BU.450.760 Technical Document T3.2 – Airbnb listings Prof.
Airbnb listing prediction exercise
The goal of the analysis is to use text data contained by Airbnb’s hosts descriptions (see Xs in screenshot below) in predicting the average review score that Airbnb users will give the host (see Y in screenshot below).
We utilize the dataset D3.2 (codebook C3.2). Airbnb Listings codebook.pdf.” The key text variable is “listing_description” (contains information for X) whereas the outcome to be predicted is “review_ratings” (Y).
Copyright By PowCoder代写 加微信 powcoder
The companion R script is S3.2.
Preliminaries (lines 6-12) follow steps explained in previous documents, are omitted here:
• Load packages
• Set working directory
• Clear workspace
• Load and summarize data
1. Generate a Document-Term Matrix with TF-IDF scores
We first generate the Document-Term matrix by applying the steps discussed in class. Note how each successive line applies a new instruction on the object generated in the previous line (text_corpus). In line 23 we generate the document_term matrix, which we label “dtm”. Line 25 transforms this matrix into TFIDF scores. Note that dtm is highly sparse—about 99% of entries are zero.
BU.450.760 Technical Document T3.2 – Airbnb listings Prof.
2. Prepare data for estimation
To prepare the data for estimation, we first need to deal with the “wide X” problem, namely, the fact that there are too many columns (terms) in the document-term matrix (many more than available observations). This is a crucial step as without it, the linear regression is not feasible.
As discussed in class, our approach to deal with this problem is to eliminate sparse terms, that is, terms that appear with relatively low frequency in our corpus. We will carry out this step with our sight on two models:
Model 1: will include terms that appear in at least 40% of docs (20 terms) Model 2: will include terms that appear in at least 50% of docs (9 terms)
Thus, model 1 will include a “wider X” compared to model 2. Lines 31 and 32 extract the two sets of terms and lines 33 and 34 convert them into the matrix format (necessary for estimation) as well as attach them to the main dataframe ds.
BU.450.760 Technical Document T3.2 – Airbnb listings Prof.
In lines 35-39 we declare categorical variables as factors and then generate the index for the training/validation data split.
In line 41 we plot a histogram of the outcome that we will be attempting to predict. This is useful to form an idea of the variability that we are exploring. In this case, for example, it turns out that most rankings are very high – the distribution is “piled up” in the 90’s. Typically this is not desirable, since rankings below 80 are in fact outliers that “distort” somewhat the estimated parameters. To evaluate how much of a distortion is introduced, we could compare our estimates/predictions obtained from the full sample, against those that would be obtained when observations associated to ratings below 80 are dropped.
3. Model estimation
We estimate three models:
Model 1 (line 45): includes 20 terms appear in at least 40% of docs Model 2 (line 57): includes 9 terms appear in at least 50% of docs Model 3 (line 69): includes 0 terms
BU.450.760 Technical Document T3.2 – Airbnb listings Prof.
Codes for model 1 are presented here—analog codes for the other two models. The estimation codes are known from previous classes. Notice that we ensure that predicted ratings are never outside the feasible range of [0,100]—see lines 49 and 50.
For all models, out-of-sample fitting is noticeably worse than in-sample. Because we also observe this for model 3 (no text-based data), we shouldn’t worry that text-data are inducing an overfitting problem. Overall, model 2 (the more restrictive of the two models that include text data) performs best outside of the sample.
Lastly, notice that if our predictions were perfect (i.e., zero prediction errors), actual and predicted outcomes fall in the line of Y=X. Therefore, as an additional check, in lines 88-90 we generate this plot, respectively for each model (codes below and plots in next page). While none of the plots aligns perfectly as a 45-degree line (predictions are not perfect), that for predictions of model 1 and 2 display a generally upward sloping pattern. For the predictions of model 3 there is no discernible pattern.
BU.450.760 Technical Document T3.2 – Airbnb listings Prof.
Notice how much more compressed the Y range is for the values predicted with model 3, compared to that for the predictions of models 1 and 2.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com