Boston Airbnb — To comment or not to comment

Mauricio Prada
5 min readMar 14, 2021
Boston Terrier :)

Listing any type of property in AirBnb has become an important source of income for many owners, and also is an affordable accommodation option for travelers in a budget (although you can also find castles and islands)

Boston?

The real reason behind the selection of Boston as an Airbnb study case is purely based in the Udacity Nanodregree recommendation, but there are some interesting facts surrounding the listings in this city.

  • A 2016 study suggested that a 10% increase in Airbnb listings led to a 0.42% increase in rent prices being asked in that area. [fact 1]
  • Boston is the 12th highest occupancy rate city in the U.S. [fact 2]
    “…The new rules require that hosts own the properties they rent out, and live in them for at least nine months of the year.” [fact 3]
  • Boston is the 12th highest occupancy rate city in the U.S. [fact 2]
    - “…The new rules require that hosts own the properties they rent out, and live in them for at least nine months of the year.” [fact 3]

What data do we have?

The dataset has three archives to work with:

-Listings, including full descriptions and average review score
-Reviews, including unique id for each reviewer and detailed comments

What question do we want to answer?

1. Is there is a strong correlation between the size of the listing and its price?
2. Is Location the most important variable for demand and pricing?
3. Past reviews impact future listings of the place?

Let’s do some Modeling

We are going to try and build a simple model to predict the price of the listing.

Datasets

The original dataset has 95 variables for 3.858 listings. It holds some interesting non structured data, like summary, space and descripcion.

Although this variables surely have relevant information for the decision of the traveler, we will be focusing our efforts in understanding how numeric and categorical variables influence price. With this in mind we are going to be working with the following sets of variables:

  1. Host information: This variables have relevant information about the host, like its acceptance rate, response time and total listings.
  2. Property characteristics: Here you will find information about the listing. Size, location, bedrooms, etc.
  3. Review scores: There are at least 6 different scores, ranging from cleanliness to communication.
  4. Comments: We have more than 60k comments on the listing. This are plain text regarding the stay of the guest.

Lets beging by ploting the correlation of the variables…

Correlation Matrix of the important variables
Correlation matriz of Importan variables

There seem to be some obvoius correlations:

  1. The scores have the biggest correlation between them.
  2. Between the number of bedrooms and the number of bed.

But there are some that are not that simple:

  1. It seems that the score that are more correlate with price are location and cleanliness
  2. Latitude seems to be more correlated with price that longitude
Correlations with price

Checking correlations only with ‘price’ it is clear that the number of bedrooms/beds, therefore the size of the listing, is strongly correlated with the price. Location (latitude and longitude) have a some correlation with price.

For the comments we apply a pre built sentiment analysis tool that gives a text string a value of negatuvuty, neutrality and positivity. For our roblem we only kept the positivity value.

Modeling and results

In this section we are going to use a simple random forest to predict the price of the listings. We are going be doing the following steps:

1. Split data (80/20)
2. Instantiate model and fit (1.000 estimators random forest regressor)
4. Evaluate (predicted price vs real price)

Scatter plot comparing predictions vs real values

Visually it seem that our model did a pretty good estimation of the price.

Variable importance

We can se that if the listing is a private room is very important for our model when predicting the price of the listings. In the next section we will dive deeper on the importance findings.

Regarding our initial questions we have the following:

Is there is a strong correlation between the size of the listing and its price?

If we consider the correlation between the number of bathrooms and how many people a listing can accommodate with size, the short answer is yes.

Is Location the most important variable for demand and pricing?

Location (captured by lat and lon) is one of the most important variables to consider when we are taliing about he price of the listing. It is quite obvious locaiton will be important, but its kind of interseting finding that latitude is more important than longitud, meaning that its more relevant deciding wether to invest in real state north/south than east/west.

Past reviews impact future listings of the place?

The actual reviw score did not show up as one of the main variables, althoug the variable ‘positive’ (wich captures the sentiment of the comments) has some impact on the pricing.

Conclusion

Wrapping up, we found that there is a strong correlation between the location of a listing and its price (kind of obvious), but the real insight here is that the average sentiment of the comments is way more important than the scores given by the guests.

You can have all the files of this analysys in my github :)

--

--