Yelp's Academic Dataset consists of 152,327 reviews of 6,900 businesses by 65,888 users. These businesses span 433 not-mutually exclusive categories, ranging from "Accessories" (52 companies) to "Zoos" (1 companies). A little background about the dataset: The information is primarily businesses near 30 schools across the United States.
Are you more likely to post a review on Yelp on certain days of the week?
The academic dataset pulled reviews from users that frequented restaurants near 30 US schools. Naively, you might attempt a frequency count of the days that reviews are posted, as shown:
This would suggest that users tend to post towards the beginning of the week, drop off by Saturday, before a slight uptick on Sunday. Whether or not this result is truly representative of the total population of Yelp reviewers is debatable. It's absolutely the accurate representation of this dataset, which are the users who reviewed these restaurants near a set of universities and colleges. This leaves us susceptible to many confounding variables, what if some locations are represented more than others? What about power-users (those who make dozens, if not hundreds of reviews)? In order to minimize these effects, we must take a random subset of the data.
Something interesting is revealed when you look at the total number of reviews made per user:
Out of nearly 66,000 users, approximately 46,000 (or 70%) account for about 16.6% reviews. This is a representation of the so-called "80-20" for Pareto distributions: the vast majority of the total reviews fall into a small group of users. I can quickly confirm is adherence to power-laws with a simple log-log plot:
For power-law distributions, expectation values depend on sample size, throwing much of statistical theory out of the window. At this point, the best approach to answering this question is to break the data into segments:
- Occasional Reviewers: Reviews from users with a total review size less than 100 (~54,000 users)
- Power Reviewers: Reviews from users with a total review size greater than 500 (~900 users)
- Moderate Reviewers: Reviews from users with a total review size in between (~11,000)
We choose a sample size of 500 for our user segments. This was chosen to satisfy a 95% confidence level at a margin-of-error corresponding to 3, 10, and 30 reviews for occasional, moderate, and power reviews, respectively. Plotting the results for the days reveal the following distributions:
This visual suggests the following: power users of Yelp review most on Mondays, remain steady Tuesday through Thursday, and significantly drop off when the weekend rolls along. Moderator users review most Sunday through Tuesday. Occasional users remain relatively steady throughout the week.
How much of this is due to chance?
Let's establish the following null hypothesis:
There is no relationship between the day of the week and the day a user posts a review.
In other words, I will accept or reject the notion that P(Monday) = P(Tuesday) = P(Wednesday) = P(Thursday) = P(Friday) = P(Saturday) = P(Sunday) against the alternative hypothesis that (at least) two proportions are different. My test will be at the 5% significance level. For this question, I will only look at occasional and power users, to keep things simple.
Here is a table of the observed and expected values for the power user sample set:
|Observed (Power)||Expected (Power)|
Observed and expected reviews of power users in sample set (size = 500, margin-of-error = 30 reviews)
The chi-square statistic for this test is 86.26. Looking at a standard table of critical values (with 6 degrees of freedom and an alpha of 0.05), we see that the chi-square critical value is 12.592. Because 86.26 > 12.592, we can safely reject the null hypothesis and conclude that the observed variation in reviews per day for power users unlikely happened by chance.
Here is a table of observed and expected values for occasional users:
|Observed (Occ.)||Expected (Occ.)|
Observed and expected reviews of power users in sample set (size = 500, margin-of-error = 3 reviews
The chi-square statistic for this test is 16.78. Because 16.78 > 12.592, we can also reject the null hypothesis and conclude with 95% confidence that the observed variation in reviews per day for occasional users unlikely happened by chance.
Power users of Yelp (those who have made more than 500 reviews) are more likely to post their reviews on Mondays. There is a significant drop-off in reviews when the weekend rolls along, perhaps they are "collecting their data".
Occasional users of Yelp (those who have made less than 100 reviews) do not have a specific day that significantly favor (though there is a slight drop-off on Fridays). In other words, they post whenever is convenient for them, not being particularly driven to start their work-week with a round of reviews.
The data is delivered as a text file with one json-object per line, each containing an identifier attribute to make more complex joins ("What days are 3+ star reviews posted for Persian restaurants?"). I used the following modules in my script: json (to help parse the dataset), calendar (to convert the dates to specific days of the week), and matplotlib (for plotting).