•
RQ1: What is the statistics of online reviews?
Select one metropolitan area and study the empirical distributions of:
1. the number of reviews written by each individual user
2. the number of reviews received by each individual business
Discuss what probability distribution (e.g., Gaussian, power law) is well suited to describe these data, and provide an estimate for the parameters of such distribution (we’ll see examples of this in the first 2 case studies)
RQ2: Are Yelp users consistent across different areas?
Focus on one particular type of business (e.g. fast food restaurant, pub, lounge, etc) and compare the empirical distributions of the ratings for this category across two different metropolitan areas. Discuss differences and similarities between the distributions and their main statistical properties (e.g., mean, standard deviation, skewness, etc)
1st assignment – due February 21
Import the Yelp dataset in your favourite programming language and address
•
the two following research questions (RQs)
•
•
Summarize your results in an individual report of less than 1000 words, with up to 3 display items (plots or tables) with captions of less than 150 words (excluded from total word count)
1st assignment – due February 21
January 30 (TBC):Tips on how to write the report &
•
detailed marking criteria
February 27: self-assessment exercise, in-class
•
discussion and feedback
Marking criteria
1. Justification of approaches used }
2. Clarity (both in text and plots / tables)
3. Consistency of language and mathematical notation
4. Soundness of results
Equally important
Justification of approaches used
1. Walk the reader through every hypothesis and choice you make (assume the reader is a graduate student with basic knowledge of probability and statistics)
2. Explain why you chose to do something over the alternatives (e.g., why did you choose to consider a certain class of distributions?)
3. Using a package downloaded from the Internet does NOT qualify as a methodology: you have to explain what it does and demonstrate that you understand it
Clarity (text)
1. Write using clear language and notation, ask yourself whether you would understand your own report
2. Avoid colloquial language, use formal scientific language as in a scientific paper
3. Use the work limit to your advantage: write short sentences where each word matters!
Clarity (plots & tables)
1. Use 2-3 significant digits when presenting numerical results (e.g., 1.23 instead of 1.23456789)
2. Show good-quality figures (no cropping!)
3. All plots must have: axis labels & ticks, legends where needed, clearly discernible symbols and lines
Example of a good plot
Use log scale when necessary!
10000
8000
6000
4000
2000
104 103 102 101
100
10-2 10-1 100
0
0 0.5 1
Use log scale when necessary!
10000
8000
6000
4000
2000
104 103 102 101
100
10-2 10-1 100
0
0 0.5 1
Consistency of language and notation
1. Use the same style throughout the report (e.g., do not switch present / past tense, etc)
2. Use consistent mathematical notation (e.g., do not rename variables, label plot axes accordingly, etc.)
Soundness of results
1. Be rigorous and discuss your results professionally
2. This means discussing statistical significance when appropriate, and discussing what worked and what did not worked in your approaches in relationship to your initial hypotheses
3. Negative results are OK! There’s nothing wrong in disproving your own hypotheses. No need to “stretch” your results in order to make them work
Suggested structure
1. Introduction – A brief overview of the data, a presentation of the research questions and of the hypotheses you’re making to address them
2. Methodology – Explain how you’re going to tackle the research questions
3. Results – A summary of your main results, without much comment
4. Discussion & conclusions – Comment on your results: what did you find? Was it in agreement with your hypotheses? What worked or did not work? How could you improve the analyses?