Topic Modeling
Submit Assignment
Due Thursday by 11:59pm Points 10 Submitting a website url or a file upload File Types docx, doc, and pdf Topic Modeling Amazon Product Reviews
Introduction
For this individual project, you will use topic modeling to deliver marketing and product insights for a popular clothing and/or shoes company. The Data
Prof. Julian McAuley at UC-San Diego has graciously let me use his “Amazon Product Data (http://jmcauley.ucsd.edu/data/amazon/links.html) ” database. It contains tons of data about Amazon products. Specifically, we will leverage two datasets: (1) meta-data about products and (2) product reviews. The aforementioned database has reviews on all types of Amazon products, but these datasets are huge (~80gb).
To remedy this, I’ve picked to two smaller datasets that only contain (1) meta-data about products that are in categorized as “Clothing, Shoes & Jewelry” and (2) reviews about products that are in the “Clothing, Shoes & Jewelry” category.
The product data:
https://drive.google.com/file/d/1Xvb8Np-4pvJ-_I2_Zy3y0UjmAdpdDyk5 (https://drive.google.com/file/d/1Xvb8Np-4pvJ-_I2_Zy3y0UjmAdpdDyk5)
The review data here:
https://drive.google.com/file/d/1iVqe6GENMTJcXxRpZTsA4c90Gie5Ck3e (https://drive.google.com/file/d/1iVqe6GENMTJcXxRpZTsA4c90Gie5Ck3e)
*IMPORTANT* FOR BOTH LINKS ABOVE CLICK THE ADD TO DRIVE BUTTON
THEN, MOVE THE FILE TO WHATEVER FOLDER YOU WOULD LIKE IN GOOGLE DRIVE.
This way you don’t have to download and reupload the data. While these datasets are smaller than the aforementioned ones, they’re still big.
Picking a Clothing/Shoe Brand
First, you must pick a popular clothing and shoe brand to analyze. Not sure what constitutes popular? Here are a couple lists I found on eMarketer:
You can choose any popular brand except Nike. In order to complete this project, you must pick a brand/product that has at least 2,000 reviews (not products). I’d encourage you to search for a brand in the data that you are passionate about.
Extracting the Data
First, you’ll need to identify the ASINs associated with products from your brand. To do that you’ll leverage dataset #1. Next, once you have a list of ASINs, you’ll use dataset #2 to extract reviews out that match those ASINs. This process will be demonstrated in class.
Performing Topic Modeling
Once you’ve picked a brand, identified their product ASINS and extracted the relevant reviews, you’ll need to perform topic modeling on the text of the review data. Using one of the popular clustering methods demonstrated in class (e.g., k-means or LDA), perform topic modeling on the data to reveal the most popular topics.
Visualize those topics. From a grade standpoint, we’ll be looking particularly at how logical the topic models are. Can we read them and generally understand what the topic model represents? There is no minimum or maximum of topics, instead I want you to tweak this parameter until the topics make the most sense to you.
Here’s a list of Nike topics that aren’t terrible:
0: item ordered great happy
1: tight little size shoes
2: shox shoes love pair
3: really shoes nice comfortable 4: excellent shoes product quality 5: loved shoes son great
6: good shoes fit shoe
7: love shoes color comfortable 8: purchased great pair fit
9: christmas pleased son loves 10: white blue shoes black
11: shoes great comfortable love 12: size shoes half ordered
13: gift loved birthday great
17: light shoes run weight
18: nice shoes good look
19: great price fit shoe
20: one watch watches battery 21: bag gym clothes good
22: sneakers great comfortable love 23: shoe shoes great foot
24: red black color shoes
25: four stars three great
26: perfect fit shoes love
27: cleats cleat soccer football 28: watch band wrist great
29: old year son shoes
30: air max shoe shoes
34: five stars great perfect
35: blisters heavy shoes shoe
36: force air ones shoes
37: sunglasses glasses lenses great 38: big little size shoes
39: shoes great like comfortable
40: shoes best ever pair
41: muy de el la
42: narrow shoes fit foot
43: shoe great comfortable good 44: socks great fit comfortable
45: shirt great fits quality
46: loves son shoes daughter
47: small size runs run
14: running shoe shoes great 15: shorts fit good pockets
16: sneaker great love sneakers
31: feet shoes comfortable wide 48: show dirt wear shoes 32: product great good quality 49: clean easy shoes great 33: boots boot great comfortable
Performing Data Clustering
Once you’ve got your final topics, perform topic classification on the documents to automatically cluster your data by topic. Inspect a dozen or so reviews from each topic to better understand what the topics actually represent. Make sure you print your ASIN for each review alongside the review. Look at common ASINs, and resolve those ASINs to specific products so you know what products are commonly talked about in each topic. Altogether, summarize these topics in 1-2 sentences.
Example Topic Descriptions
Topic 44 – A wide variety of different socks tend to fit comfortably.
Topic 42 – Consumers say specific types of Nike shoes are too narrow, including: Nike Air Max 5 and Nike Air VaporMax 2017. Topic 20 – Battery life on the Nike Watch is not ideal.
Topic 41 – Reviews in Spanish
Topic 35 – Nike Presto shoes tend to give blisters.
Marketing & Product Insights
From the topic models, extract actionable marketing and product insights. A topic model can be used to infer:
Attributes that people like about our products
Things that consumers are most impressed by can be things we choose to accentuate in our ads. For instance, Topic 49 relates to the fact that a specific type of Nike shoe, the Flyknit, is easy to clean. If we were advertising that product, we may choose to showcase that feature. Topic 27 shows that people love the quality of Nike’s new line of cleats. Moreover, topic 37 tells us that people love the lenses in Nike’s sunglasses. In
general, the word “comfortable” appears for a lot of different types of shoes. We may choose to make that a key attribute in a new ad campaign. Across topics, people tend to think that Nike shoes and apparel are true-to-size, and fit well. Quality of fit may also be an attribute we choose to leverage in a future campaign.
Attributes that people dislike about our products
Topic 48 suggests that a specific type of shoe shoes dirt easily, upon inspection, it’s leather Nike shoes that people complain about the most. We might choose to include a cleaning guide on our website, or with the shoes in the box upon purchase.
Topic 20 suggests that the battery life on the Nike Watch is not ideal, as such we’d want to avoid making claims in our ads that say the battery life is good.
Purchase Occasions
Topic 9 shows that consumers often buy a broad variety of Nike shoes for Christmas gifts, particularly for sons. Topic 46 shoes that daughters also receive gifts, but from a broader group of apparel, and more often for birthdays. Better understanding why/when consumers buy Nike’s can yield a variety of insights including what advertising media to use and when to advertise. I’d suggest from this data that Nike advertise around the Holidays, and in those ads, show young children, especially boys, receiving Nike shoes as Christmas gifts.
Product Development/Improvement Ideas
Topic 20 shows that consumers would like a better battery. We can suggest to our R&D department that we develop a better battery, as consumers really want it.
Topic 42 suggests that Nike should release a wide size for specific shoes, specifically: Nike Air Max 5 and Nike Air VaporMax 2017. R&D can work on making these sizes.
Pricing Suggestions
In looking at the Nike data, consumers generally do not mention price in reviews. This suggests, to me, that pricing is not perceived as too expensive. Moreover, since we do not see words associated with perceived value (e.g., deal, value, inexpensive, steal), we also can surmise that consumers do view Nike’s products as being more premium than budget wear. A dive into these types of words can reveal the perceived quality vs. price perception.
What’s Not in the Data
What common topics/words did you expect to be in the data that you didn’t observe? Take a moment and think about what you’d expect to see in reviews. What’s missing? Turning to Nike, generally, I expected to see more comments about high-end style and design. Nike has premier deals with high-end Fashion designers to make custom shoes. Instead, in reading these topics, consumers seem to talk more about practical attributes, such as the quality of a lens or a cleat. It’s clear that, at least for Amazon, that the majority of Nike’s reviews are about function, not fashion. This should broadly shape how we perceive our consumers on Amazon, the products we think will sell well on Amazon, and how we advertise products sold on Amazon.com
Diving Further into the Data
While the above are ideas and examples, “A” projects will be creative with insights. The above categories are not exhaustive. Think creatively, what can these common themes do to help us market and advertise our products better? What should we alter or change to address our shortcomings and accentuate our strengths? In general, I feel the majority of the grade variance in this project will come with the degree to which you use your qualitative skills to extract meaning from these topics.
Going Above and Beyond – Being Creative & Segmenting Data
The highest A’s will have taken some additional thought to the data processing step of this project. Show that you’re taking care to segment the data to get the most insights. You may decide to preprocess the data using a special stopword list. Alternatively, you may generate models by segmenting data (e.g., only looking at reviews in specific categories). There are many ways in which you could filter the data, but here are some ideas:
Amazon Product Metadata – Easy – Use the metadata (dataset #1) to filter out reviews by other product attributes (e.g., only shoes, or only socks). Look at topic models for specific types of products, compare differences across products.
By Stars – Easy – In addition to a text review, Amazon users are required to give a star rating for the product they are reviewing. It is interesting, from a marketing standpoint, to look at reviews that are critical (e.g., low star ratings) and ratings that are positive. This can be done easily by separating the data and performing topic models for different review score ranges.
By Sales Rank – Intermediate – In the product metadata file (dataset #1) there is a field dedicated to the sales rank for each product. You may want to look at products for your brand that are doing well in sales (e.g., have a high sales rank), vs. products that are not doing so well in sales (e.g, low sales rank). Special attention could be given to products that aren’t selling well, and suggestions on what would make these products better!
Sentiment – Difficult – Sentiment (e.g., only looking at negative reviews). You’ll need to use a sentiment tool (https://districtdatalabs.silvrback.com/modern-methods-for-sentiment-analysis) , to code reviews for sentiment. This is particularly a good idea if you’re
not seeing enough negative topics. Turing to my Nike example, I’d say that is the major limitation of my current topic models. There are definitely more complaints than what we’re seeing here.
Date – Somewhat Difficult – Use the date/time field in the Amazon review data to filter data by specific date ranges. Are reviews getting better or worse?
Super Reviewers – Very Difficult – Find top users in the entire dataset (e.g., users who have the most reviews across all reviews in dataset #2). Filter your product reviews and only look at top reviewers. Interpret these reviewers as more detail-oriented and opinionated customers.
Keep in mind that these methods may not be directly taught in class and instead would show a significant investment on your part, learning outside of the classroom. As always, Scott and I are here if you get stuck. We will have a week of dedicated lab time for help and support.
What to Submit Here
1. Your Report (.doc, .docx or .pdf)
1. Your topics: visualized in an easy to read way (1-pg)
2. Topic Description: short, concise and accurate (1-2pg)
3. Preprocessing steps: what you did to preprocess the data (.5pg)
4. Model steps: what method you used to model (.5pg)
5. Marketing & Product Insights: must reflect substantial time spent looking at the data (1-2pg)
2. Your code
1. Executable code via a shared Google colab link. Must be in Python. Your code should include all steps you used for the assignment.
Code that looks through the Amazon product metadata and extracts ASINs Code that extracts the review for your brand
Code that preprocesses the data (e.g., gets it ready to be topic modeled) Code that executes and prints your topics
It is OKAY if your topic model prints slightly differently, due to random iterations. Code that classifies and separates documents by topic
Topic Modeling
Criteria
Ratings
Pts
Data manipulation
No errors in data manipulation (e.g., bad features, missorted data, code that performs unanticipated actions to data).
0.5 pts Perfect code
0.3 pts Minor errors
0.0 pts Fatal flaws
0.5 pts
English, Grammar & Presentation
Grammar / spelling inside of report. Only use charts if they’re easy to read on the page. Do not stretch or pixelate images.
0.5 pts Exceptional
0.4 pts Average
0.3 pts Somewhat Clean
0.2 pts Major issues
0.0 pts Terrible
0.5 pts
Clarity, Brevity and Organization of Report
Use bulleted points that are brief. Avoid paragraphs. Use section heads. Make the document easy to reference and scan.
0.5 pts Extremely concise, organized and clear
0.4 pts Concise, organized and clear
0.3 pts Somewhat concise, organized and clear
0.0 pts
Not concise, organized and clear
0.5 pts
Organization and Clarity of Code Notebook
Organization of your python notebooks. Include comments and clear sections so I can easily follow.
0.5 pts Extremely concise, organized and clear
0.4 pts Concise, organized and clear
0.3 pts Somewhat concise, organized and clear
0.0 pts
Not concise, organized and clear
0.5 pts
Topic Quality
Topics are intuitive and qualitatively cluster together in an intuitive way. Must also be visualized in an easy to read way.
1.5 pts Extremely concise, organized and clear
1.2 pts Concise, organized and clear
0.9 pts Somewhat concise, organized and clear
0.0 pts
Not concise, organized and clear
1.5 pts
Criteria
Ratings
Pts
Topic Model Descriptions
You described your topics in a short, concise and accurate manner. You must put forward your intuition as to what the topic models are saying, and that intuition must be logical.
1.0 pts Extremely concise, organized and clear
0.8 pts Concise, organized and clear
0.6 pts Somewhat concise, organized and clear
0.0 pts
Not concise, organized and clear
1.0 pts
Preprocessing steps
Compared to your peers, to what extent did you conduct preprocessing steps? Did the preprocessing steps work as intended?
0.5 pts Exceptional
0.4 pts Average
0.3 pts Somewhat Clean
0.2 pts Major issues
0.0 pts Terrible
0.5 pts
Modeling steps
To what extent are your models valid and not in violation of the statistical assumptions of said topic model?
0.5 pts Perfect code
0.3 pts Minor errors
0.0 pts Fatal flaws
0.5 pts
Marketing Insights
The extent to which your marketing and product insights are intuitive. Must reflect substantial time spent looking at the data and meaningful takeaways that would help the business/products studied.
2.0 pts Extremely concise, organized and clear
1.6 pts Concise, organized and clear
1.2 pts Somewhat concise, organized and clear
0.0 pts
Not concise, organized and clear
2.0 pts
Document classification
Code that classifies and separates documents by topic
0.5 pts Perfect code
0.3 pts Minor errors
0.0 pts Fatal flaws
0.5 pts
Criteria
Ratings
Pts
Being Creative & Segmenting Data
The extent to which your segmented your data in clever ways that lead to topic models that are more insightful than ones that are not segmented.
2.0 pts Exceptional
1.6 pts Average
1.2 pts Somewhat Clean
0.8 pts Major issues
0.0 pts Terrible
2.0 pts
Total Points: 10.0