CIS 415 – Assignment 5
This assignment is worth 20 points, and is individual effort.
Problem Definition
In this Assignment, we are going to use Amazon Product Co-purchase data to make Book Recommendations using Social Network Analysis.
This assignment has three objectives:
- Review Python concepts to read and manipulate data and get it ready for analysis
- Apply Social Network Analysis concepts to Build and Analyze Graphs
- Apply concepts in Text Processing, Social Network Analysis and Recommendation Systems to make a product recommendation
We will be using the Amazon Meta-Data Set maintained on the SNAP site. This data set is comprised of product and review metdata on 548,552 different products. The data was collected in 2006 by crawling the Amazon website. You can view the data by double-clicking on the file amazon-meta.txt that’s been included in SocialNetworkAnalysis.zip. The following information is available for each product in this dataset:
- Id: Product id (number 0, …, 548551)
- ASIN: Amazon Standard Identification Number.
The Amazon Standard Identification Number (ASIN) is a 10-character alphanumeric unique identifier assigned by Amazon.com for product identification. You can lookup products by ASIN using following link: https://www.amazon.com/product-reviews/<ASIN>
- title: Name/title of the product
- group: Product group. The product group can be Book, DVD, Video or Music.
- salesrank: Amazon Salesrank
The Amazon sales rank represents how a product is selling in comparison to other products in its primary category. The lower the rank, the better a product is selling.
- similar: ASINs of co-purchased products (people who buy X also buy Y)
- categories: Location in product category hierarchy to which the product belongs (separated by |, category id in [])
- reviews: Product review information: total number of reviews, average rating, as well as individual customer review information including time, user id, rating, total number of votes on the review, total number of helpfulness votes (how many people found the review to be helpful)
Please download and unzip the SocialNetworkAnalysis.zip file from Blackboard in the directory where you have been doing all of your Python scripting. Then, double click on amazon-meta.txt and ensure it has the expected data described above.
The first step that we have to do is to read and understand the preprocess and format tasks done to this data. You have been provided with a Python script called PreprocessAmazonBooks.py that is included in SocialNetworkAnalysis.zip. This script takes the “amazon-meta.txt” file as input, and performs the following steps:
- Parses the amazon-meta.txt file
- Preprocess the metadata for all ASINs, and write out the following fields into the amazonProducts Nested Dictionary (key = ASIN and value = MetaData Dictionary associated with ASIN):
- Id: same as “Id” in amazon-meta.txt
- ASIN: same as “ASIN” in amazon -meta.txt
- Title: same as “title” in amazon-meta.txt
- Categories: a transformed version of “categories” in amazon-meta.txt. Essentially, all categories associated with the ASIN are concatenated, and are then subject to the following Text Preprocessing steps: lowercase, stemming, remove digit/punctuation, remove stop words, retain only unique words. The resulting list of words is then placed into “Categories”.
- Copurchased: a transformed version of “similar” in amazon-meta.txt. Essentially, the copurchased ASINs in the “similar” field are filtered down to only those ASINs that have metadata associated with it. The resulting list of ASINs is then placed into “Copurchased”.
- SalesRank: same as “salesrank” in amazon-meta.txt
- TotalReviews: same as total number of reviews under “reviews” in amazon-meta.txt
- AvgRating: same as average rating under “reviews” in amazon-meta.txt
- Filter amazonProducts Dictionary down to only Group=Book, and write filtered data to amazonBooks Dictionary
- Use the co-purchase data in amazonBooks Dictionary to create the copurchaseGraph Structure as follows:
- Nodes: the ASINs are Nodes in the Graph
- Edges: an Edge exists between two Nodes (ASINs) if the two ASINs were co-purchased
- Edge Weight (based on Category Similarity): since we are attempting to make book recommendations based on co-purchase information, it would be nice to have some measure of Similarity for each ASIN (Node) pair that was co-purchased (existence of Edge between the Nodes). We can then use the Similarity measure as the Edge Weight between the Node pair that was co-purchased. We can potentially create such a Similarity measure by using the “Categories” data, where the Similarity measure between any two ASINs that were co-purchased is calculated as follows:
Similarity = (Number of words that are common between Categories of connected Nodes)/
(Total Number of words in both Categories of connected Nodes)
The Similarity ranges from 0 (most dissimilar) to 1 (most similar).
- Add the following graph-related measures for each ASIN to the amazonBooks Dictionary:
- DegreeCentrality: associated with each Node (ASIN)
- ClusteringCoeff: associated with each Node (ASIN)
- Writes out the amazonBooks data to the amazon-books.txt file
- Writes out the copurchaseGraph data to the amazon-books-copurchase.edgelist file
Please review the PreprocessAmazonBooks.py script to ensure you understand how to relate the code back to the processing steps described above. Review the PreprocessAmazonBooks.py to understand how the amazon-books.txt and amazon-books-copurchase.edgelist files were created, as included in the SocialNetworkAnalysis.zip file you just downloaded from our course site. Understanding how these files were created is an important step in your learning of these Social Networking Analysis Big Data concepts.
The next step is to use the transformed data provided to you in the SocialNetworkAnalysis.zip file with file names amazon-books.txt and amazon-books-copurchase.edgelist to make Book Recommendations.
You have been provided with a Python script called AnalyzeAmazonBooks.py that’s been included in SocialNetworkAnalysis.zip. This script takes the “amazon-books.txt” and “amazon-books-copurchase.edgelist” files as input, and performs the following steps:
- Read amazon-books.txt data into the amazonBooks Dictionary
- Read amazon-books-copurchase.edgelist into the copurchaseGraph Structure
- We then assume a User has purchased a Book with ASIN=0805047905. The question then is, how do we make other Book Recommendations to this User, based on the Book copurchase data that we have? We could potentially take ALL books that were ever copurchased with this book and recommend all of them. However, the Degree Centrality of Nodes in a Product Co-Purchase Network can typically be large. We should therefore come up with a better strategy. Let’s take the following approach:
- First we examine the metadata associated with the Book that the User is looking to purchase (ASIN=0805047905), including Title, SalesRank, TotalReviews, AvgRating, DegreeCentrality, and ClusteringCoefficient. We notice that this Book has a DegreeCentrality of 216 – which means 216 other Books were copurchased with this Book by other Customers. So yes, it would indeed make sense to come up with a better strategy of recommending copruchased Books.
- So now, let’s consider the Ego Network (depth 1) of the Book that the User is looking to purchase (ASIN=0805047905). This is essentially ALL of the Books that have ever been co-purchased with the book under consideration.
- Recall that the Edge Weight between any two Nodes (Book ASINs) in our copurchaseGraph is the Category Similarity between the two Nodes (Book ASINs) connected by the Edge. So we can actually use the Island method to get rid of copurchased Books with a very low degree of category similarity. We pick a threshold of 0.5, and create a Trimmed Ego Network.
- We can then consider the Copurchased Books (Nodes or ASINs) that are still connected to the Book that the User is looking to purchase (ASIN=0805047905) in the Trimmed Ego Network. We can then sort these copurchased books in descending order by their AvgRatings, and recommend the Top 3.
- We can examine the metadata associated with the top the recommended books, including Title, SalesRank, TotalReviews, AvgRating, DegreeCentrality, and ClusteringCoefficient. We find that they are all pretty good matches.
Please read the AnalyzeAmazonBooks.py script to ensure you are able to relate the code back to the analysis and recommendation steps described above. Then, execute the script. Once it completes, examine the output and confirm these are appropriate matches.
Requirements for this Assignment
Here are the Requirements for this Assignment:
- Complete the steps highlighted above:
- Download and unzip the zip file from our course site
- Read, understand, and review the py script to understand how the “amazon-books.txt” and “amazon-books-copurchase.edgelist” files have been generated as included in the SocialNetworkAnalysis.zip file downloaded from our course site.
- Read, understand, and execute the py script and ensure you can see the Top 3 Recommendations for ASIN=0805047905
- Make sure that the python scripts mentioned in this section are in the root directory of the unzipped files in step 1.a above.
- Root directory or folder means that the Python scripts for this assignment need to run in the same directory or folder (not another directory or folder) where the unzipped files amazon-books.txt, amazon-books-copurchase.edgelist, and amazon-meta.txt are located.
- Failure to have the Python scripts for this assignment in the correct directory or folder, as described in this step, will result in the following Python error upon execution:
FileNotFoundError: [Errno 2] No such file or directory: ‘./amazon-books.txt’
- The best way to ensure you are running the Python scripts in the correct root directory or folder location is to do the following:
- Close Spyder (this is important)
- Use File Explorer (Windows) or Finder Window (If you installed Spyder locally in your Mac) to navigate to your directory or folder where you unzipped the file zip.
- Ensure that such directory or folder has the files: py, PreprocessAmazonBooks.py, amazon-meta.txt, amazon-books.txt, and amazon-books-copurchase.edgelist for all to work correctly.
- Now open Spyder. In Spyder, go to the File menu, then Open menu, navigate to the directory or folder where you unzipped your SocialNetworkAnalysis.zip file, and select the appropriate script. Then click on Open or Ok.
- Now you can continue with the relevant assignment steps
- Update AnalyzeAmazonBooks.py to do the following:
- Make Top 5 Recommendations for a Buyer who is purchasing ASIN =
- List the Top 5 Recommendations and associated Metadata.
- Make sure that your updated python script mentioned in this section is saved in the root directory or folder of your unzipped zip file:
- Root directory or folder means that your updated Python script for this assignment needs to run in the same directory or folder (not another directory or folder) where the unzipped files amazon-books.txt, amazon-books-copurchase.edgelist, and amazon-meta.txt are located.
- Failure to save and run your updated Python script in the correct root directory or folder, as described in this section, will result in the following Python error upon execution:
FileNotFoundError: [Errno 2] No such file or directory: ‘./amazon-books.txt’
- The best way to ensure that you are running your updated Python script in the correct root directory or folder location is to do the following:
- Save your updated Python script in the correct root directory or folder (Step 2.c.i)
- Close Spyder (this is important)
- Use File Explorer (Windows) or Finder Window (If you installed Spyder locally in your Mac) to navigate to your root directory or folder where you unzipped the file zip.
- Ensure that such root directory or folder has the following files: py, PreprocessAmazonBooks.py, amazon-meta.txt, amazon-books.txt, and amazon-books-copurchase.edgelist for all to work correctly.
- Now verify that your updated Python Script file is in this root directory or folder.
- Now open Spyder. In Spyder, go to the File menu, then Open menu, navigate to the root directory or folder where you unzipped your zip file, and then select the name of your updated Python script. Then click on Open or Ok.
- Now you can continue with the relevant assignment steps
- Recall that once we had trimmed the Ego Network, we considered the Copurchased Books that are still connected to the Book that the User is looking to purchase in the Trimmed Ego Network. We then sorted these copurchased books in descending order by their AvgRatings, and recommend the Top 3. Is there some other data and/or logic we could have used to pick the Top Book Recommendations? What would that be? Briefly describe the logic. Then, update the AnalyzeAmazonBooks.py script to use your new logic and make Top 5 Recommendations for ASIN = 0812580036.
Note: you cannot suggest a trivial change like changing the Island threshold value here!
Submission for this Assignment
Submit the following for this Assignment:
- For (2) above:
- List Top 5 Recommendations for a Buyer who is purchasing ASIN = 0812580036. [5 points]
- For (3) above:
- Brief Description of alternate/enhanced logic to make Top Recommendations from the Trimmed Ego Network. [5 points]
- Updated AnalyzeAmazonBooks.py script that implements this new logic. This file will be run as-is. Points will be deducted if it does not execute and/or does not provide the top recommendations. [5 points]
- Based on your newly executed logic, list Top 5 Recommendations for a Buyer who is purchasing ASIN = 0812580036. The Top Recommendations listed here should match the output from the updated script. [5 points]