Scenario
You work for the data science group of a US-based large supermarket chain. Today, you are assigned to develop a predictive model that can help to improve the future sale of domestic wine. Your colleague managed to obtain a dataset of 54503 different wines from a market research firm. The dataset is stored in a .csv file. It contains 8 different attributes about the
wines:
1. Id: uniquely identify a wine.
2. Name: the name of the wine.
3. Score: the rating score given by professional reviewers, scale of 1-100.
4. Price: manufacturer suggested retail price in US dollars.
5. State: the US state where the wine is made.
6. Region_1: The region where the wine is made.
7. Region_2: More specific region where the wine is made.
8. Variety: The grapes used to make the wine.
Requirements:
1. All codes must be implemented using Python.
2. You should use Jupyter Notebook to work on this project and submit the .ipynb file.
3. You are required to write an executive summary (word or pdf file) to present your work.
The summary should be no more than two pages (double spaced, excluding any figures,
tables, and references)
4. Codes must be well documented with comments.5. You should also include narratives along your codes using Markdown to explain and justify your steps, as well as describe any insights gained from each step.
6. You may search online or discuss with other students, but each student must work
independently.
Notes:
1. Additional Python packages (not covered in class) are welcomed to use. But they should
be well documented through Markdown.
2. Comments are different from explanations using Markdown.
3. Here is the importance of each component of your work. The percentage is only indicative.
Explanation/Description using Markdown (20% – 25%)
Code, including comments (60% – 65%)
Executive summary (15% – 20%)
4. This dataset is adopted from an open dataset but has been modified to fit our project. You might find similar datasets online but don’t rely on the existing solutions as they may not work properly.
5. The accuracy (or other metrics) of your final prediction model is less important than the process to achieve and improve that value. Particularly when the dataset is modified, you may end up with a low accuracy.
6. You may try different algorithms (including those not covered in the lectures) and include them in your submission. However, a purposeful selection of a smaller number of algorithms with good justification is better than a random selection of a larger number of algorithms without good justification.
7. Given the size of dataset, it may take time to train your model