Assignment-4
Courses covered: EDA, Feature Engineering, Regression Analysis, Time Series Analysis
General Instructions:
1. The Module 4 assignment consists of 4 sections.
2. The learner will have to submit solutions for all the 4 sections.
3. Read the problem statement carefully before answering.
4. Provide appropriate comments in your code.
5. Perform all the mentioned tasks programmatically using Python libraries.
Submission Instructions:
1. Create separate Jupyter Notebook for each of the 4 sections.
2. Save Jupyter Notebook in the given format: .ipynb
(Eg: Assignment-4A.ipynb)
3. Place all 4 Jupyter Notebooks in the folder
*******************************************************************
Assignment – 4A:
‘ABC’ and ‘DEF’ are two popular retailers. The purchase details of customers visiting these stores are stored in three different datasets as mentioned below:
· “Datasource/Customer.csv” has customer details.
· “Datasource /Sales.csv” has sales related information.
· “Datasource /Product.csv” has product related information.
Perform the following operations using the given datasets.
i. Import the given datasets.
ii. Find the total quantity sold and the total price for each product.
[Hint: Total price should be computed using total quantity and ‘Price’. Sample output is given below]
iii. Display the sales report with ‘Customer’ details, ‘Product’ name and ‘quantity’ purchased. Sample output is shown below.
Assignment – 4B:
An organization is interested to find the likelihood of a road accident with respect to the driver’s Blood Alcohol Concentration (BAC). Based on various incidents, the BAC levels and accident likelihood quotient are recorded in ‘crash_BAC.csv’.
As a data scientist, help the organization in building a predictive model that can predict the likelihood of an accident, given the BAC of the driver.
Problem statement:
Perform the following tasks to build the predictive model:
1. Import the data set (crash_BAC.csv).
1. Build a predictive model as given below:
1. Build a regression model using Ordinary Least Square-OLS method for the below mentioned relationship between ‘BAC’ and ‘CrashLikelihood’.
CrashLikelihood = β0 *
[ stats models library can be used]
1. Obtain the statistical inferences of the above model using summary () function.
1. Visualize the model along with the actual datapoint used for training.
1. Using a scale-location plot, visualize the residuals against the fitted values.
1. Plot a Normal Q-Q plot to infer the distribution of the residuals.
1. Predict the ‘CrashLikelihood’ for the ‘BAC’ value 0.18 using both the models built.
Assignment – 4C:
A manufacturing company has collected the details about the product they manufacture and is available in “Product_inspect.csv”. The dataset has 178 samples, 14 columns and 3 classes. Import this data set and perform the following tasks:
i. Split the dataset into training and test data with 80:20 ratio.
ii. Build machine learning model- 1:
a. Train a Support Vector Machine using the given data.
b. Find the train and the test score for model- 1.
c. Build confusion matrices for model 1, based on the train data and the test data.
iii. Build machine learning model- 2:
4. Standardize the data.
4. Transform the train and test data using Linear Discriminant Analysis.
4. Train a Support Vector Machine using the transformed data.
1. Find the train and the test score for model- 2.
1. Build confusion matrices for model 2, based on the feature transformed train data and the test data.
Assignment – 4D:
The sales dataset (‘salesdata.csv’) provides a time series data for the period Jan 2013 – Dec 2018.
Import the data set “salesdata.csv’ “and perform the following tasks on the given dataset:
i. Extract the year, month, day, and weekday as feature into the given data frame.
ii. Plot a graph to visualize the relation between Sales and the Date.
iii. Decompose the data to separate the seasonality, trend, and residual.
iv. Perform stationarity test using Dickey-Fuller test. If the obtained p-value from the test is greater than 0.01 then apply differencing to make the time series stationary.
v. Draw ACF and PACF plot and comment your observation from the plots.
vi. Build a SARIMA model using the given training set. Choose the parameters appropriately. Plot a graph to visualize the distribution of the residual resulted from SARIMA model.
vii. Consider last 30 days data from the training set and predict sales using the model built.