代写 R SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTING

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTING
BIG DATA & DATA ANALYTICS LAB PROJECT 2
This lab project is based on a dataset about movie success in 2014 and 2015 by Ahmad et al. (2015) which is available on the online platform by Lichman et al (2013). Download the file movidata.csv from Blackboard and then complete the following exercises.
EXERCISE 1 (1 MARK)
Use ggplot() to create a box plot that shows the number of screens on which each movie was initially launched in the US on the y-axis separately for 2014 and 2015. Note: Only include those observations that do not have a missing value (NA) for the variable “screens” (e.g., by using !is.na(…)).
EXERCISE 2 (2 MARK)
[R-CODE]
[R-CODE]
Calculate the profit of each movie (profit = gross –
budget) and add the results as a new variable “profit”
to the moviedata dataframe. Use ggplot() to create a
violin plot that shows the profit on the y-axis separately
for ORIGINAL movies and SEQUEL movies (using the
sequelcat variable). Use the “YlOrRd” colour palette
from the RColorBrewer library to fill the violin plots
(hint for spelling: YlOrRd stands for Yellow / Orange /
Red). Add a boxplot on top of the violin plot and add a red point that indicates the mean value. Note: Only include those observations that do not have a missing value (NA) for the variable “profit”.
EXERCISE 3 (1 MARK) [R-CODE]
Use the subset() command to create a subset of the dataframe that only includes observations without missing values for budget, screens, and aggregate_followers. Name this data frame “moviedatasub”. Then, using the newly created data frame ” moviedatasub “, use the custom winsor() function discussed in the lecture slides in week 3 to create a new variables likes_winsor based on the variable likes. Use a multiplier of 1.5.
To make sure that the winsorising worked, compare the two variables by creating simple box plots using the following commands.
with(moviedatasub, boxplot(likes)) with(moviedatasub, boxplot(likes_winsor))
1/3

EXERCISE 4 (2 MARKS)
Look up the “cut” command. Based on the dataset “moviedatasub”, create a new column “ratingscat” in the dataframe that describes the ratings category of a movie using the cut command. Distinguish between the following categories:
– “negative” (0 ≤ rating < 6) - “neutral” (6 ≤ rating < 6.8) - “positive” (6.8 ≤ rating < 10) Use ggplot() to create a scatterplot for gross over likes_winsor that you created in Exercise 3. Indicate the different ratings categories by colouring the points in the scatterplot with the "FantasticFox1" color palette of the "wesanderson" library package. EXERCISE 5 (1 MARK) [R-CODE] Based on the dataset “moviedatasub”, use the ddply() function of the package “plyr” to create a data frame with the means and standard deviations of profit, gross, and budget for the three different ratings categories (variable: ratingscat, cf. Exercise 4) and for the two different values of sequelcat (ORIGINAL / SEQUEL). Also include the number of observations N for each of the category combinations. The output should look like this: EXERCISE 6 (2 MARKS) [R-CODE] Based on the dataset “moviedatasub”, use a Bartlett’s test to test for variance homogeneity in the variable profit across the three different ratings categories (variable: ratingscat, cf. Exercise 4). In your own words, interpret the results of the test and decide whether we should assume that the variances are homogeneous. Then, use a one-way Analysis of Variance (ANOVA) to test whether there is a difference in mean profit across the three different ratings categories and interpret the result in your own words. Conduct a PostHoc analysis to determine which groups are significantly different from each other. How does the result of the test of variance homogeneity affect the PostHoc analysis? EXERCISE 7 (1 MARKS) [R-CODE] Based on the dataset “moviedatasub”, compare the mean profits for ORIGINAL and SEQUEL movies (variable: sequelcat). Which test should we use to test whether there is a significant difference and why? Conduct the test in R and interpret the result in your own words. [R-CODE] 2/3 REFERENCES Ahmed M, Jahangir M, Afzal H, Majeed A, Siddiqi I. Using Crowd-source based features from social media and Conventional features to predict the movies popularity. In Smart City/ SocialCom/S ustainCom (SmartCity), 2015 IEEE International Conference on 2015 Dec 19 (pp. 273-278). IEEE. https://ieeexplore.ieee.org/document/7463737 Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. DATASET moviedata Conventional and Social Media Movies 2014 and 2015 Description A dataset about the success of movies in 2014 and 2015. Usage moviedata Format A data frame with 231 observations on the following 14 variables. movie year ratings genre gross budget screens sequel dummy_sequel sentiment views likes dislikes comments aggregate_followers Source Name of the movie Year of movie release Rating of the movie (0 – 10) Identifier for the genre of the movie (e.g., action, adventure, drama) Gross world-wide income from the movie (in US$) Budget for the movie Number of screens that the movie was initially launched in on the opening weekend in the US A number indicating whether the movie is sequel or original (individual) movie, where higher numbers indicate later sequels in a series. For instance, for Mission Impossible a sequel value of 5 indicates that this is the fifth movie in the series. 0 – Original movie 1 – Sequel movie A sentiment score assessed through an analysis of tweets about the movie on Twitter. 0 represents a neutral sentiment, a positive value represents a positive sentiment, and a negative value indicates a negative sentiment. The sentiment score for each movie was calculated by retrieving all tweets related to each movie, assigning the sentiment score to each of them and then aggregating the score. Number of times the movie trailer was viewed on YouTube Number of likes the movie trailer received on YouTube Number of dislikes the movie trailer received on YouTube Number of times the movie trailer received a comment on YouTube The aggregate number of actor followers: Equal to sum of followers of top 3 cast from Twitter Ahmed M, Jahangir M, Afzal H, Majeed A, Siddiqi I. Using Crowd-source based features from social media and Conventional features to predict the movies popularity. In Smart City/ SocialCom/S ustainCom (SmartCity), 2015 IEEE International Conference on 2015 Dec 19 (pp. 273-278). IEEE. https://ieeexplore.ieee.org/document/7463737 Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. 3/3