COMP5349_week5_lab_solution
Copyright By PowCoder代写 加微信 powcoder
!pip install pyspark
Collecting pyspark
Downloading pyspark-3.2.1.tar.gz (281.4 MB)
|████████████████████████████████| 281.4 MB 34 kB/s
Collecting py4j==0.10.9.3
Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
|████████████████████████████████| 198 kB 50.6 MB/s
Building wheels for collected packages: pyspark
Building wheel for pyspark (setup.py) … done
Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=5531dbadde2207da79971ede4cda93eba4fd8fe90961499ec75a2c5d560af865
Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.3 pyspark-3.2.1
from google.colab import drive
drive.mount(‘/content/drive’)
Mounted at /content/drive
from pyspark import SparkConf, SparkContext
import csv
spark_conf = SparkConf()\
.setAppName(“Week 5 Lab Solution”)
sc=SparkContext.getOrCreate(spark_conf)
input_path = “file:///content/drive/MyDrive/comp5349/”
ratings = sc.textFile(input_path + “ratings.csv”)
movieData = sc.textFile(input_path + “movies.csv”)
Exercise 1¶
Find all movies without a genre listed and print out the id of the movies. The data set has a special name (no genres listed) for such case. You may hard code the genre name in functions.
gname = ‘(no genres listed)’
def mapGenres(record):
for row in csv.reader([record]):
if len(row) < 3:
movieId,genres = row[0],row[2]
if gname in genres:
return [movieId] # row[0] contains movie ids no genre
noGenreMovies = movieData.flatMap(mapGenres)
ngmovie = noGenreMovies.collect()
['126929', '135460', '138863', '141305', '141472', '143709', '149532']
Exercise 2¶
Find the top 5 movies in Documentary genre based on the number of ratings a movie received. You may hard code the genre name in functions.
def filterMovieInGenre(record):
"""This function filters entries of movies.csv based on genre list
It only keeps movie id with genre list containing the given genre.
The given genre name is defined outside as genre and will be packed as closure.
record (str): A row of CSV file, with three columns separated by comma
[(movieID, title)] if the movie belongs to the given genre, [] otherwise
for row in csv.reader([record]):
if len(row) != 3:
movieID, title, genreList = row[0],row[1], row[2]
genres = genreList.split('|')
if genre in genres:
return [(movieID, title)]
def extractRating(record):
""" This function converts entries of ratings.csv into key,value pair of the following format
(movieID, rating)
record (str): A row of CSV file, with four columns separated by comma
The return value is a tuple (movieID, genre)
userID, movieID, rating, timestamp = record.split(",")
rating = float(rating)
return (movieID, rating)
genre = 'Documentary'
moviesInGenre = movieData.flatMap(filterMovieInGenre)
movieRatings = ratings.map(extractRating)
movieRatingsInGenre = moviesInGenre.join(movieRatings)
#countByKey() is an action returning a list, instead of an RDD
#After values(), the key becomes title
movieRatingCount = movieRatingsInGenre.values().countByKey().items()
#the list can be sorted and limited using Python feature directly
#we convert the list into an RDD to utilize Spark features
results = sc.parallelize(movieRatingCount) \
.sortBy(lambda r: r[1],ascending=False) \
print(results)
[('Bowling for Columbine (2002)', 51), ('Hoop Dreams (1994)', 36), ('Roger & Me (1989)', 35), ('Super Size Me (2004)', 33), ('Fahrenheit 9/11 (2004)', 33)]
Exercise 3¶
Find the top 5 genre pairs co-occur most in the data set. Many movies can be classified into more than one genre.
This module includes a few functions used in computing average rating per genre
def getGenrePairs(record):
"""This function converts entries of movies.csv into ((g1,g2),1) pair for all genres
appearing in the row.
since there may be multiple genre per movie, this function returns a list of tuples
record (str): A row of CSV file, with three columns separated by comma
The return value is a list of tuples, each tuple contains ((g1,g2), 1)
for row in csv.reader([record]):
if len(row) != 3:
genre_list = row[2].split("|")
g = len(genre_list)
if g<2 : #single genre case
# at least two genre case
results = []
sorted_glist = sorted(genre_list) # sort by aphabet order
for i in range(g):
for j in range(i+1,g): # from 1 to last
results.append(((sorted_glist[i],sorted_glist[j]),1))
return results
movieData.flatMap(getGenrePairs)\
.reduceByKey(lambda a,b:a+b) \
.sortBy(lambda r: r[1],ascending=False).take(5)
[(('Drama', 'Romance'), 1096),
(('Comedy', 'Drama'), 1039),
(('Drama', 'Thriller'), 1016),
(('Comedy', 'Romance'), 892),
(('Crime', 'Drama'), 841)]
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com