Microsoft Word – HW1_description.docx
INF$553$–$Spring$2017$Assignment$1$
Overview’of’the’assignment’
In#this#assignment,#students#will#complete#two#tasks.#The#goal#of#these#two#tasks#
is# to# let# students# get# familiar#with# Spark# and#do#data# analysis# using# Spark.# In# the#
assignment# description,# the# first# part# is# about# how# to# configure# the# environment#
and#data#sets,#the#second#part#describes#the#two#tasks#in#details,#and#the#third#part#is#
about#the#files#the#students#should#submit#and#the#grading#criteria.#
#
Spark’Installation’
# Spark# can# be# downloaded# from# the# official# website:# #
http://spark.apache.org/downloads.html#
Spark# 1.6.1# combined# with# Hadoop# 2.4# is# recommended.# The# interface# of# Spark#
official#website#is#shown#in#the#following#figure.#
#
Scala’Installation’
Please#refer#to#the#Spark#slides#
Python’Configuration’
You# need# to# add# the# paths# of# your# Spark# (path/to/your/Spark)# and# Python#
(path/to/your/Spark/python)# folders# to# the# interpreter’s# environment# variables#
named#as#SPARK_HOME#and#PYTHONPATH,#respectively.#
Data’
Please#download#the#data#from#MovieLen#over#the#following#link:# #
You#are#required# to#download#two#data#sets.#The# first# is#mlS20m.zip,#which#size# is#
190MB,# the# second# is#mlSlatestSsmall.zip,#which# size# is#1MB.#Each#zip# file# contains#
five#CSV#files.#The#files#tags.csv#and#ratings.csv#are#needed#for#the#tasks.# #
#
#
Task1:$(40%)$
Students#are#required#to#calculate#each#movie’s#average#rating.#The#ratings.CSV#file#
is#needed#for#this#task.#
#
Result$format:$
1.#Save#the#result#as#one#text# file.#There# is#no#requirement#about#the#format#of#the#
file#
2.#The#result#is#ordering#by#movieId#in#ascending#order#
#
The#following#snapshot#is#an#example#of#result#for#task#1.#It#just#shows#the#format#of#
the#result.#
#
#
Task2:$(60%)$
Students#are#required#to#calculate#the#average#rating#of#each#tag.#Both#the#rating.csv#
and#tags.csv#files#are#required#for#this#task.#
#
Result$format:$
1. Students#are#required#to#save#the#result#in#a#CSV#file#
2. There#are#two#columns#in#the#CSV#file.#The#first#column#is#the#tag’s#name,#which#
should# be# named# as# tag.# The# second# column# is# the# rating,# which# should# be#
named#as#rating_avg.#And#the#file#should#be#sorted#according#to#the#tags’#name#in#
descending#order#
The# following# two# snapshots# is# an# example# of# result# for# task# 2.# The# unreadable#
codes#in#the#first#snapshots#are#because#encoding#problem.#It#just#shows#the#format#
of#the#result.#In#the#second#picture,#the#data#is#sorted#by#first#column#in#descending#
order.#
#
#
#
#
Hints$for$Task2:$
1. Unicode#problem:#you#may#encounter#problems#of# text#encodings#when#saving#
the#result#as#a#CSV#file.#You#should#save#your#file#with#‘uftS8’.#
2. You#can#create#Dataframe#objects#and#save#the#Dataframe#objects#as#CSV#file#
3. You#can#learn#more#about#Dataframe#by#this#link:#
https://spark.apache.org/docs/1.6.0/sqlSprogrammingSguide.html#creatingSdatafr
ames#
$
What$you$need$to$turn$in:$
1. Source#codes#for#two#tasks#(you#can#use#either#Python#or#Scala)#and#name#it#as#
Firstname_Lastname_task1# and# Firstname_Lastname_task2,7 respectively.# (For#
example,#Weiwei_Duan_task1.py)#
#
2. Result#files#of#two#tasks#for#large#and#small#data#sets#and#name#it#as#
Firstname_Lastname_result_task1_big,#
7 7 7 Firstname_Lastname_result_task2_big.csv7
7 7 7 Firstname_Lastname_result_task1_small7
7 7 7 Firstname_Lastname_result_task2_small.csv7 #
#
3. Readme#documents:#please#describe#how#to#run#your#program#in#this#document.#
#
4. If# you# use# Scala,# please# submit# the# jar# package# as# well# and# name# them# as#
Firstname_Lastname_task1.jar7and#Firstname_Lastname_task2.jar.#
5. Zip#the#above#files#and#name#it#as#Firstname_Lastname_HW1.zip#
#
Grading$Criteria:$
1. Your#codes#will#be#run#according#to#your#Readme#file.# If#your#programs#cannot#
be#run#with#the#commands#you#provide,#your#submission#will#be#graded#based#
on#the#result#files#you#submit#and#20%#penalty#for#it.#
2. If#the#file#generated#by#your#program#is#unsorted,#there#will#be#20%#penalty.#
3. If#your#program#generates#more#than#one#file,#there#will#be#20%#penalty.#
4. If#the#CSV#file#generated#in#task#2#has#more#than#two#columns,#there#will#be#20%#
penalty.#
5. If#the#header#of#the#CSV#file#is#missing#in#task#2,#there#will#be#10%#penalty#
6. The#deadline#for#assignment#1#is#02/07#midnight.#There#will#be#20%#penalty#for#
late#submission.#