spark代写 CSE313 - PowCoder代写

CSE313: Big Data Analytics – Lab Assignments

Spark

Use scala-shell for all questions except for the question 9. For the question 10, use spark-submit command to execute the Scala class. Provide code comments wherever applicable.

Consider two RDD dataset: RDD1 has number 1 to 10, RDD2 has number 5 to 10 Combine the two dataset, remove duplicate and find maximum number.
Hint: Use distint
Read a text file into a RDD. Filter and show all the sentences where the line does not contain a word as “ERROR”. Count number of lines which does not contain a word as “ERROR. Hint: Use filter. For count see the Spark class notes “with Accumulator” example.
Use the RDD operation to break a word into letters. Each letter should be separated by comma as delimiter in the output. Print only the first 5 characters. Hint: use flatMap
Read a file file /tmp/person.txt contains firstname lastname as showsn below: Xi Jinping Hu Jintao
Xi Zemin Hu Sangkun Load the entire file into RDD. Count how many person with same first names. For example, how many Xi and how many Hu are there. Hint: Load first name last name into key value pairs. Group by first name.
Take the above input file /tmp/person.txt. Sort the file based on the first name. Print only the first record after sorting.
Take the above input file /tmp/person.txt. Convert all the first names converted to uppercase and print only first names (do not print last names).
Take the above input file /tmp/person.txt. Search for first name Xi and print all people with same first name Xi and search result would be Xi Jinping and Xi Zemin.
Hint: Use lookup
Take the above input file /tmp/person.txt. Also, the second input file /tmp/person1.txt as follows:

CSE313: Big Data Analytics – Lab Assignments

Mao Zedong Xi Zedong

Load two text files into two RDDs. Show the person with common first name in both the files (i.e. result is Xi).

Take values from 1 to 5 into a RDD. Multiply all the numbers (i.e. 1×2×3×4×5). Persist the result RDD into disk. Hint: Use reduce() to multiply then convert the result into RDD then use persist() to store RDD
Write the above application (Question 8) into a Scala class file. Read the input from HDFS. Provide spark submit command to execute the class in a Spark cluster. User should provide two HDFS paths as two parameters in the spark-submit command line.

Related Posts