spark代写 CSE313

CSE313: Big Data Analytics – Lab Assignments

1

Spark

Use scala-shell for all questions except for the question 9. For the question 10, use spark-submit command to execute the Scala class. Provide code comments wherever applicable.

  1. Consider two RDD dataset: RDD1 has number 1 to 10, RDD2 has number 5 to 10 Combine the two dataset, remove duplicate and find maximum number.
    Hint: Use distint
  2. Read a text file into a RDD. Filter and show all the sentences where the line does not contain a word as “ERROR”. Count number of lines which does not contain a word as “ERROR. Hint: Use filter. For count see the Spark class notes “with Accumulator” example.
  3. Use the RDD operation to break a word into letters. Each letter should be separated by comma as delimiter in the output. Print only the first 5 characters. Hint: use flatMap
  4. Read a file file /tmp/person.txt contains firstname lastname as showsn below: Xi Jinping Hu Jintao
    Xi Zemin Hu Sangkun Load the entire file into RDD. Count how many person with same first names. For example, how many Xi and how many Hu are there. Hint: Load first name last name into key value pairs. Group by first name.
  5. Take the above input file /tmp/person.txt. Sort the file based on the first name. Print only the first record after sorting.
  6. Take the above input file /tmp/person.txt. Convert all the first names converted to uppercase and print only first names (do not print last names).
  7. Take the above input file /tmp/person.txt. Search for first name Xi and print all people with same first name Xi and search result would be Xi Jinping and Xi Zemin.
    Hint: Use lookup
  8. Take the above input file /tmp/person.txt. Also, the second input file /tmp/person1.txt as follows:

CSE313: Big Data Analytics – Lab Assignments

2

Mao Zedong Xi Zedong

Load two text files into two RDDs. Show the person with common first name in both the files (i.e. result is Xi).

  1. Take values from 1 to 5 into a RDD. Multiply all the numbers (i.e. 1×2×3×4×5). Persist the result RDD into disk. Hint: Use reduce() to multiply then convert the result into RDD then use persist() to store RDD
  2. Write the above application (Question 8) into a Scala class file. Read the input from HDFS. Provide spark submit command to execute the class in a Spark cluster. User should provide two HDFS paths as two parameters in the spark-submit command line.