代写 Scheme html python Spark SQL Aboratory 6: SparkSQL

Aboratory 6: SparkSQL
µÚÒ»²¿·Ö:We will continue working with the data set of the Heterogeneity Dataset for Human Activity Recognition (HHAR) that contains information on the movement sensors of telephones and watches. The link to the data is: https://archive.ics.uci.edu/ml/datasets/Heterogeneity+Activity+Recognition
The objective will also be to add the third user (User), model (Model) and executed movement (gt) as the primary key.
In particular, you have to create an RDD (for each file). From each RDD, we will obtain a DataFrame inferring the scheme.
From the initial DataFrames, we will obtain a record for each user, model and class with the mean, standard deviation and maximum and minimum value of the sequence of the executed movement.
Once this is done, the gyro and accelerometer registers of the clocks on the one hand and of the telephones on the other should be concatenated by means of a join. Finally, a unique DataFrame will be created (via union) with the DataFrames of telephones and watches.
To manipulate DataFrames, we can use any of the two options seen in class:
Apply API operations
Execute SQL queries
For this task, a single notebook will be used that will be part of the .zip file corresponding to Laboratory 6. Data files should not be included. The functions must be documented. µÚ¶þ²¿·Ö:We will continue working with the data set of the Heterogeneity Dataset for Human Activity Recognition (HHAR) that contains information on the movement sensors of telephones and watches. We will use the complete data published on the page of the subject (not the sample data).
1. The first objective of this task is to verify that the parquet format reduces the sizes of the files considerably with respect to text files in csv format, for example. To do this, you must generate a parquet file for each of the csv files and provide a table showing the sizes of each of the csv files and their corresponding parquet files.
2. The second objective of this task is to measure the execution time of task 1 when it is carried out in different ways. We will consider the following cases:
Case 1. RDDs are created for each of the csv files (this corresponds to the notebook made in Lab 5)
Case 2. DataFrames are created from the RDDs (this corresponds to Task 1 of Laboratory 6) Case 3. DataFrames are created from the parquet files generated in section 1
Case 4. DataFrames are created from the original csv files.
To do this, you can use the spark.read.csv function http://spark.apache.org/docs/latest/api/python/pyspark.sql.html# pyspark.sql.DataFrame
For this task, a single notebook will be used that will be part of the .zip file corresponding to Laboratory 6. Data files should not be included. The functions must be documented.
The two tables corresponding to sections 1 and 2 will also be delivered.
We are going to work with a new set of data from BookCrossing (http://www.bookcrossing.com), a community of book lovers who exchange books around the world and share their experiences.

The first step is to download the CSV Dump from the page http://www2.informatik.uni-freiburg.de/~cziegler/BX/
1. The first objective of this task is to use the functions of the API to solve the following queries: to. List of users together with the number of books they have rated b. Maximum rating received by each publisher
c. Name of the author who has received the most ratings
2. The second objective of this task is to use the Spark SQL Window Functions to solve the following queries:
to. What is the title of the book with the highest number of ratings for each publisher?
b. What is the difference between the number of ratings in each book and the number of ratings in the book with the highest number of ratings in the same publisher?
For this task, a single notebook will be used that will be part of the .zip file corresponding to Laboratory 6. Data files should not be included. The functions must be documented.

Related Posts