程序代写代做代考 python Introduction to Apache Spark using PySpark¶

Introduction to Apache Spark using PySpark¶

Resilient Distributed Datasets¶
Spark Initialization: Spark Context¶
Spark applications are run as independent sets of processes, coordinated by a Spark Context in a driver program.
It may be automatically created (for instance if you call pyspark from the shells (the Spark context is then called sc).
But we haven’t set it up, so you need to define it:
In [1]:
from pyspark import SparkContext

• the driver (first argument) can be local[*], spark://, **yarn, etc. What is available for you depends on how Spark has been deployed on the machine you use.
• the second argument is the application name and is a human readable string you choose.
Because we do not specify any number of tasks for local, it means we will be using one only. To use a maximum of 2 tasks in parallel:
In [2]:
%time
#sc.stop()
sc = SparkContext(‘local[2]’, ‘Spark 101’)

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 4.77 µs

If you wish to use all the available resource, you can simply use ‘*’ i.e.
In [3]:
%time
sc.stop()
sc = SparkContext(‘local[*]’, ‘Spark 101’)

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 10 µs

Please note that within one session, you cannot define several Spark context! So if you have tried the 3 previous SparkContext examples, don’t be surprised to get an error!

Let’s convert temperature from Celsius to Kelvin.
You recognize the map function (please note it is not the pure Python map function but PySpark map function).
The map function acts here as the transformation function while collect is the action. It pulls all elements of the RDD to the driver.
In [4]:
temp_c = [10, 3, -5, 25, 1, 9, 29, -10, 5]

rdd_temp_c = sc.parallelize(temp_c)
rdd_temp_K = rdd_temp_c.map(lambda x: x + 273.15).collect()

print(rdd_temp_K)

[283.15, 276.15, 268.15, 298.15, 274.15, 282.15, 302.15, 263.15, 278.15]

Now let’s take another example where we use map as the transformation and reduce for the action.
In [5]:
# we define a list of integers
numbers = [1, 4, 6,2, 9, 10]

rdd_numbers=sc.parallelize(numbers)

# Use reduce to combine numbers
rdd_reduce = rdd_numbers.reduce(lambda x,y: “(” + str(x) + “, ” + str(y) + “]”)
#rdd_reduce = rdd_numbers.reduce(lambda x,y: “(” + str(x) + str(y))

print(rdd_reduce)

(((1, (4, 6]], 2], (9, 10]]
In [6]:
data2 = [10,11,12,13,14,15,16]
data2distr = sc.parallelize(data2)
In [7]:
data2distr.reduce(lambda a,b : a + b)
Out[7]:
91
In [8]:
sum(data2)
Out[8]:
91
In [9]:
from functools import reduce
reduce(lambda a,b : a + b,data2)
Out[9]:
91

Data Processing using PySpark¶
DataFrame operations¶
In [10]:
from pyspark.sql import SparkSession

session = SparkSession.builder.appName(‘data_processing’).getOrCreate()
# may take several seconds to run

Download the famous Iris dataset using wget command.
In [11]:
!wget https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv

/bin/sh: wget: command not found

Read in the .csv file that we have downloaded in the previous step. We will create a DataFrame named df.
In [12]:
df = session.read.csv(‘iris.csv’,inferSchema=True,header=True)

Next, we check what have been read in from the csv file.
In [13]:
df.show()

+————+———–+————+———–+——-+
|sepal_length|sepal_width|petal_length|petal_width|species|
+————+———–+————+———–+——-+
| 5.1| 3.5| 1.4| 0.2| setosa|
| 4.9| 3.0| 1.4| 0.2| setosa|
| 4.7| 3.2| 1.3| 0.2| setosa|
| 4.6| 3.1| 1.5| 0.2| setosa|
| 5.0| 3.6| 1.4| 0.2| setosa|
| 5.4| 3.9| 1.7| 0.4| setosa|
| 4.6| 3.4| 1.4| 0.3| setosa|
| 5.0| 3.4| 1.5| 0.2| setosa|
| 4.4| 2.9| 1.4| 0.2| setosa|
| 4.9| 3.1| 1.5| 0.1| setosa|
| 5.4| 3.7| 1.5| 0.2| setosa|
| 4.8| 3.4| 1.6| 0.2| setosa|
| 4.8| 3.0| 1.4| 0.1| setosa|
| 4.3| 3.0| 1.1| 0.1| setosa|
| 5.8| 4.0| 1.2| 0.2| setosa|
| 5.7| 4.4| 1.5| 0.4| setosa|
| 5.4| 3.9| 1.3| 0.4| setosa|
| 5.1| 3.5| 1.4| 0.3| setosa|
| 5.7| 3.8| 1.7| 0.3| setosa|
| 5.1| 3.8| 1.5| 0.3| setosa|
+————+———–+————+———–+——-+
only showing top 20 rows

Let’s restrict the display of rows to just the first 3 rows.
In [14]:
df.show(3)

Sometimes, we just want to know the column names instead of the contents (or rows).
In [15]:
df.columns
Out[15]:
[‘sepal_length’, ‘sepal_width’, ‘petal_length’, ‘petal_width’, ‘species’]

Next, we using the count() method to find out how many rows are there in the DataFrame named df.
In [16]:
df.count()
Out[16]:
150

Next, we find out what is the schema behind this DataFrame. It tells us the data type of each column, the column name as well as whether rows in a particular column can be empty (nullable). If nullable = false, then no empty cells are allowed in the DataFrame.
In [17]:
df.printSchema()

Next, we can restrict which column(s) get displayed to us. In this case, only sepal_width and petal_width columns and only first 3 rows are are displayed.
In [18]:
df.select(‘sepal_width’,’petal_width’).show(3)

+———–+———–+
|sepal_width|petal_width|
+———–+———–+
| 3.5| 0.2|
| 3.0| 0.2|
| 3.2| 0.2|
+———–+———–+
only showing top 3 rows

Generate several descriptive statistics (mean, standard deviation, minimum, maximum) for each column.
In [19]:
df.describe().show()

+——-+——————+——————-+——————+——————+———+
|summary| sepal_length| sepal_width| petal_length| petal_width| species|
+——-+——————+——————-+——————+——————+———+
| count| 150| 150| 150| 150| 150|
| mean| 5.843333333333335| 3.0540000000000007|3.7586666666666693|1.1986666666666672| null|
| stddev|0.8280661279778637|0.43359431136217375| 1.764420419952262|0.7631607417008414| null|
| min| 4.3| 2.0| 1.0| 0.1| setosa|
| max| 7.9| 4.4| 6.9| 2.5|virginica|
+——-+——————+——————-+——————+——————+———+

In [20]:
df.describe(‘petal_length’).show()

+——-+——————+
|summary| petal_length|
+——-+——————+
| count| 150|
| mean|3.7586666666666693|
| stddev| 1.764420419952262|
| min| 1.0|
| max| 6.9|
+——-+——————+

Add a new column
In [21]:
df.withColumn(“new_col_name”,(df[“sepal_width”]*10)).show(6)

+————+———–+————+———–+——-+————+
|sepal_length|sepal_width|petal_length|petal_width|species|new_col_name|
+————+———–+————+———–+——-+————+
| 5.1| 3.5| 1.4| 0.2| setosa| 35.0|
| 4.9| 3.0| 1.4| 0.2| setosa| 30.0|
| 4.7| 3.2| 1.3| 0.2| setosa| 32.0|
| 4.6| 3.1| 1.5| 0.2| setosa| 31.0|
| 5.0| 3.6| 1.4| 0.2| setosa| 36.0|
| 5.4| 3.9| 1.7| 0.4| setosa| 39.0|
+————+———–+————+———–+——-+————+
only showing top 6 rows

Filter the rows and thereafter show the first 9 rows.
In [22]:
df.filter(df[“species”]==”setosa”).show(9)

Only show the those rows that belong to the species Setosa, display the first 9 rows and only show the 3 columns named sepal_with, petal_width, species.
In [23]:
df.filter(df[“species”]==”setosa”).select(“sepal_width”,”petal_width”,”species”).show(9)

+———–+———–+——-+
|sepal_width|petal_width|species|
+———–+———–+——-+
| 3.5| 0.2| setosa|
| 3.0| 0.2| setosa|
| 3.2| 0.2| setosa|
| 3.1| 0.2| setosa|
| 3.6| 0.2| setosa|
| 3.9| 0.4| setosa|
| 3.4| 0.3| setosa|
| 3.4| 0.2| setosa|
| 2.9| 0.2| setosa|
+———–+———–+——-+
only showing top 9 rows

Next, we add a further criterion. We now also want the sepal width to be strictly less than 3.0.
In [24]:
df.filter((df[“species”]==”setosa”) & (df[“sepal_width”]<3.0)).show(9) +------------+-----------+------------+-----------+-------+ |sepal_length|sepal_width|petal_length|petal_width|species| +------------+-----------+------------+-----------+-------+ | 4.4| 2.9| 1.4| 0.2| setosa| | 4.5| 2.3| 1.3| 0.3| setosa| +------------+-----------+------------+-----------+-------+ In the previous example, we applied the 'and' condition. In the following example, we change it from 'and' to 'or' using the pipe symbol |. In [25]: df.filter((df["species"]=="setosa") | (df["sepal_width"]<3.0)).show(9) +------------+-----------+------------+-----------+-------+ |sepal_length|sepal_width|petal_length|petal_width|species| +------------+-----------+------------+-----------+-------+ | 5.1| 3.5| 1.4| 0.2| setosa| | 4.9| 3.0| 1.4| 0.2| setosa| | 4.7| 3.2| 1.3| 0.2| setosa| | 4.6| 3.1| 1.5| 0.2| setosa| | 5.0| 3.6| 1.4| 0.2| setosa| | 5.4| 3.9| 1.7| 0.4| setosa| | 4.6| 3.4| 1.4| 0.3| setosa| | 5.0| 3.4| 1.5| 0.2| setosa| | 4.4| 2.9| 1.4| 0.2| setosa| +------------+-----------+------------+-----------+-------+ only showing top 9 rows Next, we want to show unqiue values in the column named species. In [26]: df.select("species").distinct().show() +----------+ | species| +----------+ | virginica| |versicolor| | setosa| +----------+ Next, we want to show the number of unqiue values in the column named species. We see that we have 3 unique species in our data. In [27]: df.select("species").distinct().count() Out[27]: 3 Group by¶ Next, we need to find out what is the number of times each species has appeared in the data. We achieve this by consolidating the occurences using the groupBy method. In [28]: df.groupBy('species').count().show() +----------+-----+ | species|count| +----------+-----+ | virginica| 50| |versicolor| 50| | setosa| 50| +----------+-----+ Next, we count the number of times each unique value of petal length has occured in the data. We achieve this by consolidating the occurences using the groupBy method. In [29]: df.groupBy('petal_length').count().show() +------------+-----+ |petal_length|count| +------------+-----+ | 5.4| 2| | 3.5| 2| | 6.1| 3| | 6.6| 1| | 3.7| 1| | 4.5| 8| | 5.7| 3| | 1.4| 12| | 1.7| 4| | 6.7| 2| | 4.9| 5| | 1.0| 1| | 4.1| 3| | 4.0| 5| | 1.9| 2| | 3.9| 3| | 3.8| 1| | 5.1| 8| | 4.2| 4| | 1.3| 7| +------------+-----+ only showing top 20 rows Next, sort the results using orderBy method. The default order is ascending (smallest value on top, largest value below). In [30]: df.groupBy('petal_length').count().orderBy("count").show() +------------+-----+ |petal_length|count| +------------+-----+ | 1.1| 1| | 1.0| 1| | 3.6| 1| | 6.3| 1| | 6.4| 1| | 6.6| 1| | 3.0| 1| | 3.8| 1| | 6.9| 1| | 3.7| 1| | 6.0| 2| | 5.2| 2| | 1.2| 2| | 4.3| 2| | 6.7| 2| | 5.4| 2| | 1.9| 2| | 3.5| 2| | 5.9| 2| | 5.3| 2| +------------+-----+ only showing top 20 rows Next, we change the order of sorting by using ascending = False. In [31]: df.groupBy('petal_length').count().orderBy("count",ascending=False).show() +------------+-----+ |petal_length|count| +------------+-----+ | 1.5| 14| | 1.4| 12| | 4.5| 8| | 5.1| 8| | 1.6| 7| | 1.3| 7| | 5.6| 6| | 4.9| 5| | 4.0| 5| | 4.7| 5| | 4.2| 4| | 5.0| 4| | 4.8| 4| | 4.4| 4| | 1.7| 4| | 4.1| 3| | 4.6| 3| | 6.1| 3| | 5.8| 3| | 5.7| 3| +------------+-----+ only showing top 20 rows Next, we calculate the average sepal length, average sepal width, average petal length, average petal width, for each unique value of petal length. In [32]: df.groupBy('petal_length').mean().show() +------------+------------------+------------------+------------------+-------------------+ |petal_length| avg(sepal_length)| avg(sepal_width)| avg(petal_length)| avg(petal_width)| +------------+------------------+------------------+------------------+-------------------+ | 5.4| 6.550000000000001| 3.25| 5.4| 2.2| | 3.5| 5.35| 2.3| 3.5| 1.0| | 6.1| 7.433333333333334|3.1333333333333333| 6.099999999999999| 2.2333333333333334| | 6.6| 7.6| 3.0| 6.6| 2.1| | 3.7| 5.5| 2.4| 3.7| 1.0| | 4.5| 5.775| 2.875| 4.5| 1.5125| | 5.7| 6.766666666666667| 3.266666666666667| 5.7| 2.3000000000000003| | 1.4| 4.916666666666667|3.3333333333333335|1.4000000000000001| 0.2166666666666667| | 1.7| 5.4|3.5999999999999996| 1.7| 0.35| | 6.7| 7.7| 3.3| 6.7| 2.1| | 4.9| 6.239999999999999|2.8199999999999994| 4.9| 1.72| | 1.0| 4.6| 3.6| 1.0| 0.2| | 4.1| 5.699999999999999|2.8333333333333335| 4.1| 1.2| | 4.0| 5.78| 2.48| 4.0| 1.22| | 1.9| 4.949999999999999|3.5999999999999996| 1.9|0.30000000000000004| | 3.9| 5.533333333333334|2.6333333333333333| 3.9| 1.2333333333333334| | 3.8| 5.5| 2.4| 3.8| 1.1| | 5.1| 6.125|2.8750000000000004|5.1000000000000005| 1.925| | 4.2| 5.725| 2.9| 4.2| 1.325| | 1.3|4.8428571428571425|3.2285714285714286| 1.3| 0.2571428571428572| +------------+------------------+------------------+------------------+-------------------+ only showing top 20 rows Next, we calculate the total sepal length, total sepal width, total petal length, total petal width, for each unique value of petal length. In [33]: df.groupBy('petal_length').sum().show() +------------+------------------+------------------+------------------+------------------+ |petal_length| sum(sepal_length)| sum(sepal_width)| sum(petal_length)| sum(petal_width)| +------------+------------------+------------------+------------------+------------------+ | 5.4|13.100000000000001| 6.5| 10.8| 4.4| | 3.5| 10.7| 4.6| 7.0| 2.0| | 6.1| 22.3| 9.4|18.299999999999997| 6.7| | 6.6| 7.6| 3.0| 6.6| 2.1| | 3.7| 5.5| 2.4| 3.7| 1.0| | 4.5| 46.2| 23.0| 36.0| 12.1| | 5.7| 20.3| 9.8| 17.1| 6.9| | 1.4| 59.0| 40.0| 16.8|2.6000000000000005| | 1.7| 21.6|14.399999999999999| 6.8| 1.4| | 6.7| 15.4| 6.6| 13.4| 4.2| | 4.9|31.199999999999996|14.099999999999998| 24.5| 8.6| | 1.0| 4.6| 3.6| 1.0| 0.2| | 4.1|17.099999999999998| 8.5|12.299999999999999|3.5999999999999996| | 4.0|28.900000000000002| 12.4| 20.0| 6.1| | 1.9| 9.899999999999999| 7.199999999999999| 3.8|0.6000000000000001| | 3.9| 16.6| 7.9| 11.7| 3.7| | 3.8| 5.5| 2.4| 3.8| 1.1| | 5.1| 49.0|23.000000000000004|40.800000000000004| 15.4| | 4.2| 22.9| 11.6| 16.8| 5.3| | 1.3| 33.9| 22.6| 9.1| 1.8| +------------+------------------+------------------+------------------+------------------+ only showing top 20 rows Next, we calculate the maximum sepal length, maximum sepal width, maximum petal length, maximum petal width, for each unique value of petal length. In [34]: df.groupBy('petal_length').max().show() +------------+-----------------+----------------+-----------------+----------------+ |petal_length|max(sepal_length)|max(sepal_width)|max(petal_length)|max(petal_width)| +------------+-----------------+----------------+-----------------+----------------+ | 5.4| 6.9| 3.4| 5.4| 2.3| | 3.5| 5.7| 2.6| 3.5| 1.0| | 6.1| 7.7| 3.6| 6.1| 2.5| | 6.6| 7.6| 3.0| 6.6| 2.1| | 3.7| 5.5| 2.4| 3.7| 1.0| | 4.5| 6.4| 3.4| 4.5| 1.7| | 5.7| 6.9| 3.3| 5.7| 2.5| | 1.4| 5.5| 4.2| 1.4| 0.3| | 1.7| 5.7| 3.9| 1.7| 0.5| | 6.7| 7.7| 3.8| 6.7| 2.2| | 4.9| 6.9| 3.1| 4.9| 2.0| | 1.0| 4.6| 3.6| 1.0| 0.2| | 4.1| 5.8| 3.0| 4.1| 1.3| | 4.0| 6.1| 2.8| 4.0| 1.3| | 1.9| 5.1| 3.8| 1.9| 0.4| | 3.9| 5.8| 2.7| 3.9| 1.4| | 3.8| 5.5| 2.4| 3.8| 1.1| | 5.1| 6.9| 3.2| 5.1| 2.4| | 4.2| 5.9| 3.0| 4.2| 1.5| | 1.3| 5.5| 3.9| 1.3| 0.4| +------------+-----------------+----------------+-----------------+----------------+ only showing top 20 rows Next, we calculate the minimum sepal length, minimum sepal width, minimum petal length, minimum petal width, for each unique value of petal length. In [35]: df.groupBy('petal_length').min().show() +------------+-----------------+----------------+-----------------+----------------+ |petal_length|min(sepal_length)|min(sepal_width)|min(petal_length)|min(petal_width)| +------------+-----------------+----------------+-----------------+----------------+ | 5.4| 6.2| 3.1| 5.4| 2.1| | 3.5| 5.0| 2.0| 3.5| 1.0| | 6.1| 7.2| 2.8| 6.1| 1.9| | 6.6| 7.6| 3.0| 6.6| 2.1| | 3.7| 5.5| 2.4| 3.7| 1.0| | 4.5| 4.9| 2.2| 4.5| 1.3| | 5.7| 6.7| 3.2| 5.7| 2.1| | 1.4| 4.4| 2.9| 1.4| 0.1| | 1.7| 5.1| 3.3| 1.7| 0.2| | 6.7| 7.7| 2.8| 6.7| 2.0| | 4.9| 5.6| 2.5| 4.9| 1.5| | 1.0| 4.6| 3.6| 1.0| 0.2| | 4.1| 5.6| 2.7| 4.1| 1.0| | 4.0| 5.5| 2.2| 4.0| 1.0| | 1.9| 4.8| 3.4| 1.9| 0.2| | 3.9| 5.2| 2.5| 3.9| 1.1| | 3.8| 5.5| 2.4| 3.8| 1.1| | 5.1| 5.8| 2.7| 5.1| 1.5| | 4.2| 5.6| 2.7| 4.2| 1.2| | 1.3| 4.4| 2.3| 1.3| 0.2| +------------+-----------------+----------------+-----------------+----------------+ only showing top 20 rows Aggregation function¶ Next, we will use an aggration function that does summation. More specifically, we will sum the sepal length. In [36]: df.groupBy('species').agg({'sepal_length':'sum'}).show() +----------+------------------+ | species| sum(sepal_length)| +----------+------------------+ | virginica| 329.3999999999999| |versicolor| 296.8| | setosa|250.29999999999998| +----------+------------------+ In [37]: !pwd /Users/macuser/Desktop/NUS/eca5372_2020 Exporting and saving output¶ Next, we will save our output into two different file formats. They are comma-separated values (csv) format and the parquet format. In [38]: import os import shutil def save_output(folder_name,file_type): if not os.path.exists(folder_name): #if folder does not exist print("Good, the",folder_name, "folder does not exist.") df.coalesce(3)\ .write.format(file_type)\ .option("header","True")\ .save(folder_name) #save csv file into that particular folder print("Successfully saved",file_type,"file into",folder_name, "folder.") else: print("Detected that",folder_name, "folder exists, proceeding to delete folder and its contents.") shutil.rmtree(folder_name) #if folder exists, remove folder and all its contents print("Since we cannot write into the same folder, we removed",folder_name, "folder and its contents.") save_output(folder_name, file_type) #checking again by calling itself save_output("saved_csv","csv") Good, the saved_csv folder does not exist. Successfully saved csv file into saved_csv folder. Important note: If you run the cell above more than once, you will encounter an error telling you that the folder already exists. To rectify that, you need to either rename the folder or delete the previous folder. In [39]: !ls -trl saved_csv total 8 -rw-r--r-- 1 macuser staff 3858 Feb 17 16:28 part-00000-c07d7410-4d27-4e9c-a647-cb2efd9dd59d-c000.csv -rw-r--r-- 1 macuser staff 0 Feb 17 16:28 _SUCCESS In [40]: save_output("saved_parquet","parquet") Good, the saved_parquet folder does not exist. Successfully saved parquet file into saved_parquet folder. In [41]: !ls -trl saved_parquet total 8 -rw-r--r-- 1 macuser staff 2608 Feb 17 16:28 part-00000-39177311-bd1a-497b-a0cd-5ff8dfdd3d54-c000.snappy.parquet -rw-r--r-- 1 macuser staff 0 Feb 17 16:28 _SUCCESS Next, we do a simple count to check the number of rows currently in the DataFrame. In [42]: df.count() Out[42]: 150 Next, we use the dropDuplicates method to remove any duplicated rows. If there were duplicated rows, the number of rows (using the count() method) should decrease after this step. In [43]: df.dropDuplicates() Out[43]: DataFrame[sepal_length: double, sepal_width: double, petal_length: double, petal_width: double, species: string] In [44]: df.count() Out[44]: 150 Next, we use the dropna method to remove any rows that contain any empty cells (cells with no value). If there were such rows, the number of rows (using the count() method) should decrease after this step. In [45]: df.dropna() Out[45]: DataFrame[sepal_length: double, sepal_width: double, petal_length: double, petal_width: double, species: string] In [46]: df.count() Out[46]: 150 Before removing one of the columns, we show all the columns to get a good idea of the contents of the DataFrame. In [47]: df.show(5) +------------+-----------+------------+-----------+-------+ |sepal_length|sepal_width|petal_length|petal_width|species| +------------+-----------+------------+-----------+-------+ | 5.1| 3.5| 1.4| 0.2| setosa| | 4.9| 3.0| 1.4| 0.2| setosa| | 4.7| 3.2| 1.3| 0.2| setosa| | 4.6| 3.1| 1.5| 0.2| setosa| | 5.0| 3.6| 1.4| 0.2| setosa| +------------+-----------+------------+-----------+-------+ only showing top 5 rows Next, we remove the sepal_length column using the drop() method. In [48]: new_df = df.drop("sepal_length") In [49]: new_df.show(5) +-----------+------------+-----------+-------+ |sepal_width|petal_length|petal_width|species| +-----------+------------+-----------+-------+ | 3.5| 1.4| 0.2| setosa| | 3.0| 1.4| 0.2| setosa| | 3.2| 1.3| 0.2| setosa| | 3.1| 1.5| 0.2| setosa| | 3.6| 1.4| 0.2| setosa| +-----------+------------+-----------+-------+ only showing top 5 rows Next, we save the new DataFrame into a new csv file. In [50]: save_output("saved_new_df_csv","csv") Good, the saved_new_df_csv folder does not exist. Successfully saved csv file into saved_new_df_csv folder.

Related Posts