CRICOS code 00025BCRICOS code 00025B
• A3 Spec Updated
Copyright By PowCoder代写 加微信 powcoder
• Individual Project – Presentation (~10 min) + Q&A (3-5 min)
• GCP Coupon Survey
• Check your balance
• Fill in out the survey (see Announcement) if you need one
Cloud Computing
CRICOS code 00025BCRICOS code 00025B 3
Cloud Computing
Picture: https://www.mygreatlearning.com/blog/apache-spark/
CRICOS code 00025BCRICOS code 00025B 4
Cloud Computing
Picture: https://databricks.com/glossary/what-is-spark-streaming
CRICOS code 00025BCRICOS code 00025B
• History & Architecture of QL
• Operations on DataFrame
• Connecting QL with external data sources
• Machine learning basics
• Introduction to MLlib: Feature extraction, transformation, selection, classification, and regression
• Pipeline and components
• Machine Learning Examples:
– Logistic regression
– Decision tree
– Clustering
– Collaborative filtering
– Association Rule Mining
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
• Apache Hive is a data warehouse software project
built on top of Apache Hadoop
– data query, data summarization, and data analysis
• Hive provides an SQL-like interface to query data:
HiveQL (HQL).
– (SQL-like query ➔ MapReduce jobs on Hadoop).
• HiveQL can be integrated into the underlying Java
without the need to implement queries in the low-level
• Hive aids portability of SQL-based applications to
QL – History
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
QL – History
Cloud Computing
• Shortcomings of Hive: Less Ideal Performance
CRICOS code 00025BCRICOS code 00025B
This architecture contains three layers namely,
• Data Sources − Usually the Data source for spark-core is a text file, etc.
However, the Data Sources for QL is different. Those are Parquet
file, JSON document, HIVE tables, and Cassandra database.
• Schema RDD − is designed with special data structure called
RDD. Generally, QL works on schemas, tables, and records.
Therefore, we can use the Schema RDD as temporary table. We can call
this Schema RDD as Data Frame.
• Language API − Spark is compatible with different languages and QL. It is also, supported by these languages- API (python, scala, java,
QL – Architecture
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
QL supports the vast majority of Hive features, such as:
• Hive query statements, including:
– GROUP BY
– ORDER BY
– CLUSTER BY
• All Hive expressions
• User defined functions (UDF)
Apache Hive Compatibility
Cloud Computing
CRICOS code 00025B
• is a distributed collection of data, which is organized
into named columns (inspired from DataFrame in R
Programming and Pandas in Python).
• can be constructed from an array of different sources
such as Hive tables, Structured Data files, external
databases, or existing RDDs.
• supports to process the data in the size (from
Kilobytes to Petabytes)
• supports different data formats (Avro, csv, elastic
search, and Cassandra) and storage systems (HDFS,
HIVE tables, mysql, etc).
QL – DataFrame
Cloud Computing
CRICOS code 00025B
• can be easily integrated with all
Big Data tools and frameworks
via Spark-Core.
• still has immutability, resilient,
in-memory, distributed
computing abilities.
• is one step ahead of RDD:
memory management and
optimized execution plan.
QL – DataFrame
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
• SparkSession supports to load data from different data sources and convert them into DataFrames.
– HDFS, Cassandra, Hive, local files, etc.
• Once DataFrame is created, data can be converted into tables in SQLContext for SQL queries.
• To create a basic SparkSession, just use SparkSession.builder()
QL – Creating a DataFrame
Cloud Computing
SparkSession vs
SparkContext?
CRICOS code 00025BCRICOS code 00025B
• With a SparkSession, applications can create
DataFrames
– from an existing RDD,
– from a Hive table,
– or from Spark data sources.
• Spark.read.json(.) can load a JSON file of sales
records and convert it into a DataFrame.
QL – Creating a DataFrame
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
• .printSchema() method can be used to display schema
• .show() method can be used to display top 20 rows of data.
QL – Creating a DataFrame
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
QL supports two different methods for converting existing RDDs into DataFrames.
• The first method uses reflection to infer the schema of an RDD that contains specific types
of objects.
– RDD.toDF() or RDD -> case class -> .toDF()
• The second method is through a programmatic interface that allows you to construct a
schema and then apply it to an existing RDD.
• – CreateDataFrame(RDD, schema)
QL – from RDD to DataFrame
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
• QL supports automatically converting an RDD containing case classes to a
DataFrame.
• The case class defines the schema of the table.
• The names of the arguments to the case class are read using reflection and become the
names of the columns.
• Case classes can also be nested or contain complex types such as Seqs or Arrays.
• This RDD can be implicitly converted to a DataFrame and then be registered as a table.
• Tables can be used in subsequent SQL statements.
• Example: people.txt in ”/spark/examples/src/main/resources”
QL – Reflection
Cloud Computing
“Michael, 29”
”Andy, 30”
“Justin, 19”
RDD (lines)
sc.textFile()
CRICOS code 00025BCRICOS code 00025B
Firstly, define a case class (only case class can be implicitly converted into DataFrame by
Secondly, load the file into an RDD:
QL – Reflection
Cloud Computing
“Michael, 29”
”Andy, 30”
“Justin, 19”
RDD (lines)
sc.textFile()
Array(“Michael”, “29”)
Array(”Andy”, “30”)
Array(“Justin”, “19”)
.map(_.split(“,”))
CRICOS code 00025BCRICOS code 00025B
Thirdly, use map() to convert to a Person object and store each object in an RDD
Lastly, call .toDF() method to convert the RDD to a DataFrame
QL – Reflection
Cloud Computing
Array(“Michael”, “29”)
Array(”Andy”, “30”)
Array(“Justin”, “19”)
.map(x => Person(x(0), x(1).trim.toInt))
Person(“Michael”, 29)
Person(”Andy”, 30)
Person(“Justin”, 19)
.map(x => Person(x(0), x(1).trim.toInt))
Func (x) {
val tmp = new Person(x(0), x(1))
return tmp
Array(“Michael”, “29”)
Michael 29
.toDF() method
CRICOS code 00025BCRICOS code 00025B
QL – Reflection
Cloud Computing
Student (case class)
Student (case class)
Student (case class)
Student (case class)
Student (case class)
Student (case class)
Name Age Height
String Int Double
String Int Double
String Int Double
String Int Double
String Int Double
String Int Double
String Int Double
RDD (Student)
DataFrame (Student)
case class Student (name: String, age: Int, height: Double)
*Image is from https://www.ibmastery.com/blog/how-to-write-ib-extended-essay-reflections
Reflection (toDF method)
https://www.ibmastery.com/blog/how-to-write-ib-extended-essay-reflections
CRICOS code 00025BCRICOS code 00025B
• More often, case classes cannot be defined ahead of time
– E.g. the structure of records is encoded in a string,
– or a text dataset will be parsed and fields will be projected differently for different users
• A DataFrame can be created programmatically with three steps.
– Step 1: Create an RDD of Rows from the original RDD;
– Step 2: Create the schema represented by a StructType matching the structure of Rows in the
RDD created in Step 1.
– Step 3: Apply the schema to the RDD of Rows via createDataFrame method provided
by SparkSession.
• Example: people.txt in ”/spark/examples/src/main/resources”
QL – Programmatical Interface
Cloud Computing
“Michael, 29”
”Andy, 30”
“Justin, 19”
RDD (lines)
sc.textFile()
CRICOS code 00025BCRICOS code 00025B
• Firstly, create an RDD of Rows from the original RDD
• Secondly, create the schema represented by a StructType matching the structure of Rows
QL – Programmatical Interface
Cloud Computing
Array(“Michael”, “29”)
Array(”Andy”, “30”)
Array(“Justin”, “19”)
.map(x => Row(x(0), x(1).trim)
Row(“Michael”, “29”)
Row(”Andy”, “30”)
Row(“Justin”, “19”)
Array(“name”, “age”) Array(StructField(name,StringType,true), StructField(age,StringType,true))
StructType(StructField(name,StringType,true), StructField(age,StringType,true))
CRICOS code 00025BCRICOS code 00025B
• Lastly, apply the schema to RDD
QL – Programmatical Interface
Cloud Computing
Row(“Michael”, “29”)
Row(”Andy”, “30”)
Row(“Justin”, “19”)
StructType(StructField(name,StringType,true), StructField(age,StringType,true))
Michael 29
DataFrame createDataFrame(rowRDD, schema)
CRICOS code 00025BCRICOS code 00025B
QL – Programmatical Interface
Cloud Computing
Name Age Height
String Int Double
String Int Double
String Int Double
String Int Double
String Int Double
String Int Double
String Int Double
RDD (Student)
DataFrame (Student)
schema (name: String, age: Int, height: Double)
*Image is from https://www.ibmastery.com/blog/how-to-write-ib-extended-essay-reflections
Student (Rows)
Student (Rows)
Student (Rows)
Student (Rows)
Student (Rows)
Student (Rows)
RDD (Rows of
Name Age Height
https://www.ibmastery.com/blog/how-to-write-ib-extended-essay-reflections
CRICOS code 00025BCRICOS code 00025B
– is an immutable distributed collection of elements of your data, partitioned across nodes
in your cluster.
– ls lower-level data abstraction supporting more atomic transformations and actions
operations: Map(), Filter(), collect(), take(), etc.
• DataFrame
– is an immutable distributed collection of data. But it is organized into named columns,
like a table in a relational database.
– DataFrame is designed to make large data sets processing even easier in some tasks.
– is higher-level data abstraction
• Usage cases:
– No-structured data manipulation (RDD) vs large structured data query (DataFrame)
QL – RDD vs DataFrame
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
• write() method can save DataFrame as RDD and .format() supports various output formats:
– JSON, parquet, jdbc, orc, libsvm, csv, text, etc.
QL – Save to Files
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
• History & Architecture of QL
• Operations on DataFrame
• Connecting QL with external data sources
• Machine learning basics
• Introduction to MLlib: Feature extraction, transformation, selection, classification, and regression
• Pipeline and components
• Machine Learning Examples:
– Logistic regression
– Decision tree
– Clustering
– Collaborative filtering
– Association Rule Mining
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
• To show a specific column, use.select(“[col_name]”).show()
QL – DataFrame Operations
Cloud Computing
SQL equivalent statement: SELECT
CRICOS code 00025BCRICOS code 00025B
• To show filtered results, .filter() can be used: .filter(condition).show()
• condition in filter: $”id”>20, $ is used to quote a field name (like a variable).
QL – DataFrame Operations
Cloud Computing
SQL equivalent: WHERE Clause
CRICOS code 00025BCRICOS code 00025B
• To show aggregated results, use.groupBy(“[col_name]”).count().show()
QL – DataFrame Operations
Cloud Computing
SQL equivalent: GROUP BY
CRICOS code 00025BCRICOS code 00025B
• To sort the results, use.sort(“[col_name]”).show()
• Sort() method also supports secondary sorting
QL – DataFrame Operations
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
• The sql function on a SparkSession enables applications to run SQL queries programmatically and
returns the result as a DataFrame. Temporary views in QL are session-scoped and will
disappear if the session that creates it terminates.
QL – DataFrame Operations
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
• If you want to keep alive until the Spark application terminates, use a global temporary view.
• Global temporary view is tied to a system preserved database global_temp,
– e.g. SELECT * FROM global_temp.view1 WHERE id = 1
QL – DataFrame Operations
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
• History & Architecture of QL
• Operations on DataFrame
• Connecting QL with external data sources
• Machine learning basics
• Introduction to MLlib: Feature extraction, transformation, selection, classification, and regression
• Pipeline and components
• Machine Learning Examples:
– Logistic regression
– Decision tree
– Clustering
– Collaborative filtering
– Association Rule Mining
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
• QL also includes a data source that can read data from other databases using JDBC.
• We use MySQL database as a demo:
– MySQL has been installed and setup on VM instance-3
– To access to MySQL, phpMyAdmin is enabled
QL – Connect to MySQL via JDBC
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
• Create a database named “sparktest” and create a table “student”
QL – Connect to MySQL via JDBC
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
• To get started you will need to include the JDBC driver for your particular database on the spark
classpath.
• For example, to connect to MySQL from the hell you would run the following command:
– Download JDBC driver for the specific database:
– Unzip the compressed file
– Put the jar (mysql-connector-java-5.1.48.jar) in the folder of $SPARK/jars
– When starting a spark-shell, the driver needs to be specified:
bin/spark-shell –driver-class-path mysql-connector-java-5.1.48-bin.jar –jars mysql-connector-
java-5.1.48-bin.jar
QL – Connect to MySQL via JDBC
Cloud Computing
https://dev.mysql.com/downloads/connector/j/5.1.html
https://dev.mysql.com/downloads/connector/j/5.1.html
CRICOS code 00025BCRICOS code 00025B
• Then MySQL database can be connected via JDBC.
• DB information need to be input via .option() method:
QL – Connect to MySQL via JDBC
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
• DataFrame can be written into MySQL database via JDBC:
• Create Rows objects in an RDD
• Create an schema
QL – Connect to MySQL via JDBC
Cloud Computing
parallelize() method
.map(_.split(“ “)) Array(“6”, “Rudd”, “M”, “23”)
Array(“7”, “David”, “M”, “20”)
.map(x => Row(x(0), x(1).trim) Row(“6”, “Rudd”, “M”, “23”)
Row(“7”, “David”, “M”, “20”)
CRICOS code 00025BCRICOS code 00025B
• Apply schema to RDD to generate a DataFrame
• Use write() method and “append” mode to write records into MySQL via JDBC:
QL – Connect to MySQL via JDBC
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
• History & Architecture of QL
• Operations on DataFrame
• Connecting QL with external data sources
• Machine learning basics
• Introduction to MLlib: Feature extraction, transformation, selection, classification, and regression
• Pipeline and components
• Machine Learning Examples:
– Logistic regression
– Decision tree
– Clustering
– Collaborative filtering
– Association Rule Mining
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
Machine Learning (ML): is the scientific study of algorithms and statistical models that computer
systems use to effectively perform a specific task without using explicit instructions, relying on patterns and
inference instead (Wikipedia).
Types of ML:
• Supervised (and semi-supervised) learning
– Has ground-truth information (e.g. features and labels in training data)
– Classification (binary-class/multi-class/multi-label) & regression (continuous values)
– Naïve Bayesian, Decision Trees, SVMs, CNNs, RNNs, etc.
• Unsupervised learning
– No labels (only features)
– Clustering analysis (to find structure in the data)
– K-means, DBSCAN, GMM, Autoencoders, etc.
Machine Learning Basics
https://en.wikipedia.org/wiki/Branches_of_science
https://en.wikipedia.org/wiki/Algorithm
https://en.wikipedia.org/wiki/Statistical_model
https://en.wikipedia.org/wiki/Computer_systems
CRICOS code 00025BCRICOS code 00025B
1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
Training Set
General Approach for Building Classification Model
Process II
Training errors: number of classification
errors on training records.
Generalization (test) errors: number of
classification errors on test records.
A good model should have low errors of both types.
Decision Boundary
CRICOS code 00025BCRICOS code 00025B
• Training errors: number of classification errors on training
• Generalization (test) errors: number of classification errors
on test records.
• A good model should have low errors of both types.
• Model underfitting: model is too simple such that both
Training error and Generalization error are high.
• Model overfitting: A model that fits the training data too well
(with too low training errors) may have a poorer
generalization error than a model with higher training error.
Underfitting vs Overfitting
Models are
too simple!
Models are too
complicated!
CRICOS code 00025BCRICOS code 00025B
Predictive accuracy
Precision, p, is the number of correctly classified positive examples divided by the total number of
examples that are classified as positive.
Recall, r, is the number of correctly classified positive examples divided by the total number of actual
positive examples in the test set.
Efficiency
• time to construct the model
• time to use the model
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Evaluating Classification Methods
CRICOS code 00025BCRICOS code 00025B
• History & Architecture of QL
• Operations on DataFrame
• Connecting QL with external data sources
• Machine learning basics
• Introduction to MLlib: Feature extraction, transformation, selection, classification, and regression
• Pipeline and components
• Machine Learning Examples:
– Logistic regression
– Decision tree
– Clustering
– Collaborative filtering
– Association Rule Mining
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
• Traditional Machine Learning algorithms can be only performed on small training data,
which makes the performance will heavily depends on the quality of data sampling.
• In Big data era, the availability of massive data
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com