代写代考 CRICOS code 00025BCRICOS code 00025B

CRICOS code 00025BCRICOS code 00025B

• A3 Spec Updated

• Individual Project – Presentation (~10 min) + Q&A (3-5 min)

• GCP Coupon Survey

• Check your balance

• Fill in out the survey (see Announcement) if you need one

Cloud Computing

CRICOS code 00025BCRICOS code 00025B 3

Cloud Computing

Picture: https://www.mygreatlearning.com/blog/apache-spark/

CRICOS code 00025BCRICOS code 00025B 4

Cloud Computing

Picture: https://databricks.com/glossary/what-is-spark-streaming

CRICOS code 00025BCRICOS code 00025B

• History & Architecture of QL

• Operations on DataFrame

• Connecting QL with external data sources

• Machine learning basics

• Introduction to MLlib: Feature extraction, transformation, selection, classification, and regression

• Pipeline and components

• Machine Learning Examples:

– Logistic regression

– Decision tree

– Clustering

– Collaborative filtering

– Association Rule Mining

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

• Apache Hive is a data warehouse software project

built on top of Apache Hadoop

– data query, data summarization, and data analysis

• Hive provides an SQL-like interface to query data:

HiveQL (HQL).

– (SQL-like query ➔ MapReduce jobs on Hadoop).

• HiveQL can be integrated into the underlying Java

without the need to implement queries in the low-level

• Hive aids portability of SQL-based applications to

QL – History

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

QL – History

Cloud Computing

• Shortcomings of Hive: Less Ideal Performance

Apache Spark @Scale: A 60 TB+ production use case

CRICOS code 00025BCRICOS code 00025B

This architecture contains three layers namely,

• Data Sources − Usually the Data source for spark-core is a text file, etc.

However, the Data Sources for QL is different. Those are Parquet

file, JSON document, HIVE tables, and Cassandra database.

• Schema RDD − is designed with special data structure called

RDD. Generally, QL works on schemas, tables, and records.

Therefore, we can use the Schema RDD as temporary table. We can call

this Schema RDD as Data Frame.

• Language API − Spark is compatible with different languages and QL. It is also, supported by these languages- API (python, scala, java,

QL – Architecture

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

QL supports the vast majority of Hive features, such as:

• Hive query statements, including:

– GROUP BY

– ORDER BY

– CLUSTER BY

• All Hive expressions

• User defined functions (UDF)

Apache Hive Compatibility

Cloud Computing

CRICOS code 00025B

• is a distributed collection of data, which is organized

into named columns (inspired from DataFrame in R

Programming and Pandas in Python).

• can be constructed from an array of different sources

such as Hive tables, Structured Data files, external

databases, or existing RDDs.

• supports to process the data in the size (from

Kilobytes to Petabytes)

• supports different data formats (Avro, csv, elastic

search, and Cassandra) and storage systems (HDFS,

HIVE tables, mysql, etc).

QL – DataFrame

Cloud Computing

CRICOS code 00025B

• can be easily integrated with all

Big Data tools and frameworks

via Spark-Core.

• still has immutability, resilient,

in-memory, distributed

computing abilities.

• is one step ahead of RDD:

memory management and

optimized execution plan.

QL – DataFrame

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

• SparkSession supports to load data from different data sources and convert them into DataFrames.

– HDFS, Cassandra, Hive, local files, etc.

• Once DataFrame is created, data can be converted into tables in SQLContext for SQL queries.

• To create a basic SparkSession, just use SparkSession.builder()

QL – Creating a DataFrame

Cloud Computing

SparkSession vs

SparkContext?

CRICOS code 00025BCRICOS code 00025B

• With a SparkSession, applications can create

DataFrames

– from an existing RDD,

– from a Hive table,

– or from Spark data sources.

• Spark.read.json(.) can load a JSON file of sales

records and convert it into a DataFrame.

QL – Creating a DataFrame

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

• .printSchema() method can be used to display schema

• .show() method can be used to display top 20 rows of data.

QL – Creating a DataFrame

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

QL supports two different methods for converting existing RDDs into DataFrames.

• The first method uses reflection to infer the schema of an RDD that contains specific types

of objects.

– RDD.toDF() or RDD -> case class -> .toDF()

• The second method is through a programmatic interface that allows you to construct a

schema and then apply it to an existing RDD.

• – CreateDataFrame(RDD, schema)

QL – from RDD to DataFrame

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

• QL supports automatically converting an RDD containing case classes to a

DataFrame.

• The case class defines the schema of the table.

• The names of the arguments to the case class are read using reflection and become the

names of the columns.

• Case classes can also be nested or contain complex types such as Seqs or Arrays.

• This RDD can be implicitly converted to a DataFrame and then be registered as a table.

• Tables can be used in subsequent SQL statements.

• Example: people.txt in ”/spark/examples/src/main/resources”

QL – Reflection

Cloud Computing

“Michael, 29”

”Andy, 30”

“Justin, 19”

RDD (lines)

sc.textFile()

CRICOS code 00025BCRICOS code 00025B

Firstly, define a case class (only case class can be implicitly converted into DataFrame by

Secondly, load the file into an RDD:

QL – Reflection

Cloud Computing

“Michael, 29”

”Andy, 30”

“Justin, 19”

RDD (lines)

sc.textFile()

Array(“Michael”, “29”)

Array(”Andy”, “30”)

Array(“Justin”, “19”)

.map(_.split(“,”))

CRICOS code 00025BCRICOS code 00025B

Thirdly, use map() to convert to a Person object and store each object in an RDD

Lastly, call .toDF() method to convert the RDD to a DataFrame

QL – Reflection

Cloud Computing

Array(“Michael”, “29”)

Array(”Andy”, “30”)

Array(“Justin”, “19”)

.map(x => Person(x(0), x(1).trim.toInt))

Person(“Michael”, 29)

Person(”Andy”, 30)

Person(“Justin”, 19)

.map(x => Person(x(0), x(1).trim.toInt))

Func (x) {

val tmp = new Person(x(0), x(1))

return tmp

Array(“Michael”, “29”)

Michael 29

.toDF() method

CRICOS code 00025BCRICOS code 00025B

QL – Reflection

Cloud Computing

Student (case class)

Name Age Height

String Int Double

String Int Double
RDD (Student)

DataFrame (Student)

case class Student (name: String, age: Int, height: Double)

*Image is from https://www.ibmastery.com/blog/how-to-write-ib-extended-essay-reflections

Reflection (toDF method)

https://www.ibmastery.com/blog/how-to-write-ib-extended-essay-reflections

CRICOS code 00025BCRICOS code 00025B

• More often, case classes cannot be defined ahead of time

– E.g. the structure of records is encoded in a string,

– or a text dataset will be parsed and fields will be projected differently for different users

• A DataFrame can be created programmatically with three steps.

– Step 1: Create an RDD of Rows from the original RDD;

– Step 2: Create the schema represented by a StructType matching the structure of Rows in the

RDD created in Step 1.

– Step 3: Apply the schema to the RDD of Rows via createDataFrame method provided

by SparkSession.

• Example: people.txt in ”/spark/examples/src/main/resources”

QL – Programmatical Interface

Cloud Computing

“Michael, 29”

”Andy, 30”

“Justin, 19”

RDD (lines)

sc.textFile()

CRICOS code 00025BCRICOS code 00025B

• Firstly, create an RDD of Rows from the original RDD

• Secondly, create the schema represented by a StructType matching the structure of Rows

QL – Programmatical Interface

Cloud Computing

Array(“Michael”, “29”)

Array(”Andy”, “30”)

Array(“Justin”, “19”)

.map(x => Row(x(0), x(1).trim)

Row(“Michael”, “29”)

Row(”Andy”, “30”)

Row(“Justin”, “19”)

Array(“name”, “age”) Array(StructField(name,StringType,true), StructField(age,StringType,true))

StructType(StructField(name,StringType,true), StructField(age,StringType,true))

CRICOS code 00025BCRICOS code 00025B

• Lastly, apply the schema to RDD

QL – Programmatical Interface

Cloud Computing

Row(“Michael”, “29”)

Row(”Andy”, “30”)

Row(“Justin”, “19”)

StructType(StructField(name,StringType,true), StructField(age,StringType,true))

Michael 29

DataFrame createDataFrame(rowRDD, schema)

CRICOS code 00025BCRICOS code 00025B

QL – Programmatical Interface

Cloud Computing

Name Age Height

String Int Double

String Int Double
RDD (Student)

DataFrame (Student)

schema (name: String, age: Int, height: Double)

*Image is from https://www.ibmastery.com/blog/how-to-write-ib-extended-essay-reflections

Student (Rows)

RDD (Rows of

Name Age Height

https://www.ibmastery.com/blog/how-to-write-ib-extended-essay-reflections

CRICOS code 00025BCRICOS code 00025B

– is an immutable distributed collection of elements of your data, partitioned across nodes

in your cluster.

– ls lower-level data abstraction supporting more atomic transformations and actions

operations: Map(), Filter(), collect(), take(), etc.

• DataFrame

– is an immutable distributed collection of data. But it is organized into named columns,

like a table in a relational database.

– DataFrame is designed to make large data sets processing even easier in some tasks.

– is higher-level data abstraction

• Usage cases:

– No-structured data manipulation (RDD) vs large structured data query (DataFrame)

QL – RDD vs DataFrame

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

• write() method can save DataFrame as RDD and .format() supports various output formats:

– JSON, parquet, jdbc, orc, libsvm, csv, text, etc.

QL – Save to Files

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

• History & Architecture of QL

• Operations on DataFrame

• Connecting QL with external data sources

• Machine learning basics

• Introduction to MLlib: Feature extraction, transformation, selection, classification, and regression

• Pipeline and components

• Machine Learning Examples:

– Logistic regression

– Decision tree

– Clustering

– Collaborative filtering

– Association Rule Mining

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

• To show a specific column, use.select(“[col_name]”).show()

QL – DataFrame Operations

Cloud Computing

SQL equivalent statement: SELECT

CRICOS code 00025BCRICOS code 00025B

• To show filtered results, .filter() can be used: .filter(condition).show()

• condition in filter: $”id”>20, $ is used to quote a field name (like a variable).

QL – DataFrame Operations

Cloud Computing

SQL equivalent: WHERE Clause

CRICOS code 00025BCRICOS code 00025B

• To show aggregated results, use.groupBy(“[col_name]”).count().show()

QL – DataFrame Operations

Cloud Computing

SQL equivalent: GROUP BY

CRICOS code 00025BCRICOS code 00025B

• To sort the results, use.sort(“[col_name]”).show()

• Sort() method also supports secondary sorting

QL – DataFrame Operations

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

• The sql function on a SparkSession enables applications to run SQL queries programmatically and

returns the result as a DataFrame. Temporary views in QL are session-scoped and will

disappear if the session that creates it terminates.

QL – DataFrame Operations

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

• If you want to keep alive until the Spark application terminates, use a global temporary view.

• Global temporary view is tied to a system preserved database global_temp,

– e.g. SELECT * FROM global_temp.view1 WHERE id = 1

QL – DataFrame Operations

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

• History & Architecture of QL

• Operations on DataFrame

• Connecting QL with external data sources

• Machine learning basics

• Introduction to MLlib: Feature extraction, transformation, selection, classification, and regression

• Pipeline and components

• Machine Learning Examples:

– Logistic regression

– Decision tree

– Clustering

– Collaborative filtering

– Association Rule Mining

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

• QL also includes a data source that can read data from other databases using JDBC.

• We use MySQL database as a demo:

– MySQL has been installed and setup on VM instance-3

– To access to MySQL, phpMyAdmin is enabled

QL – Connect to MySQL via JDBC

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

• Create a database named “sparktest” and create a table “student”

QL – Connect to MySQL via JDBC

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

• To get started you will need to include the JDBC driver for your particular database on the spark

classpath.

• For example, to connect to MySQL from the hell you would run the following command:

– Download JDBC driver for the specific database:

– Unzip the compressed file

– Put the jar (mysql-connector-java-5.1.48.jar) in the folder of $SPARK/jars

– When starting a spark-shell, the driver needs to be specified:

 bin/spark-shell –driver-class-path mysql-connector-java-5.1.48-bin.jar –jars mysql-connector-

java-5.1.48-bin.jar

QL – Connect to MySQL via JDBC

Cloud Computing

https://dev.mysql.com/downloads/connector/j/5.1.html

CRICOS code 00025BCRICOS code 00025B

• Then MySQL database can be connected via JDBC.

• DB information need to be input via .option() method:

QL – Connect to MySQL via JDBC

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

• DataFrame can be written into MySQL database via JDBC:

• Create Rows objects in an RDD

• Create an schema

QL – Connect to MySQL via JDBC

Cloud Computing

parallelize() method

.map(_.split(“ “)) Array(“6”, “Rudd”, “M”, “23”)

Array(“7”, “David”, “M”, “20”)

.map(x => Row(x(0), x(1).trim) Row(“6”, “Rudd”, “M”, “23”)

Row(“7”, “David”, “M”, “20”)

CRICOS code 00025BCRICOS code 00025B

• Apply schema to RDD to generate a DataFrame

• Use write() method and “append” mode to write records into MySQL via JDBC:

QL – Connect to MySQL via JDBC

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

• History & Architecture of QL

• Operations on DataFrame

• Connecting QL with external data sources

• Machine learning basics

• Introduction to MLlib: Feature extraction, transformation, selection, classification, and regression

• Pipeline and components

• Machine Learning Examples:

– Logistic regression

– Decision tree

– Clustering

– Collaborative filtering

– Association Rule Mining

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

Machine Learning (ML): is the scientific study of algorithms and statistical models that computer

systems use to effectively perform a specific task without using explicit instructions, relying on patterns and

inference instead (Wikipedia).

Types of ML:

• Supervised (and semi-supervised) learning

– Has ground-truth information (e.g. features and labels in training data)

– Classification (binary-class/multi-class/multi-label) & regression (continuous values)

– Naïve Bayesian, Decision Trees, SVMs, CNNs, RNNs, etc.

• Unsupervised learning

– No labels (only features)

– Clustering analysis (to find structure in the data)

– K-means, DBSCAN, GMM, Autoencoders, etc.

Machine Learning Basics

https://en.wikipedia.org/wiki/Branches_of_science
https://en.wikipedia.org/wiki/Algorithm
https://en.wikipedia.org/wiki/Statistical_model
https://en.wikipedia.org/wiki/Computer_systems

CRICOS code 00025BCRICOS code 00025B

1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes

1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ?

Training Set

General Approach for Building Classification Model

Process II

Training errors: number of classification

errors on training records.

Generalization (test) errors: number of

classification errors on test records.

A good model should have low errors of both types.

Decision Boundary

CRICOS code 00025BCRICOS code 00025B

• Training errors: number of classification errors on training

• Generalization (test) errors: number of classification errors

on test records.

• A good model should have low errors of both types.

• Model underfitting: model is too simple such that both

Training error and Generalization error are high.

• Model overfitting: A model that fits the training data too well

(with too low training errors) may have a poorer

generalization error than a model with higher training error.

Underfitting vs Overfitting

Models are

too simple!

Models are too

complicated!

CRICOS code 00025BCRICOS code 00025B

Predictive accuracy

Precision, p, is the number of correctly classified positive examples divided by the total number of

examples that are classified as positive.

Recall, r, is the number of correctly classified positive examples divided by the total number of actual

positive examples in the test set.

Efficiency

• time to construct the model

• time to use the model

Robustness: handling noise and missing values

Scalability: efficiency in disk-resident databases

Evaluating Classification Methods

CRICOS code 00025BCRICOS code 00025B

• History & Architecture of QL

• Operations on DataFrame

• Connecting QL with external data sources

• Machine learning basics

• Introduction to MLlib: Feature extraction, transformation, selection, classification, and regression

• Pipeline and components

• Machine Learning Examples:

– Logistic regression

– Decision tree

– Clustering

– Collaborative filtering

– Association Rule Mining

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

• Traditional Machine Learning algorithms can be only performed on small training data,

which makes the performance will heavily depends on the quality of data sampling.

• In Big data era, the availability of massive data

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts