Cloud Computing INFS3208
Updates
• A3 Spec Updated
• Individual Project – Presentation (~10 min) + Q&A (3-5 min) • GCP Coupon Survey
• Check your balance
• Fill in out the survey (see Announcement) if you need one
CRICOS code 00025B 2
Cloud Computing INFS3208
Picture: https://www.mygreatlearning.com/blog/apache-spark/
CRICOS code 00025B 3
Cloud Computing INFS3208
Picture: https://databricks.com/glossary/what-is-spark-streaming
CRICOS code 00025B 4
Cloud Computing INFS3208
Outline
•
•
• • •
• • • •
QL
History & Architecture of QL
Operations on DataFrame
Connecting QL with external data sources
Machine learning basics
Introduction to MLlib: Feature extraction, transformation, selection, classification, and regression Pipeline and components
Machine Learning Examples:
– Logistic regression
– Decision tree
– Clustering
– Collaborative filtering
– Association Rule Mining
CRICOS code 00025B 5
Cloud Computing INFS3208
QL – History
• Apache Hive is a data warehouse software project built on top of Apache Hadoop
– data query, data summarization, and data analysis • Hive provides an SQL-like interface to query data:
HiveQL (HQL).
– (SQL-like query➔MapReduce jobs on Hadoop).
• HiveQL can be integrated into the underlying Java without the need to implement queries in the low-level Java API
• Hive aids portability of SQL-based applications to Hadoop.
CRICOS code 00025B 6
MapReduce
HDFS
Cloud Computing INFS3208
QL – History
• Shortcomings of Hive: Less Ideal Performance
CRICOS code 00025B 7
Cloud Computing INFS3208
QL – Architecture
This architecture contains three layers namely,
• Data Sources − Usually the Data source for spark-core is a text file, etc. However, the Data Sources for QL is different. Those are Parquet file, JSON document, HIVE tables, and Cassandra database.
• Schema RDD − is designed with special data structure called RDD. Generally, QL works on schemas, tables, and records. Therefore, we can use the Schema RDD as temporary table. We can call this Schema RDD as Data Frame.
• Language API − Spark is compatible with different languages and QL. It is also, supported by these languages- API (python, scala, java, HiveQL).
CRICOS code 00025B 8
Cloud Computing INFS3208
Apache Hive Compatibility
QL supports the vast majority of Hive features, such as:
•
Hive query statements, including: – SELECT
– GROUP BY
– ORDER BY
– CLUSTER BY
– SORT B
• All Hive expressions
• User defined functions (UDF) •…
CRICOS code 00025B 9
Cloud Computing INFS3208
QL – DataFrame
• is a distributed collection of data, which is organized into named columns (inspired from DataFrame in R Programming and Pandas in Python).
• can be constructed from an array of different sources such as Hive tables, Structured Data files, external databases, or existing RDDs.
• supports to process the data in the size (from Kilobytes to Petabytes)
• supports different data formats (Avro, csv, elastic search, and Cassandra) and storage systems (HDFS, HIVE tables, mysql, etc).
CRICOS code 00025B 10
Cloud Computing INFS3208
QL – DataFrame
• can be easily integrated with all Big Data tools and frameworks via Spark-Core.
• still has immutability, resilient, in-memory, distributed computing abilities.
• is one step ahead of RDD:
memory management and optimized execution plan.
CRICOS code 00025B 11
Cloud Computing INFS3208
QL – Creating a DataFrame
• SparkSession supports to load data from different data sources and convert them into DataFrames. – HDFS, Cassandra, Hive, local files, etc.
• Once DataFrame is created, data can be converted into tables in SQLContext for SQL queries.
• To create a basic SparkSession, just use SparkSession.builder()
SparkSession vs SparkContext?
CRICOS code 00025B
12
Cloud Computing INFS3208
QL – Creating a DataFrame
• With a SparkSession, applications can create DataFrames
– from an existing RDD,
– from a Hive table,
– or from Spark data sources.
• Spark.read.json(.) can load a JSON file of sales records and convert it into a DataFrame.
CRICOS code 00025B 13
Cloud Computing INFS3208
QL – Creating a DataFrame
• .printSchema() method can be used to display schema
• .show() method can be used to display top 20 rows of data.
CRICOS code 00025B 14
Cloud Computing INFS3208
QL – from RDD to DataFrame
QL supports two different methods for converting existing RDDs into DataFrames.
•
• •
The first method uses reflection to infer the schema of an RDD that contains specific types of objects.
– RDD.toDF() or RDD -> case class -> .toDF()
The second method is through a programmatic interface that allows you to construct a
schema and then apply it to an existing RDD. – CreateDataFrame(RDD, schema)
CRICOS code 00025B 15
Cloud Computing INFS3208
QL – Reflection
• QL supports automatically converting an RDD containing case classes to a DataFrame.
• The case class defines the schema of the table.
• The names of the arguments to the case class are read using reflection and become the
names of the columns.
• Case classes can also be nested or contain complex types such as Seqs or Arrays.
• This RDD can be implicitly converted to a DataFrame and then be registered as a table.
• Tables can be used in subsequent SQL statements.
• Example: people.txt in ”/spark/examples/src/main/resources”
RDD (lines)
“Michael, 29” ”Andy, 30” “Justin, 19”
sc.textFile()
CRICOS code 00025B
16
Cloud Computing INFS3208
QL – Reflection
Firstly, define a case class (only case class can be implicitly converted into DataFrame by Spark)
Secondly, load the file into an RDD:
sc.textFile() .map(_.split(“,”))
“Michael, 29” ”Andy, 30” “Justin, 19”
Array(“Michael”, “29”) Array(”Andy”, “30”) Array(“Justin”, “19”)
RDD (lines) CRICOS code 00025B 17
Cloud Computing INFS3208
QL – Reflection
Thirdly, use map() to convert to a Person object and store each object in an RDD
RDD
RDD
Array(“Michael”, “29”) Array(”Andy”, “30”) Array(“Justin”, “19”)
.map(x => Person(x(0), x(1).trim.toInt))
Person(“Michael”, 29) Person(”Andy”, 30) Person(“Justin”, 19)
.map(x => Person(x(0), x(1).trim.toInt))
.toDF() method
Array(“Michael”, “29”)
Func (x) {
val tmp = new Person(x(0), x(1))
return tmp }
DataFrame
Lastly, call .toDF() method to convert the RDD to a DataFrame
name
CRICOS code 00025B
18
age
Michael
29
Andy
30
Justin
19
Cloud Computing INFS3208
QL – Reflection
case class Student (name: String, age: Int, height: Double)
Name
Age
Height
String
Int
Double
String
Int
Double
String
Int
Double
String
Int
Double
String
Int
Double
String
Int
Double
String
Int
Double
Student (case class)
Student (case class)
Student (case class)
Student (case class)
Student (case class)
Student (case class)
Reflection (toDF method) *Image is from https://www.ibmastery.com/blog/how-to-write-ib-extended-essay-reflections
RDD (Student)
DataFrame (Student)
CRICOS code 00025B 19
Cloud Computing INFS3208
QL – Programmatical Interface
• More often, case classes cannot be defined ahead of time
– E.g. the structure of records is encoded in a string,
– or a text dataset will be parsed and fields will be projected differently for different users
• A DataFrame can be created programmatically with three steps.
– Step 1: Create an RDD of Rows from the original RDD;
– Step 2: Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
– Step 3: Apply the schema to the RDD of Rows via createDataFrame method provided by SparkSession.
• Example: people.txt in ”/spark/examples/src/main/resources”
sc.textFile()
RDD (lines) “Michael, 29”
”Andy, 30”
“Justin, 19”
CRICOS code 00025B 20
Cloud Computing INFS3208
QL – Programmatical Interface
• Firstly, create an RDD of Rows from the original RDD RDD
.map(x => Row(x(0), x(1).trim)
RDD
Array(“Michael”, “29”) Array(”Andy”, “30”) Array(“Justin”, “19”)
Row(“Michael”, “29”) Row(”Andy”, “30”) Row(“Justin”, “19”)
• Secondly, create the schema represented by a StructType matching the structure of Rows
Array(“name”, “age”) Array(StructField(name,StringType,true), StructField(age,StringType,true)) StructType(StructField(name,StringType,true), StructField(age,StringType,true))
CRICOS code 00025B 21
Cloud Computing INFS3208
QL – Programmatical Interface
• Lastly, apply the schema to RDD
schema
StructType(StructField(name,StringType,true), StructField(age,StringType,true)) +
rowRDD
DataFrame
createDataFrame(rowRDD, schema)
Row(“Michael”, “29”) Row(”Andy”, “30”) Row(“Justin”, “19”)
name
age
Michael
29
Andy
30
Justin
19
CRICOS code 00025B 22
Cloud Computing INFS3208
QL – Programmatical Interface
schema (name: String, age: Int, height: Double)
Name
Age
Height
String
Int
Double
String
Int
Double
String
Int
Double
String
Int
Double
String
Int
Double
String
Int
Double
String
Int
Double
Student
Student (Rows)
Student
Student (Rows)
Student
Student (Rows)
Name
Age
Height
Student
Student (Rows)
Student
Student (Rows)
Student
Student (Rows)
RDD (Rows of
Student) DataFrame (Student)
RDD (Student)
*Image is from https://www.ibmastery.com/blog/how-to-write-ib-extended-essay-reflections
CRICOS code 00025B 23
Cloud Computing INFS3208
QL – RDD vs DataFrame
• RDD
– is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster.
– ls lower-level data abstraction supporting more atomic transformations and actions operations: Map(), Filter(), collect(), take(), etc.
• DataFrame
– is an immutable distributed collection of data. But it is organized into named columns, like a table in a relational database.
– DataFrame is designed to make large data sets processing even easier in some tasks.
– is higher-level data abstraction
• Usage cases:
– No-structured data manipulation (RDD) vs large structured data query (DataFrame)
CRICOS code 00025B 24
Cloud Computing INFS3208
QL – Save to Files
• write() method can save DataFrame as RDD and .format() supports various output formats: – JSON, parquet, jdbc, orc, libsvm, csv, text, etc.
CRICOS code 00025B 25
Cloud Computing INFS3208
Outline
•
•
• • •
• • • •
QL
History & Architecture of QL
Operations on DataFrame
Connecting QL with external data sources
Machine learning basics
Introduction to MLlib: Feature extraction, transformation, selection, classification, and regression Pipeline and components
Machine Learning Examples:
– Logistic regression
– Decision tree
– Clustering
– Collaborative filtering
– Association Rule Mining
CRICOS code 00025B 26
Cloud Computing INFS3208
QL – DataFrame Operations
• To show a specific column, use.select(“[col_name]”).show()
SQL equivalent statement: SELECT
CRICOS code 00025B 27
Cloud Computing INFS3208
QL – DataFrame Operations
• To show filtered results, .filter() can be used: .filter(condition).show()
SQL equivalent: WHERE Clause
• condition in filter: $”id”>20, $ is used to quote a field name (like a variable).
CRICOS code 00025B 28
Cloud Computing INFS3208
QL – DataFrame Operations
• To show aggregated results, use.groupBy(“[col_name]”).count().show()
SQL equivalent: GROUP BY
CRICOS code 00025B 29
Cloud Computing INFS3208
QL – DataFrame Operations
• To sort the results, use.sort(“[col_name]”).show()
• Sort() method also supports secondary sorting
CRICOS code 00025B 30
Cloud Computing INFS3208
QL – DataFrame Operations
• The sql function on a SparkSession enables applications to run SQL queries programmatically and returns the result as a DataFrame. Temporary views in QL are session-scoped and will disappear if the session that creates it terminates.
CRICOS code 00025B 31
Cloud Computing INFS3208
QL – DataFrame Operations
• If you want to keep alive until the Spark application terminates, use a global temporary view.
• Global temporary view is tied to a system preserved database global_temp,
– e.g. SELECT * FROM global_temp.view1 WHERE id = 1
CRICOS code 00025B 32
Cloud Computing INFS3208
Outline
•
•
• • •
• • • •
QL
History & Architecture of QL
Operations on DataFrame
Connecting QL with external data sources
Machine learning basics
Introduction to MLlib: Feature extraction, transformation, selection, classification, and regression Pipeline and components
Machine Learning Examples:
– Logistic regression
– Decision tree
– Clustering
– Collaborative filtering
– Association Rule Mining
CRICOS code 00025B 33
Cloud Computing INFS3208
QL – Connect to MySQL via JDBC
• QL also includes a data source that can read data from other databases using JDBC.
• We use MySQL database as a demo:
– MySQL has been installed and setup on VM instance-3
– To access to MySQL, phpMyAdmin is enabled
CRICOS code 00025B 34
Cloud Computing INFS3208
QL – Connect to MySQL via JDBC
• Create a database named “sparktest” and create a table “student”
CRICOS code 00025B 35
Cloud Computing INFS3208
QL – Connect to MySQL via JDBC
• •
To get started you will need to include the JDBC driver for your particular database on the spark classpath.
For example, to connect to MySQL from the hell you would run the following command:
–
– –
–
Download JDBC driver for the specific database:
Unzip the compressed file
Put the jar (mysql-connector-java-5.1.48.jar) in the folder of $SPARK/jars
When starting a spark-shell, the driver needs to be specified:
bin/spark-shell –driver-class-path mysql-connector-java-5.1.48-bin.jar –jars mysql-connector-
java-5.1.48-bin.jar
https://dev.mysql.com/downloads/connector/j/5.1.html
CRICOS code 00025B 36
Cloud Computing INFS3208
QL – Connect to MySQL via JDBC
• Then MySQL database can be connected via JDBC.
• DB information need to be input via .option() method:
CRICOS code 00025B 37
Cloud Computing INFS3208
QL – Connect to MySQL via JDBC
• DataFrame can be written into MySQL database via JDBC:
• Create Rows objects in an RDD
parallelize() method
“6 23” “7 20”
.map(_.split(“ “)) .map(x => Row(x(0), x(1).trim)
• Create an schema
“6 23” “7 20”
Array(“6”, “Rudd”, “M”, “23”) Array(“7”, “David”, “M”, “20”)
Row(“6”, “Rudd”, “M”, “23”) Row(“7”, “David”, “M”, “20”)
CRICOS code 00025B 38
Cloud Computing INFS3208
QL – Connect to MySQL via JDBC
• Apply schema to RDD to generate a DataFrame
• Use write() method and “append” mode to write records into MySQL via JDBC:
CRICOS code 00025B 39
Cloud Computing INFS3208
Outline
•
•
• • •
• • • •
QL
History & Architecture of QL
Operations on DataFrame
Connecting QL with external data sources
Machine learning basics
Introduction to MLlib: Feature extraction, transformation, selection, classification, and regression Pipeline and components
Machine Learning Examples:
– Logistic regression
– Decision tree
– Clustering
– Collaborative filtering
– Association Rule Mining
CRICOS code 00025B 40
Machine Learning Basics
Machine Learning (ML): is the scientific study of algorithms and statistical models that computer
systems use to effectively perform a specific task without using explicit instructions, relying on patterns and inference instead (Wikipedia).
Types of ML:
• Supervised (and semi-supervised) learning
– Has ground-truth information (e.g. features and labels in training data)
– Classification (binary-class/multi-class/multi-label) & regression (continuous values)
– Naïve Bayesian, Decision Trees, SVMs, CNNs, RNNs, etc.
• Unsupervised learning
– No labels (only features)
– Clustering analysis (to find structure in the data)
– K-means, DBSCAN, GMM, Autoencoders, etc.
CRICOS code 00025B 41
General Approach for Building Classification Model
Attrib1
Yes No No Yes No No Yes No No No
Attrib2
Attrib3
125K 100K 70K 120K 95K 60K 220K 85K 75K 90K
Class
No No No No Yes No No Yes No Yes
Large
Medium
Small
Medium
Large
Medium
Large
Small
Medium
Small
Tid
1 2 3 4 5 6 7 8 9 10
Tid
11 12 13 14 15
Induction
Training errors: number of classification errors on training records.
Process I
Decision Boundary
Model
Process II
Generalization (test) errors: number of classification errors on test records.
10
Training Set
Learn Model
Apply Model
Attrib1
No Yes Yes No No
Attrib2
Attrib3
55K 80K 110K 95K 67K
Class
? ? ? ? ?
Small
Medium
Large
Small
Large
Deduction
10
A good model should have low errors of both types. CRICOS code 00025B
Test Set
Learning algorithm
Underfitting vs Overfitting
• Training errors: number of classification errors on training records.
• Generalization (test) errors: number of classification errors on test records.
• A good model should have low errors of both types.
• Model underfitting: model is too simple such that both
Training error and Generalization error are high.
• Model overfitting: A model that fits the training data too well (with too low training errors) may have a poorer generalization error than a model with higher training error.
Models are too simple!
Models are too complicated!
CRICOS code 00025B
43
Evaluating Classification Methods
Predictive accuracy
Precision, p, is the number of correctly classified positive examples divided by the total number of examples that are classified as positive.
Recall, r, is the number of correctly classified positive examples divided by the total number of actual
positive examples in the test set. TP Efficiency p = TP + FP .
• time to construct the model
• time to use the model
Robustness: handling noise and missing values Scalability: efficiency in disk-resident databases
TP
r = TP + FN .
CRICOS code 00025B 44
Cloud Computing INFS3208
Outline
•
•
• • •
• • • •
QL
History & Architecture of QL
Operations on DataFrame
Connecting QL with external data sources
Machine learning basics
Introduction to MLlib: Feature extraction, transformation, selection, classification, and regression Pipeline and components
Machine Learning Examples:
– Logistic regression
– Decision tree
– Clustering
– Collaborative filtering
– Association Rule Mining
CRICOS code 00025B 45
Cloud Computing INFS3208
Llib
• Traditional Machine Learning algorithms can be only performed on small training data, which makes the performance will heavily depends on the quality of data sampling.
• In Big data era, the availability of massive data supports algorithms learn on complete data distribution.
• There are many algorithms in ML based on iterative training: Gradient Descend
• Hadoop MapReduce computation model cannot deal with a large number of iterations due
to its I/O communication.
• In contrast, Spark that has a nature of in-memory computing suits ML algorithms well.
• Llib provides a library that can run in a distributed computing mode
• Spark users only need to know the basics of the algorithms: input, output, and parameters.
• Taking advantage of interactive manner, data scientists can run their ML testing code over the data with responses.
CRICOS code 00025B 46
Cloud Computing INFS3208
Llib – Feature Extraction, Transformation, and Selection
Apache Llib includes a variety of Extracting, transforming and selecting feature algorithms: • Feature Extractors: Extracting features from “raw” data.
– TF-IDF, Word2Vec, and CountVectorizer
• Feature Transformation (20+): Scaling, converting, or modifying features.
– Tokenizer, StopWordsRemover,
– n-gram, Binarizer,
– Principle Component Analysis (PCA),
– OneHotEncoder,
– Etc.
• Selection: Selecting a subset of a larger set of features.
– VectorSlicer, ChiSqSelector, etc
• Locality Sensitive Hashing https://data-flair.training/blogs/apache-spark-mllib/
CRICOS code 00025B 47
Cloud Computing INFS3208
Llib – Classification and Regression
•
Classifiers in Llib:
– Decision tree classifier
– Random forest classifier
– Multilayer perceptron classifier
– Linear Support Vector Machine
– Naive Bayes
– etc
Regressors in Llib:
– Linear regression
– Generalized linear regression
– Decision tree regression
– Random forest regression
– Gradient-boosted tree regression
– etc
•
CRICOS code 00025B 48
Cloud Computing INFS3208
Outline
•
•
• • •
• • • •
QL
History & Architecture of QL
Operations on DataFrame
Connecting QL with external data sources
Machine learning basics
Introduction to MLlib: Feature extraction, transformation, selection, classification, and regression Pipeline and components
Machine Learning Examples:
– Logistic regression
– Decision tree
– Clustering
– Collaborative filtering
– Association Rule Mining
CRICOS code 00025B 49
Cloud Computing INFS3208
Llib – Pipeline Componets
• DataFrame:
– is used by ML API and can hold a variety of data types.
– E.g., a DataFrame could have different columns storing text, feature vectors, true labels.
• Estimator:
– abstracts the concept of a learning algorithm or any algorithm that fits or trains on data.
– implements a method fit(), which accepts a DataFrame and produces a Model, which is a Transformer
– E.g., LogisticRegression is an Estimator and can train on a DataFrame
CRICOS code 00025B 50
Cloud Computing INFS3208
Llib – Pipeline Componets
• Transformer:
– abstracts feature transformations and learned models.
– implements a method transform(), which converts one DataFrame into another (appending one or more columns).
– transforms one DataFrame into another DataFrame.
– E.g., a learned LogisticRegression model is a Transformer transforming testing data into
testing data with labels. • Parameter:
– All Transformers and Estimators now share a common API for specifying parameters.
CRICOS code 00025B 51
Cloud Computing INFS3208
Llib – How Pipeline Works
An Estimator Pipeline has multiple stages:
• The first two (Tokenizer and HashingTF) are Transformers (blue)
– The Tokenizer.transform() method splits the raw text documents into words, adding a new column with words to the DataFrame.
– The HashingTF.transform() method converts the words column into feature vectors, adding a new column with those vectors to the DataFrame.
• The third (LogisticRegression) is an Estimator (red).
– the Pipeline first calls LogisticRegression.fit() to produce a LogisticRegressionModel
• The Pipeline.fit() method is called on the original DataFrame, which has raw text documents and labels.
CRICOS code 00025B 52
Cloud Computing INFS3208
Llib – How Pipeline Works
The Estimator Pipeline has produced a PipelineModel, which will be used at test time:
• The first two (Tokenizer and HashingTF) are still the same Transformers (blue)
• The third (LogisticRegressionModel) becomes an Transformer (blue).
– the Pipeline first calls LogisticRegression.fit() to produce a LogisticRegressionModel
• PipelineModel’s transform() method is called on a test dataset and the data are passed through the
fitted pipeline in order.
• Each stage’s transform() method updates the dataset and passes it to the next stage.
CRICOS code 00025B 53
Cloud Computing INFS3208
Outline
•
•
• • •
• • • •
QL
History & Architecture of QL
Operations on DataFrame
Connecting QL with external data sources
Machine learning basics
Introduction to MLlib: Feature extraction, transformation, selection, classification, and regression Pipeline and components
Machine Learning Examples:
– Logistic regression
– Decision tree
– Clustering
– Collaborative filtering
– Association Rule Mining
CRICOS code 00025B 54
Cloud Computing INFS3208
Llib – Logistic Regression
Logistic regression is a popular method to predict a categorical response.
•
In statistics, the logistic model (or logit model) is used to model the probability of a certain class or event existing
– such as pass/fail, win/lose, alive/dead or healthy/sick. • LR can be extended to model several classes of events
•
– such as determining whether an image contains a cat, dog, lion, etc…
– Each object being detected in the image would be assigned a probability between 0 and 1 and the sum adding to one.
Logistic regression can be used to predict
– a binary outcome by using binomial logistic regression,
– a multiclass outcome by using multinomial logistic regression.
https://www.bmc.com/blogs/using-logistic-regression-scala-spark/
CRICOS code 00025B 55
Cloud Computing INFS3208
• • • •
ID
1 2 3 4 5 6 7 8 9 10
Decision tree learning is one of the most widely used techniques for classification.
Its classification accuracy is competitive with other methods, and it is very efficient.
The classification model is a tree, called decision tree.
Because of its simplicity, DT has been regarded as an interpretable (explainable predictive model)
Llib – Decision Tree
–
Many domains require explainability, such as healthcare, finance, etc.
Home Owner
No
10
Marital Status
Annual Income
Defaulted Borrower
Married
80K
?
Home Owner
Yes No No Yes No No Yes No No No
Marital Status
Single Married Single Married Divorced Married Divorced Single Married Single
Annual Income
125K 100K 70K 120K 95K 60K 220K 85K 75K 90K
Defaulted Borrower
No
No
No
No
Yes
No
No
Yes
No
Yes
Home Owner
Yes
NO
Splitting Attributes
No MarSt
Single, Divorce d
< 80K
NO
MarSt
Single, Divorced
Income
Married
Married
NO Yes
NO
Home Owner
< 80K
NO
> 80K NO YES
No
Income
10
> 80K
YES
Training Data
Model: Decision Tree
CRICOS code 00025B
56
Cloud Computing INFS3208
Llib – Clustering
• Cluster: A collection of data objects
– similar (or related) to one another within the same group
– dissimilar (or unrelated) to the objects in other groups
• Cluster analysis (or clustering, data segmentation, …)
– Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters
• Unsupervised learning: no predefined classes (i.e., learning by observations vs. learning by examples: supervised)
• Typical applications
–
As a stand-alone tool to get insight into data distribution
Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs
City-planning: Identifying groups of houses according to their house type, value, and geographical location
Anomaly Detection: credit card fraud and theft detection As a preprocessing step for other algorithms
–
CRICOS code 00025B 57
Cloud Computing INFS3208
Llib – Collaborative Filtering (CF)
• Collaborative filtering (CF) is a technique used by recommender systems.
• In a narrower sense, collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating).
• CF’s underlying assumption is if a person A has the same opinion as a person B on an issue, A is more likely to have B’s opinion on a different issue than that of a randomly chosen person.
“Birds of a feather flock together.”
https://spark.apache.org/docs/latest/ml-collaborative-filtering.html https://en.wikipedia.org/wiki/Collaborative_filtering
CRICOS code 00025B 58
Cloud Computing INFS3208
Google Dataproc
Dataproc is a Big Data platform for running Apache Spark and Apache Hadoop jobs.
• Automated cluster management
– Resizable clusters
– Autoscaling clusters
– Cluster scheduled deletion
– Automatic or manual configuration
– Flexible virtual machines
• Containerize OSS jobs
• Cloud integrated
• Enterprise security
CRICOS code 00025B 59
Cloud Computing INFS3208
Google Dataproc Example I – Calculating Pi
CRICOS code 00025B 60
Cloud Computing INFS3208
Google Dataproc Example I
CRICOS code 00025B 61
Cloud Computing INFS3208
Google Dataproc Example I
CRICOS code 00025B 62
Cloud Computing INFS3208
Google Dataproc Example I
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaSparkPi.java
CRICOS code 00025B 63
Cloud Computing INFS3208
Google Dataproc Example I
CRICOS code 00025B 64
Cloud Computing INFS3208
Tutorial and Practical Activities in Week 11
Tutorial
• TheSolutiontoAssignmentII • Project Topic Discussion
Practical
• I programming practice
• Project Technical Solution Discussion
CRICOS code 00025B 65
Cloud Computing INFS3208
Reading Materials
1. QL Tutorial. https://data-flair.training/blogs/spark-sql-tutorial/
2.Introduction to QL. https://www.tutorialspoint.com/spark_sql/spark_sql_introduction.htm 3.Machine Learning Library (Mllib) Guide, https://spark.apache.org/docs/latest/ml-guide.html 4.https://www.bmc.com/blogs/using-logistic-regression-scala-spark/
CRICOS code 00025B
Cloud Computing INFS3208
Next (Week 12) Topic:
Security (blockchain, crypto coins) and Privacy
CRICOS code 00025B