程序代写代做代考 compiler SQL python algorithm flex PowerPoint Presentation

PowerPoint Presentation

1

Starting Point: SparkSession
Entry point into all functionality in Spark SQL
Already created in pyspark
Need to create in self-contained applications:
from pyspark.sql import SparkSession

spark = SparkSession \
.builder \
.appName(“Python Spark SQL basic example”) \
.config(“spark.some.config.option”, “some-value”) \
.getOrCreate()

DataFrames
Idea borrowed from pandas and R
A DataFrame is an RDD of Row objects
The fields in a Row can be accessed like attributes

>>> from pyspark.sql import Row
>>> row = Row(name=”Alice”, age=11)
>>> row
Row(age=11, name=’Alice’)
>>> row[‘name’], row[‘age’]
(‘Alice’, 11)
>>>row.name, row.age
(‘Alice’,+11)

Note: Use row[‘name’] to avoid conflicts with row’s own attributes

DataFrame API
RDD can be used to achieve the same functionality. What are the benefits of using DataFrames?
Readability
Flexibility
Columnar storage
Catalyst optimizer

Plan Optimization & Execution
5
SQL AST
DataFrame
Unresolved Logical Plan
Logical Plan
Optimized Logical Plan
RDDs
Selected Physical Plan
Analysis
Logical
Optimization
Physical
Planning
Cost Model
Physical
Plans
Code
Generation
Catalog
DataFrames and SQL share the same optimization/execution pipeline

Plan Optimization & Execution
6
SQL AST
DataFrame
Unresolved Logical Plan
Logical Plan
Optimized Logical Plan
RDDs
Selected Physical Plan
Analysis
Logical
Optimization
Physical
Planning
Cost Model
Physical
Plans
Code
Generation
Catalog
DataFrames and SQL share the same optimization/execution pipeline

Tree transformations
Constant folding
7

Add
Attribute(x)
Literal(3)

x + (1 + 2)
x + 3

Done in Scala because functional programming languages naturally support compiler functions.
7

Catalyst Rules
Pattern matching functions that transform subtrees into specific structures.
Multiple patterns in the same transform call.

May take multiple batches to reach a fixed point.
(x+0) + (3+3) => x + (3+3) => x+6
transform can contain arbitrary Scala code.

8

Applies standard rule-based optimization (constant folding, predicate-pushdown, projection pruning, null propagation, boolean expression simplification, etc)
9
Logical Plan
Optimized Logical Plan
Logical
Optimization

Example: Find expression SUM(x) where x is a DECIMAL(prec, scale). If prec is not too high, convert the SUM into a sum over LONGs, compute the sum, and convert the result back.

10
events.join(users, ‘user_id’)
.filter(events.city == “Hong Kong”)
.select(events.timestamp, users.phone)
.show()
Physical Plan 2

Logical Plan
filter
join
events table
users table
select
Physical Plan
join
filter
users table
events table
select
Predicate Pushdown
Column Pruning
join
filter
events table
select(user_id, city, timetamp)
select(user_id, phone)
users table
Rule-based optimization

Plan Optimization & Execution
11
SQL AST
DataFrame
Unresolved Logical Plan
Logical Plan
RDDs
Selected Physical Plan
Analysis
Logical
Optimization
Cost Model
Code
Generation
Catalog
DataFrames and SQL share the same optimization/execution pipeline

Optimized Logical Plan
Physical
Planning
Physical
Plans

Cost-based optimization
Currently only used for choosing join algorithms
Big table joining big table: shuffle
Small table joining big table: broadcast the small table
Rewrite the PMI example using DataFrame API:
12

A Bug in SparkSQL
Bug report:
https://issues.apache.org/jira/browse/SPARK-20169

/docProps/thumbnail.jpeg