INFO20003 Database Systems
Dr Renata Borovica-Gajic
Lecture 22
Adaptive databases for the future Introducing research avenues (non-examinable)
Week 11
Data, data everywhere…
www.tableausoftware.com
[The Economist]
Year
50-fold from 2010-2020*
* ¡°The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East¡°, 2012, IDC
http://reportlogix.com/reporting.html
And grows exponentially… 2
Data Amounts
Finding useful information
Equals to finding the needle in a haystack 3
Data analysis with databases
Database systems (DB):
4 decades of research, predominant data analysis tool
SQL Queries
select val from sales where id = max;
Results
Process Access Store
DB System
4
From data to knowledge
Time
Too expensive
query optimize
prepare store
insight
data
The luxury is long gone
Database (DB) System
5
Unless…..
6
Data analysis with databases
SQL Queries
Results
Process Access
Store
DB System
7
Store data carefully
Storage tiering in private and public clouds
Hot Warm Cold
Tape
[VLDB¡¯16, ADMS¡¯17]
Memory Disk
8
Cost
Performance
Store data carefully
Storage tiering in private and public clouds
[VLDB¡¯16, ADMS¡¯17]
Storing 100TB of data [Horison, 2015]
Hot Warm Cold
Cold Storage Cold Storage
Tape
400
350
300
250
200
150
100
50 0
$159,641
Memory Disk
Trad. Cold 3-tier 2-tier
Embrace new technology
9
Cost
Performance
Cost (x1000$)
Cost benefit without performance penalty
Setting: multitenant enterprise datacenter, clients: TPCH 50, Q12, CSD: shared, layout: one client per group
5
Cold storage
PostgreSQL
HDD 4 Ideal
3 2 1 0
Skipper on Skipper
Cold Storage
12345 Number of clients
10
Average exec. time (x1000 sec)
Skipper to the rescue
Virtualised enterprise data center
VM1
DB1
Multi-way joins: VM2 VM3
VM
2.
MJoin
Hash Scan B
PostgreSQL Opportunistic execution
triggered upon data arrival
DB2 DB3
Hash Scan A
Hash Scan C
Network
1.
3.
Cache Management
A2
I/O Scheduler object-group map.
A1 B1 C1
Novel ranking Balances access across groups and
Progress driven caching: Favors caching of objects to maximise query progress
algorithm:
efficiency
across clients
Cold Storage
fairness
11
Lesson #1
Embrace new HW technology
12
SQL Queries
Results
Process Access Store
DB System
Data analysis
13
Choose access strategy on-the-fly
[DBTest¡¯12, ICDE¡¯15, VLDBJ¡®18] Find ¡®age¡¯ from people where name = ¡®Smith¡¯
Index Scan
Smooth Scan
Full Scan
Result size
Adapt to data
14
Execution time
Morphing mechanism
Modes:
1. Index Access: Traditional index access
2. Entire Page Probe: Index access probes entire page
3. Gradual Flattening Access: Probe adjacent region(s)
…
INDEX
HEAP PAGES
Mode 1
Mode 2
Mode 3
15
Smooth Scan in action
Setting: Micro-benchmark, 25GB table, Order by, Selectivity 0-100% 100000
10000 1000 100 10 1 0.1
Full Scan
Index Scan
Smooth Scan
0 0.001 0.01
0.1 1 Selectivity(%)
50 75 100
Near-optimal over entire selectivity range 16
20
Execution time (sec) (log scale)
Lesson #2
Learn from data
17
Data analysis with databases
SQL Queries
Results
Process Access Store
DB System
18
Process instantly
[SIGMOD¡¯12, VLDB¡¯12, CACM¡¯15]
q4
q4
q3
q2
q1
q3
q2
q4
q3
q1
q2
q1
80% reuse within 3 hours
[Facebook]
Adapt to queries
19
DB
File querying NoDB (MR)
Time Loading
PostgresRaw: NoDB from idea to practice
1. Positional
Pointers to end of tuples
indexing
1|Supplier#01|17|335-1736|5755.94|each slyly…
2|Supplier#02|5|861-2259|4032.68| slyly bold… 3|Supplier#03|1|516-1199|4192.40|blithely… 4|Supplier#04|15|787-7479|4641.08|riously eve… 5|Supplier#05|11|21-151-690-3663|-283.84|. Slyly… 6|Supplier#06|14|24-696-997- 4969|1365.79|final…..
scan
3. Statistics
Pointers to
attributes
2. Cache
Workload
NationKey Name
Supplier#01
Supplier#02
…
17
5
…
10 5 0
1 3 5 7 9 111315
# Buckets
Adjust to queries = progressively cheaper 20
Frequency
1800
1600
1400
1200
1000
800
600
400
200
0
~ 7000
~ 4806
Q20 Q19 Q18 Q17 Q16 Q15 Q14 Q13 Q12 Q11 Q10 Q9 Q8 Q7 Q6 Q5 Q4 Q3 Q2 Q1 Load
PostgresRaw in action
Setting: 7.5M tuples, 150 attributes, 11GB file
Queries: 10 arbitrary attributes per query, vary selectivity
Data-to-insight time halved with PostgresRaw
PostgresRaw
MySQL CSV Engine DBMS X DBMS X PostgreSQL
MySQL w/ external files
Per query performance comparable to traditional DBMS
21
Execution time (sec)
Lesson #3
Learn from queries
22
Self-designing systems for data analysis
¡°It is not the strongest species that survive, nor the most intelligent, but the ones most responsive to change.¡± Charles Darwin
Queries
[SIGMOD¡¯12] [VLDB¡¯12] [CACM¡¯15]
Adapt Refine
DB System
Fast response
Data
[DBTest¡¯12] [ICDE¡¯15] [VLDBJ¡¯18]
Hardware
[VLDB¡¯16] [ADMS¡¯17]
Thank you!
Anyone can be a data scientist with self-driving DB
23