MET CS 689 B1 Designing and Implementing a Data Warehouse Andrew D Wolfe, Jr.
MET CS 689 B1
Designing and Implementing a Data Warehouse
Mary E Letourneau
Modeling Big Data
April 15, 2020
1
The 3 V’s – effects on modeling
Volume Big data is typically petabyte range. This presents a serious burden on administrators if put into a relational database. However, this only concerns storage, not modeling.
Velocity Millions to billions of (unstructured or not) data records every day. Again, this is less a modeling than an administrative concern.
Variety Numerous structures, plus semi-structured and unstructured (text, media) data. There can be no static data model for Big Data.
Note: Various individuals have added to these over the years (e.g. Veracity, Value, Visualization, Variability) but it starts with these three.
Modeling the Unmodelable
Partial structure for data
raw text
“flat files”
XML files
Media – images, sound, video
Sensor data
Emails
Office docs
Data Flow for Big Data
Gather
Analyze
Process
Distribute
(see Krishnan ch 11)
Gather unstructured
Need to Gather/Acquire in order to analyze
Choose storage that fits your intended final integration
Regular files
Database
Hadoop distributed files
other distributed storage
Analysis
This is the stage that turns the ‘fire hose’ of raw bytes into some kind of information
All data is analyzed, even with structure
aim is not permanent fact storage, but insight discovery
Apply multiple algorithms of multiple types
Use generic storage mechanism
Structuring your metadata
Technical Metadata
Business Metadata
Contextual Metadata
Process-Design-Level metadata
Program-level metadata
Operational Metadata
Business Intelligence metadata
Technical Metadata
Table layout including indexes
Source system of record
data transformations
types of transformations
source data
destination of transformed data
administration
Business Metadata
business objects and structure
form of the data attributes
data element dictionary
confidence/veracity
business validation rules
history
ownership/stewardship
Contextual Metadata
system of record (data warehouse)
incoming feed
type of incoming record
embedded information
tags
EXIF
SKUs
data from surrounding records
Process-Design-Level metadata
Data Source and Target
Lookup-data references
Source platform and address
Target platform and address
Extract or Feed mechanism
Exception handling
Business rules
Programming metadata
Source code management
System and program name
Authorship and history
Load scripts
Dependencies
Business rules applied by program
Operational Metadata
Frequency
Record counts
Usage
Processing time
Security
Business Intelligence metadata
Data mining metadata
OLAP Dimensional/cube structuring
Reporting
Metadata on Analysis Algorithms
Name of algorithm
Category of algorithm (e.g. pattern processing)
Implementation version history
forms of applicable input data
structure of parameter data
specification of results storage
parameters used in runs for which sources
history of parameter changes
Categories of Algorithms
Text mining
Data mining
Pattern processing
Statistical analysis
Mathematical models
MET CS 689 B1
Designing and Implementing a Data Warehouse
Mary E Letourneau
Alternative Storage for Big Data
April 15, 2020
17
Alternative Storage for Big Data
Revisiting the Storage Problem
CAP Theorem
What is “NoSQL”?
Parallel Data Manipulation
Column Stores
Document Databases
Hadoop DFS
Others
Revisiting the Storage Problem
Relational databases seem a poor fit for big data
can capacity scale with “agility” (no DBA)?
transaction controls cause bottlenecks
read optimization?
Volume/Velocity
Swallow large data volumes fast
avoid crazy hardware swapping
High-parallelism storage is the answer
CAP Theorem
Distributed Data should have three qualities
Consistency (not the same kind as in ACID)
Copies of data are the same across the system
Availability
Queries (read or write) succeed
Partition Tolerance
System still works even if data is partitioned across multiple nodes
CAP Theorem con’t
CAP Theorem proves “choose any two”
Regular RDBMS holds Consistency, Availability
Big data high parallelism means partitioning data in network
Wide Partitioning risks dropped data
Big Data must have “P” – which other one?
BASE – Basically Available, Soft-State, Eventually Consistent
What is “NoSQL”?
Not Only SQL
Large volume
Non-relational
(Usually) no schema – “schema on read”
Often no standardized query language available
Broad horizontal scalability (over network)
Loose guarantees of data consistency – “Eventually consistent”
Parallel Data Manipulation
Push processing operations out from central node to storage nodes
Horizontal scaling of processing
Most Big Data stores support a form of MapReduce (Google phrasing)
Map filters and rekeys source data
Shuffle redistributes by keys
Reduce processes/aggregates
MapReduce actions can be pipelined
Supported by many kinds of Big Data stores
Column Stores
Contrasted to RDBMS ‘row stores’
All values for a “column” are stored together
Like a relational DB, column values are usually ‘scalar’ or ‘atomic’ rather than an embedded data structure per se
Values are stored with a key
Columns often distributed between nodes for parallelism
MAIN BENEFIT: streamed inserts can be exceptionally fast
Different columns can be correlated by using the same key
Legitimate storage mechanism for relational database
Widely Used Column Stores
Greenplum
Amazon Redshift
SAP HANA
InfoBright
Vertica
Document Databases
Information is stored as “documents”
Documents are often most like “programming objects” a la C++, Java
Information stored in a ‘serialized’ fashion such as XML or Binary JSON (“BSON”)
Keys, paths, or URI used to store and retrieve
May be tuned for scanning by search engine to retrieve by enclosed fields
Replicated for parallelism
Widely-Used Document Databases
MongoDB
CouchDB
Solr
Elasticsearch
Hadoop DFS
Massive-parallelism distributed file system
Run on commodity hardware
Exists in a “cluster” of HDFS “nodes”
Multiple replicas of files (or pieces) across nodes in cluster
Replication provides high fault tolerance
Built for MapReduce to support processing at storage nodes
Others
Key-Value Stores (Cassandra)
Column Families (HBase)
Graph database (Neo4j)
Key Points
Visualizing Big Data
Gather, Analyze, Process, Distribute
Data Science and Analytics
CAP vs ACID
Issues that the three V’s create
Metadata – types and sources
Big Data Technologies/Solutions
Krishnan’s goals of information life-cycle management
Krishnan’s technology layers
Have a Good Evening and a Great Week!
End of Presentation
31