程序代写代做代考 database algorithm hbase hadoop file system data structure SQL data science data mining Java MET CS 689 B1 Designing and Implementing a Data Warehouse Andrew D Wolfe, Jr.

MET CS 689 B1 Designing and Implementing a Data Warehouse Andrew D Wolfe, Jr.

MET CS 689 B1
Designing and Implementing a Data Warehouse

Mary E Letourneau
Modeling Big Data
April 15, 2020

1

The 3 V’s – effects on modeling
Volume Big data is typically petabyte range. This presents a serious burden on administrators if put into a relational database. However, this only concerns storage, not modeling.
Velocity Millions to billions of (unstructured or not) data records every day. Again, this is less a modeling than an administrative concern.
Variety Numerous structures, plus semi-structured and unstructured (text, media) data. There can be no static data model for Big Data.

Note: Various individuals have added to these over the years (e.g. Veracity, Value, Visualization, Variability) but it starts with these three.

Modeling the Unmodelable
Partial structure for data
raw text
“flat files”
XML files
Media – images, sound, video
Sensor data
Emails
Office docs

Data Flow for Big Data
Gather
Analyze
Process
Distribute
(see Krishnan ch 11)

Gather unstructured
Need to Gather/Acquire in order to analyze
Choose storage that fits your intended final integration
Regular files
Database
Hadoop distributed files
other distributed storage

Analysis
This is the stage that turns the ‘fire hose’ of raw bytes into some kind of information
All data is analyzed, even with structure
aim is not permanent fact storage, but insight discovery
Apply multiple algorithms of multiple types
Use generic storage mechanism

Structuring your metadata
Technical Metadata
Business Metadata
Contextual Metadata
Process-Design-Level metadata
Program-level metadata
Operational Metadata
Business Intelligence metadata

Technical Metadata
Table layout including indexes
Source system of record
data transformations
types of transformations
source data
destination of transformed data
administration

Business Metadata
business objects and structure
form of the data attributes
data element dictionary
confidence/veracity
business validation rules
history
ownership/stewardship

Contextual Metadata
system of record (data warehouse)
incoming feed
type of incoming record
embedded information
tags
EXIF
SKUs
data from surrounding records

Process-Design-Level metadata
Data Source and Target
Lookup-data references
Source platform and address
Target platform and address
Extract or Feed mechanism
Exception handling
Business rules

Programming metadata
Source code management
System and program name
Authorship and history
Load scripts
Dependencies
Business rules applied by program

Operational Metadata
Frequency 
Record counts 
Usage 
Processing time 
Security

Business Intelligence metadata
Data mining metadata
OLAP Dimensional/cube structuring
Reporting

Metadata on Analysis Algorithms
Name of algorithm
Category of algorithm (e.g. pattern processing)
Implementation version history
forms of applicable input data
structure of parameter data
specification of results storage
parameters used in runs for which sources
history of parameter changes

Categories of Algorithms
Text mining
Data mining
Pattern processing
Statistical analysis
Mathematical models

MET CS 689 B1
Designing and Implementing a Data Warehouse

Mary E Letourneau
Alternative Storage for Big Data
April 15, 2020

17

Alternative Storage for Big Data
Revisiting the Storage Problem
CAP Theorem
What is “NoSQL”?
Parallel Data Manipulation
Column Stores
Document Databases
Hadoop DFS
Others

Revisiting the Storage Problem
Relational databases seem a poor fit for big data
can capacity scale with “agility” (no DBA)?
transaction controls cause bottlenecks
read optimization?
Volume/Velocity
Swallow large data volumes fast
avoid crazy hardware swapping
High-parallelism storage is the answer

CAP Theorem
Distributed Data should have three qualities
Consistency (not the same kind as in ACID)
Copies of data are the same across the system
Availability
Queries (read or write) succeed
Partition Tolerance
System still works even if data is partitioned across multiple nodes

CAP Theorem con’t
CAP Theorem proves “choose any two”
Regular RDBMS holds Consistency, Availability
Big data high parallelism means partitioning data in network
Wide Partitioning risks dropped data
Big Data must have “P” – which other one?
BASE – Basically Available, Soft-State, Eventually Consistent

What is “NoSQL”?
Not Only SQL
Large volume
Non-relational
(Usually) no schema – “schema on read”
Often no standardized query language available
Broad horizontal scalability (over network)
Loose guarantees of data consistency – “Eventually consistent”

Parallel Data Manipulation
Push processing operations out from central node to storage nodes
Horizontal scaling of processing
Most Big Data stores support a form of MapReduce (Google phrasing)
Map filters and rekeys source data
Shuffle redistributes by keys
Reduce processes/aggregates
MapReduce actions can be pipelined
Supported by many kinds of Big Data stores

Column Stores
Contrasted to RDBMS ‘row stores’
All values for a “column” are stored together
Like a relational DB, column values are usually ‘scalar’ or ‘atomic’ rather than an embedded data structure per se
Values are stored with a key
Columns often distributed between nodes for parallelism
MAIN BENEFIT: streamed inserts can be exceptionally fast
Different columns can be correlated by using the same key
Legitimate storage mechanism for relational database

Widely Used Column Stores
Greenplum
Amazon Redshift
SAP HANA
InfoBright
Vertica

Document Databases
Information is stored as “documents”
Documents are often most like “programming objects” a la C++, Java
Information stored in a ‘serialized’ fashion such as XML or Binary JSON (“BSON”)
Keys, paths, or URI used to store and retrieve
May be tuned for scanning by search engine to retrieve by enclosed fields
Replicated for parallelism

Widely-Used Document Databases
MongoDB
CouchDB
Solr
Elasticsearch

Hadoop DFS
Massive-parallelism distributed file system
Run on commodity hardware
Exists in a “cluster” of HDFS “nodes”
Multiple replicas of files (or pieces) across nodes in cluster
Replication provides high fault tolerance
Built for MapReduce to support processing at storage nodes

Others
Key-Value Stores (Cassandra)
Column Families (HBase)
Graph database (Neo4j)

Key Points
Visualizing Big Data
Gather, Analyze, Process, Distribute
Data Science and Analytics
CAP vs ACID
Issues that the three V’s create
Metadata – types and sources
Big Data Technologies/Solutions
Krishnan’s goals of information life-cycle management
Krishnan’s technology layers

Have a Good Evening and a Great Week!
End of Presentation

31