代写代考 COMP9313: Big Data Management

COMP9313: Big Data Management
Course web site: http://www.cse.unsw.edu.au/~cs9313/

Chapter 8.2: Hive

Introduction to Hive

What is Hive?
❖ A data warehouse system for Hadoop that
➢ facilitates easy data summarization
➢ supports ad-hoc queries (still batch though…) ➢ created by Facebook
❖ A mechanism to project structure onto this data and query the data using a SQL-like language – HiveQL
➢ Interactive-console –or-
➢ Execute scripts
➢ Kicks off one or more MapReduce jobs in the background
❖ An ability to use indexes, built-in user-defined functions
❖ Latest version: 3.1.2, works with Hadoop 3.x.y

Motivation of Hive
❖ Limitation of MR
➢ Have to use M/R model ➢ Not Reusable
➢ Error prone
➢ For complex jobs:
 Multiple stage of Map/Reduce functions
 Just like ask developer to write specified physical execution
plan in the database ❖ Hive intuitive
➢ Make the unstructured data looks like tables regardless how it really lays out
➢ SQL based query can be directly against these tables
➢ Generate specified execution plan for this query

Hive Features
❖ A subset of SQL covering the most common statements ❖ Agile data types: Array, Map, Struct, and JSON objects ❖ User Defined Functions and Aggregates
❖ Regular Expression support
❖ MapReduce support
❖ JDBC support
❖ Partitions and Buckets (for performance optimization) ❖ Views and Indexes

Word Count using MapReduce

Word Count using Hive
create table doc(
text string
) row format delimited fields terminated by ‘\n’ stored as textfile;
load data local inpath ‘/home/Words’ overwrite into table doc;
SELECT word, COUNT(*) FROM (SELECT explode(split(text, ‘ ‘)) AS word FROM doc) wTable GROUP BY word;

Word Count using Hive
create table doc(
text string
) row format delimited fields terminated by ‘\n’ stored as textfile;
load data local inpath ‘/home/Words’ overwrite into table doc;
SELECT word, COUNT(*) FROM doc LATERAL VIEW explode(split(text, ‘ ‘)) wTable as word GROUP BY word;

Architecture of Hive

Architecture of Hive
Command Line Interface Driver
(Compiler, Optimizer, Executor)
❖ Metastore
➢ The component that store the system catalog and meta data about
tables, columns, partitions etc.
➢ Stored in a relational RDBMS (built-in Derby)

Architecture of Hive
Command Line Interface Driver
(Compiler, Optimizer, Executor)
❖ Driver: manages the lifecycle of a HiveQL statement as it moves through Hive.
➢ Query Compiler: compiles HiveQL into map/reduce tasks
➢ Optimizer: generate the best execution plan
➢ Execution Engine: executes the tasks produced by the compiler in proper dependency order. The execution engine interacts with the underlying Hadoop instance.

Architecture of Hive
Command Line Interface Driver
(Compiler, Optimizer, Executor)
➢ Cross-language support
➢ Provides a thrift interface and a JDBC/ODBC server and provides a way of integrating Hive with other applications.

Architecture of Hive
Command Line Interface Driver
(Compiler, Optimizer, Executor)
❖ Client Components
➢ Including Command Line Interface(CLI), the web UI and JDBC/ODBC driver.

Hive Installation and Configuration
❖ Download at: https://hive.apache.org/downloads.html
❖ The latest version: 3.1.0
❖ Install:
❖ Environment variables in ~/.bashrc
❖ Create /tmp and /user/hive/warehouse and set them chmod g+w for more than one user usage
❖ Run the schematool command to initialize Hive $ schematool -dbType derby -initSchema
❖ Start Hive Shell: $ hive
$ tar xzf apache-hive-3.1.0-bin.tar.gz $ mv apache-hive-3.1.0 ~/hive
export HIVE_HOME = ~/hive
export PATH = $HIVE_HOME/bin:$PATH
$ hdfs dfs -mkdir /tmp
$ hdfs dfs -mkdir /user/hive/warehouse
$ hdfs dfs -chmod g+w /tmp
$ hdfs dfs -chmod g+w /user/hive/warehouse

Hive Type System
❖ Primitive types
➢ Integers: TINYINT, SMALLINT, INT, BIGINT. ➢ Boolean: BOOLEAN.
➢ Floating point numbers: FLOAT, DOUBLE.
➢ Fixed point numbers: DECIMAL
➢ String: STRING, CHAR, VARCHAR.
➢ Date and time types: TIMESTAMP, DATE
❖ Complex types
➢ Structs: c has type {a INT; b INT}. c.a to access the first field ➢ Maps: M[‘group’].
➢ Arrays: [‘a’, ‘b’, ‘c’], A[1] returns ‘b’.
➢ list< map > >
➢ Represents list of associative arrays that map strings to structs that contain two ints

Hive Data Model
❖ Databases: Namespaces function to avoid naming conflicts for tables, views, partitions, columns, and so on.
❖ Tables: Homogeneous units of data which have the same schema. ➢ Analogous to tables in relational DBs.
➢ Each table has corresponding directory in HDFS.
➢ An example table: page_views:
 timestamp—which is of INT type that corresponds to a UNIX timestamp of when the page was viewed.
 userid —which is of BIGINT type that identifies the user who viewed the page.
 page_url—which is of STRING type that captures the location of the page.
 referer_url—which is of STRING that captures the location of the page from where the user arrived at the current page.
 IP—which is of STRING type that captures the IP address from where the page request was made.

Hive Data Model (
❖ Partitions:
➢ Each Table can have one or more partition Keys which determines
how the data is stored
➢ Example:
 Given the table page_views, we can define two partitions a date_partition of type STRING and country_partition of type STRING
 All “US” data from “2009-12-23” is a partition of the page_views table
➢ Partition columns are virtual columns, they are not part of the data itself but are derived on load
➢ It is the user’s job to guarantee the relationship between partition name and data content
❖ Buckets: Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of the Table
➢ Example: the page_views table may be bucketed by userid 8.18

Data Model and Storage
Tables (dir)
Partitions (dir)
Buckets (file)

Create Table
See full CREATE TABLE command at:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
CREATE TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name data_type [COMMENT col_comment], …)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], …)]
[CLUSTERED BY (col_name, col_name, …) [SORTED BY (col_name [ASC|DESC], …)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format] [STORED AS file_format]

❖ SerDe is a short name for “Serializer and Deserializer.“
➢ Describe how to load the data from the file into a representation
that make it looks like a table;
❖ Hive uses SerDe (and FileFormat) to read and write table rows.
❖ HDFS files –> InputFileFormat –> –> Deserializer –> Row object
❖ Row object –> Serializer –> –> OutputFileFormat –> HDFS files
❖ More details see: https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide#Dev eloperGuide-HiveSer De

row_format
: DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
Default values: Ctrl+A, Ctrl+B, Ctrl+C, new line, respectively
file_format:
: SEQUENCEFILE
| TEXTFILE — (Default, depending on hive.default.fileformat configuration)
| RCFILE — (Note: Available in Hive 0.6.0 and later)
| ORC — (Note: Available in Hive 0.11.0 and later)
| PARQUET — (Note: Available in Hive 0.13.0 and later) | AVRO — (Note: Available in Hive 0.14.0 and later)
| INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname

❖ Example:
Create Table Example
CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING,
ip STRING COMMENT ‘IP Address of the User’)
COMMENT ‘This is the page view table’ PARTITIONED BY(dt STRING, country STRING) CLUSTERED BY(userid) SORTED BY(viewTime)
INTO 32 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\001’ COLLECTION ITEMS TERMINATED BY ‘\002’ MAP KEYS TERMINATED BY ‘\003’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;

Browsing Tables and Partitions
❖ To list existing tables in the warehouse ➢ SHOW TABLES;
❖ To list tables with prefix ‘page’ ➢ SHOW TABLES ‘page.*’;
❖ To list partitions of a table
➢ SHOW PARTITIONS page_view;
❖ To list columns and column types of table. ➢ DESCRIBE page_view;

Alter Table/Partition/Column
❖ To rename existing table to a new name
➢ ALTER TABLE old_table_name RENAME TO new_table_name;
❖ To rename the columns of an existing table
➢ ALTER TABLE old_table_name REPLACE COLUMNS (col1
TYPE, …);
❖ To add columns to an existing table
➢ ALTER TABLE tab1 ADD COLUMNS (c1 INT COMMENT ‘a new int column’, c2 STRING DEFAULT ‘def val’);
❖ To rename a partition
➢ ALTER TABLE table_name PARTITION old_partition_spec
RENAME TO PARTITION new_partition_spec;
❖ To rename a column
➢ ALTER TABLE table_name CHANGE old_col_name new_col_name column_type
❖ More details see:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+D
DL#LanguageManualDDL-AlterTable

Drop Table/Partition
❖ To drop a table
➢ DROP TABLE [IF EXISTS] table_name ➢ Example:
 DROP TABLE page_view ❖ To drop a paritition
➢ ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec[, PARTITION partition_spec, …]
➢ Example:
 ALTER TABLE pv_users DROP PARTITION (ds=’2008-08-08′)

Loading Data
❖ Hive does not do any transformation while loading data into tables. Load operations are currently pure copy/move operations that move datafiles into locations corresponding to Hive tables.
➢ Load data from a file in the local files system
 LOAD DATA LOCAL INPATH /tmp/pv_2008-06-08_us.txt INTO TABLE page_view PARTITION(date=’2008-06-08′, country=’US’)
➢ Load data from a file in HDFS
 LOAD DATA INPATH ‘/user/data/pv_2008-06-08_us.txt’ INTO TABLE page_view PARTITION(date=’2008-06-08′, country=’US’)
➢ The input data format must be the same as the table format! 8.27
LOAD DATA [LOCAL] INPATH ‘filepath’ [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 …)]

❖ Insert rows into a table: ➢ Syntax
❖ Inserting data into Hive Tables from queries ➢ Syntax
➢ Example:
Insert Data
INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] …)] VALUES values_row [, values_row …]
INSERT INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 …)] select_statement FROM from_statement;
INSERT OVERWRITE TABLE user_active SELECT user.*
WHERE user.active = 1;

Update Data
UPDATE tablename SET column = value [, column = value …] [WHERE expression]
❖ Synopsis
➢ The referenced column must be a column of the table being
➢ The value assigned must be an expression that Hive supports in the select clause. Thus arithmetic operators, UDFs, casts, literals, etc. are supported. Subqueries are not supported.
➢ Only rows that match the WHERE clause will be updated.
➢ Partitioning columns cannot be updated.
➢ Bucketing columns cannot be updated.

❖ Select Syntax:
Query Data
SELECT [ALL | DISTINCT] select_expr, select_expr, … FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list] [CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list] ]
[LIMIT number]

Order, Sort, Cluster, and Distribute By
❖ Difference between Order By and Sort By
➢ The former guarantees total order in the output while the latter
only guarantees ordering of the rows within a reducer
❖ Cluster By
➢ Cluster By is a short-cut for both Distribute By and Sort By.
➢ Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer. However, Distribute By does not guarantee clustering or sorting properties on the distributed keys.
x1 x2 x4 x3 x1
x1 x2 x4 x3 x1
Distribute By
Cluster By

Query Examples
❖ Selects column ‘foo’ from all rows of partition ds=2008-08-15 of the invites table. The results are not stored anywhere, but are displayed on the console.
hive> SELECT a.foo FROM invites a WHERE a.ds=’2008-08-15′;
❖ Selects all rows from partition ds=2008-08-15 of the invites table into an HDFS directory.
❖ Selects all rows from pokes table into a local directory.
hive> INSERT OVERWRITE DIRECTORY ‘/tmp/hdfs_out’ SELECT a.* FROM invites a WHERE a.ds=’2008-08-15′;
hive> INSERT OVERWRITE LOCAL DIRECTORY ‘/tmp/local_out’ SELECT a.* FROM pokes a;

❖ Count the number of distinct users by gender
❖ Multiple DISTINCT expressions in the same query is not allowed
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count (DISTINCT pv_users.userid) FROM pv_users
GROUP BY pv_users.gender;
INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip)
FROM pv_users
GROUP BY pv_users.gender;

❖ Hive does not support join conditions that are not equality conditions ➢ it is very difficult to express such conditions as a map/reduce job ➢ SELECT a.* FROM a JOIN b ON (a.id = b.id)
➢ However, the following statement is not allowed:
 SELECT a.* FROM a JOIN b ON (a.id <> b.id) ❖ More than 2 tables can be joined in the same query.
➢ SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
❖ Example:
SELECT s.word, s.freq, k.freq FROM shakespeare s
JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1 ORDER BY s.freq DESC LIMIT 10;

Behind the Scenes
SELECT s.word, s.freq, k.freq FROM shakespeare s
JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1 ORDER BY s.freq DESC LIMIT 10;
(Abstract Syntax Tree)
(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF shakespeare s) (TOK_TABREF bible k) (= (. (TOK_TABLE_OR_COL s) word) (. (TOK_TABLE_OR_COL k) word)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL s) word)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL s) freq)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL k) freq))) (TOK_WHERE (AND (>= (. (TOK_TABLE_OR_COL s) freq) 1) (>= (. (TOK_TABLE_OR_COL k) freq) 1))) (TOK_ORDERBY (TOK_TABSORTCOLNAMEDESC (. (TOK_TABLE_OR_COL s) freq))) (TOK_LIMIT 10)))
(one or more of MapReduce jobs)

❖ Built-in functions:
➢ mathematical, date function, string function, etc.
❖ Built-in aggregate functions: ➢ max, min, count, etc.
❖ Built-in table-generating functions: transform a single input row to multiple output rows
➢ explode(ARRAY): Returns one row for each element from the array.
➢ explode(MAP): Returns one row for each key-value pair from the input map with two columns in each row
❖ Create Custom UDFs
❖ More details see: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+U DF#LanguageManualUDF-explode
Functions (UDFs)
Operators and User
❖ Built-in operators:
➢ relational, arithmetic, logical, etc.

❖ Create a table in Hive
❖ Load file into table
load data local inpath ‘/home/Words’ overwrite into table doc;
❖ Compute word count using select
➢ explode() takes in an array (or a map) as an input and outputs the elements of the array (map) as separate rows.
➢ Lateral view is used in conjunction with user-defined table generating functions such as explode()
➢ A lateral view first applies the UDTF to each row of base table and then joins resulting output rows to form a virtual table
create table doc( text string
) row format delimited fields terminated by ‘\n’ stored as textfile;
SELECT word, COUNT(*) FROM doc LATERAL VIEW explode(split(text, ‘ ‘)) wTable as word GROUP BY word;

explode() Function
❖ explode() takes in an array (or a map) as an input and outputs the elements of the array (map) as separate rows.
❖ The following will return a table of words in doc, with a single column word
SELECT explode(split(text, ‘ ‘)) AS word FROM doc
❖ The following will compute the frequency of each word
SELECT word, COUNT(*)
FROM (SELECT explode(SPLIT(text, ‘ ‘)) AS word FROM doc) AS words GROUP BY word;

Lateral View
❖ Lateral view is used in conjunction with user-defined table generating functions such as explode()
❖ A lateral view first applies the UDTF (User Defined Tabular Function) to each row of base table and then joins resulting output rows to form a virtual table.
❖ Lateral View Syntax
➢ lateralView: LATERAL VIEW udtf(expression) tableAlias AS
columnAlias (’,’ columnAlias)*
➢ fromClause: FROM baseTable (lateralView)*
❖ Compare the two ways:
SELECT word, COUNT(*) FROM
(SELECT explode(SPLIT(text, ‘ ‘)) AS word FROM doc) AS words GROUP BY word;
SELECT word, COUNT(*) FROM
doc LATERAL VIEW explode(split(text, ‘ ‘)) wTable as word GROUP BY word;

Writing HIVE Scripts
❖ Rather than executing HQL statements one-by-one in a Hive shell, you can bundle them into a script and execute them all together. This is also a good way to save your statements, edit them, and run them easily whenever you like.
❖ To execute the statements in the file, just enter the following command in the terminal: hive -f frequency.hql
CREATE TABLE doc ( line STRING
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\n’ STORED AS textfile;
LOAD DATA LOCAL INPATH ‘text.txt’ OVERWRITE INTO TABLE doc;
SELECT doc, COUNT(*)
FROM doc LATERAL VIEW EXPLODE(SPLIT(line, ‘ ‘)) words as word GROUP BY word;

➢ An easy way to process large scale data
➢ Support SQL-based queries
➢ Provide more user defined interfaces to extend ➢ Programmability
➢ Efficient execution plans for performance
➢ Interoperability with other databases
➢ No easy way to append data ➢ Files in HDFS are immutable

Applications of Hive
❖ Log processing ➢ Daily Report
➢ User Activity Measurement ❖ Data/Text mining
➢ Machine learning (Training Data) ❖ Business intelligence
➢ Advertising Delivery ➢ Spam Detection

References
❖ https://cwiki.apache.org/confluence/display/Hive/Home#Home- HiveDocumentation
❖ http://www.tutorialspoint.com/hive/
❖ Hadoop the Definitive Guide. Hive Chapter

End of Chapter 8.2