7CCSMBDT – Big Data Technologies Practical
HIVE
The practical commands are to be executed in Cloudera Quickstart VM. General reference (documentation of HIVE)
https://cwiki.apache.org/confluence/display/Hive/LanguageManual
A. Open a terminal in Cloudera and type hive. You will see the prompt:
hive>
1. CREATE DATABASE is a statement used to create a database in Hive. A database in Hive is a namespace or a collection of tables. The syntax for this statement is as follows:
CREATE DATABASE [IF NOT EXITS] database_name;
Here, IF NOT EXISTS is an optional clause, which notifies the user that a database with the same name already exists. For details, see: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManu alDDL-CreateTableCreate/Drop/TruncateTable
Use the command CREATE DATABASE to create a database with name userdb;
2. SHOW DATABASE is a statement used to show the list of existing databases in HIVE. To try it, type:
SHOW DATABASES;
You must see output like the following:
OK
default
userdb
Time taken: 3.794 seconds, Fetched: 2 row(s)
7CCSMBDT – Big Data Technologies Practical
3. DROP DATABASE is a statement that drops all the tables and deletes the database. Its syntax is as follows:
DROP DATABASE [IF EXISTS] database_name [RESTRICT|CASCADE];
The CASCADE keyword causes tables to be dropped before the database is dropped. Using the RESTRICT keyword instead of CASCADE is equivalent to the default behavior, where existing tables must be dropped before dropping the database. When a database is dropped, its directory is also deleted. Note: For external tables, the metadata is deleted but the data is not.
For details, read
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManu alDDL-Create/Drop/Alter/UseDatabase
Use the command DROP DATABASE to drop the database userdb, also using the CASCADE keyword.
4. CREATE TABLE First, create a new database mydb Then, execute
use mydb;
This command tells HIVE that we will use the database mydb in our further queries. Now, create a new table by executing:
create table employee (eid int, name String, salary String, destination String) COMMENT ‘Employee details’ ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ LINES TERMINATED BY ‘\n’ STORED AS TEXTFILE;
If the schema specified differs from the schema in the table that already exists, Hive won’t warn you. If your intention is for this table to have the new schema, you’ll have to drop the old table, losing your data, and then re-create it.
5. Open a terminal in Cloudera and create a text file, sample.txt, containing the following data. Do not copy and paste, because text formatting is important here. Type the text, making sure that there is a tab between attribute values and a newline after each line. If this fails, a select * from employee will show you records with NULL values.
1201 George 45000
1202 Mike 45000
1203 Michael 40000
1204 Kyle 40000
1205 Kris 30000
Technical manager Proof reader Technical writer
Hr Admin
Op Admin
7CCSMBDT – Big Data Technologies Practical
6. LOAD DATA Inside the hive shell, type
load data local inpath ‘/home/cloudera/sample.txt’ overwrite into table employee;
On successful output, you should see something like:
Loading data to table default.employee
Table default.employee stats: [numFiles=1, numRows=0, totalSize=201, rawDataSize=0] OK
Time taken: 7.516 seconds
hive>
The parameter local inpath is used to get data from the local filesystem.
If we wanted to get data from an HDFS file, we would just use inpath, e.g. if we had a file myfile.txt in the HDFS directory /user/hive, we would use:
load data inpath ‘/user/hive/myfile.txt’ overwrite into table employee;
7. SELECT Read about the select command https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select
Use the select command to retrieve all records of the table employee with eid greater than 1203
8. HADOOP HDFS FROM HIVE Use Hadoop dfs commands inside HIVE to find out: (i) the HDFS directory for the database mydb we created
(ii) the HDFS directory for the table employee
(iii) the HDFS file corresponding to the table employee
In addition, display the contents of the HDFS file in (iii).
You can run the hadoop dfs … commands from within the hive CLI; just drop the hadoop word from the command and add the semicolon at the end. Example:
hive> dfs -ls / ;
Found 3 items
drwxr-xr-x – root supergroup 0 2011-08-17 16:27 /etl drwxr-xr-x – edward supergroup 0 2012-01-18 15:51 /flag drwxrwxr-x – hadoop supergroup 0 2010-02-03 17:50 /users
This method of accessing hadoop commands is actually more efficient than using the hadoop dfs … equivalent at the bash shell, because the latter starts up a new JVM instance each time, whereas Hive just runs the same code in its current process.
7CCSMBDT – Big Data Technologies Practical
You can see a full listing of help on the options supported by dfs using this command: hive> dfs -help;
Recall that the default directory for hive is /user/hive/warehouse
9. DESCRIBE The DESCRIBE command can be used to show details about a table. It has two different modes, EXTENDED and FORMATTED. DESCRIBE shows the list of columns including partition columns for the given table. If the EXTENDED keyword is specified then it will show all the metadata for the table in Thrift serialized form. This is generally only useful for debugging and not for general use. If the FORMATTED keyword is specified, then it will show the metadata in a tabular format.
See
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManu alDDL-Describe
Use the DESCRIBE command as follows and observe the outcome in each case: hive> DESCRIBE EXTENDED mydb.employee;
hive> DESCRIBE FORMATTED mydb.employee;
10. ALTER TABLE Most table properties can be altered with ALTER TABLE statements, which change metadata about the table but not the data itself. These statements can be used to fix mistakes in schema, move partition locations, and do other operations. ALTER TABLE modifies table metadata only. The data for the table is untouched. It’s up to you to ensure that any modifications are consistent with the actual data.
See:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManu alDDL-AlterTable
for how ALTER works.
See:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageMan ualTypes-ColumnTypes for the BIGINT data type
and
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageMan ualTypes-MiscTypes
for the array data type that you will need below.
See:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManu alDDL-Add/ReplaceColumns for adding columns.
7CCSMBDT – Big Data Technologies Practical
Tasks (you can check the result of your commands with the DESCRIBE command): (i) Use the ALTER command to rename the table employee to employee2
(ii) Change column eid to column new_id . The new_id needs to be of type BIGINT. (iii) Add a new column extra_ids . This column needs to be of type array of INT.
11. HQL FILES In addition to typing commands in hive shell, you can also include the commands in an .hql file and execute it from the command line (i.e., cloudera terminal) with hive -f
i) Create a file query.hql containing commands to retrieve the salary from the records of employee2 with new_id>1203. Execute the file.
ii) Create a file query2.hql containing commands to retrieve the average salary of users with new_id <=1203 and execute the file.
12. PARTITONED TABLES Hive organizes tables into partitions, based on the values of partitioned columns. Using a partition it is faster to query a portion of the data. Read the section “Partitioned tables”: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManu alDDL-PartitionedTables
(i) Create a table employee3 with the attributes id, name, salary_dq, destination whose type is int, name, string, and string, respectively. The table employee3 needs to be partitioned by salary.
(ii) Read the section “Inserting data into Hive Tables”:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManu alDML-InsertingdataintoHiveTablesfromqueries
Using the “INSERT INTO” command (three times) insert the records of employee2 with salary 45000 into the partition for salary=45000, the records of employee2 with salary 40000 into the partition for salary=40000, and the records of employee2 with salary 30000 into the partition for salary=30000. Since empoyee2 has an extra attribute, you need to project the records of employee2 into the first four attributes before executing INSERT INTO.
7CCSMBDT – Big Data Technologies Practical
(iii) Write a query to select the names of employees whose salary is equal to 45000 from employee2. In addition, write a query to select the names of employees whose salary is equal to 45000 from employee3. How much time each query takes? Why?
Read http://blog.cloudera.com/blog/2014/08/improving-query-performance-using- partitioning-in-apache-hive/ to find out more about partitioning.
13. UDFS User-Defined Functions (UDFs) are a powerful feature that allow users to extend HiveQL.Custom UDFs need to be written in JAVA, so we will not cover them (if somebody is interested, they can read https://dzone.com/articles/writing-custom-hive-udf- andudaf ). We will cover existing UDFs in HIVE.
(i) Type
SHOW FUNCTIONS;
to get a list of the functions currently loaded in the Hive session.
(ii) use the DESCRIBE FUNCTION command to learn more about the UDFs: split, count, and explode
For example, DESCRIBE FUNCTION count shows information about count.
Also, read the following sections from
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
• String Functions for split()
• Built-in Aggregate Functions (UDAF) for count()
• Built-in Table-Generating Functions (UDTF) for explode()
and https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html expressions (HIVE uses the same syntax as JAVA).
for regular
(iii) Use split to separate each destination value, in each record of employee2, into words. For example, “Technical Manager” should become an array [“Technical”,”Manager”]
(iv) Modify the query in (iii) to return each element of each array in a separate row. (v) Modify the query in (iv) to count the number of distinct returned values.
(vI) Write a query that displays each distinct word and its count in the destination attribute of employee2
7CCSMBDT – Big Data Technologies Practical
Example output: Hr 1 Manager 1 Op 1 Proof 1 Technical 2
14. WORDCOUNT in HIVE
i) Create a table dictionary which contains each word in the words file as a string. The
“words” file is in /usr/share/dict
For this, you need to create a table and then load the words file into the table.
ii) Test that you have created and loaded the table correctly by executing:
select count(*) from dictionary;
The output must be 479829.
iii) Write a query that displays each word in the dictionary table and its length.
iv) Write a query that displays the length of the longest word in the dictionary table.