程序代写代做代考 Hive Java html database hadoop jvm 7CCSMBDT – Big Data Technologies Practical

7CCSMBDT – Big Data Technologies Practical
HIVE
The practical commands are to be executed in Cloudera Quickstart VM. Genera reference (documentation of HIVE)
https://cwiki.apache.org/confluence/display/Hive/LanguageManual
A. Open a terminal in Cloudera and type hive. You will see the prompt:
hive>
1. CREATE DATABASE is a statement used to create a database in Hive. A database in Hive is a namespace or a collection of tables. The syntax for this statement is as follows:
CREATE DATABASE [IF NOT EXITS] database_name;
Here, IF NOT EXISTS is an optional clause, which notifies the user that a database with the same name already exists. For details, see: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManu alDDL-CreateTableCreate/Drop/TruncateTable
Use the command CREATE DATABASE to create a database with name userdb; Sol: CREATE DATABASE userdb;
2. SHOW DATABASE is a statement used to show the list of existing databases in HIVE. To try it, type:
SHOW DATABASES;
You must see output like the following:
OK
default
userdb
Time taken: 3.794 seconds, Fetched: 2 row(s)

7CCSMBDT – Big Data Technologies Practical
3. DROP DATABASE is a statement that drops all the tables and deletes the database. Its syntax is as follows:
DROP DATABASE [IF EXISTS] database_name [RESTRICT|CASCADE];
The CASCADE keyword causes tables to be dropped before the database is dropped. Using the RESTRICT keyword instead of CASCADE is equivalent to the default behavior, where existing tables must be dropped before dropping the database. When a database is dropped, its directory is also deleted. Note: For external tables, the metadata is deleted but the data is not.
For details, read
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManu alDDL-Create/Drop/Alter/UseDatabase
Use the command DROP DATABASE to drop the database userdb, also using the CASCADE keyword.
SOL: DROP DATABASE IF EXISTS userdb CASCADE;
4. CREATE TABLE First, create a new database mydb Then, execute
use mydb;
This command tells HIVE that we will use the database mydb in our further queries. Now, create a new table by executing:
create table employee (eid int, name String, salary String, destination String) COMMENT ‘Employee details’ ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ LINES TERMINATED BY ‘\n’ STORED AS TEXTFILE;
If the schema specified differs from the schema in the table that already exists, Hive won’t warn you. If your intention is for this table to have the new schema, you’ll have to drop the old table, losing your data, and then re-create it.
5. Open a terminal in Cloudera and create a text file, sample.txt, containing the following data. Do not copy and paste, because text formatting is important here. Type the text, making sure that there is a tab between attribute values and a newline after each line. If this fails, a select * from employee will show you records with NULL values.
1201 George 45000
1202 Mike 45000
1203 Michael 40000
1204 Kyle 40000
1205 Kris 30000
Technical manager Proof reader Technical writer
Hr Admin
Op Admin

7CCSMBDT – Big Data Technologies Practical
SOL: nano sample.txt
then write the contents inside and save (with Ctrl+O) and then exit
6. LOAD DATA Inside the hive shell, type
load data local inpath ‘/home/cloudera/sample.txt’ overwrite into table employee;
On successful output, you should see something like:
Loading data to table default.employee
Table default.employee stats: [numFiles=1, numRows=0, totalSize=201, rawDataSize=0] OK
Time taken: 7.516 seconds
hive>
The parameter local inpath is used to get data from the local filesystem.
If we wanted to get data from an HDFS file, we would just use inpath, e.g. if we had a file myfile.txt in the HDFS directory /user/hive, we would use:
load data inpath ‘/user/hive/myfile.txt’ overwrite into table employee;
7. SELECT Read about the select command https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select
Use the select command to retrieve all records of the table employee with eid greater than 1203
SOL: select * from employee where eid>1203;
8. HADOOP HDFS FROM HIVE Use Hadoop dfs commands inside HIVE to find out: (i) the HDFS directory for the database mydb we created
(ii) the HDFS directory for the table employee
(iii) the HDFS file corresponding to the table employee
In addition, display the contents of the HDFS file in (iii).
You can run the hadoop dfs … commands from within the hive CLI; just drop the hadoop word from the command and add the semicolon at the end. Example:
hive> dfs -ls / ;
Found 3 items
drwxr-xr-x – root supergroup 0 2011-08-17 16:27 /etl drwxr-xr-x – edward supergroup 0 2012-01-18 15:51 /flag drwxrwxr-x – hadoop supergroup 0 2010-02-03 17:50 /users

7CCSMBDT – Big Data Technologies Practical
This method of accessing hadoop commands is actually more efficient than using the hadoop dfs … equivalent at the bash shell, because the latter starts up a new JVM instance each time, whereas Hive just runs the same code in its current process.
You can see a full listing of help on the options supported by dfs using this command: hive> dfs -help;
Recall that the default directory for hive is /user/hive/warehouse
SOL: (i)
hive> dfs -ls /user/hive/warehouse;
Found 1 items
drwxrwxrwx – cloudera supergroup 0 2017-02-09 04:41 /user/hive/warehouse/mydb.db
The first “d” shows mydb.db is a directory, and mydb.db corresponds to the database mydb we have created.
(ii)
hive> dfs -ls /user/hive/warehouse/mydb.db;
Found 1 items
drwxrwxrwx – cloudera supergroup 0 2017-02-09 04:45 /user/hive/warehouse/mydb.db/employee
(iii)
hive> dfs -ls /user/hive/warehouse/mydb.db/employee/
To display the contents of the HDFS file in (iii):
hive> dfs -cat /user/hive/warehouse/mydb.db/employee/*;
9. DESCRIBE The DESCRIBE command can be used to show details about a table. It has two different modes, EXTENDED and FORMATTED. DESCRIBE shows the list of columns including partition columns for the given table. If the EXTENDED keyword is specified then it will show all the metadata for the table in Thrift serialized form. This is generally only useful for debugging and not for general use. If the FORMATTED keyword is specified, then it will show the metadata in a tabular format.
See
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManu alDDL-Describe
Use the DESCRIBE command as follows and observe the outcome in each case: hive> DESCRIBE EXTENDED mydb.employee;
hive> DESCRIBE FORMATTED mydb.employee;

7CCSMBDT – Big Data Technologies Practical
10. ALTER TABLE Most table properties can be altered with ALTER TABLE statements, which change metadata about the table but not the data itself. These statements can be used to fix mistakes in schema, move partition locations, and do other operations. ALTER TABLE modifies table metadata only. The data for the table is untouched. It’s up to you to ensure that any modifications are consistent with the actual data.
See:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManu alDDL-AlterTable
for how ALTER works.
See:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageMan ualTypes-ColumnTypes for the BIGINT data type
and
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageMan ualTypes-MiscTypes
for the array data type that you will need below.
See:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManu alDDL-Add/ReplaceColumns for adding columns.
Tasks (you can check the result of your commands with the DESCRIBE command): (i) Use the ALTER command to rename the table employee to employee2
(ii) Change column eid to column new_id . The new_id needs to be of type BIGINT. (iii) Add a new column extra_ids . This column needs to be of type array of INT.
SOL:
(i) alter table mydb.employee rename to mydb.employee2;
(ii) alter table employee2 change eid new_id BIGINT;
(iii) alter table employee2 add columns (extra_ids ARRAY);

7CCSMBDT – Big Data Technologies Practical
11. HQL FILES In addition to typing commands in hive shell, you can also include the commands in an .hql file and execute it from the command line (i.e., cloudera terminal) with hive -f
i) Create a file query.hql containing commands to retrieve the salary from the records of employee2 with new_id>1203. Execute the file.
SOL: In command line: nano query.hql.
Inside nano:
use mydb;
select salary from employee2 where new_id>1203;
In command line: hive -f query.hql
ii) Create a file query2.hql containing commands to retrieve the average salary of users with new_id <=1203 and execute the file. SOL: same as (i) but the file needs to contain: use mydb; select avg(salary) from employee2 where new_id<1203; 12. PARTITONED TABLES Hive organizes tables into partitions, based on the values of partitioned columns. Using a partition it is faster to query a portion of the data. Read the section “Partitioned tables”: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManu alDDL-PartitionedTables (i) Create a table employee3 with the attributes id, name, salary_dq, destination whose type is int, name, string, and string, respectively. The table employee3 needs to be partitioned by salary. SOL: create table employee3(id int, name string, salary_dq string, destination string) partitioned by (salary string); (ii) Read the section “Inserting data into Hive Tables”: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManu alDML-InsertingdataintoHiveTablesfromqueries 7CCSMBDT – Big Data Technologies Practical Using the “INSERT INTO” command (three times) insert the records of employee2 with salary 45000 into the partition for salary=45000, the records of employee2 with salary 40000 into the partition for salary=40000, and the records of employee2 with salary 30000 into the partition for salary=30000. Since empoyee2 has an extra attribute, you need to project the records of employee2 into the first four attributes before executing INSERT INTO. SOL: insert into table employee3 partition(salary=45000) select new_id, name, salary, destination from employee2 where salary=45000; insert into table employee3 partition(salary=40000) select new_id, name, salary, destination from employee2 where salary=40000; insert into table employee3 partition(salary=30000) select new_id, name, salary, destination from employee2 where salary=30000; (iii) Write a query to select the names of employees whose salary is equal to 45000 from employee2. In addition, write a query to select the names of employees whose salary is equal to 45000 from employee3. How much time each query takes? Why? Read http://blog.cloudera.com/blog/2014/08/improving-query-performance-using- partitioning-in-apache-hive/ to find out more about partitioning. SOL: When a partitioned table is queried with one or both partition columns in criteria or in the WHERE clause, what Hive effectively does is partition elimination by scanning only those data directories that are needed. If no partitioned columns are used, then all the directories are scanned (full table scan) and partitioning will not have any effect. 13. UDFS User-Defined Functions (UDFs) are a powerful feature that allow users to extend HiveQL.Custom UDFs need to be written in JAVA, so we will not cover them (if somebody is interested, they can read https://dzone.com/articles/writing-custom-hive-udf- andudaf ). We will cover existing UDFs in HIVE. (i) Type SHOW FUNCTIONS; to get a list of the functions currently loaded in the Hive session. (ii) use the DESCRIBE FUNCTION command to learn more about the UDFs: split, count, and explode For example, DESCRIBE FUNCTION count shows information about count. 7CCSMBDT – Big Data Technologies Practical Also, read the following sections from https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF • String Functions for split() • Built-in Aggregate Functions (UDAF) for count() • Built-in Table-Generating Functions (UDTF) for explode() and https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html expressions (HIVE uses the same syntax as JAVA). for regular (iii) Use split to separate each destination value, in each record of employee2, into words. For example, “Technical Manager” should become an array [“Technical”,”Manager”] SOL: select split(destination,’\\s’) from employee2; (iv) Modify the query in (iii) to return each element of each array in a separate row. SOL: select explode(split(destination,’\\s’)) from employee2; (v) Modify the query in (iv) to count the number of distinct returned values. SOL: select count(DISTINCT col) from (select explode(split(destination, ‘\\s’)) from employee2) as w; (vI) Write a query that displays each distinct word and its count in the destination attribute of employee2 Example output: Hr 1 Manager 1 Op 1 Proof 1 Technical 2 SOL: select w.word, count(1) from (select explode(split(destination, ‘\\s’)) as word from employee2) w group by w.word; 7CCSMBDT – Big Data Technologies Practical 14. WORDCOUNT in HIVE i) Create a table dictionary which contains each word in the words file as a string. The “words” file is in /usr/share/dict For this, you need to create a table and then load the words file into the table. SOL: create table dictionary (line STRING); load data local inpath ‘/usr/share/dict/words’ overwrite into table dictionary; ii) Test that you have created and loaded the table correctly by executing: select count(*) from dictionary; The output must be 479829. iii) Write a query that displays each word in the dictionary table and its length. SOL: select w.word, length(w.word) from (select explode(split(line,’\\s’)) as word from dictionary) w group by w.word; iv) Write a query that displays the length of the longest word in the dictionary table. SOL: select max(length(w.word)) from (select explode(split(line, ‘\\s’)) as word from dictionary) w group by w.word;