程序代写代做代考 go html database hadoop JDBC file system Java data structure 7CCSMBDT – Big Data Technologies Practical

7CCSMBDT – Big Data Technologies Practical
Lab 2: Big Data access connectors
Introduction to using Cloudera Quickstart VM (this is useful for every lab using Cloudera)
The following steps should be performed after logging in into Cloudera. For logging in, please refer to the email you received from Andreas Biternas. Any issues related to Cloudera should be reported to the NMS Computing Team, by raising a ticket https://apps.nms.kcl.ac.uk/sd/
As mentioned in the class, you are encouraged to download Cloudera Quickstart VM in your own pc (for this you will also need a VM manager e.g., http://www.oracle.com/technetwork/server-storage/virtualbox/downloads/index.html ). This will help your practical study in your individual study time, because the Cloudera Quickstart VM will only be accessible during labs (and later for courseworks).
1. On Cloudera’s desktop click on “Launch Cloudera express” and wait until the terminal window closes. An alternative (in case there are issues) is to open a terminal (right click anywhere on the desktop and then Open Terminal) and type
sudo /home/cloudera/clouder-manager –force –express
2. Go to Firefox window that opened when you first entered Cloudera and locate a small icon “Cloudera Manager” under the address bar. Please click on it and then select I agree. Type cloudera for username and again cloudera for password and then select Remember. These are internal passwords for cloudera (visible only within the VM).
3. You will see Cloudera Manager on the top-left corner of the window that opens and a large white panel on the left titled “Cloudera Quick… (CDH 5.112.0, Packages)”. Next to “Cloudera Quick… (CDH 5.112.0, Packages)” there is a dropdown menu. Click on it and then select “Start”. Answer “Start” to the message “Are you sure you want to run the start command on cluster Cloudera quickstart?”.
A picture of Cloudera manager (similar to the one you will see) is on the next page.

7CCSMBDT – Big Data Technologies Practical
4. Wait until all commands on the window that appears have a green sign next to them. This may take some minutes.

7CCSMBDT – Big Data Technologies Practical
5. Now you will see all services under Cloudera Quickstart panel green except the icon for Scoop 1 client.
Apache Sqoop
This is to be run in the Cloudera Quickstart VM.
We will use Sqoop to move data between a MySQL database into HDFS, the distributed file system of Hadoop.
1. MySQL and JDBC driver installation and database setup and testing
(a) We will install MySQL in the Cloudera environment. Execute:
sudo yum install mysql-server
sudo /sbin/service mysqld start
sudo /usr/bin/mysql_secure_installation
Enter “cloudera” for the root password and then select “do not change the root password”.
Answer yes to all the prompts.
Useful commands (do not execute them now):
To start the service (already started at this point): sudo /sbin/service mysqld start To end the service (do not end it now): sudo /sbin/service mysqld stop
(b) JDBC is a Java specific database-vendor independent interface for accessing relational databases and enterprise data warehouses. Upon this generic interface, each database vendor must implement a compliant driver containing required functionality. Sqoop requires the JDBC drivers for the MySQL database server that we have installed in order to transfer data. You will need to install the JDBC driver individually.
(i) To install it, execute:
sudo yum install mysql-connector-java
(ii) To copy it so that it can be used by sqoop, execute:
sudo cp /var/lib/sqoop/mysql-connector-java.jar /usr/lib/sqoop/lib/.

7CCSMBDT – Big Data Technologies Practical
(c) Download the setup_credentials.sql script from KEATS (Datasets for Lab2) or
https://www.dropbox.com/sh/r9au7l4ycamc2tu/AACjrZ-HuPagm_vxSJBjaEJva?dl=0
This has commands to create a user “sqoop”, new databases that the user can access. (d) Run the setup_credentials.sql script with:
mysql -u root -p < setup_credentials.sql (e) Download and run the setup_tables.sql script from KEATS (Datasets for Lab2) or https://www.dropbox.com/sh/r9au7l4ycamc2tu/AACjrZ-HuPagm_vxSJBjaEJva?dl=0 This has commands to create and populate some example tables. (f) Execute the MySQL shell to verify that the database and tables have been created: mysql -u root –p After typing the password “cloudera”, you must see Welcome to the MySQL monitor ... ... Type ‘help;’ or ‘\h’ for help. Type ‘\c’ to clear the current input statement. mysql> (g) Execute:
show databases;
You must see a database “sqoop” among the listed databases. (h) Execute:
use sqoop;
and then execute:
show tables;
You must see 5 tables, namely “cities”, “countries”, “normcities”, “staging cities”, and “visits”.
(i) Execute:
exit;

7CCSMBDT – Big Data Technologies Practical
2. Sqoop
When executing sqoop, ignore the accumulo warning. Also, always use the argument
–driver com.mysql.jdbc.Driver
to tell Sqoop to use the correct JDBC driver.
Reference guide for Sqoop with explanations and examples: https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html
Listing databases and tables
(a) To get a list of databases, execute:
sqoop list-databases –connect jdbc:mysql://quickstart.cloudera/ –username sqoop — password sqoop
Note that we connect to jdbc:mysql://quickstart.cloudera/
with the username sqoop and password sqoop. This user was created when you executed setup_credentials.sql. This command does not include –driver (sqoop falls back to the standard driver).
(b) Get a list of the tables in the sqoop database. To find how execute:
sqoop help
SOL: sqoop list-tables –driver com.mysql.jdbc.Driver –connect jdbc:mysql://quickstart.cloudera/sqoop –username sqoop –password sqoop
The solution includes –driver com.mysql.jdbc.Driver, but –driver may fail in the Cloudera version in the lab pcs.
Importing data
(c) To see how the import command of Sqoop works, execute:
sqoop help import
[If you want to read page by page execute: scoop help import |more]

7CCSMBDT – Big Data Technologies Practical
You will see:
Now use sqoop import to import the table “cities” of the database “sqoop” into HDFS. The last two lines of the output you must see will be similar to:
17/01/21 05:03:11 INFO mapreduce.ImportJobBase: Transferred 54 bytes in 127.1497 seconds (0.4247 bytes/sec)
17/01/21 05:03:11 INFO mapreduce.ImportJobBase: Retrieved 3 records.
SOL: sqoop import –connect jdbc:mysql://quickstart.cloudera/sqoop –driver com.mysql.jdbc.Driver –username sqoop –password sqoop –table cities
(d) Note in the output of the sqoop import command in (c), under “Job counters”, the number of “Launched map tasks”. Now execute:
hadoop fs -ls cities/
You must see three files with names starting with part-m. Each of these files corresponds to a mapper.
(e) Execute:
hadoop fs -cat cities/part-m-*
You must see the contents of the file cities, i.e.,: 1,USA,Palo Alto
2,Czech Republic,Brno
3,USA,Sunnyvale

7CCSMBDT – Big Data Technologies Practical
Target directories
(f) Import the table “cities” into the directory “lab2/input” of HDFS using 4 mappers. What is the time taken for transfer? How does it compare to the time in (c)?
SOL: sqoop import –connect jdbc:mysql://quickstart.cloudera/sqoop –driver com.mysql.jdbc.Driver –username sqoop –password sqoop –table cities –target- dir lab2/input/ -m 4
The time is smaller because in (c) there were 3 mappers used (determined automatically – may be different)
(g) Execute the same command you used in (f) again. What do you observe? Please note that –target-dir cannot be specified together with –warehouse.
SOL: Sqoop will reject importing data when the final output directory already exists. In this case, the name is comprised of the directory name specified in — warehouse-dir and the name of a transferred table.
(h) Execute the command you used in (f) again with –warehouse-dir lab2/input2/
What do you observe in lab2/input2?
SOL: Instead of directly specifying the final directory, the parameter –warehouse- dir allows you to specify only the parent directory. Rather than writing data into the warehouse directory, Sqoop will create a directory with the same name as the table inside the warehouse directory and import data there. This is similar to the default case where Sqoop imports data to your home directory on HDFS, with the notable exception that the –warehouse-dir parameter allows you to use a directory other than the home directory.
Importing based on selection criteria
(i) Use the –where parameter of import to insert into HDFS only records of cities with country Czech Republic. The records should be imported in the directory lab2/input3.
SOL: sqoop import –connect jdbc:mysql://quickstart.cloudera/sqoop –driver com.mysql.jdbc.Driver –username sqoop –password sqoop –table cities -m 4 — target-dir lab/input3 –where “country = ‘Czech Republic'”
Using the Avro Format

7CCSMBDT – Big Data Technologies Practical
(k) Apache Avro is a generic data serialization system. Specifying the –as-avrodatafile parameter instructs Sqoop to use its compact and fast binary encoding format. Avro can store any arbitrary data structures, and it uses a schema, encoded as a JSON string, to describe what data structures are stored within the file. The schema is generated automatically based on the metadata information retrieved from the database server and will retain the schema in each generated file. Any application that uses Avro files needs to use Avro libraries.
Import the rows of table “countries” with country Czech Republic using the –as-avrodatafile parameter. Then, execute
hadoop fs – cat countries/*
to view the file.
SOL: sqoop import –connect jdbc:mysql://quickstart.cloudera/sqoop –driver com.mysql.jdbc.Driver –username sqoop –password sqoop –table cities -m 4 — target-dir lab/input3b –where “country = ‘Czech Republic'” –as-avrodatafile
Split-by and direct
(l) Download and execute the setup_rand_num.sql script from KEATS (Datasets for Lab2) or https://www.dropbox.com/sh/r9au7l4ycamc2tu/AACjrZ- HuPagm_vxSJBjaEJva?dl=0
This script creates two tables rand_numbers and rand_numbers2, each of which has one attribute ‘number’ with 100000 values (records).
(m) Run import to insert the table rand_numbers with –split-by number . This parameter uses the attribute number to slice the data into multiple parallel tasks.
How much time does the transfer take?
SOL: sqoop import –connect jdbc:mysql://quickstart.cloudera/sqoop –driver com.mysql.jdbc.Driver –username sqoop –password sqoop –table rand_numbers –split-by number
It took 115.8434 seconds
(n) Run import to insert the table rand_numbers2 with –split-by number and –direct. The direct parameter delegates the transferring of data to the native utilities provided by the database vendor. In the case of MySQL, the mysqldump and mysqlimport will be used for retrieving data from the database server or moving data back. This sometimes is much faster.
SOL: It took 83.5218 seconds.

7CCSMBDT – Big Data Technologies Practical
Importing only new data
(o) Sqoop has an incremental feature that is activated by –incremental.
When the table is only getting new rows and the existing ones are not changed, we use –incremental append. Incremental import also requires two additional parameters:
–check-column id indicates that a column with name id is checked for newly appended data, and
–last-value val which contains the last value that was successfully imported into Hadoop.
Execute:
sqoop import –connect jdbc:mysql://quickstart.cloudera/sqoop –driver com.mysql.jdbc.Driver –username sqoop –password sqoop –table visits –incremental append –check-column id –last-value 1
Use hadoop fs -cat visits/* . What do you observe?
SOL: Only the value with id=2 is entered (one more than the id=1).
(p) After executing the sqoop command in (o), a message (consider saving this with ‘sqoop job –create’ is shown. This recommends taking advantage of the built-in Sqoop metastore that allows you to save all parameters for later reuse. Execute:
sqoop job –create visits — import –connect jdbc:mysql://quickstart.cloudera/sqoop –driver com.mysql.jdbc.Driver –username sqoop –password sqoop –table visits –incremental append –check-column id — last-value 2
Note the “ “ between — and import. This creates a job called “visits”. To view the content of the job you created, execute:
sqoop job –show visits
To show all created jobs execute:
sqoop job –list
To delete the job visits, execute:
sqoop job –delete visits

7CCSMBDT – Big Data Technologies Practical
————————- Optional (if you have time, or in your individual study time) ————- Importing data from two tables
Sqoop can also import the result set of an arbitrary SQL query. This requires specifying an SQL statement with the –query argument and a destination directory with –target- dir . If you want to import the results of a query in parallel, then each map task will need to execute a copy of the query, with results partitioned by bounding conditions inferred by Sqoop. Your query must include the token $CONDITIONS which each Sqoop process will replace with a unique condition expression. You must also select a splitting column with –split-by.
An example of query
sqoop import –query ‘SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE
$CONDITIONS’ –split-by a.id –target-dir …
(q) We have created the following two tables in MySQL.
We need to import one table, containing id, city, country (i.e., use the country values in the table countries instead of the country_id).
SOL: sqoop import –connect jdbc:mysql://quickstart.cloudera/sqoop –driver com.mysql.jdbc.Driver –username sqoop –password sqoop –query ‘SELECT normcities.id, normcities.city, countries.country FROM normcities JOIN countries USING(country_id) where $CONDITIONS’ –split-by id –target-dir normcities
Transferring data from HDFS
(r) To export data from the export-dir directory cities into the MySQL table cities, we use the export command. Sqoop fetches the table’s metadata in the export: the destination table (specified with the –table parameter) must exist prior to running Sqoop. The table does not have to be empty, and you can even export new data from Hadoop to your database on an iterative basis. But, there should not be any constraint violations when performing the INSERT statements.

7CCSMBDT – Big Data Technologies Practical
Execute:
sqoop export –connect jdbc:mysql://quickstart.cloudera/sqoop –driver com.mysql.jdbc.Driver –username sqoop –password sqoop –table cities — export-dir cities
Why does the command results in error? Check again setup_tables.sql.
SOL: id is a primary key, and the table cities already contains records with the same ids of those we try to export.
(s) Delete all records from the MySQL table cities and try the command of (r) again.
SOL: To delete all records, “delete from cities where id>0” in MySQL.

Related Posts