7CCSMBDT – Big Data Technologies Practical
Lab 2: Big Data access connectors
Introduction to using Cloudera Quickstart VM (this is useful for every lab using Cloudera)
The following steps should be performed after logging in into Cloudera. For logging in, please refer to the email you received from Andreas Biternas. Any issues related to Cloudera should be reported to the NMS Computing Team, by raising a ticket https://apps.nms.kcl.ac.uk/sd/
As mentioned in the class, you are encouraged to download Cloudera Quickstart VM in your own pc (for this you will also need a VM manager e.g., http://www.oracle.com/technetwork/server-storage/virtualbox/downloads/index.html ). This will help your practical study in your individual study time, because the Cloudera Quickstart VM will only be accessible during labs (and later for courseworks).
1. On Cloudera’s desktop click on “Launch Cloudera express” and wait until the terminal window closes. An alternative (in case there are issues) is to open a terminal (right click anywhere on the desktop and then Open Terminal) and type
sudo /home/cloudera/clouder-manager –force –express
2. Go to Firefox window that opened when you first entered Cloudera and locate a small icon “Cloudera Manager” under the address bar. Please click on it and then select I agree. Type cloudera for username and again cloudera for password and then select Remember. These are internal passwords for cloudera (visible only within the VM).
3. You will see Cloudera Manager on the top-left corner of the window that opens and a large white panel on the left titled “Cloudera Quick… (CDH 5.112.0, Packages)”. Next to “Cloudera Quick… (CDH 5.112.0, Packages)” there is a dropdown menu. Click on it and then select “Start”. Answer “Start” to the message “Are you sure you want to run the start command on cluster Cloudera quickstart?”.
A picture of Cloudera manager (similar to the one you will see) is on the next page.
7CCSMBDT – Big Data Technologies Practical
4. Wait until all commands on the window that appears have a green sign next to them. This may take some minutes.
7CCSMBDT – Big Data Technologies Practical
5. Now you will see all services under Cloudera Quickstart panel green except the icon for Scoop 1 client.
Apache Sqoop
This is to be run in the Cloudera Quickstart VM.
We will use Sqoop to move data between a MySQL database into HDFS, the distributed file system of Hadoop.
1. MySQL and JDBC driver installation and database setup and testing
(a) We will install MySQL in the Cloudera environment. Execute:
sudo yum install mysql-server
sudo /sbin/service mysqld start
sudo /usr/bin/mysql_secure_installation
Enter “cloudera” for the root password and then select “do not change the root password”.
Answer yes to all the prompts.
Useful commands (do not execute them now):
To start the service (already started at this point): sudo /sbin/service mysqld start To end the service (do not end it now): sudo /sbin/service mysqld stop
(b) JDBC is a Java specific database-vendor independent interface for accessing relational databases and enterprise data warehouses. Upon this generic interface, each database vendor must implement a compliant driver containing required functionality. Sqoop requires the JDBC drivers for the MySQL database server that we have installed in order to transfer data. You will need to install the JDBC driver individually.
(i) To install it, execute:
sudo yum install mysql-connector-java
(ii) To copy it so that it can be used by sqoop, execute:
sudo cp /var/lib/sqoop/mysql-connector-java.jar /usr/lib/sqoop/lib/.
7CCSMBDT – Big Data Technologies Practical
(c) Download the setup_credentials.sql script from KEATS (Datasets for Lab2) or
https://www.dropbox.com/sh/r9au7l4ycamc2tu/AACjrZ-HuPagm_vxSJBjaEJva?dl=0
This has commands to create a user “sqoop”, new databases that the user can access. (d) Run the setup_credentials.sql script with:
mysql -u root -p < setup_credentials.sql
(e) Download and run the setup_tables.sql script from KEATS (Datasets for Lab2) or
https://www.dropbox.com/sh/r9au7l4ycamc2tu/AACjrZ-HuPagm_vxSJBjaEJva?dl=0
This has commands to create and populate some example tables.
(f) Execute the MySQL shell to verify that the database and tables have been created:
mysql -u root –p
After typing the password “cloudera”, you must see Welcome to the MySQL monitor ...
...
Type ‘help;’ or ‘\h’ for help. Type ‘\c’ to clear the current input statement.
mysql>
(g) Execute:
show databases;
You must see a database “sqoop” among the listed databases.
(h) Execute:
use sqoop;
and then execute:
show tables;
You must see 5 tables, namely “cities”, “countries”, “normcities”, “staging cities”, and “visits”.
7CCSMBDT – Big Data Technologies Practical
(i) Execute:
exit;
2. Sqoop
When executing sqoop, ignore the accumulo warning. Also, always use the argument
–driver com.mysql.jdbc.Driver
to tell Sqoop to use the correct JDBC driver.
Reference guide for Sqoop with explanations and examples: https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html
Listing databases and tables
(a) To get a list of databases, execute:
sqoop list-databases –connect jdbc:mysql://quickstart.cloudera/ –username sqoop — password sqoop
Note that we connect to jdbc:mysql://quickstart.cloudera/
with the username sqoop and password sqoop. This user was created when you executed setup_credentials.sql. This command does not include –driver (sqoop falls back to the standard driver).
(b) Get a list of the tables in the sqoop database. To find how execute:
sqoop help
Importing data
(c) To see how the import command of Sqoop works, execute:
sqoop help import
[If you want to read page by page execute: scoop help import |more]
You will see:
7CCSMBDT – Big Data Technologies Practical
Now use sqoop import to import the table “cities” of the database “sqoop” into HDFS. The last two lines of the output you must see will be similar to:
17/01/21 05:03:11 INFO mapreduce.ImportJobBase: Transferred 54 bytes in 127.1497 seconds (0.4247 bytes/sec)
17/01/21 05:03:11 INFO mapreduce.ImportJobBase: Retrieved 3 records.
(d) Note in the output of the sqoop import command in (c), under “Job counters”, the number of “Launched map tasks”. Now execute:
hadoop fs -ls cities/
You must see three files with names starting with part-m. Each of these files corresponds to a mapper.
(e) Execute:
hadoop fs -cat cities/part-m-*
You must see the contents of the file cities, i.e.,: 1,USA,Palo Alto
2,Czech Republic,Brno
3,USA,Sunnyvale
7CCSMBDT – Big Data Technologies Practical
Target directories
(f) Import the table “cities” into the directory “lab2/input” of HDFS using 4 mappers. What is the time taken for transfer? How does it compare to the time in (c)?
(g) Execute the same command you used in (f) again. What do you observe? Please note that –target-dir cannot be specified together with –warehouse.
(h) Execute the command you used in (f) again with –warehouse-dir lab2/input2/ What do you observe in lab2/input2?
Importing based on selection criteria
(i) Use the –where parameter of import to insert into HDFS only records of cities with country Czech Republic. The records should be imported in the directory lab2/input3.
Using the Avro Format
(k) Apache Avro is a generic data serialization system. Specifying the –as-avrodatafile parameter instructs Sqoop to use its compact and fast binary encoding format. Avro can store any arbitrary data structures, and it uses a schema, encoded as a JSON string, to describe what data structures are stored within the file. The schema is generated automatically based on the metadata information retrieved from the database server and will retain the schema in each generated file. Any application that uses Avro files needs to use Avro libraries.
Import the rows of table “countries” with country Czech Republic using the –as-avrodatafile parameter. Then, execute
hadoop fs – cat countries/*
to view the file.
Split-by and direct
(l) Download and execute the setup_rand_num.sql script from KEATS (Datasets for Lab2) or https://www.dropbox.com/sh/r9au7l4ycamc2tu/AACjrZ- HuPagm_vxSJBjaEJva?dl=0
7CCSMBDT – Big Data Technologies Practical
This script creates two tables rand_numbers and rand_numbers2, each of which has one attribute ‘number’ with 100000 values (records).
(m) Run import to insert the table rand_numbers with –split-by number . This parameter uses the attribute number to slice the data into multiple parallel tasks.
How much time does the transfer take?
(n) Run import to insert the table rand_numbers2 with –split-by number and –direct. The direct parameter delegates the transferring of data to the native utilities provided by the database vendor. In the case of MySQL, the mysqldump and mysqlimport will be used for retrieving data from the database server or moving data back. This sometimes is much faster.
Importing only new data
(o) Sqoop has an incremental feature that is activated by –incremental.
When the table is only getting new rows and the existing ones are not changed, we use –incremental append. Incremental import also requires two additional parameters:
–check-column id indicates that a column with name id is checked for newly appended data, and
–last-value val which contains the last value that was successfully imported into Hadoop.
Execute:
sqoop import –connect jdbc:mysql://quickstart.cloudera/sqoop –driver com.mysql.jdbc.Driver –username sqoop –password sqoop –table visits –incremental append –check-column id –last-value 1
Use hadoop fs -cat visits/* . What do you observe?
(p) After executing the sqoop command in (o), a message (consider saving this with ‘sqoop job –create’ is shown. This recommends taking advantage of the built-in Sqoop metastore that allows you to save all parameters for later reuse. Execute:
sqoop job –create visits — import –connect jdbc:mysql://quickstart.cloudera/sqoop –driver com.mysql.jdbc.Driver –username sqoop –password sqoop –table visits –incremental append –check-column id — last-value 2
7CCSMBDT – Big Data Technologies Practical
Note the “ “ between — and import. This creates a job called “visits”. To view the content of the job you created, execute:
sqoop job –show visits
To show all created jobs execute:
sqoop job –list
To delete the job visits, execute:
sqoop job –delete visits
————————- Optional (if you have time, or in your individual study time) ————- Importing data from two tables
Sqoop can also import the result set of an arbitrary SQL query. This requires specifying an SQL statement with the –query argument and a destination directory with –target- dir . If you want to import the results of a query in parallel, then each map task will need to execute a copy of the query, with results partitioned by bounding conditions inferred by Sqoop. Your query must include the token $CONDITIONS which each Sqoop process will replace with a unique condition expression. You must also select a splitting column with –split-by.
An example of query
sqoop import –query ‘SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE
$CONDITIONS’ –split-by a.id –target-dir …
(q) We have created the following two tables in MySQL.
We need to import one table, containing id, city, country (i.e., use the country values in the table countries instead of the country_id).
7CCSMBDT – Big Data Technologies Practical
Transferring data from HDFS
(r) To export data from the export-dir directory cities into the MySQL table cities, we use the export command. Sqoop fetches the table’s metadata in the export: the destination table (specified with the –table parameter) must exist prior to running Sqoop. The table does not have to be empty, and you can even export new data from Hadoop to your database on an iterative basis. But, there should not be any constraint violations when performing the INSERT statements.
Execute:
sqoop export –connect jdbc:mysql://quickstart.cloudera/sqoop –driver com.mysql.jdbc.Driver –username sqoop –password sqoop –table cities — export-dir cities
Why does the command results in error? Check again setup_tables.sql.
(s) Delete all records from the MySQL table cities and try the command of (r) again.