7CCSMBDT- Tutorial 1
Huiping Chen
huiping.chen@kcl.ac.uk
Big Data Technologies in the labs
In practice, the big data technologies that we will discuss in the module run in powerful servers (on the cloud) and are managed by professionals.
We want to be able to use these big data technologies in a “testing” environment and with little or no management.
We will use Cloudera CDH QuickStart VM for that.
Cloudera CDH
Complete, tested, and popular distribution of Apache Hadoop and related projects.
Core elements of Hadoop – scalable storage and distributed computing – along with a Web- based user interface and vital enterprise capabilities.
Open source
Flexibility—Store any type of data and manipulate it with a variety of different computation frameworks including batch processing, interactive SQL
Integration—Get up and running quickly on a complete Hadoop platform that works with a broad range of hardware and software solutions.
Scalability—Enable a broad range of applications and scale and extend them to suit your requirements.
https://www.cloudera.com/documentation/enterprise/5 -13-x/topics/cdh_intro.html#xd_583c10bfdbd326ba–5a52cca- 1476e7473cd–7f59
How it runs
• Cloudera CDH QuickStart VM; one for each student, runs remotely in KCL’s cloud, but it can be accessed from the labs
• You will receive an email with instructions on how to access your Cloudera VM
• You have full control of Cloudera but no-one else does. You are responsible for
backing up your files (take them out of Cloudera).
• For any issues with Cloudera (e.g., it does not start), you need to raise a ticket https://apps.nms.kcl.ac.uk/sd/ Lecturers and TAs do not have control over your VM
• Several warnings and errors (when you start). Most easily addressed by re- starting the services and others by re-starting the VM.
How it runs
• You can also (optionally) install Cloudera CDH QuickStart VM in your own pc (Windows, Mac, Linux).
• In this tutorial, we will show you step-by-step on how this can be done.
• Remember, Cloudera is a VM. So, it needs to run (be hosted) somewhere, but you
only have your pc.
• VirtualBox is a tool that will host your Cloudera in your own pc. It is free and provided by Oracle.
• Alternative software is X2Go (used in the labs) and others.
Oracle VirtualBox Installation
1. Download the VirtualBox software
https://www.oracle.com/technetwork/server-storage/virtualbox/downloads/index.html
2. double-click on the VirtualBox.pkg installer file displayed in that window
Installation Failed
• Open up System Preferences
• Click on the Security & Privacy icon
• Make sure Allow apps downloaded from: App store and identified developers is checked.
Install VirtualBox again by click the VirtualBox_Uninstall.tool
Download Cloudera
https://www.cloudera.com/downloads/quickstart_vms/5-13.html
Setting the RAM as large as possible
Login: cloudera Password: cloudera
This button opens a drop-down menu that starts, restarts, and stops the services
This button opens a drop-down menu that starts, restarts, and stops the service HDFS.
This button opens a terminal. Most commands in the course will be executed in terminals within Cloudera.