HW3: assigned 4/8, due 4/12 before 11:59 PM
In this homework, you are going to use Apache Pig to count letters (not words!) in multiple input text files, using the HortonWorks Hadoop Sandbox setup running inside a VirtualBox VM. What you need to submit:
* charCount.pig – your Pig script that does the letter counting
* additional files, if any [eg. if you wrote a custom ‘UDF’, or if you created extra input files, submit those]
Following are the steps.
1. Install VirtualBox (NOT VMWare!).
2. Download the HortonWorks Hadoop Sandbox. If you are using a PC, download WinSCP as well.
3. Start the VM, and import the Hadoop Sandbox appliance:
Here is a useful guide that will help with the setting up of the sandbox.
4. Bring up the sandbox (press the Start button on the VM) – this will take a few minutes:
5. Once the sandbox shell (terminal) comes up, you can start to play! You can type in standard Unix commands (ls, rm, mkdir, more..), and use bash-like editing (Ctrl p, Ctrl b, Ctrl d etc.). You can also run from this terminal, Hadoop map-reduce commands, Spark, Hive, Pig, etc. – LOTS of power!
Verify that Pig runs:
6. Learn (the basics of) Pig, start playing with it. You can directly run Pig commands in the shell, or bring up the Pig-specific ‘grunt’ shell and run commands inside it (I recommend not using ‘grunt’).
7. write a small script called countChars.pig – you’ll do all script typing on your machine, transfer (upload) the script to the sandbox using ‘scp’ (Mac/Linux) or WinSCP (PC), and run it in the sandbox. Note: with WinSCP, you can drag and drop files and folders from your PC to the sandbox, and from the sandbox back to your PC.Download files (para1.txt-para6.txt), use as ‘official’ inputs for script:
* para1.txt * para2.txt * para3.txt
* para4.txt * para5.txt * para6.txt
Transfer (scp) your .pig script plus the 6 input files, to the sandbox. The input files can be placed in an ‘in’ directory on the sandbox, to keep the sandbox clean (do ‘mkdir in’ in your sandbox, to create the folder). On a PC, bring up WinSCP, and log on to the sandbox in order to transfer files:
* host: 127.0.0.1 * port: 2222 * user: root
* password: < whatever you picked when you were asked to change the default password 'hadoop' >
Run your program, debug, make changes to the script, upload, run, debug..
Below are a pair of clips that show my uploading and running countChars.pig, on four tiny input files named p1.txt, p2.txt, p3.txt, p4.txt – together these four text files contain ‘The quick brown fox jumps over the lazy dog’, which is a sentence that is special because it has all letters of the alphabet 🙂
download [right-click to save]
download [right-click to save]
Success! As you can tell, the output file contained counts for all 26 letters. Note that I had specified ‘charcount’ to be the output directory, in my .pig script. If you too specify a directory for output (recommended), make sure this directory does not pre-exist when you run your script! If it does, you’ll get an error when your script runs. Use ‘rm -rf ‘ to remove the output directory and its contents, each time before running countChars.pig.The letter counting should IGNORE case – so ‘Tutti Frutti’ would produce 5 for the ‘t’ (or ‘T’) count, not 4.
That was a lot of output from the running processes! The underlying YARN manager takes our .pig script, parses it, and automagically spawns a series of mappers and reducers to run the Pig commands, where possible in parallel. Cool!
Note the syntax for executing a Pig script: ‘pig -x local countChars.pig’. Local execution (ie. in the sandbox) is simpler than executing in the HDFS (Hadoop file system), something which you can learn later (you need to copy the inputs and your .pig script to HDFS, then run the script using ‘pig -x mapreduce’).
Next, I upload the ‘official’ para?.txt files and a slightly modified countChars.pig that points to para?.txt, and execute the script:
download [right-click to save]
download [right-click to save]
download [right-click to save]
Again, success! You can see total letter counts for all letters in six input paragraphs.
Note – your submittable (.pig) would be as small as just 6 lines (one or two more lines if you attempt the Part2 below)!! Do allocate enough time, though, to experiment with Pig commands and data flow, that is how you will arrive at the solution. Translation: do not put off working on this because there doesn’t seem much to type in.Tip: ‘dump’ and ‘describe’ are VERY useful Pig commands to add to your code, they are great debugging aids.
Part2: Modify countChars.pig so that it only outputs totals for the vowels, ie. for a, e, i, o, u; submit it as countChars_Part2.pig. Feel free to create any extra files you might need for this – if you do so, submit these extra files as well.
/docProps/thumbnail.jpeg