Archive for the ‘Hadoop’ Category

Learning Hadoop basics

Monday, January 2nd, 2017

Greatly appreciate a lesson on Apache Hadoop this fall semester.  Over the winter break, I am posting one assigned problem to help cement my understanding.  This is a personal reference for moving forward to learn Apache Spark and Hadoop image processing interface (HIPI).

This hadoop exercise involves installing on Windows 10, adding JSON library, and performing a page rank.

Windows10 installation

At the time of implementation, I was not aware of Cloudera quickstart VM and Hortonworks’ sandboxes.  So, I embarked on Windows so it can be shared among team members of my class.

I am following ‘s article to install Hadoop 2.7.1 and Java 1.7 on Windows10 without cygwin.  Also, a self reminder to set Java path.  The final step of the article is to start the hadoop file system.

C:\Dev\hadoop-2.7.1\sbin>Start-all.cmd

localhost8088

localhost50070

Adding JSON library

To process JSON input, I need to include an external java library.  From this article on StackOverflowDoug Crockford references a Java JSON library authored by Sean Leary.  With my working directory at c:\Hwork, I place the java files in sub-directory c:\Hwork\org\json\.

json-src

Compiling the JSON library: javac org/json/*.javacompile-json

Build JSON jar file: jar -cvf json.jar org/json/*.class

jar-json

Page Rank

As a beginner, I am using the famous wordcount example in Hadoop documentation as a starting point.  My task is to sort and rank below text file (one example) with multiple lines of JSON.  

{“page”:”http://marys_flowers/main.html”, “list”:[{“rose”:”http://bobs-flowers/page1″},{“rose”:”http://bettys-flowers/rose_page.html”},{“lily”:”http://flowers-r-us/lily”},{“rose”:”http://bobs-flowers/page1″}]}
{“page”:”http://flowerpage/main.html”,”list”:[{“rose”:”http://bobs-flowers/page1″},{“mums”:”http://flowerpage/page2.html”}]}

From the wordcount example, I renamed it pagerank and include the JSON library and change methods’ signatures to handle Text for keys and values (download PageRank.java download PageRank.jar).

import-json

map-signatures

reducer-signatures

The input and output for Map() and Reduce() methods are key and Iteratable values of type Text and IntWriteable by default.  In above screenshots, I edit this to Text for both key and values.  They must be consistent across both methods.  I have to set the job output type in main (below).

setoutput

The map() code is straight forward.  I parse the JSONObject value list into JSONArray and then write each key and value for reduce() to rank.

map-parser

The reduce() code is also straight forward.  I aggregate the values for the same key.  Ranking is a numeric count of values per key (below).

reduce

The output result is as follow.  The second column of integers is the number of values associated with that row key.

1 1 lily http://flowers-r-us/lily
1 1 mums http://flowerpage/page2.html
1 4 rose http://bobs-flowers/page1,http://bobs-flowers/page1,http://bettys-flowers/rose_page.html,http://bobs-flowers/page1

Build + Execution

Compile:   C:\Hwork>javac -classpath c:\dev\hadoop-2.7.1\share\hadoop\common\hadoop-common-2.7.1.jar;c:\dev\hadoop-2.7.1\share\hadoop\mapreduce\hadoop-mapreduce-client-core-2.7.1.jar;c:\dev\hadoop-2.7.1\share\hadoop\common\lib\gson-2.2.4.jar;c:\dev\hadoop-2.7.1\share\hadoop\common\lib\commons-cli-1.2.jar PageRank.java
Build JAR:   C:\Hwork>jar -cvf PageRank.jar *.class

build

CREATE input directory on HDFS and copy input file there for execution.

C:\Dev\hadoop-2.7.1\sbin>hadoop fs -mkdir /in

C:\Dev\hadoop-2.7.1\sbin>hadoop fs -copyFromLocal c:\Hwork\input\input.txt /in

create_in

Make JSON library JAR file available by copying it into a directory that is one of hadoop class path.

C:\Dev\hadoop-2.7.1\sbin>hadoop classpath

classpath

copy JSON library JAR file into common directory;  PageRank code will be able to access the JSON library when executed.
json-dir

Execute PageRank:

C:\Dev\hadoop-2.7.1\sbin>hadoop jar c:\Hwork\PageRank.jar PageRank /in /out

execute

Retrieve output file from HDFS:

C:\Dev\hadoop-2.7.1\sbin>hadoop fs -copyToLocal /out/* c:\Hwork\output

retrieve

Reference: Hadoop, “The Definitive Guide”-4th Edition by Tom White, O’Reilly

ISBN:978-1-491-90163-2