Archive for the ‘Hadoop’ Category

Learning Hadoop basics

Monday, January 2nd, 2017

Greatly appreciate a lesson on Apache Hadoop this fall semester.  Over the winter break, I am posting one assigned problem to help cement my understanding.  This is a personal reference for moving forward to learn Apache Spark and Hadoop image processing interface (HIPI).

This hadoop exercise involves installing on Windows 10, adding JSON library, and performing a page rank.

Windows10 installation

At the time of implementation, I was not aware of Cloudera quickstart VM and Hortonworks’ sandboxes.  So, I embarked on Windows so it can be shared among team members of my class.

I am following ‘s article to install Hadoop 2.7.1 and Java 1.7 on Windows10 without cygwin.  Also, a self reminder to set Java path.  The final step of the article is to start the hadoop file system.




Adding JSON library

To process JSON input, I need to include an external java library.  From this article on StackOverflowDoug Crockford references a Java JSON library authored by Sean Leary.  With my working directory at c:\Hwork, I place the java files in sub-directory c:\Hwork\org\json\.


Compiling the JSON library: javac org/json/*.javacompile-json

Build JSON jar file: jar -cvf json.jar org/json/*.class


Page Rank

As a beginner, I am using the famous wordcount example in Hadoop documentation as a starting point.  My task is to sort and rank below text file (one example) with multiple lines of JSON.  

{“page”:”http://marys_flowers/main.html”, “list”:[{“rose”:”http://bobs-flowers/page1″},{“rose”:”http://bettys-flowers/rose_page.html”},{“lily”:”http://flowers-r-us/lily”},{“rose”:”http://bobs-flowers/page1″}]}

From the wordcount example, I renamed it pagerank and include the JSON library and change methods’ signatures to handle Text for keys and values (download download PageRank.jar).




The input and output for Map() and Reduce() methods are key and Iteratable values of type Text and IntWriteable by default.  In above screenshots, I edit this to Text for both key and values.  They must be consistent across both methods.  I have to set the job output type in main (below).


The map() code is straight forward.  I parse the JSONObject value list into JSONArray and then write each key and value for reduce() to rank.


The reduce() code is also straight forward.  I aggregate the values for the same key.  Ranking is a numeric count of values per key (below).


The output result is as follow.  The second column of integers is the number of values associated with that row key.

1 1 lily http://flowers-r-us/lily
1 1 mums http://flowerpage/page2.html
1 4 rose http://bobs-flowers/page1,http://bobs-flowers/page1,http://bettys-flowers/rose_page.html,http://bobs-flowers/page1

Build + Execution

Compile:   C:\Hwork>javac -classpath c:\dev\hadoop-2.7.1\share\hadoop\common\hadoop-common-2.7.1.jar;c:\dev\hadoop-2.7.1\share\hadoop\mapreduce\hadoop-mapreduce-client-core-2.7.1.jar;c:\dev\hadoop-2.7.1\share\hadoop\common\lib\gson-2.2.4.jar;c:\dev\hadoop-2.7.1\share\hadoop\common\lib\commons-cli-1.2.jar
Build JAR:   C:\Hwork>jar -cvf PageRank.jar *.class


CREATE input directory on HDFS and copy input file there for execution.

C:\Dev\hadoop-2.7.1\sbin>hadoop fs -mkdir /in

C:\Dev\hadoop-2.7.1\sbin>hadoop fs -copyFromLocal c:\Hwork\input\input.txt /in


Make JSON library JAR file available by copying it into a directory that is one of hadoop class path.

C:\Dev\hadoop-2.7.1\sbin>hadoop classpath


copy JSON library JAR file into common directory;  PageRank code will be able to access the JSON library when executed.

Execute PageRank:

C:\Dev\hadoop-2.7.1\sbin>hadoop jar c:\Hwork\PageRank.jar PageRank /in /out


Retrieve output file from HDFS:

C:\Dev\hadoop-2.7.1\sbin>hadoop fs -copyToLocal /out/* c:\Hwork\output


Reference: Hadoop, “The Definitive Guide”-4th Edition by Tom White, O’Reilly