Greatly appreciate a lesson on Apache Hadoop this fall semester. Over the winter break, I am posting one assigned problem to help cement my understanding. This is a personal reference for moving forward to learn Apache Spark and Hadoop image processing interface (HIPI).
This hadoop exercise involves installing on Windows 10, adding JSON library, and performing a page rank.
I am following Tushar Sarde‘s article to install Hadoop 2.7.1 and Java 1.7 on Windows10 without cygwin. Also, a self reminder to set Java path. The final step of the article is to start the hadoop file system.
Adding JSON library
To process JSON input, I need to include an external java library. From this article on StackOverflow, Doug Crockford references a Java JSON library authored by Sean Leary. With my working directory at c:\Hwork, I place the java files in sub-directory c:\Hwork\org\json\.
Build JSON jar file: jar -cvf json.jar org/json/*.class
As a beginner, I am using the famous wordcount example in Hadoop documentation as a starting point. My task is to sort and rank below text file (one example) with multiple lines of JSON.
The input and output for Map() and Reduce() methods are key and Iteratable values of type Text and IntWriteable by default. In above screenshots, I edit this to Text for both key and values. They must be consistent across both methods. I have to set the job output type in main (below).
The map() code is straight forward. I parse the JSONObject value list into JSONArray and then write each key and value for reduce() to rank.
The reduce() code is also straight forward. I aggregate the values for the same key. Ranking is a numeric count of values per key (below).
The output result is as follow. The second column of integers is the number of values associated with that row key.
Build + Execution
Compile: C:\Hwork>javac -classpath c:\dev\hadoop-2.7.1\share\hadoop\common\hadoop-common-2.7.1.jar;c:\dev\hadoop-2.7.1\share\hadoop\mapreduce\hadoop-mapreduce-client-core-2.7.1.jar;c:\dev\hadoop-2.7.1\share\hadoop\common\lib\gson-2.2.4.jar;c:\dev\hadoop-2.7.1\share\hadoop\common\lib\commons-cli-1.2.jar PageRank.java
Build JAR: C:\Hwork>jar -cvf PageRank.jar *.class
CREATE input directory on HDFS and copy input file there for execution.
C:\Dev\hadoop-2.7.1\sbin>hadoop fs -mkdir /in
C:\Dev\hadoop-2.7.1\sbin>hadoop fs -copyFromLocal c:\Hwork\input\input.txt /in
Make JSON library JAR file available by copying it into a directory that is one of hadoop class path.
C:\Dev\hadoop-2.7.1\sbin>hadoop jar c:\Hwork\PageRank.jar PageRank /in /out
Retrieve output file from HDFS:
C:\Dev\hadoop-2.7.1\sbin>hadoop fs -copyToLocal /out/* c:\Hwork\output
Reference: Hadoop, “The Definitive Guide”-4th Edition by Tom White, O’Reilly