Hadoop

The CHPC offers a small Hadoop cluster named 'elephant'. The 'elephant' cluster has the following characteristics:

  • #name nodes: 2
  • #data nodes: 24 (3.7 TB HDD per node)
  • replication factor: 3
  • #login nodes: 2
  • Hadoop Software Stack: HDP 2.2.4

NOTE :  The hadoop cluster is currently down in order to be relocated as well as to update the cluster to Hadoop 2.7.1 and Spark 1.6.0.  This process has encountered a number of issues and the ETA for having the node returned to service is unknown.  We will update when more informaiton is available.  6 July 2016

 

Access to the cluster

You need to obtain permission to run on the CHPC Hadoop cluster. Therefore, please send an email to issues@chpc.utah.edu and cc Sam Liston (sam.liston@utah.edu). Don't forget to add 'Access to the Hadoop cluster' in your email's subject line.

Once you obtained permission to log into CHPC's 'elephant' cluster, you can access it in the following way (the elephant cluster can only be accessed by first logging into one of the compute clusters):

ssh -X $UNID@$cluster.chpc.utah.edu
ssh -X $UNID@elephant.chpc.utah.edu

where cluster stands for ember, kingspeak, lonepeak, ... and UNID (as in the remainder of this discussion) stands for the user's UoU ID.

Running your first Hadoop program

We provide a basic example i.e. WordCount to test out/get familiar with the new Hadoop environment.

Step 1: Retrieve the example code

The example code can be retrieved from CHPC's gitlab in the following way:

cd $EXAMPLEDIR
module load git
git clone https://gitlab.chpc.utah.edu/u0253283/Hadoop-Examples.git WordCount

Note that we encourage you to set the variable EXAMPLEDIR to a directory of your choice. If the EXAMPLEDIR is not set, the WordCount directory will be created in your HOME directory.

Step 2: Copy data from your local fs to the Hadoop File System (HDFS):

We will subject Machiavelli's work "The Prince" to the WordCount program.

In a first step, we will retrieve 'The Prince' from the Gutenberg project, by executing the following code:

cd WordCount
wget http://www.gutenberg.org/cache/epub/1232/pg1232.txt

In order to initialize the Hadoop environment, we must load the following 2 modules:

module load jdk/1.7.0_67
module load hadoop/2.6.0 

In the next step, we will copy the 'Prince' into the HDFS filesystem, by executing the following commands:

hadoop fs -mkdir /user/$UNID
hadoop fs -mkdir machia-inp
hadoop fs -copyFromLocal pg1232.txt machia-inp

If you want to see the content of the input directory, please type:

hadoop fs -ls -R machia-inp

Step 3.1: Building & testing the Java version of WordCount

The Hadoop framework is written in Java. The use of Java allows easy access the Hadoop functionality. In order to run the Java code, we first need to build a Jar file. This can be easily done using maven:

module load maven/3.3.3

cd WordCount-Java
mvn -e package 

If the building process went succesful, we shoud see the corresponding WordCount.jar file in the target subdirectory. We are now ready to execute the Java WordCount code:

 hadoop jar target/WordCount-1.0.jar org.chpc.wim.wordcount.WordCount \
-Dmapred.job.queue.name=general -D mapreduce.job.reduces=5 \
-DwordcountInput=/user/$UNID/machia-inp/pg1232.txt \
-DwordcountOutput=/user/$UNID/machia-out-java

In the above example, we used 2 additional flags:

  • -Dmapred.job.queue.name=general (name of the queue (general|chpc) , which is MANDATORY)
  • -Dmpareduce.job.reduce=5 (OPTIONAL flag -> #reducers to be used. The default value is 1)

We are now ready to check the output (assuming the MapReduce went well).

We can check the output by copying it back to our local file system in the following way:

cd ..
hadoop fs -copyToLocal machia-out-java . 

If the WordCount job was successful, the directory will contain an empty _SUCCESS file as well as N part-r-0000x files where N stands for the number of reducers which has been used.

We can also check the the output without the need to copy it over from the HDFS file system:

hadoop fs -ls -R machia-out-java
hadoop fs -cat machia-out-java/part-r-00000

 We can easily delete the output in the HDFS file system, by invoking the following command:

hadoop fs -rm -R machia-out-java

Step 3.2: Testing the Python version of WordCount

The Hadoop framework can also interact with Python code through the Hadoop streaming jar. 

The corresponding Python code can now be executed as follows:

cd WordCount-Python
hadoop jar /usr/hdp/2.2.4.2-2/hadoop-mapreduce/hadoop-streaming-2.6.0.2.2.4.2-2.jar \
-D mapred.job.queue.name=general -D mapreduce.job.reduces=5 \
-input /user/$UNID/machia-inp/pg1232.txt \
-output /user/$UNID/machia-out-python \
-mapper ./mapper.py -file ./mapper.py \
-reducer ./reducer.py -file ./reducer.py

Note that the following flag is MANDATORY (as in the Java case):

-D mapred.job.queue.name=general

The output can be checked in exactly the same way as for the Java example.

Links

Apache Hadoop