NOTE : The hadoop cluster is currently down in order to be relocated as well as to update the cluster to Hadoop 2.7.1 and Spark 1.6.0. This process has encountered a number of issues and the ETA for having the node returned to service is unknown. We will update when more informaiton is available. 6 July 2016
Access to the cluster
You need to obtain permission to run on the CHPC Hadoop cluster. Therefore, please send an email to email@example.com and cc Sam Liston (firstname.lastname@example.org). Don't forget to add 'Access to the Hadoop cluster' in your email's subject line.
Once you obtained permission to log into CHPC's 'elephant' cluster, you can access it in the following way (the elephant cluster can only be accessed by first logging into one of the compute clusters):
ssh -X $UNID@$cluster.chpc.utah.edu
ssh -X $UNID@elephant.chpc.utah.edu
where cluster stands for ember, kingspeak, lonepeak, ... and UNID (as in the remainder of this discussion) stands for the user's UoU ID.
Running your first Hadoop program
We provide a basic example i.e. WordCount to test out/get familiar with the new Hadoop environment.
Step 1: Retrieve the example code
The example code can be retrieved from CHPC's gitlab in the following way:
module load git
git clone https://gitlab.chpc.utah.edu/u0253283/Hadoop-Examples.git WordCount
Note that we encourage you to set the variable EXAMPLEDIR to a directory of your choice. If the EXAMPLEDIR is not set, the WordCount directory will be created in your HOME directory.
Step 2: Copy data from your local fs to the Hadoop File System (HDFS):
We will subject Machiavelli's work "The Prince" to the WordCount program.
In a first step, we will retrieve 'The Prince' from the Gutenberg project, by executing the following code:
In order to initialize the Hadoop environment, we must load the following 2 modules:
module load jdk/1.7.0_67
module load hadoop/2.6.0
In the next step, we will copy the 'Prince' into the HDFS filesystem, by executing the following commands:
hadoop fs -mkdir /user/$UNID
hadoop fs -mkdir machia-inp
hadoop fs -copyFromLocal pg1232.txt machia-inp
If you want to see the content of the input directory, please type:
hadoop fs -ls -R machia-inp
Step 3.1: Building & testing the Java version of WordCount
The Hadoop framework is written in Java. The use of Java allows easy access the Hadoop functionality. In order to run the Java code, we first need to build a Jar file. This can be easily done using maven:
module load maven/3.3.3
mvn -e package
If the building process went succesful, we shoud see the corresponding WordCount.jar file in the target subdirectory. We are now ready to execute the Java WordCount code:
hadoop jar target/WordCount-1.0.jar org.chpc.wim.wordcount.WordCount \
-Dmapred.job.queue.name=general -D mapreduce.job.reduces=5 \
In the above example, we used 2 additional flags:
- -Dmapred.job.queue.name=general (name of the queue (general|chpc) , which is MANDATORY)
- -Dmpareduce.job.reduce=5 (OPTIONAL flag -> #reducers to be used. The default value is 1)
We are now ready to check the output (assuming the MapReduce went well).
We can check the output by copying it back to our local file system in the following way:
hadoop fs -copyToLocal machia-out-java .
If the WordCount job was successful, the directory will contain an empty _SUCCESS file as well as N part-r-0000x files where N stands for the number of reducers which has been used.
We can also check the the output without the need to copy it over from the HDFS file system:
hadoop fs -ls -R machia-out-java
hadoop fs -cat machia-out-java/part-r-00000
We can easily delete the output in the HDFS file system, by invoking the following command:
hadoop fs -rm -R machia-out-java
Step 3.2: Testing the Python version of WordCount
The Hadoop framework can also interact with Python code through the Hadoop streaming jar.
The corresponding Python code can now be executed as follows:
hadoop jar /usr/hdp/126.96.36.199-2/hadoop-mapreduce/hadoop-streaming-188.8.131.52.2.4.2-2.jar \
-D mapred.job.queue.name=general -D mapreduce.job.reduces=5 \
-input /user/$UNID/machia-inp/pg1232.txt \
-output /user/$UNID/machia-out-python \
-mapper ./mapper.py -file ./mapper.py \
-reducer ./reducer.py -file ./reducer.py
Note that the following flag is MANDATORY (as in the Java case):
The output can be checked in exactly the same way as for the Java example.