Its very important to learn Hadoop by practice.
One of the learning curves is how to write the first map reduce app and debug in favorite IDE, Eclipse. Do we need any Eclipse plugins? No, we do not. We can do Hadoop development without map reduce plugins
This tutorial will show you how to set up eclipse and run your map reduce project and MapReduce job right from your IDE. Before you read further, you should have setup Hadoop single node cluster and your machine.
Use Case:
We will explore the weather data to find maximum temperature from Tom White’s book Hadoop: Definitive Guide (3rd edition) Chapter 2 and run it using ToolRunner
I am using linux mint 15 on VirtualBox VM instance.
In addition, you should have
- Hadoop (MRV1 am using 1.2.1) Single Node Cluster Installed and Running, If you have not done so, would strongly recommend you do it from here
- Download Eclipse IDE, as of writing this, latest version of Eclipse is Kepler
1. Create New Java Project
2. Add Dependencies JARs
Right click on project properties and select Java build path
add all jars from $HADOOP_HOME/lib and $HADOOP_HOME (where hadoop core and tools jar lives)
3. Create Mapper
01.
package
com.letsdobigdata;
02.
import
java.io.IOException;
03.
import
org.apache.hadoop.io.IntWritable;
04.
import
org.apache.hadoop.io.LongWritable;
05.
import
org.apache.hadoop.io.Text;
06.
import
org.apache.hadoop.mapreduce.Mapper;
07.
public
class
MaxTemperatureMapper
extends
08.
Mapper<LongWritable, Text, Text, IntWritable> {
09.
private
static
final
int
MISSING =
9999
;
10.
@Override
11.
public
void
map(LongWritable key, Text value, Context context)
12.
throws
IOException, InterruptedException {
13.
String line = value.toString();
14.
String year = line.substring(
15
,
19
);
15.
int
airTemperature;
16.
if
(line.charAt(
87
) ==
'+'
) {
// parseInt doesn't like leading plus
17.
// signs
18.
airTemperature = Integer.parseInt(line.substring(
88
,
92
));
19.
}
else
{
20.
airTemperature = Integer.parseInt(line.substring(
87
,
92
));
21.
}
22.
String quality = line.substring(
92
,
93
);
23.
if
(airTemperature != MISSING && quality.matches(
"[01459]"
)) {
24.
context.write(
new
Text(year),
new
IntWritable(airTemperature));
25.
}
26.
}
27.
}
4. Create Reducer
01.
package
com.letsdobigdata;
02.
import
java.io.IOException;
03.
import
org.apache.hadoop.io.IntWritable;
04.
import
org.apache.hadoop.io.Text;
05.
import
org.apache.hadoop.mapreduce.Reducer;
06.
public
class
MaxTemperatureReducer
07.
extends
Reducer<Text, IntWritable, Text, IntWritable> {
08.
@Override
09.
public
void
reduce(Text key, Iterable<IntWritable> values,
10.
Context context)
11.
throws
IOException, InterruptedException {
12.
13.
int
maxValue = Integer.MIN_VALUE;
14.
for
(IntWritable value : values) {
15.
maxValue = Math.max(maxValue, value.get());
16.
}
17.
context.write(key,
new
IntWritable(maxValue));
18.
}
19.
}
5. Create Driver for MapReduce Job
Map Reduce job is executed by useful hadoop utility class ToolRunner
01.
package
com.letsdobigdata;
02.
03.
import
org.apache.hadoop.conf.Configured;
04.
import
org.apache.hadoop.fs.Path;
05.
import
org.apache.hadoop.io.IntWritable;
06.
import
org.apache.hadoop.io.Text;
07.
import
org.apache.hadoop.mapreduce.Job;
08.
import
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
09.
import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
10.
import
org.apache.hadoop.util.Tool;
11.
import
org.apache.hadoop.util.ToolRunner;
12.
/*This class is responsible for running map reduce job*/
13.
public
class
MaxTemperatureDriver
extends
Configured
implements
Tool{
14.
public
int
run(String[] args)
throws
Exception
15.
{
16.
17.
if
(args.length !=
2
) {
18.
System.err.println(
"Usage: MaxTemperatureDriver <input path> <outputpath>"
);
19.
System.exit(-
1
);
20.
}
21.
22.
Job job =
new
Job();
23.
job.setJarByClass(MaxTemperatureDriver.
class
);
24.
job.setJobName(
"Max Temperature"
);
25.
26.
FileInputFormat.addInputPath(job,
new
Path(args[
0
]));
27.
FileOutputFormat.setOutputPath(job,
new
Path(args[
1
]));
28.
29.
job.setMapperClass(MaxTemperatureMapper.
class
);
30.
job.setReducerClass(MaxTemperatureReducer.
class
);
31.
32.
job.setOutputKeyClass(Text.
class
);
33.
job.setOutputValueClass(IntWritable.
class
);
34.
35.
System.exit(job.waitForCompletion(
true
) ?
0
:
1
);
36.
boolean
success = job.waitForCompletion(
true
);
37.
return
success ?
0
:
1
;
38.
}
39.
public
static
void
main(String[] args)
throws
Exception {
40.
MaxTemperatureDriver driver =
new
MaxTemperatureDriver();
41.
int
exitCode = ToolRunner.run(driver, args);
42.
System.exit(exitCode);
43.
}
44.
}
6. Supply Input and Output
We need to supply input file that will be used during Map phase and the final output will be generated in output directory by Reduct task. Edit Run Configuration and supply command line arguments. sample.txt reside in the project root. Your project explorer should contain following
7. Map Reduce Job Execution
8. Final Output
If you managed to come this far, Once the job is complete, it will create output directory with _SUCCESS and part_nnnnn , double click to view it in eclipse editor and you will see we have supplied 5 rows of weather data (downloaded from NCDC weather) and we wanted to find out the maximum temperature in a given year from input file and the output will contain 2 rows with max temperature in (Centigrade) for each supplied year
1949 111 (11.1 C)
1950 22 (2.2 C)
1950 22 (2.2 C)
Make sure you delete the output directory next time running your application else you will get an error from Hadoop saying directory already exists.
Happy Hadooping!