Distributed grep using Hadoop

Hadoop word count example is commonly used to introduce mapreduce concepts. I have altered 
the word count sample to do pattern matching or work like UNIX grep command.

first copy the text file to HDFS location.

bin/hadoop dfs -copyFromLocal local-dir hdfs-dir 
bin/hadoop jar path/grep.jar org.myprl.Grep hdfs-input-dir hdfs-output-dir  pattern

package org.myperl;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;

import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Grep {

public static class Map extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private Pattern pattern;
private int group;

public void configure(JobConf job) {
pattern = Pattern.compile(job.get("mapred.mapper.regex"));
group = job.getInt("mapred.mapper.regex.group", 0);
}

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
Matcher matcher = pattern.matcher(line);
if (matcher.find()) {
output.collect(new Text(line), one);
}
}
}

public static void main(String[] args) throws Exception {
if (args.length < 3) {
System.out.println("Grep <inDir> <outDir> <regex> [<group>]");
return;
}
JobConf conf = new JobConf(Grep.class);
conf.setJobName("Grep");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);

conf.set("mapred.mapper.regex", args[2]);
if (args.length == 4) {
conf.set("mapred.mapper.regex.group", args[3]);
}
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}

Advertisements

Apache Hadoop 1.0.0 released

Apache Hadoop 1.0.0 released. Release Note

Hadoop 1.0.0 released from 0.20.2xx.x development tree.  There is another new development in Hadoop space with version 0.23.0. New 0.23.0 version contains HDFS Federation and NextGen MapReduce (YARN).


Howto start Apache hadoop in debug mode

Inset JPDA parameters to $HADOOP_HOME/bin/hadoop script.

HADOOP_OPTS=”$HADOOP_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8000″

start hadoop with command:

$HADOOP_HOME/bin/hadoop namenode

Apache hadoop with start and wait for a debugger to connect via port 8000


How to Read/Write to a Apache hadoop file system (HDFS) file system using JAVA

Deploy a HDFS cluster.

Create HDFS client using JAVA.

Genarate Maven2 project.

mvn archetype:generate -DgroupId=org.wso2.carbon -DartifactId=HDFSClient
hadoop-core is the only dependency you want in maven2 based project.
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-core</artifactId>
    <version>0.20.2</version>
</dependency>

set cluster config in the client

do file operation via JAVA API

find maven2 based sample in in OT svn.