Anagram Finder MapReduce Case Study

To identify anagrams words and group them together from Input files which contain n number of varying string.



Problem Statement : Find anagram words from input files , group them together and convert each word into upper case.

AnagramFinderMapper  : This will read input file line by line.It will create a token from line and sort each token to form a key with original value.In this way hadoop will create a group of similar words.


For example : aba and baa will get sorted as aab. So reducer will receive a key as aab with group {aba,baa}

package com.hadoopgyaan;
import java.io.IOException;
import java.util.Arrays;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
 
public class AnagramFinderMapper extends MapReduceBase implements Mapper<LongWritable,Text,Text,Text> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
 
public void map(LongWritable key, Text value,OutputCollector<Text,Text> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer  tokenizer = new StringTokenizer(line);
 
while (tokenizer.hasMoreTokens()){
String originalInput = tokenizer.nextToken();
String sortedKey = sortStringChar(originalInput);
System.out.println(“Original Input” + originalInput + “Sorted Input” + sortedKey);
output.collect(new Text(sortedKey), new Text(originalInput));
}
}
 
private String sortStringChar(String string) {
char[] chars = string.toCharArray();
Arrays.sort(chars);
String sortedString = new String (chars);
System.out.println(“Sorted String” + sortedString);
return sortedString;
}
 
}
 
 
 

AnagramFinderReducer : It will read each member of group and combine them into single string separated by tab.

package com.hadoopgyaan;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
 
public class AnagramFinderReducer extends MapReduceBase implements Reducer<Text,Text,Text,IntWritable> {
 
public void reduce(Text key, Iterator<Text> values,OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException {
int sum=1;
String similarWord = “”;
while (values.hasNext()) {
 similarWord = similarWord + “t” + values.next().toString();
 System.out.println( similarWord.toString());
}
output.collect(new Text( similarWord)  ,new IntWritable(sum));
}
 
}


AnagramFinderDriver :  Now for running the job, we will need a driver program which will create a new job instance, define proper mapper and reducer to the job. Driver program will also define the input and output directory which will be passed as argument.

package com.hadoopgyaan;

import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.OutputFormat;
import org.apache.hadoop.mapred.lib.AnagramFinderMapper;
import org.apache.hadoop.mapred.lib.AnagramFinderReducer;
import org.apache.hadoop.mapred.TextOutputFormat;
 
public class AnagramFinderDriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException{
JobConf conf = new JobConf();
System.out.println(“Execution Started”);
conf.setJobName(“AnagramFinder“);
conf.setJarByClass(AnagramFinderRunner.class);
FileInputFormat.setInputPaths(conf,new Path(args[0]));
FileOutputFormat.setOutputPath(conf,new Path(args[1]));
 
JobConf conf1 = new JobConf(false);
AnagramFinderMapper.addMapper(conf, SortKeyMapper.class, LongWritable.class, Text.class, Text.class, Text.class, true, conf1);
JobConf redConf =   new JobConf();
AnagramFinderReducer.setReducer(conf, CombineKeyReducer.class, Text.class, Text.class, Text.class, IntWritable.class, true, redConf);
JobConf mapConf =   new JobConf();
AnagramFinderReducer.addMapper(conf, UpperCaseMapper.class, Text.class, IntWritable.class, Text.class, NullWritable.class, true, mapConf);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(NullWritable.class);
conf.setOutputFormat((Class<? extends OutputFormat>) TextOutputFormat.class);
JobClient.runJob(conf);
}
 
}
 

Downloads :

 
 
I hope this tutorial will surely help you. If you have any questions or problems please let me know.
Happy Hadooping with Patrick..

Leave a Reply

Your email address will not be published. Required fields are marked *