Thursday, April 30, 2015

Data localization in mapreduce

Whenever the mapper process is completed before tasktracker emits the result . Task tracker keep the output in the LFS(Local file system) in the same node.

Note : - Data localization is for mapper phase but not for sort & shuffle and reducer phase.

Life of mapper output is till the end of job completion i.e as the job completion success or failure the local copies of mapper o/p will automatically revoked by mapper only.


Creation of jar file and Execution Process in Clustered Environment

Creation of jar file and Execution Process in Clustered Environment
Process of Creation of Jar file:
Step1: Open Eclipse, click on file, click on New then click on Java Project(if not available click on Other select Java         and click on Java project)


Step2:
Copy the provided code or write the below code.
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCountNew {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();

    Job job = new Job(conf, "word count");
    job.setJarByClass(WordCountNew.class);
  
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
  
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}
Step3: Right Click on “src” which is in Package Explorer tab -> Click On “Build Path”->Configure Build Path->Click on Libraries tab->Add        External JARs..->Press “Ok” Button.




Step4: Step-3: Right Click on “src” ->New->Class->give the name of Class File same as in code->Press “Finish” Button->Paste Code/Write the Code
Step5: Again Right Click on “src”->Export->Select JAR File under the Java->give the path as well as Name of JAR File->Press Finish
              Button

Execution Process:

 
    Step1: copy the jar file and input text file(provided) into download folder(provided server shared path).
                 
    Step2: copy the text file into local folder and copy the jar file into local folder.
     Step3: Create the new directory in HDFS path (hadoop fs –mkdir MrInput)
    Step4: copy the text file from LFS to HDFS by using –put or –copyFromLocal.
    Step5:

root@ubuntu:/home/chandu# hadoop jar BATCH38-WORDCOUNT.jar WordCountNew /root/user/chandu/MrInput/Input-Big.txt /root/user/chandu/MROutput
14/08/31 02:50:00 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/08/31 02:50:00 INFO input.FileInputFormat: Total input paths to process : 1
14/08/31 02:50:00 WARN snappy.LoadSnappy: Snappy native library is available
14/08/31 02:50:00 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/08/31 02:50:00 INFO snappy.LoadSnappy: Snappy native library loaded
14/08/31 02:50:00 INFO mapred.JobClient: Running job: job_201408310158_0002
14/08/31 02:50:01 INFO mapred.JobClient:  map 0% reduce 0%
14/08/31 02:50:07 INFO mapred.JobClient:  map 100% reduce 0%
14/08/31 02:50:14 INFO mapred.JobClient:  map 100% reduce 33%
14/08/31 02:50:16 INFO mapred.JobClient:  map 100% reduce 100%
14/08/31 02:50:16 INFO mapred.JobClient: Job complete: job_201408310158_0002
14/08/31 02:50:16 INFO mapred.JobClient: Counters: 26
14/08/31 02:50:16 INFO mapred.JobClient:   Job Counters
14/08/31 02:50:16 INFO mapred.JobClient:     Launched reduce tasks=1
14/08/31 02:50:16 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=5140
14/08/31 02:50:16 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/08/31 02:50:16 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/08/31 02:50:16 INFO mapred.JobClient:     Launched map tasks=1
14/08/31 02:50:16 INFO mapred.JobClient:     Data-local map tasks=1
14/08/31 02:50:16 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8545
14/08/31 02:50:16 INFO mapred.JobClient:   FileSystemCounters
14/08/31 02:50:16 INFO mapred.JobClient:     FILE_BYTES_READ=139
14/08/31 02:50:16 INFO mapred.JobClient:     HDFS_BYTES_READ=153395
14/08/31 02:50:16 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=119182
14/08/31 02:50:16 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=122
14/08/31 02:50:16 INFO mapred.JobClient:   Map-Reduce Framework
14/08/31 02:50:16 INFO mapred.JobClient:     Map input records=5487
14/08/31 02:50:16 INFO mapred.JobClient:     Reduce shuffle bytes=139
14/08/31 02:50:16 INFO mapred.JobClient:     Spilled Records=22
14/08/31 02:50:16 INFO mapred.JobClient:     Map output bytes=251174
14/08/31 02:50:16 INFO mapred.JobClient:     CPU time spent (ms)=2000
14/08/31 02:50:16 INFO mapred.JobClient:     Total committed heap usage (bytes)=177016832
14/08/31 02:50:16 INFO mapred.JobClient:     Combine input records=25872
14/08/31 02:50:16 INFO mapred.JobClient:     SPLIT_RAW_BYTES=125
14/08/31 02:50:16 INFO mapred.JobClient:     Reduce input records=11
14/08/31 02:50:16 INFO mapred.JobClient:     Reduce input groups=11
14/08/31 02:50:16 INFO mapred.JobClient:     Combine output records=11
14/08/31 02:50:16 INFO mapred.JobClient:     Physical memory (bytes) snapshot=186167296
14/08/31 02:50:16 INFO mapred.JobClient:     Reduce output records=11
14/08/31 02:50:16 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=749965312
14/08/31 02:50:16 INFO mapred.JobClient:     Map output records=25872
      Step6:

root@ubuntu:/home/chandu# hadoop fs -ls /root/user/chandu/MrInput
Found 1 items
-rw-r--r--   1 root supergroup     153270 2014-08-31 02:33 /root/user/chandu/MrInput/Input-Big.txt
root@ubuntu:/home/chandu# hadoop fs -ls /root/user/chandu/MROutput
Found 3 items
-rw-r--r--   1 root supergroup          0 2014-08-31 02:50 /root/user/chandu/MROutput/_SUCCESS
drwxrwxrwx   - root supergroup          0 2014-08-31 02:50 /root/user/chandu/MROutput/_logs
-rw-r--r--   1 root supergroup        122 2014-08-31 02:50 /root/user/chandu/MROutput/part-r-00000
root@ubuntu:/home/chandu# hadoop fs -cat /root/user/chandu/MROutput/part-r-00000
good    4312
hadoop    4312
having    2156
is    4312
knowledge    1078
leader    1078
learn    1078
market    3234
now    2156
people    1078
the    1078
root@ubuntu:/home/chandu#