Map Reduce Interview Questions and Answers
-
What is Map Reduce?
Map Reduce is a java based programming paradigm of Hadoop framework that
provides scalability across various Hadoop clusters
-
How Map Reduce works in Hadoop?
MapReduce distributes the workload into two different jobs namely
1. Map job and 2. Reduce job that can run in parallel.
The Map job breaks down the data sets into key-value pairs or tuples.
The Reduce job then takes the output of the map job and combines the data tuples
into smaller set of tuples.
-
What is ‘Key value pair’ in Map Reduce?
Key value pair is the intermediate data generated by maps and sent to reduces for
generating the final output.
-
What is the difference between MapReduce engine and HDFS cluster?
HDFS cluster is the name given to the whole configuration of master and slaves
where data is stored. Map Reduce Engine is the programming module which is used
to retrieve and analyze data.
-
Is map like a pointer?
No, Map is not like a pointer.
-
Why are the number of splits equal to the number of maps?
The number of maps is equal to the number of input splits because we want the key
and value pairs of all the input splits.
-
Is a job split into maps?
No, a job is not split into maps. Spilt is created for the file. The file is placed on
datanodes in blocks. For each split, a map is needed.
-
How can you set an arbitrary number of mappers to be created for a job inHadoop?
This is a trick question. You cannot set it
-
How can you set an arbitary number of reducers to be created for a job in Hadoop?
You can either do it progamatically by using method setNumReduceTasksin the
JobConfclass or set it up as a configuration setting
-
How will you write a custom partitioner for a Hadoop job?
The following steps are needed to write a custom partitioner.
– Create a new class that extends Partitioner class
– Override method getPartition
– In the wrapper that runs the Map Reducer, either
– add the custom partitioner to the job programtically using method
setPartitionerClass or
– add the custom partitioner to the job as a config file (if your wrapper reads from
config file or oozie)
-
What is the difference between TextInputFormat and KeyValueInputFormat class?
TextInputFormat: It reads lines of text files and provides the offset of the line as
key to the Mapper and actual line as Value to the mapper
KeyValueInputFormat: Reads text file and parses lines into key, val pairs.
Everything up to the first tab character is sent as key to the Mapper and the
remainder of the line is sent as value to the mapper.
-
What is a Combiner?
The Combiner is a “mini-reduce” process which operates only on data generated by a
mapper. The Combiner will receive as input all data emitted by the Mapper instances
on a given node. The output from the Combiner is then sent to the Reducers, instead
of the output from the Mappers.
-
If no custom partitioner is defined in the hadoop then how is data partitioned before its sent to the reducer?
The default partitioner computes a hash value for the key and assigns the partition
based on this result.
-
Have you ever used Counters in Hadoop. Give us an example scenario?
Anybody who claims to have worked on a Hadoop project is expected to use
counters.
-
Is it possible to provide multiple input to Hadoop? If yes then how can you give multiple directories as input to the Hadoop job?
Yes, The input format class provides methods to add multiple directories as input to
a Hadoop job.
-
Is it possible to have Hadoop job output in multiple directories. If yes then how?
Yes, by using Multiple Outputs class
-
Explain what are the basic parameters of a Mapper?
The basic parameters of a Mapper are
LongWritable and Text
Text and IntWritable
-
Explain what is the function of MapReducer partitioner?
The function of MapReducer partitioner is to make sure that all the value of a single
key goes to the same reducer, eventually which helps evenly distribution of the map
output over the reducers.
-
Explain what is difference between an Input Split and HDFS Block?
Logical division of data is known as Split while physical division of data is known as
HDFS Block.
-
Mention what are the main configuration parameters that user need to specify to run Mapreduce Job ?
The user of Mapreduce framework needs to specify
-
Job’s input locations in the distributed file system
-
Job’s output location in the distributed file system
-
Input format
-
Output format
-
Class containing the map function
-
Class containing the reduce function
-
JAR file containing the mapper, reducer and driver classes