Map Reduce Interview Questions and Answers

What is Map Reduce?

Map Reduce is a java based programming paradigm of Hadoop framework that

provides scalability across various Hadoop clusters

How Map Reduce works in Hadoop?

MapReduce distributes the workload into two different jobs namely

1. Map job and 2. Reduce job that can run in parallel.

The Map job breaks down the data sets into key-value pairs or tuples.

The Reduce job then takes the output of the map job and combines the data tuples

into smaller set of tuples.

What is ‘Key value pair’ in Map Reduce?

Key value pair is the intermediate data generated by maps and sent to reduces for

generating the final output.

What is the difference between MapReduce engine and HDFS cluster?

HDFS cluster is the name given to the whole configuration of master and slaves

where data is stored. Map Reduce Engine is the programming module which is used

to retrieve and analyze data.

Is map like a pointer?

No, Map is not like a pointer.

Why are the number of splits equal to the number of maps?

The number of maps is equal to the number of input splits because we want the key

and value pairs of all the input splits.

Is a job split into maps?

No, a job is not split into maps. Spilt is created for the file. The file is placed on

datanodes in blocks. For each split, a map is needed.

How can you set an arbitrary number of mappers to be created for a job inHadoop?

This is a trick question. You cannot set it

How can you set an arbitary number of reducers to be created for a job in Hadoop?

You can either do it progamatically by using method setNumReduceTasksin the

JobConfclass or set it up as a configuration setting

How will you write a custom partitioner for a Hadoop job?

The following steps are needed to write a custom partitioner.

– Create a new class that extends Partitioner class

– Override method getPartition

– In the wrapper that runs the Map Reducer, either

– add the custom partitioner to the job programtically using method

setPartitionerClass or

– add the custom partitioner to the job as a config file (if your wrapper reads from

config file or oozie)

What is the difference between TextInputFormat and KeyValueInputFormat class?

TextInputFormat: It reads lines of text files and provides the offset of the line as

key to the Mapper and actual line as Value to the mapper

KeyValueInputFormat: Reads text file and parses lines into key, val pairs.

Everything up to the first tab character is sent as key to the Mapper and the

remainder of the line is sent as value to the mapper.

What is a Combiner?

The Combiner is a “mini-reduce” process which operates only on data generated by a

mapper. The Combiner will receive as input all data emitted by the Mapper instances

on a given node. The output from the Combiner is then sent to the Reducers, instead

of the output from the Mappers.

If no custom partitioner is defined in the hadoop then how is data partitioned before its sent to the reducer?

The default partitioner computes a hash value for the key and assigns the partition

based on this result.

Have you ever used Counters in Hadoop. Give us an example scenario?

Anybody who claims to have worked on a Hadoop project is expected to use

counters.

Is it possible to provide multiple input to Hadoop? If yes then how can you give multiple directories as input to the Hadoop job?

Yes, The input format class provides methods to add multiple directories as input to

a Hadoop job.

Is it possible to have Hadoop job output in multiple directories. If yes then how?

Yes, by using Multiple Outputs class

Explain what are the basic parameters of a Mapper?

The basic parameters of a Mapper are

LongWritable and Text

Text and IntWritable

Explain what is the function of MapReducer partitioner?

The function of MapReducer partitioner is to make sure that all the value of a single

key goes to the same reducer, eventually which helps evenly distribution of the map

output over the reducers.

Explain what is difference between an Input Split and HDFS Block?

Logical division of data is known as Split while physical division of data is known as

HDFS Block.

Mention what are the main configuration parameters that user need to specify to run Mapreduce Job ?

The user of Mapreduce framework needs to specify

Job’s input locations in the distributed file system
Job’s output location in the distributed file system
Input format
Output format
Class containing the map function
Class containing the reduce function
JAR file containing the mapper, reducer and driver classes

Map Reduce Interview Questions and Answers