HDFS:
1. Without touching Block size & input split, can we have a say on the no. of mappers?
Ans: Create a Custom input Format and override the 'isSplitable()'to return false.
Ans: Create a Custom input Format and override the 'isSplitable()'to return false.
2. What is the difference between Block size & input split?
Ans: Block is a physical division whereas, input split is a logical division of the data.
3. To process one hundred files each of size - 100MB on HDFS whose default block size is 64MB, how many mappers would be invoked?
Ans: Each file occupies 2 blocks of data(block1 - 64 MB & block2 - 36 MB) and hence 100 files would occupy 200 map slots.
4. What is data locality optimization?
Ans: In Hadoop, execution is done near the data. This execution can be done in 3 possible ways, out of which the first way is always preferred by the Namenode
Same node execution: Tasktracker process is initiated in the Datanode where the block of data is stored.
Off-node execution: In the event of the unavailability of Tasktracker slots in the Datanode(where the data block is located), this block of data is copied to the nearest datanode in the same rack and execution is done.
Off-rack execution: If no slot is free to run the Tasktracker in the entire rack where the block of data is present, the block of data is moved across to a different rack and executed.
5. What is Speculative execution?
Ans: If one of the tasks of a MapReduce job is slow, it pulls down the overall performance of the job. Hence, Jobtracker continuously monitors each task for progress(via heart beat signals). If certain task does not respond in the given time-interval, then the job trackerspeculates that the task is down and initiates a similar Tasktracker on a different replica of the same block. This concept is called Speculative execution.
Important thing to note here is that, it will not kill the slow running task. Both tasks would run simultaneously. Only when one of the tasks get completed, the remaining task would be killed.
6. What are the different types of File permissions in HDFS?
Ans:
drwxrwxrwx user1 prog 10 Aug 16 15:02 myfolder
-rwxrwxrwx user1 prog 10 Aug 01 07:02 myfile.sas
Position 1: ‘d’ means folder, ‘-’ means file
Positions 2-4: Owners permissions on file/folder
Positions 5-7: Group permissions on file or folder
Positions 8-10: Global permissions on file or folder
7. What is Rack-awareness?
Ans: In HDFS, not all replicas of a single block are copied in the same Rack. This concept is called Rack-awareness. In the event of an entire Rack going down, if all the replicas are in that rack, there would be no way of recovering that block of data.
8. What are the different modes of HDFS that one can run? Where do we configure these modes?
Ans: Hadoop can be configured to run on one of the following modes.
a. Standalone Mode or local (default mode)
b. Psuedo distributed mode
c. Fully distributed mode.
These configuration settings can be set via - core-site.xml, mapred-site.xml, hdfs-site.xml
9. What are the available data-types in Hadoop?
Ans: To support serialization-deserialization and to be able to get compared with one another, hadoop has built its own datatypes.
Following is the list of types that implement WritableComparable-
Primitives: BooleanWritable, ByteWritable, ShortWritable, IntWritable, VIntWritable,FloatWritable,
LongWritable, VLongWritable, DoubleWritable.
Others:
NullWritable, Text, BytesWritable, MD5Hash10. Explain the command '-getMerge'
Ans: hadoop fs -getmerge <directory> <merged file name>
This option gets all the files in the directory and merges them into a single file.
11. Explain the anatomy of a file read in HDFS
Ans:
1. Client opens the file (calls open() on Distributed File System).
2. DFS calls the namenode to get block locations.
3. DFS creates FSDataInputStream and client invokes read() on this object.
4. Using DFSDataInputStream(a sub class of FSDataInputStream), read operation is done on datanodes where file blocks are present. Blocks are read in the order. Once reading all the blocks is finished, client calls close() on the FSDataInputStream
Ans:
1. Client opens the file (calls open() on Distributed File System).
2. DFS calls the namenode to get block locations.
3. DFS creates FSDataInputStream and client invokes read() on this object.
4. Using DFSDataInputStream(a sub class of FSDataInputStream), read operation is done on datanodes where file blocks are present. Blocks are read in the order. Once reading all the blocks is finished, client calls close() on the FSDataInputStream
12. Explain the anatomy of a file write in HDFS
Ans:
1. Client creates a files (calls create() on DFS)
2. Client calls namenode(NN) to create a file. NN checks for client's access permissions to the file and if file already exists. If the file already exists, it throws an IO Exception
3. The DFS returns an FSDataOutputStream to write data into. FSDataOutputStream has a subclass DFSDataOutputStream which handles communication with NN & datanode(DN)
4. DFSDataOutputStream writes data in the form of packets(small units of data) and these packets are written to various DNs to form blocks of data. A pipeline is formed that consists of the list of DNs that a single block has to be replicated to.
5. When a block of data is written to all DNs in the pipeline, acknowledgement comes from the DNs in the pipeline in the reverse order.
6. When client has finished writing the data, it calls close() on the stream
7. Waits for acknowledgement before contacting the name to signal that file is complete.
Ans:
1. Client creates a files (calls create() on DFS)
2. Client calls namenode(NN) to create a file. NN checks for client's access permissions to the file and if file already exists. If the file already exists, it throws an IO Exception
3. The DFS returns an FSDataOutputStream to write data into. FSDataOutputStream has a subclass DFSDataOutputStream which handles communication with NN & datanode(DN)
4. DFSDataOutputStream writes data in the form of packets(small units of data) and these packets are written to various DNs to form blocks of data. A pipeline is formed that consists of the list of DNs that a single block has to be replicated to.
5. When a block of data is written to all DNs in the pipeline, acknowledgement comes from the DNs in the pipeline in the reverse order.
6. When client has finished writing the data, it calls close() on the stream
7. Waits for acknowledgement before contacting the name to signal that file is complete.
MapReduce:
1. What is Distributed Cache?
Ans: Distributed Cache is a mechanism by which 'Side Data' (extra read-only data needed by a MR program) is distributed
2. What is 'Sequence File' format? Where do we use it?
Ans: SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats. It is also worth noting that, internally, the temporary outputs of maps are stored using SequenceFile.
The SequenceFile provides a Writer, Reader and Sorter classes for writing, reading and sorting respectively.There are 3 different SequenceFile formats:
a. Uncompressed key/value records
b. Record compressed key/value records - only 'values' are compressed here.
c. Block compressed key/value records - both keys and values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable
3. What are the different File Input Formats in MapReduce?
Ans: FileInputFormat is the base class for all implementations ofInputFormat that uses file as their data source. The sub-classes of FileInputFormat are: CombineFileInputFormat, TextInputFormat (default), KeyValueTextInputFormat, NLineInputFormat, SequenceFileInputFormat.
SequenceFileInputFormat has few subclasses like - SequenceFileAsBinaryInputFormat, SequenceFileAsTextInputFormat, SequenceFileInputFilter
4. What is ‘Shuffling & sorting’ phase in MapReduce?
Ans: This phase occurs between Map & Reduce phases. During this phase, the all keys emitted by various mappers is collected, grouped and copied to the reducers.
5. How many instances of a 'jobtracker' run in a cluster?
Ans: Only one instance of Jobtracker would run in a cluster
6. Can two different Mappers communicate with each other?
Ans: No, Mappers/Reducers run independently of each other.
7. How do you make sure that only one mapper runs your entire file?
Ans: Create a Custom 'InputFormat' and override the'isSplitable()' to return false. (or) a rather rude way to do is - set the block size greater than the size of the input file.
8.When will the reducer phase start in a MR program and why is the progress of the reducer phase is non-zero value(percentage) even before the mapper phase doesn't end?
Ans: Reducer phase starts only after all mappers finish their execution. But the progress of reducer would be some non-zero value before mapper phase progress reaches 100%. This is because, the reducer phase is actually a combination of copy, sort & reduce. The keys would start being sorted and copied to various reducers just before the mapper phase execution is going to end.
9. Explain various phases of a MapReduce program.
Ans:
Mapper phase: A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner
Sort & Shuffle phase: Determines the reducer that should receive the map output key/value pair(called as partitioning). All keys inside a reducer are sorted.
Reducer phase: The reducer receives a key and corresponding list of values(emitted across all the mappers). Aggregation of these values is done in the reducer phase.
Mapper phase: A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner
Sort & Shuffle phase: Determines the reducer that should receive the map output key/value pair(called as partitioning). All keys inside a reducer are sorted.
Reducer phase: The reducer receives a key and corresponding list of values(emitted across all the mappers). Aggregation of these values is done in the reducer phase.
10. What is a 'Task instance' ?
Ans: Task instance is the child JVM process that is initiated by the Tasktracker itself. This is to ensure that process failure does not take down the Tasktracker.