1. What is Big Data?
Any data that cannot be stored into traditional RDBMS is termed as Big Data. As we know most of the data that we use today has been generated in the past 20 years. And this data is mostly unstructured or semi structured in nature. More than the volume of the data – it is the nature of the data that defines whether it is considered as Big Data or not.
2. What do the four V’s of Big Data denote?
IBM has a nice, simple explanation for the four critical features of big data:
a) Volume –Scale of data
b) Velocity –Different forms of data
c) Variety –Analysis of streaming data
d) Veracity –Uncertainty of data
a) Volume –Scale of data
b) Velocity –Different forms of data
c) Variety –Analysis of streaming data
d) Veracity –Uncertainty of data
IBM has a nice, simple explanation for the four critical features of big data:
a) Volume –Scale of data
b) Velocity –Different forms of data
c) Variety –Analysis of streaming data
d) Veracity –Uncertainty of data
a) Volume –Scale of data
b) Velocity –Different forms of data
c) Variety –Analysis of streaming data
d) Veracity –Uncertainty of data
For more the Basic questions and answers click here
2) Hadoop HDFS Interview Questions
1. What is a block and block scanner in HDFS?
Block - The minimum amount of data that can be read or written is generally referred to as a “block” in HDFS. The default size of a block in HDFS is 64MB.
Block Scanner - Block Scanner tracks the list of blocks present on a DataNode and verifies them to find any kind of checksum errors. Block Scanners use a throttling mechanism to reserve disk bandwidth on the datanode.
2. Explain the difference between NameNode, Backup Node and Checkpoint NameNode.
NameNode: NameNode is at the heart of the HDFS file system which manages the metadata i.e. the data of the files is not stored on the NameNode but rather it has the directory tree of all the files present in the HDFS file system on a hadoop cluster. NameNode uses two files for the namespace-
fsimage file- It keeps track of the latest checkpoint of the namespace.
edits file-It is a log of changes that have been made to the namespace since checkpoint.
Checkpoint Node-
Checkpoint Node keeps track of the latest checkpoint in a directory that has same structure as that of NameNode’s directory. Checkpoint node creates checkpoints for the namespace at regular intervals by downloading the edits and fsimage file from the NameNode and merging it locally. The new image is then again updated back to the active NameNode.
BackupNode:
Backup Node also provides check pointing functionality like that of the checkpoint node but it also maintains its up-to-date in-memory copy of the file system namespace that is in sync with the active NameNode.
For more the Hadoop HDFS Interview Questions click here
3) MapReduce Interview Questions
1. Explain the usage of Context Object.
Context Object is used to help the mapper interact with other Hadoop systems. Context Object can be used for updating counters, to report the progress and to provide any application level status updates. ContextObject has the configuration details for the job and also interfaces, that helps it to generating the output.
2. What are the core methods of a Reducer?
The 3 core methods of a reducer are –
1)setup () – This method of the reducer is used for configuring various parameters like the input data size, distributed cache, heap size, etc.
Function Definition- public void setup (context)
2)reduce () it is heart of the reducer which is called once per key with the associated reduce task.
Function Definition -public void reduce (Key,Value,context)
3)cleanup () - This method is called only once at the end of reduce task for clearing all the temporary files.
Function Definition -public void cleanup (context)
For more the MapReduce Interview Questions click here
4) Hadoop HBase Interview Questions
1. When should you use HBase and what are the key components of HBase?
HBase should be used when the big data application has –
1)A variable schema
2)When data is stored in the form of collections
3)If the application demands key based access to data while retrieving.
Key components of HBase are –
Region- This component contains memory data store and Hfile.
Region Server-This monitors the Region.
HBase Master-It is responsible for monitoring the region server.
Zookeeper- It takes care of the coordination between the HBase Master component and the client.
Catalog Tables-The two important catalog tables are ROOT and META.ROOT table tracks where the META table is and META table stores all the regions in the system.
For more the Hadoop HBase Interview Questions click here
5) Hadoop Sqoop Interview Questions
1. Explain about some important Sqoop commands other than import and export.
Create Job (--create)
Here we are creating a job with the name my job, which can import the table data from RDBMS table to HDFS. The following command is used to create a job that is importing data from the employee table in the db database to the HDFS file.
$ Sqoop job --create myjob \
--import \
--connect jdbc:mysql://localhost/db \
--username root \
--table employee --m 1
Verify Job (--list)
‘--list’ argument is used to verify the saved jobs. The following command is used to verify the list of saved Sqoop jobs.
$ Sqoop job --list
Inspect Job (--show)
‘--show’ argument is used to inspect or verify particular jobs and their details. The following command and sample output is used to verify a job called myjob.
$ Sqoop job --show myjob
Execute Job (--exec)
‘--exec’ option is used to execute a saved job. The following command is used to execute a saved job called myjob.
$ Sqoop job --exec myjob
For moreHadoop Sqoop Interview Questions click here
6) Hadoop Flume Interview Questions
1. Explain about the core components of Flume.
The core components of Flume are –
Event- The single log entry or unit of data that is transported.
Source- This is the component through which data enters Flume workflows.
Sink-It is responsible for transporting data to the desired destination.
Channel- it is the duct between the Sink and Source.
Agent- Any JVM that runs Flume.
Client- The component that transmits event to the source that operates with the agent.
For more Hadoop Flume Interview Questions click here
7) Hadoop Zookeeper Interview Questions
1. Can Apache Kafka be used without Zookeeper?
It is not possible to use Apache Kafka without Zookeeper because if the Zookeeper is down Kafka cannot serve client request.
2. Name a few companies that use Zookeeper.
Yahoo, Solr, Helprace, Neo4j, Rackspace
For more Hadoop Zookeeper Interview Questions click here
8) Pig Interview Questions
1. What do you mean by a bag in Pig?
Collection of tuples is referred as a bag in Apache Pig
2. Does Pig support multi-line commands?
Yes
For more Pig Interview Questions click here
9) Hive Interview Questions
1. What is a Hive Metastore?
Hive Metastore is a central repository that stores metadata in external database.
2. Are multiline comments supported in Hive?
No
For more Hive Interview Questions click here
10) Hadoop YARN Interview Questions
1. What are the stable versions of Hadoop?
Release 2.7.1 (stable)
Release 2.4.1
Release 1.2.1 (stable)
2. What is Apache Hadoop YARN?
YARN is a powerful and efficient feature rolled out as a part of Hadoop 2.0.YARN is a large scale distributed system for running big data applications.
No comments:
Post a Comment