HadoopDevelopment: August 2015

Tuesday, 25 August 2015

Hadoop Interview Questions – HDFS

Q1. What is Hadoop?

Ans. Hadoop is free java based programming framework. It supports big data processing with distributed computing environment.

Q2. Who is the provider of Hadoop?

Ans. Hadoop forms part of Apache project provided by Apache Software Foundation.

Q3. What is the use of Hadoop?

Ans. With Hadoop the user can run applications on the systems that have thousands of nodes spreading through innumerable terabytes. Rapid data processing and transfer among nodes helps uninterrupted operation even when a node fails preventing system failure.

Q4. What are the operating systems on which Hadoop works?

Ans. Windows and Linux are the preferred operating system though Hadoop can work on OS x and BSD.

Q5. What is meant by Big Data?

Ans. Big Data refers to assortment of huge amount of data which is difficult capturing, storing, processing or reprieving. Traditional database management tools cannot handle them but Hadoop can.

Q6. Can you indicate Big Data examples?

Ans. Facebook alone generates more than 500 terabytes of data daily whereas many other organizations like Jet Air and Stock Exchange Market generates 1+ terabytes of data every hour. These are Big Data.

Q7. What are major characteristics of Big Data?

Ans. The three characteristics of Big Data are volume, velocity, and veracity. Earlier it was assessed in megabytes and gigabytes but now the assessment is made in terabytes.

Q8. What is the use of Big Data Analysis for an enterprise?

Ans. Analysis of Big Data identifies the problem and focus points in an enterprise. It can prevent big losses and make profits helping the entrepreneurs take informed decision.

Q9. What are the characteristics of data scientists?

Ans. Data scientists analyze data and provide solutions for business problems. They are gradually replacing business and data analysts.

Q10. What are the basic characteristics of Hadoop?

Ans. Written in Java, Hadoop framework has the capability of solving issues involving Big Data analysis. Its programming model is based on Google Mapreduce and infrastructure is based on Google’s Big Data and distributed file systems. Hadoop is scalable and more nodes can be added to it.

Q11. Which are the major players on the web that uses Hadoop?

Ans. Introduce in 2002 by Doug Cutting, Hadoop was used in Google MapReduce and HDFS project in 2004 and 2006. Yahoo and Facebook adopted it in 2008 and 2009 respectively. Major commercial enterprises using Hadoop include EMC, Hortonworks, Cloudera, MaOR, Twitter, EBay, and Amazon among others.

Q12. How is Hadoop different from traditional RDBMS?

Ans. RDBMS can be useful for single files and short data whereas Hadoop is useful for handling Big Data in one shot.

Q13. What are the main components of Hadoop?

Ans. Main components of Hadoop are HDFS used to store large databases and MapReduce used to analyze them.

Q.14. What is HDFS?

Ans. HDFS is filing system use to store large data files. It handles streaming data and running clusters on the commodity hardware.

Q15. What are the main features of HDFS>

Ans. Great fault tolerance, high throughput, suitability for handling large data sets, and streaming access to file system data are the main features of HDFS. It can be built with commodity hardware.

Q16. Why replication is pursued in HDFS though it may cause data redundancy?

Ans. Systems with average configuration are vulnerable to crash at any time. HDFS replicates and stores data at three different locations that makes the system highly fault tolerant. If data at one location becomes corrupt and is inaccessible it can be retrieved from another location.

Q17. Would the calculations made on one node be replicated to others in HDFS?

Ans. No! The calculation would be made on the original node only. In case the node fails then only the master node would replicate the calculation on to a second node.

Q18. What is meant by streaming access?

Ans. HDFS works on the principle of “write once, read many” and the focus is on fast and accurate data retrieval. Steaming access refers to reading the complete data instead of retrieving single record from the database.

Q19. What is meant by ‘commodity hardware’? Can Hadoop work on them?

Ans. Average and non-expensive systems are known as commodity hardware and Hadoop can be installed on any of them. Hadoop does not require high end hardware to function.

Q20. Which one is the master node in HDFS? Can it be commodity?

Ans. Name node is the master node in HDFS and job tracker runs on it. The node contains metadata and works as high availability machine and single pint of failure in HDFS. It cannot be commodity as the entire HDFS works on it.

Q21. What is meant by Data node?

Ans. Data node is the slave deployed in each of the systems and provides the actual storage locations and serves read and writer requests for clients.

Q22. What is daemon?

Ans. Daemon is the process that runs in background in the UNIX environment. In Windows it is ‘services’ and in DOS it is ‘TSR’.

Q23. What is the function of ‘job tracker’?

Ans. Job tracker is one of the daemons that runs on name node and submits and tracks the MapReduce tasks in Hadoop. There is only one job tracker who distributes the task to various task trackers. When it goes down all running jobs comes to a halt.

Q24. What is the role played by task trackers?

Ans. Daemons that run on What data nodes, the task tracers take care of individual tasks on slave node as entrusted to them by job tracker.

Q25.What is meant by heartbeat in HDFS?

Ans. Data nodes and task trackers send heartbeat signals to Name node and Job tracker respectively to inform that they are alive. If the signal is not received it would indicate problems with the node or task tracker.

Q26. Is it necessary that Name node and job tracker should be on the same host?

Ans. No! They can be on different hosts.

Q.27. What is meant by ‘block’ in HDFS?

Ans. Block in HDFS refers to minimum quantum of data for reading or writing. Default block size is 64 MB in HDFS. If a file is 52 MB then HDFS would store it and leave 12 MB empty and ready to use.

Q.28. Can blocks be broken down by HDFS if a machine does not have the capacity to copy as many blocks as the user wants?

Ans. Blocks in HDFS cannot be broken. Master node calculates the required space and how data would be transferred to a machine having lower space.

Q.29. What is the process of indexing in HDFS?

Ans. Once data is stored HDFS will depend on the last part to find out where the next part of data would be stored.

Q.30. How a data node is identified as saturated?

Ans. When a data node is full and has no space left the name node will identify it.

Q31. What type of data is processed by Hadoop?

Ans. Hadoop processes the digital data only.

Q32. How Name node determines which data node to write on?

Ans. Name node contains metadata or information in respect of all the data nodes and it will decide which data node to be used for storing data.

Q33. Who is the ‘user’ in HDFS?

Ans. Anyone who tries to retrieve data from database using HDFS is the user. Client is not end user but an application that uses job tracker and task tracker to retrieve data.

Q.34. How the client communicates with Name node and Data node in HDFS?

Ans. The communication mode for clients with name node and data node in HDFS is SSH.

Q.35. What is a rack in HDFS?

Ans. Rack is the storage location where all the data nodes are put together. Thus it is a physical collection of data nodes stored in a single location.

Monday, 17 August 2015

Differences between Hadoop1.0 & Hadoop 2.0

Early adopters of the Hadoop ecosystem were restricted to processing models that were MapReduce-based only. Hadoop 2 has brought with it effective processing models that lend themselves to many Big Data uses, including interactive SQL queries over big data, analysis of Big Data scale graphs, and scalable machine learning abilities. The evolution of Hadoop 1's limited processing model comprising of various batch-oriented MapReduce tasks, to the more specialized and interactive hard-core models of Hadoop 2 ,have now showcased the potential value contributed by distributed and large scale processing systems. Read on to note the major differences that exist between Hadoop 1 and 2.

Hadoop--YARN and HDFS

While other available solutions are likely to be unsuitable for interactive analytics; are I/O intensive; constrained with respect to providing graph support, memory intensive algorithms, and other machine learning processes; and more; Hadoop proves to be far ahead in the race. Creating a reliable, scalable and strong foundation for Big Data architectures, the Hadoop ecosystem has been positioned as one of the most dominant Big Data platforms for analytics. Here, it deserves mention that Hadoop developers had rewritten major components of the Hadoop 1 file system for producing Hadoop 2. The resource manager YARN and HDFS federation were introduced as important advances for Hadoop 2.

HDFS-- Hadoop file system with a difference

HDFS, a popular Hadoop file system, comprises of two main components: blocks storage service and namespaces. While the block storage service deals with block operations, cluster management of data nodes, and replication; namespaces manage all operations on files/ directories, especially with regards to the creation and modification of files and directories.

A single Namenode was responsible for managing the complete namespace for Hadoop clusters in Hadoop 1. With the advent of the HDFS federation, several Namenode servers are being used for the management of namespaces. This in turn allows for performance improvements, horizontal scaling, and multiple namespaces. All in all, the implementation of HDFS makes existing Namenode configurations operate without changes. A shift to the HDFS federation requires Hadoop administrators to format Namenodes, and update the same for use with latest Hadoop cluster applications. It also involves the addition of more Namenodes to the Hadoop cluster.

YARN—Supports additional performance enhancements for Hadoop 2

While the HDFS federation is responsible for bringing in measures of reliability and scalability to Hadoop, YARN brings about significant performance enhancements for certain applications; implements an overall more flexible execution engine; and offers support for additional processing models. As a recap, do know that YARN, a resource manager, was developed as a result of the separation of the resource management capabilities and processing engine of MapReduce; as implemented in Hadoop 1.

Oft referred to as the operating system of Hadoop due to its role in managing and monitoring diverse workloads, implementing security controls, maintaining multi-tenant environs, and managing all high availability Hadoop features, YARN is designed for diverse, multiple, user applications that operate on a given multi-tenant platform. In addition to MapReduce, YARN supports other multiple processing models too.

High Availability Mode (HA) of Namenode

The name node stores all metadata in the Hadoop Cluster. It’s extremely important because in case of events such as an unprecedented machine crash, it can bring down the entire Hadoop cluster. Hadoop 2.0 offers a solution for the problem on hand. Now, the High Availability feature of HDFS comes to the rescue by allowing any of the two redundant name nodes to run in the same cluster. These name nodes may run in any given active/passive way—with one operating as the primary name node, and the other as a hot standby one.

Both these name nodes share an edits log, wherein all changes are collected in shared NFS storage. At any point of time, only a single writer is allowed to access this shared storage. Here, the passive name node is also allowed access to the storage and is responsible for keeping all updated metadata information with respect to the cluster. If an active name node fails to function, the passive name node takes over as the active one and starts writing onto the shared storage.

Enhanced Utilization of Resources

In case of Hadoop 1.0, the JobTracker held the dual responsibility of driving the accurate execution of MapReduce jobs, and also managing the resources dedicated to the cluster. With YARN coming to the scene, two major functionalities attributed to the overburdened JobTracker-- job scheduling/monitoring and resource management, are split up into separate daemons. These are:

A Resource Manager (RM) that lays focus upon the management of cluster resources;

An Application Master (AM), which is typically a one-per-running-application that manages individual running applications; for instance, MapReduce jobs.

It is essential to note that there exists no more non-flexible map-reduce slots. With YARN as the central resource manager, multiple applications can now share a common resource and run on Hadoop.

Batch Oriented application

In its 2.0 version, Hadoop goes much beyond its batch oriented nature and runs interactive applications, along with streaming them too.

Native Windows Support

Originally, Hadoop was developed for supporting the UNIX family that was linked with operating systems. With Hadoop 2.0 that offers native support for the Windows operating system, the reach of Hadoop has extended significantly. It now caters to the ever-growing Windows Server market with flair.

Non MapReduce Applications on Hadoop 2.0

Hadoop 1.0 was compatible with MapReduce framework tasks only; they could process all data stored in HDFS. Other than MapReduce, there were no more models for data processing. For things such as graph or real-time analysis of the data stored in HDFS, users had to shift the data to other alternate storage facilities like HBase. YARN helps Hadoop run non-MapReduce applications too. YARN APIs can be used for writing on other frameworks and running on top of HDFS. This helps the running of different non-MapReduce applications on Hadoop—with MPI, Giraph, Spark, and HAMA being some applications that are well-ported for running within YARN.

Data node caching for faster access

Hadoop 2.0 users and applications of the likes of Pig, Hive, or HBase are capable of identifying different sets of files that require caching. For instance, the dimension tables related to Hive can now be configured for data caches linked to the DataNode RAM; thereby allowing faster reads for Hive related queries to most frequently looked up tables.

HDFS- Multiple Storage

Another important difference between Hadoop 1.0 vs. Hadoop 2.0 is the latter’s support for all kinds of heterogeneous storage. Whether it’s about SSDs or spinning disks, Hadoop 1.0 is known to treat all storage devices as a single uniform pool on a DataNode. So, while Hadoop 1.0 users could store their data on an SSD, they were in no position to control the same. Heterogeneous storage serves to be an integral part of Hadoop’s version of 2.0 and onwards. The approach is quite general and permits users to treat memory as storage tiers for temporary and cached data.

HDFS Snapshots

Hadoop 2.0 offers additional support and compatibility for file system snapshots. They are point-in-time images of complete file system or the sub trees of a specific file system. The many uses of snapshots include:

Protection for user errors: An admin-driven process can be set up for taking snapshots periodically. So, if users happen to delete files accidentally, the lost data is capable of being restored from the snapshots containing the same.

Reliable backups: Snapshots of entire file systems or sub-trees in the file system can be used by the admin as a beginning point for full backups. There’s a scope of taking incremental backups by copying down the differences between any two given snapshots.

Disaster recovery: Snapshots may also be used for the copying of point-in-time images to remotely placed sites for disaster recovery.