HadoopDevelopment: June 2015

Saturday, 27 June 2015

Hadoop Basic Questions

HDFS:

1. Without touching Block size & input split, can we have a say on the no. of mappers?
Ans: Create a Custom input Format and override the 'isSplitable()'to return false.

2. What is the difference between Block size & input split?
Ans: Block is a physical division whereas, input split is a logical division of the data.

3. To process one hundred files each of size - 100MB on HDFS whose default block size is 64MB, how many mappers would be invoked?
Ans: Each file occupies 2 blocks of data(block1 - 64 MB & block2 - 36 MB) and hence 100 files would occupy 200 map slots.

4. What is data locality optimization?
Ans: In Hadoop, execution is done near the data. This execution can be done in 3 possible ways, out of which the first way is always preferred by the Namenode
Same node execution: Tasktracker process is initiated in the Datanode where the block of data is stored.
Off-node execution: In the event of the unavailability of Tasktracker slots in the Datanode(where the data block is located), this block of data is copied to the nearest datanode in the same rack and execution is done.
Off-rack execution: If no slot is free to run the Tasktracker in the entire rack where the block of data is present, the block of data is moved across to a different rack and executed.

5. What is Speculative execution?
Ans: If one of the tasks of a MapReduce job is slow, it pulls down the overall performance of the job. Hence, Jobtracker continuously monitors each task for progress(via heart beat signals). If certain task does not respond in the given time-interval, then the job trackerspeculates that the task is down and initiates a similar Tasktracker on a different replica of the same block. This concept is called Speculative execution.

Important thing to note here is that, it will not kill the slow running task. Both tasks would run simultaneously. Only when one of the tasks get completed, the remaining task would be killed.

6. What are the different types of File permissions in HDFS?
Ans:
drwxrwxrwx user1 prog 10 Aug 16 15:02 myfolder
-rwxrwxrwx user1 prog 10 Aug 01 07:02 myfile.sas

Position 1: ‘d’ means folder, ‘-’ means file
Positions 2-4: Owners permissions on file/folder
Positions 5-7: Group permissions on file or folder
Positions 8-10: Global permissions on file or folder

7. What is Rack-awareness?
Ans: In HDFS, not all replicas of a single block are copied in the same Rack. This concept is called Rack-awareness. In the event of an entire Rack going down, if all the replicas are in that rack, there would be no way of recovering that block of data.

8. What are the different modes of HDFS that one can run? Where do we configure these modes?
Ans: Hadoop can be configured to run on one of the following modes.
a. Standalone Mode or local (default mode)
b. Psuedo distributed mode
c. Fully distributed mode.
These configuration settings can be set via - core-site.xml, mapred-site.xml, hdfs-site.xml

9. What are the available data-types in Hadoop?
Ans: To support serialization-deserialization and to be able to get compared with one another, hadoop has built its own datatypes.

Following is the list of types that implement WritableComparable-

Primitives: BooleanWritable, ByteWritable, ShortWritable, IntWritable, VIntWritable,FloatWritable,

LongWritable, VLongWritable, DoubleWritable.

Others:

NullWritable, Text, BytesWritable, MD5Hash

10. Explain the command '-getMerge'

Ans: hadoop fs -getmerge <directory> <merged file name>

This option gets all the files in the directory and merges them into a single file.

11. Explain the anatomy of a file read in HDFS
Ans:
1. Client opens the file (calls open() on Distributed File System).
2. DFS calls the namenode to get block locations.
3. DFS creates FSDataInputStream and client invokes read() on this object.
4. Using DFSDataInputStream(a sub class of FSDataInputStream), read operation is done on datanodes where file blocks are present. Blocks are read in the order. Once reading all the blocks is finished, client calls close() on the FSDataInputStream

12. Explain the anatomy of a file write in HDFS
Ans:
1. Client creates a files (calls create() on DFS)
2. Client calls namenode(NN) to create a file. NN checks for client's access permissions to the file and if file already exists. If the file already exists, it throws an IO Exception
3. The DFS returns an FSDataOutputStream to write data into. FSDataOutputStream has a subclass DFSDataOutputStream which handles communication with NN & datanode(DN)
4. DFSDataOutputStream writes data in the form of packets(small units of data) and these packets are written to various DNs to form blocks of data. A pipeline is formed that consists of the list of DNs that a single block has to be replicated to.
5. When a block of data is written to all DNs in the pipeline, acknowledgement comes from the DNs in the pipeline in the reverse order.
6. When client has finished writing the data, it calls close() on the stream
7. Waits for acknowledgement before contacting the name to signal that file is complete.

MapReduce:

1. What is Distributed Cache?

Ans: Distributed Cache is a mechanism by which 'Side Data' (extra read-only data needed by a MR program) is distributed

2. What is 'Sequence File' format? Where do we use it?

Ans: SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats. It is also worth noting that, internally, the temporary outputs of maps are stored using SequenceFile.

The SequenceFile provides a Writer, Reader and Sorter classes for writing, reading and sorting respectively.
There are 3 different SequenceFile formats:
a. Uncompressed key/value records
b. Record compressed key/value records - only 'values' are compressed here.
c. Block compressed key/value records - both keys and values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable

3. What are the different File Input Formats in MapReduce?

Ans: FileInputFormat is the base class for all implementations ofInputFormat that uses file as their data source. The sub-classes of FileInputFormat are: CombineFileInputFormat, TextInputFormat (default), KeyValueTextInputFormat, NLineInputFormat, SequenceFileInputFormat.

SequenceFileInputFormat has few subclasses like - SequenceFileAsBinaryInputFormat, SequenceFileAsTextInputFormat, SequenceFileInputFilter

4. What is ‘Shuffling & sorting’ phase in MapReduce?

Ans: This phase occurs between Map & Reduce phases. During this phase, the all keys emitted by various mappers is collected, grouped and copied to the reducers.

5. How many instances of a 'jobtracker' run in a cluster?

Ans: Only one instance of Jobtracker would run in a cluster

6. Can two different Mappers communicate with each other?

Ans: No, Mappers/Reducers run independently of each other.

7. How do you make sure that only one mapper runs your entire file?

Ans: Create a Custom 'InputFormat' and override the'isSplitable()' to return false. (or) a rather rude way to do is - set the block size greater than the size of the input file.

8.When will the reducer phase start in a MR program and why is the progress of the reducer phase is non-zero value(percentage) even before the mapper phase doesn't end?

Ans: Reducer phase starts only after all mappers finish their execution. But the progress of reducer would be some non-zero value before mapper phase progress reaches 100%. This is because, the reducer phase is actually a combination of copy, sort & reduce. The keys would start being sorted and copied to various reducers just before the mapper phase execution is going to end.

9. Explain various phases of a MapReduce program.

Ans:
Mapper phase: A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner
Sort & Shuffle phase: Determines the reducer that should receive the map output key/value pair(called as partitioning). All keys inside a reducer are sorted.
Reducer phase: The reducer receives a key and corresponding list of values(emitted across all the mappers). Aggregation of these values is done in the reducer phase.

10. What is a 'Task instance' ?

Ans: Task instance is the child JVM process that is initiated by the Tasktracker itself. This is to ensure that process failure does not take down the Tasktracker.

Thursday, 25 June 2015

Shufflling and repartitioning of RDD’s in apache spark

To write the optimize spark application you should carefully use transformation and actions, if you use wrong transformation and action will make your application slow. So when you are writing application some points should be remember to make your application more optimize.

1. Number of partitions when creating RDD

By default spark create one partition for each block of the file in HDFS it is 64MB by default. You can also pass second argument as a number of partition when creating RDD.Let see example of creating RDD of text file

val rdd= sc.textFile(“file.txt”,5)

above statement make a RDD of textFile with 5 partition. Now if we have a cluster with 4 cores then each partition need to process 5 minutes so 4 partition process parallel and 5 partition process after that whenever core will be free so it so final result will be completed in 10 minutes and resources also ideal while only one partition process.

So to overcome this problem we should make RDD with number of partition is equal to number of cores in the cluster by this all partition will process parallel and resources are also used equally

2 . reduceByKey Vs. groupByKey

Let see example of word count you can process RDD and find the frequency of word using both the transformations groupByKey and reduceBykey

word count using reduceBykey

val wordPairsRDD = rdd.map(word => (word, 1))

val wordCountsWithReduce = wordPairsRDD
  .reduceByKey(_ + _)
  .collect()

See in diagram how RDD are process and shuffle over the network

As you see in above diagram all worker node first process its own partition and count words on its own machine and then shuffle for final result

On the other hand if we use groupByKey for word count as follow

val wordCountsWithGroup = rdd
  .groupByKey()
  .map(t => (t._1, t._2.sum))
  .collect()

Let see diagram how RDD are process and shuffle over the network using groupByKey

As you see above all worker node shuffle data and at final node it will be count words so using groupByKey so lot of unnecessary data will be transfer over the network.

So avoid using groupByKey as much as possible.

3. Hash-partition before transformation over pair RDD

Before perform any transformation we should shuffle same key data at the same worker so for that we use Hash-partition to shuffle data and make partition using the key of the pair RDD let see the example of the Hash-Partition data

val wordPairsRDD = rdd.map(word => (word, 1)).
                   partitonBy(new HashPartition(4))

val wordCountsWithReduce = wordPairsRDD
  .reduceByKey(_ + _)
  .collect()

When we are using Hash-partition the data will be shuffle and all same key data will shuffle at same worker, Let see in diagram

In the above diagram you can see all the data of “c” key will be shuffle at same worker node. So if we use tansformation over pair RDD we should use hash-partitioning.

4. Do not use collect() over a big dataset

Collect() action collect all elements of RDD and send it to master so if we use it on the big dataset sometimes it might be give out of memory because data set not fit into memory so filter the data before use collect() or use take and sampleTake action.

5. Use coalesce to repartition in decrease number of partition

Use coalesce if you decrease number of partition of the RDD instead of repartition. coalesce is usefull because its not shuffle data over network.

rdd.coalesce(1)

Cloud Computing

Due to loads of emails from users requesting to know more things about Cloud Computing and its features, we have to provide the Top Cloud Computing Frequently Asked Questions and Answers, please read on and learn some things about cloud computing. Let see the frequently asked questions about cloud computing:

What Is Cloud Computing?
Cloud Computing defines service rather than a product. It also indulges in providing hosted service on the net. Well cloud computing can be a service which is paid or else it may be a free one. All these depend on the user. You could get this service based on the hour, minute. It is flexible, in terms you could get the service any time as much as you want. There are three types of cloud computing namely: Infrastructure as a Service IaaS, Platform as a Service PaaS, and Software as a service SaaS.

Why Should You Use Cloud Computing?
You should definitely endeavor to use cloud computing because it has several advantages like It reduces cost, in case both the buyer and the service provider. It provides innumerable storage capacity. Once you get the application, you need not to worry about the space; you just need to update the software regularly. Thus, in turn lowers the cost of the company. The cloud computing also enables you to get automatic up gradation. There is no need for you to hire professionals. It is very flexible indeed; you could use them according to your comfort label. It is very mobile. You could access it from anywhere; the only thing you require is the ceaseless internet connection. Last but not the least you need not to download any data.

What Are The Disadvantages Of Cloud Computing?
1. You would face a lot of problem with the quality of the service.
2. The cloud service providers even fail to do regular maintenance.
3. Now a days Cloud computing application tend to be very costly. The buyers need to get the system to develop various versions of software.
4. Another disadvantage is that you could not shift the service to another cloud Service providers.
5. The recent application is running through various stages of tastes. However, they failed to produce the assured flexibility.
6. You should have sound knowledge on the cloud computing application. Here you need to deal with the cloud Service providers. It is tough to get through it.

What Do You Mean By Public Cloud?
Public Cloud is that where the provider of the service makes various services like storage and application accessible to the general users on the internet. Well this service is given based on two schemes. Either you could get it free or you pay as you are using.
Let’s Discuss The Advantages Of Public Cloud:
The set up is easy and less costly because the service provides the cost of the bandwidth. It provides swift scalability to meet all kinds of your needs. Even it is not wastage of your money. You pay as much as you use, the step which gives you more power on your money than just paying for what you are not making better use of.

What Is Private Cloud Computing?
You could also call it as corporate cloud or internal cloud .It is such an infrastructure that gives hosting service to a handful of users coupled with the firewall. If you are an owner of a media company then you could use the private cloud system in order to have full control over the data. For this, you need to seek the help of any third party service provider like the Simple Storage Service. It is one of the best online business service.

How Unique is Cloud Computing?
You should know that cloud is defined as on demand. It even attaches the features coupled with grid models and utility. Cloud computing also helps you perform huge number of task. It boasts of as an independent service provider. Today cloud computing is the most desirable element of IT industry. The clubbing of these attributes has made it the most popular one. You would feel great to know that both term utility and the cloud computing both complement each other. Thus, cloud computing, Grid computing, utility are the real trendsetter of the web world. Image result for cloud computing

What Does Service Provider Would Charge For The Service?
The Software as a service provider feels proud to provide the service as per as you need. They even provide you the opportunity to get as you pay. The service provider provides unlimited service to the users. As compared to others, here the cloud computing provides you with ceaseless power. However, if the power exceeds the specified tenure then you need not to worry. The service provider would charge the postpaid bill. You could take the instance of Amazon cloud server, which actually endeavor to take charges based on per hour from the users. You could also take the instance of Linux, which charges 10 cents only.

Data Safety In Cloud Computing?
In case of cloud computing, data safety is an important aspect. It is something that you should always pay heed. There are some online service providers such as Carbonite and Link up, who have faced a huge problem with data. All of a sudden, they lost the data and were incapable to find the data for all those customers. Another most disgusting thing is that the data is not safe at all. Sometimes it is trapped in to the bad hands. Therefore it is necessary on your part to get a detailed information regarding the terms and conditions of the privacy and security policy

Wednesday, 24 June 2015

Spark-Part2

What is Apache Spark?
Why it is a hot topic in Big Data forums?
Is Apache Spark going to replace hadoop?
If you are into BigData analytics business then, should you really care about Spark?
I hope this blog post will help to answer some of your questions which might have coming to your mind these days.

Introduction to Apache Spark

It is a framework for performing general data analytics on distributed computing cluster like Hadoop.It provides in memory computations for increase speed and data process over mapreduce.It runs on top of existing hadoop cluster and access hadoop data store (HDFS), can also process structured data in Hive and Streaming data from HDFS,Flume,Kafka,Twitter
Spark Architecture

Is Apache Spark going to replace Hadoop?

Hadoop is parallel data processing framework that has traditionally been used to run map/reduce jobs. These are long running jobs that take minutes or hours to complete. Spark has designed to run on top of Hadoop and it is an alternative to the traditional batch map/reduce model that can be used for real-time stream data processing and fast interactive queries that finish within seconds. So, Hadoop supports both traditional map/reduce and Spark.

We should look at Hadoop as a general purpose Framework that supports multiple models and We should look at Spark as an alternative to Hadoop MapReduce rather than a replacement to Hadoop.

Hadoop MapReduce vs. Spark –Which One to Choose?

Spark uses more RAM instead of network and disk I/O its relatively fast as compared to hadoop. But as it uses large RAM it needs a dedicated high end physical machine for producing effective results

It all depends and the variables on which this decision depends keep on changing dynamically with time.

Difference between Hadoop Mapreduce and Apache Spark

Spark stores data in-memory whereas Hadoop stores data on disk. Hadoop uses replication to achieve fault tolerance whereas Spark uses different data storage model, resilient distributed datasets (RDD), uses a clever way of guaranteeing fault tolerance that minimizes network I/O.For details see the UC Berkeley's link Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.

From the Spark academic paper: "RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information to rebuild just that partition." This removes the need for replication to achieve fault tolerance.

Do I need to learn Hadoop first to learn Apache Spark?

No, you don't need to learn Hadoop to learn Spark. Spark was an independent project . But after YARN and Hadoop 2.0, Spark became popular because Spark can run on top of HDFS along with other Hadoop components. Spark has become another data processing engine in Hadoop ecosystem and which is good for all businesses and community as it provides more capability to Hadoop stack.

For developers, there is almost no overlap between the two. Hadoop is a framework in which you write MapReduce job by inheriting Java classes. Spark is a library that enables parallel computation via function calls.

For operators to running a cluster, there is an overlap in general skills, such as monitoring configuration, and code deployment.

Apache Spark's features

Lets go through some of Spark's features which are really highlighting it in the Big Data world!
From http://spark.apache.org/:

i) Speed:

Spark enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x faster even when running on disk. Spark makes it possible by reducing number of read/write to disc. It stores this intermediate processing data in-memory. It uses the concept of an Resilient Distributed Dataset (RDD), which allows it to transparently store data on memory and persist it to disc only it’s needed. This helps to reduce most of the disc read and write – the main time consuming factors – of data processing.
Logistic regression in Hadoop and Spark

ii) Ease of Use:

Spark lets you quickly write applications in Java, Scala, or Python. This helps developers to create and run their applications on their familiar programming languages and easy to build parallel apps. It comes with a built-in set of over 80 high-level operators.We can use it interactively to query data within the shell too.

Word count in Spark's Python API

datafile = spark.textFile("hdfs://...")
datafile.flatMap(lambda line: line.split())
        .map(lambda word: (word, 1))
        .reduceByKey(lambda x, y: x+y)

iii) Combines SQL, streaming, and complex analytics.

In addition to simple “map” and “reduce” operations, Spark supports SQL queries, streaming data, and complex analytics such as machine learning and graph algorithms out-of-the-box. Not only that, users can combine all these capabilities seamlessly in a single workflow.

iv) Runs Everywhere

Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3.

Spark’s major use cases over Hadoop

Iterative Algorithms in Machine Learning
Interactive Data Mining and Data Processing
Spark is a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive.
Stream processing: Log processing and Fraud detection in live streams for alerts, aggregates and analysis
Sensor data processing: Where data is fetched and joined from multiple sources, in-memory dataset really helpful as they are easy and fast to process.

Note : Spark is still working out bugs as it matures.

Your Turn: Go Get Started

Spark is very easy to get started writing powerful Big Data applications. Your existing Hadoop and/or programming skills will have you productively interacting with your data in minutes. Go get started today:

Download: http://spark.incubator.apache.org/downloads.html
Quick Start: http://spark.incubator.apache.org/docs/latest/quick-start.html
Spark Summit 2013 (Dec. 2, 2013): http://spark-summit.org
Amazon Web Services Documentation : https://aws.amazon.com/articles/4926593393724923