HadoopDevelopment: MapReduce V/S Spark

	MapReduce	Spark
Process	Slow for the processing real time data fast	fast processing jobs

	Jobs Run in Java	Job rus in Scala
	It is processinging	its works in both batch processing and streaming
Speed	Good	100X in memory computation, 10X in disk ccomputation easy
Ease of Use	We can write jobs inJava , Pig, Hwrite	Write in Java, Scala and Python
Sehosticated analysis	Batch processing SQL	SQL queries supports, streaming data and complex analysis
	MapReduce has been widely criticized as a bottleneck in Hadoop clusters because it executes jobs in batch mode, which means that real-time analysis of data is not possible.	It executes jobs in short bursts of micro-batches that are five seconds or less apart XML
ML and Graphs processing computationally	No	computationally in-depth jobs involving machine learning and graph processing.

Spark’s Awesome Features:

§ Hadoop Integration – Spark can work with files stored in HDFS.

§ Spark’s Interactive Shell – Spark is written initsla, and has it’s own version of the Scala interpreter.

§ Spark’s Analytic Suite – Spark comes with tools for interactive query analysis, large-scale graph processing and analysis and real-time analysis.

§ Resilient Distributed Datasets (RDD’s) – RDD’s are distributed objects that can be cached in-memory, across a cluster of compute nodes. They are the primary data objects used in Spark.

§ Distributed Operators – Besides MapReduce, there are many other operators one can use on RDD’s

Advantages of Using Apache Spark with Hadoop:

Spark is not tied to the two-stage MapReduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce for certain applications.
Well suited to machine learning algorithms – Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster’s memory and query it repeatedly
MLlib implements a slew of common machine learning algorithms, such as naïve Bayesian classification or clustering; Spark Streaming enables high-speed processing of data ingested from multiple sources; and GraphX allows for computations on graph data.
Apache Spark Compatibility with Hadoop [HDFS, HBASE and YARN] – Apache Spark is fully compatible with Hadoop’s Distributed File System (HDFS), as well as with other Hadoop components such as YARN (Yet Another Resource Negotiator) and the HBase distributed database.

Apache Spark's features

Lets go through some of Spark's features which are really highlighting it in the Big Data world!
From http://spark.apache.org/:

i) Speed:

Spark enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x faster even when running on disk. Spark makes it possible by reducing number of read/write to disc. It stores this intermediate processing data in-memory. It uses the concept of an Resilient Distributed Dataset (RDD), which allows it to transparently store data on memory and persist it to disc only it’s needed. This helps to reduce most of the disc read and write – the main time consuming factors – of data processing.
Logistic regression in Hadoop and Spark

ii) Ease of Use:

Spark lets you quickly write applications in Java, Scala, or Python. This helps developers to create and run their applications on their familiar programming languages and easy to build parallel apps. It comes with a built-in set of over 80 high-level operators.We can use it interactively to query data within the shell too.

Word count in Spark's Python API

datafile=spark.textFile("hdfs://...")
datafile.flatMap(lambda line: line.split())
.map(lambda word: (word, 1))
.reduceByKey(lambda x, y: x+y)

iii) Combines SQL, streaming, and complex analytics.

In addition to simple “map” and “reduce” operations, Spark supports SQL queries, streaming data, and complex analytics such as machine learning and graph algorithms out-of-the-box. Not only that, users can combine all these capabilities seamlessly in a single workflow.

iv) Runs Everywhere

Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3.

Spark’s major use cases over Hadoop

· Iterative Algorithms in Machine Learning

· Interactive Data Mining and Data Processing

· Spark is a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive.

· Stream processing: Log processing and Fraud detection in live streams for alerts, aggregates and analysis

· Sensor data processing: Where data is fetched and joined from multiple sources, in-memory dataset really helpful as they are easy and fast to process.

Saturday, 13 February 2016

MapReduce V/S Spark

No comments:

Post a Comment

Blog Archive