|
Spark’s Awesome Features:
§ Hadoop Integration – Spark can work with files stored in
HDFS.
§ Spark’s Interactive Shell – Spark is written initsla, and has
it’s own version of the Scala interpreter.
§ Spark’s Analytic Suite – Spark comes with tools for interactive
query analysis, large-scale graph processing and analysis and real-time
analysis.
§ Resilient Distributed Datasets (RDD’s) –
RDD’s are distributed objects that can be cached in-memory,
across a cluster of compute nodes. They are the primary data objects used in
Spark.
§ Distributed Operators – Besides MapReduce, there are many
other operators one can use on RDD’s
Advantages of Using Apache Spark with Hadoop:
- Spark
is not tied to the two-stage MapReduce paradigm, and promises
performance up to 100 times faster than Hadoop MapReduce for certain
applications.
- Well
suited to machine learning algorithms – Spark provides primitives for
in-memory cluster computing that allows user programs to load data into a
cluster’s memory and query it repeatedly
- MLlib
implements a slew of common machine learning algorithms, such as naïve
Bayesian classification or clustering; Spark Streaming enables high-speed
processing of data ingested from multiple sources; and GraphX allows for
computations on graph data.
- Apache
Spark Compatibility with Hadoop [HDFS, HBASE and YARN] – Apache Spark is
fully compatible with Hadoop’s Distributed File System (HDFS), as well as
with other Hadoop components such as YARN (Yet Another Resource
Negotiator) and the HBase distributed database.
Apache Spark's features
Lets go through some of Spark's features which
are really highlighting it in the Big Data world!
From http://spark.apache.org/:
From http://spark.apache.org/:
i) Speed:
Spark enables applications in Hadoop clusters to
run up to 100x faster in memory, and 10x faster even when running on disk.
Spark makes it possible by reducing number of read/write to disc. It stores
this intermediate processing data in-memory. It uses the concept of an
Resilient Distributed Dataset (RDD), which allows it to transparently store
data on memory and persist it to disc only it’s needed. This helps to reduce
most of the disc read and write – the main time consuming factors – of data
processing.
ii) Ease of Use:
Spark lets you quickly write applications in
Java, Scala, or Python. This helps developers to create and run their
applications on their familiar programming languages and easy to build parallel
apps. It comes with a built-in set of over 80 high-level operators.We can use
it interactively to query data within the shell too.
Word count in Spark's Python API
datafile=spark.textFile("hdfs://...")
datafile.flatMap(lambda line: line.split())
.map(lambda word: (word, 1))
.reduceByKey(lambda x, y: x+y)
datafile.flatMap(lambda line: line.split())
.map(lambda word: (word, 1))
.reduceByKey(lambda x, y: x+y)
iii) Combines SQL, streaming,
and complex analytics.
In addition to simple “map” and “reduce”
operations, Spark supports SQL queries, streaming data, and complex analytics
such as machine learning and graph algorithms out-of-the-box. Not only that,
users can combine all these capabilities seamlessly in a single workflow.
iv) Runs Everywhere
Spark runs on Hadoop, Mesos, standalone, or in
the cloud. It can access diverse data sources including HDFS, Cassandra, HBase,
S3.
Spark’s major use cases over
Hadoop
·
Iterative Algorithms in Machine
Learning
·
Interactive Data Mining and
Data Processing
·
Spark is a fully Apache
Hive-compatible data warehousing system that can run 100x faster than Hive.
·
Stream processing: Log
processing and Fraud detection in live streams for alerts, aggregates and
analysis
·
Sensor data processing: Where
data is fetched and joined from multiple sources, in-memory dataset really
helpful as they are easy and fast to process.
No comments:
Post a Comment