Sunday, 14 February 2016

Hive tables stored as textfile VS parquet

Here is a very quick way for you to get some hands on experience seeing the differences between TEXTFILE and PARQUET, along with Hive and Impala.  You can do this on your own cluster or use Cloudera’s Quick Start VM.  Our steps were done using a three node CDH 5.2.0 cluster which has Hive 0.13.1 and Impala 2.0.1.


First get some data.  In this example, we grabbed temperature data from the US government.  We took the ‘hourly_TEMP_2014.zip’ data, which after uncompressed, is around 1GB.  Not large by any means, but enough to use in this example.

#unzip, efficiently remove the header from the file and add to hdfs.
unzip hourly_TEMP_2014.zip
hadoop fs -copyFromLocal hourly_TEMP_2014.csv /tmp
Next, log into the hive (beeline or Hue), create tables, and load some data.  We’re creating a TEXTFILE table and a PARQUET table in this example.  PARQUET is a columnar store that gives us advantages for storing and scanning data. Storing the data column-wise allows for better compression, which gives us faster scans while using less storage. It’s also helpful for “wide” tables and for things like column-level aggregations.  E.g. avg(degrees)

#hive (via the beeline shell or Hue)

create table temps_txt (state code string, country code string, sternum string, param code string, POC string, latitude string, longitude string, datum string, param string, date local string, time local string, date GMT string, timegmt string, degrees double, uom string, mdl string, uncert string, qual string, method string, method name string, state string, county string, dateoflastchange string) row format delimited fields terminated by ',';

load data inpath '/tmp/hourly_TEMP_2014.csv' into table temps_txt;

create table temps_par (state code string, country code string, sternum string, param code string, POC string, latitude string, longitude string, datum string, param string, date local string, time local string, date GMT string, timegmt string, degrees double, uom string, mdl string, uncert string, qual string, method string, method name string, state string, county string, dateoflastchange string) stored as parquet;

insert into table temps_par select * from temps_txt;
Now we have some data, let’s do some analysis.  In these examples, we are using Hive to select the TEXTFILE and PARQUET tables.  Your results will vary, but the PARQUET queries should be faster because of its columnar storage approach with the statements in this example.

#hive (via the beeline shell or Hue)
select avg(degrees) from temps_txt;
select avg(degrees) from temps_par;
Now, let’s launch the impala-shell and issue the same commands.  Impala can use the Hive metastore to query the same tables as you created in Hive.

#impala-shell (via impala-shell)
invalidate metadata;  //To import hive metadata
select avg(degrees) from temps_txt;
select avg(degrees) from temps_par;
Again, these are very small data sets for Hadoop, but give a simple example of how to get up and running so you can see the differences between storage formats (TEXTFILE vs PARQUET) and query engine behavior (Hive vs Impala).


Saturday, 13 February 2016

MapReduce V/S Spark


MapReduce
Spark
Process
Slow for the processing real time data fast
fast processing jobs
Jobs Run in Java 
Job rus in Scala
It is processinging
its works in both batch processing and streaming
Speed
Good
100X in memory computation, 10X in disk ccomputation easy
Ease of Use
We can write jobs inJava , Pig, Hwrite
Write in Java, Scala and Python
Sehosticated analysis
Batch processing SQL
SQL queries supports, streaming data and complex analysis
MapReduce has been widely criticized as a bottleneck in Hadoop clusters because it executes jobs in batch mode, which means that real-time analysis of data is not possible.
 It executes jobs in short bursts of micro-batches that are five seconds or less apart XML
ML and Graphs processing computationally
No
computationally in-depth jobs involving machine learning and graph processing.

Spark’s Awesome Features:
§  Hadoop Integration – Spark can work with files stored in HDFS.
§  Spark’s Interactive Shell – Spark is written initsla, and has it’s own version of the Scala interpreter.
§  Spark’s Analytic Suite – Spark comes with tools for interactive query analysis, large-scale graph processing and analysis and real-time analysis.
§  Resilient Distributed Datasets (RDD’s) – RDD’s are distributed objects that can be cached in-memory, across a cluster of compute nodes. They are the primary data objects used in Spark.
§  Distributed Operators – Besides MapReduce, there are many other operators one can use on RDD’s
Advantages of Using Apache Spark with Hadoop:
  • Spark is not tied to the two-stage MapReduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce for certain applications.
  • Well suited to machine learning algorithms – Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster’s memory and query it repeatedly
  • MLlib implements a slew of common machine learning algorithms, such as naïve Bayesian classification or clustering; Spark Streaming enables high-speed processing of data ingested from multiple sources; and GraphX allows for computations on graph data.
  • Apache Spark Compatibility with Hadoop [HDFS, HBASE and YARN] – Apache Spark is fully compatible with Hadoop’s Distributed File System (HDFS), as well as with other Hadoop components such as YARN (Yet Another Resource Negotiator) and the HBase distributed database.


Apache Spark's features
Lets go through some of Spark's features which are really highlighting it in the Big Data world!
From http://spark.apache.org/:
i) Speed:
Spark enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x faster even when running on disk. Spark makes it possible by reducing number of read/write to disc. It stores this intermediate processing data in-memory. It uses the concept of an Resilient Distributed Dataset (RDD), which allows it to transparently store data on memory and persist it to disc only it’s needed. This helps to reduce most of the disc read and write – the main time consuming factors – of data processing.
Logistic regression in Hadoop and Spark
ii) Ease of Use:
Spark lets you quickly write applications in Java, Scala, or Python. This helps developers to create and run their applications on their familiar programming languages and easy to build parallel apps. It comes with a built-in set of over 80 high-level operators.We can use it interactively to query data within the shell too.
Word count in Spark's Python API
datafile=spark.textFile("hdfs://...")
datafile.flatMap(
lambda line: line.split())
        .map(
lambda word: (word, 1))
        .reduceByKey(
lambda x, y: x+y)

iii) Combines SQL, streaming, and complex analytics.
In addition to simple “map” and “reduce” operations, Spark supports SQL queries, streaming data, and complex analytics such as machine learning and graph algorithms out-of-the-box. Not only that, users can combine all these capabilities seamlessly in a single workflow.
iv) Runs Everywhere
Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3.
Spark’s major use cases over Hadoop
·         Iterative Algorithms in Machine Learning
·         Interactive Data Mining and Data Processing
·         Spark is a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive.
·         Stream processing: Log processing and Fraud detection in live streams for alerts, aggregates and analysis
·         Sensor data processing: Where data is fetched and joined from multiple sources, in-memory dataset really helpful as they are easy and fast to process.