Sunday, 14 February 2016

Hive tables stored as textfile VS parquet

Here is a very quick way for you to get some hands on experience seeing the differences between TEXTFILE and PARQUET, along with Hive and Impala. You can do this on your own cluster or use Cloudera’s Quick Start VM. Our steps were done using a three node CDH 5.2.0 cluster which has Hive 0.13.1 and Impala 2.0.1.

First get some data. In this example, we grabbed temperature data from the US government. We took the ‘hourly_TEMP_2014.zip’ data, which after uncompressed, is around 1GB. Not large by any means, but enough to use in this example.

#unzip, efficiently remove the header from the file and add to hdfs.
unzip hourly_TEMP_2014.zip
hadoop fs -copyFromLocal hourly_TEMP_2014.csv /tmp
Next, log into the hive (beeline or Hue), create tables, and load some data. We’re creating a TEXTFILE table and a PARQUET table in this example. PARQUET is a columnar store that gives us advantages for storing and scanning data. Storing the data column-wise allows for better compression, which gives us faster scans while using less storage. It’s also helpful for “wide” tables and for things like column-level aggregations. E.g. avg(degrees)

#hive (via the beeline shell or Hue)

create table temps_txt (state code string, country code string, sternum string, param code string, POC string, latitude string, longitude string, datum string, param string, date local string, time local string, date GMT string, timegmt string, degrees double, uom string, mdl string, uncert string, qual string, method string, method name string, state string, county string, dateoflastchange string) row format delimited fields terminated by ',';

load data inpath '/tmp/hourly_TEMP_2014.csv' into table temps_txt;

create table temps_par (state code string, country code string, sternum string, param code string, POC string, latitude string, longitude string, datum string, param string, date local string, time local string, date GMT string, timegmt string, degrees double, uom string, mdl string, uncert string, qual string, method string, method name string, state string, county string, dateoflastchange string) stored as parquet;

insert into table temps_par select * from temps_txt;
Now we have some data, let’s do some analysis. In these examples, we are using Hive to select the TEXTFILE and PARQUET tables. Your results will vary, but the PARQUET queries should be faster because of its columnar storage approach with the statements in this example.

#hive (via the beeline shell or Hue)
select avg(degrees) from temps_txt;
select avg(degrees) from temps_par;
Now, let’s launch the impala-shell and issue the same commands. Impala can use the Hive metastore to query the same tables as you created in Hive.

#impala-shell (via impala-shell)
invalidate metadata; //To import hive metadata
select avg(degrees) from temps_txt;
select avg(degrees) from temps_par;
Again, these are very small data sets for Hadoop, but give a simple example of how to get up and running so you can see the differences between storage formats (TEXTFILE vs PARQUET) and query engine behavior (Hive vs Impala).

Saturday, 13 February 2016

MapReduce V/S Spark

	MapReduce	Spark
Process	Slow for the processing real time data fast	fast processing jobs

	Jobs Run in Java	Job rus in Scala
	It is processinging	its works in both batch processing and streaming
Speed	Good	100X in memory computation, 10X in disk ccomputation easy
Ease of Use	We can write jobs inJava , Pig, Hwrite	Write in Java, Scala and Python
Sehosticated analysis	Batch processing SQL	SQL queries supports, streaming data and complex analysis
	MapReduce has been widely criticized as a bottleneck in Hadoop clusters because it executes jobs in batch mode, which means that real-time analysis of data is not possible.	It executes jobs in short bursts of micro-batches that are five seconds or less apart XML
ML and Graphs processing computationally	No	computationally in-depth jobs involving machine learning and graph processing.

Spark’s Awesome Features:

§ Hadoop Integration – Spark can work with files stored in HDFS.

§ Spark’s Interactive Shell – Spark is written initsla, and has it’s own version of the Scala interpreter.

§ Spark’s Analytic Suite – Spark comes with tools for interactive query analysis, large-scale graph processing and analysis and real-time analysis.

§ Resilient Distributed Datasets (RDD’s) – RDD’s are distributed objects that can be cached in-memory, across a cluster of compute nodes. They are the primary data objects used in Spark.

§ Distributed Operators – Besides MapReduce, there are many other operators one can use on RDD’s

Advantages of Using Apache Spark with Hadoop:

Spark is not tied to the two-stage MapReduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce for certain applications.
Well suited to machine learning algorithms – Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster’s memory and query it repeatedly
MLlib implements a slew of common machine learning algorithms, such as naïve Bayesian classification or clustering; Spark Streaming enables high-speed processing of data ingested from multiple sources; and GraphX allows for computations on graph data.
Apache Spark Compatibility with Hadoop [HDFS, HBASE and YARN] – Apache Spark is fully compatible with Hadoop’s Distributed File System (HDFS), as well as with other Hadoop components such as YARN (Yet Another Resource Negotiator) and the HBase distributed database.

Apache Spark's features

Lets go through some of Spark's features which are really highlighting it in the Big Data world!
From http://spark.apache.org/:

i) Speed:

Spark enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x faster even when running on disk. Spark makes it possible by reducing number of read/write to disc. It stores this intermediate processing data in-memory. It uses the concept of an Resilient Distributed Dataset (RDD), which allows it to transparently store data on memory and persist it to disc only it’s needed. This helps to reduce most of the disc read and write – the main time consuming factors – of data processing.
Logistic regression in Hadoop and Spark

ii) Ease of Use:

Spark lets you quickly write applications in Java, Scala, or Python. This helps developers to create and run their applications on their familiar programming languages and easy to build parallel apps. It comes with a built-in set of over 80 high-level operators.We can use it interactively to query data within the shell too.

Word count in Spark's Python API

datafile=spark.textFile("hdfs://...")
datafile.flatMap(lambda line: line.split())
.map(lambda word: (word, 1))
.reduceByKey(lambda x, y: x+y)

iii) Combines SQL, streaming, and complex analytics.

In addition to simple “map” and “reduce” operations, Spark supports SQL queries, streaming data, and complex analytics such as machine learning and graph algorithms out-of-the-box. Not only that, users can combine all these capabilities seamlessly in a single workflow.

iv) Runs Everywhere

Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3.

Spark’s major use cases over Hadoop

· Iterative Algorithms in Machine Learning

· Interactive Data Mining and Data Processing

· Spark is a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive.

· Stream processing: Log processing and Fraud detection in live streams for alerts, aggregates and analysis

· Sensor data processing: Where data is fetched and joined from multiple sources, in-memory dataset really helpful as they are easy and fast to process.

Wednesday, 23 September 2015

Amazon Web Services (AWS) Command Line Arguments (CLI)

In this post I am explaining about AWS CLI. Here we come to know about how to access the amazon account using CLI.

We can access AWS account from Window's machine and Linux machine.

Windows

To access AWS account in window we should install aws cli.

For that we can download from the link (https://aws.amazon.com/cli/) or we can use https://s3.amazonaws.com/aws-cli/AWSCLI64.msi

if we run this .msi file, AWS CLI install in our local windows machine.

Then open command prompt:

Type "aws configure"

Then it will ask flllowing configurations

AWS Access Key ID [None]: xxxxxxxxxxxx

AWS Secret Access Key [None]: ###########################

Default region name [None]: @@@@@

Default output format [None]: ENTER

Finally you installed AWS CLI in your windows machine.

For testing you can use ”aws s3 ls”. It will give all s3 bucket list.

AWS EC2 machine:

In AWS ec2 machine it will give default in the position /usr/bin/aws.

So no need to install again in this machine. U can access directly from anywhere.

Here I explained some queries, we can use them for the fast processing.

“aws s3 ls “

It will give us list of all buckets.

“aws s3 ls s3://somutest/”

It will give us list of all documents and folders in this bucket.

“aws s3 cp . s3://somutest/”

It will transfer all documents from present local folder (should type .dot to represent present location) to s3. To transfer only one file we should mention file name like “aws cp abc.txt to s3://somutest/”

Here fallowed by some commands which are mostly used in real life.

aws s3 cp s3://somutest .

$ aws s3 sync . s3://my-bucket/MyFolder --acl public-read

Creating Buckets

$ aws s3 mb s3://bucket-name

Removing Buckets

$ aws s3 rb s3://bucket-name

$ aws s3 rb s3://bucket-name --force (Non empty bucket)

Listing Buckets

$ aws s3 ls

$ aws s3 ls s3://bucket-name/MyFolder

Managing objects

aws s3 mv s3://mybucket . --recursive

$ aws s3 cp file.txt s3://bucket-name/ --grants read=uri=http://acs.amazonaws.com/groups/global/AllUsers full=emailaddress=user@example.com

$ aws s3 sync <source> <target> [--options]

aws s3 sync . s3://somutest (It will upload all documents from local that folder to s3 bucket)

// Delete local file

$ rm ./MyFile1.txt

// Attempt sync without --delete option - nothing happens

$ aws s3 sync . s3://my-bucket/MyFolder

// Sync with deletion - object is deleted from bucket

$ aws s3 sync . s3://my-bucket/MyFolder --delete

delete: s3://my-bucket/MyFolder/MyFile1.txt

// Delete object from bucket

$ aws s3 rm s3://my-bucket/MyFolder/MySubdirectory/MyFile3.txt

delete: s3://my-bucket/MyFolder/MySubdirectory/MyFile3.txt

// Sync with deletion - local file is deleted

$ aws s3 sync s3://my-bucket/MyFolder . --delete

delete: MySubdirectory\MyFile3.txt

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

Local directory contains 3 files:

MyFile1.txt

MyFile2.rtf

MyFile88.txt

'''

$ aws s3 sync . s3://my-bucket/MyFolder --exclude '*.txt'

upload: MyFile2.rtf to s3://my-bucket/MyFolder/MyFile2.rtf

'''

$ aws s3 sync . s3://my-bucket/MyFolder --exclude '*.txt' --include 'MyFile*.txt'

upload: MyFile1.txt to s3://my-bucket/MyFolder/MyFile1.txt

upload: MyFile88.txt to s3://my-bucket/MyFolder/MyFile88.txt

upload: MyFile2.rtf to s3://my-bucket/MyFolder/MyFile2.rtf

'''

$ aws s3 sync . s3://my-bucket/MyFolder --exclude '*.txt' --include 'MyFile*.txt' --exclude 'MyFile?.txt'

upload: MyFile2.rtf to s3://my-bucket/MyFolder/MyFile2.rtf

upload: MyFile88.txt to s3://my-bucket/MyFolder/MyFile88.txt

@@@@@@@@@@@@@@@@@@@@@@@@@@@

// Copy MyFile.txt in current directory to s3://my-bucket/MyFolder

$ aws s3 cp MyFile.txt s3://my-bucket/MyFolder/

// Move all .jpg files in s3://my-bucket/MyFolder to ./MyDirectory

$ aws s3 mv s3://my-bucket/MyFolder ./MyDirectory --exclude '*' --include '*.jpg' --recursive

// List the contents of my-bucket

$ aws s3 ls s3://my-bucket

// List the contents of MyFolder in my-bucket

$ aws s3 ls s3://my-bucket/MyFolder

// Delete s3://my-bucket/MyFolder/MyFile.txt

$ aws s3 rm s3://my-bucket/MyFolder/MyFile.txt

// Delete s3://my-bucket/MyFolder and all of its contents

$ aws s3 rm s3://my-bucket/MyFolder --recursive

Aws EMR Cluster

aws emr create-cluster \

--ami-version 3.8.0 \

--instance-type m1.xlarge \

--instance-count 1 \

--name "cascading-kinesis-example" \

--visible-to-all-users \

--enable-debugging \

--auto-terminate \

--no-termination-protected \

--log-uri s3n://quanttestbucket/logs/ \

--service-role EMR_DefaultRole --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole \

Indexing Common Crawl Metadata on Amazon EMR Using Cascading and Elasticsearch

This post shows you how to build a simple application with Cascading for reading Common Crawl metadata, index the metadata on Elasticsearch, and use Kibana to query the indexed content.

What is Common Crawl?

Common Crawl is an open-source repository of web crawl data. This data set is freely available on Amazon S3 under the Common Crawl terms of use. The data is stored in several data formats. In this example, you work with the WAT response format that contains the metadata for the crawled HTML information. This allows you to build an Elasticsearch index, which can be used to extract useful information about tons of sites on the Internet.

What is Cascading?

Cascading is an application development platform for building data applications on Apache Hadoop. In this post, you use it to build a simple application that indexes JSON files in Elasticsearch, without the need to think in terms of MapReduce methods.

Launching an EMR cluster with Elasticsearch, Maven, and Kibana

As in the previous post, you launch a cluster with Elasticsearch and Kibana installed. You also install Maven to compile the application and run a script to resolve some library dependencies between Elasticsearch and Cascading. All the bootstrap actions are public, so you can download the code to verify the installation steps at any time.

To launch the cluster, use the AWS CLI and run the following command:

aws emr create-cluster --name Elasticsearch --ami-version 3.9.0 \
--instance-type=m1.medium --instance-count 3 \
--ec2-attributes KeyName=your-key \
--log-uri s3://your-bucket/logs/ \
--bootstrap-action Name="Setup Jars",Path=s3://support.elasticmapreduce/bootstrap-actions/other/cascading-elasticsearch-jar-classpath.sh \
Name="Install Maven",Path=s3://support.elasticmapreduce/bootstrap-actions/other/maven-install.sh \
Name="Install Elasticsearch",Path=s3://support.elasticmapreduce/bootstrap-actions/other/elasticsearch_install.rb \
Name="Install Kibana",Path=s3://support.elasticmapreduce/bootstrap-actions/other/kibananginx_install.rb \
--no-auto-terminate --use-default-roles

Compiling Cascading Source Code with Maven

After you have the cluster up and running, you can connect using SSH into the master node to compile and run the application. Your Cascading application applies a filter before you start the indexing process, to remove the WARC envelope and obtain plain JSON output. For more information about the code, see theGithub repository.

Install git:

$ sudo yum install git

Clone the repository:

$ git clone https://github.com/awslabs/aws-big-data-blog.git

Compile the code:

$ cd aws-big-data-blog/aws-blog-elasticsearch-cascading-commoncrawl/commoncrawl.cascading.elasticsearch
$ mvn clean && mvn assembly:assembly -Dmaven.test.skip=true  -Ddescriptor=./src/main/assembly/job.xml -e

Compiled application is placed in the following directory: aws-big-data-blog/aws-blog-elasticsearch-cascading-commoncrawl/commoncrawl.cascading.elasticsearch/target

Listing the directory should show the packaged application, as shown in the following graphic:

Indexing Common Crawl Metadata on Elasticsearch

Using the application you just compiled, you can index a single Common Crawl file or a complete directory, by modifying the parameter. The following commands show you how to index a file or directory.

Index a single file:

hadoop jar /home/hadoop/aws-big-data-blog/aws-blog-elasticsearch-cascading-commoncrawl/commoncrawl.cascading.elasticsearch/target/commoncrawl.cascading.elasticsearch-0.0.1-SNAPSHOT-job.jar com.amazonaws.bigdatablog.indexcommoncrawl.Main s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-52/segments/1419447563504.69/wat/CC-MAIN-20141224185923-00099-ip-10-231-17-201.ec2.internal.warc.wat.gz

Index a complete directory:

hadoop jar /home/hadoop/aws-big-data-blog/aws-blog-elasticsearch-cascading-commoncrawl/commoncrawl.cascading.elasticsearch/target/commoncrawl.cascading.elasticsearch-0.0.1-SNAPSHOT-job.jar com.amazonaws.bigdatablog.indexcommoncrawl.Main s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-52/segments/1419447563504.69/wat/

Running the command to index a single file produces the following output:

Running the command to index a single file produces this output

The application writes each JSON entry directly into Elasticsearch using the Cascading and Hadoop connectors.

Checking Indexes and Mappings

The index on Elasticsearch is created automatically, using the default configuration. Now, run a couple of commands on the console to check the index and mappings.

List all indexes:

$ curl 'localhost:9200/_cat/indices?v'

View the mappings:

curl -XGET 'http://localhost:9200/_all/_mapping' | python -m json.tool |more

If you look at the mapping output, you’ll see that it follows the structure showed on the Common Crawl WAT metadata description: http://commoncrawl.org/the-data/get-started/.

This mapping is shown in the Kibana menu and allows you to navigate the different metadata entries.

Querying Indexed Content

Because the Kibana bootstrap action configures the cluster to use port 80, you can point the browser to the master node public DNS address to access the Kibana console. On the Kibana console, click Sample Dashboard to start exploring the content indexed earlier in this post.

A sample dasbhard appears with some basic information extracted:

You can search Head.Metas headers for all the occurrences of “hello”; in the search box, type “HTML-Metadata.Head.Metas AND keywords AND hello”.

That search returns all the records that contain ‘keywords’ and ‘hello’ on the “Metadata.Head.Metas” header. The result looks like the following:

Another useful way to find information is by using the mapping index. You can click “Envelope.Payload-Metadata.HTTP-Response-Metadata.Headers.Server” to see a ranking of the different server technologies of all the indexed sites:

Click the magnifier icon to find all the details on the selected entry.

Or you can get the top ten technologies used in the indexed web application by clicking “Envelope.Payload-Metadata.HTTP-Response-Metadata.Headers.X-Powered-By”. The following graphic shows an example:

Getting the top ten technologies used in the indexed web application

Conclusion

This post has shown how EMR lets you build and compile a simple Cascading application and use it to index Common Crawl metadata on an Elasticsearch cluster.

Cascading provided a simple application layer on top of Hadoop to parallelize the process and fetch the data directly from the S3 repository location, while Kibana provided a presentation interface that allowed you to research the indexed data in many ways.

If you have questions or suggestions, please leave a comment below.

------------------------------------

Do more with EMR:

Using IPython Notebook to Analyze Data with EMR

---------------------------------------------------------------