Wednesday, 23 September 2015

Amazon Web Services (AWS) Command Line Arguments (CLI)

In this post I am explaining about AWS CLI. Here we come to know about how to access the amazon account using CLI.

We can access AWS account from Window's machine and Linux machine.

Windows

To access AWS account in window we should install aws cli.

For that we can download from the link (https://aws.amazon.com/cli/) or we can use https://s3.amazonaws.com/aws-cli/AWSCLI64.msi

if we run this .msi file, AWS CLI install in our local windows machine.

Then open command prompt:

Type "aws configure"

Then it will ask flllowing configurations

AWS Access Key ID [None]: xxxxxxxxxxxx

AWS Secret Access Key [None]: ###########################

Default region name [None]: @@@@@

Default output format [None]: ENTER

Finally you installed AWS CLI in your windows machine.

For testing you can use ”aws s3 ls”. It will give all s3 bucket list.

AWS EC2 machine:

In AWS ec2 machine it will give default in the position /usr/bin/aws.

So no need to install again in this machine. U can access directly from anywhere.

Here I explained some queries, we can use them for the fast processing.

“aws s3 ls “

It will give us list of all buckets.

“aws s3 ls s3://somutest/”

It will give us list of all documents and folders in this bucket.

“aws s3 cp . s3://somutest/”

It will transfer all documents from present local folder (should type .dot to represent present location) to s3. To transfer only one file we should mention file name like “aws cp abc.txt to s3://somutest/”

Here fallowed by some commands which are mostly used in real life.

aws s3 cp s3://somutest .

$ aws s3 sync . s3://my-bucket/MyFolder --acl public-read

Creating Buckets

$ aws s3 mb s3://bucket-name

Removing Buckets

$ aws s3 rb s3://bucket-name

$ aws s3 rb s3://bucket-name --force (Non empty bucket)

Listing Buckets

$ aws s3 ls

$ aws s3 ls s3://bucket-name/MyFolder

Managing objects

aws s3 mv s3://mybucket . --recursive

$ aws s3 cp file.txt s3://bucket-name/ --grants read=uri=http://acs.amazonaws.com/groups/global/AllUsers full=emailaddress=user@example.com

$ aws s3 sync <source> <target> [--options]

aws s3 sync . s3://somutest (It will upload all documents from local that folder to s3 bucket)

// Delete local file

$ rm ./MyFile1.txt

// Attempt sync without --delete option - nothing happens

$ aws s3 sync . s3://my-bucket/MyFolder

// Sync with deletion - object is deleted from bucket

$ aws s3 sync . s3://my-bucket/MyFolder --delete

delete: s3://my-bucket/MyFolder/MyFile1.txt

// Delete object from bucket

$ aws s3 rm s3://my-bucket/MyFolder/MySubdirectory/MyFile3.txt

delete: s3://my-bucket/MyFolder/MySubdirectory/MyFile3.txt

// Sync with deletion - local file is deleted

$ aws s3 sync s3://my-bucket/MyFolder . --delete

delete: MySubdirectory\MyFile3.txt

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

Local directory contains 3 files:

MyFile1.txt

MyFile2.rtf

MyFile88.txt

'''

$ aws s3 sync . s3://my-bucket/MyFolder --exclude '*.txt'

upload: MyFile2.rtf to s3://my-bucket/MyFolder/MyFile2.rtf

'''

$ aws s3 sync . s3://my-bucket/MyFolder --exclude '*.txt' --include 'MyFile*.txt'

upload: MyFile1.txt to s3://my-bucket/MyFolder/MyFile1.txt

upload: MyFile88.txt to s3://my-bucket/MyFolder/MyFile88.txt

upload: MyFile2.rtf to s3://my-bucket/MyFolder/MyFile2.rtf

'''

$ aws s3 sync . s3://my-bucket/MyFolder --exclude '*.txt' --include 'MyFile*.txt' --exclude 'MyFile?.txt'

upload: MyFile2.rtf to s3://my-bucket/MyFolder/MyFile2.rtf

upload: MyFile88.txt to s3://my-bucket/MyFolder/MyFile88.txt

@@@@@@@@@@@@@@@@@@@@@@@@@@@

// Copy MyFile.txt in current directory to s3://my-bucket/MyFolder

$ aws s3 cp MyFile.txt s3://my-bucket/MyFolder/

// Move all .jpg files in s3://my-bucket/MyFolder to ./MyDirectory

$ aws s3 mv s3://my-bucket/MyFolder ./MyDirectory --exclude '*' --include '*.jpg' --recursive

// List the contents of my-bucket

$ aws s3 ls s3://my-bucket

// List the contents of MyFolder in my-bucket

$ aws s3 ls s3://my-bucket/MyFolder

// Delete s3://my-bucket/MyFolder/MyFile.txt

$ aws s3 rm s3://my-bucket/MyFolder/MyFile.txt

// Delete s3://my-bucket/MyFolder and all of its contents

$ aws s3 rm s3://my-bucket/MyFolder --recursive

Aws EMR Cluster

aws emr create-cluster \

--ami-version 3.8.0 \

--instance-type m1.xlarge \

--instance-count 1 \

--name "cascading-kinesis-example" \

--visible-to-all-users \

--enable-debugging \

--auto-terminate \

--no-termination-protected \

--log-uri s3n://quanttestbucket/logs/ \

--service-role EMR_DefaultRole --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole \

Indexing Common Crawl Metadata on Amazon EMR Using Cascading and Elasticsearch

This post shows you how to build a simple application with Cascading for reading Common Crawl metadata, index the metadata on Elasticsearch, and use Kibana to query the indexed content.

What is Common Crawl?

Common Crawl is an open-source repository of web crawl data. This data set is freely available on Amazon S3 under the Common Crawl terms of use. The data is stored in several data formats. In this example, you work with the WAT response format that contains the metadata for the crawled HTML information. This allows you to build an Elasticsearch index, which can be used to extract useful information about tons of sites on the Internet.

What is Cascading?

Cascading is an application development platform for building data applications on Apache Hadoop. In this post, you use it to build a simple application that indexes JSON files in Elasticsearch, without the need to think in terms of MapReduce methods.

Launching an EMR cluster with Elasticsearch, Maven, and Kibana

As in the previous post, you launch a cluster with Elasticsearch and Kibana installed. You also install Maven to compile the application and run a script to resolve some library dependencies between Elasticsearch and Cascading. All the bootstrap actions are public, so you can download the code to verify the installation steps at any time.

To launch the cluster, use the AWS CLI and run the following command:

aws emr create-cluster --name Elasticsearch --ami-version 3.9.0 \
--instance-type=m1.medium --instance-count 3 \
--ec2-attributes KeyName=your-key \
--log-uri s3://your-bucket/logs/ \
--bootstrap-action Name="Setup Jars",Path=s3://support.elasticmapreduce/bootstrap-actions/other/cascading-elasticsearch-jar-classpath.sh \
Name="Install Maven",Path=s3://support.elasticmapreduce/bootstrap-actions/other/maven-install.sh \
Name="Install Elasticsearch",Path=s3://support.elasticmapreduce/bootstrap-actions/other/elasticsearch_install.rb \
Name="Install Kibana",Path=s3://support.elasticmapreduce/bootstrap-actions/other/kibananginx_install.rb \
--no-auto-terminate --use-default-roles

Compiling Cascading Source Code with Maven

After you have the cluster up and running, you can connect using SSH into the master node to compile and run the application. Your Cascading application applies a filter before you start the indexing process, to remove the WARC envelope and obtain plain JSON output. For more information about the code, see theGithub repository.

Install git:

$ sudo yum install git

Clone the repository:

$ git clone https://github.com/awslabs/aws-big-data-blog.git

Compile the code:

$ cd aws-big-data-blog/aws-blog-elasticsearch-cascading-commoncrawl/commoncrawl.cascading.elasticsearch
$ mvn clean && mvn assembly:assembly -Dmaven.test.skip=true  -Ddescriptor=./src/main/assembly/job.xml -e

Compiled application is placed in the following directory: aws-big-data-blog/aws-blog-elasticsearch-cascading-commoncrawl/commoncrawl.cascading.elasticsearch/target

Listing the directory should show the packaged application, as shown in the following graphic:

Indexing Common Crawl Metadata on Elasticsearch

Using the application you just compiled, you can index a single Common Crawl file or a complete directory, by modifying the parameter. The following commands show you how to index a file or directory.

Index a single file:

hadoop jar /home/hadoop/aws-big-data-blog/aws-blog-elasticsearch-cascading-commoncrawl/commoncrawl.cascading.elasticsearch/target/commoncrawl.cascading.elasticsearch-0.0.1-SNAPSHOT-job.jar com.amazonaws.bigdatablog.indexcommoncrawl.Main s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-52/segments/1419447563504.69/wat/CC-MAIN-20141224185923-00099-ip-10-231-17-201.ec2.internal.warc.wat.gz

Index a complete directory:

hadoop jar /home/hadoop/aws-big-data-blog/aws-blog-elasticsearch-cascading-commoncrawl/commoncrawl.cascading.elasticsearch/target/commoncrawl.cascading.elasticsearch-0.0.1-SNAPSHOT-job.jar com.amazonaws.bigdatablog.indexcommoncrawl.Main s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-52/segments/1419447563504.69/wat/

Running the command to index a single file produces the following output:

Running the command to index a single file produces this output

The application writes each JSON entry directly into Elasticsearch using the Cascading and Hadoop connectors.

Checking Indexes and Mappings

The index on Elasticsearch is created automatically, using the default configuration. Now, run a couple of commands on the console to check the index and mappings.

List all indexes:

$ curl 'localhost:9200/_cat/indices?v'

View the mappings:

curl -XGET 'http://localhost:9200/_all/_mapping' | python -m json.tool |more

If you look at the mapping output, you’ll see that it follows the structure showed on the Common Crawl WAT metadata description: http://commoncrawl.org/the-data/get-started/.

This mapping is shown in the Kibana menu and allows you to navigate the different metadata entries.

Querying Indexed Content

Because the Kibana bootstrap action configures the cluster to use port 80, you can point the browser to the master node public DNS address to access the Kibana console. On the Kibana console, click Sample Dashboard to start exploring the content indexed earlier in this post.

A sample dasbhard appears with some basic information extracted:

You can search Head.Metas headers for all the occurrences of “hello”; in the search box, type “HTML-Metadata.Head.Metas AND keywords AND hello”.

That search returns all the records that contain ‘keywords’ and ‘hello’ on the “Metadata.Head.Metas” header. The result looks like the following:

Another useful way to find information is by using the mapping index. You can click “Envelope.Payload-Metadata.HTTP-Response-Metadata.Headers.Server” to see a ranking of the different server technologies of all the indexed sites:

Click the magnifier icon to find all the details on the selected entry.

Or you can get the top ten technologies used in the indexed web application by clicking “Envelope.Payload-Metadata.HTTP-Response-Metadata.Headers.X-Powered-By”. The following graphic shows an example:

Getting the top ten technologies used in the indexed web application

Conclusion

This post has shown how EMR lets you build and compile a simple Cascading application and use it to index Common Crawl metadata on an Elasticsearch cluster.

Cascading provided a simple application layer on top of Hadoop to parallelize the process and fetch the data directly from the S3 repository location, while Kibana provided a presentation interface that allowed you to research the indexed data in many ways.

If you have questions or suggestions, please leave a comment below.

------------------------------------

Do more with EMR:

Using IPython Notebook to Analyze Data with EMR

---------------------------------------------------------------

Friday, 4 September 2015

Hadoop Interview Questions V2

1. What is Big Data?

Any data that cannot be stored into traditional RDBMS is termed as Big Data. As we know most of the data that we use today has been generated in the past 20 years. And this data is mostly unstructured or semi structured in nature. More than the volume of the data – it is the nature of the data that defines whether it is considered as Big Data or not.

2. What do the four V’s of Big Data denote?

IBM has a nice, simple explanation for the four critical features of big data:
a) Volume –Scale of data
b) Velocity –Different forms of data
c) Variety –Analysis of streaming data
d) Veracity –Uncertainty of data

For more the Basic questions and answers click here

2) Hadoop HDFS Interview Questions

1. What is a block and block scanner in HDFS?

Block - The minimum amount of data that can be read or written is generally referred to as a “block” in HDFS. The default size of a block in HDFS is 64MB.

Block Scanner - Block Scanner tracks the list of blocks present on a DataNode and verifies them to find any kind of checksum errors. Block Scanners use a throttling mechanism to reserve disk bandwidth on the datanode.

2. Explain the difference between NameNode, Backup Node and Checkpoint NameNode.

NameNode: NameNode is at the heart of the HDFS file system which manages the metadata i.e. the data of the files is not stored on the NameNode but rather it has the directory tree of all the files present in the HDFS file system on a hadoop cluster. NameNode uses two files for the namespace-

fsimage file- It keeps track of the latest checkpoint of the namespace.

edits file-It is a log of changes that have been made to the namespace since checkpoint.

Checkpoint Node-

Checkpoint Node keeps track of the latest checkpoint in a directory that has same structure as that of NameNode’s directory. Checkpoint node creates checkpoints for the namespace at regular intervals by downloading the edits and fsimage file from the NameNode and merging it locally. The new image is then again updated back to the active NameNode.

BackupNode:

Backup Node also provides check pointing functionality like that of the checkpoint node but it also maintains its up-to-date in-memory copy of the file system namespace that is in sync with the active NameNode.

For more the Hadoop HDFS Interview Questions click here

3) MapReduce Interview Questions

1. Explain the usage of Context Object.

Context Object is used to help the mapper interact with other Hadoop systems. Context Object can be used for updating counters, to report the progress and to provide any application level status updates. ContextObject has the configuration details for the job and also interfaces, that helps it to generating the output.

2. What are the core methods of a Reducer?

The 3 core methods of a reducer are –

1)setup () – This method of the reducer is used for configuring various parameters like the input data size, distributed cache, heap size, etc.

Function Definition- public void setup (context)

2)reduce () it is heart of the reducer which is called once per key with the associated reduce task.

Function Definition -public void reduce (Key,Value,context)

3)cleanup () - This method is called only once at the end of reduce task for clearing all the temporary files.

Function Definition -public void cleanup (context)

For more the MapReduce Interview Questions click here

4) Hadoop HBase Interview Questions

1. When should you use HBase and what are the key components of HBase?

HBase should be used when the big data application has –

1)A variable schema

2)When data is stored in the form of collections

3)If the application demands key based access to data while retrieving.

Key components of HBase are –

Region- This component contains memory data store and Hfile.

Region Server-This monitors the Region.

HBase Master-It is responsible for monitoring the region server.

Zookeeper- It takes care of the coordination between the HBase Master component and the client.

Catalog Tables-The two important catalog tables are ROOT and META.ROOT table tracks where the META table is and META table stores all the regions in the system.

For more the Hadoop HBase Interview Questions click here

5) Hadoop Sqoop Interview Questions

1. Explain about some important Sqoop commands other than import and export.

Create Job (--create)

Here we are creating a job with the name my job, which can import the table data from RDBMS table to HDFS. The following command is used to create a job that is importing data from the employee table in the db database to the HDFS file.

$ Sqoop job --create myjob \

--import \

--connect jdbc:mysql://localhost/db \

--username root \

--table employee --m 1

Verify Job (--list)

‘--list’ argument is used to verify the saved jobs. The following command is used to verify the list of saved Sqoop jobs.

$ Sqoop job --list

Inspect Job (--show)

‘--show’ argument is used to inspect or verify particular jobs and their details. The following command and sample output is used to verify a job called myjob.

$ Sqoop job --show myjob

Execute Job (--exec)

‘--exec’ option is used to execute a saved job. The following command is used to execute a saved job called myjob.

$ Sqoop job --exec myjob

For moreHadoop Sqoop Interview Questions click here

6) Hadoop Flume Interview Questions

1. Explain about the core components of Flume.

The core components of Flume are –

Event- The single log entry or unit of data that is transported.

Source- This is the component through which data enters Flume workflows.

Sink-It is responsible for transporting data to the desired destination.

Channel- it is the duct between the Sink and Source.

Agent- Any JVM that runs Flume.

Client- The component that transmits event to the source that operates with the agent.

For more Hadoop Flume Interview Questions click here

7) Hadoop Zookeeper Interview Questions

1. Can Apache Kafka be used without Zookeeper?

It is not possible to use Apache Kafka without Zookeeper because if the Zookeeper is down Kafka cannot serve client request.

2. Name a few companies that use Zookeeper.

Yahoo, Solr, Helprace, Neo4j, Rackspace

For more Hadoop Zookeeper Interview Questions click here

8) Pig Interview Questions

1. What do you mean by a bag in Pig?

Collection of tuples is referred as a bag in Apache Pig

2. Does Pig support multi-line commands?

Yes

For more Pig Interview Questions click here

9) Hive Interview Questions

1. What is a Hive Metastore?

Hive Metastore is a central repository that stores metadata in external database.

2. Are multiline comments supported in Hive?

For more Hive Interview Questions click here

10) Hadoop YARN Interview Questions

1. What are the stable versions of Hadoop?

Release 2.7.1 (stable)

Release 2.4.1

Release 1.2.1 (stable)

2. What is Apache Hadoop YARN?

YARN is a powerful and efficient feature rolled out as a part of Hadoop 2.0.YARN is a large scale distributed system for running big data applications.