Wednesday, 23 September 2015

Indexing Common Crawl Metadata on Amazon EMR Using Cascading and Elasticsearch

This post shows you how to build a simple application with Cascading for reading Common Crawl metadata, index the metadata on Elasticsearch, and use Kibana to query the indexed content.

What is Common Crawl?

Common Crawl is an open-source repository of web crawl data. This data set is freely available on Amazon S3 under the Common Crawl terms of use. The data is stored in several data formats. In this example, you work with the WAT response format that contains the metadata for the crawled HTML information. This allows you to build an Elasticsearch index, which can be used to extract useful information about tons of sites on the Internet.

What is Cascading?

Cascading is an application development platform for building data applications on Apache Hadoop. In this post, you use it to build a simple application that indexes JSON files in Elasticsearch, without the need to think in terms of MapReduce methods.

Launching an EMR cluster with Elasticsearch, Maven, and Kibana

As in the previous post, you launch a cluster with Elasticsearch and Kibana installed. You also install Maven to compile the application and run a script to resolve some library dependencies between Elasticsearch and Cascading. All the bootstrap actions are public, so you can download the code to verify the installation steps at any time.
To launch the cluster, use the AWS CLI and run the following command:
aws emr create-cluster --name Elasticsearch --ami-version 3.9.0 \
--instance-type=m1.medium --instance-count 3 \
--ec2-attributes KeyName=your-key \
--log-uri s3://your-bucket/logs/ \
--bootstrap-action Name="Setup Jars",Path=s3://support.elasticmapreduce/bootstrap-actions/other/cascading-elasticsearch-jar-classpath.sh \
Name="Install Maven",Path=s3://support.elasticmapreduce/bootstrap-actions/other/maven-install.sh \
Name="Install Elasticsearch",Path=s3://support.elasticmapreduce/bootstrap-actions/other/elasticsearch_install.rb \
Name="Install Kibana",Path=s3://support.elasticmapreduce/bootstrap-actions/other/kibananginx_install.rb \
--no-auto-terminate --use-default-roles

Compiling Cascading Source Code with Maven

After you have the cluster up and running, you can connect using SSH into the master node to compile and run the application. Your Cascading application applies a filter before you start the indexing process, to remove the WARC envelope and obtain plain JSON output. For more information about the code, see theGithub repository.
Install git:
$ sudo yum install git
Clone the repository:
$ git clone https://github.com/awslabs/aws-big-data-blog.git
Compile the code:
$ cd aws-big-data-blog/aws-blog-elasticsearch-cascading-commoncrawl/commoncrawl.cascading.elasticsearch
$ mvn clean && mvn assembly:assembly -Dmaven.test.skip=true  -Ddescriptor=./src/main/assembly/job.xml -e
Compiled application is placed in the following directory: aws-big-data-blog/aws-blog-elasticsearch-cascading-commoncrawl/commoncrawl.cascading.elasticsearch/target
Listing the directory should show the packaged application, as shown in the following graphic:

Indexing Common Crawl Metadata on Elasticsearch

Using the application you just compiled, you can index a single Common Crawl file or a complete directory, by modifying the parameter. The following commands show you how to index a file or directory.
Index a single file:
hadoop jar /home/hadoop/aws-big-data-blog/aws-blog-elasticsearch-cascading-commoncrawl/commoncrawl.cascading.elasticsearch/target/commoncrawl.cascading.elasticsearch-0.0.1-SNAPSHOT-job.jar com.amazonaws.bigdatablog.indexcommoncrawl.Main s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-52/segments/1419447563504.69/wat/CC-MAIN-20141224185923-00099-ip-10-231-17-201.ec2.internal.warc.wat.gz
Index a complete directory:
hadoop jar /home/hadoop/aws-big-data-blog/aws-blog-elasticsearch-cascading-commoncrawl/commoncrawl.cascading.elasticsearch/target/commoncrawl.cascading.elasticsearch-0.0.1-SNAPSHOT-job.jar com.amazonaws.bigdatablog.indexcommoncrawl.Main s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-52/segments/1419447563504.69/wat/
Running the command to index a single file produces the following output:
Running the command to index a single file produces this output
The application writes each JSON entry directly into Elasticsearch using the Cascading and Hadoop connectors.

Checking Indexes and Mappings

The index on Elasticsearch is created automatically, using the default configuration. Now, run a couple of commands on the console to check the index and mappings.
List all indexes:
$ curl 'localhost:9200/_cat/indices?v'
Listing indexes
View the mappings:
curl -XGET 'http://localhost:9200/_all/_mapping' | python -m json.tool |more
If you look at the mapping output, you’ll see that it follows the structure showed on the Common Crawl WAT metadata description: http://commoncrawl.org/the-data/get-started/.
Viewing mappings
This mapping is shown in the Kibana menu and allows you to navigate the different metadata entries.

Querying Indexed Content

Because the Kibana bootstrap action configures the cluster to use port 80, you can point the browser to the master node public DNS address to access the Kibana console. On the Kibana console, click Sample Dashboard to start exploring the content indexed earlier in this post.
A sample dasbhard appears with some basic information extracted:
Querying Indexed Content
You can search Head.Metas headers for all the occurrences of “hello”; in the search box, type “HTML-Metadata.Head.Metas AND keywords AND hello”.
Searching for headers
That search returns all the records that contain ‘keywords’ and ‘hello’ on the “Metadata.Head.Metas” header. The result looks like the following:
Search results
Another useful way to find information is by using the mapping index. You can click “Envelope.Payload-Metadata.HTTP-Response-Metadata.Headers.Server” to see a ranking of the different server technologies of all the indexed sites:
using the mapping index
Click the magnifier icon to find all the details on the selected entry.
Or you can get the top ten technologies used in the indexed web application by clicking “Envelope.Payload-Metadata.HTTP-Response-Metadata.Headers.X-Powered-By”. The following graphic shows an example:
Getting the top ten technologies used in the indexed web application

Conclusion

This post has shown how EMR lets you build and compile a simple Cascading application and use it to index Common Crawl metadata on an Elasticsearch cluster.
Cascading provided a simple application layer on top of Hadoop to parallelize the process and fetch the data directly from the S3 repository location, while Kibana provided a presentation interface that allowed you to research the indexed data in many ways.
If you have questions or suggestions, please leave a comment below.
------------------------------------
Do more with EMR:
---------------------------------------------------------------

1 comment:

  1. This is an awesome post.Really very informative and creative contents. These concept is a good way to enhance the knowledge.I like it and help me to development very well.Thank you for this brief explanation and very nice information.Well, got a good knowledge.
    AWS Training in Chennai

    ReplyDelete