Here is a very quick way for you to get some hands on experience seeing the differences between TEXTFILE and PARQUET, along with Hive and Impala. You can do this on your own cluster or use Cloudera’s Quick Start VM. Our steps were done using a three node CDH 5.2.0 cluster which has Hive 0.13.1 and Impala 2.0.1.
First get some data. In this example, we grabbed temperature data from the US government. We took the ‘hourly_TEMP_2014.zip’ data, which after uncompressed, is around 1GB. Not large by any means, but enough to use in this example.
#unzip, efficiently remove the header from the file and add to hdfs.
unzip hourly_TEMP_2014.zip
hadoop fs -copyFromLocal hourly_TEMP_2014.csv /tmp
Next, log into the hive (beeline or Hue), create tables, and load some data. We’re creating a TEXTFILE table and a PARQUET table in this example. PARQUET is a columnar store that gives us advantages for storing and scanning data. Storing the data column-wise allows for better compression, which gives us faster scans while using less storage. It’s also helpful for “wide” tables and for things like column-level aggregations. E.g. avg(degrees)
#hive (via the beeline shell or Hue)
create table temps_txt (state code string, country code string, sternum string, param code string, POC string, latitude string, longitude string, datum string, param string, date local string, time local string, date GMT string, timegmt string, degrees double, uom string, mdl string, uncert string, qual string, method string, method name string, state string, county string, dateoflastchange string) row format delimited fields terminated by ',';
load data inpath '/tmp/hourly_TEMP_2014.csv' into table temps_txt;
create table temps_par (state code string, country code string, sternum string, param code string, POC string, latitude string, longitude string, datum string, param string, date local string, time local string, date GMT string, timegmt string, degrees double, uom string, mdl string, uncert string, qual string, method string, method name string, state string, county string, dateoflastchange string) stored as parquet;
insert into table temps_par select * from temps_txt;
Now we have some data, let’s do some analysis. In these examples, we are using Hive to select the TEXTFILE and PARQUET tables. Your results will vary, but the PARQUET queries should be faster because of its columnar storage approach with the statements in this example.
#hive (via the beeline shell or Hue)
select avg(degrees) from temps_txt;
select avg(degrees) from temps_par;
Now, let’s launch the impala-shell and issue the same commands. Impala can use the Hive metastore to query the same tables as you created in Hive.
#impala-shell (via impala-shell)
invalidate metadata; //To import hive metadata
select avg(degrees) from temps_txt;
select avg(degrees) from temps_par;
Again, these are very small data sets for Hadoop, but give a simple example of how to get up and running so you can see the differences between storage formats (TEXTFILE vs PARQUET) and query engine behavior (Hive vs Impala).
First get some data. In this example, we grabbed temperature data from the US government. We took the ‘hourly_TEMP_2014.zip’ data, which after uncompressed, is around 1GB. Not large by any means, but enough to use in this example.
#unzip, efficiently remove the header from the file and add to hdfs.
unzip hourly_TEMP_2014.zip
hadoop fs -copyFromLocal hourly_TEMP_2014.csv /tmp
Next, log into the hive (beeline or Hue), create tables, and load some data. We’re creating a TEXTFILE table and a PARQUET table in this example. PARQUET is a columnar store that gives us advantages for storing and scanning data. Storing the data column-wise allows for better compression, which gives us faster scans while using less storage. It’s also helpful for “wide” tables and for things like column-level aggregations. E.g. avg(degrees)
#hive (via the beeline shell or Hue)
create table temps_txt (state code string, country code string, sternum string, param code string, POC string, latitude string, longitude string, datum string, param string, date local string, time local string, date GMT string, timegmt string, degrees double, uom string, mdl string, uncert string, qual string, method string, method name string, state string, county string, dateoflastchange string) row format delimited fields terminated by ',';
load data inpath '/tmp/hourly_TEMP_2014.csv' into table temps_txt;
create table temps_par (state code string, country code string, sternum string, param code string, POC string, latitude string, longitude string, datum string, param string, date local string, time local string, date GMT string, timegmt string, degrees double, uom string, mdl string, uncert string, qual string, method string, method name string, state string, county string, dateoflastchange string) stored as parquet;
insert into table temps_par select * from temps_txt;
Now we have some data, let’s do some analysis. In these examples, we are using Hive to select the TEXTFILE and PARQUET tables. Your results will vary, but the PARQUET queries should be faster because of its columnar storage approach with the statements in this example.
#hive (via the beeline shell or Hue)
select avg(degrees) from temps_txt;
select avg(degrees) from temps_par;
Now, let’s launch the impala-shell and issue the same commands. Impala can use the Hive metastore to query the same tables as you created in Hive.
#impala-shell (via impala-shell)
invalidate metadata; //To import hive metadata
select avg(degrees) from temps_txt;
select avg(degrees) from temps_par;
Again, these are very small data sets for Hadoop, but give a simple example of how to get up and running so you can see the differences between storage formats (TEXTFILE vs PARQUET) and query engine behavior (Hive vs Impala).