Impala
What is Impala?
Impala is a low latency query tools suitable for data analyst. It is not suitable for ETL jobs, for which hive is more appropriate, It is custom built C++ based system developed at Cloudera, but it now it is an apache open source project. It does not support delete, update operations. Below is Impala architecture from apache site. Impala does not fault tolerance during execution time, that means if during query execution fails, user has resubmit the query. Occasionally it shows performance issues with data size larger than 10 TB.
Download the stocks file and upload to HDFS /user/cloudera/stocks directory.
In hive create a stocks table with simple delimited format specification.
hive> create external table stocks(
tr_date String,
open Double,
high Double,
low Double,
close Double,
volume Double,
adjClose Double,
symbol String)
row format delimited
fields terminated by ','
stored as textfile
location '/user/cloudera/stocks'
TBLPROPERTIES("skip.header.line.count"="1");
Launch impala shell.
[cloudera@quickstart ~]$ impala-shell
Starting Impala Shell without Kerberos authentication
Connected to quickstart.cloudera:21000
Server version: impalad version 2.9.0-cdh5.12.0 RELEASE (build 03c6ddbdcec39238be4f5b14a300d5c4f576097e)
***********************************************************************************
Welcome to the Impala shell.
(Impala Shell v2.9.0-cdh5.12.0 (03c6ddb) built on Thu Jun 29 04:17:31 PDT 2017)
To see a summary of a query's progress that updates in real-time, run 'set
LIVE_PROGRESS=1;'.
***********************************************************************************
Each time you want to access the newly created hive tables, invalidate impala meta store which caches the meta data information.
[quickstart.cloudera:21000] > invalidate metadata;
Query: invalidate metadata
Query submitted at: 2018-08-28 11:33:33 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=854c4859e73d0289:512740200000000
Fetched 0 row(s) in 5.94s
View existing databases.
[quickstart.cloudera:21000] > show databases;
Query: show databases
+------------------+----------------------------------------------+
| name | comment |
+------------------+----------------------------------------------+
| _impala_builtins | System database for Impala builtin functions |
| default | Default Hive database |
+------------------+----------------------------------------------+
Fetched 2 row(s) in 0.12s
Run select statement against stocks table.
[quickstart.cloudera:21000] > select * from stocks limit 4;
Query: select * from stocks limit 4
Query submitted at: 2018-08-28 23:38:07 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=fe443ca03202f2c7:d72fecb500000000
+------------+----------+------------+-----------+---------+----------+-------------------+--------+
| tr_date | open | high | low | close | volume | adjclose | symbol |
+------------+----------+------------+-----------+---------+----------+-------------------+--------+
| 2000-07-18 | 144.8125 | 144.828125 | 141.4375 | 143 | 50683600 | 50.155473 | INTC |
| 2000-07-20 | 32.93751 | 34.25001 | 32.8125 | 33.75 | 3288300 | 8.789733999999999 | BEN |
| 2000-07-24 | 64.25 | 67.312477 | 64.187523 | 64.75 | 948800 | 7.689567 | APH |
| 2000-07-26 | 21.875 | 22.125 | 20.9375 | 20.9375 | 1464300 | 15.61832 | SHW |
+------------+----------+------------+-----------+---------+----------+-------------------+--------+
Fetched 4 row(s) in 5.65s
We can see the results from stocks table.
Now, go to hive and create a CSV serde table on the same stocks dataset.
hive> CREATE EXTERNAL TABLE stocks_csv_serde(
tr_date string,
open Double,
high Double,
low Double,
close Double,
volume Double,
adjClose Double,
symbol String)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '"',
'escapeChar' = '\\')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'/user/cloudera/stocks'
TBLPROPERTIES (
'serialization.null.format' = '',
'skip.header.line.count' = '1');
In impala, run the invalidate metadata command as before and try to query the serde table. You should see an error message.
[quickstart.cloudera:21000] > select * from stocks_csv_serde limit 4;
Query: select * from stocks_csv_serde limit 4
Query submitted at: 2018-08-28 23:43:52 (Coordinator: http://quickstart.cloudera:25000)
ERROR: AnalysisException: Failed to load metadata for table: 'stocks_csv_serde'
CAUSED BY: TableLoadingException: Failed to load metadata for table: stocks_csv_serde
CAUSED BY: InvalidStorageDescriptorException: Impala does not support tables of this type. REASON: SerDe library 'org.apache.hadoop.hive.serde2.OpenCSVSerde' is not supported.
Impala does not support serde based hive tables. Only parquet and hbase serde are supported. Look at the supported file formats.
https://www.cloudera.com/documentation/enterprise/5-4-x/topics/impala_file_formats.html
Go back to hive and create a copy of the stocks_csv_serde in parquet format.
hive> create table stocks_parquet like stocks_csv_serde stored as parquet;
hive> insert into stocks_parquet select * from stocks_csv_serde;
Invlidate imapala meta store again and query the stocks_parquet table. You should see the data there.
[quickstart.cloudera:21000] > select * from stocks_parquet limit 4;
Query: select * from stocks_parquet limit 4
Query submitted at: 2018-08-28 23:47:43 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=c484501e875e053:c55c1d7900000000
+------------+----------+------------+-----------+---------+----------+-------------------+--------+
| tr_date | open | high | low | close | volume | adjclose | symbol |
+------------+----------+------------+-----------+---------+----------+-------------------+--------+
| 2000-07-18 | 144.8125 | 144.828125 | 141.4375 | 143 | 50683600 | 50.155473 | INTC |
| 2000-07-20 | 32.93751 | 34.25001 | 32.8125 | 33.75 | 3288300 | 8.789733999999999 | BEN |
| 2000-07-24 | 64.25 | 67.312477 | 64.187523 | 64.75 | 948800 | 7.689567 | APH |
| 2000-07-26 | 21.875 | 22.125 | 20.9375 | 20.9375 | 1464300 | 15.61832 | SHW |
+------------+----------+------------+-----------+---------+----------+-------------------+--------+
Fetched 4 row(s) in 0.13s