Drill for Interactive Query

Key features of Apache Drill

  1. Run interactive query on data on files system without defining a schema
  2. Run interactive query against hive table
  3. Run interactive query against HBase/Mapr DB table
  4. Run interactive query against RDBMS tables using jdbc storage plugin
  5. Run interactive join queries against 2 or more tables - one located in file system and other is located HBase or any combination of them.
  6. Run interactive query against nested data formats like json, parquet, avro (xml support is not there yet)

Download apache drill from here and untar it.

$ ~/Downloads
$ tar xf apache-drill-1.9.0.tar.gz
$ cd apache-drill-1.9.0
$ export DRILL_HOME=$HOME/Downloads/apache-drill-1.9.0


If you are using MapR Sandbox, set DRILL_HOME as below. MapR version of Drill has the necessary libraries to work with MapR FS, MapR-DB etc.

$ export DRILL_HOME=/opt/mapr/drill/drill-1.11.0 

Test java version. Java 1.7+ is required.

$ java -version
openjdk version "1.8.0_91"
OpenJDK Runtime Environment (build 1.8.0_91-b14)
OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)

Start Drill embedded mode. Set Drill memory to 1GB to ask for less memory from OS.

$ export DRILL_HEAP=1G
$ $DRILL_HOME/bin/drill-embedded 
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Aug 07, 2017 9:40:46 AM org.glassfish.jersey.server.ApplicationHandler initialize
INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26...
apache drill 1.9.0 
"a little sql for your nosql"

Show Schemas

0: jdbc:drill:zk=local> show schemas;
+---------------------+
|     SCHEMA_NAME     |
+---------------------+
| INFORMATION_SCHEMA  |
| cp.default          |
| dfs.default         |
| dfs.root            |
| dfs.tmp             |
| sys                 |
+---------------------+

Working with CSV Data

Load sample stocks file to /user/mapr/stocks in HDFS. You can find a sample file here.

0: jdbc:drill:zk=local> select * from dfs.root.`/user/mapr/stocks` limit 10;
+-----------------------------------------------------------------------------------------------+
|                                            columns                                            |
+-----------------------------------------------------------------------------------------------+
| ["date","open","high","low","close","volume","adjclose","symbol"]                             |
| ["2000-07-17","95.4375","97.5","92.75","96.625","3508100.0","74.269199","XLNX"]               |
| ["2000-07-17","22.625","22.75","22.4375","22.5625","201600.0","13.48614","ES"]                |
| ["2000-07-17","6.750002","6.937503","6.375","6.5","1235700.0","5.241649","CHK"]               |
| ["2000-07-17","19.812501","20.1875","19.500001","20.1875","1434100.0","3.806147","NI"]        |
| ["2000-07-17","30.5","30.6875","30.0","30.03125","254600.0","19.81183","SNA"]                 |
| ["2000-07-17","44.749996","45.062498","44.500004","45.000009","535200.0","17.400773","FOXA"]  |
| ["2000-07-17","19.625","19.625","19.25","19.375","309500.0","13.768835","R"]                  |
| ["2000-07-17","16.6562","16.6875","16.125","16.25","5507200.0","1.755466","ROST"]             |
| ["2000-07-17","56.25","57.25","56.0625","56.125","7941200.0","18.31076","PG"]                 |
+-----------------------------------------------------------------------------------------------+

As you can see Drill has parsed tokenized each record and created a single column schema that contains complex data type - list. If you want to parse the fields from the list data type you can do the following way.

0: jdbc:drill:zk=local> select columns[0] `date`, columns[5] `volume`, columns[7] `symbol` from dfs.root.`/user/mapr/stocks` limit 10;
+-------------+------------+---------+
|    date     |   volume   | symbol  |
+-------------+------------+---------+
| date        | volume     | symbol  |
| 2000-07-17  | 3508100.0  | XLNX    |
| 2000-07-17  | 201600.0   | ES      |
| 2000-07-17  | 1235700.0  | CHK     |
| 2000-07-17  | 1434100.0  | NI      |
| 2000-07-17  | 254600.0   | SNA     |
| 2000-07-17  | 535200.0   | FOXA    |
| 2000-07-17  | 309500.0   | R       |
| 2000-07-17  | 5507200.0  | ROST    |
| 2000-07-17  | 7941200.0  | PG      |
+-------------+------------+---------+

But there is a better way to handle this type of data by updating the storage plugin in Drill Console. Update Storage Plugin to add more intelligence to CSV reader.

Launch Drill Web Console at http://localhost:8047

Go to Storage tab and click on Update button near to "dfs" section.

Update csv file as below and click on

"csv": {   
     "type": "text",   
     "extensions": ["csv"],   
     "skipFirstLine": false,   
     "extractHeader": true,   
     "delimiter": "," 
}

Now run the query again to see the difference.

0: jdbc:drill:zk=local> select *  from dfs.`/user/mapr/stocks` limit 10;
+-------------+------------+------------+------------+------------+------------+------------+---------+
|    date     |    open    |    high    |    low     |   close    |   volume   |  adjclose  | symbol  |
+-------------+------------+------------+------------+------------+------------+------------+---------+
| 2000-07-17  | 95.4375    | 97.5       | 92.75      | 96.625     | 3508100.0  | 74.269199  | XLNX    |
| 2000-07-17  | 22.625     | 22.75      | 22.4375    | 22.5625    | 201600.0   | 13.48614   | ES      |
| 2000-07-17  | 6.750002   | 6.937503   | 6.375      | 6.5        | 1235700.0  | 5.241649   | CHK     |
| 2000-07-17  | 19.812501  | 20.1875    | 19.500001  | 20.1875    | 1434100.0  | 3.806147   | NI      |
| 2000-07-17  | 30.5       | 30.6875    | 30.0       | 30.03125   | 254600.0   | 19.81183   | SNA     |
| 2000-07-17  | 44.749996  | 45.062498  | 44.500004  | 45.000009  | 535200.0   | 17.400773  | FOXA    |
| 2000-07-17  | 19.625     | 19.625     | 19.25      | 19.375     | 309500.0   | 13.768835  | R       |
| 2000-07-17  | 16.6562    | 16.6875    | 16.125     | 16.25      | 5507200.0  | 1.755466   | ROST    |
| 2000-07-17  | 56.25      | 57.25      | 56.0625    | 56.125     | 7941200.0  | 18.31076   | PG      |
| 2000-07-17  | 54.000326  | 54.000326  | 52.500318  | 53.375325  | 3725000.0  | 71.068871  | TYC     |
+-------------+------------+------------+------------+------------+------------+------------+---------+

Randomly select 10 rows from the stocks dataset.

0: jdbc:drill:zk=local> select * from dfs.`/user/mapr/stocks` order by rand() limit 10;
+-------------+-------------+-------------+-------------+-------------+-------------+---------+------------+
|  adjclose   |    close    |    date     |    high     |     low     |    open     | symbol  |   volume   |
+-------------+-------------+-------------+-------------+-------------+-------------+---------+------------+
| 86.41385    | 86.980003   | 2015-12-07  | 87.57       | 85.110001   | 85.440002   | EL      | 2066300.0  |
| 80.143503   | 90.43       | 2012-10-25  | 90.800003   | 89.68       | 90.629997   | XOM     | 1.03872E7  |
| 43.311197   | 48.16       | 2009-05-15  | 48.560001   | 47.5        | 48.360001   | AMGN    | 7823600.0  |
| 4.215844    | 17.170007   | 2002-01-17  | 17.170007   | 16.920003   | 17.09999    | OKE     | 263100.0   |
| 19.39638    | 23.74       | 2003-07-01  | 24.08       | 23.129999   | 24.08       | AON     | 2770400.0  |
| 16.955152   | 79.790001   | 2004-04-26  | 80.68       | 79.699997   | 80.050003   | WFM     | 1648000.0  |
| 29.671386   | 30.43       | 2013-09-18  | 30.540001   | 29.92       | 30.299999   | TSN     | 3449800.0  |
| 101.900002  | 101.900002  | 2014-07-15  | 103.519997  | 101.690002  | 103.059998  | WAT     | 512100.0   |
| 17.676357   | 19.17       | 2008-02-14  | 19.85       | 19.0        | 19.809999   | LEN     | 3864600.0  |
| 4.933124    | 25.379999   | 2002-11-07  | 26.51       | 25.219999   | 26.35       | CMI     | 2774000.0  |
+-------------+-------------+-------------+-------------+-------------+-------------+---------+------------+

Working with Json Data

Load this small file to /user/mapr/ directory in HDFS.

0: jdbc:drill:zk=local> select source from dfs.`/user/mapr/tweets.small.json` limit 10;
+---------------------------------------------------------------------------------------+
|                                        source                                         |
+---------------------------------------------------------------------------------------+
| <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>  |
| <a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>   |
| <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>  |
| <a href="http://www.tweetedtimes.com" rel="nofollow">The Tweeted Times</a>            |
| <a href="http://bufferapp.com" rel="nofollow">Buffer</a>                              |
| <a href="http://www.hubspot.com/" rel="nofollow">HubSpot</a>                          |
| <a href="http://www.linkedin.com/" rel="nofollow">LinkedIn</a>                        |
| <a href="http://www.hubspot.com/" rel="nofollow">HubSpot</a>                          |
| <a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>   |
| <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                    |
+---------------------------------------------------------------------------------------+
10 rows selected (0.196 seconds)

Working with Hive Tables

Enable and update hive storage plugin using Drill Console http://localhost:8047. You can the hive related configuration in hive-site.xml inside hive conf directory.

{
  "type": "hive",
  "enabled": true,
  "configProps": {
    "hive.metastore.uris": "thrift://localhost:9083",
    "javax.jdo.option.ConnectionURL": "jdbc:derby:;databaseName=../sample-data/drill_hive_db;create=true",
    "hive.metastore.warehouse.dir": "/user/hive/warehouse",
    "fs.default.name": "file:///",
    "hive.metastore.sasl.enabled": "false"
  }
}


If you have enabled mysql meta store for hive, use the following plugin configuration for hive. Also, you have to provide the jdbc connector jar to apache drill. Go to Drill directory and run the following statement and restart Drillbit.

cd $DRILL_HOME
ln -s /opt/mapr/lib/mysql-connector-java-5.1.25.jar jars/3rdparty/
ln -s /opt/mapr/lib/maprfs-6.0.1-mapr.jar jars/3rdparty/
ln -s /opt/mapr/lib/mapr-hbase-6.0.1-mapr.jar jars/3rdparty/
ln -s /opt/mapr/lib/maprdb-6.0.1-mapr.jar jars/3rdparty/
ln -s /opt/mapr/hive/hive-2.1/conf/hive-site.xml conf/


(Following changes have to be done at the plugin configuration in Drill Web UI console)

{
  "type": "hive",
  "enabled": true,
  "configProps": {
    "hive.metastore.uris": "thrift://localhost:9083",
    "javax.jdo.option.ConnectionDriverName": "com.mysql.jdbc.Driver",
    "javax.jdo.option.ConnectionURL": "jdbc:mysql://localhost:3306/hive",
    "javax.jdo.option.javax.jdo.option.ConnectionUserName": "hive",
    "javax.jdo.option.ConnectionPassword": "Mapr@123",
    "hive.metastore.warehouse.dir": "/user/hive/warehouse",
    "fs.default.name": "maprfs:///",
    "hive.metastore.sasl.enabled": "false"
  }
}

After you enable hive in Drill Console, Drill can access hive table definitions. See the schemas the Drill can access now.

0: jdbc:drill:zk=local> show schemas;
+---------------------+
|     SCHEMA_NAME     |
+---------------------+
| INFORMATION_SCHEMA  |
| cp.default          |
| dfs.default         |
| dfs.root            |
| dfs.tmp             |
| hive.default        |
| sys                 |
+---------------------+

Switch to hive.default schema.

0: jdbc:drill:zk=local> use `hive.default`;
+-------+-------------------------------------------+
|  ok   |                  summary                  |
+-------+-------------------------------------------+
| true  | Default schema changed to [hive.default]  |
+-------+-------------------------------------------+

See tables inside hive.default schema.

0: jdbc:drill:zk=local> show tables;
+---------------+-------------+
| TABLE_SCHEMA  | TABLE_NAME  |
+---------------+-------------+
| hive.default  | movies      |
| hive.default  | stocks      |
+---------------+-------------+
2 rows selected (0.967 seconds)

Reference: http://drill.apache.org/docs/tutorials-introduction/



Query RDBMS sources

We can query RDBMS data sources by using jdbc plugin. It requires jar file for JDBC connection to the RDBMS database and connection configuration like connection string, credentials etc. Add jdbc driver for mysql and restart drillbit.

$ cd $DRILL_HOME/jars/3rdparty
$ ln -s /opt/mapr/lib/mysql-connector-java-5.1.25.jar 

Create a new storage plugin called mysql as below.

{
  "type": "jdbc",
  "driver": "com.mysql.jdbc.Driver",
  "url": "jdbc:mysql://localhost:3306?useSSL=false",
  "username": "root",
  "password": "mapr",
  "enabled": true
} 


0: jdbc:drill:zk=local> show databases;

+---------------------------+
|        SCHEMA_NAME        |
+---------------------------+
| cp.default                |
| dfs.default               |
| dfs.mapr_user             |
| dfs.root                  |
| dfs.tmp                   |
| hive.default              |
| INFORMATION_SCHEMA        |
| mysql.employees           |
| mysql.hive                |
| mysql.information_schema  |
| mysql.mysql               |
| mysql.performance_schema  |
| mysql.sys                 |
| mysql                     |
| sys                       |
+---------------------------+


Query the data of mysql table from Drill.

0: jdbc:drill:zk=local> use mysql.employees;
+-------+----------------------------------------------+
|  ok   |                   summary                    |
+-------+----------------------------------------------+
| true  | Default schema changed to [Mysql.employees]  |
+-------+----------------------------------------------+
1 row selected (0.066 seconds)
0: jdbc:drill:zk=local> select * from employees limit 3;
+---------+-------------+-------------+------------+---------+-------------+
| emp_no  | birth_date  | first_name  | last_name  | gender  |  hire_date  |
+---------+-------------+-------------+------------+---------+-------------+
| 10001   | 1953-09-02  | Georgi      | Facello    | M       | 1986-06-26  |
| 10002   | 1964-06-02  | Bezalel     | Simmel     | F       | 1985-11-21  |
| 10003   | 1959-12-03  | Parto       | Bamford    | M       | 1986-08-28  |
+---------+-------------+-------------+------------+---------+-------------+
3 rows selected (0.479 seconds)

Note: if you want to query other RDBMS, drop the required jar files to jars/3rdparty directory of the apache drill or add to CLASSPATH.


Join two data sources - mysql and HDFS

Import a mysql table into HDFS as parquet file. Parquet file contains the schema that Drill can advantage of.

$ sqoop import --connect jdbc:mysql://localhost:3306/employees --username root --password mapr --table salaries --target-dir /user/mapr/salaries -m 1 --as-parquetfile


Create a view that joins two datasets - employees table in mysql and salaries in HDFS.

0: jdbc:drill:zk=local> use dfs.tmp;
0: jdbc:drill:zk=local> create view employee_salary as select * from dfs.root.`/user/mapr/salaries` t1 join mysql.employees.employees t2 on t1.emp_no = t2.emp_no;


Run a query on the view

0: jdbc:drill:zk=local> select emp_no, first_name, salary from employee_salary limit 3;
+---------+-------------+---------+
| emp_no  | first_name  | salary  |
+---------+-------------+---------+
| 10001   | Georgi      | 60117   |
| 10001   | Georgi      | 62102   |
| 10001   | Georgi      | 66074   |
+---------+-------------+---------+
3 rows selected (1.322 seconds)



Query HBase /MapR-DB table using Drill


Create a table called product having a single column family info. Insert a record having two columns - author and page.

hbase> create 'product', 'info'
hbase> put 'product', 'harry potter', 'info:author', 'JKRowling'
hbase> put 'product', 'harry potter', 'info:page', '500'


Start Drill. Enable HBase storage plugin usong Drill Web UI (http://localhost:8047)

{
  "type": "hbase",
  "config": {
    "hbase.zookeeper.quorum": "maprdemo",
    "hbase.zookeeper.property.clientPort": "5181"
  },
  "size.calculator.enabled": false,
  "enabled": true
}

Run the query against the Hbase table.

0: jdbc:drill:drillbit=localhost> select cast(row_key as varchar),cast(info['author'] as varchar) author , cast(info['page'] as int) page from dfs.root.`/tables/product`;
+----------------+------------+-------+
|     EXPR$0     |   author   | page  |
+----------------+------------+-------+
| harry potter  | JKRowling  | 500   |
+----------------+------------+-------+
1 row selected (0.181 seconds)