Thrift Service on Spark SQL and JDBC

Key Features of the Thrift Server:

  • Thrift server allows to run multiple spark session sharing the cached table in Spark. You can turn off multi session mode using spark.sql.hive.thriftServer.singleSession.

  • You can connect standard BI tools such as Tableau, MicroStrategy etc. to Spark

  • You also use this as a gateway to access the hive tables as well. To access hive tables put core-site.xml, hdfs-site.xml and hive-site.xml or link them) inside %SPARK_HOME%/conf

  • Note, that $SPARK_HOME/spark-sql is not similar to thrift server. spark-sql CLI runs only in local mode, which thrift service runs in distributed mode.

Pre-requisites

  • Compile Spark with support for Hive, Hadoop, sparkr (optional), pyspark

  • To use Thrift server you do not need Hadoop or Hive configuration. Just the libraries are required

Start thrift service for Spark SQL

1. Start thrift server

$ %SPARK_HOME/sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.port=10000 --hiveconf hive.server2.thrift.bind.host=0.0.0.0

2. Launch beeline

$ %SPARK_HOME/bin/beeline -u jdbc:hive2://localhost:10000/default

0: jdbc:hive2://localhost:10000> create temporary table people using org.apache.spark.sql.json options(path "/Volumes/SONY/Data/people.json");

0: jdbc:hive2://localhost:10000> cache table people;

0: jdbc:hive2://localhost:10000> select * from people limit 10;

+-------------+------------+---------+-------------+-------------+---------+--------------+--+

| dob | firstname | gender | lastname | middlename | salary | ssn |

+-------------+------------+---------+-------------+-------------+---------+--------------+--+

| 1940-08-06 | Dacia | F | Samborski | Rosella | 274357 | 932-39-7400 |

| 1964-01-31 | Loria | F | Cassino | Suzie | 166618 | 940-40-2137 |

| 1936-06-02 | Lashaunda | F | Rockhill | Markita | 185766 | 923-83-5563 |

| 1971-09-25 | Candace | F | Goike | Marcy | 92497 | 935-40-2967 |

| 1926-06-29 | Marhta | F | Bonin | Filomena | 40013 | 968-22-1158 |

| 1951-04-27 | Rachel | F | Mcmonigle | Gwyn | 211468 | 926-47-4803 |

| 2012-09-08 | Lorine | F | Bilous | Valencia | 26612 | 992-10-1262 |

| 1926-04-25 | Alene | F | Somji | Berniece | 74027 | 989-16-1381 |

| 1930-07-01 | Sadye | F | Morrisseau | Mara | 209278 | 971-50-8157 |

| 1962-08-26 | Shawn | F | Battisti | Reginia | 190167 | 993-42-5846 |

+-------------+------------+---------+-------------+-------------+---------+--------------+--+

Connect to Spark Thrift Service using JDBC

Now, let's connect SQL Workbench (you can choose any other SQL client that support JDBC drivers).

Step 1. Download hadoop-common-2.6.4.jar and hive-jdbc-2.1.0-standalone.jar. You can choose to download the version that is relevant for you. You can find these in hadoop and hive binary sources respectively.

Step 2. Create a driver in SQL Workbench adding the above the jars and choose the driver class as org.apache.hive.jdbc.HiveDriver. Now you are ready to connect. Choose a name for the driver like "hive".

Step 3. Create a connection using the above driver.

Connection string: jdbc:hive2://localhost:10000/default

Username: your local machine user (does not matter)

Password keep it blank

Step 4. Now you should be logged now. Have fun with SQL!

Process CSV package over Thrift Service

Purpose of this module is to load a csv files using spark csv package (https://github.com/databricks/spark-csv) inside a SQL client connected over Spark Thrift service.

1. First stop Thrift service if it is already running.

$SPARK_HOME/sbin/stop-thriftserver.sh. Verify whether the Thrift service has stopped properly using Spark History Server (http://localhost:18080).

2. Start thrift service by passing an extra package details to the command as below.

$SPARK_HOME/sbin/start-thriftserver.sh --packages com.databricks:spark-csv_2.10:1.4.0 --master local[*]

Now your thrift service is ready with package to handle csv files.

3. Launch SQL Workbench or beeline as described above. Run the the following command to create a table. The table created this way will be created in Hive meta store and persistent (not temporary). Note that you need hive-site.xml in Spark conf dir. A sample hive-site.xml is attached along with the people.csv file.

CREATE TABLE people(firstname STRING, middlename STRING, lastname STRING, gender STRING, dob Date, salary INT, ssn STRING)

USING com.databricks.spark.csv

OPTIONS (path "/Volumes/SONY/Data/people.txt", header "false", inferSchema "false", delimiter ":")

Now you can write SQL. statement. Optionally you can cache the table. Once the cache, other SQL session connected to the Thrift service will be able to take advantage of the cached data as well.

CACHE TABLE PEOPLE;

SELECT * FROM people LIMIT 5;

Create Spark SQL Table On JDBC Source

$ wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.39.tar.gz

$ tar -zxf mysql-connector-java-5.1.39.tar.gz

Move mysql-connector-java-5.1.39/mysql-connector-java-5.1.39-bin.jar file to /usr/share/java/

Launch thrift server with the connector jar. Alternatively, you can put the jar on $SPARK_HOME/lib/ directory for each of the machine in the spark cluster.

$ cd $SPARK_HOME

$ sbin/start-thriftserver.sh --jars /usr/share/java/mysql-connector-java-5.1.39-bin.jar

Now, you can either connect to beeline or SQL Workbench

CREATE TEMPORARY TABLE customers

USING org.apache.spark.sql.jdbc

OPTIONS(

url 'jdbc:mysql://localhost:3306/employees',

driver 'com.mysql.jdbc.Driver',

dbtable 'employees',

user 'root',

password 'root',

fetchSize '1000');

Once the table is created you can run the queries against it.

beeline> select * from customers;

beeline> cache table customers;

Persisting Spark Table

To persist the tables created in thrift server, you need hive configured. Create %SPARK_HOME/conf/hive-site.xml if you are running in Standalone mode or link /etc/hive/conf/hive-site.xml to %SPARK_HOME/conf/hive-site.xml. In standalone mode create hive warehouse directory /user/hive/warehouse/on linux file system. To persists table, remove the word TEMPORARY in create table command.