Data Virtualization Using Spark

Load entire database details in Spark SQL in a single go

Load all tables from a mysql database into Spark so they can be queried using Spark SQL. You can select a subset of tables as well.

import java.sql.DriverManager
import java.sql.Connection
val username = "root"
val password = "cloudera"
val db = "retail_db"

val url = s"jdbc:mysql://localhost/$db"

val connection = DriverManager.getConnection(url, username, password)
val statement = connection.createStatement()
val resultSet = statement.executeQuery(s"show tables from $db")
while ({
   val table = resultSet.getString(1)
   println(s"Registering $table table to spark sql as my_$table")"jdbc")
   .option("url", url)
   .option("driver", "com.mysql.jdbc.Driver")
   .option("dbtable", table)
   .option("user", username)
   .option("password", password)
   .createOrReplaceTempView("my_" + table)
println("Tables registered in Spark. Name of a mysql table starts with my_: ")
sql("show tables").show

Import Data from mysql into HDFS and build a hive table

Import orders dataset into HDFS as parquet format using sqoop.

$ sqoop import \

--connect jdbc:mysql://localhost:3306/retail_db \

--username root \

--hive-import \

--password cloudera \

--table orders \

--target-dir /user/cloudera/orders \

--as-parquetfile \

--compress --compression-codec snappy \

-m 1

You could also import this table data and schema using schema using Spark. However, sqoop is beneficial since it is command line driven and you can consider building incremental sqoop job as describe here.

The above sqoop import statement will import data into HDFS /user/hive/warehouse/orders dir and create a hive managed table, called orders.

Provided hive metastore is connected with Spark, it will access all tables defined in hive. So you can join in Spark SQL between two tables - one staying in HDFS and one in mysql.

scala> sql("select * from orders t1 join my_customers t2 where t1.order_customer_id = t2.customer_id").show

You can extend this mechanism to other data sources or multiple tables.