Spark SQL over REST API
Primary goal is to submit a SQL query to Spark cluster and get response of collection objects as json with low latency (single digit seconds).
The service is exposed as REST service. This REST api can be consume in any BI service
Spark over Thrift service is a cleaner solution, but due security concerns around impersonating queries, Thrift service may not be viable solution in security enabled Hive ecosystem.
This solution be launched on any spark cluster (over yarn or standalone).
It requires further work to make this REST service secure (authentication and authorization)
Other alternative solutions
Create a scala project with your favourite IDE.
One notable dependency is on finagle - an open source high performance REST api library for scala developed at Twitter used at twitter.com. This package is used to create the REST service.
"com.twitter" %% "finagle-http" % "6.35.0",
"org.apache.spark" %% "spark-core" % "1.6.1",
"org.apache.spark" %% "spark-sql" % "1.6.1",
"log4j" % "log4j" % "1.2.14"
Create a scala object
Export the project as .jar. While creating jar, do not include the dependencies - those are not necessary.
Now submit the jar on the cluster. Submit using "client" mode so that driver machines get update from the running jobs. You do not have to include packages for Spark and log4j. Those are already available in classpath of the cluster. In the example below, the spark code has been submitted to standalone cluster.
Now you are ready to send SQL queries using REST call. The rest api service will be available on port 8080 of the machine on which spark application is submitted. Using postman or curl, submit the following GET request and see the output.