Convert String to Timestamp for SparkSQL

Objective:

  • Convert a datetime string to Timestamp, which is compatible with Spark SQL

import java.text.SimpleDateFormat

import java.util.Locale

import java.sql.Timestamp // Since java.util.Date is not supported in Spark SQL

def sf = new SimpleDateFormat( "EEE MMM dd HH:mm:ss ZZZZZ yyyy", Locale.ENGLISH)

sf.setLenient(true)

def parseDatetimeString(dateStr: String): Timestamp = {

new Timestamp(sf.parse(dateStr).getTime)

}

View more on date format https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html

parseDatetimeString("Thu Jul 07 22:59:41 +0000 2016")

Same way, if you want to remove the time portion, use java.sql.Date class rather than java.sql.Timestamp. Rest remains same.

Now, register this function as UDF so that you can use it in Dataframe function or Spark SQL

import org.apache.spark.sql.functions.udf

def toTimestamp = udf(parseDatetimeString _)

sqlContext.udf.register("toTimestamp", parseDatetimeString _)