Convert String to Timestamp for SparkSQL
Objective:
Convert a datetime string to Timestamp, which is compatible with Spark SQL
import java.text.SimpleDateFormat
import java.util.Locale
import java.sql.Timestamp // Since java.util.Date is not supported in Spark SQL
def sf = new SimpleDateFormat( "EEE MMM dd HH:mm:ss ZZZZZ yyyy", Locale.ENGLISH)
sf.setLenient(true)
def parseDatetimeString(dateStr: String): Timestamp = {
new Timestamp(sf.parse(dateStr).getTime)
}
View more on date format https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html
parseDatetimeString("Thu Jul 07 22:59:41 +0000 2016")
Same way, if you want to remove the time portion, use java.sql.Date class rather than java.sql.Timestamp. Rest remains same.
Now, register this function as UDF so that you can use it in Dataframe function or Spark SQL
import org.apache.spark.sql.functions.udf
def toTimestamp = udf(parseDatetimeString _)
sqlContext.udf.register("toTimestamp", parseDatetimeString _)