Spark to Read from S3

Add the following packages to spark-defaults.conf

spark.jars.packages   org.apache.hadoop:hadoop-aws:2.6.0

Or launch Spark with commands line options

$ spark-shell --packages org.apache.hadoop:hadoop-aws:2.6.0

Read from S3 using Spark RDD

val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", "<aws key>")
hadoopConf.set("fs.s3.awsSecretAccessKey", "<aws secret>")

var rdd = sc.textFile("s3://...")
val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", "AKI...")
hadoopConf.set("fs.s3.awsSecretAccessKey", "dco...")

var rdd = sc.textFile("s3://aws-bigdata-bootcamp/data/orders/")
val path = "s3://aws-bigdata-bootcamp/data/orders-full/"
var df = spark
.read
.format("csv")
.load(path)
.toDF("customer_id","order_id","description","product_id","unitprice","quantity","extended_price","line_tax")

val df2 = df.selectExpr("cast(customer_id as int)","cast(order_id as int)"
,"description", "cast(product_id as int)"
, "cast(unitprice as double)", "cast(quantity as int)"
, "cast(extended_price as double)", "cast(line_tax as double)")

df2.write.save("s3://emr.einext.com/data/orders")


If you have already give role permission to the EC2 instance to access S3 bucket, then you do not have to put access key details in spark code.

spark.read.text("s3a://data.einext.com/stocks/stocks.csv.gz").show()