Spark to Read from S3

Add the following packages to spark-defaults.conf

spark.jars.packages org.apache.hadoop:hadoop-aws:2.6.0

Or launch Spark with commands line options

$ spark-shell --packages org.apache.hadoop:hadoop-aws:2.6.0

Read from S3 using Spark RDD

val hadoopConf = sc.hadoopConfiguration

hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")

hadoopConf.set("fs.s3.awsAccessKeyId", "<aws key>")

hadoopConf.set("fs.s3.awsSecretAccessKey", "<aws secret>")


var rdd = sc.textFile("s3://...")

val hadoopConf = sc.hadoopConfiguration

hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")

hadoopConf.set("fs.s3.awsAccessKeyId", "AKI...")

hadoopConf.set("fs.s3.awsSecretAccessKey", "dco...")


var rdd = sc.textFile("s3://aws-bigdata-bootcamp/data/orders/")

val path = "s3://aws-bigdata-bootcamp/data/orders-full/"

var df = spark

.read

.format("csv")

.load(path)

.toDF("customer_id","order_id","description","product_id","unitprice","quantity","extended_price","line_tax")


val df2 = df.selectExpr("cast(customer_id as int)","cast(order_id as int)"

,"description", "cast(product_id as int)"

, "cast(unitprice as double)", "cast(quantity as int)"

, "cast(extended_price as double)", "cast(line_tax as double)")


df2.write.save("s3://emr.einext.com/data/orders")


If you have already give role permission to the EC2 instance to access S3 bucket, then you do not have to put access key details in spark code.

spark.read.text("s3a://data.einext.com/stocks/stocks.csv.gz").show()