Spark to Read from S3
Add the following packages to spark-defaults.conf
spark.jars.packages org.apache.hadoop:hadoop-aws:2.6.0
Or launch Spark with commands line options
$ spark-shell --packages org.apache.hadoop:hadoop-aws:2.6.0
Read from S3 using Spark RDD
val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", "<aws key>")
hadoopConf.set("fs.s3.awsSecretAccessKey", "<aws secret>")
var rdd = sc.textFile("s3://...")
val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", "AKI...")
hadoopConf.set("fs.s3.awsSecretAccessKey", "dco...")
var rdd = sc.textFile("s3://aws-bigdata-bootcamp/data/orders/")
val path = "s3://aws-bigdata-bootcamp/data/orders-full/"
var df = spark
.read
.format("csv")
.load(path)
.toDF("customer_id","order_id","description","product_id","unitprice","quantity","extended_price","line_tax")
val df2 = df.selectExpr("cast(customer_id as int)","cast(order_id as int)"
,"description", "cast(product_id as int)"
, "cast(unitprice as double)", "cast(quantity as int)"
, "cast(extended_price as double)", "cast(line_tax as double)")
df2.write.save("s3://emr.einext.com/data/orders")
If you have already give role permission to the EC2 instance to access S3 bucket, then you do not have to put access key details in spark code.
spark.read.text("s3a://data.einext.com/stocks/stocks.csv.gz").show()