Working with AWS S3 Storage Using Spark

Start Spark-shell with the following packages

$ spark-shell --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0

Set access key and access Key secret in Hadoop configuration

scala> val sc = spark.sparkContext
scala> sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "<access_key>")
scala> sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "<access_key_secret>")

Load file from S3 as Dataset

scala> val df = spark.read.format("csv").load("s3n://einext.com/data/Olympics2016.csv")
scala> df.show