File Format
Save CSV in orc, parquet, avro, xml etc format and compare size on disk
Download https://data.sfgov.org/Public-Safety/Map-Crime-Incidents-from-1-Jan-2003/gxxq-x39z/data to a dir. I downloaded in /data dir.
Launch spark. Maven coordinates for required packages are mentioned in the launch script.
$ $SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.10:1.4.0,com.databricks:spark-avro_2.10:2.0.1
Using csv spark package, load the csv as data frame
scala> val incidents = sqlContext
.read
.format("com.databricks.spark.csv")
.options(Map("header" -> "true", "inferSchema" -> "true"))
.load("/data/Map__Crime_Incidents_-_from_1_Jan_2003.csv")
Save the incidents data frame as avro file.
$ incidents.coalesce(1).write.format("com.databricks.spark.avro").save("/data/sfpd.avro")
You should see a file ending with .avro inside /data/sfpd.avro.
In the same way, you can export the orc, parquet, json and xml format. For xml format you need one extra spark package similar to csv package, called com.databricks:spark-xml_2.10:0.3.3
$ incidents.coalesce(1).write.format("orc").save("/data/sfpd.orc")
$ incidents.coalesce(1).write.format("parquet").save("/data/sfpd.parquet")
$ incidents.coalesce(1).write.format("json").save("/data/sfpd.json")
$ incidents.coalesce(1).write.format("xml").save("/data/sfpd.xml")
Parquet recommends a block size of 512MB or 1024 MB rather the default 128 MB, You can set the blocksize while writing the file to HDFS.
$ incidents.write.option("parquet.block.size", 1024 * 1024 * 1024).format("parquet").save("/data/sfpd.parquet")
Clearly, orc and parquet are space efficient than the other format.