What is Apache Spark
A popular open source framework for distributed parallel processing for so called "big data" - the data that cannot practically be processed using a single machine. Spark can run on Hadoop YARN cluster manager. Spark was built with hindsight of other popular distributed processing framework such as Hadoop MapReduce. The MR framework did not take full advantage of the available memory at the cluster level. Researchers at Berkley University released a paper Resilient Distributed Dataset that provided an approach to take advantage of the shared memory at the cluster level. Since memory IO is order of magnitude faster that disk IO, memory based computation becomes a lot faster. For a few types iterative processing effect is as much 10-100x speed gain.
Why Spark is fast?
- In Memory Computation: While Hadoop leverage distributed storage and computing power, Spark started its journey (read the RDD research paper from Berkley) with a goal to create a shared memory architecture. Essentially, it allows developer to cache data in memory that allows processing data of memory rather than of disk. Just to put the in different in perspective, to sequentially read 1 MB data from disk takes 20 milliseconds, while that from memory takes 250 microseconds that is 1/80 of the former. However, not all computation is Spark happen in memory. It is up to the developer to control which data to keep it it in memory. By leveraging in memory computation Spark performs 10x - 100x faster than Hadoop map reduce program.
- DAG Optimization: Spark allows the developer to write a chain of operations on dataset that are lazily computed. The series of steps forms a computation graph known as DAG (Directed Acyclic Graph). Spark finds an optimized execution path in order to execute the DAG. This is most optimized that manually determining the order of the steps as known as Tasks. Spark has an optimizer known as Catalyst engine that performs logical and physical optimization of query plan.
- High Performance Data Structure: Spark has 3 main data structures - RDD (Resilience Distributed Dataset, Dataframe, and Dataset). Dataframe and Dataset are highly optimized for in-memory storage and processing. Internally, they are form of columnar compressed data structure that requires less memory for storage compared to RDD (as much as 80% less) and faster compared to RDD in order of magnitude.
Popular Use Cases of Apache Spark
Survey report 2016 that shows popularity, typical workload of Spark.