This blog post details Facebook’s migration from Hive to Spark for a particular distributed data processing workload.
Spark is a distributed compute engine, allowing you to write SQL or imperative code that runs in a distributed environment with relatively few accommodations.
Example pipeline (Hive):
Filter ->
Group ->
Reduce
In this writeup, the core task was to aggregate feature values for each entity_id, target_id pair and shard by entity id:
entity_id, target_id, feature_id, feature_value
->
// Aggregation could be something like this - not explicit in writeup
entity_id, target_id: <average, sum, ..> feature_value for feature_id
In Hive, each step outputs a temporary table (stored in the HDFS) that serves as the input to the next step. Because this involved serializing and writing files to HDFS repeatedly along the way, this had a pretty severe performance cost in time and CPU hours.
In the Spark implementation described here, each step is instead combined into a single MapReduce job.
They also note some contributions to Spark to improve its reliability, where the great size of their workload revealed some cracks - especially where hardcoded limits proved to be too restrictive.
E.g., this pull request solves a problem when a reducer’s memory is fully allocated. Memory is taken up both by the pointer array (which is the array that’s sorted) and the records pointed to by it. Initially, the code would reset the pointer array first (allocating new memory) then free the records, which could cause OoM errors unnecessarily.