A direct comparison of Hadoop and Spark is difficult because they do many of the same things, but are also non-overlapping in some areas.
For example, Spark has no file management and therefore must rely on Hadoop’s Distributed File System (HDFS) or some other solution.
It is wiser to compare Hadoop MapReduce to Spark, because they’re more comparable as data processing engines.
The most important thing to remember about both technologies is that their use is not an either-or scenario because they are not mutually exclusive. Nor is one necessarily a drop-in replacement for the other. The two are compatible with each other and that makes their pairing an extremely powerful solution for a variety of big data applications.
Hadoop is an Apache.org project that is a software library and a framework that allows for distributed processing of large data sets (big data) across computer clusters using simple programming models. It can scale from single computer systems up to thousands of commodity systems that offer local storage and compute power. Hadoop, in essence, is the ubiquitous 800-lb big data gorilla in the big data analytics space.
Hadoop is composed of modules that work together to create the Hadoop framework. The primary Hadoop framework modules are:
- Hadoop Common
- Hadoop Distributed File System (HDFS)
- Hadoop YARN
- Hadoop MapReduce
Although the above four modules comprise Hadoop’s core, there are several other modules. These include Ambari, Avro, Cassandra, Hive, Pig, Oozie, Flume, and Sqoop, which further enhance and extend Hadoop’s power and reach into big data applications and large data set processing.
Many companies that use big data sets and analytics use Hadoop. It has become the de facto standard in big data applications. Hadoop originally was designed to handle crawling and searching billions of web pages and collecting their information into a database. The result of the desire to crawl and search the web was Hadoop’s HDFS and its distributed processing engine, MapReduce.
Hadoop is useful to companies when data sets become so large or so complex that their current solutions cannot effectively process the information in what the data users consider being a reasonable amount of time.
MapReduce is an excellent text processing engine and rightly so since crawling and searching the web (its first job) are both text-based tasks.
Developers bill it as “a fast and general engine for large-scale data processing.” By comparison, and sticking with the analogy, if Hadoop’s Big Data framework is the 800-lb gorilla, then Spark is the 130-lb big data cheetah.
Although critics of Spark’s in-memory processing admit that Spark is very fast (Up to 100 times faster than Hadoop MapReduce), they might not be so ready to acknowledge that it runs up to ten times faster on disk. Spark can also perform batch processing, however, it really excels at streaming workloads, interactive queries, and machine-based learning.
Spark’s big claim to fame is its real-time data processing capability as compared to MapReduce’s disk-bound, batch processing engine. Spark is compatible with Hadoop and its modules. In fact, on Hadoop’s project page, Spark is listed as a module.
Spark has its own page because, while it can run in Hadoop clusters through YARN (Yet Another Resource Negotiator), it also has a standalone mode. The fact that it can run as a Hadoop module and as a standalone solution makes it tricky to directly compare and contrast. However, as time goes on, some big data scientists expect Spark to diverge and perhaps replace Hadoop, especially in instances where faster access to processed data is critical.
Spark is a cluster-computing framework, which means that it competes more with MapReduce than with the entire Hadoop ecosystem. For example, Spark doesn’t have its own distributed filesystem, but can use HDFS.
It uses memory and can use disk for processing, whereas MapReduce is strictly disk-based. The primary difference between MapReduce and Spark is that MapReduce uses persistent storage and Spark uses Resilient Distributed Datasets (RDDs), which is covered in more detail under the Fault Tolerance section.