Learn Hadoop Fundamentals

Have you ever wondered about how google queries huge amount of data so fast? This tutorial has the answer for this. Here we will be concentrating on the fundamentals of hadoop.

What is Hadoop ?

It is a open source software used for distributed computing which is helpful in obtaining results when dealing with very huge amount of data.

Disadvantages of Non- distributed architecture

  1. The data is stored as a pile in a single server and a client will access the server for information; if we increase storage accessing becomes difficult and need more physical space to hold.
  2. This architecture is not reliable, since when main server fails, we have to take backup and restore it which becomes a very tedious process

Key points about Hadoop

  1. When we are querying for a specific data on a very large data set, it is executed in individual local servers on small data and finally all the results are consolidated. This makes faster access of data on a very large data set.
  2. We don’t need to have a powerful server. Less expensive with less memory capacity is enough to handle.
  3. High fault tolerance. If any node in the hadoop environment fails, it takes care of itself by providing good results by distributing and replicating the data efficiently across multiple nodes.
  4. We are processing the data parallel but not serially
  5. Hadoop is platform independent as it is written in Java
  6. We generally use hadoop when dealing with unstructured data

Hadoop Framework modules

  • Hadoop Common: It is common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS): It is a distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: It is a framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system used for parallel processing of large data sets.
  • Hadoop Ozone: It’s like an object store for Hadoop.

Hadoop Distributed File System

In HDFS, the data is stored across the network of machines, it will provide one of the most reliable file systems. It has a unique design that provides storage for very large files with streaming data access pattern and it runs on commodity hardware. HDFS is designed on principle that we can write once and read any number of times. HDFS also makes applications available to parallel processing.


  • It is distributed data storage.
  • It has blocks which reduces the seek time.
  • The data is highly available as the same block is present at multiple data nodes.
  • High fault tolerance.

Goals of HDFS

  • Fault detection and recovery
  • Huge datasets
  • Hardware at data


  • Low latency data access
  • Small file problem
  • Multiple Writes


Hadoop MapReduce is a software framework that give you support for writing applications. It process vast amounts of data in-parallel on large clusters of commodity.

MapReduce usually splits the input large data-set into independent chunks which are processed by the map tasks executed in parallel. The framework sorts the outputs and then inputs to the reduce tasks. Usually both the input and the output of the job are stored in a file-system. The framework handles the tasks scheduling, monitors them and re-executes the failed tasks.

MapReduce’s advantage is that it is easy to scale data processing over multiple computing nodes.

Usage of MapReduce

  • It can be used in document clustering, distributed sorting, and web link-graph reversal.
  • It can be used for distributed pattern-based searching.
  • We can also use MapReduce in machine learning.
  • It was used by Google to regenerate Google’s index of the World Wide Web.
  • It can be used in multiple computing environments such as multi-cluster, multi-core, and mobile environment.


  • YARN stands for “Yet another Resource Negotiator“. The idea of yarn is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either DAG of jobs or a single job.
  • It enables the users to perform operations as per requirement by using a variety of tools like spark for real-time processing, Hive for SQL etc.
  • It performs job scheduling
  • It allocates resources and schedules job

Benefits of YARN

  • Scalability
  • Utilization
  • Multitenancy
  • Compatibility


This tutorial is a basic overview of hadoop, and it has lot more. It is a booming technology in the field of computer science. You can also check our post on Spark vs Hadoop.

Spread the knowledge

Aswath Rao

Currently pursuing Msc in Data Science

Leave a Reply

Your email address will not be published. Required fields are marked *