Apache Hadoop is a Big Data ecosystem consisting of open source components that essentially change the way large datasets are analyzed, stored, transferred and processed. Contrasting to traditional distributed processing systems, Hadoop facilitates multiple kinds of analytic workloads on same datasets at the same time.
Following qualities make Hadoop stand out of the crowd:
- Single namespace by HDFS makes content visible across all the nodes
- Easily administered using High Performance Computing (HPC)
- Querying and managing distributed data are done using Hive
- Pig facilitates analyzing the large and complex datasets on Hadoop
- HDFS is designed specially to give high throughput instead of low latency.
Comparison of Hadoop 1 and Hadoop 2 architectures
While Hadoop is the foundation for most of the big data structures, its different versions came up with varied improvisations. It is always better to have a good grasp about the functionalities offered by the successor versions of any technology. Let’s find out the same for Hadoop 1 and Hadoop 2:
|Hadoop 1||Hadoop 2|
|Components are- HDFS (V1), MapReduce (V1)||Components are- HDFS (V2), YARN (MR V2), MapReduce (V2)|
|Only one namespace||Multiple namespaces|
|Only one programming model||Multiple programming models|
|Has fixed-sized slots||Has variable sizes of containers|
|Supports maximum of 4,000 nodes per cluster||Supports maximum of 10,000 nodes per cluster|
The most widely and frequently used framework to manage massive data across a number of computing platforms and servers in every industry, Hadoop is rocketing ahead in enterprises. It lets organizations store files that are bigger than what you can store on a specific node or server. More importantly, Hadoop is not just a storage platform, it is one of the most optimized and efficient computational frameworks for big data analytics. The right Hadoop training helps you understand the real world scenarios of working with Big Data.
Read Also: Apache Hadoop Ecosystem
Hadoop Tutorial: Hadoop Features
Fig: Hadoop Tutorial – Hadoop Features
When machines are working in tandem, if one of the machines fails, another machine will take over the responsibility and work in a reliable and fault tolerant fashion. Hadoop infrastructure has inbuilt fault tolerance features and hence, Hadoop is highly reliable.
Read Also: Apache Cassandra Tutorial
Hadoop uses commodity hardware (like your PC, laptop). For example, in a small Hadoop cluster, all your DataNodes can have normal configurations like 8-16 GB RAM with 5-10 TB hard disk and Xeon processors, but if I would have used hardware-based RAID with Oracle for the same purpose, I would end up spending 5x times more at least. So, the cost of ownership of a Hadoop-based project is pretty minimized. It is easier to maintain the Hadoop environment and is economical as well. Also, Hadoop is an open source software and hence there is no licensing cost.
Hadoop has the inbuilt capability of integrating seamlessly with cloud-based services. So, if you are installing Hadoop on a cloud, you don’t need to worry about the scalability factor because you can go ahead and procure more hardware and expand your setup within minutes whenever required.
Hadoop is very flexible in terms of ability to deal with all kinds of data. We discussed “Variety” in our previous blog on Big Data Analytics, where data can be of any kind and Hadoop can store and process them all, whether it is structured, semi-structured or unstructured data.
These 4 characteristics make Hadoop a front-runner as a solution to Big Data challenges. Now that we know what is Hadoop, we can explore the core components of Hadoop. Let us understand, what are the core components of Hadoop.
Hadoop Tutorial: Hadoop Core Components
While setting up a Hadoop cluster, you have an option of choosing a lot of services as part of your Hadoop platform, but there are two services which are always mandatory for setting up Hadoop. One is HDFS (storage) and the other is YARN (processing). HDFS stands for Hadoop Distributed File System, which is a scalable storage unit of Hadoop whereas YARN is used to process the data i.e. stored in the HDFS in a distributed and parallel fashion.
Let us go ahead with HDFS first. The main components of HDFS are: NameNode and DataNode. Let us talk about the roles of these two components in detail.
Read Also: Apache Cassandra Architecture
- It is the master daemon that maintains and manages the DataNodes (slave nodes)
- It records the metadata of all the blocks stored in the cluster, e.g. location of blocks stored, size of the files, permissions, hierarchy, etc.
- It records each and every change that takes place to the file system metadata
- If a file is deleted in HDFS, the NameNode will immediately record this in the EditLog
- It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to ensure that the DataNodes are live
- It keeps a record of all the blocks in the HDFS and DataNode in which they are stored
- It has high availability and federation features which I will discuss in HDFS architecture in detail
- It is the slave daemon which run on each slave machine
- The actual data is stored on DataNodes
- It is responsible for serving read and write requests from the clients
- It is also responsible for creating blocks, deleting blocks and replicating the same based on the decisions taken by the NameNode
- It sends heartbeats to the NameNode periodically to report the overall health of HDFS, by default, this frequency is set to 3 seconds
So, this was all about HDFS in nutshell. Now, let move ahead to our second fundamental unit of Hadoop i.e. YARN.