Professors can't start the big data class without introducing Hadoop, and that makes sense because all big data stories have the same origin; Google.
Around 2000, Google had a huge scalable problem by the continuation of growing data volumes. The scale-up architecture they were using was already pushed to the limitation - the hardware didn't catch up what Google was doing. They want a new solution to solve Google web search indexes, PageRank.
The main idea of these whitepapers is about distributed computing paradigm known as MapReduce.
In fact, distributed computing like Massively Parallel Processing (MPP) is not new at that time. there were several industries already leveraged this paradigm such as Oceanography, Earthquake, monitoring, and Space exploration.
MapReduce did outclass other distributed computing tools at that time with advantages like easiness, generalization, and linear scaling. This looks really promising except only one thing - Google didn't share the source code !
Thankfully, around 2006, Doug Cutting and Mike Cafarella created Apache Hadoop based on both MapReduce and Google File System and made it fullly Open-source software (OSS).
“The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such.”
-- Doug Cutting on the origin of the name Hadoop
Since 2006, Hadoop has become one of the fastest growing OSS in the history. Many engineers have started developing tools based on the Hadoop ecosystem since it was written in Java and run on JVM which is very popular; Hadoop tools are thriving and also have the higher rates of Enterprise adoption.
Okay, no more history class and should learn the Hadoop then. In this article, I'm going to focus on only 2 Hadoop main components:
HDFS is a software filesystem written in Java sitting on top the native file system. HDFS has a master/slave architecture consisting of a single Name node and multiple Data nodes in the cluster.
The following is the HDFS architecture:
Name node is the master node in the cluster. It can have only one name node running in the Hadoop cluster. The main jobs are:
In another word, it is just the pointers pointing to the real data !
how metadata looks like
At the early day of Hadoop, HDFS was seriously undergoing with a single point of failure; if Name Node fails, it will stop the entire system from working. Luckily, nowadays, we can set up the high availability cluster with the secondary name node. It was introduced to prevent the cluster failure by promoting it to be the active name node whenever the first name node is out of connections.
Note: secondary name node is always up and running with the stand-by mode.
Data nodes, as its name would suggest, are locations where the data is actually stored. Data nodes are also responsible to perform block creation, deletion, and replication upon instruction from the name node.
The main concept behind HDFS is that it divides a file into blocks instead of dealing with a file as a whole. This allows many features such as distribution, replication, failure recovery, and later being the most important part for distributed processing.
With this simple concept, it solves all the classic roadblocks found in traditonal architecture.
YARN YARN YARN
Block sizes can be 64 MB, 128 MB, 256 MB, or 512 MB.
YARN stands for Yet Another Resource Negotiator. Since we are dealing with the big cluster, orchestrator is a must have. It is responsible to negotiate with Node Managers in each DataNode, allocate the resources, distribute the jobs across nodes and also receive the client submit's request. You can think of it like another software layer sitting on top of JVM and handling every processing jobs in Hadoop, the brain part.
Moving data to YARN is freaking slow. That's why YARN moves itself to target DataNodes and compute on top of that.
YARN running process - https://www.oreilly.com/library/view/hadoop-the-definitive/9781491901687/ch04.html
To process the data, Hadoop uses the technique called MapReduce which basically just send the jobs "map key and value" across nodes in the cluster. With the distributing of the HDFS, it's possible to submit the MapReduce function to achive the true Parallelism.
WordCounts Example - https://www.guru99.com/introduction-to-mapreduce.html
Hadoop seems like the perfect solution for computing massive amount of big data, however, as time progress, community has realized that there are some limitations and drawbacks:
And community reacted to that really quick by inventing Apache Spark to address MapReduce issues:
When we talk about Hadoop, you should clarify what you are really talking about first, MapReduce or HDFS ? If you mean MapReduce, it is dead, and has been completely replaced by Spark. HDFS is far from death even though many new projects nowadays start with S3 or GFS instead.
Google always has the Google's problem - pretty unique to Google only. But, it turns out that Google tools are really useful and inspire a lot of open-source projects. Hadoop is one of that kind of projects that has started from a really small niche field and gradually moved to reach the massive audiences. Yes, it may have some flaws, but indeed, its drawbacks, lessons, and the attempt to decentralized big data tools have inspired many OSS to fill the missing puzzles in the Hadoop ecosystem.
So now you understand why professors always start lecturing with Hadoop; we owe Hadoop for this amazingly invaluable lessons !