Apache Hadoop

Apache Hadoop is a platform that comprises of technologies which are used to store and process humongous amount of data. Hadoop was created by Doug Cutting and Mike Cafarella to build a search engine called Nutch. It was named after Dough Cutting’s son. It was a powerful, popular and well supported framework that is well equipped for handling humongous amount of Data. It can also be defined as a data lake on the top of a suite of components that comprises of a number of libraries, tools and utilities that helps to handle data.

Apache Hadoop possess the following 3 characteristics, which are as follows:

1.Disrtributed – should be able to utilise multiple machines to solve tasks

2.Scalable – at times when needed it should be able to add up the number of systems

3.Reliable – if one of the machines in the system fails that should not affect the work executed by the whole system

As mentioned before Apache Hadoop consist of the following suite of components:

HDFS or Hadoop Distributed File System – It is an open source, scalable, reliable and distributed file system on many computers that provides storge for huge amount of data.

YARN or Yet Another Resource negotiator – It keeps track of all the resources of machines that are connected through the network and are involved in the execution of the application.

H Base –It is a kind of NoSQL Datastore provides storage in the form of database table.

MapReduce – It consists of sorting engine and uses Yarn for program execution. It consists of two parts which includes the map and reduce. Map converts raw data into key value anf Reduce part groups and combine data based on the key.

Spark – Spark is similar to MapReduce in its construct and framework and it is faster and more recent.

Hive – Hive helps in writing your logic in SQL that converts it into MapReduce internally as writing code in MapReduce is very time-consuming. Hive can process huge amount of structured and semi-structured data with simple SQL.

Pig (Latin) – It helps to write complex MapReduce transformations using scripting language Pig Latin

Mahout – It is a machine learning algorithm library that breaks work so that it gets executed on many machines in a distributed fashion.

Zookeeper – It is an independent component that is used to coordination between various components such as HDFS, YARN, HBase.

Flume – It continuously pumps the unstructured data from many machines to a central source such as HDFS.

Sqoop – It helps in transfer of data between Hadoop and relational databases such as SQL. First it converts the data using Hadoop MapReduce and then export the data back to  relational databases.

Oozie – It is a workflow engine that helps to execute the work in sequence

Add a Comment

Your email address will not be published. Required fields are marked *