Monday, August 17, 2015

My First take on Hadoop HDFS and MapReduce

HDFS

Hadoop Distributed File System (HDFS) is one of the latest DFS based on Google File System. HDFS handles large files that are bigger than individual hard disks available to host these files by separating them into fixed size blocks and manage requests to access those blocks at anytime in a reliable way.

File block size is 64MB and each block is associated with a meta file information. Hadoop has a lookup table for each file it hosts on the cluster. Cluster is pool of slave systems, managed by one master node, called as NameNode. Slave systems are called as Datanodes. Cluster configuration is used to identify given data node and it is scalable by adding or dropping data nodes in the cluster configuration file.

Cluster allows user to specify its reliability by replicating file blocks on multiple systems to handle failed data nodes or unavailable network segments. If a file block is replicated 3 times, replication factor is 3 which is also the default for Hadoop.

File access commands normally used by host operating system are not applicable to HDFS based files. HDFS has similar commands to list files, merge files, bring it to local file system etc.


MapReduce

MapReduce is the logical processing system that can reside on Data nodes to compute required information from data residing on the data node, without transferring whole data used for computation, but send only resulting output data, allowing data localization.

Map is a process to take a data field or element and tranform it to another value that can be used by next process, called Reduce. Reduce assimilates and summarizes computed data into business useful final or intermediatory form.

MapReduce technique is implemented in java language which is machine and operating systen independent.