Apache Hadoop can be defined as a set of algorithms (an open source network) for the distributed storage and distributed processing of the large clusters of data. All the modules on Hadoop are designed with a fundamental assumption that hardware failure is a common place and can be automatically handled by hardware in the framework.
The core of apache Hadoop consists of two basic parts. The first part is the storage part as Hadoop distributed file system and the second part is processing part as map reduces. Hadoop spits the large files into large blocks (default 64mb or 128 mb) and distributed the block into nodes of the cluster.
To process the data Hadoop map/transfers code (specifically jar files) to nodes that have the required data which nodes then process in parallel. This process takes advantage that the data is processed more efficiently and faster by distributed processing by using a more convention file system that follows a parallel file system where computation and data are connected to high speed computers.
Advantages of Hadoop Technology:
Scalability: Hadoop is highly scalable because it can store and process very large clusters in a symmetric order and run all modules on multiple servers simultaneously.
Cost Effective: Hadoop is quite cost effective. The problem with traditional relational database management systems is that they implement costly methods to manage the large amount of data. But Hadoop adapts less costly methods.
Fast: Hadoop is very fast in processing as it works on distributed methodology and it can easily map the data located on any location within the cluster.
Flexibility: Hadoop is flexible enough to adapt itself according to the new technology. It can work out without any failure for a large span of time.
High Fault Tolerance: Hadoop has high fault tolerance. It saves or stores the data by creating a duplicate copy of the data or backup which is retrieved at the time of failure.
Hadoop requires a java runtime environment 1.6 or higher to execute.
Hdfs stores large files( typically gigabytes or terabytes) cross multiple machines. It achieves reliability by distributing the data across multiple hosts. It increases the i/o performance of the system.the data is stored on three nodes:two on the same rack and 3rd on the different rack. Data nodes can talk to each other nodes to rebalance data and move the copies around
The advantage of using hdfs is task awareness bt the job tracker and task tracker. The job tracker shcdules ,aps or reduce jobs to task trackers. If node contains a data (x,y,z) and node b contains data(a,b,c) a would be s cheduled to reduce the task of(x,y,z) this rduces the traffic that goes over the network and prevents unnecessary data transfer.