Apache Hadoop is an open-source framework that allows for parallel processing of large data sets and collective mining of disparate data sources. Hadoop consists of the Hadoop Distributed File System (HDFS), YARN (Yet Another Resource Negotiator), and other components, such as MapReduce. YARN acts as the operating system managing other applications like MapReduce, which is responsible for processing large datasets in a parallel manner
The Apache Hadoop open-source framework extends to include additional software components, such as Spark, Zookeeper, Pig, and Hive, along with hundreds of others. These additional components address the ingestion, security, scripting, processing, visualization, and monitoring of data. Not all components are required and the use of multiple components is completely dependent on individual workflow needs.
What can you do with Hadoop?
Hadoop analytics help provide better understanding of customer behavior, operations activities, sales patterns, and more. Hadoop assists the science, medical, and pharmaceutical industries by helping researchers who are applying new analytic methods to massive quantities of data to make discoveries that could not otherwise be made using smaller data sample sizes. Hadoop also proves invaluable when evaluating Internet of things data, where countless appliances, machines, vehicles, devices, garments, accessories and more are producing massive amounts of insight-rich data every day.
What makes Hadoop a component of Big Data?
Apache Hadoop allows for the quick, streamlined mining of the various data sources that you have been collecting. This data in turn enables you to gain valuable business insight. The cross correlation of data helps you to make smarter decisions, better products and services, and more informed predictions about future trends and behavior.
Why a Data Lake for Hadoop?
At EMC, we maintain that a Data Lake is essential for true Hadoop environments because the more data that is available for analytics, the richer the data insights will be. A Data Lake takes various data, traditionally held in separate silos, and consolidates them into a single repository that is Hadoop enabled. This consolidation of data allows you to work from a single data source and to manage, control, and protect that source in a unified manner.
Why a Data Lake from EMC for Hadoop?
- Lower operating costs: With the capabilities made possible by a Data Lake from EMC, you’ll require less storage capacity and physical space to house the same amount of data. The Data Lake from EMC is simpler to manage and consumes fewer IT resources for storage administration. Due to these storage efficiencies, you can keep more data for longer rather than disposing the older data sets.
- Faster time to results: With a Data Lake from EMC, you no longer have to move data because the Data Lake enables analytics in-place.
- Scale and flexibility: Although direct-attached storage (DAS) is the conventional approach to deploying and managing Hadoop, there are benefits to decoupling compute and storage with a Data Lake, especially if your Hadoop workload does not linearly scale along with the amount of data.