Thursday, July 24, 2014

A Summary of Hadoop Ecosystem

Big Data Number Crunching Platform: Hadoop

Apache Hadoop
  1. Hadoop Common: The common utilities that support the other Hadoop modules.
  2. Hadoop Distributed File System (HDFS): provides high-throughput access to application data.
  3. Hadoop YARN: A framework for job scheduling and cluster resource management.
  4. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Hadoop Ecosystem
  1. Pig: A platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
  2. Hive: A data warehouse system for Hadoop that offers a SQL-like query language to facilitate easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.
  3. Hbase: A distributed, scalable, big data store with random, real time read/write access.
  4. Sqoop: A tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases.
  5. Flume: A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.
  6. Zookeeper: A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
and many others.

Hadoop Major Distributions
  1. HortonWorks, Cloudera and MapR
  2. Pivotal HD by EMC Corporation, 
  3. IBM InfoSphere BigInsights, 
  4. Amazion Elastic MapReduce (EMR), this one is cloud based. 
Hadoops distribution analysis
  1. Cloudera: The most established distribution by far with most number of referenced deployments. Powerful tooling for deployment, management and monitoring are available. Impala is developed and contributed by Cloudera to offer real time processing of big data.
  2. Hortonworks: The only vendor which uses 100% open source Apache Hadoop without own (non-open) modifications. Hortonworks is the first vendor to use Apache HCatalog functionality for metadata services. Besides, their Stinger initiative optimizes the Hive project massively. Hortonworks offers a very good, easy-to-use sandbox for getting started. Hortonworks developed and committed enhancements into the core trunk that make Apache Hadoop run natively on the Microsoft Windows platforms including Windows Server and Windows Azure.
  3. MapR: Uses some different concepts than its competitors, especially support for a native Unix file system instead of HDFS (with non-open-source components) for better performance and ease of use. Native Unix commands can be used instead of Hadoop commands. Besides, MapR differentiates from its competitors with high availability features such as snapshots, mirroring or stateful failover. The company is also spearheading the Apache Drill project, an open-source re-envisioning of Google’s Dremel for SQL-like queries on Hadoop data for offering real time processing.
  4. Amazon Elastic Map Reduce (EMR): Differs from others as it is a hosted solution running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). Besides Amazon’s distribution, you can also use MapR on EMR. A major use case is ephemeral clusters. If you need one-time or infrequent big data processing, EMR might save you a lot of money. However, there are some disadvantages, too. Only Pig and Hive are included of the Hadoop ecosystem, so many others are missing by default. Besides, EMR is highly tuned for working with data in S3, which has a higher latency and does not locate the data on your computational nodes. So file IO on EMR is slower and more latent than IO on your own Hadoop cluster or on your own EC2 cluster.

Big Data Suite
Comes on top of Apache Hadoop or a Hadoop distribution. Supports different Hadoop distributions, some vendors implement their own Hadoop solution. Features added:
  1. Tooling: Usually, is based on top of an IDE (Eclipse for instance). Typically it offers graphical tooling to model big data services. 
  2. Modeling: Although the basic infrastructure for Hadoop clusters is provided by Apache Hadoop or a Hadoop distribution there´s still the need to write lots of code to build a MapReduce program. It can be written in plain Java, or in optimized languages such as PigLatin or the Hive Query Language (HQL), which generate MapReduce code. 
  3. Code Generation: Means that you don´t have to write, debug, analyze and optimize MapReduce code.
  4. Scheduling: Execution of big data jobs can be scheduled and monitored making defining and managing execution plans much easier.
  5. Integration: Hadoop integrates files and SQL databases, NoSQL databases, social media (Twitter, Facebook), messages from middleware and data from B2B products (Salesforce, SAP). A big data suite provides connectors to Hadoop and back, which can be graphically managed providing services like data cleansing.
Therefore the three alternatives are :
  1. Apache Hadoop Framework
  2. A Hadoop distribution 
  3. A Big Data Suite (from IBM, Oracle of Microsoft)