Thursday, July 24, 2014

A Summary of Hadoop Ecosystem

Big Data Number Crunching Platform: Hadoop

Apache Hadoop
  1. Hadoop Common: The common utilities that support the other Hadoop modules.
  2. Hadoop Distributed File System (HDFS): provides high-throughput access to application data.
  3. Hadoop YARN: A framework for job scheduling and cluster resource management.
  4. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Hadoop Ecosystem
  1. Pig: A platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
  2. Hive: A data warehouse system for Hadoop that offers a SQL-like query language to facilitate easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.
  3. Hbase: A distributed, scalable, big data store with random, real time read/write access.
  4. Sqoop: A tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases.
  5. Flume: A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.
  6. Zookeeper: A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
and many others.

Hadoop Major Distributions
  1. HortonWorks, Cloudera and MapR
  2. Pivotal HD by EMC Corporation, 
  3. IBM InfoSphere BigInsights, 
  4. Amazion Elastic MapReduce (EMR), this one is cloud based. 
Hadoops distribution analysis
  1. Cloudera: The most established distribution by far with most number of referenced deployments. Powerful tooling for deployment, management and monitoring are available. Impala is developed and contributed by Cloudera to offer real time processing of big data.
  2. Hortonworks: The only vendor which uses 100% open source Apache Hadoop without own (non-open) modifications. Hortonworks is the first vendor to use Apache HCatalog functionality for metadata services. Besides, their Stinger initiative optimizes the Hive project massively. Hortonworks offers a very good, easy-to-use sandbox for getting started. Hortonworks developed and committed enhancements into the core trunk that make Apache Hadoop run natively on the Microsoft Windows platforms including Windows Server and Windows Azure.
  3. MapR: Uses some different concepts than its competitors, especially support for a native Unix file system instead of HDFS (with non-open-source components) for better performance and ease of use. Native Unix commands can be used instead of Hadoop commands. Besides, MapR differentiates from its competitors with high availability features such as snapshots, mirroring or stateful failover. The company is also spearheading the Apache Drill project, an open-source re-envisioning of Google’s Dremel for SQL-like queries on Hadoop data for offering real time processing.
  4. Amazon Elastic Map Reduce (EMR): Differs from others as it is a hosted solution running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). Besides Amazon’s distribution, you can also use MapR on EMR. A major use case is ephemeral clusters. If you need one-time or infrequent big data processing, EMR might save you a lot of money. However, there are some disadvantages, too. Only Pig and Hive are included of the Hadoop ecosystem, so many others are missing by default. Besides, EMR is highly tuned for working with data in S3, which has a higher latency and does not locate the data on your computational nodes. So file IO on EMR is slower and more latent than IO on your own Hadoop cluster or on your own EC2 cluster.

Big Data Suite
Comes on top of Apache Hadoop or a Hadoop distribution. Supports different Hadoop distributions, some vendors implement their own Hadoop solution. Features added:
  1. Tooling: Usually, is based on top of an IDE (Eclipse for instance). Typically it offers graphical tooling to model big data services. 
  2. Modeling: Although the basic infrastructure for Hadoop clusters is provided by Apache Hadoop or a Hadoop distribution there´s still the need to write lots of code to build a MapReduce program. It can be written in plain Java, or in optimized languages such as PigLatin or the Hive Query Language (HQL), which generate MapReduce code. 
  3. Code Generation: Means that you don´t have to write, debug, analyze and optimize MapReduce code.
  4. Scheduling: Execution of big data jobs can be scheduled and monitored making defining and managing execution plans much easier.
  5. Integration: Hadoop integrates files and SQL databases, NoSQL databases, social media (Twitter, Facebook), messages from middleware and data from B2B products (Salesforce, SAP). A big data suite provides connectors to Hadoop and back, which can be graphically managed providing services like data cleansing.
Therefore the three alternatives are :
  1. Apache Hadoop Framework
  2. A Hadoop distribution 
  3. A Big Data Suite (from IBM, Oracle of Microsoft)



Monday, May 5, 2014

Initial Experiences with Hadoop on a VM, Ubuntu on an external drive and other eccentricities...

-------------------------------------------------------------------------
02/05/2014
Linux version: Ubuntu Server 12.014 Precise - http://www.ubuntu.com/download/server

I will try installing it in an external drive from a flash drive. First I need to download Ubuntu image file (.iso) from the address http://www.ubuntu.com/download/server

To prepare the flash drive to receive Ubuntu installer I will be using an application called Rufus that can be downloaded from the address http://rufus.akeo.ie/. This application doesn´t need to be installed. Just need to be downloaded and run. Once it has started select:


  1. MBR partition scheme for BIOS or UEFI computers
  2. Create a bootable disk using: ISO Image
  3. Select the .iso file downloaded previously
  4. Rufus will run and prepare the usb stick to be a bootable device

I will now reboot my system and check if the OS will be loaded from the usb stick this time. Once this is done, I will start the installation process in the external hd. First thing to remember: it is a "sdb" drive and the boot loader will be loaded from him NOT the "sda" drive.

Note: On my Samsung Ativ 6 with Windows 8.1 there´s a feature in the bios setup that must be disabled. Samsung Ativ 6 Setup Sequence is activated by pressing F2 during boot. First go to Advanced and disable Fast Bios Mode. Then save settings and reboot. Get back to the setup sequence by pressing F2 again. Now go to Boot and chance boot order making the USB stick the first option. This will load the Ubuntu installation program.

Now let´s get back to the installation procedure.

Product was installed and boot sequence was ok. Now it appears the boot loader for the external HD. Nonetheless after an initial sequence procedure was followed in the end the system comes to a blank black screen. My initial reaction was to reinstall the server. I did this and put in the installation only Postgresql and Tomcat Java Server. The installation sequence went in a very smooth way and the same problem happened. I than tried to start Ubuntu in a different way, only with command line support. It seems to accept this procedure, even asking me my login name (gustavo) and my password (standard, new, non-windows)

----------------------------------------------------------------
03/05/2014

I found one explanation and one possibility for this problem. First: the graphical user interface in Ubuntu(Linux) is not present in the server version. This is why I ended up with a black screen and a command prompt (when I tried to initiate one of the secured sequences in Ubuntu). For this problem I´m now downloading the desktop version and will try to add other features later. Second: my video card maybe is incompatible with Ubuntu, which I will be aware in a few minutes, since I´m downloading Ubuntu desktop (which comes in a much bigger package right now).

An interesting alternative would be to run Ubuntu in the Oracle Virtual Box (former Sun Box) and configure everything from there. It is repeatedly stated in the literature that this runs perfectly. I will also give it a try.

Keep having the blank screen and found a possible solution that is related (as I suspected ) to the video drivers. Here is a copy paste explanation / solution that I got from the internet (http://ubuntuforums.org/showthread.php?t=2203824):


  1. "Because the BIOS/UEFI firmware is setup in CSM mode (which emulates old fashioned 16-bit BIOS calls) the installer's generic video drivers won't recognize your Intel Haswell ULT Touchscreen Graphics card which ultimately results in booting up to a blank screen. There's an easy way and a hard way to fix this:
    1. Easy way: When the GRUB bootloader prompts you to Install Ubuntu, press F6 (this is the easy way). Pressing F6 will display some options and you want to select the very last option NOMODESET which disables a portion of the RAM [framebuffer] that's built-in your Intel Haswell graphics card. The active drivers on the bootable USB installer don't know how to utilize this yet."
    2. Hard way: 
      1. Use the GRUB prompt (hard way) if necessary to edit "quiet splash" command which you can simply replace with "nomodeset" before GRUB mounts the Kernel and boots Ubuntu.
      2. Also in the same address it is said to adjust my firmware setting in the following way:
        1. Keep Fastboot DISABLED
        2. Keep Secure Boot DISABLED
        3. Select UEFI OS
        4. Save & Reset


Note: I tried the F6 option and it didn´t work. I than noticed that there was an option to edit the start configuration file by pressing "e". I did this and found the line with the "quiet splash" command. I replaced it by "nomodeset" and BINGO, everything worked just fine. I was able to boot from the pendrive, and saw for the first time the Ubuntu graphical user interface.

Now let´s move back to the hard drive configuration.

I tried starting the hard drive installation but again got stuck in the black screen. Suspecting that I was facing a video driver problem again I started my system once more and before going into the Ubuntu installation I pressed "e" and there it was "quiet splash" once more in the configuration file. I changed it to "nomodeset" and the installation started.

I was then presented with the hard disk configuration utility. I found info about Windows 8.1 and Ubuntu in different drives in the url http://www.linuxbsdos.com/2013/10/23/how-to-install-ubuntu-13-10-on-an-external-hard-drive/ and I got the following info:

  1. sda is the drive with Windows 8 and it should be left alone. Let´s focus either in sdb or sdc
  2. choose the option "something else"  (don´t choose any of the standard options or you will end up messing up your windows 8.1 partition in your internal hd)
  3. Leave sda alone and delete the other partition(s). Now prepare the system to receive ubuntu the following way: 
    1. Create a new partition (I did this using 90% of my 1Tb disk for a Ext4 journaling system mounted on / (i.e. root),
    2. Create another one using a little bit less than 10% for swap 
    3. Change the bootloader location to this new partition (sdb or sdc) and voilá, you´re ready to go.

Obs: so far, I must remember that, whenever I try running Ubuntu I must change quiet "splash" to "nomodeset" in the loader script. I will try learning how to make this change permanent. The summary of "strange" pre-reqs for this Ubuntu installation is:


  1. System setup (F2) :
  2. Fastboot DISABLED
  3. Secure Boot DISABLED
  4. Method UEFI OS
  5. Boot sequence: Pendrive, External HD, Windows
  6. Loader script:
    1. Press "e" to edit
    2. Change "quiet splash" to "nomodeset"
    3. Press F10 to boot

When running Ubunt ALWAYS change the loader script, otherwise, system will not run

That´s it. Now I have a system with boot-loader and can start either Windows 8.1 or Ubuntu.

So far, so good!

Thursday, February 20, 2014

Big Data for Managers, Quantitative Modeling and Advanced Modeling - Course Development Ideas - Post in Brazilian Portuguese

----------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------
Big Data for Managers Course Structure

Um curso de Big Data, que procure apresentar uma visão inicial das possibilidades e recursos, além de proporcionar experiência prática aos participantes teria os seguintes tópicos:

I - Modelagem de Dados

Introduzir o conceito de modelo de dados. O objetivo é mostrar o aluno as possibilidades de análise de dados fora do sistema tradicional de planilha, a qual aparece quando a quantidade de dados cresce e começam a aparecer complexas inter-relacionamentos entre as colunas de uma planilha.

Como são exemplos com objetivo de introduzir conceitos novos ou de melhorar o uso individual de recursos, recomenda-se iniciar esta parte com o Access e apresentar posteriormente bancos de dados relacionais que sejam escaláveis (SQL Server ou até mesmo o SciDB).

II – Estatística

A partir das bases de dados utilizadas para implementar os exemplos da parte I, pode-se explorar conceitos de análise estatística mais sofisticada, feita a partir de subconjuntos de dados extraídos e exportados de um DB (utilizado na etapa I).

Os primeiros exemplos envolvem Testes de Hipóteses, Anova e Regressão feitos no Excel.

Exemplos mais sofisticados de clusterização e análise fatorial são feitos no R

III – Big Data

Sempre através de exemplos práticos, de caráter gerencial, executados no computador o aluno já conheceu a importância da estrutura dos dados e a capacidade das técnicas estatística em transformar dados difusos em informação útil.

Neste momento ele é apresentado a técnica de captura de dados e “garimpagem” de informações em ambientes não estruturados (isto é reais). É aqui que se introduz Hadoop e MapReduce.

Os tópicos seriam :

  • a) Introdução ao Hadoop. Compreendendo sistemas distribuídos. Comparando DBs SQL e Hadoop. Compreendendo o MapReduce. Executando um programa simples de contagem de palavras
  • b) Estrutura do Hadoop. Implementação da Hortonworks em Windows, para máquinas individuais. Diferenças das implementações individuais (para teste e aprendizagem) das implementações típicas (em cluster).
  • c) MapReduce. Criação de programas em MapReduce. Combinação de diferentes fontes de dados. Criação de filtros.
  • d) Integração R e Hadoop
  • e) Estudos de caso: New York Times Archive, Mining at China Mobile, Websites at StumbleUpon, IBM Project ES2


IV – Estudos de Caso

O curso conclui com exemplos de predictive analytics nas áreas de: propaganda, preferencias e escolha dos consumidores, market basket analysis, análise econômica, operações, text analytics, esportes, relação entre marca e preço e também análise espacial de dados.

Estes exemplos exploram e integram as 3 etapas apresentadas acima.

----------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------
Quantitative Modeling Basic Course

  1. Spreadsheet problem modeling
  2. Results optimization
  3. Uncertainty simulation
  4. Examples:
  • Production and marketing mix optimization.
  • Advertising results analysis. 
  • Problem linearization and operation sequencing: oil and pharma industry examples.
  • Operations management applications. 
  • Financial applications. Cash flow management.
  • Network distribution optimization.
  • Resource allocation. Territory assignment and facilities location.
  • DEA – Data Envelopment Analysis. Applications in services, finance and third sector organizations.
  • Integer and binary programming
  • Results simulation under uncertainty conditions: overbooking, market-share, product insurance, cash flow, VaR (Value At Risk) introduction.



Advanced Course (A.I. Applied to Business Problems)

  • Softwares: Excel, Solver (2010) & Palisade @RISK
  1. Non linear optimization: gradient methods, application limitis. Evolutionary methods. Multiple Start Methods. Pricing, operations and investment applications with Solver 2010 and Palisade Evolver
  2. Neural nets applications. Credit portfolio management and risk analysis with Palisade Neural Tool.
  3. VBA intro. Monte Carlo simulation review. Applications in finance, hedging, futures and derivatives using Microsoft Excel spreadsheet and Palisade @RISK.
  4. Optimization under uncertainty. Derivatives pricing. Hedge,futures and derivatives optimization  with Excel, Palisade RISKOptimizer & VBA.


Monday, January 20, 2014

THE MISSION!

Here starts THE MISSION. The development of an Introduction to Big Data course. In this course, some important technologies that enable the Big Data Era will be explained and applied in a series of business examples.