A Data Architect’s Walkthrough of HDP 2.1

Comments (0)

When Hortonworks announced the general availability of HDP 2.1, I was eager to check out what’s new in the release. My initial reason was to double check the version of Hue for a Quora post, but I also wanted to see the much-awaited commercial support for Storm. Based on last October’s Elasticsearch-Hortonworks partnership announcement, I thought I would be seeing Elasticsearch too. A couple days ago though, I read about the Lucidworks-Hortonworks partnership. Search is important, and Lucidworks has a good reputation, It’s just that I happened to learn Elasticsearch from an earlier customer-360 type project and wanted to see my two favorite real-time big data technologies working together, out of the box.


Hortonworks is great about creating a Virtual Machines with all of the Hadoop software installed and configured, so I downloaded the VirtualBox image from their site, landing a 2.82 GB file on my laptop before my flight took off:




Since I had VirtualBox already installed on my MacBook, all I had to do was double-click the downloaded file to open it and start the VM import process:



Helpful hint: If you can, I would bump up the RAM settings from the default 4GB. Initially I had trouble getting HBase and Storm to start. Configuring all of these components to run in just 4-8GB of RAM is a science unto itself given that modern Hadoop control and data nodes have 128-256GB of RAM. On my second and successful attempt, I set my VM to use 8GB of ram.


In the VM’s console window you will see the Redhat Linux boot sequence and then finally success:



Launch the Firefox browser and open up the URL indicated (e.g., Wade through the registration until you get to the about page:



My eyes eagerly scanned and then I found the prize: Storm 0.9.1. Storm is real-time stream processing framework. Storm’s development was led by my bro-crush Nathan Marz and sponsored by Twitter. Storm is now under the Apache umbrella and is integrated to run under the YARN resource manager. By the looks of it, Storm is now managed by Ambari. Storm is used by a number of companies we know, often for ingestion and ETL, some companies have gone as far as to generate real time analytics, aggregates and alerts from Storm and Trident.


Please notice also: Knox and Falcon components. Expect to hear much more about these projects in the future.


You’ll see the familiar HDFS, the file-based storage foundation of Hadoop. For some, YARN and MapReduce 2 are new. YARN has become an important part of getting full utilization of a Hadoop cluster. YARN has displaced MapReduce as a first-class citizen in Hadoop. In fact, MapReduce is now implemented as a YARN application. YARN is definitely the future of Hadoop’s distributed processing model.


Also listed is the latest version of Hive / HCatalog 0.13. HCatalog encapsulates all of the metadata management for Hive (think DDL), and makes this information available to Hive, Pig, Java MapReduce, Cascading, etc. For those times when you need to work with low latency data access and analytics, the NoSQL HBase is  onboard and upgraded to version 0.98.


Just about everyone knows by now that Hadoop can crunch through massive data sets but often doesn’t fair well with “human-scale” response times to queries. The next milestone will be low latency SQL queries. Hortonworks has placed its bets on Tez as the path forward. Tez is an optimized processing framework, more flexible than MapReduce, but remaining compatible. Tez allows one job to run connecting multiple map stages to multiple reduce stages, and where possible, keeping the intermediate results in memory and off of slower disk. Tez is available by default with the release of Hive 0.13.  I can tell you from a recent bench-marking project, that Hive 0.13 is hot stuff, capable of grinding through complex decision support queries requiring complex window functions, massive joins and aggregations.


Let’s check out the Apache Ambari console:



A thing of beauty, we can now see Tez, Pig, Scoop, Oozie, Storm, HBase, YARN, MapReduce2 all running side by side. One happy zoo.


Helpful hint: If you have problems launching Storm, try increasing the memory for the DRPC service. Out of the box, the DRPC memory was set to -Xmx250m, and Ambari warns to set it to -Xmx768m.



Here is my final Storm DRPC server configuration:




So, drilling down on the Storm link we find the familiar Storm UI. 



These are just some of the highlights of HDP 2.1 which caught my attention. I’m excited to see a huge chunk of the Big Data ecosystem so easily available on my laptop for client implementation projects. Back when I first started at Think Big, bringing up just Hadoop was a multi-day effort, starting first with compiling the software and compression codecs. These days, VMs are a standard part of our implementation projects. This is hugely valuable for us given the large number of projects and different environments we work with on a weekly basis. Three is this week’s Hadoop distribution count for me.


The pace of innovation is only accelerating. Since HDP 2.0’s release in October, several components have received major upgrades (Hive, HBase, Mahout, Ambari), and more than half a dozen new projects have been included (Tez, Storm, Accumulo, Solr, Knox, Falcon, and Phoenix).


Best of luck to you in your Big Data Analytics journey, we’re here to help along the way.

Leave a Reply

Your email address will not be published. Required fields are marked *