The Hadoop Summit of 2010 started off with a vuvuzela blast from Blake Irving, Chief Product Officer for Yahoo. Yahoo delivered keynote addresses that outlined the scale of their use, technical directions for their contributions, and architectural patterns in how they apply the technology.
The increasing interest in Hadoop was evident: this year’s conference had 1000 people in attendance and was sold out 10 days before it started, increasing from about 300 two years ago and 650 last year. The father of Java, James Gosling was also in attendance at the conference. This conference marked the fifth anniversary of Hadoop (approximately). Irving noted that only 5% of the world’s data is structured, with unstructured data having tremendous growth and that this new data is of a far more transient nature. He highlighted that Yahoo uses Hadoop to analyze every page click and optimizes rankings for content, updating the results every 7 minutes. He noted "we believe Hadoop is now ready for mainstream enterprise use."
Yahoo’s SVP of Cloud Computing, Shelton Shugar noted that Yahoo inputs 120 TB of data per day for 100 billion events, currently storing 70 PB with 170 PB of capacity. Yahoo processes 3 PB per day, running over a million jobs per month on 38,000 servers. As Yahoo’s usage of Hadoop has grown, they have needed to build support for mainstream application programmers and better management tools and data security. He noted that Yahoo uses Hadoop in production for a variety of products:
- data analytics
- content optimization
- content enrichment
- Yahoo! Mail Anti-Spam
- Ad products
- Ad optimization
- Ad selection
- Big data processing & ETL
Yahoo is also using Hadoop heavily for applied science, such as:
- user interest prediction
- ad inventory prediction
- search ranking
- ad targeting
- spam filtering
Eric Baldeschwieler, VP of Hadoop Software Development of Yahoo noted that in the last year Yahoo has:
- increased their clusters from 2000 to 4000 nodes each
- doubled the number of jobs per node, benefitting from increased CPU power due to Moore’s law.
- Now have over 80% disk utilization and typically 50-60% CPU utilization, and that data use is growing faster than processing use.
- contributed over 70% of the total patches to Hadoop.
Their focus in the last year was to improve Hadoop map-reduce:
- a new capacity scheduler
- job tracker stability and robustness supporting mixed workloads
- adding limits to resource use: safety rails
Their focus is now on developing Hadoop’s distributed file system, HDFS:
- every node in their clusters now has 12 TB of storage. They are now building a single 48 PB cluster – "that blows Hadoop’s mind" because of scalability limits in the name node
- improving use of memory, connections and buffers, and to provide metrics
- breaking up storage into a set of volumes (uses multiple HDFS clusters)
- releasing federated storage across HDFS instances, in the next major Hadoop release
Baldeschwieler explained how Yahoo personalizes their home page:
- Realtime serving systems use Apache that read maps of users to interests from database
- Every 5 minutes they use a production Hadoop cluster to rerank content based on recent data, updating results every 7 minutes
- Every week they recompute their machine learning models for categories in a science Hadoop cluster
Yahoo Mail uses Hadoop in a similar way:
- Scoring mail against a spam model in a production cluster frequently
- Retraining antispam models on a science cluster every few hours.
- This system powers 5 billion deliveries per day to over 450 million mail boxes.
Because HDFS has a single point of failure (the name node), it is a risk for a high availability production system. To mitigate that, Yahoo replicates data to multiple clusters, so an outage of one distributed file system can be compensated for by using a backup file system. In their presentations Yahoo noted that they are using the Hive data warehousing project of Hadoop, in addition to their own Pig project.
Baldeschwieler announced that Yahoo has released a beta test of Hadoop Security, which uses Kerberos for authentication and allows colocation of business sensitive data within the same cluster. They have also released Oozie, a workflow engine for Hadoop, which has become the de-facto ETL standard at Yahoo. It is integrated with MapReduce, HDFS, and Pig and Hadoop Security.
Overall, Yahoo demonstrated continued leadership in developing Hadoop technologies, and at the same time they were clearly pleased that a number of leading Internet companies and independent technology vendors have emerged as part of the ecosystem.
This article originally appeared on InfoQ.