Tag Archives: Advertising

[repost ]Evolution of Hadoop Ecosystem: AOL Advertising Experience


Pero works on research and development in new technologies for online advertising at Aol Advertising R&D in Palo Alto. Over the past 4 years he has been the Chief Architect of R&D distributed ecosystem comprising more than thousand nodes in multiple data centers. He also led large-scale contextual analysis, segmentation and machine learning efforts at AOL, Yahoo and Cadence Design Systems and published patents and research papers in these areas.

A critical premise for success of online advertising networks is to successfully collect, organize, analyze and use large volumes of data for decision making. Given the nature of their online orientation and dynamics, it is critical that these processes be automated to the largest extent possible.

Specifically, the success of advertising technology and its impact on revenue are directly proportional to its capability to use large amounts of data in order to compute proper impression value given the unique circumstances of ad serving events such as the characteristics of the impression, the ad, and the user as well as the content and context. As a general rule, more data results in more accurate predictions.

In addition, to Optimization, Reporting and Analytics provide indispensable feedback to our internal Business and Sales teams helping us acquire new, and expand current, commitments from external customers.

At AOL, we started large-scale data collection more than 4 years ago and went from using heavily sampled data sets to being able to process full serving logs. We have been using Apache Hadoop since version 0.14 as a part of an R&D effort and recently moved to Cloudera CDH3 distribution. Gradually, we introduced more systems and technologies to our ecosystem around Hadoop.

We chose Hadoop for several reasons:

  • Ability to store, organize and process large data sets
  • Great flexibility with data formats
  • Map-reduce offers flexible data processing paradigm and works well with changing data
  • Excellent cost-volume/price-performance point which proved very important in early proof-of-concept stages
  • Failure built into the system via distributed computation and data redundancy
Line Graph Demonstrating AOL's Cluster Size [nodes] and Aggregate Disk Space [TB]Figure 1. Growth of Hadoop cluster

AOL's Sampling Rate

Figure 2. Growth in sampling rate

We show growth of our Hadoop clusters in Figure 1, and increase in the sampling rate in Figure 2. Between the 3rd and 4th iteration we switched to disks that are 4 times larger and we used 4-8 times more cores per node. The increase in the total number of CPUs was even more pronounced as we found we needed more processing power for newly developed processing flows. During the initial stages growing the sampling rate was the primary goal. As the number of processing pipelines increased, the output data volume increased. We’ve also added more external data flows. These two trends drove the increase in total storage space and processing power beyond full log samples between stages 4 and 5. Note that the impact of important factors like the business environment and team growth had significant impact on the pace of cluster upgrades.

At the same time, we grew the ecosystem around Hadoop to encompass other infrastructure and computational components such as databases, caching and high-performance computing clusters. As our Hadoop clusters increased in size, these clusters correspondingly increased to store and process larger data sets.

The main reason for the qualitative change in shifting between the 3rd and 4th iteration was the move from R&D to a production environment. With involvement of additional teams we faced several challenges that Cloudera helped us with:

  • Specifying and executing operational requirements
  • Cluster setup
  • Staff training
  • Introducing other indispensable parts of Hadoop ecosystem such as robust data flows (Flume), monitoring and instrumentation
  • Ensuring that long-term vision and execution are aligned with Hadoop roadmap

The last point is especially important as we see Hadoop as an ever-evolving data processing platform. We see ourselves as a contributor and partner in this process – through the recently introduced Cloudera Customer Council we participate in discussions and working groups. For us, this is a great learning experience which simultaneously provides ample opportunities for us to contribute to an important technology that is changing the way we do business.

[repost ]Mining of Massive Datasets


This book is placed on the Web for free use of all who wish it. We do, however, retain copyright on the work, and we expect that you will acknowledge our authorship if you republish parts or all of it. We are sorry to have to mention this point, but we have evidence that other items we have published on the Web have been appropriated and republished under other names. It is easy to detect such misuse, by the way, as you will learn in Chapter 3.

— Anand Rajaraman (@anand_raj) and Jeff Ullman


Download the Complete Book (340 pages, approximately 2MB)

Download chapters of the book:

Preface and Table of Contents
Chapter 1 Data Mining
Chapter 2 Large-Scale File Systems and Map-Reduce
Chapter 3 Finding Similar Items
Chapter 4 Mining Data Streams
Chapter 5 Link Analysis
Chapter 6 Frequent Itemsets
Chapter 7 Clustering
Chapter 8 Advertising on the Web
Chapter 9 Recommendation Systems

Gradiance Support

If you are an instructor interested in using the Gradiance Automated Homework System with this book, start by creating an account for yourself at www.gradiance.com/services. Then, email your chosen login and the request to become an instructor for the MMDS book to support@gradiance.com You will then be able to create a class using these materials. Manuals explaining the use of the system are atwww.gradiance.com/info.html.

Students who want to use the Gradiance system for self-study can register at www.gradiance.com/services. Then, use the class token 1EDD8A1D to join the “omnibus class” for the MMDS book. SeeThe Student Guide for more information.

Other Stuff

  • Slides and Course Material from old CS345A. Like the book, you are welcome to use these as you like, but please preserve our authorship.
  • The Errata Sheet. We shall endeavor to keep the downloads up to date. But if you bought or printed out a copy, you can check this list for known errors with the date of discovery. Please report errata to ullman a t gmail.com.