Big Data Problems in Monitoring at eBay

This post is based on a talk by Bhaven Avalani and Yuri Finklestein at QConSF 2012 (slides). Bhaven and Yuri work on the Platform Services team at eBay.

by @mattokeefe

This is a Big Data talk with Monitoring as the context. The problem domain includes operational management (performance, errors, anomaly detection), triaging (Root Cause Analysis), and business monitoring (customer behavior, click stream analytics). Customers of Monitoring include dev, Ops, infosec, management, research, and the business team. How much data? In 2009 it was tens of terabytes per day, now more than 500 TB/day. Drivers of this volume are business growth, SOA (many small pieces log more data), business insights, and Ops automation.

The second aspect is Data Quality. There are logs, metrics, and events with decreasing entropy in that order. Logs are free-form whereas events are well defined. Veracity increases in that order. Logs might be inaccurate.

There are tens of thousands of servers in multiple datacenters generating logs, metrics and events that feed into a data distribution system. The data is distributed to OLAP, Hadoop, and HBase for storage. Some of the data is dealt with in real-time while other activities such as OLAP for metric extraction is not.

Logs
How do you make logs less “wild”? Typically there are no schema, types, or governance. At eBay they impose a log format as a requirement. The log entry types includes open and close for transactions, with time for transaction begin and end, status code, and arbitrary key-value data. Transactions can be nested. Another type is atomic transactions. There are also types for events and heartbeats. They generate 150TB of logs per day.

Large Scale Data Distribution
The hardest part of distributing such large amounts of data is fault handling. It is necessary to be able to buffer data temporarily, and handle large spikes. Their solution is similar to Scribe and Flume except the unit of work is a log entry with multiple lines. The lines must be processed in correct order. The Fault Domain Manager copies the data into downstream domains. It uses a system of queues to handle the temporary unavailability of a destination domain such as Hadoop or Messaging. Queues can indicate the pressure in the system being produced by the tens of thousands of publisher clients. The queues are implemented as circular buffers so that they can start dropping data if the pressure is too great. There are different policies such as drop head and drop tail that are applied depending on the domain’s requirements.

Metric Extraction
The raw log data is a great source of metrics and events. The client does not need to know ahead of time what is of interest. The heart of the system that does this is Distributed OLAP. There are multiple dimensions such as machine name, cluster name, datacenter, transaction name, etc. The system maintains counters in memory on hierarchically described data. Traditional OLAP systems cannot keep up with the amount of data, so they partition across layers consisting of publishers, buses, aggregators, combiners, and query servers. The result of the aggregators is OLAP cubes with multidimensional structures with counters. The combiner then produces one gigantic cube that is made available for queries.

Time Series Storage
RRD was a remarkable invention when it came out, but it can’t deal with data at this scale. One solution is to use a column oriented database such or HBase or Cassandra. However you don’t know what your row size should be and handling very large rows is problematic. On the other hand OpenTSDB uses fixed row sizes based on time intervals. At eBay’s scale with millions of metrics per second, you need to down-sample based on metric frequency. To solve this, they introduced a concept of multiple row spans for different resolutions.

Insights
* Entropy is important to look at; remove it as early as possible
* Data distribution needs to be flexible and elastic
* Storage should be optimized for access patterns

Q&A
Q. What are the outcomes in terms of value gained?
A. Insights into availability of the site are important as they release code every day. Business insights into customer behavior are great too.

Q. How do they scale their infrastructure and do deployments?
A. Each layer is horizontally scalable but they’re still working on auto-scaling at this time. EBay is looking to leverage Cloud automation to address this.

Q. What is the smallest element that you cannot divide?
A. Logs must be processed atomically. It is hard to parallelize metric families.

Q. How do you deal with security challenges?
A. Their security team applies governance. Also there is a secure channel that is encrypted for when you absolutely need to log sensitive data.

Latest Images

Trending Articles

Latest Images