folly/stats/test/TDigestBenchmark.cpp · 5e1a06aee1bed25e5968f49340403679763b58c2 · Libraries / folly

TDigest for estimating quantiles · 5e1a06ae

Marc Celani authored Apr 13, 2018

Summary:
The existing histogram approach for estimating quantiles is inadequate for common use cases. This diff introduces TDigests for estimating quantiles of streaming values.

The current histogram approach suffers from two issues:
  - The user needs to know something about the universe of values, U, and the underlying distribution in order to get reasonably accurate values
  - Because the histogram is not biased, estimating tail quantiles such as p99.9 requires a large amount of memory.

t-digests (https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf) are biased digests that allow for accurate estimates of tail quantiles, while using a small amount of memory and without requiring the user to know anything about U.

The scaling function in the paper is slow, because it uses arcsin. This approach uses a sqrt function instead. Details of why this works in the comments of the TDigest file.

Reviewed By: simpkins

Differential Revision: D7512432

fbshipit-source-id: 01460433ec7380fed88e5e942ef15f032dd794bb

5e1a06ae

TDigestBenchmark.cpp 6.87 KB

Replace TDigestBenchmark.cpp