• Marc Celani's avatar
    TDigest for estimating quantiles · 5e1a06ae
    Marc Celani authored
    Summary:
    The existing histogram approach for estimating quantiles is inadequate for common use cases. This diff introduces TDigests for estimating quantiles of streaming values.
    
    The current histogram approach suffers from two issues:
      - The user needs to know something about the universe of values, U, and the underlying distribution in order to get reasonably accurate values
      - Because the histogram is not biased, estimating tail quantiles such as p99.9 requires a large amount of memory.
    
    t-digests (https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf) are biased digests that allow for accurate estimates of tail quantiles, while using a small amount of memory and without requiring the user to know anything about U.
    
    The scaling function in the paper is slow, because it uses arcsin. This approach uses a sqrt function instead. Details of why this works in the comments of the TDigest file.
    
    Reviewed By: simpkins
    
    Differential Revision: D7512432
    
    fbshipit-source-id: 01460433ec7380fed88e5e942ef15f032dd794bb
    5e1a06ae
TDigestBenchmark.cpp 6.87 KB