fbpx
Wikipedia

Count–min sketch

In computing, the count–min sketch (CM sketch) is a probabilistic data structure that serves as a frequency table of events in a stream of data. It uses hash functions to map events to frequencies, but unlike a hash table uses only sub-linear space, at the expense of overcounting some events due to collisions. The count–min sketch was invented in 2003 by Graham Cormode and S. Muthu Muthukrishnan[1] and described by them in a 2005 paper.[2]

Count–min sketch is an alternative to count sketch and AMS sketch and can be considered an implementation of a counting Bloom filter (Fan et al., 1998[3]) or multistage-filter.[1] However, they are used differently and therefore sized differently: a count–min sketch typically has a sublinear number of cells, related to the desired approximation quality of the sketch, while a counting Bloom filter is more typically sized to match the number of elements in the set.

Data structure edit

The goal of the basic version of the count–min sketch is to consume a stream of events, one at a time, and count the frequency of the different types of events in the stream. At any time, the sketch can be queried for the frequency of a particular event type i from a universe of event types  , and will return an estimate of this frequency that is within a certain distance of the true frequency, with a certain probability.[a]

The actual sketch data structure is a two-dimensional array of w columns and d rows. The parameters w and d are fixed when the sketch is created, and determine the time and space needs and the probability of error when the sketch is queried for a frequency or inner product. Associated with each of the d rows is a separate hash function; the hash functions must be pairwise independent. The parameters w and d can be chosen by setting w = ⌈e/ε and d = ⌈ln 1/δ, where the error in answering a query is within an additive factor of ε with probability 1 − δ (see below), and e is Euler's number.

When a new event of type i arrives we update as follows: for each row j of the table, apply the corresponding hash function to obtain a column index k = hj(i). Then increment the value in row j, column k by one.

Several types of queries are possible on the stream.

  • The simplest is the point query, which asks for the count of an event type i. The estimated count is given by the least value in the table for i, namely  , where   is the table.

Obviously, for each i, one has  , where   is the true frequency with which i occurred in the stream.

Additionally, this estimate has the guarantee that   with probability  , where   is the stream size, i.e. the total number of items seen by the sketch.

  • An inner product query asks for the inner product between the histograms represented by two count–min sketches,   and  .

Small modifications to the data structure can be used to sketch other different stream statistics.

Like the count sketch, the Count–min sketch is a linear sketch. That is, given two streams, constructing a sketch on each stream and summing the sketches yields the same result as concatenating the streams and constructing a sketch on the concatenated streams. This makes the sketch mergeable and appropriate for use in distributed settings in addition to streaming ones.

Reducing bias and error edit

One potential problem with the usual min estimator for count–min sketches is that they are biased estimators of the true frequency of events: they may overestimate, but never underestimate the true count in a point query. Furthermore, while the min estimator works well when the distribution is highly skewed, other sketches such as the Count sketch based on means are more accurate when the distribution is not sufficiently skewed. Several variations on the sketch have been proposed to reduce error and reduce or eliminate bias.[4]

To remove bias, the hCount* estimator [5] repeatedly randomly selects d random entries in the sketch and takes the minimum to obtain an unbiased estimate of the bias and subtracts it off.

A maximum likelihood estimator (MLE) was derived in Ting.[6] By using the MLE, the estimator is always able to match or better the min estimator and works well even if the distribution is not skewed. This paper also showed the hCount* debiasing operation is a bootstrapping procedure that can be efficiently computed without random sampling and can be generalized to any estimator.

Since errors arise from hash collisions with unknown items from the universe, several approaches correct for the collisions when multiple elements of the universe are known or queried for simultaneously [7][8][6]. For each of these, a large proportion of the universe must be known to observe a significant benefit.

Conservative updating changes the update, but not the query algorithms. To count c instances of event type i, one first computes an estimate  , then updates   for each row j. While this update procedure makes the sketch not a linear sketch, it is still mergeable.

See also edit

Notes edit

  1. ^ The following discussion assumes that only "positive" events occur, i.e., the frequency of the various types cannot decrease over time. Modifications of the following algorithms exist for the more general case where frequencies are allowed to decrease.

References edit

  1. ^ a b Cormode, Graham (2009). "Count-min sketch" (PDF). Encyclopedia of Database Systems. Springer. pp. 511–516.
  2. ^ Cormode, Graham; S. Muthukrishnan (2005). (PDF). J. Algorithms. 55: 29–38. doi:10.1016/j.jalgor.2003.12.001. Archived from the original (PDF) on 2023-05-25.
  3. ^ Fan, Li; Cao, Pei; Almeida, Jussara; Broder, Andrei (2000), "Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol", IEEE/ACM Transactions on Networking, 8 (3): 281–293, CiteSeerX 10.1.1.41.1487, doi:10.1109/90.851975, S2CID 4779754. A preliminary version appeared at SIGCOMM '98.
  4. ^ Goyal, Amit; Daumé, Hal III; Cormode, Graham (2012). Sketch algorithms for estimating point queries in NLP. Proc. EMNLP/CoNLL.
  5. ^ Jin, C.; Qian, W.; Xu, X.; Zhou, A. (2003), Dynamically maintaining frequent items over a data stream, CiteSeerX 10.1.1.151.5909
  6. ^ a b Ting, Daniel (2018). "Count-Min". Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. pp. 2319–2328. doi:10.1145/3219819.3219975. ISBN 9781450355520.
  7. ^ Deng, Fan; Rafiei, Davood (2007), New estimation algorithms for streaming data: Count-min can do more, CiteSeerX 10.1.1.552.1283
  8. ^ Lu, Yi; Montanari, Andrea; Prabhakar, Balaji; Dharmapurikar, Sarang; Kabbani, Abdul (2008). "Counter braids". ACM SIGMETRICS Performance Evaluation Review. 36 (1): 121–132. doi:10.1145/1384529.1375472. ISSN 0163-5999.

Further reading edit

  • Dwork, Cynthia; Naor, Moni; Pitassi, Toniann; Rothblum, Guy N.; Yekhanin, Sergey (2010). Pan-private streaming algorithms. Proc. ICS. CiteSeerX 10.1.1.165.5923.
  • Schechter, Stuart; Herley, Cormac; Mitzenmacher, Michael (2010). Popularity is everything: A new approach to protecting passwords from statistical-guessing attacks. USENIX Workshop on Hot Topics in Security. CiteSeerX 10.1.1.170.9356.

External links edit

  • Count–min FAQ

count, sketch, computing, count, sketch, sketch, probabilistic, data, structure, that, serves, frequency, table, events, stream, data, uses, hash, functions, events, frequencies, unlike, hash, table, uses, only, linear, space, expense, overcounting, some, even. In computing the count min sketch CM sketch is a probabilistic data structure that serves as a frequency table of events in a stream of data It uses hash functions to map events to frequencies but unlike a hash table uses only sub linear space at the expense of overcounting some events due to collisions The count min sketch was invented in 2003 by Graham Cormode and S Muthu Muthukrishnan 1 and described by them in a 2005 paper 2 Count min sketch is an alternative to count sketch and AMS sketch and can be considered an implementation of a counting Bloom filter Fan et al 1998 3 or multistage filter 1 However they are used differently and therefore sized differently a count min sketch typically has a sublinear number of cells related to the desired approximation quality of the sketch while a counting Bloom filter is more typically sized to match the number of elements in the set Contents 1 Data structure 2 Reducing bias and error 3 See also 4 Notes 5 References 6 Further reading 7 External linksData structure editThe goal of the basic version of the count min sketch is to consume a stream of events one at a time and count the frequency of the different types of events in the stream At any time the sketch can be queried for the frequency of a particular event type i from a universe of event types U displaystyle mathcal U nbsp and will return an estimate of this frequency that is within a certain distance of the true frequency with a certain probability a The actual sketch data structure is a two dimensional array of w columns and d rows The parameters w and d are fixed when the sketch is created and determine the time and space needs and the probability of error when the sketch is queried for a frequency or inner product Associated with each of the d rows is a separate hash function the hash functions must be pairwise independent The parameters w and d can be chosen by setting w e e and d ln 1 d where the error in answering a query is within an additive factor of e with probability 1 d see below and e is Euler s number When a new event of type i arrives we update as follows for each row j of the table apply the corresponding hash function to obtain a column index k hj i Then increment the value in row j column k by one Several types of queries are possible on the stream The simplest is the point query which asks for the count of an event type i The estimated count is given by the least value in the table for i namely a i minjcount j hj i displaystyle hat a i min j mathrm count j h j i nbsp where count displaystyle mathrm count nbsp is the table Obviously for each i one has ai a i displaystyle a i leq hat a i nbsp where ai displaystyle a i nbsp is the true frequency with which i occurred in the stream Additionally this estimate has the guarantee that a i ai eN displaystyle hat a i leq a i varepsilon N nbsp with probability 1 d displaystyle 1 delta nbsp where N i Uai displaystyle N sum i in mathcal U a i nbsp is the stream size i e the total number of items seen by the sketch An inner product query asks for the inner product between the histograms represented by two count min sketches counta displaystyle mathrm count a nbsp and countb displaystyle mathrm count b nbsp Small modifications to the data structure can be used to sketch other different stream statistics Like the count sketch the Count min sketch is a linear sketch That is given two streams constructing a sketch on each stream and summing the sketches yields the same result as concatenating the streams and constructing a sketch on the concatenated streams This makes the sketch mergeable and appropriate for use in distributed settings in addition to streaming ones Reducing bias and error editOne potential problem with the usual min estimator for count min sketches is that they are biased estimators of the true frequency of events they may overestimate but never underestimate the true count in a point query Furthermore while the min estimator works well when the distribution is highly skewed other sketches such as the Count sketch based on means are more accurate when the distribution is not sufficiently skewed Several variations on the sketch have been proposed to reduce error and reduce or eliminate bias 4 To remove bias the hCount estimator 5 repeatedly randomly selects d random entries in the sketch and takes the minimum to obtain an unbiased estimate of the bias and subtracts it off A maximum likelihood estimator MLE was derived in Ting 6 By using the MLE the estimator is always able to match or better the min estimator and works well even if the distribution is not skewed This paper also showed the hCount debiasing operation is a bootstrapping procedure that can be efficiently computed without random sampling and can be generalized to any estimator Since errors arise from hash collisions with unknown items from the universe several approaches correct for the collisions when multiple elements of the universe are known or queried for simultaneously 7 8 6 For each of these a large proportion of the universe must be known to observe a significant benefit Conservative updating changes the update but not the query algorithms To count c instances of event type i one first computes an estimate a i minjcount j hj i displaystyle hat a i min j mathrm count j h j i nbsp then updates count j hj i max count j hj i ai c displaystyle mathrm count j h j i leftarrow max mathrm count j h j i hat a i c nbsp for each row j While this update procedure makes the sketch not a linear sketch it is still mergeable See also editFeature hashing Locality sensitive hashing MinHashNotes edit The following discussion assumes that only positive events occur i e the frequency of the various types cannot decrease over time Modifications of the following algorithms exist for the more general case where frequencies are allowed to decrease References edit a b Cormode Graham 2009 Count min sketch PDF Encyclopedia of Database Systems Springer pp 511 516 Cormode Graham S Muthukrishnan 2005 An Improved Data Stream Summary The Count Min Sketch and its Applications PDF J Algorithms 55 29 38 doi 10 1016 j jalgor 2003 12 001 Archived from the original PDF on 2023 05 25 Fan Li Cao Pei Almeida Jussara Broder Andrei 2000 Summary Cache A Scalable Wide Area Web Cache Sharing Protocol IEEE ACM Transactions on Networking 8 3 281 293 CiteSeerX 10 1 1 41 1487 doi 10 1109 90 851975 S2CID 4779754 A preliminary version appeared at SIGCOMM 98 Goyal Amit Daume Hal III Cormode Graham 2012 Sketch algorithms for estimating point queries in NLP Proc EMNLP CoNLL Jin C Qian W Xu X Zhou A 2003 Dynamically maintaining frequent items over a data stream CiteSeerX 10 1 1 151 5909 a b Ting Daniel 2018 Count Min Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery amp Data Mining pp 2319 2328 doi 10 1145 3219819 3219975 ISBN 9781450355520 Deng Fan Rafiei Davood 2007 New estimation algorithms for streaming data Count min can do more CiteSeerX 10 1 1 552 1283 Lu Yi Montanari Andrea Prabhakar Balaji Dharmapurikar Sarang Kabbani Abdul 2008 Counter braids ACM SIGMETRICS Performance Evaluation Review 36 1 121 132 doi 10 1145 1384529 1375472 ISSN 0163 5999 Further reading editDwork Cynthia Naor Moni Pitassi Toniann Rothblum Guy N Yekhanin Sergey 2010 Pan private streaming algorithms Proc ICS CiteSeerX 10 1 1 165 5923 Schechter Stuart Herley Cormac Mitzenmacher Michael 2010 Popularity is everything A new approach to protecting passwords from statistical guessing attacks USENIX Workshop on Hot Topics in Security CiteSeerX 10 1 1 170 9356 External links editCount min FAQ Retrieved from https en wikipedia org w index php title Count min sketch amp oldid 1204831567, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.