fbpx
Wikipedia

Count-distinct problem

In computer science, the count-distinct problem[1] (also known in applied mathematics as the cardinality estimation problem) is the problem of finding the number of distinct elements in a data stream with repeated elements. This is a well-known problem with numerous applications. The elements might represent IP addresses of packets passing through a router, unique visitors to a web site, elements in a large database, motifs in a DNA sequence, or elements of RFID/sensor networks.

Formal definition edit

Instance: Consider a stream of elements   with repetitions. Let   denote the number of distinct elements in the stream, with the set of distinct elements represented as  .
Objective: Find an estimate   of   using only   storage units, where  .

An example of an instance for the cardinality estimation problem is the stream:  . For this instance,  .

Naive solution edit

The naive solution to the problem is as follows:

 Initialize a counter, c, to zero,  . Initialize an efficient dictionary data structure, D, such as hash table or search tree in which insertion and membership can be performed quickly. For each element  , a membership query is issued. If   is not a member of D ( ) Add   to D Increase c by one,   Otherwise ( ) do nothing. Output  . 

As long as the number of distinct elements is not too big, D fits in main memory and an exact answer can be retrieved. However, this approach does not scale for bounded storage, or if the computation performed for each element   should be minimized. In such a case, several streaming algorithms have been proposed that use a fixed number of storage units.

HyperLogLog algorithm edit

Streaming algorithms edit

To handle the bounded storage constraint, streaming algorithms use a randomization to produce a non-exact estimation of the distinct number of elements,  . State-of-the-art estimators hash every element   into a low-dimensional data sketch using a hash function,  . The different techniques can be classified according to the data sketches they store.

Min/max sketches edit

Min/max sketches[2][3] store only the minimum/maximum hashed values. Examples of known min/max sketch estimators: Chassaing et al.[4] presents max sketch which is the minimum-variance unbiased estimator for the problem. The continuous max sketches estimator[5] is the maximum likelihood estimator. The estimator of choice in practice is the HyperLogLog algorithm.[6]

The intuition behind such estimators is that each sketch carries information about the desired quantity. For example, when every element   is associated with a uniform RV,  , the expected minimum value of   is  . The hash function guarantees that   is identical for all the appearances of  . Thus, the existence of duplicates does not affect the value of the extreme order statistics.

There are other estimation techniques other than min/max sketches. The first paper on count-distinct estimation[7] describes the Flajolet–Martin algorithm, a bit pattern sketch. In this case, the elements are hashed into a bit vector and the sketch holds the logical OR of all hashed values. The first asymptotically space- and time-optimal algorithm for this problem was given by Daniel M. Kane, Jelani Nelson, and David P. Woodruff.[8]

Bottom-m sketches edit

Bottom-m sketches [9] are a generalization of min sketches, which maintain the   minimal values, where  . See Cosma et al.[2] for a theoretical overview of count-distinct estimation algorithms, and Metwally [10] for a practical overview with comparative simulation results.

CVM Algorithm edit

Compared to other approximation algorithms for the count-distinct problem the CVM Algorithm[11] (named by Donald Knuth after the initials of Sourav Chakraborty, N. V. Vinodchandran, and Kuldeep S. Meel) uses sampling instead of hashing. The CVM Algorithm provides an unbiased estimator for the number of distinct elements in a stream,[12] in addition to the standard (ε-δ) guarantees. Below is the CVM algorithm, including the slight modification by Donald Knuth, that adds the while loop to ensure B is reduced. [12]

 Initialize   Initialize an empty buffer, B For each element   in data stream   of size   do: If   is in B then Delete   from B   random number in   If   then Insert   into B While   then Remove every element of   with probability     End While End For return  . 

Weighted count-distinct problem edit

In its weighted version, each element is associated with a weight and the goal is to estimate the total sum of weights. Formally,

Instance: A stream of weighted elements   with repetitions, and an integer  . Let   be the number of distinct elements, namely  , and let these elements be  . Finally, let   be the weight of  .
Objective: Find an estimate   of   using only   storage units, where  .

An example of an instance for the weighted problem is:  . For this instance,  , the weights are   and  .

As an application example,   could be IP packets received by a server. Each packet belongs to one of   IP flows  . The weight   can be the load imposed by flow   on the server. Thus,   represents the total load imposed on the server by all the flows to which packets   belong.

Solving the weighted count-distinct problem edit

Any extreme order statistics estimator (min/max sketches) for the unweighted problem can be generalized to an estimator for the weighted problem .[13] For example, the weighted estimator proposed by Cohen et al.[5] can be obtained when the continuous max sketches estimator is extended to solve the weighted problem. In particular, the HyperLogLog algorithm[6] can be extended to solve the weighted problem. The extended HyperLogLog algorithm offers the best performance, in terms of statistical accuracy and memory usage, among all the other known algorithms for the weighted problem.

See also edit

References edit

  1. ^ Ullman, Jeff; Rajaraman, Anand; Leskovec, Jure. "Mining data streams" (PDF). {{cite journal}}: Cite journal requires |journal= (help)
  2. ^ a b Cosma, Ioana A.; Clifford, Peter (2011). "A statistical analysis of probabilistic counting algorithms". Scandinavian Journal of Statistics. arXiv:0801.3552.
  3. ^ Giroire, Frederic; Fusy, Eric (2007). 2007 Proceedings of the Fourth Workshop on Analytic Algorithmics and Combinatorics (ANALCO). pp. 223–231. CiteSeerX 10.1.1.214.270. doi:10.1137/1.9781611972979.9. ISBN 978-1-61197-297-9.
  4. ^ Chassaing, Philippe; Gerin, Lucas (2006). "Efficient estimation of the cardinality of large data sets". Proceedings of the 4th Colloquium on Mathematics and Computer Science. arXiv:math/0701347. Bibcode:2007math......1347C.
  5. ^ a b Cohen, Edith (1997). "Size-estimation framework with applications to transitive closure and reachability". J. Comput. Syst. Sci. 55 (3): 441–453. doi:10.1006/jcss.1997.1534.
  6. ^ a b Flajolet, Philippe; Fusy, Eric; Gandouet, Olivier; Meunier, Frederic (2007). "HyperLoglog: the analysis of a near-optimal cardinality estimation algorithm" (PDF). Analysis of Algorithms.
  7. ^ Flajolet, Philippe; Martin, G. Nigel (1985). "Probabilistic counting algorithms for data base applications" (PDF). J. Comput. Syst. Sci. 31 (2): 182–209. doi:10.1016/0022-0000(85)90041-8.
  8. ^ Kane, Daniel M.; Nelson, Jelani; Woodruff, David P. (2010). "An Optimal Algorithm for the Distinct Elements Problem". Proceedings of the 29th Annual ACM Symposium on Principles of Database Systems (PODS).
  9. ^ Cohen, Edith; Kaplan, Haim (2008). "Tighter estimation using bottom k sketches" (PDF). PVLDB.
  10. ^ Metwally, Ahmed; Agrawal, Divyakant; Abbadi, Amr El (2008), Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic, Proceedings of the 11th international conference on Extending Database Technology: Advances in Database Technology, pp. 618–629, CiteSeerX 10.1.1.377.4771
  11. ^ Chakraborty, Sourav; Vinodchandran, N. V.; Meel, Kuldeep S. (2022). "Distinct Elements in Streams: An Algorithm for the (Text) Book": 6 pages, 727571 bytes. doi:10.4230/LIPIcs.ESA.2022.34. ISSN 1868-8969. {{cite journal}}: Cite journal requires |journal= (help)
  12. ^ a b Knuth, Donald (May 2023). "The CVM Algorithm for Estimating Distinct Elements in Streams" (PDF). {{cite journal}}: Cite journal requires |journal= (help)
  13. ^ Cohen, Reuven; Katzir, Liran; Yehezkel, Aviv (2014). "A Unified Scheme for Generalizing Cardinality Estimators to Sum Aggregation". Information Processing Letters. 115 (2): 336–342. doi:10.1016/j.ipl.2014.10.009.

count, distinct, problem, computer, science, count, distinct, problem, also, known, applied, mathematics, cardinality, estimation, problem, problem, finding, number, distinct, elements, data, stream, with, repeated, elements, this, well, known, problem, with, . In computer science the count distinct problem 1 also known in applied mathematics as the cardinality estimation problem is the problem of finding the number of distinct elements in a data stream with repeated elements This is a well known problem with numerous applications The elements might represent IP addresses of packets passing through a router unique visitors to a web site elements in a large database motifs in a DNA sequence or elements of RFID sensor networks Contents 1 Formal definition 2 Naive solution 3 HyperLogLog algorithm 4 Streaming algorithms 4 1 Min max sketches 4 2 Bottom m sketches 4 3 CVM Algorithm 5 Weighted count distinct problem 6 Solving the weighted count distinct problem 7 See also 8 ReferencesFormal definition editInstance Consider a stream of elements x 1 x 2 x s displaystyle x 1 x 2 ldots x s nbsp with repetitions Let n displaystyle n nbsp denote the number of distinct elements in the stream with the set of distinct elements represented as e 1 e 2 e n displaystyle e 1 e 2 ldots e n nbsp Objective Find an estimate n displaystyle widehat n nbsp of n displaystyle n nbsp using only m displaystyle m nbsp storage units where m n displaystyle m ll n nbsp An example of an instance for the cardinality estimation problem is the stream a b a c d b d displaystyle a b a c d b d nbsp For this instance n a b c d 4 displaystyle n left a b c d right 4 nbsp Naive solution editThe naive solution to the problem is as follows Initialize a counter c to zero c 0 displaystyle c leftarrow 0 nbsp Initialize an efficient dictionary data structure D such as hash table or search tree in which insertion and membership can be performed quickly For each element x i displaystyle x i nbsp a membership query is issued If x i displaystyle x i nbsp is not a member of D x i D displaystyle x i notin D nbsp Add x i displaystyle x i nbsp to D Increase c by one c c 1 displaystyle c leftarrow c 1 nbsp Otherwise x i D displaystyle x i in D nbsp do nothing Output n c displaystyle n c nbsp As long as the number of distinct elements is not too big D fits in main memory and an exact answer can be retrieved However this approach does not scale for bounded storage or if the computation performed for each element x i displaystyle x i nbsp should be minimized In such a case several streaming algorithms have been proposed that use a fixed number of storage units HyperLogLog algorithm editMain article HyperLogLogStreaming algorithms editTo handle the bounded storage constraint streaming algorithms use a randomization to produce a non exact estimation of the distinct number of elements n displaystyle n nbsp State of the art estimators hash every element e j displaystyle e j nbsp into a low dimensional data sketch using a hash function h e j displaystyle h e j nbsp The different techniques can be classified according to the data sketches they store Min max sketches edit Min max sketches 2 3 store only the minimum maximum hashed values Examples of known min max sketch estimators Chassaing et al 4 presents max sketch which is the minimum variance unbiased estimator for the problem The continuous max sketches estimator 5 is the maximum likelihood estimator The estimator of choice in practice is the HyperLogLog algorithm 6 The intuition behind such estimators is that each sketch carries information about the desired quantity For example when every element e j displaystyle e j nbsp is associated with a uniform RV h e j U 0 1 displaystyle h e j sim U 0 1 nbsp the expected minimum value of h e 1 h e 2 h e n displaystyle h e 1 h e 2 ldots h e n nbsp is 1 n 1 displaystyle 1 n 1 nbsp The hash function guarantees that h e j displaystyle h e j nbsp is identical for all the appearances of e j displaystyle e j nbsp Thus the existence of duplicates does not affect the value of the extreme order statistics There are other estimation techniques other than min max sketches The first paper on count distinct estimation 7 describes the Flajolet Martin algorithm a bit pattern sketch In this case the elements are hashed into a bit vector and the sketch holds the logical OR of all hashed values The first asymptotically space and time optimal algorithm for this problem was given by Daniel M Kane Jelani Nelson and David P Woodruff 8 Bottom m sketches edit Bottom m sketches 9 are a generalization of min sketches which maintain the m displaystyle m nbsp minimal values where m 1 displaystyle m geq 1 nbsp See Cosma et al 2 for a theoretical overview of count distinct estimation algorithms and Metwally 10 for a practical overview with comparative simulation results CVM Algorithm edit Compared to other approximation algorithms for the count distinct problem the CVM Algorithm 11 named by Donald Knuth after the initials of Sourav Chakraborty N V Vinodchandran and Kuldeep S Meel uses sampling instead of hashing The CVM Algorithm provides an unbiased estimator for the number of distinct elements in a stream 12 in addition to the standard e d guarantees Below is the CVM algorithm including the slight modification by Donald Knuth that adds the while loop to ensure B is reduced 12 Initialize p 1 displaystyle p leftarrow 1 nbsp Initialize an empty buffer B For each element a t displaystyle a t nbsp in data stream A displaystyle A nbsp of size n displaystyle n nbsp do If a t displaystyle a t nbsp is in B then Delete a t displaystyle a t nbsp from B u displaystyle u leftarrow nbsp random number in 0 1 displaystyle 0 1 nbsp If u p displaystyle u leq p nbsp then Insert a t u displaystyle a t u nbsp into B While B s displaystyle B s nbsp then Remove every element of B displaystyle B nbsp with probability 1 2 displaystyle frac 1 2 nbsp p p 2 displaystyle p leftarrow frac p 2 nbsp End While End For return B p displaystyle B p nbsp Weighted count distinct problem editIn its weighted version each element is associated with a weight and the goal is to estimate the total sum of weights Formally Instance A stream of weighted elements x 1 x 2 x s displaystyle x 1 x 2 ldots x s nbsp with repetitions and an integer m displaystyle m nbsp Let n displaystyle n nbsp be the number of distinct elements namely n x 1 x 2 x s displaystyle n left x 1 x 2 ldots x s right nbsp and let these elements be e 1 e 2 e n displaystyle left e 1 e 2 ldots e n right nbsp Finally let w j displaystyle w j nbsp be the weight of e j displaystyle e j nbsp Objective Find an estimate w displaystyle widehat w nbsp of w j 1 n w j displaystyle w sum j 1 n w j nbsp using only m displaystyle m nbsp storage units where m n displaystyle m ll n nbsp An example of an instance for the weighted problem is a 3 b 4 a 3 c 2 d 3 b 4 d 3 displaystyle a 3 b 4 a 3 c 2 d 3 b 4 d 3 nbsp For this instance e 1 a e 2 b e 3 c e 4 d displaystyle e 1 a e 2 b e 3 c e 4 d nbsp the weights are w 1 3 w 2 4 w 3 2 w 4 3 displaystyle w 1 3 w 2 4 w 3 2 w 4 3 nbsp and w j 12 displaystyle sum w j 12 nbsp As an application example x 1 x 2 x s displaystyle x 1 x 2 ldots x s nbsp could be IP packets received by a server Each packet belongs to one of n displaystyle n nbsp IP flows e 1 e 2 e n displaystyle e 1 e 2 ldots e n nbsp The weight w j displaystyle w j nbsp can be the load imposed by flow e j displaystyle e j nbsp on the server Thus j 1 n w j displaystyle sum j 1 n w j nbsp represents the total load imposed on the server by all the flows to which packets x 1 x 2 x s displaystyle x 1 x 2 ldots x s nbsp belong Solving the weighted count distinct problem editAny extreme order statistics estimator min max sketches for the unweighted problem can be generalized to an estimator for the weighted problem 13 For example the weighted estimator proposed by Cohen et al 5 can be obtained when the continuous max sketches estimator is extended to solve the weighted problem In particular the HyperLogLog algorithm 6 can be extended to solve the weighted problem The extended HyperLogLog algorithm offers the best performance in terms of statistical accuracy and memory usage among all the other known algorithms for the weighted problem See also editCount min sketch Streaming algorithm Maximum likelihood Minimum variance unbiased estimatorReferences edit Ullman Jeff Rajaraman Anand Leskovec Jure Mining data streams PDF a href Template Cite journal html title Template Cite journal cite journal a Cite journal requires journal help a b Cosma Ioana A Clifford Peter 2011 A statistical analysis of probabilistic counting algorithms Scandinavian Journal of Statistics arXiv 0801 3552 Giroire Frederic Fusy Eric 2007 2007 Proceedings of the Fourth Workshop on Analytic Algorithmics and Combinatorics ANALCO pp 223 231 CiteSeerX 10 1 1 214 270 doi 10 1137 1 9781611972979 9 ISBN 978 1 61197 297 9 Chassaing Philippe Gerin Lucas 2006 Efficient estimation of the cardinality of large data sets Proceedings of the 4th Colloquium on Mathematics and Computer Science arXiv math 0701347 Bibcode 2007math 1347C a b Cohen Edith 1997 Size estimation framework with applications to transitive closure and reachability J Comput Syst Sci 55 3 441 453 doi 10 1006 jcss 1997 1534 a b Flajolet Philippe Fusy Eric Gandouet Olivier Meunier Frederic 2007 HyperLoglog the analysis of a near optimal cardinality estimation algorithm PDF Analysis of Algorithms Flajolet Philippe Martin G Nigel 1985 Probabilistic counting algorithms for data base applications PDF J Comput Syst Sci 31 2 182 209 doi 10 1016 0022 0000 85 90041 8 Kane Daniel M Nelson Jelani Woodruff David P 2010 An Optimal Algorithm for the Distinct Elements Problem Proceedings of the 29th Annual ACM Symposium on Principles of Database Systems PODS Cohen Edith Kaplan Haim 2008 Tighter estimation using bottom k sketches PDF PVLDB Metwally Ahmed Agrawal Divyakant Abbadi Amr El 2008 Why go logarithmic if we can go linear Towards effective distinct counting of search traffic Proceedings of the 11th international conference on Extending Database Technology Advances in Database Technology pp 618 629 CiteSeerX 10 1 1 377 4771 Chakraborty Sourav Vinodchandran N V Meel Kuldeep S 2022 Distinct Elements in Streams An Algorithm for the Text Book 6 pages 727571 bytes doi 10 4230 LIPIcs ESA 2022 34 ISSN 1868 8969 a href Template Cite journal html title Template Cite journal cite journal a Cite journal requires journal help a b Knuth Donald May 2023 The CVM Algorithm for Estimating Distinct Elements in Streams PDF a href Template Cite journal html title Template Cite journal cite journal a Cite journal requires journal help Cohen Reuven Katzir Liran Yehezkel Aviv 2014 A Unified Scheme for Generalizing Cardinality Estimators to Sum Aggregation Information Processing Letters 115 2 336 342 doi 10 1016 j ipl 2014 10 009 Retrieved from https en wikipedia org w index php title Count distinct problem amp oldid 1226437500, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.