fbpx
Wikipedia

Streaming algorithm

In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes, typically just one. These algorithms are designed to operate with limited memory, generally logarithmic in the size of the stream and/or in the maximum value in the stream, and may also have limited processing time per item.

As a result of these constraints, streaming algorithms often produce approximate answers based on a summary or "sketch" of the data stream.

History edit

Though streaming algorithms had already been studied by Munro and Paterson[1] as early as 1978, as well as Philippe Flajolet and G. Nigel Martin in 1982/83,[2] the field of streaming algorithms was first formalized and popularized in a 1996 paper by Noga Alon, Yossi Matias, and Mario Szegedy.[3] For this paper, the authors later won the Gödel Prize in 2005 "for their foundational contribution to streaming algorithms." There has since been a large body of work centered around data streaming algorithms that spans a diverse spectrum of computer science fields such as theory, databases, networking, and natural language processing.

Semi-streaming algorithms were introduced in 2005 as a relaxation of streaming algorithms for graphs,[4] in which the space allowed is linear in the number of vertices n, but only logarithmic in the number of edges m. This relaxation is still meaningful for dense graphs, and can solve interesting problems (such as connectivity) that are insoluble in   space.

Models edit

Data stream model edit

In the data stream model, some or all of the input is represented as a finite sequence of integers (from some finite domain) which is generally not available for random access, but instead arrives one at a time in a "stream".[5] If the stream has length n and the domain has size m, algorithms are generally constrained to use space that is logarithmic in m and n. They can generally make only some small constant number of passes over the stream, sometimes just one.[6]

Turnstile and cash register models edit

Much of the streaming literature is concerned with computing statistics on frequency distributions that are too large to be stored. For this class of problems, there is a vector   (initialized to the zero vector  ) that has updates presented to it in a stream. The goal of these algorithms is to compute functions of   using considerably less space than it would take to represent   precisely. There are two common models for updating such streams, called the "cash register" and "turnstile" models.[7]

In the cash register model, each update is of the form  , so that   is incremented by some positive integer  . A notable special case is when   (only unit insertions are permitted).

In the turnstile model, each update is of the form  , so that   is incremented by some (possibly negative) integer  . In the "strict turnstile" model, no   at any time may be less than zero.

Sliding window model edit

Several papers also consider the "sliding window" model.[citation needed] In this model, the function of interest is computing over a fixed-size window in the stream. As the stream progresses, items from the end of the window are removed from consideration while new items from the stream take their place.

Besides the above frequency-based problems, some other types of problems have also been studied. Many graph problems are solved in the setting where the adjacency matrix or the adjacency list of the graph is streamed in some unknown order. There are also some problems that are very dependent on the order of the stream (i.e., asymmetric functions), such as counting the number of inversions in a stream and finding the longest increasing subsequence.[citation needed]

Evaluation edit

The performance of an algorithm that operates on data streams is measured by three basic factors:

  • The number of passes the algorithm must make over the stream.
  • The available memory.
  • The running time of the algorithm.

These algorithms have many similarities with online algorithms since they both require decisions to be made before all data are available, but they are not identical. Data stream algorithms only have limited memory available but they may be able to defer action until a group of points arrive, while online algorithms are required to take action as soon as each point arrives.

If the algorithm is an approximation algorithm then the accuracy of the answer is another key factor. The accuracy is often stated as an   approximation meaning that the algorithm achieves an error of less than   with probability  .

Applications edit

Streaming algorithms have several applications in networking such as monitoring network links for elephant flows, counting the number of distinct flows, estimating the distribution of flow sizes, and so on.[8] They also have applications in databases, such as estimating the size of a join[citation needed].

Some streaming problems edit

Frequency moments edit

The kth frequency moment of a set of frequencies   is defined as  .

The first moment   is simply the sum of the frequencies (i.e., the total count). The second moment   is useful for computing statistical properties of the data, such as the Gini coefficient of variation.   is defined as the frequency of the most frequent items.

The seminal paper of Alon, Matias, and Szegedy dealt with the problem of estimating the frequency moments.[citation needed]

Calculating frequency moments edit

A direct approach to find the frequency moments requires to maintain a register mi for all distinct elements ai ∈ (1,2,3,4,...,N) which requires at least memory of order  .[3] But we have space limitations and require an algorithm that computes in much lower memory. This can be achieved by using approximations instead of exact values. An algorithm that computes an (ε,δ)approximation of Fk, where F'k is the (ε,δ)- approximated value of Fk.[9] Where ε is the approximation parameter and δ is the confidence parameter.[10]

Calculating F0 (distinct elements in a DataStream) edit
FM-Sketch algorithm edit

Flajolet et al. in [2] introduced probabilistic method of counting which was inspired from a paper by Robert Morris.[11] Morris in his paper says that if the requirement of accuracy is dropped, a counter n can be replaced by a counter log n which can be stored in log log n bits.[12] Flajolet et al. in [2] improved this method by using a hash function h which is assumed to uniformly distribute the element in the hash space (a binary string of length L).

 

Let bit(y,k) represent the kth bit in binary representation of y

 

Let   represents the position of least significant 1-bit in the binary representation of yi with a suitable convention for  .

 

Let A be the sequence of data stream of length M whose cardinality need to be determined. Let BITMAP [0...L − 1] be the

hash space where the ρ(hashedvalues) are recorded. The below algorithm then determines approximate cardinality of A.

Procedure FM-Sketch: for i in 0 to L − 1 do BITMAP[i] := 0 end for for x in A: do Index := ρ(hash(x)) if BITMAP[index] = 0 then BITMAP[index] := 1 end if end for B := Position of left most 0 bit of BITMAP[] return 2 ^ B 

If there are N distinct elements in a data stream.

  • For   then BITMAP[i] is certainly 0
  • For   then BITMAP[i] is certainly 1
  • For   then BITMAP[i] is a fringes of 0's and 1's
K-minimum value algorithm edit

The previous algorithm describes the first attempt to approximate F0 in the data stream by Flajolet and Martin. Their algorithm picks a random hash function which they assume to uniformly distribute the hash values in hash space.

Bar-Yossef et al. in [10] introduced k-minimum value algorithm for determining number of distinct elements in data stream. They used a similar hash function h which can be normalized to [0,1] as  . But they fixed a limit t to number of values in hash space. The value of t is assumed of the order   (i.e. less approximation-value ε requires more t). KMV algorithm keeps only t-smallest hash values in the hash space. After all the m values of stream have arrived,   is used to calculate . That is, in a close-to uniform hash space, they expect at-least t elements to be less than  .

Procedure 2 K-Minimum Value Initialize first t values of KMV for a in a1 to an do if h(a) < Max(KMV) then Remove Max(KMV) from KMV set Insert h(a) to KMV end if end for return t/Max(KMV) 
Complexity analysis of KMV edit

KMV algorithm can be implemented in   memory bits space. Each hash value requires space of order   memory bits. There are hash values of the order  . The access time can be reduced if we store the t hash values in a binary tree. Thus the time complexity will be reduced to  .

Calculating Fk edit

Alon et al. estimates Fk by defining random variables that can be computed within given space and time.[3] The expected value of random variables gives the approximate value of Fk.

Assume length of sequence m is known in advance. Then construct a random variable X as follows:

  • Select ap be a random member of sequence A with index at p,  
  • Let  , represents the number of occurrences of l within the members of the sequence A following ap.
  • Random variable  .

Assume S1 be of the order   and S2 be of the order  . Algorithm takes S2 random variable   and outputs the median   . Where Yi is the average of Xij where 1 ≤ jS1.

Now calculate expectation of random variable E(X).

 
Complexity of Fk edit

From the algorithm to calculate Fk discussed above, we can see that each random variable X stores value of ap and r. So, to compute X we need to maintain only log(n) bits for storing ap and log(n) bits for storing r. Total number of random variable X will be the  .

Hence the total space complexity the algorithm takes is of the order of  

Simpler approach to calculate F2 edit

The previous algorithm calculates   in order of   memory bits. Alon et al. in [3] simplified this algorithm using four-wise independent random variable with values mapped to  .

This further reduces the complexity to calculate   to  

Frequent elements edit

In the data stream model, the frequent elements problem is to output a set of elements that constitute more than some fixed fraction of the stream. A special case is the majority problem, which is to determine whether or not any value constitutes a majority of the stream.

More formally, fix some positive constant c > 1, let the length of the stream be m, and let fi denote the frequency of value i in the stream. The frequent elements problem is to output the set { i | fi > m/c }.[13]

Some notable algorithms are:

Event detection edit

Detecting events in data streams is often done using a heavy hitters algorithm as listed above: the most frequent items and their frequency are determined using one of these algorithms, then the largest increase over the previous time point is reported as trend. This approach can be refined by using exponentially weighted moving averages and variance for normalization.[14]

Counting distinct elements edit

Counting the number of distinct elements in a stream (sometimes called the F0 moment) is another problem that has been well studied. The first algorithm for it was proposed by Flajolet and Martin. In 2010, Daniel Kane, Jelani Nelson and David Woodruff found an asymptotically optimal algorithm for this problem.[15] It uses O(ε2 + log d) space, with O(1) worst-case update and reporting times, as well as universal hash functions and a r-wise independent hash family where r = Ω(log(1/ε) / log log(1/ε)).

Entropy edit

The (empirical) entropy of a set of frequencies   is defined as  , where  .

Online learning edit

Learn a model (e.g. a classifier) by a single pass over a training set.


Lower bounds edit

Lower bounds have been computed for many of the data streaming problems that have been studied. By far, the most common technique for computing these lower bounds has been using communication complexity.[citation needed]

See also edit

Notes edit

  1. ^ Munro, J. Ian; Paterson, Mike (1978). "Selection and Sorting with Limited Storage". 19th Annual Symposium on Foundations of Computer Science, Ann Arbor, Michigan, USA, 16–18 October 1978. IEEE Computer Society. pp. 253–258. doi:10.1109/SFCS.1978.32.
  2. ^ a b c Flajolet & Martin (1985)
  3. ^ a b c d Alon, Matias & Szegedy (1996)
  4. ^ Feigenbaum, Joan; Sampath, Kannan (2005). "On graph problems in a semi-streaming model". Theoretical Computer Science. 348 (2): 207–216. doi:10.1016/j.tcs.2005.09.013.
  5. ^ Babcock, Brian; Babu, Shivnath; Datar, Mayur; Motwani, Rajeev; Widom, Jennifer (2002). "Models and issues in data stream systems". Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. PODS '02. New York, NY, USA: ACM. pp. 1–16. CiteSeerX 10.1.1.138.190. doi:10.1145/543613.543615. ISBN 978-1581135077. S2CID 2071130.
  6. ^ Bar-Yossef, Ziv; Jayram, T. S.; Kumar, Ravi; Sivakumar, D.; Trevisan, Luca (2002-09-13). "Counting Distinct Elements in a Data Stream". Randomization and Approximation Techniques in Computer Science. Lecture Notes in Computer Science. Vol. 2483. Springer, Berlin, Heidelberg. pp. 1–10. CiteSeerX 10.1.1.12.6276. doi:10.1007/3-540-45726-7_1. ISBN 978-3540457268. S2CID 4684185.
  7. ^ Gilbert et al. (2001)
  8. ^ Xu (2007)
  9. ^ Indyk, Piotr; Woodruff, David (2005-01-01). "Optimal approximations of the frequency moments of data streams". Proceedings of the thirty-seventh annual ACM symposium on Theory of computing. STOC '05. New York, NY, USA: ACM. pp. 202–208. doi:10.1145/1060590.1060621. ISBN 978-1-58113-960-0. S2CID 7911758.
  10. ^ a b Bar-Yossef, Ziv; Jayram, T. S.; Kumar, Ravi; Sivakumar, D.; Trevisan, Luca (2002-09-13). Rolim, José D. P.; Vadhan, Salil (eds.). Counting Distinct Elements in a Data Stream. Lecture Notes in Computer Science. Springer Berlin Heidelberg. pp. 1–10. CiteSeerX 10.1.1.12.6276. doi:10.1007/3-540-45726-7_1. ISBN 978-3-540-44147-2. S2CID 4684185.
  11. ^ Morris (1978)
  12. ^ Flajolet, Philippe (1985-03-01). "Approximate counting: A detailed analysis". BIT Numerical Mathematics. 25 (1): 113–134. CiteSeerX 10.1.1.64.5320. doi:10.1007/BF01934993. ISSN 0006-3835. S2CID 2809103.
  13. ^ Cormode, Graham (2014). "Misra-Gries Summaries". In Kao, Ming-Yang (ed.). Encyclopedia of Algorithms. Springer US. pp. 1–5. doi:10.1007/978-3-642-27848-8_572-1. ISBN 9783642278488.
  14. ^ Schubert, E.; Weiler, M.; Kriegel, H. P. (2014). SigniTrend: scalable detection of emerging topics in textual streams by hashed significance thresholds. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '14. pp. 871–880. doi:10.1145/2623330.2623740. ISBN 9781450329569.
  15. ^ Kane, Nelson & Woodruff (2010)

References edit

  • Alon, Noga; Matias, Yossi; Szegedy, Mario (1999), "The space complexity of approximating the frequency moments", Journal of Computer and System Sciences, 58 (1): 137–147, doi:10.1006/jcss.1997.1545, ISSN 0022-0000. First published as Alon, Noga; Matias, Yossi; Szegedy, Mario (1996), "The space complexity of approximating the frequency moments", Proceedings of the 28th ACM Symposium on Theory of Computing (STOC 1996), pp. 20–29, CiteSeerX 10.1.1.131.4984, doi:10.1145/237814.237823, ISBN 978-0-89791-785-8, S2CID 1627911.
  • Babcock, Brian; Babu, Shivnath; Datar, Mayur; Motwani, Rajeev; Widom, Jennifer (2002), "Models and issues in data stream systems", (PDF), pp. 1–16, CiteSeerX 10.1.1.138.190, doi:10.1145/543613.543615, ISBN 978-1581135077, S2CID 2071130, archived from the original (PDF) on 2017-07-09, retrieved 2013-07-15.
  • Gilbert, A. C.; Kotidis, Y.; Muthukrishnan, S.; Strauss, M. J. (2001), "Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries" (PDF), Proceedings of the International Conference on Very Large Data Bases: 79–88.
  • Kane, Daniel M.; Nelson, Jelani; Woodruff, David P. (2010). "An optimal algorithm for the distinct elements problem". Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. PODS '10. New York, NY, USA: ACM. pp. 41–52. CiteSeerX 10.1.1.164.142. doi:10.1145/1807085.1807094. ISBN 978-1-4503-0033-9. S2CID 10006932..
  • Karp, R. M.; Papadimitriou, C. H.; Shenker, S. (2003), "A simple algorithm for finding frequent elements in streams and bags", ACM Transactions on Database Systems, 28 (1): 51–55, CiteSeerX 10.1.1.116.8530, doi:10.1145/762471.762473, S2CID 952840.
  • Lall, Ashwin; Sekar, Vyas; Ogihara, Mitsunori; Xu, Jun; Zhang, Hui (2006), "Data streaming algorithms for estimating entropy of network traffic", Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (ACM SIGMETRICS 2006) (PDF), p. 145, doi:10.1145/1140277.1140295, hdl:1802/2537, ISBN 978-1595933195, S2CID 240982[permanent dead link].
  • Xu, Jun (Jim) (2007), A Tutorial on Network Data Streaming (PDF).
  • Heath, D., Kasif, S., Kosaraju, R., Salzberg, S., Sullivan, G., "Learning Nested Concepts With Limited Storage", Proceeding IJCAI'91 Proceedings of the 12th international joint conference on Artificial intelligence - Volume 2, Pages 777–782, Morgan Kaufmann Publishers Inc. San Francisco, CA, USA ©1991
  • Morris, Robert (1978), "Counting large numbers of events in small registers", Communications of the ACM, 21 (10): 840–842, doi:10.1145/359619.359627, S2CID 36226357.

streaming, algorithm, computer, science, streaming, algorithms, algorithms, processing, data, streams, which, input, presented, sequence, items, examined, only, passes, typically, just, these, algorithms, designed, operate, with, limited, memory, generally, lo. In computer science streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes typically just one These algorithms are designed to operate with limited memory generally logarithmic in the size of the stream and or in the maximum value in the stream and may also have limited processing time per item As a result of these constraints streaming algorithms often produce approximate answers based on a summary or sketch of the data stream Contents 1 History 2 Models 2 1 Data stream model 2 2 Turnstile and cash register models 2 3 Sliding window model 3 Evaluation 4 Applications 5 Some streaming problems 5 1 Frequency moments 5 1 1 Calculating frequency moments 5 1 1 1 Calculating F0 distinct elements in a DataStream 5 1 1 1 1 FM Sketch algorithm 5 1 1 1 2 K minimum value algorithm 5 1 1 1 3 Complexity analysis of KMV 5 1 1 2 Calculating Fk 5 1 1 2 1 Complexity of Fk 5 1 1 2 2 Simpler approach to calculate F2 5 2 Frequent elements 5 3 Event detection 5 4 Counting distinct elements 5 5 Entropy 5 6 Online learning 6 Lower bounds 7 See also 8 Notes 9 ReferencesHistory editThough streaming algorithms had already been studied by Munro and Paterson 1 as early as 1978 as well as Philippe Flajolet and G Nigel Martin in 1982 83 2 the field of streaming algorithms was first formalized and popularized in a 1996 paper by Noga Alon Yossi Matias and Mario Szegedy 3 For this paper the authors later won the Godel Prize in 2005 for their foundational contribution to streaming algorithms There has since been a large body of work centered around data streaming algorithms that spans a diverse spectrum of computer science fields such as theory databases networking and natural language processing Semi streaming algorithms were introduced in 2005 as a relaxation of streaming algorithms for graphs 4 in which the space allowed is linear in the number of vertices n but only logarithmic in the number of edges m This relaxation is still meaningful for dense graphs and can solve interesting problems such as connectivity that are insoluble in o n displaystyle o n nbsp space Models editData stream model edit In the data stream model some or all of the input is represented as a finite sequence of integers from some finite domain which is generally not available for random access but instead arrives one at a time in a stream 5 If the stream has length n and the domain has size m algorithms are generally constrained to use space that is logarithmic in m and n They can generally make only some small constant number of passes over the stream sometimes just one 6 Turnstile and cash register models edit Much of the streaming literature is concerned with computing statistics on frequency distributions that are too large to be stored For this class of problems there is a vector a a 1 a n displaystyle mathbf a a 1 dots a n nbsp initialized to the zero vector 0 displaystyle mathbf 0 nbsp that has updates presented to it in a stream The goal of these algorithms is to compute functions of a displaystyle mathbf a nbsp using considerably less space than it would take to represent a displaystyle mathbf a nbsp precisely There are two common models for updating such streams called the cash register and turnstile models 7 In the cash register model each update is of the form i c displaystyle langle i c rangle nbsp so that a i displaystyle a i nbsp is incremented by some positive integer c displaystyle c nbsp A notable special case is when c 1 displaystyle c 1 nbsp only unit insertions are permitted In the turnstile model each update is of the form i c displaystyle langle i c rangle nbsp so that a i displaystyle a i nbsp is incremented by some possibly negative integer c displaystyle c nbsp In the strict turnstile model no a i displaystyle a i nbsp at any time may be less than zero Sliding window model edit Several papers also consider the sliding window model citation needed In this model the function of interest is computing over a fixed size window in the stream As the stream progresses items from the end of the window are removed from consideration while new items from the stream take their place Besides the above frequency based problems some other types of problems have also been studied Many graph problems are solved in the setting where the adjacency matrix or the adjacency list of the graph is streamed in some unknown order There are also some problems that are very dependent on the order of the stream i e asymmetric functions such as counting the number of inversions in a stream and finding the longest increasing subsequence citation needed Evaluation editThis section does not cite any sources Please help improve this section by adding citations to reliable sources Unsourced material may be challenged and removed April 2021 Learn how and when to remove this template message The performance of an algorithm that operates on data streams is measured by three basic factors The number of passes the algorithm must make over the stream The available memory The running time of the algorithm These algorithms have many similarities with online algorithms since they both require decisions to be made before all data are available but they are not identical Data stream algorithms only have limited memory available but they may be able to defer action until a group of points arrive while online algorithms are required to take action as soon as each point arrives If the algorithm is an approximation algorithm then the accuracy of the answer is another key factor The accuracy is often stated as an ϵ d displaystyle epsilon delta nbsp approximation meaning that the algorithm achieves an error of less than ϵ displaystyle epsilon nbsp with probability 1 d displaystyle 1 delta nbsp Applications editStreaming algorithms have several applications in networking such as monitoring network links for elephant flows counting the number of distinct flows estimating the distribution of flow sizes and so on 8 They also have applications in databases such as estimating the size of a join citation needed Some streaming problems editFrequency moments edit The k th frequency moment of a set of frequencies a displaystyle mathbf a nbsp is defined as F k a i 1 n a i k displaystyle F k mathbf a sum i 1 n a i k nbsp The first moment F 1 displaystyle F 1 nbsp is simply the sum of the frequencies i e the total count The second moment F 2 displaystyle F 2 nbsp is useful for computing statistical properties of the data such as the Gini coefficient of variation F displaystyle F infty nbsp is defined as the frequency of the most frequent items The seminal paper of Alon Matias and Szegedy dealt with the problem of estimating the frequency moments citation needed Calculating frequency moments edit A direct approach to find the frequency moments requires to maintain a register mi for all distinct elements ai 1 2 3 4 N which requires at least memory of order W N displaystyle Omega N nbsp 3 But we have space limitations and require an algorithm that computes in much lower memory This can be achieved by using approximations instead of exact values An algorithm that computes an e d approximation of Fk where F k is the e d approximated value of Fk 9 Where e is the approximation parameter and d is the confidence parameter 10 Calculating F0 distinct elements in a DataStream edit Main article Count distinct problem FM Sketch algorithm edit Flajolet et al in 2 introduced probabilistic method of counting which was inspired from a paper by Robert Morris 11 Morris in his paper says that if the requirement of accuracy is dropped a counter n can be replaced by a counter log n which can be stored in log log n bits 12 Flajolet et al in 2 improved this method by using a hash function h which is assumed to uniformly distribute the element in the hash space a binary string of length L h m 0 2 L 1 displaystyle h m rightarrow 0 2 L 1 nbsp Let bit y k represent the kth bit in binary representation of y y k 0 b i t y k 2 k displaystyle y sum k geq 0 mathrm bit y k 2 k nbsp Let r y displaystyle rho y nbsp represents the position of least significant 1 bit in the binary representation of yi with a suitable convention for r 0 displaystyle rho 0 nbsp r y M i n k b i t y k 1 if y gt 0 L if y 0 displaystyle rho y begin cases mathrm Min k mathrm bit y k 1 amp text if y gt 0 L amp text if y 0 end cases nbsp Let A be the sequence of data stream of length M whose cardinality need to be determined Let BITMAP 0 L 1 be thehash space where the r hashedvalues are recorded The below algorithm then determines approximate cardinality of A Procedure FM Sketch for i in 0 to L 1 do BITMAP i 0 end for for x in A do Index r hash x if BITMAP index 0 then BITMAP index 1 end if end for B Position of left most 0 bit of BITMAP return 2 BIf there are N distinct elements in a data stream For i log N displaystyle i gg log N nbsp then BITMAP i is certainly 0 For i log N displaystyle i ll log N nbsp then BITMAP i is certainly 1 For i log N displaystyle i approx log N nbsp then BITMAP i is a fringes of 0 s and 1 sK minimum value algorithm edit The previous algorithm describes the first attempt to approximate F0 in the data stream by Flajolet and Martin Their algorithm picks a random hash function which they assume to uniformly distribute the hash values in hash space Bar Yossef et al in 10 introduced k minimum value algorithm for determining number of distinct elements in data stream They used a similar hash function h which can be normalized to 0 1 as h m 0 1 displaystyle h m rightarrow 0 1 nbsp But they fixed a limit t to number of values in hash space The value of t is assumed of the order O 1 e 2 displaystyle O left dfrac 1 varepsilon 2 right nbsp i e less approximation value e requires more t KMV algorithm keeps only t smallest hash values in the hash space After all the m values of stream have arrived y M a x h a i displaystyle upsilon mathrm Max h a i nbsp is used to calculateF 0 t y displaystyle F 0 dfrac t upsilon nbsp That is in a close to uniform hash space they expect at least t elements to be less than O t F 0 displaystyle O left dfrac t F 0 right nbsp Procedure 2 K Minimum Value Initialize first t values of KMV for a in a1 to an do if h a lt Max KMV then Remove Max KMV from KMV set Insert h a to KMV end if end for return t Max KMV Complexity analysis of KMV edit KMV algorithm can be implemented in O 1 e 2 log m displaystyle O left left dfrac 1 varepsilon 2 right cdot log m right nbsp memory bits space Each hash value requires space of order O log m displaystyle O log m nbsp memory bits There are hash values of the order O 1 e 2 displaystyle O left dfrac 1 varepsilon 2 right nbsp The access time can be reduced if we store the t hash values in a binary tree Thus the time complexity will be reduced to O log 1 e log m displaystyle O left log left dfrac 1 varepsilon right cdot log m right nbsp Calculating Fk edit Alon et al estimates Fk by defining random variables that can be computed within given space and time 3 The expected value of random variables gives the approximate value of Fk Assume length of sequence m is known in advance Then construct a random variable X as follows Select ap be a random member of sequence A with index at p a p l 1 2 3 n displaystyle a p l in 1 2 3 ldots n nbsp Let r q q p a q l displaystyle r q q geq p a q l nbsp represents the number of occurrences of l within the members of the sequence A following ap Random variable X m r k r 1 k displaystyle X m r k r 1 k nbsp Assume S1 be of the order O n 1 1 k l 2 displaystyle O n 1 1 k lambda 2 nbsp and S2 be of the order O log 1 e displaystyle O log 1 varepsilon nbsp Algorithm takes S2 random variable Y 1 Y 2 Y S 2 displaystyle Y 1 Y 2 Y S2 nbsp and outputs the median Y displaystyle Y nbsp Where Yi is the average of Xij where 1 j S1 Now calculate expectation of random variable E X E X i 1 n i 1 m i j k j 1 k m m 1 k 2 k 1 k m 1 k m 1 1 k 1 k 2 k 1 k m 2 k m 2 1 k 1 k 2 k 1 k m n k m n 1 k i 1 n m i k F k displaystyle begin array lll E X amp amp sum i 1 n sum i 1 m i j k j 1 k amp amp frac m m 1 k 2 k 1 k ldots m 1 k m 1 1 k amp amp 1 k 2 k 1 k ldots m 2 k m 2 1 k ldots amp amp 1 k 2 k 1 k ldots m n k m n 1 k amp amp sum i 1 n m i k F k end array nbsp Complexity of Fk edit From the algorithm to calculate Fk discussed above we can see that each random variable X stores value of ap and r So to compute X we need to maintain only log n bits for storing ap and log n bits for storing r Total number of random variable X will be the S 1 S 2 displaystyle S 1 S 2 nbsp Hence the total space complexity the algorithm takes is of the order of O k log 1 e l 2 n 1 1 k log n log m displaystyle O left dfrac k log 1 over varepsilon lambda 2 n 1 1 over k left log n log m right right nbsp Simpler approach to calculate F2 edit The previous algorithm calculates F 2 displaystyle F 2 nbsp in order of O n log m log n displaystyle O sqrt n log m log n nbsp memory bits Alon et al in 3 simplified this algorithm using four wise independent random variable with values mapped to 1 1 displaystyle 1 1 nbsp This further reduces the complexity to calculate F 2 displaystyle F 2 nbsp to O log 1 e l 2 log n log m displaystyle O left dfrac log 1 over varepsilon lambda 2 left log n log m right right nbsp Frequent elements edit In the data stream model the frequent elements problem is to output a set of elements that constitute more than some fixed fraction of the stream A special case is the majority problem which is to determine whether or not any value constitutes a majority of the stream More formally fix some positive constant c gt 1 let the length of the stream be m and let fi denote the frequency of value i in the stream The frequent elements problem is to output the set i fi gt m c 13 Some notable algorithms are Boyer Moore majority vote algorithm Count Min sketch Lossy counting Multi stage Bloom filters Misra Gries heavy hitters algorithm Misra Gries summaryEvent detection edit Detecting events in data streams is often done using a heavy hitters algorithm as listed above the most frequent items and their frequency are determined using one of these algorithms then the largest increase over the previous time point is reported as trend This approach can be refined by using exponentially weighted moving averages and variance for normalization 14 Counting distinct elements edit Counting the number of distinct elements in a stream sometimes called the F0 moment is another problem that has been well studied The first algorithm for it was proposed by Flajolet and Martin In 2010 Daniel Kane Jelani Nelson and David Woodruff found an asymptotically optimal algorithm for this problem 15 It uses O e2 log d space with O 1 worst case update and reporting times as well as universal hash functions and a r wise independent hash family where r W log 1 e log log 1 e Entropy edit The empirical entropy of a set of frequencies a displaystyle mathbf a nbsp is defined as F k a i 1 n a i m log a i m displaystyle F k mathbf a sum i 1 n frac a i m log frac a i m nbsp where m i 1 n a i displaystyle m sum i 1 n a i nbsp Online learning edit Main article Online machine learning Learn a model e g a classifier by a single pass over a training set Feature hashing Stochastic gradient descentLower bounds editLower bounds have been computed for many of the data streaming problems that have been studied By far the most common technique for computing these lower bounds has been using communication complexity citation needed See also editData stream mining Data stream clustering Online algorithm Stream processing Sequential algorithmNotes edit Munro J Ian Paterson Mike 1978 Selection and Sorting with Limited Storage 19th Annual Symposium on Foundations of Computer Science Ann Arbor Michigan USA 16 18 October 1978 IEEE Computer Society pp 253 258 doi 10 1109 SFCS 1978 32 a b c Flajolet amp Martin 1985 harvtxt error no target CITEREFFlajoletMartin1985 help a b c d Alon Matias amp Szegedy 1996 Feigenbaum Joan Sampath Kannan 2005 On graph problems in a semi streaming model Theoretical Computer Science 348 2 207 216 doi 10 1016 j tcs 2005 09 013 Babcock Brian Babu Shivnath Datar Mayur Motwani Rajeev Widom Jennifer 2002 Models and issues in data stream systems Proceedings of the twenty first ACM SIGMOD SIGACT SIGART symposium on Principles of database systems PODS 02 New York NY USA ACM pp 1 16 CiteSeerX 10 1 1 138 190 doi 10 1145 543613 543615 ISBN 978 1581135077 S2CID 2071130 Bar Yossef Ziv Jayram T S Kumar Ravi Sivakumar D Trevisan Luca 2002 09 13 Counting Distinct Elements in a Data Stream Randomization and Approximation Techniques in Computer Science Lecture Notes in Computer Science Vol 2483 Springer Berlin Heidelberg pp 1 10 CiteSeerX 10 1 1 12 6276 doi 10 1007 3 540 45726 7 1 ISBN 978 3540457268 S2CID 4684185 Gilbert et al 2001 Xu 2007 Indyk Piotr Woodruff David 2005 01 01 Optimal approximations of the frequency moments of data streams Proceedings of the thirty seventh annual ACM symposium on Theory of computing STOC 05 New York NY USA ACM pp 202 208 doi 10 1145 1060590 1060621 ISBN 978 1 58113 960 0 S2CID 7911758 a b Bar Yossef Ziv Jayram T S Kumar Ravi Sivakumar D Trevisan Luca 2002 09 13 Rolim Jose D P Vadhan Salil eds Counting Distinct Elements in a Data Stream Lecture Notes in Computer Science Springer Berlin Heidelberg pp 1 10 CiteSeerX 10 1 1 12 6276 doi 10 1007 3 540 45726 7 1 ISBN 978 3 540 44147 2 S2CID 4684185 Morris 1978 Flajolet Philippe 1985 03 01 Approximate counting A detailed analysis BIT Numerical Mathematics 25 1 113 134 CiteSeerX 10 1 1 64 5320 doi 10 1007 BF01934993 ISSN 0006 3835 S2CID 2809103 Cormode Graham 2014 Misra Gries Summaries In Kao Ming Yang ed Encyclopedia of Algorithms Springer US pp 1 5 doi 10 1007 978 3 642 27848 8 572 1 ISBN 9783642278488 Schubert E Weiler M Kriegel H P 2014 SigniTrend scalable detection of emerging topics in textual streams by hashed significance thresholds Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining KDD 14 pp 871 880 doi 10 1145 2623330 2623740 ISBN 9781450329569 Kane Nelson amp Woodruff 2010 References editAlon Noga Matias Yossi Szegedy Mario 1999 The space complexity of approximating the frequency moments Journal of Computer and System Sciences 58 1 137 147 doi 10 1006 jcss 1997 1545 ISSN 0022 0000 First published as Alon Noga Matias Yossi Szegedy Mario 1996 The space complexity of approximating the frequency moments Proceedings of the 28th ACM Symposium on Theory of Computing STOC 1996 pp 20 29 CiteSeerX 10 1 1 131 4984 doi 10 1145 237814 237823 ISBN 978 0 89791 785 8 S2CID 1627911 Babcock Brian Babu Shivnath Datar Mayur Motwani Rajeev Widom Jennifer 2002 Models and issues in data stream systems Proceedings of the 21st ACM SIGMOD SIGACT SIGART Symposium on Principles of Database Systems PODS 2002 PDF pp 1 16 CiteSeerX 10 1 1 138 190 doi 10 1145 543613 543615 ISBN 978 1581135077 S2CID 2071130 archived from the original PDF on 2017 07 09 retrieved 2013 07 15 Gilbert A C Kotidis Y Muthukrishnan S Strauss M J 2001 Surfing Wavelets on Streams One Pass Summaries for Approximate Aggregate Queries PDF Proceedings of the International Conference on Very Large Data Bases 79 88 Kane Daniel M Nelson Jelani Woodruff David P 2010 An optimal algorithm for the distinct elements problem Proceedings of the Twenty Ninth ACM SIGMOD SIGACT SIGART symposium on Principles of database systems PODS 10 New York NY USA ACM pp 41 52 CiteSeerX 10 1 1 164 142 doi 10 1145 1807085 1807094 ISBN 978 1 4503 0033 9 S2CID 10006932 Karp R M Papadimitriou C H Shenker S 2003 A simple algorithm for finding frequent elements in streams and bags ACM Transactions on Database Systems 28 1 51 55 CiteSeerX 10 1 1 116 8530 doi 10 1145 762471 762473 S2CID 952840 Lall Ashwin Sekar Vyas Ogihara Mitsunori Xu Jun Zhang Hui 2006 Data streaming algorithms for estimating entropy of network traffic Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems ACM SIGMETRICS 2006 PDF p 145 doi 10 1145 1140277 1140295 hdl 1802 2537 ISBN 978 1595933195 S2CID 240982 permanent dead link Xu Jun Jim 2007 A Tutorial on Network Data Streaming PDF Heath D Kasif S Kosaraju R Salzberg S Sullivan G Learning Nested Concepts With Limited Storage Proceeding IJCAI 91 Proceedings of the 12th international joint conference on Artificial intelligence Volume 2 Pages 777 782 Morgan Kaufmann Publishers Inc San Francisco CA USA c 1991 Morris Robert 1978 Counting large numbers of events in small registers Communications of the ACM 21 10 840 842 doi 10 1145 359619 359627 S2CID 36226357 Retrieved from https en wikipedia org w index php title Streaming algorithm amp oldid 1187109795, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.