fbpx
Wikipedia

Lempel–Ziv complexity

The Lempel–Ziv complexity is a measure that was first presented in the article On the Complexity of Finite Sequences (IEEE Trans. On IT-22,1 1976), by two Israeli computer scientists, Abraham Lempel and Jacob Ziv. This complexity measure is related to Kolmogorov complexity, but the only function it uses is the recursive copy (i.e., the shallow copy).

The underlying mechanism in this complexity measure is the starting point for some algorithms for lossless data compression, like LZ77, LZ78 and LZW. Even though it is based on an elementary principle of words copying, this complexity measure is not too restrictive in the sense that it satisfies the main qualities expected by such a measure: sequences with a certain regularity do not have a too large complexity, and the complexity grows as the sequence grows in length and irregularity.

The Lempel–Ziv complexity can be used to measure the repetitiveness of binary sequences and text, like song lyrics or prose. Fractal dimension estimates of real-world data have also been shown to correlate with Lempel–Ziv complexity.[1][2]

Principle

Let S be a binary sequence, of length n, for which we have to compute the Lempel–Ziv complexity, denoted C(S). The sequence is read from the left.

Imagine you have a delimiting line, which can be moved in the sequence during the calculation. At first, this line is set just after the first symbol, at the beginning of the sequence. This initial position is called position 1, from where we have to move it to a position 2, which is considered the initial position for the next step (and so on). We have to move the delimiter (starting in position 1) the further possible to the right, so that the sub-word between position 1 and the delimiter position be a word of the sequence that starts before the position 1 of the delimiter.

As soon as the delimiter is set on a position where this condition is not met, we stop, move the delimiter to this position, and start again by marking this position as a new initial position (i.e., position 1). Keep iterating until the end of the sequence. The Lempel–Ziv complexity corresponds to the number of iterations needed to finish this procedure.

Said differently, the Lempel–Ziv complexity is the number of different sub-strings (or sub-words) encountered as the binary sequence is viewed as a stream (from left to right).

Formal explanations

The method proposed by Lempel and Ziv uses three notions: reproducibility, producibility and exhaustive history of a sequence, that we defined here.

Notations

Let S be a binary sequence of length n (i.e., n symbols taking value 0 or 1). Let S(i,j), with  , be the sub-word of S from index i to index j (if j<i, S(i,j) is the empty string). The length n of S is denoted l(S), and a sequence Q is said to be a fixed prefix of S if:

 

Reproducibility and producibility

 
Example of reproducibility Click here

On the one hand, a sequence S of length n is said to be reproducible from its prefix S(1,j) when S(j+1,n) is a sub-word of S(1,j). This is denoted S(1,j)→S.

Said differently, S is reproducible from its prefix S(1,j) if the rest of the sequence, S(j+1,n), is nothing but a copy of another sub-word (starting at an index i < j+1) of S(1,n−1).

To prove that the sequence S can be reproduced by one of its prefix S(1,j), you have to show that:

 

 
Example of Productibility Click here

On the other hand, the producibility, is defined from the reproducibility: a sequence S is producible from its prefix S(1,j) if S(1,n−1) is reproducible from S(1,j). This is denoted S(1,j)⇒S. Said differently, S(j+1,n−1) has to be a copy of another sub-word of S(1,n-2). The last symbol of S can be a new symbol (but could not be), possibly leading to the production of a new sub-word (hence the term producibility).

 
Comparison between reproducibility and producibility Click here

Exhaustive history and complexity

From the definition of productibility, the empty string Λ=S(1,0) ⇒ S(1,1). So by a recursive production process, at step i we have S(1,hi) ⇒ S(1,hi+1), so we can build S from its prefixes. And as S(1,i) ⇒ S(1,i+1) (with hi+1 =hi + 1) is always true, this production process of S takes at most n=l(S) steps. Let m,  , be the number of steps necessary for this product process of S. S can be written in a decomposed form, called history of S, and denoted H(S), defined like this:

   

 
Comparison between reproductibility and productibility Click here

A component of S, Hi(S), is said to be exhaustive if S(1,hi) is the longest sequence produced by S(1,hi−1) (i.e., S(1,hi−1) ⇒ S(1,hi)) but so that S(1,hi−1) does not produce S(1,hi) (denoted).   The index p which allows to have the longest production is called pointer.

The history of S is said to be exhaustive if all its component are exhaustive, except possibly the last one. From the definition, one can show that any sequence S has only one exhaustive history, and this history is the one with the smallest number of component from all the possible histories of S. Finally, the number of component of this unique exhaustive history of S is called the Lempel–Ziv complexity of S.

Algorithm

Fortunately, there exists a very efficient method for computing this complexity, in a linear number of operation (  for   length of the sequence S).

A formal description of this method is given by the following algorithm:

  • i = p − 1, p is the pointer (see above)
  • u is the length of the current prefix
  • v is the length of the current component for the current pointer p
  • vmax is the final length used for the current component (largest on all the possible pointers p)
  • and C is the Lempel–Ziv Complexity, incremented iteratively.
// S is a binary sequence of size n i := 0 C := 1 u := 1 v := 1 vmax := v while u + v <= n do if S[i + v] = S[u + v] then v := v + 1 else vmax := max(v, vmax) i := i + 1 if i = u then // all the pointers have been treated C := C + 1 u := u + vmax v := 1 i := 0 vmax := v else v := 1 end if end if end while if v != 1 then C := C+1 end if 

See also

  • LZ77 and LZ78 – Compression algorithms that use a similar idea of finding matching substrings.

Notes and references

References

  1. ^ Burns, T.; Rajan, R. (2015). "Burns & Rajan (2015) Combining complexity measures of EEG data: multiplying measures reveal previously hidden information. F1000Research. 4:137". F1000Research. 4: 137. doi:10.12688/f1000research.6590.1. PMC 4648221. PMID 26594331.
  2. ^ Burns, T.; Rajan, R. (2019). "A Mathematical Approach to Correlating Objective Spectro-Temporal Features of Non-linguistic Sounds With Their Subjective Perceptions in Humans". Frontiers in Neuroscience. 13: 794. doi:10.3389/fnins.2019.00794. PMC 6685481. PMID 31417350.

Bibliography

  • Abraham Lempel and Jacob Ziv, « On the Complexity of Finite Sequences », IEEE Trans. on Information Theory, January 1976, p. 75–81, vol. 22, n°1

Application

  • « Are Pop Lyrics Getting More Repetitive? », By Colin Morris, is a blog post explaining how to use the Lempel–Ziv complexity to measure repetitiveness of song lyrics (with source code available).
  • Burns & Rajan (2015) Combining complexity measures of EEG data: multiplying measures reveal previously hidden information. F1000Research. 4:137. [1] (with public MATLAB code available).
  • Burns & Rajan (2019) A Mathematical Approach to Correlating Objective Spectro-Temporal Features of Non-linguistic Sounds With Their Subjective Perceptions in Humans. Frontiers in Neuroscience 13:794. [2] (with public MATLAB code available).

External links

  • Example of a Python implementation and discussion on StackOverflow #41336798
  • Open-Source (MIT) implementation on Python and Cython on GitHub available on PyPi

lempel, complexity, measure, that, first, presented, article, complexity, finite, sequences, ieee, trans, 1976, israeli, computer, scientists, abraham, lempel, jacob, this, complexity, measure, related, kolmogorov, complexity, only, function, uses, recursive, . The Lempel Ziv complexity is a measure that was first presented in the article On the Complexity of Finite Sequences IEEE Trans On IT 22 1 1976 by two Israeli computer scientists Abraham Lempel and Jacob Ziv This complexity measure is related to Kolmogorov complexity but the only function it uses is the recursive copy i e the shallow copy The underlying mechanism in this complexity measure is the starting point for some algorithms for lossless data compression like LZ77 LZ78 and LZW Even though it is based on an elementary principle of words copying this complexity measure is not too restrictive in the sense that it satisfies the main qualities expected by such a measure sequences with a certain regularity do not have a too large complexity and the complexity grows as the sequence grows in length and irregularity The Lempel Ziv complexity can be used to measure the repetitiveness of binary sequences and text like song lyrics or prose Fractal dimension estimates of real world data have also been shown to correlate with Lempel Ziv complexity 1 2 Contents 1 Principle 2 Formal explanations 2 1 Notations 2 2 Reproducibility and producibility 2 3 Exhaustive history and complexity 3 Algorithm 4 See also 5 Notes and references 5 1 References 5 2 Bibliography 5 3 Application 6 External linksPrinciple EditLet S be a binary sequence of length n for which we have to compute the Lempel Ziv complexity denoted C S The sequence is read from the left Imagine you have a delimiting line which can be moved in the sequence during the calculation At first this line is set just after the first symbol at the beginning of the sequence This initial position is called position 1 from where we have to move it to a position 2 which is considered the initial position for the next step and so on We have to move the delimiter starting in position 1 the further possible to the right so that the sub word between position 1 and the delimiter position be a word of the sequence that starts before the position 1 of the delimiter As soon as the delimiter is set on a position where this condition is not met we stop move the delimiter to this position and start again by marking this position as a new initial position i e position 1 Keep iterating until the end of the sequence The Lempel Ziv complexity corresponds to the number of iterations needed to finish this procedure Said differently the Lempel Ziv complexity is the number of different sub strings or sub words encountered as the binary sequence is viewed as a stream from left to right Formal explanations EditThe method proposed by Lempel and Ziv uses three notions reproducibility producibility and exhaustive history of a sequence that we defined here Notations Edit Let S be a binary sequence of length n i e n symbols taking value 0 or 1 Let S i j with 1 i j n displaystyle 1 leq i j leq n be the sub word of S from index i to index j if j lt i S i j is the empty string The length n of S is denoted l S and a sequence Q is said to be a fixed prefix of S if j lt l S s t S 1 j Q displaystyle exists j lt text l S s t S 1 j Q Reproducibility and producibility Edit Example of reproducibility Click here On the one hand a sequence S of length n is said to be reproducible from its prefix S 1 j when S j 1 n is a sub word of S 1 j This is denoted S 1 j S Said differently S is reproducible from its prefix S 1 j if the rest of the sequence S j 1 n is nothing but a copy of another sub word starting at an index i lt j 1 of S 1 n 1 To prove that the sequence S can be reproduced by one of its prefix S 1 j you have to show that p j s t S j 1 n S p l S j 1 n p 1 displaystyle exists p leq j text s t S j 1 n S p l S j 1 n p 1 Example of Productibility Click here On the other hand the producibility is defined from the reproducibility a sequence S is producible from its prefix S 1 j if S 1 n 1 is reproducible from S 1 j This is denoted S 1 j S Said differently S j 1 n 1 has to be a copy of another sub word of S 1 n 2 The last symbol of S can be a new symbol but could not be possibly leading to the production of a new sub word hence the term producibility Comparison between reproducibility and producibility Click here Exhaustive history and complexity Edit From the definition of productibility the empty string L S 1 0 S 1 1 So by a recursive production process at step i we have S 1 hi S 1 hi 1 so we can build S from its prefixes And as S 1 i S 1 i 1 with hi 1 hi 1 is always true this production process of S takes at most n l S steps Let m 1 m l S displaystyle 1 leq text m leq l S be the number of steps necessary for this product process of S S can be written in a decomposed form called history of S and denoted H S defined like this H S S 1 h 1 S h 1 1 h 2 S h m 1 1 h m displaystyle H S S 1 h 1 S h 1 1 h 2 dotsm S h m 1 1 h m H i S S h i 1 1 h i i 1 2 m where h 0 0 h 1 1 h m l S is called component of H S displaystyle H i S S h i 1 1 h i i 1 2 dotsm m text where h 0 0 h 1 1 h m l S text is called component of H S Comparison between reproductibility and productibility Click here A component of S Hi S is said to be exhaustive if S 1 hi is the longest sequence produced by S 1 hi 1 i e S 1 hi 1 S 1 hi but so that S 1 hi 1 does not produce S 1 hi denoted S 1 h i 1 S 1 h i displaystyle S 1 h i 1 nrightarrow S 1 h i The index p which allows to have the longest production is called pointer The history of S is said to be exhaustive if all its component are exhaustive except possibly the last one From the definition one can show that any sequence S has only one exhaustive history and this history is the one with the smallest number of component from all the possible histories of S Finally the number of component of this unique exhaustive history of S is called the Lempel Ziv complexity of S Algorithm EditFortunately there exists a very efficient method for computing this complexity in a linear number of operation O n displaystyle mathcal O n for n l S displaystyle n l S length of the sequence S A formal description of this method is given by the following algorithm i p 1 p is the pointer see above u is the length of the current prefix v is the length of the current component for the current pointer p vmax is the final length used for the current component largest on all the possible pointers p and C is the Lempel Ziv Complexity incremented iteratively S is a binary sequence of size n i 0 C 1 u 1 v 1 vmax v while u v lt n do if S i v S u v then v v 1 else vmax max v vmax i i 1 if i u then all the pointers have been treated C C 1 u u vmax v 1 i 0 vmax v else v 1 end if end if end while if v 1 then C C 1 end ifSee also EditLZ77 and LZ78 Compression algorithms that use a similar idea of finding matching substrings Notes and references EditReferences Edit Burns T Rajan R 2015 Burns amp Rajan 2015 Combining complexity measures of EEG data multiplying measures reveal previously hidden information F1000Research 4 137 F1000Research 4 137 doi 10 12688 f1000research 6590 1 PMC 4648221 PMID 26594331 Burns T Rajan R 2019 A Mathematical Approach to Correlating Objective Spectro Temporal Features of Non linguistic Sounds With Their Subjective Perceptions in Humans Frontiers in Neuroscience 13 794 doi 10 3389 fnins 2019 00794 PMC 6685481 PMID 31417350 Bibliography Edit Abraham Lempel and Jacob Ziv On the Complexity of Finite Sequences IEEE Trans on Information Theory January 1976 p 75 81 vol 22 n 1Application Edit Are Pop Lyrics Getting More Repetitive By Colin Morris is a blog post explaining how to use the Lempel Ziv complexity to measure repetitiveness of song lyrics with source code available Burns amp Rajan 2015 Combining complexity measures of EEG data multiplying measures reveal previously hidden information F1000Research 4 137 1 with public MATLAB code available Burns amp Rajan 2019 A Mathematical Approach to Correlating Objective Spectro Temporal Features of Non linguistic Sounds With Their Subjective Perceptions in Humans Frontiers in Neuroscience 13 794 2 with public MATLAB code available External links EditExample of a Python implementation and discussion on StackOverflow 41336798 Open Source MIT implementation on Python and Cython on GitHub available on PyPi Retrieved from https en wikipedia org w index php title Lempel Ziv complexity amp oldid 1088349399, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.