fbpx
Wikipedia

Burrows–Wheeler transform

The Burrows–Wheeler transform (BWT, also called block-sorting compression) rearranges a character string into runs of similar characters. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters by techniques such as move-to-front transform and run-length encoding. More importantly, the transformation is reversible, without needing to store any additional data except the position of the first original character. The BWT is thus a "free" method of improving the efficiency of text compression algorithms, costing only some extra computation. The Burrows–Wheeler transform is an algorithm used to prepare data for use with data compression techniques such as bzip2. It was invented by Michael Burrows and David Wheeler in 1994 while Burrows was working at DEC Systems Research Center in Palo Alto, California. It is based on a previously unpublished transformation discovered by Wheeler in 1983. The algorithm can be implemented efficiently using a suffix array thus reaching linear time complexity.[1]

Burrows–Wheeler transform
Classpreprocessing for lossless compression
Data structurestring
Worst-case performanceO(n)
Worst-case space complexityO(n)

Description edit

When a character string is transformed by the BWT, the transformation permutes the order of the characters. If the original string had several substrings that occurred often, then the transformed string will have several places where a single character is repeated multiple times in a row.

For example:

Input SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES
Output TEXYDST.E.IXIXIXXSSMPPS.B..E.S.EUSFXDIIOIIIT[2]

The output is easier to compress because it has many repeated characters. In this example the transformed string contains six runs of identical characters: XX, SS, PP, .., II, and III, which together make 13 out of the 44 characters.

Example edit

The transform is done by sorting all the circular shifts of a text in lexicographic order and by extracting the last column and the index of the original string in the set of sorted permutations of S.

Given an input string S = ^BANANA$ (step 1 in the table below), rotate it N times (step 2), where N = 8 is the length of the S string considering also the red ^ character representing the start of the string and the red $ character representing the 'EOF' pointer; these rotations, or circular shifts, are then sorted lexicographically (step 3). The output of the encoding phase is the last column L = BNN^AA$A after step 3, and the index (0-based) I of the row containing the original string S, in this case I = 6.

It is not necessary to use both $ and ^, but at least one must be used, else we cannot invert the transform, since all circular permutations of a string have the same Burrows–Wheeler transform.

Transformation
1. Input 2. All
rotations
3. Sort into
lexical order
4. Take the
last column
5. Output
^BANANA$ 
^BANANA$ $^BANANA A$^BANAN NA$^BANA ANA$^BAN NANA$^BA ANANA$^B BANANA$^ 
ANANA$^B ANA$^BAN A$^BANAN BANANA$^ NANA$^BA NA$^BANA ^BANANA$ $^BANANA 
ANANA$^B ANA$^BAN A$^BANAN BANANA$^ NANA$^BA NA$^BANA ^BANANA$ $^BANANA 
BNN^AA$A 

The following pseudocode gives a simple (though inefficient) way to calculate the BWT and its inverse. It assumes that the input string s contains a special character 'EOF' which is the last character and occurs nowhere else in the text.

function BWT (string s) create a table, where the rows are all possible rotations of s sort rows alphabetically return (last column of the table) 
function inverseBWT (string s) create empty table repeat length(s) times // first insert creates first column insert s as a column of table before first column of the table sort rows of the table alphabetically return (row that ends with the 'EOF' character) 

Explanation edit

To understand why this creates more-easily-compressible data, consider transforming a long English text frequently containing the word "the". Sorting the rotations of this text will group rotations starting with "he " together, and the last character of that rotation (which is also the character before the "he ") will usually be "t", so the result of the transform would contain a number of "t" characters along with the perhaps less-common exceptions (such as if it contains "ache ") mixed in. So it can be seen that the success of this transform depends upon one value having a high probability of occurring before a sequence, so that in general it needs fairly long samples (a few kilobytes at least) of appropriate data (such as text).

The remarkable thing about the BWT is not that it generates a more easily encoded output—an ordinary sort would do that—but that it does this reversibly, allowing the original document to be re-generated from the last column data.

The inverse can be understood this way. Take the final table in the BWT algorithm, and erase all but the last column. Given only this information, you can easily reconstruct the first column. The last column tells you all the characters in the text, so just sort these characters alphabetically to get the first column. Then, the last and first columns (of each row) together give you all pairs of successive characters in the document, where pairs are taken cyclically so that the last and first character form a pair. Sorting the list of pairs gives the first and second columns. Continuing in this manner, you can reconstruct the entire list. Then, the row with the "end of file" character at the end is the original text. Reversing the example above is done like this:

Inverse transformation
Input
BNN^AA$A 
Add 1 Sort 1 Add 2 Sort 2
B N N ^ A A $ A 
A A A B N N ^ $ 
BA NA NA ^B AN AN $^ A$ 
AN AN A$ BA NA NA ^B $^ 
Add 3 Sort 3 Add 4 Sort 4
BAN NAN NA$ ^BA ANA ANA $^B A$^ 
ANA ANA A$^ BAN NAN NA$ ^BA $^B 
BANA NANA NA$^ ^BAN ANAN ANA$ $^BA A$^B 
ANAN ANA$ A$^B BANA NANA NA$^ ^BAN $^BA 
Add 5 Sort 5 Add 6 Sort 6
BANAN NANA$ NA$^B ^BANA ANANA ANA$^ $^BAN A$^BA 
ANANA ANA$^ A$^BA BANAN NANA$ NA$^B ^BANA $^BAN 
BANANA NANA$^ NA$^BA ^BANAN ANANA$ ANA$^B $^BANA A$^BAN 
ANANA$ ANA$^B A$^BAN BANANA NANA$^ NA$^BA ^BANAN $^BANA 
Add 7 Sort 7 Add 8 Sort 8
BANANA$ NANA$^B NA$^BAN ^BANANA ANANA$^ ANA$^BA $^BANAN A$^BANA 
ANANA$^ ANA$^BA A$^BANA BANANA$ NANA$^B NA$^BAN ^BANANA $^BANAN 
BANANA$^ NANA$^BA NA$^BANA ^BANANA$ ANANA$^B ANA$^BAN $^BANANA A$^BANAN 
ANANA$^B ANA$^BAN A$^BANAN BANANA$^ NANA$^BA NA$^BANA ^BANANA$ $^BANANA 
Output
^BANANA$ 

Optimization edit

A number of optimizations can make these algorithms run more efficiently without changing the output. There is no need to represent the table in either the encoder or decoder. In the encoder, each row of the table can be represented by a single pointer into the strings, and the sort performed using the indices. In the decoder, there is also no need to store the table, and in fact no sort is needed at all. In time proportional to the alphabet size and string length, the decoded string may be generated one character at a time from right to left. A "character" in the algorithm can be a byte, or a bit, or any other convenient size.

One may also make the observation that mathematically, the encoded string can be computed as a simple modification of the suffix array, and suffix arrays can be computed with linear time and memory. The BWT can be defined with regards to the suffix array SA of text T as (1-based indexing):

 [3]

There is no need to have an actual 'EOF' character. Instead, a pointer can be used that remembers where in a string the 'EOF' would be if it existed. In this approach, the output of the BWT must include both the transformed string, and the final value of the pointer. The inverse transform then shrinks it back down to the original size: it is given a string and a pointer, and returns just a string.

A complete description of the algorithms can be found in Burrows and Wheeler's paper, or in a number of online sources.[1] The algorithms vary somewhat by whether EOF is used, and in which direction the sorting was done. In fact, the original formulation did not use an EOF marker.[4]

Bijective variant edit

Since any rotation of the input string will lead to the same transformed string, the BWT cannot be inverted without adding an EOF marker to the end of the input or doing something equivalent, making it possible to distinguish the input string from all its rotations. Increasing the size of the alphabet (by appending the EOF character) makes later compression steps awkward.

There is a bijective version of the transform, by which the transformed string uniquely identifies the original, and the two have the same length and contain exactly the same characters, just in a different order.[5][6]

The bijective transform is computed by factoring the input into a non-increasing sequence of Lyndon words; such a factorization exists and is unique by the Chen–Fox–Lyndon theorem,[7] and may be found in linear time and constant space.[8] The algorithm sorts the rotations of all the words; as in the Burrows–Wheeler transform, this produces a sorted sequence of n strings. The transformed string is then obtained by picking the final character of each string in this sorted list. The one important caveat here is that strings of different lengths are not ordered in the usual way; the two strings are repeated forever, and the infinite repeats are sorted. For example, "ORO" precedes "OR" because "OROORO..." precedes "OROROR...".

For example, the text "^BANANA$" is transformed into "ANNBAA^$" through these steps (the red $ character indicates the EOF pointer) in the original string. The EOF character is unneeded in the bijective transform, so it is dropped during the transform and re-added to its proper place in the file.

The string is broken into Lyndon words so the words in the sequence are decreasing using the comparison method above. (Note that we're sorting '^' as succeeding other characters.) "^BANANA" becomes (^) (B) (AN) (AN) (A).

Bijective transformation
Input All
rotations
Sorted alphabetically Last column
of rotated Lyndon word
Output
^BANANA$ 
^^^^^^^^... (^) BBBBBBBB... (B) ANANANAN... (AN) NANANANA... (NA) ANANANAN... (AN) NANANANA... (NA) AAAAAAAA... (A) 
AAAAAAAA... (A) ANANANAN... (AN) ANANANAN... (AN) BBBBBBBB... (B) NANANANA... (NA) NANANANA... (NA) ^^^^^^^^... (^) 
AAAAAAAA... (A) ANANANAN... (AN) ANANANAN... (AN) BBBBBBBB... (B) NANANANA... (NA) NANANANA... (NA) ^^^^^^^^... (^) 
ANNBAA^$ 
Inverse bijective transform
Input
ANNBAA^ 
Add 1 Sort 1 Add 2 Sort 2
A N N B A A ^ 
A A A B N N ^ 
AA NA NA BB AN AN ^^ 
AA AN AN BB NA NA ^^ 
Add 3 Sort 3 Add 4 Sort 4
AAA NAN NAN BBB ANA ANA ^^^ 
AAA ANA ANA BBB NAN NAN ^^^ 
AAAA NANA NANA BBBB ANAN ANAN ^^^^ 
AAAA ANAN ANAN BBBB NANA NANA ^^^^ 
Output
^BANANA 

Up until the last step, the process is identical to the inverse Burrows–Wheeler process, but here it will not necessarily give rotations of a single sequence; it instead gives rotations of Lyndon words (which will start to repeat as the process is continued). Here, we can see (repetitions of) four distinct Lyndon words: (A), (AN) (twice), (B), and (^). (NANA... doesn't represent a distinct word, as it is a cycle of ANAN....) At this point, these words are sorted into reverse order: (^), (B), (AN), (AN), (A). These are then concatenated to get

^BANANA

The Burrows–Wheeler transform can indeed be viewed as a special case of this bijective transform; instead of the traditional introduction of a new letter from outside our alphabet to denote the end of the string, we can introduce a new letter that compares as preceding all existing letters that is put at the beginning of the string. The whole string is now a Lyndon word, and running it through the bijective process will therefore result in a transformed result that, when inverted, gives back the Lyndon word, with no need for reassembling at the end.

Relatedly, the transformed text will only differ from the result of BWT by one character per Lyndon word; for example, if the input is decomposed into six Lyndon words, the output will only differ in six characters. For example, applying the bijective transform gives:

Input SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES
Lyndon words SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES
Output STEYDST.E.IXXIIXXSMPPXS.B..EE..SUSFXDIOIIIIT

The bijective transform includes eight runs of identical characters. These runs are, in order: XX, II, XX, PP, .., EE, .., and IIII.

In total, 18 characters are used in these runs.

Dynamic Burrows–Wheeler transform edit

When a text is edited, its Burrows–Wheeler transform will change. Salson et al.[9] propose an algorithm that deduces the Burrows–Wheeler transform of an edited text from that of the original text, doing a limited number of local reorderings in the original Burrows–Wheeler transform, which can be faster than constructing the Burrows–Wheeler transform of the edited text directly.

Sample implementation edit

This Python implementation sacrifices speed for simplicity: the program is short, but takes more than the linear time that would be desired in a practical implementation. It essentially does what the pseudocode section does.

Using the STX/ETX control codes to mark the start and end of the text, and using s[i:] + s[:i] to construct the ith rotation of s, the forward transform takes the last character of each of the sorted rows:

def bwt(s: str) -> str:  """Apply Burrows–Wheeler transform to input string.""" assert "\002" not in s and "\003" not in s, "Input string cannot contain STX and ETX characters" s = "\002" + s + "\003" # Add start and end of text marker table = sorted(s[i:] + s[:i] for i in range(len(s))) # Table of rotations of string last_column = [row[-1:] for row in table] # Last characters of each row return "".join(last_column) # Convert list of characters into string 

The inverse transform repeatedly inserts r as the left column of the table and sorts the table. After the whole table is built, it returns the row that ends with ETX, minus the STX and ETX.

def inverse_bwt(r: str) -> str:  """Apply inverse Burrows–Wheeler transform.""" table = [""] * len(r) # Make empty table for _ in range(len(r)): table = sorted(r[i] + table[i] for i in range(len(r))) # Add a column of r s = next((row for row in table if row.endswith("\003")), "") # Iterate over and check whether last character ends with ETX or not return s.rstrip("\003").strip("\002") # Retrieve data from array and get rid of start and end markers 

Following implementation notes from Manzini, it is equivalent to use a simple null character suffix instead. The sorting should be done in colexicographic order (string read right-to-left), i.e. sorted(..., key=lambda s: s[::-1]) in Python.[4] (The above control codes actually fail to satisfy EOF being the last character; the two codes are actually the first. The rotation holds nevertheless.)

BWT applications edit

As a lossless compression algorithm the Burrows–Wheeler transform offers the important quality that its encoding is reversible and hence the original data may be recovered from the resulting compression. The lossless quality of Burrows algorithm has provided for different algorithms with different purposes in mind. To name a few, Burrows–Wheeler transform is used in algorithms for sequence alignment, image compression, data compression, etc. The following is a compilation of some uses given to the Burrows–Wheeler Transform.

BWT for sequence alignment edit

The advent of next-generation sequencing (NGS) techniques at the end of the 2000s decade has led to another application of the Burrows–Wheeler transformation. In NGS, DNA is fragmented into small pieces, of which the first few bases are sequenced, yielding several millions of "reads", each 30 to 500 base pairs ("DNA characters") long. In many experiments, e.g., in ChIP-Seq, the task is now to align these reads to a reference genome, i.e., to the known, nearly complete sequence of the organism in question (which may be up to several billion base pairs long). A number of alignment programs, specialized for this task, were published, which initially relied on hashing (e.g., Eland, SOAP,[10] or Maq[11]). In an effort to reduce the memory requirement for sequence alignment, several alignment programs were developed (Bowtie,[12] BWA,[13] and SOAP2[14]) that use the Burrows–Wheeler transform.

BWT for image compression edit

The Burrows–Wheeler transformation has proved to be fundamental for image compression applications. For example,[15] Showed a compression pipeline based on the application of the Burrows–Wheeler transformation followed by inversion, run-length, and arithmetic encoders. The pipeline developed in this case is known as Burrows–Wheeler transform with an inversion encoder (BWIC). The results shown by BWIC are shown to outperform the compression performance of well-known and widely used algorithms like Lossless JPEG and JPEG 2000. BWIC is shown to outperform those in terms of final compression size of radiography medical images on the order of 5.1% and 4.1% respectively. The improvements are achieved by combining BWIC and a pre-BWIC scan of the image in a vertical snake order fashion. More recently, additional works like that of [16] have shown the implementation of the Burrows–Wheeler Transform in conjunction with the known move-to-front transform (MTF) achieve near lossless compression of images.

BWT for compression of genomic databases edit

Cox et al.[17] presented a genomic compression scheme that uses BWT as the algorithm applied during the first stage of compression of several genomic datasets including the human genomic information. Their work proposed that BWT compression could be enhanced by including a second stage compression mechanism called same-as-previous encoding ("SAP"), which makes use of the fact that suffixes of two or more prefix letters could be equal. With the compression mechanism BWT-SAP, Cox et al. showed that in the genomic database ERA015743, 135.5 GB in size, the compression scheme BWT-SAP compresses the ERA015743 dataset by around 94%, to 8.2 GB.

BWT for sequence prediction edit

BWT has also been proved to be useful on sequence prediction which is a common area of study in machine learning and natural-language processing. In particular, Ktistakis et al.[18] proposed a sequence prediction scheme called SuBSeq that exploits the lossless compression of data of the Burrows–Wheeler transform. SuBSeq exploits BWT by extracting the FM-index and then performing a series of operations called backwardSearch, forwardSearch, neighbourExpansion, and getConsequents in order to search for predictions given a suffix. The predictions are then classified based on a weight and put into an array from which the element with the highest weight is given as the prediction from the SuBSeq algorithm. SuBSeq has been shown to outperform state of the art algorithms for sequence prediction both in terms of training time and accuracy.

References edit

  1. ^ a b Burrows, Michael; Wheeler, David J. (May 10, 1994), A block sorting lossless data compression algorithm, Technical Report 124, Digital Equipment Corporation, archived from the original on January 5, 2003
  2. ^ "adrien-mogenet/scala-bwt". GitHub. Retrieved 19 April 2018.
  3. ^ Simpson, Jared T.; Durbin, Richard (2010-06-15). "Efficient construction of an assembly string graph using the FM-index". Bioinformatics. 26 (12): i367–i373. doi:10.1093/bioinformatics/btq217. ISSN 1367-4803. PMC 2881401. PMID 20529929.
  4. ^ a b Manzini, Giovanni (1999-08-18). "The Burrows–Wheeler Transform: Theory and Practice" (PDF). Mathematical Foundations of Computer Science 1999: 24th International Symposium, MFCS'99 Szklarska Poreba, Poland, September 6-10, 1999 Proceedings. Springer Science & Business Media. ISBN 9783540664086. Archived (PDF) from the original on 2022-10-09.
  5. ^ Gil, J.; Scott, D. A. (2009), (PDF), archived from the original (PDF) on 2011-10-08, retrieved 2009-07-09
  6. ^ Kufleitner, Manfred (2009), "On bijective variants of the Burrows–Wheeler transform", in Holub, Jan; Žďárek, Jan (eds.), Prague Stringology Conference, pp. 65–69, arXiv:0908.0239, Bibcode:2009arXiv0908.0239K.
  7. ^ *Lothaire, M. (1997), Combinatorics on words, Encyclopedia of Mathematics and Its Applications, vol. 17, Perrin, D.; Reutenauer, C.; Berstel, J.; Pin, J. E.; Pirillo, G.; Foata, D.; Sakarovitch, J.; Simon, I.; Schützenberger, M. P.; Choffrut, C.; Cori, R.; Lyndon, Roger; Rota, Gian-Carlo. Foreword by Roger Lyndon (2nd ed.), Cambridge University Press, p. 67, ISBN 978-0-521-59924-5, Zbl 0874.20040
  8. ^ Duval, Jean-Pierre (1983), "Factorizing words over an ordered alphabet", Journal of Algorithms, 4 (4): 363–381, doi:10.1016/0196-6774(83)90017-2, ISSN 0196-6774, Zbl 0532.68061.
  9. ^ Salson M, Lecroq T, Léonard M, Mouchard L (2009). "A Four-Stage Algorithm for Updating a Burrows–Wheeler Transform". Theoretical Computer Science. 410 (43): 4350–4359. doi:10.1016/j.tcs.2009.07.016.
  10. ^ Li R; et al. (2008). "SOAP: short oligonucleotide alignment program". Bioinformatics. 24 (5): 713–714. doi:10.1093/bioinformatics/btn025. PMID 18227114.
  11. ^ Li H, Ruan J, Durbin R (2008-08-19). "Mapping short DNA sequencing reads and calling variants using mapping quality scores". Genome Research. 18 (11): 1851–1858. doi:10.1101/gr.078212.108. PMC 2577856. PMID 18714091.
  12. ^ Langmead B, Trapnell C, Pop M, Salzberg SL (2009). "Ultrafast and memory-efficient alignment of short DNA sequences to the human genome". Genome Biology. 10 (3): R25. doi:10.1186/gb-2009-10-3-r25. PMC 2690996. PMID 19261174.
  13. ^ Li H, Durbin R (2009). "Fast and accurate short read alignment with Burrows–Wheeler Transform". Bioinformatics. 25 (14): 1754–1760. doi:10.1093/bioinformatics/btp324. PMC 2705234. PMID 19451168.
  14. ^ Li R; et al. (2009). "SOAP2: an improved ultrafast tool for short read alignment". Bioinformatics. 25 (15): 1966–1967. doi:10.1093/bioinformatics/btp336. PMID 19497933.
  15. ^ Collin P, Arnavut Z, Koc B (2015). "Lossless compression of medical images using Burrows–Wheeler Transformation with Inversion Coder". 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). Vol. 2015. pp. 2956–2959. doi:10.1109/EMBC.2015.7319012. ISBN 978-1-4244-9271-8. PMID 26736912. S2CID 4460328.
  16. ^ Devadoss CP, Sankaragomathi B (2019). "Near lossless medical image compression using block BWT–MTF and hybrid fractal compression techniques". Cluster Computing. 22: 12929–12937. doi:10.1007/s10586-018-1801-3. S2CID 33687086.
  17. ^ Cox AJ, Bauer MJ, Jakobi T, Rosone G (2012). "Large-scale compression of genomic sequence databases with the Burrows–Wheeler transform". Bioinformatics. 28 (11). Oxford University Press: 1415–1419. arXiv:1205.0192. doi:10.1093/bioinformatics/bts173. PMID 22556365.
  18. ^ Ktistakis R, Fournier-Viger P, Puglisi SJ, Raman R (2019). "Succinct BWT-Based Sequence Prediction". Database and Expert Systems Applications. Lecture Notes in Computer Science. Vol. 11707. pp. 91–101. doi:10.1007/978-3-030-27618-8_7. ISBN 978-3-030-27617-1. S2CID 201058996.

External links edit

  • Article by Mark Nelson on the BWT 2017-03-25 at the Wayback Machine
  • A Bijective String-Sorting Transform, by Gil and Scott 2011-10-08 at the Wayback Machine
  • On Bijective Variants of the Burrows–Wheeler Transform, by Kufleitner
  • Blog post and project page for an open-source compression program and library based on the Burrows–Wheeler algorithm
  • MIT open courseware lecture on BWT (Foundations of Computational and Systems Biology)
  • League Table Sort (LTS) or The Weighting algorithm to BWT by Abderrahim Hechachena

burrows, wheeler, transform, also, called, block, sorting, compression, rearranges, character, string, into, runs, similar, characters, this, useful, compression, since, tends, easy, compress, string, that, runs, repeated, characters, techniques, such, move, f. The Burrows Wheeler transform BWT also called block sorting compression rearranges a character string into runs of similar characters This is useful for compression since it tends to be easy to compress a string that has runs of repeated characters by techniques such as move to front transform and run length encoding More importantly the transformation is reversible without needing to store any additional data except the position of the first original character The BWT is thus a free method of improving the efficiency of text compression algorithms costing only some extra computation The Burrows Wheeler transform is an algorithm used to prepare data for use with data compression techniques such as bzip2 It was invented by Michael Burrows and David Wheeler in 1994 while Burrows was working at DEC Systems Research Center in Palo Alto California It is based on a previously unpublished transformation discovered by Wheeler in 1983 The algorithm can be implemented efficiently using a suffix array thus reaching linear time complexity 1 Burrows Wheeler transformClasspreprocessing for lossless compressionData structurestringWorst case performanceO n Worst case space complexityO n Contents 1 Description 2 Example 3 Explanation 4 Optimization 5 Bijective variant 6 Dynamic Burrows Wheeler transform 7 Sample implementation 8 BWT applications 8 1 BWT for sequence alignment 8 2 BWT for image compression 8 3 BWT for compression of genomic databases 8 4 BWT for sequence prediction 9 References 10 External linksDescription editWhen a character string is transformed by the BWT the transformation permutes the order of the characters If the original string had several substrings that occurred often then the transformed string will have several places where a single character is repeated multiple times in a row For example Input SIX MIXED PIXIES SIFT SIXTY PIXIE DUST BOXESOutput TEXYDST E IXIXIXXSSMPPS B E S EUSFXDIIOIIIT 2 The output is easier to compress because it has many repeated characters In this example the transformed string contains six runs of identical characters XX SS PP II and III which together make 13 out of the 44 characters Example editThe transform is done by sorting all the circular shifts of a text in lexicographic order and by extracting the last column and the index of the original string in the set of sorted permutations of S Given an input string S span style color red span BANANA span style color red span step 1 in the table below rotate it N times step 2 where N 8 is the length of the S string considering also the red span style color red span character representing the start of the string and the red span style color red span character representing the EOF pointer these rotations or circular shifts are then sorted lexicographically step 3 The output of the encoding phase is the last column L BNN span style color red span AA span style color red span A after step 3 and the index 0 based I of the row containing the original string S in this case I 6 It is not necessary to use both span style color red span and span style color red span but at least one must be used else we cannot invert the transform since all circular permutations of a string have the same Burrows Wheeler transform Transformation1 Input 2 Allrotations 3 Sort intolexical order 4 Take thelast column 5 Output BANANA BANANA BANANA A BANAN NA BANA ANA BAN NANA BA ANANA B BANANA ANANA B ANA BAN A BANAN BANANA NANA BA NA BANA BANANA BANANA ANANA B ANA BA N A BANA N BANANA NANA B A NA BAN A BANANA BANAN A BNN AA AThe following pseudocode gives a simple though inefficient way to calculate the BWT and its inverse It assumes that the input string s contains a special character EOF which is the last character and occurs nowhere else in the text function BWT string s create a table where the rows are all possible rotations of s sort rows alphabetically return last column of the table function inverseBWT string s create empty table repeat length s times first insert creates first column insert s as a column of table before first column of the table sort rows of the table alphabetically return row that ends with the EOF character Explanation editTo understand why this creates more easily compressible data consider transforming a long English text frequently containing the word the Sorting the rotations of this text will group rotations starting with he together and the last character of that rotation which is also the character before the he will usually be t so the result of the transform would contain a number of t characters along with the perhaps less common exceptions such as if it contains ache mixed in So it can be seen that the success of this transform depends upon one value having a high probability of occurring before a sequence so that in general it needs fairly long samples a few kilobytes at least of appropriate data such as text The remarkable thing about the BWT is not that it generates a more easily encoded output an ordinary sort would do that but that it does this reversibly allowing the original document to be re generated from the last column data The inverse can be understood this way Take the final table in the BWT algorithm and erase all but the last column Given only this information you can easily reconstruct the first column The last column tells you all the characters in the text so just sort these characters alphabetically to get the first column Then the last and first columns of each row together give you all pairs of successive characters in the document where pairs are taken cyclically so that the last and first character form a pair Sorting the list of pairs gives the first and second columns Continuing in this manner you can reconstruct the entire list Then the row with the end of file character at the end is the original text Reversing the example above is done like this Inverse transformationInputBNN AA AAdd 1 Sort 1 Add 2 Sort 2B N N A A A A A A B N N BA NA NA B AN AN A AN AN A BA NA NA B Add 3 Sort 3 Add 4 Sort 4BAN NAN NA BA ANA ANA B A ANA ANA A BAN NAN NA BA B BANA NANA NA BAN ANAN ANA BA A B ANAN ANA A B BANA NANA NA BAN BAAdd 5 Sort 5 Add 6 Sort 6BANAN NANA NA B BANA ANANA ANA BAN A BA ANANA ANA A BA BANAN NANA NA B BANA BAN BANANA NANA NA BA BANAN ANANA ANA B BANA A BAN ANANA ANA B A BAN BANANA NANA NA BA BANAN BANAAdd 7 Sort 7 Add 8 Sort 8BANANA NANA B NA BAN BANANA ANANA ANA BA BANAN A BANA ANANA ANA BA A BANA BANANA NANA B NA BAN BANANA BANAN BANANA NANA BA NA BANA BANANA ANANA B ANA BAN BANANA A BANAN ANANA B ANA BAN A BANAN BANANA NANA BA NA BANA BANANA BANANAOutput BANANA Optimization editA number of optimizations can make these algorithms run more efficiently without changing the output There is no need to represent the table in either the encoder or decoder In the encoder each row of the table can be represented by a single pointer into the strings and the sort performed using the indices In the decoder there is also no need to store the table and in fact no sort is needed at all In time proportional to the alphabet size and string length the decoded string may be generated one character at a time from right to left A character in the algorithm can be a byte or a bit or any other convenient size One may also make the observation that mathematically the encoded string can be computed as a simple modification of the suffix array and suffix arrays can be computed with linear time and memory The BWT can be defined with regards to the suffix array SA of text T as 1 based indexing BWT i T SA i 1 if SA i gt 0 otherwise displaystyle BWT i begin cases T SA i 1 amp text if SA i gt 0 amp text otherwise end cases nbsp 3 There is no need to have an actual EOF character Instead a pointer can be used that remembers where in a string the EOF would be if it existed In this approach the output of the BWT must include both the transformed string and the final value of the pointer The inverse transform then shrinks it back down to the original size it is given a string and a pointer and returns just a string A complete description of the algorithms can be found in Burrows and Wheeler s paper or in a number of online sources 1 The algorithms vary somewhat by whether EOF is used and in which direction the sorting was done In fact the original formulation did not use an EOF marker 4 Bijective variant editSince any rotation of the input string will lead to the same transformed string the BWT cannot be inverted without adding an EOF marker to the end of the input or doing something equivalent making it possible to distinguish the input string from all its rotations Increasing the size of the alphabet by appending the EOF character makes later compression steps awkward There is a bijective version of the transform by which the transformed string uniquely identifies the original and the two have the same length and contain exactly the same characters just in a different order 5 6 The bijective transform is computed by factoring the input into a non increasing sequence of Lyndon words such a factorization exists and is unique by the Chen Fox Lyndon theorem 7 and may be found in linear time and constant space 8 The algorithm sorts the rotations of all the words as in the Burrows Wheeler transform this produces a sorted sequence of n strings The transformed string is then obtained by picking the final character of each string in this sorted list The one important caveat here is that strings of different lengths are not ordered in the usual way the two strings are repeated forever and the infinite repeats are sorted For example ORO precedes OR because OROORO precedes OROROR For example the text BANANA is transformed into ANNBAA through these steps the red character indicates the EOF pointer in the original string The EOF character is unneeded in the bijective transform so it is dropped during the transform and re added to its proper place in the file The string is broken into Lyndon words so the words in the sequence are decreasing using the comparison method above Note that we re sorting as succeeding other characters BANANA becomes B AN AN A Bijective transformationInput Allrotations Sorted alphabetically Last columnof rotated Lyndon word Output BANANA BBBBBBBB B ANANANAN AN NANANANA NA ANANANAN AN NANANANA NA AAAAAAAA A AAAAAAAA A ANANANAN AN ANANANAN AN BBBBBBBB B NANANANA NA NANANANA NA AAAAAAAA A ANANANAN AN ANANANAN AN BBBBBBBB B NANANANA NA NANANANA NA ANNBAA Inverse bijective transformInputANNBAA Add 1 Sort 1 Add 2 Sort 2A N N B A A A A A B N N AA NA NA BB AN AN AA AN AN BB NA NA Add 3 Sort 3 Add 4 Sort 4AAA NAN NAN BBB ANA ANA AAA ANA ANA BBB NAN NAN AAAA NANA NANA BBBB ANAN ANAN AAAA ANAN ANAN BBBB NANA NANA Output BANANAUp until the last step the process is identical to the inverse Burrows Wheeler process but here it will not necessarily give rotations of a single sequence it instead gives rotations of Lyndon words which will start to repeat as the process is continued Here we can see repetitions of four distinct Lyndon words A AN twice B and NANA doesn t represent a distinct word as it is a cycle of ANAN At this point these words are sorted into reverse order B AN AN A These are then concatenated to get BANANAThe Burrows Wheeler transform can indeed be viewed as a special case of this bijective transform instead of the traditional introduction of a new letter from outside our alphabet to denote the end of the string we can introduce a new letter that compares as preceding all existing letters that is put at the beginning of the string The whole string is now a Lyndon word and running it through the bijective process will therefore result in a transformed result that when inverted gives back the Lyndon word with no need for reassembling at the end Relatedly the transformed text will only differ from the result of BWT by one character per Lyndon word for example if the input is decomposed into six Lyndon words the output will only differ in six characters For example applying the bijective transform gives Input SIX MIXED PIXIES SIFT SIXTY PIXIE DUST BOXESLyndon words span style color 990000 S span span style color FF9900 IX span span style color 006600 MIXED PIXIES SIFT SIXTY PIXIE span span style color 0000DD DUST span span style color 660066 BOXES span Output STEYDST E IXXIIXXSMPPXS B EE SUSFXDIOIIIITThe bijective transform includes eight runs of identical characters These runs are in order XX II XX PP EE and IIII In total 18 characters are used in these runs Dynamic Burrows Wheeler transform editWhen a text is edited its Burrows Wheeler transform will change Salson et al 9 propose an algorithm that deduces the Burrows Wheeler transform of an edited text from that of the original text doing a limited number of local reorderings in the original Burrows Wheeler transform which can be faster than constructing the Burrows Wheeler transform of the edited text directly Sample implementation editThis Python implementation sacrifices speed for simplicity the program is short but takes more than the linear time that would be desired in a practical implementation It essentially does what the pseudocode section does Using the STX ETX control codes to mark the start and end of the text and using s i s i to construct the ith rotation of s the forward transform takes the last character of each of the sorted rows def bwt s str gt str Apply Burrows Wheeler transform to input string assert 002 not in s and 003 not in s Input string cannot contain STX and ETX characters s 002 s 003 Add start and end of text marker table sorted s i s i for i in range len s Table of rotations of string last column row 1 for row in table Last characters of each row return join last column Convert list of characters into string The inverse transform repeatedly inserts r as the left column of the table and sorts the table After the whole table is built it returns the row that ends with ETX minus the STX and ETX def inverse bwt r str gt str Apply inverse Burrows Wheeler transform table len r Make empty table for in range len r table sorted r i table i for i in range len r Add a column of r s next row for row in table if row endswith 003 Iterate over and check whether last character ends with ETX or not return s rstrip 003 strip 002 Retrieve data from array and get rid of start and end markers Following implementation notes from Manzini it is equivalent to use a simple null character suffix instead The sorting should be done in colexicographic order string read right to left i e sorted key lambda s s 1 in Python 4 The above control codes actually fail to satisfy EOF being the last character the two codes are actually the first The rotation holds nevertheless BWT applications editAs a lossless compression algorithm the Burrows Wheeler transform offers the important quality that its encoding is reversible and hence the original data may be recovered from the resulting compression The lossless quality of Burrows algorithm has provided for different algorithms with different purposes in mind To name a few Burrows Wheeler transform is used in algorithms for sequence alignment image compression data compression etc The following is a compilation of some uses given to the Burrows Wheeler Transform BWT for sequence alignment edit The advent of next generation sequencing NGS techniques at the end of the 2000s decade has led to another application of the Burrows Wheeler transformation In NGS DNA is fragmented into small pieces of which the first few bases are sequenced yielding several millions of reads each 30 to 500 base pairs DNA characters long In many experiments e g in ChIP Seq the task is now to align these reads to a reference genome i e to the known nearly complete sequence of the organism in question which may be up to several billion base pairs long A number of alignment programs specialized for this task were published which initially relied on hashing e g Eland SOAP 10 or Maq 11 In an effort to reduce the memory requirement for sequence alignment several alignment programs were developed Bowtie 12 BWA 13 and SOAP2 14 that use the Burrows Wheeler transform BWT for image compression edit The Burrows Wheeler transformation has proved to be fundamental for image compression applications For example 15 Showed a compression pipeline based on the application of the Burrows Wheeler transformation followed by inversion run length and arithmetic encoders The pipeline developed in this case is known as Burrows Wheeler transform with an inversion encoder BWIC The results shown by BWIC are shown to outperform the compression performance of well known and widely used algorithms like Lossless JPEG and JPEG 2000 BWIC is shown to outperform those in terms of final compression size of radiography medical images on the order of 5 1 and 4 1 respectively The improvements are achieved by combining BWIC and a pre BWIC scan of the image in a vertical snake order fashion More recently additional works like that of 16 have shown the implementation of the Burrows Wheeler Transform in conjunction with the known move to front transform MTF achieve near lossless compression of images BWT for compression of genomic databases edit Cox et al 17 presented a genomic compression scheme that uses BWT as the algorithm applied during the first stage of compression of several genomic datasets including the human genomic information Their work proposed that BWT compression could be enhanced by including a second stage compression mechanism called same as previous encoding SAP which makes use of the fact that suffixes of two or more prefix letters could be equal With the compression mechanism BWT SAP Cox et al showed that in the genomic database ERA015743 135 5 GB in size the compression scheme BWT SAP compresses the ERA015743 dataset by around 94 to 8 2 GB BWT for sequence prediction edit BWT has also been proved to be useful on sequence prediction which is a common area of study in machine learning and natural language processing In particular Ktistakis et al 18 proposed a sequence prediction scheme called SuBSeq that exploits the lossless compression of data of the Burrows Wheeler transform SuBSeq exploits BWT by extracting the FM index and then performing a series of operations called backwardSearch forwardSearch neighbourExpansion and getConsequents in order to search for predictions given a suffix The predictions are then classified based on a weight and put into an array from which the element with the highest weight is given as the prediction from the SuBSeq algorithm SuBSeq has been shown to outperform state of the art algorithms for sequence prediction both in terms of training time and accuracy References edit a b Burrows Michael Wheeler David J May 10 1994 A block sorting lossless data compression algorithm Technical Report 124 Digital Equipment Corporation archived from the original on January 5 2003 adrien mogenet scala bwt GitHub Retrieved 19 April 2018 Simpson Jared T Durbin Richard 2010 06 15 Efficient construction of an assembly string graph using the FM index Bioinformatics 26 12 i367 i373 doi 10 1093 bioinformatics btq217 ISSN 1367 4803 PMC 2881401 PMID 20529929 a b Manzini Giovanni 1999 08 18 The Burrows Wheeler Transform Theory and Practice PDF Mathematical Foundations of Computer Science 1999 24th International Symposium MFCS 99 Szklarska Poreba Poland September 6 10 1999 Proceedings Springer Science amp Business Media ISBN 9783540664086 Archived PDF from the original on 2022 10 09 Gil J Scott D A 2009 A bijective string sorting transform PDF archived from the original PDF on 2011 10 08 retrieved 2009 07 09 Kufleitner Manfred 2009 On bijective variants of the Burrows Wheeler transform in Holub Jan Zdarek Jan eds Prague Stringology Conference pp 65 69 arXiv 0908 0239 Bibcode 2009arXiv0908 0239K Lothaire M 1997 Combinatorics on words Encyclopedia of Mathematics and Its Applications vol 17 Perrin D Reutenauer C Berstel J Pin J E Pirillo G Foata D Sakarovitch J Simon I Schutzenberger M P Choffrut C Cori R Lyndon Roger Rota Gian Carlo Foreword by Roger Lyndon 2nd ed Cambridge University Press p 67 ISBN 978 0 521 59924 5 Zbl 0874 20040 Duval Jean Pierre 1983 Factorizing words over an ordered alphabet Journal of Algorithms 4 4 363 381 doi 10 1016 0196 6774 83 90017 2 ISSN 0196 6774 Zbl 0532 68061 Salson M Lecroq T Leonard M Mouchard L 2009 A Four Stage Algorithm for Updating a Burrows Wheeler Transform Theoretical Computer Science 410 43 4350 4359 doi 10 1016 j tcs 2009 07 016 Li R et al 2008 SOAP short oligonucleotide alignment program Bioinformatics 24 5 713 714 doi 10 1093 bioinformatics btn025 PMID 18227114 Li H Ruan J Durbin R 2008 08 19 Mapping short DNA sequencing reads and calling variants using mapping quality scores Genome Research 18 11 1851 1858 doi 10 1101 gr 078212 108 PMC 2577856 PMID 18714091 Langmead B Trapnell C Pop M Salzberg SL 2009 Ultrafast and memory efficient alignment of short DNA sequences to the human genome Genome Biology 10 3 R25 doi 10 1186 gb 2009 10 3 r25 PMC 2690996 PMID 19261174 Li H Durbin R 2009 Fast and accurate short read alignment with Burrows Wheeler Transform Bioinformatics 25 14 1754 1760 doi 10 1093 bioinformatics btp324 PMC 2705234 PMID 19451168 Li R et al 2009 SOAP2 an improved ultrafast tool for short read alignment Bioinformatics 25 15 1966 1967 doi 10 1093 bioinformatics btp336 PMID 19497933 Collin P Arnavut Z Koc B 2015 Lossless compression of medical images using Burrows Wheeler Transformation with Inversion Coder 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society EMBC Vol 2015 pp 2956 2959 doi 10 1109 EMBC 2015 7319012 ISBN 978 1 4244 9271 8 PMID 26736912 S2CID 4460328 Devadoss CP Sankaragomathi B 2019 Near lossless medical image compression using block BWT MTF and hybrid fractal compression techniques Cluster Computing 22 12929 12937 doi 10 1007 s10586 018 1801 3 S2CID 33687086 Cox AJ Bauer MJ Jakobi T Rosone G 2012 Large scale compression of genomic sequence databases with the Burrows Wheeler transform Bioinformatics 28 11 Oxford University Press 1415 1419 arXiv 1205 0192 doi 10 1093 bioinformatics bts173 PMID 22556365 Ktistakis R Fournier Viger P Puglisi SJ Raman R 2019 Succinct BWT Based Sequence Prediction Database and Expert Systems Applications Lecture Notes in Computer Science Vol 11707 pp 91 101 doi 10 1007 978 3 030 27618 8 7 ISBN 978 3 030 27617 1 S2CID 201058996 External links editArticle by Mark Nelson on the BWT Archived 2017 03 25 at the Wayback Machine A Bijective String Sorting Transform by Gil and Scott Archived 2011 10 08 at the Wayback Machine Yuta s openbwt v1 5 zip contains source code for various BWT routines including BWTS for bijective version On Bijective Variants of the Burrows Wheeler Transform by Kufleitner Blog post and project page for an open source compression program and library based on the Burrows Wheeler algorithm MIT open courseware lecture on BWT Foundations of Computational and Systems Biology League Table Sort LTS or The Weighting algorithm to BWT by Abderrahim Hechachena Retrieved from https en wikipedia org w index php title Burrows Wheeler transform amp oldid 1214062423, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.