fbpx
Wikipedia

CRAM (file format)

Compressed Reference-oriented Alignment Map (CRAM) is a compressed columnar file format for storing biological sequences aligned to a reference sequence, initially devised by Markus Hsi-Yang Fritz et al.[1]

CRAM
Filename extension
.cram
Developed byMarkus Hsi-Yang Fritz et al; Vadim Zalunin; James Bonfield
Type of formatBioinformatics
Open format?yes
Websitewww.ga4gh.org/cram/, www.ebi.ac.uk/ena/software/cram-toolkit

CRAM was designed to be an efficient reference-based alternative to the Sequence Alignment Map (SAM) and Binary Alignment Map (BAM) file formats. It optionally uses a genomic reference to describe differences between the aligned sequence fragments and the reference sequence, reducing storage costs. Additionally each column in the SAM format is separated into its own blocks, improving compression ratio. CRAM files typically vary from 30 to 60% smaller than BAM, depending on the data held within them.

Implementations of CRAM exist in htsjdk,[2] htslib,[3] JBrowse,[4] and Scramble.[5]

The file format specification is maintained by the Global Alliance for Genomics and Health (GA4GH)[6] with the specification document available from the EBI cram toolkit page.[7]

File format edit

The basic structure of a CRAM file is a series of containers, the first of which holds a compressed copy of the SAM header. Subsequent containers consist of a container Compression Header followed by a series of slices which in turn hold the alignment records themselves, formatted as a series of blocks.

CRAM file:

Magic number Container
(SAM header)
Container
(Data)
... Container
(Data)
Container
(EOF)

Container:

Container
Header
Compression
Header
Slice ... Slice

Slice:

Slice
Header
Block Block ... Block

CRAM constructs records from a set of data series, describing the components of an alignment. The container Compression Header specifies which data series is encoded in which block, what codec will be used, and any codec specific meta-data (for example a table of Huffman symbol code lengths). While data series can be mixed together within the same block, keeping them separate usually improves compression and provides the opportunity for efficient selective decoding where only some data types are required.

Selective access to a CRAM file is granted via the index (with file-name suffix ".crai"). On chromosome and position sorted data this indicates which region is covered by each slice. On unsorted data the index may be used to simply fetch the Nth container. Selective decoding may also be achieved using the Compression Header to skip specified data series if partial records are required.

History edit

Year Version(s) Notes
2010-11 pre-CRAM Initial paper describing the reference based format. This did not use the name CRAM, but called it mzip. This software was implemented in Python as a prototype and demonstration of the basic concepts.[1]
2011-12 0.3–0.86 Vadim Zalunin of the European Bioinformatics Institute (EBI) produced the first implementation named CRAM as a package called CRAMtools,[8] written in the Java programming language.
2012 1.0[9] Implemented in Java CRAMtools.[10]
2013 C implementation added to the Scramble[11][5] tool, by James Bonfield of the Wellcome Sanger Institute.
2013 2.0 Changes included support for more than one reference per slice (useful with highly fragmented assemblies), better encoding of SAM auxiliary tags, splitting soft-clip and inserted bases into their own data-series, meta-data to track the number of records and bases per slice, and corrections to the BF (BAM flag) data-series.
2013 Added to htslib (0.2.0).
2014 2.1[12] Added EOF blocks, to help identify truncated files.
2014 Added to htsjdk (1.127).
2014 3.0[13] Inclusion of lzma and rANS codecs for block compression, along with multiple checksums for ensuring data integrity
2018 JavaScript implementation as part of JBrowse[4] (1.15.0), by Rob Buels.
2021 Rust implementation in Noodles[14]
2023 3.1[15] Officially adopted. (Draft from 2019)

CRAM version 4.0 exists as a prototype in Scramble,[5] initially demonstrated in 2015, but has yet to be adopted as a standard.

See also edit

References edit

  1. ^ a b Hsi-Yang Fritz, Markus; Leinonen, Rasko; Cochrane, Guy; Birney, Ewan (May 2011). "Efficient storage of high throughput DNA sequencing data using reference-based compression". Genome Research. 21 (5): 734–740. doi:10.1101/gr.114819.110. ISSN 1549-5469. PMC 3083090. PMID 21245279.
  2. ^ "Htsjdk by Broad Institute". samtools.github.io. Retrieved 2018-10-14.
  3. ^ "Samtools". www.htslib.org. Retrieved 2018-10-14.
  4. ^ a b "JBrowse · A fast, embeddable genome browser built with HTML5 and JavaScript". jbrowse.org. Retrieved 2018-10-14.
  5. ^ a b c Bonfield, James K. (2014-06-14). "The Scramble conversion tool". Bioinformatics. 30 (19): 2818–2819. doi:10.1093/bioinformatics/btu390. ISSN 1460-2059. PMC 4173023. PMID 24930138.
  6. ^ "GA4GH". www.ga4gh.org. Retrieved 2018-10-14.
  7. ^ EMBL-EBI. "CRAM toolkit < Software < European Nucleotide Archive < EMBL-EBI". www.ebi.ac.uk. Retrieved 2018-10-14.
  8. ^ "vadimzalunin/crammer". GitHub. 2017-08-08. Retrieved 2018-10-14.
  9. ^ "CRAM 1.0 Specification" (PDF).
  10. ^ "enasequence/cramtools". GitHub. 2018-10-02. Retrieved 2018-10-14.
  11. ^ "jkbonfield/io_lib". GitHub. 2018-10-16. Retrieved 2018-10-14.
  12. ^ "CRAM 2.1 Specification" (PDF).
  13. ^ "CRAM 3.0 Specification" (PDF).
  14. ^ https://github.com/zaeleus/noodles/
  15. ^ "CRAM 3.1 Specification" (PDF).

cram, file, format, compressed, reference, oriented, alignment, cram, compressed, columnar, file, format, storing, biological, sequences, aligned, reference, sequence, initially, devised, markus, yang, fritz, cramfilename, extension, cramdeveloped, bymarkus, y. Compressed Reference oriented Alignment Map CRAM is a compressed columnar file format for storing biological sequences aligned to a reference sequence initially devised by Markus Hsi Yang Fritz et al 1 CRAMFilename extension cramDeveloped byMarkus Hsi Yang Fritz et al Vadim Zalunin James BonfieldType of formatBioinformaticsOpen format yesWebsitewww wbr ga4gh wbr org wbr cram wbr www wbr ebi wbr ac wbr uk wbr ena wbr software wbr cram toolkit CRAM was designed to be an efficient reference based alternative to the Sequence Alignment Map SAM and Binary Alignment Map BAM file formats It optionally uses a genomic reference to describe differences between the aligned sequence fragments and the reference sequence reducing storage costs Additionally each column in the SAM format is separated into its own blocks improving compression ratio CRAM files typically vary from 30 to 60 smaller than BAM depending on the data held within them Implementations of CRAM exist in htsjdk 2 htslib 3 JBrowse 4 and Scramble 5 The file format specification is maintained by the Global Alliance for Genomics and Health GA4GH 6 with the specification document available from the EBI cram toolkit page 7 Contents 1 File format 2 History 3 See also 4 ReferencesFile format editThe basic structure of a CRAM file is a series of containers the first of which holds a compressed copy of the SAM header Subsequent containers consist of a container Compression Header followed by a series of slices which in turn hold the alignment records themselves formatted as a series of blocks CRAM file Magic number Container SAM header Container Data Container Data Container EOF Container ContainerHeader CompressionHeader Slice Slice Slice SliceHeader Block Block Block CRAM constructs records from a set of data series describing the components of an alignment The container Compression Header specifies which data series is encoded in which block what codec will be used and any codec specific meta data for example a table of Huffman symbol code lengths While data series can be mixed together within the same block keeping them separate usually improves compression and provides the opportunity for efficient selective decoding where only some data types are required Selective access to a CRAM file is granted via the index with file name suffix crai On chromosome and position sorted data this indicates which region is covered by each slice On unsorted data the index may be used to simply fetch the Nth container Selective decoding may also be achieved using the Compression Header to skip specified data series if partial records are required History editYear Version s Notes 2010 11 pre CRAM Initial paper describing the reference based format This did not use the name CRAM but called it mzip This software was implemented in Python as a prototype and demonstration of the basic concepts 1 2011 12 0 3 0 86 Vadim Zalunin of the European Bioinformatics Institute EBI produced the first implementation named CRAM as a package called CRAMtools 8 written in the Java programming language 2012 1 0 9 Implemented in Java CRAMtools 10 2013 C implementation added to the Scramble 11 5 tool by James Bonfield of the Wellcome Sanger Institute 2013 2 0 Changes included support for more than one reference per slice useful with highly fragmented assemblies better encoding of SAM auxiliary tags splitting soft clip and inserted bases into their own data series meta data to track the number of records and bases per slice and corrections to the BF BAM flag data series 2013 Added to htslib 0 2 0 2014 2 1 12 Added EOF blocks to help identify truncated files 2014 Added to htsjdk 1 127 2014 3 0 13 Inclusion of lzma and rANS codecs for block compression along with multiple checksums for ensuring data integrity 2018 JavaScript implementation as part of JBrowse 4 1 15 0 by Rob Buels 2021 Rust implementation in Noodles 14 2023 3 1 15 Officially adopted Draft from 2019 CRAM version 4 0 exists as a prototype in Scramble 5 initially demonstrated in 2015 but has yet to be adopted as a standard See also editSAM file format Binary Alignment Map Compression of Genomic Re Sequencing Data List of file formats for molecular biologyReferences edit a b Hsi Yang Fritz Markus Leinonen Rasko Cochrane Guy Birney Ewan May 2011 Efficient storage of high throughput DNA sequencing data using reference based compression Genome Research 21 5 734 740 doi 10 1101 gr 114819 110 ISSN 1549 5469 PMC 3083090 PMID 21245279 Htsjdk by Broad Institute samtools github io Retrieved 2018 10 14 Samtools www htslib org Retrieved 2018 10 14 a b JBrowse A fast embeddable genome browser built with HTML5 and JavaScript jbrowse org Retrieved 2018 10 14 a b c Bonfield James K 2014 06 14 The Scramble conversion tool Bioinformatics 30 19 2818 2819 doi 10 1093 bioinformatics btu390 ISSN 1460 2059 PMC 4173023 PMID 24930138 GA4GH www ga4gh org Retrieved 2018 10 14 EMBL EBI CRAM toolkit lt Software lt European Nucleotide Archive lt EMBL EBI www ebi ac uk Retrieved 2018 10 14 vadimzalunin crammer GitHub 2017 08 08 Retrieved 2018 10 14 CRAM 1 0 Specification PDF enasequence cramtools GitHub 2018 10 02 Retrieved 2018 10 14 jkbonfield io lib GitHub 2018 10 16 Retrieved 2018 10 14 CRAM 2 1 Specification PDF CRAM 3 0 Specification PDF https github com zaeleus noodles CRAM 3 1 Specification PDF Retrieved from https en wikipedia org w index php title CRAM file format amp oldid 1220790412, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.