Open Access   Article Go Back

A Survey of Genome Compression Methodology

Rituparna Mitra1 , Subhankar Roy2

Section:Survey Paper, Product Type: Journal Paper
Volume-6 , Issue-8 , Page no. 983-991, Aug-2018

CrossRef-DOI:   https://doi.org/10.26438/ijcse/v6i8.983991

Online published on Aug 31, 2018

Copyright © Rituparna Mitra, Subhankar Roy . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

View this paper at   Google Scholar | DPI Digital Library

How to Cite this Paper

  • IEEE Citation
  • MLA Citation
  • APA Citation
  • BibTex Citation
  • RIS Citation

IEEE Style Citation: Rituparna Mitra, Subhankar Roy, “A Survey of Genome Compression Methodology,” International Journal of Computer Sciences and Engineering, Vol.6, Issue.8, pp.983-991, 2018.

MLA Style Citation: Rituparna Mitra, Subhankar Roy "A Survey of Genome Compression Methodology." International Journal of Computer Sciences and Engineering 6.8 (2018): 983-991.

APA Style Citation: Rituparna Mitra, Subhankar Roy, (2018). A Survey of Genome Compression Methodology. International Journal of Computer Sciences and Engineering, 6(8), 983-991.

BibTex Style Citation:
@article{Mitra_2018,
author = {Rituparna Mitra, Subhankar Roy},
title = {A Survey of Genome Compression Methodology},
journal = {International Journal of Computer Sciences and Engineering},
issue_date = {8 2018},
volume = {6},
Issue = {8},
month = {8},
year = {2018},
issn = {2347-2693},
pages = {983-991},
url = {https://www.ijcseonline.org/full_paper_view.php?paper_id=2805},
doi = {https://doi.org/10.26438/ijcse/v6i8.983991}
publisher = {IJCSE, Indore, INDIA},
}

RIS Style Citation:
TY - JOUR
DO = {https://doi.org/10.26438/ijcse/v6i8.983991}
UR - https://www.ijcseonline.org/full_paper_view.php?paper_id=2805
TI - A Survey of Genome Compression Methodology
T2 - International Journal of Computer Sciences and Engineering
AU - Rituparna Mitra, Subhankar Roy
PY - 2018
DA - 2018/08/31
PB - IJCSE, Indore, INDIA
SP - 983-991
IS - 8
VL - 6
SN - 2347-2693
ER -

VIEWS PDF XML
565 292 downloads 252 downloads
  
  
           

Abstract

Storing the information about human nucleotide is become essential now a day for various medical research purposes. A human genome consists of almost 3.2 billion nucleotides. It is unmanageable to store, access and retrieve the desired information from the massive bulk of unprocessed data. So the possible solution is genome compression. By compression we mean that we are restricting on the data storage. Existing data compression methodology is not suitable to deal with this massive data. In this paper we provide a survey analysis on various types of genome compression and read compression algorithm which are specially designed to handle this voluminous raw DNA information. To extract the unique non repeated information from the whole sequence is actually a tough challenge .These compression algorithms not only save time but also provide high compression rate. We have discussed all the types of compression algorithm with their distinctive approach. Each of them having some benefit over other. We also briefly discuss on various file formats used while compression.

Key-Words / Index Term

Genome compression, Read compression, Data formats

References

[1] P.Raja Rajeswari (1) Allam Apparo (2), V.K. Kumar, Genbit Compress Tool(GBC): A Java-Based Tool to Compress DNA Sequences and Compute Compression Ratio(bits/base) of Genomes , Acharya Nagarjuna University, India, (2) Jawaharlal Nehru Technological University, India and (3) S.V.H. College Of Engineering, India7 Jun 2010
[2] Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 23(3):337-343, 1977
[3] Ateet Mehta and Bankim Patel. Dna compression using hash based data structure. International Journal of Information Technology & Knowledge Management, 3:383-386, 2010.
[4] Piyuan Lin, Shaopeng Liu, Lixia Zhang, et al. Compressed pattern matching in dna sequences using multithreaded technology. In 3rd International Conference on Bioinformatics and Biomedical Engineering, ICBBE`09, 2009.
[5] Pothuraju Rajeswari and Allam Apparao. Dnabit compress -genome compression algorithm. Bioinformation, 5(8):350-60, 2011.
[6] Grumbach, S. and Tahi, F. (1993). Compression of DNA sequences. In DCC`93:Proceedings of the Conference on Data Compression, pages 340-350.
[7] Grumbach, S. and Tahi, F. (1994). A new challenge for compression algorithms:Genetic sequences. Information Processing and Management, 30(6), 875-886
[8] Rivals, E., Delahaye, J., Dauchet, M., and Delgrange, O. (1996). A guaranteed compression scheme for repetitive DNA sequences. In DCC `96: Proceedings of the Conference on Data Compression, page 453.
[9] Chen, X., Kwong, S., and Li, M. (2000). A compression algorithm for DNA sequences and its applications in genome comparison. In RECOMB`00: Proceedings of the 4th Annual International Conference on Computational Molecular Biology, pages 107-117.
[10] Chen, X., Li, M., Ma, B., and Tromp, J. (2002). DNACompress: fast and effective DNA sequence compression. Bioinformatics, 18(12), 1696-1698.
[11] T. Matsumoto, K. Sadakane, and H. Imai. Biological sequence compression algorithms. Genome l.Informatics, 11:43–52, 2000.
[12] Kuruppu S, Beresford-Smith B, Conway T, et al. Iterative dictionary construction for compression of large DNA datasets. IEEE-ACM Trans Computat Biol Bioinformatics 2012;9:137-49
[13] Manzini, G., and Rastero, M., 2004, A Simple and Fast DNA Compressor, Software: Practice and Experience, 34(14), 1397–1411.
[14] A. J. T. Lee, C. Chang and C. Chen, "DNAC: An Efficient Compression Algorithm for DNA Sequences," National Taiwan University, Taipei, Taiwan 10617, R.O.C., 2004.
[15] Dimitris Antoniou, Evangelos Theodoridis, and Athanasios Tsakalidis. Compressing biological sequences using selfadjusting data structures. In Information Technology and Applications in Biomedicine, 2010
[16] Kalyan Kumar Kaipa, Ajit S Bopardikar, Srikantha Abhilash, et al. Algorithm for dnasequence compression based on prediction of mismatch bases and repeat location. In Bioinformatics and Biomedicine Workshops, BIBMW, 2010.
[17] Behzadi, B. and Le Fessant, F. (2005). DNA compression challenge revisited: A dynamic programming approach. In CPM`05: Proceedings of the 16th Annual Symposium on Combinatorial Pattern Matching, volume 3537 of LNCS, pages 85-96.
[18] Matsumoto, T., Sadakane, K., and Imai, H. (2000). Biological sequence compression algorithms. Genome Informatics, 11, 43-52.
[19] I. Tabus, G. Korodi, and J. Rissanen, "DNA sequence compression using the normalized maximum likelihood model for discrete regression," in Proc. of the Data Compression Conf. (DCC2003), 2003, 253–262.
[20] Korodi, G. and Tabus, I. (2005). An effcient normalized maximum likelihood algorithm for DNA sequence compression. ACM Transactions on Information Systems, 23(1), 3-34.
[21] Mishra, K. N., Aaggarwal, A., Abdelhadi, E., et al., 2010, An Efficient Horizontal and Vertical Method for Online DNA Sequence Compression, International Journal of Computer Applications, 3(1), 39-46.
[22] David A Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the Institute of Radio Engineers, 40(9):1098-1101, 1952.
[23] D. Loewenstern, and P. N. Yianilos, "Significantly lower entropy estimates for natural DNA sequences," in Proc. of the Data Compression Conf., (DCC `97), 1997, 151–160.
[24] Allison, L., Edgoose, T., and Dix, T. I., 1998, Compression of strings with approximate repeats, Proc. ISMB, 8–16.
[25] M. D. Cao, T. I. Dix, L. Allison, et al., "A Simple Statistical Algorithm for Biological Sequence Compression," in Proc. of the Data Compression Conf., (DCC `07), 2007, 43–52.
[26] Diogo Pratas and Armando J. Pinho. Compressing the human genome using exclusively markov models. In Miguel P. Rocha, Juan M. Corchado Rodrguez, Florentino Fdez-Riverola, and Alfonso Valencia, editors, PACBB, volume 93 of Advances in Intelligent and Soft Computing, pages 213-220. Springer, 2011.
[27] K. R. Venugopal, K. G. Srinivasa, and Lalit Patnaik. Probabilistic Approach for DNA Compression, chapter 14, pages 279-289. Springer, 2009.
[28] I. Tabus and G. Korodi. Genome compression using normalized maximum likelihood models for constrained markov sources. In Information Theory Workshop, 2008.
[29] Kalyan Kumar Kaipa, Kyusang Lee, Taejin Ahn, et al. System for random access dna sequence compression. In International Conference on Bioinformatics and Biomedicine Workshops, 2010.
[30] Marty C. Brandon, Douglas C. Wallace, and Pierre Baldi. Data structures and compression algorithms for genomic sequence data. Bioinformatics, 25(14):1731-1738, 2009.
[31] Golomb S. Run-length encodings. IEEETrans InformTheory 1965;12:399–401.
[32] Elias P. Universal codeword sets and representations of the integers. IEEETrans InformTheory 1975;21:194–203.
[33] Huffman DA. A method for the construction of minimum redundancy codes. Proc IRE 1952;40:1098–101.
[34] Scott Christley, Yiming Lu, Chen Li, et al. Human genomes as email attachments. Bioinfor-matics, 25(2):274-275, 2009.
[35] Congmao Wang and Dabing Zhang. A novel compression tool for efficient storage of genome resequencing data. Nucleic Acids Research, 39(7):e45, 2011.
[36] Shanika Kuruppu, Simon J. Puglisi, and Justin Zobel. Relative lempel-ziv compression of genomes for large-scale storage and retrieval. In Proceedings of the 17th International Conference on String Processing and Information Retrieval, SPIRE`10, pages 201-206, 2010.
[37] Shanika Kuruppu, Simon Puglisi, and Justin Zobel. Optimized relative lempel-ziv compression of genomes. In Australasian Computer Science Conference, 2011.
[38] Szymon Grabowski and Sebastian Deorowicz. Engineering relative compression of genomes. CoRR, abs/1103.2351, 2011.
[39] Armando J. Pinho, Diogo Pratas, and Sara P. Garcia. Green: a tool for efficient compression of genome resequencing data. Nucleic Acids Research, 2011.
[40] Sebastian Kreft and Gonzalo Navarro. Lz77-like compression with fast random access. In Proceedings of the 2010 Conference on Data Compression, DCC`10, pages 239-248, 2010.
[41] Heba Afify, Muhammad Islam, and Manal Abdel Wahed. Dna lossless differential compression algorithm based on similarity of genomic sequence database. CoRR, abs/1109.0094, 2011.
[42] Heba Afify, Muhammad Islam, and Manal Abdel Wahed. Genomic sequences differential compressionmodel. International Journal of Computer Science and Information Technology,3:145-154, 2011.
[43] Deorowicz S, Grabowski S. Robust relative compression of genomes with random access. Bioinformatics 2011;27:2979–86.
[44] Mohammed MH, Dutta A, Bose T, et al. DELIMINATE-afast and efficient method for loss-less compression of genomic sequences. Bioinformatics 2012;28:2527–9.
[45] Pinho AJ, Ferreira PJSG, Neves AJR, et al. On the representability of complete genomes by multiple competing finite-context (Markov) models. PLoS One 2011;6:e21588.
[46] Hunt JJ, Vo K-P, Tichy WF. Delta algorithms: an empirical analysis. ACMTrans Software EngMethodol (TOSEM) 1998;7:192–214.
[47] Pinho AJ, Ferreira PJSG, Neves AJR, et al. On the representability of complete genomes by multiple competing finite-context (Markov) models. PLoS One 2011;6:e21588.
[48] Oscar Herrera and Angel Kuri-Morales. Lossless compression of biological sequences with evolutionary metadictionaries. In Workshop on Machine Learning and Data Mining, 2009.
[49] Giulia Menconi, Vieri Benci, and Marcello Buiatti. Data compression and genomes: a two dimensional life domain map. Journal of Theoretical Biology, 253(2):281-288, 2008.
[50] Zexuan Zhu, Jiarui Zhou, Zhen Ji, et al. Dna sequence compression using adaptive particle swarm optimization-based memetic algorithm. IEEE Transactions on Evolutionary Computation, 15(5):643-658, 2011.
[51] Vishal Bhola, Ajit Bopardikar, Rangavittal Narayanan, et al. No-reference compression of genomic data stored in fastq format. In Proceedings of the 2011 IEEE International Conference on Bioinformatics and Biomedicine, BIBM`11, pages 147-150, 2011.
[52] Raymond Wan, Vo N. Anh, and Kiyoshi Asai. Transformations for the compression of fastq quality scores of next generation sequencing data. Bioinformatics, 2011
[53] Waibhav Tembe, James Lowey, and Edward Suh. G-sqz: compact encoding of genomic sequence and quality data. Bioinformatics, 26(17):2192-2194, 2010.
[54] Sebastian Deorowicz and Szymon Grabowski. Compression of dna sequence reads in fastq format. Bioinformatics, 27(6):860-862, 2011.
[55] Wei-Hsin Chen, Yu-Wen Lu, Feipei Lai, et al. Integrating human genome database into electronic health record with sequence alignment and compression mechanism. Journal of Medical Systems, 36(3):2587-2597, 2011.
[56] Kenny Daily, Paul Rigor, Scott Christley, et al. Data structures and compression algorithms for high-throughput sequencing technologies. BMC Bioinformatics, 11(1):514+, 2010.
[57] Christos Kozanitis, Chris Saunders, Semyon Kruglyak, et al. Compressing genomic sequence fragments using slimgene. In Proceedings of the 14th Annual International Conference on Research in Computational Molecular Biology, RECOMB`10, pages 310-324, 2010.
[58] Fritz MH-Y, Leinonen R, Cochrane G, et al. Efficient storage of high throughput DNA sequencing data using reference based compression. GenomeRes 2011; 21:734–40.
[59] Jones DC, Ruzzo WL, Peng X, et al. Compression of nextgeneration sequencing reads aided by highly efficient denovo assembly. Nucleic Acids Res 2012;40:e171.
[60] Popitsch N, von Haeseler A. NGC: lossless and lossy compression of aligned high-throughput sequencing data. Nucleic Acids Res 2013;41:e27.
[61] Bonfield JK, Mahoney MV. Compression of FASTQ and SAM format sequencing data. PLoS One 2013;8:e59190.
[62] Markus H. Fritz, Rasko Leinonen, Guy Cochrane, et al. Effcient storage of high throughput dna sequencing data using reference-based compression. Genome Research, 21(5):734-740,2011.
[63] Yanovsky V. ReCoil - an algorithm for compression of extremely large datasets of DNA data. Algorithms Mol Biol 2011;6:23.
[64] Cox AJ, Bauer MJ, Jakobi T, et al. Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics 2012;28:1415–9.
[65] Hach F, Numanagic I, Alkan C, et al. SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 2012;28:3051–7.
[66] Heba Afify, Muhammad Islam and Manal Abdel Wahed. DNA LOSSLESS DIFFERENTIAL COMPRESSION ALGORITHM BASED ON SIMILARITY OF GENOMIC SEQUENCE DATABASE. International Journal of Computer Science & Information Technology (IJCSIT) Vol 3, No 4, August 2011 .
[67] Bacem Saada, Member, IAENG, Jing Zhang. DNA Sequences Compression Techniques Based on Modified DNABIT Algorithm. Proceedings of the World Congress on Engineering 2016 Vol I WCE 2016, June 29 - July 1, 2016, London, U.K.
[68] Rajesh Mukherjee , Subhrajyoti Mandal , Bijoy Mandal. Reverse Sequencing based Genome Sequence using Lossless Compression Algorithm. International Research Journal of Engineering and Technology (IRJET) Volume: 03 Issue: 05 ,May-2016 .
[69] Rexline S J, Trujilla Lobo F. DNA Compression Algorithm Using Pattern Hunter. International Journal on Computer Science and Engineering (IJCSE).
[70] Peter J. A. Cock,Christopher J. Fields, Naohisa Goto, Michael L. Heuer and Peter M. Rice. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Plant Pathology, SCRI, Invergowrie, Dundee DD2 5DA, UK.