Bioinformatics tools for the sequence complexity estimates

We review current methods and bioinformatics tools for the text complexity estimates (information and entropy measures). The search DNA regions with extreme statistical characteristics such as low complexity regions are important for biophysical models of chromosome function and gene transcription regulation in genome scale. We discuss the complexity profiling for segmentation and delineation of genome sequences, search for genome repeats and transposable elements, and applications to next-generation sequencing reads. We review the complexity methods and new applications fields: analysis of mutation hotspots loci, analysis of short sequencing reads with quality control, and alignment-free genome comparisons. The algorithms implementing various numerical measures of text complexity estimates including combinatorial and linguistic measures have been developed before genome sequencing era. The series of tools to estimate sequence complexity use compression approaches, mainly by modification of Lempel–Ziv compression. Most of the tools are available online providing large-scale service for whole genome analysis. Novel machine learning applications for classification of complete genome sequences also include sequence compression and complexity algorithms. We present comparison of the complexity methods on the different sequence sets, the applications for gene transcription regulatory regions analysis. Furthermore, we discuss approaches and application of sequence complexity for proteins. The complexity measures for amino acid sequences could be calculated by the same entropy and compression-based algorithms. But the functional and evolutionary roles of low complexity regions in protein have specific features differing from DNA. The tools for protein sequence complexity aimed for protein structural constraints. It was shown that low complexity regions in protein sequences are conservative in evolution and have important biological and structural functions. Finally, we summarize recent findings in large scale genome complexity comparison and applications for coronavirus genome analysis. © 2023, International Union for Pure and Applied Biophysics (IUPAB) and Springer-Verlag GmbH Germany, part of Springer Nature.

Authors
Orlov Y.L. , Orlova N.G.
Number of issue
5
Language
English
Pages
1367-1378
Status
Published
Volume
15
Year
2023
Organizations
  • 1 The Digital Health Institute, I.M. Sechenov First Moscow State Medical University of the Russian Ministry of Health (Sechenov University), Moscow, 119991, Russian Federation
  • 2 Institute of Cytology and Genetics SB RAS, Novosibirsk, 630090, Russian Federation
  • 3 Agrarian and Technological Institute, Peoples’ Friendship University of Russia, Moscow, 117198, Russian Federation
  • 4 Department of Mathematics, Financial University under the Government of the Russian Federation, Moscow, 125167, Russian Federation
Keywords
Alignment-free; Bioinformatics; Entropy; Genetic codes; Genome comparison; Genomic rearrangement; Lempel–Ziv compression; Low complexity regions; Online tools; Sequence information; Sequencing artefacts; Text complexity

Other records