Bioinformatics tools for the sequence complexity estimates

Orlov, Y.L.; Orlova N.G.

Bioinformatics tools for the sequence complexity estimates

We review current methods and bioinformatics tools for the text complexity estimates (information and entropy measures). The search DNA regions with extreme statistical characteristics such as low complexity regions are important for biophysical models of chromosome function and gene transcription regulation in genome scale. We discuss the complexity profiling for segmentation and delineation of genome sequences, search for genome repeats and transposable elements, and applications to next-generation sequencing reads. We review the complexity methods and new applications fields: analysis of mutation hotspots loci, analysis of short sequencing reads with quality control, and alignment-free genome comparisons. The algorithms implementing various numerical measures of text complexity estimates including combinatorial and linguistic measures have been developed before genome sequencing era. The series of tools to estimate sequence complexity use compression approaches, mainly by modification of Lempel–Ziv compression. Most of the tools are available online providing large-scale service for whole genome analysis. Novel machine learning applications for classification of complete genome sequences also include sequence compression and complexity algorithms. We present comparison of the complexity methods on the different sequence sets, the applications for gene transcription regulatory regions analysis. Furthermore, we discuss approaches and application of sequence complexity for proteins. The complexity measures for amino acid sequences could be calculated by the same entropy and compression-based algorithms. But the functional and evolutionary roles of low complexity regions in protein have specific features differing from DNA. The tools for protein sequence complexity aimed for protein structural constraints. It was shown that low complexity regions in protein sequences are conservative in evolution and have important biological and structural functions. Finally, we summarize recent findings in large scale genome complexity comparison and applications for coronavirus genome analysis. © 2023, International Union for Pure and Applied Biophysics (IUPAB) and Springer-Verlag GmbH Germany, part of Springer Nature.

Авторы

Orlov Y.L. , Orlova N.G.

Journal

Biophysical Reviews

Номер выпуска

Язык

English

Страницы

1367-1378

Статус

Published

Ссылка

Внешняя ссылка

DOI

10.1007/S12551-023-01140-Y

Том

Год

2023

Организации

¹ The Digital Health Institute, I.M. Sechenov First Moscow State Medical University of the Russian Ministry of Health (Sechenov University), Moscow, 119991, Russian Federation
² Institute of Cytology and Genetics SB RAS, Novosibirsk, 630090, Russian Federation
³ Agrarian and Technological Institute, Peoples’ Friendship University of Russia, Moscow, 117198, Russian Federation
⁴ Department of Mathematics, Financial University under the Government of the Russian Federation, Moscow, 125167, Russian Federation

Ключевые слова

Alignment-free; Bioinformatics; Entropy; Genetic codes; Genome comparison; Genomic rearrangement; Lempel–Ziv compression; Low complexity regions; Online tools; Sequence information; Sequencing artefacts; Text complexity

Цитировать

ГОСТ MLA RIS BibTex

Другие записи

AUTONOMY IN THE RUSSIAN FEDERATION: THEORY AND PRACTICE

Article

Kartashkin V.A., Abashidze A.Kh.

International Journal on Minority and Group Rights. Том 10. 2003. С. 203-220

DIALOGUES WITH NEURAL NETWORKS ABOUT THE PRESENT AND FUTURE OF MEDICAL PROFESSIONS: RISKS AND PROSPECTS

Article

Aksenova E.I., Bogdan I.V.

Problemy sotsial'noi gigieny, zdravookhraneniia i istorii meditsiny. Том 31. 2023. С. 1097-1103