Show simple item record

dc.contributor.authorHackenberg, Michael 
dc.contributor.authorRueda, Antonio
dc.contributor.authorCarpena, Pedro
dc.contributor.authorBernaola-Galván, Pedro
dc.contributor.authorBarturen, Guillermo
dc.contributor.authorOliver Jiménez, José Lutgardo 
dc.date.accessioned2026-02-20T12:21:23Z
dc.date.available2026-02-20T12:21:23Z
dc.date.issued2011-12-30
dc.identifier.citationHackenberg, M.; Rueda, A.; Carpena, P. [et al]. (2012). Clustering of DNA words and biological function: A proof of principle. Journal of Theoretical Biology 297 (2012) 127–136. doi:10.1016/j.jtbi.2011.12.024es_ES
dc.identifier.issn0022-5193
dc.identifier.urihttps://hdl.handle.net/10481/111318
dc.descriptionThis work was supported by the Ministry of Innovation and Science of the Spanish Government [BIO2008-01353 to JLO and BIO2010-20219 to MH], ‘Juan de la Cierva’ grant (M.H.) and Basque country ‘AE’ grant (G.B.). We gratefully acknowledge the valuable comments of two anonymous referees, which significantly improved the manuscript. We thank Ángel M. Alganza for help with system administration and database support.es_ES
dc.description.abstractRelevant words in literary texts (key words) are known to be clustered, while common words are randomly distributed. Given the clustered distribution of many functional genome elements, we hypothesize that the biological text per excellence, the DNA sequence, might behave in the same way: k-length words (k-mers) with a clear function may be spatially clustered along the one-dimensional chromosome sequence, while less-important, non-functional words may be randomly distributed. To explore this linguistic analogy, we calculate a clustering coefficient for each k-mer (k=2–9 bp) in human and mouse chromosome sequences, then checking if clustered words are enriched in the functional part of the genome. First, we found a positive general trend relating clustering level and word enrichment within exons and Transcription Factor Binding Sites (TFBSs), while a much weaker relation exists for repeats, and no relation at all exists for introns. Second, we found that 38.45% of the 200 top-clustered 8-mers, but only 7.70% of the non-clustered words, are represented in known motif databases. Third, enrichment/depletion experiments show that highly clustered words are significantly enriched in exons and TFBSs, while they are depleted in introns and repetitive DNA. Considering exons and TFBSs together, 1417 (or 72.26%) in human and 1385 (or 72.97%) in mouse of the top-clustered 8-mers showed a statistically significant association to either exons or TFBSs, thus strongly supporting the link between word clustering and biological function. Lastly, we identified a subset of clustered, diagnostic words that are enriched in exons but depleted in introns, and therefore might help to discriminate between these two gene regions. The clustering of DNA words thus appears as a novel principle to detect functionality in genome sequences. As evolutionary conservation is not a prerequisite, the proof of principle described here may open new ways to detect species-specific functional DNA sequences and the improvement of gene and promoter predictions, thus contributing to the quest for function in the genome.es_ES
dc.description.sponsorshipMinistry of Innovation and Science of the Spanish Government [BIO2008-01353 and BIO2010-20219]es_ES
dc.description.sponsorshipJuan de la Cierva grantes_ES
dc.description.sponsorshipBasque country ‘AE’ grantes_ES
dc.language.isoenges_ES
dc.publisherElsevieres_ES
dc.rightsCreative Commons Attribution-NonCommercial-NoDerivs 3.0 Licensees_ES
dc.rightsAtribución 4.0 Internacional*
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/*
dc.subjectDNA-wordses_ES
dc.subjectWord clusteringes_ES
dc.subjectEnrichment/depletion experimentses_ES
dc.titleClustering of DNA words and biological function: A proof of principlees_ES
dc.typejournal articlees_ES
dc.rights.accessRightsopen accesses_ES
dc.identifier.doi10.1016/j.jtbi.2011.12.024
dc.type.hasVersionVoRes_ES


Files in this item

[PDF]

This item appears in the following Collection(s)

Show simple item record

Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License
Except where otherwise noted, this item's license is described as Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License