Querying Highly Similar Structured Sequences via Binary Encoding and Word Level Operations
Abstract
In the post-genomic era there has been an explosion in the amount of genomic data available and the primary research problems have moved from being able to produce interesting biological data to being able to efficiently process and store this information. In this paper we present efficient data structures and algorithms for the High Similarity Sequencing Problem. In the High Similarity Sequencing Problem we are given the sequences S0, S1, …, Sk where Sj = $e_{j_1} I_{\sigma_1}e_{j_2} I_{\sigma_2} e_{j_3} I_{\sigma_3}, \dots,e_{j_\ell} I_{\sigma_\ell}$ and must perform pattern matching on the set of sequences. In this paper we present time and memory efficient datastructures by exploiting their extensive similarity, our solution leads to a query time of $O(m + vk \log \ell + \frac{m occ_v v}{w} + \frac{PSC(p)m}{w})$ with a memory usage of O(N logN + vk logvk).
Domains
Computer Science [cs]Origin | Files produced by the author(s) |
---|
Loading...