Algorithm that may help genome assemble
Hey there! I can share some techniques for genome assemble. Genome sequencing data is fragmented. Typically, it's around a hundred nucleotides. We need to assemble these fragmented genome sequences into a single long sequence. We align sequences in the right order with overlaps. The Sequence-match program identifies potential overlaps. Then we optimize the best order of sequence alignment based on the overlaps. This problem is the same as the Hamiltonian path problem.
I wrote some code to solve Hamiltonian path problems. The main finding so far is that we can design algorithms similar to TSP, as well as in a different manner. In the TSP like implementation, we need NxN binary variables, which results in 100 nodes in the current hybrid solver (100x100=10000). In the other implementation, we assign binary variables to each edge, so that we can handle >100 nodes. For example, we can optimize 1000 nodes, if there are 10 outgoing edges on average. I put these algorithms in GitHub (https://github.com/tadakado/dwave_hamilton_01). You can test the code without installing D-Wave's SDK. After you create your free account at https://cloud.dwavesys.com/leap/signup/, you can access IDE with my code at https://ide.dwavesys.io/#https://github.com/tadakado/dwave_hamilton_01. There are two python programs, which associate with the two algorithms above. Let me know if you have any questions, or you can come and join the virtual BioHackathon slack. See more details at https://github.com/virtual-biohackathons/covid-19-bh20.
Comments
Hello Tadashi,
Hi Tadashi and Marcus,
Spike and I are working on something similar. If you would like to collaborate let us know! We would love to hear some insight. It would be great. Message me at sandeepb724@gmail.com and I can get us all in communication too.
Thanks,
Sandeep
Hi Sandeep,
Currently, I'm collaborating with other researchers on #dwave channel in virtual hackathon slack at
https://app.slack.com/client/T010K0KRLTF/C0120JS537H
We are working on visualization algorithms of variations among genomes. The slack channel is also an open community. You can join the channel, or we can start other collaboration separately depending on a topic we are tackling.
Thanks,
Hello Tadashi and Marcus,
I'm afraid I don't follow the part about "reads are mixed with other organisms..." However it is true that ncbi.nlm.gov
seems to be the "oz" of genetic information. Eg., ncbi.nlm.nih.gov/nuccore/NC_045512 has just about all there is, informationally, on SARS-CoV-2. Ie., the [sic] "complete genome".
I think that visualization algo' concepts are becoming most invaluable. I also think that sub-atomic antigenic fit is a very cool and increasingly important thing. Ie., a theoretical perfect physical fit between the peritope and epitope, especially visible with the new cryo-EM tech' that keeps getting more precise every year.
Take care all.
Hello Perez,
The current algorithm finds partial paths in the variant graph (https://github.com/vgteam/vg). I think this algorithm can be used (or combined) in the existing visualization algorithm. The code is available at https://github.com/tadakado/dwave_hamilton_01/. We don't integrate with the visualization algorithm but hope this chunk of code helps the community.
Best,
Please sign in to leave a comment.