# Algorithm that may help genome assemble

Hey there! I can share some techniques for genome assemble. Genome sequencing data is fragmented. Typically, it's around a hundred nucleotides. We need to assemble these fragmented genome sequences into a single long sequence. We align sequences in the right order with overlaps. The Sequence-match program identifies potential overlaps. Then we optimize the best order of sequence alignment based on the overlaps. This problem is the same as the Hamiltonian path problem.

I wrote some code to solve Hamiltonian path problems. The main finding so far is that we can design algorithms similar to TSP, as well as in a different manner. In the TSP like implementation, we need NxN binary variables, which results in 100 nodes in the current hybrid solver (100x100=10000). In the other implementation, we assign binary variables to each edge, so that we can handle >100 nodes. For example, we can optimize 1000 nodes, if there are 10 outgoing edges on average. I put these algorithms in GitHub (https://github.com/tadakado/dwave_hamilton_01). You can test the code without installing D-Wave's SDK. After you create your free account at https://cloud.dwavesys.com/leap/signup/, you can access IDE with my code at https://ide.dwavesys.io/#https://github.com/tadakado/dwave_hamilton_01. There are two python programs, which associate with the two algorithms above. Let me know if you have any questions, or you can come and join the virtual BioHackathon slack. See more details at https://github.com/virtual-biohackathons/covid-19-bh20.

The NCBI has next generation sequencing data for SARS-CoV-2 at this URL:

I think where faster assembly could be valuable is when the reads are mixed with other organisms.   Perhaps this would be plausible in a pandemic situation where a lab had more equipment than people?

Here's one case that took Bowtie2 a long time to align to the reference sequence.
Luckily, at least for GenBank, it seems that the submitters have been doing a better job removing host genetic material.

Marcus

Spike and I are working on something similar. If you would like to collaborate let us know! We would love to hear some insight. It would be great. Message me at sandeepb724@gmail.com and I can get us all in communication too.

Thanks,

Sandeep

• Hi Sandeep,

Currently, I'm collaborating with other researchers on #dwave channel in virtual hackathon slack at

https://app.slack.com/client/T010K0KRLTF/C0120JS537H

We are working on visualization algorithms of variations among genomes. The slack channel is also an open community. You can join the channel, or we can start other collaboration separately depending on a topic we are tackling.

Thanks,

I'm afraid I don't follow the part about "reads are mixed with other organisms..." However it is true that ncbi.nlm.gov

seems to be the "oz" of genetic information. Eg.,   ncbi.nlm.nih.gov/nuccore/NC_045512 has just about all there is, informationally, on SARS-CoV-2. Ie., the [sic] "complete genome".

I think that visualization algo' concepts are becoming most invaluable. I also think that sub-atomic antigenic fit is a very cool and increasingly important thing. Ie., a theoretical perfect physical fit between the peritope and epitope, especially visible with the new cryo-EM tech' that keeps getting more precise every year.

Take care all.

• Hello Perez,

The current algorithm finds partial paths in the variant graph (https://github.com/vgteam/vg). I think this algorithm can be used (or combined) in the existing visualization algorithm. The code is available at https://github.com/tadakado/dwave_hamilton_01/. We don't integrate with the visualization algorithm but hope this chunk of code helps the community.

Best,