The comparison would be useful for us to find the intermediate host. If we can find the intermediate host, we might be able to figure out how this happened. We can even go even further and ask the question of whether it's a natural or unnatural transition.
One way would be to fit maximum likelihood phylogenetic tree to some sequences, and then do ancestral reconstruction near the leaves of the tree. Where there were dramatic changes in probability for particular parts of the viral sequence, one could infer selection pressure. In this way one could learn about cross-species and within-species transmission events and any mutational changes that occurred along the way. Further, D-Wave samplers could be used to do ascertain which mutations were associated with phenotypes of interest. For humans, phenotypes might include properties of their immune system (from sequencing their genome rather that the virus), or whether someone was hospitalized or not.
Spike Campbell and I are discussing the same issue. The comparison of COVID-19 genome to other viruses as well (especially among the Corona family) is very beneficial. Let us know if you are interested on collaborating.
The Nextstrain code appears to be a reasonable starting point, depending on what folks want to do. Looking under the build instructions, I see it can use several alternative ML phylogenetic codes (IQ-TREE, RAxML, or FastTree). While this code displays geographic information, it doesn't compute evolutionary bottlenecks or selection pressure, nor does it have the ability to correlate mutational patterns to facts about patients. Within ML codes is a representation at each branch in the tree that has probabilities for each nucleotide in each position of the sequence. Some nucleotides may be mutating faster than others depending on the different kinds of (immune system) pressure they encounter. For zoonotic disease, there may be bigger changes that are required to adapt to new species.
Following-up on Pau's remark: Does anyone have experience with short read sequencing? Are we likely to get more detail on emerging mutations, or is it just going to be a complicated alignment exercise riddled with noise?
Since I'm waiting for GISAID access, I thought I'd try out the Genbank's sequence read archive (SRA). The idea is to find variants that come out of next-generation (deep) sequencing machines and get a sense of population diversity within a patient (if any). I haven't found a lot of literature on this kind of analysis for COVID-19, so maybe it is not commonplace?
I installed htslib, bcftools, samtools, sra-tools, and bowtie2 to do the alignment. Attached are scripts to 1) grab and index a reference sequence for the virus, and 2) compute a consensus sequence for the sample. The first script downloads a reference alignment and indexes it for Bowtie2. The second script pulls reads from a sample and aligns and indexes them. Finally I used a tool to show high-quality variant calls.
# Do this once to get the tool that the attached script setup_reference below uses.
$ pip install ncbi-acc-download
# This creates the file ref_MN908947.3.fasta below, a SARs-Cov2 isolate from China.
$ ./setup_reference
# This creates the file SRR11313286_consensus.fasta and supporting files
##bcftools_viewCommand=view -i %QUAL>=50 SRR11313286_calls.vcf.gz; Date=Sun Apr 5 13:25:42 2020 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SRR11313286_sorted.bam MN908947.3 15090 . A T 99 . DP=27;VDB=4.13851e-08;SGB=-0.676189;RPB=0.770381;MQB=0.536296;MQSB=0.734889;BQB=0.173571;MQ0F=0.0740741;ICB=1;HOB=0.5;AC=1;AN=2;DP4=8,3,6,5;MQ=23 GT:PL 0/1:132,0,99 MN908947.3 15153 . T C 109 . DP=137;VDB=2.01517e-20;SGB=-0.693145;RPB=0.2716;MQB=0.997273;MQSB=0.469397;BQB=0.661197;MQ0F=0.0437956;ICB=1;HOB=0.5;AC=1;AN=2;DP4=49,29,21,19;MQ=23 GT:PL 0/1:142,0,245 MN908947.3 22120 . CAAAAA CAAAA 50 . INDEL;IDV=49;IMF=0.385827;DP=124;VDB=7.66172e-15;SGB=-0.683931;MQSB=0.81328;MQ0F=0.0322581;ICB=1;HOB=0.5;AC=1;AN=2;DP4=11,3,7,6;MQ=27 GT:PL 0/1:83,0,86 MN908947.3 23964 . ATTTTT ATTTT 63 . INDEL;IDV=54;IMF=0.296703;DP=177;VDB=3.80248e-19;SGB=-0.692562;MQSB=0.437343;MQ0F=0.039548;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,16,14,8;MQ=22 GT:PL 0/1:96,0,50 MN908947.3 28144 . T C 210 . DP=195;VDB=0;SGB=-0.693147;RPB=0.468292;MQB=0.95976;MQSB=0.0578741;BQB=0.301333;MQ0F=0.0717949;AC=2;AN=2;DP4=1,2,82,80;MQ=18 GT:PL 1/1:237,255,0 MN908947.3 28214 . CTTTTT CTTTT 57 . INDEL;IDV=91;IMF=0.450495;DP=202;VDB=2.80938e-39;SGB=-0.69312;MQSB=0.998167;MQ0F=0.0693069;ICB=1;HOB=0.5;AC=1;AN=2;DP4=15,9,22,10;MQ=23 GT:PL 0/1:90,0,104 MN908947.3 29095 . C T 165 . DP=164;VDB=1.85494e-37;SGB=-0.693147;MQSB=0.217919;MQ0F=0.103659;AC=2;AN=2;DP4=0,0,38,58;MQ=15 GT:PL 1/1:195,255,0
I have not experimented with this kind of data before, so I don't know what kind of systematic biases exist or what the best practices are. There seems to be a big community around these databases and tools, though.
Apparently this forum does not allow upload of non-image files, so here are the (simple) scripts inline:
Comments
https://github.com/enformatik/samtools
Bioinformatics Tools: Introduction to Bioinformatics by Mehmet Keçeci, Page 60, ISBN-13: 978-1095890714, https://www.amazon.com/gp/product/1095890719
Hi Mehmet,
I understand you'd like to compare Covid-19 genome to other viruses? Would you start with SAM or with FASTA for that?
The comparison would be useful for us to find the intermediate host. If we can find the intermediate host, we might be able to figure out how this happened. We can even go even further and ask the question of whether it's a natural or unnatural transition.
Hi Mehmet,
One way would be to fit maximum likelihood phylogenetic tree to some sequences, and then do ancestral reconstruction near the leaves of the tree. Where there were dramatic changes in probability for particular parts of the viral sequence, one could infer selection pressure. In this way one could learn about cross-species and within-species transmission events and any mutational changes that occurred along the way. Further, D-Wave samplers could be used to do ascertain which mutations were associated with phenotypes of interest. For humans, phenotypes might include properties of their immune system (from sequencing their genome rather that the virus), or whether someone was hospitalized or not.
Marcus
Hi Mehmet/Marcus,
Spike Campbell and I are discussing the same issue. The comparison of COVID-19 genome to other viruses as well (especially among the Corona family) is very beneficial. Let us know if you are interested on collaborating.
Thanks,
Sandeep
Hi Sandeep,
Great, D-Wave is interested in collaborating.
https://www.gisaid.org/ is a possible source for sequence data. I'm tooling-up with phylogenetics codes this weekend.
Cheers,
Marcus
https://github.com/nextstrain/ncov
https://nextstrain.org/ncov
https://www.gisaid.org/epiflu-applications/next-hcov-19-app/
The Nextstrain code appears to be a reasonable starting point, depending on what folks want to do. Looking under the build instructions, I see it can use several alternative ML phylogenetic codes (IQ-TREE, RAxML, or FastTree). While this code displays geographic information, it doesn't compute evolutionary bottlenecks or selection pressure, nor does it have the ability to correlate mutational patterns to facts about patients. Within ML codes is a representation at each branch in the tree that has probabilities for each nucleotide in each position of the sequence. Some nucleotides may be mutating faster than others depending on the different kinds of (immune system) pressure they encounter. For zoonotic disease, there may be bigger changes that are required to adapt to new species.
I see IQ-TREE can do ancestral reconstruction.
Following-up on Pau's remark: Does anyone have experience with short read sequencing? Are we likely to get more detail on emerging mutations, or is it just going to be a complicated alignment exercise riddled with noise?
Hi Mehmet, all,
Since I'm waiting for GISAID access, I thought I'd try out the Genbank's sequence read archive (SRA). The idea is to find variants that come out of next-generation (deep) sequencing machines and get a sense of population diversity within a patient (if any). I haven't found a lot of literature on this kind of analysis for COVID-19, so maybe it is not commonplace?
I installed htslib, bcftools, samtools, sra-tools, and bowtie2 to do the alignment. Attached are scripts to 1) grab and index a reference sequence for the virus, and 2) compute a consensus sequence for the sample. The first script downloads a reference alignment and indexes it for Bowtie2. The second script pulls reads from a sample and aligns and indexes them. Finally I used a tool to show high-quality variant calls.
# Do this once to get the tool that the attached script setup_reference below uses.
$ pip install ncbi-acc-download
# This creates the file ref_MN908947.3.fasta below, a SARs-Cov2 isolate from China.
$ ./setup_reference
# This creates the file SRR11313286_consensus.fasta and supporting files
$ ./create_consensus ref_MN908947.3.fasta SRR11313286
# show some variants that are unlikely to happen by chance
$ bcftools view -i '%QUAL>=50' SRR11313286_calls.vcf.gz
##bcftools_viewCommand=view -i %QUAL>=50 SRR11313286_calls.vcf.gz; Date=Sun Apr 5 13:25:42 2020
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SRR11313286_sorted.bam
MN908947.3 15090 . A T 99 . DP=27;VDB=4.13851e-08;SGB=-0.676189;RPB=0.770381;MQB=0.536296;MQSB=0.734889;BQB=0.173571;MQ0F=0.0740741;ICB=1;HOB=0.5;AC=1;AN=2;DP4=8,3,6,5;MQ=23 GT:PL 0/1:132,0,99
MN908947.3 15153 . T C 109 . DP=137;VDB=2.01517e-20;SGB=-0.693145;RPB=0.2716;MQB=0.997273;MQSB=0.469397;BQB=0.661197;MQ0F=0.0437956;ICB=1;HOB=0.5;AC=1;AN=2;DP4=49,29,21,19;MQ=23 GT:PL 0/1:142,0,245
MN908947.3 22120 . CAAAAA CAAAA 50 . INDEL;IDV=49;IMF=0.385827;DP=124;VDB=7.66172e-15;SGB=-0.683931;MQSB=0.81328;MQ0F=0.0322581;ICB=1;HOB=0.5;AC=1;AN=2;DP4=11,3,7,6;MQ=27 GT:PL 0/1:83,0,86
MN908947.3 23964 . ATTTTT ATTTT 63 . INDEL;IDV=54;IMF=0.296703;DP=177;VDB=3.80248e-19;SGB=-0.692562;MQSB=0.437343;MQ0F=0.039548;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,16,14,8;MQ=22 GT:PL 0/1:96,0,50
MN908947.3 28144 . T C 210 . DP=195;VDB=0;SGB=-0.693147;RPB=0.468292;MQB=0.95976;MQSB=0.0578741;BQB=0.301333;MQ0F=0.0717949;AC=2;AN=2;DP4=1,2,82,80;MQ=18 GT:PL 1/1:237,255,0
MN908947.3 28214 . CTTTTT CTTTT 57 . INDEL;IDV=91;IMF=0.450495;DP=202;VDB=2.80938e-39;SGB=-0.69312;MQSB=0.998167;MQ0F=0.0693069;ICB=1;HOB=0.5;AC=1;AN=2;DP4=15,9,22,10;MQ=23 GT:PL 0/1:90,0,104
MN908947.3 29095 . C T 165 . DP=164;VDB=1.85494e-37;SGB=-0.693147;MQSB=0.217919;MQ0F=0.103659;AC=2;AN=2;DP4=0,0,38,58;MQ=15 GT:PL 1/1:195,255,0
I have not experimented with this kind of data before, so I don't know what kind of systematic biases exist or what the best practices are. There seems to be a big community around these databases and tools, though.
Apparently this forum does not allow upload of non-image files, so here are the (simple) scripts inline:
$ cat setup_reference
#!/bin/sh
acc=MN908947.3
ncbi-acc-download -F fasta -o ref_${acc}.fasta $acc
bowtie2-build ref_${acc}.fasta sars-cov2
$ cat create_consensus
#!/bin/sh
if test -z "$2"; then
echo Usage: $0 ref.fasta SRA_Run_Accession_Number
exit 1
fi
ref=$1
acc=$2
fastq-dump ${acc}
bowtie2 -x sars-cov2 -U ${acc}.fastq -S ${acc}.sam
samtools view -bS ${acc}.sam > ${acc}.bam
samtools sort ${acc}.bam -o ${acc}_sorted.bam
samtools mpileup -u -f $ref ${acc}_sorted.bam | bcftools call -mv -Oz -o ${acc}_calls.vcf.gz
bcftools index ${acc}_calls.vcf.gz
cat $ref | bcftools consensus ${acc}_calls.vcf.gz > ${acc}_consensus.fasta
Marcus
Here's one paper on the topic:
https://jvi.asm.org/content/94/7/e00127-20
Please sign in to leave a comment.