Covid-19 genome

Programs that can compare the Covid-19 genome to other viruses

1

Comments

12 comments
  • https://github.com/enformatik/samtools

    Bioinformatics Tools: Introduction to Bioinformatics by Mehmet Keçeci, Page 60, ISBN-13: 978-1095890714, https://www.amazon.com/gp/product/1095890719

    1
    Comment actions Permalink
  • Hi Mehmet,

    I understand you'd like to compare Covid-19 genome to other viruses? Would you start with SAM or with FASTA for that?

    1
    Comment actions Permalink
  • The comparison would be useful for us to find the intermediate host. If we can find the intermediate host, we might be able to figure out how this happened. We can even go even further and ask the question of whether it's a natural or unnatural transition.

    0
    Comment actions Permalink
  • Hi Mehmet,

    One way would be to fit maximum likelihood phylogenetic tree to some sequences, and then do ancestral reconstruction near the leaves of the tree.   Where there were dramatic changes in probability for particular parts of the viral sequence, one could infer selection pressure.  In this way one could learn about cross-species and within-species transmission events and any mutational changes that occurred along the way.   Further, D-Wave samplers could be used to do ascertain which mutations were associated with phenotypes of interest.   For humans, phenotypes might include properties of their immune system (from sequencing their genome rather that the virus), or whether someone was hospitalized or not.

    Marcus

    1
    Comment actions Permalink
  • Hi Mehmet/Marcus,

    Spike Campbell and I are discussing the same issue. The comparison of COVID-19 genome to other viruses as well (especially among the Corona family) is very beneficial. Let us know if you are interested on collaborating. 

    Thanks,

    Sandeep

    1
    Comment actions Permalink
  • Hi Sandeep,

    Great, D-Wave is interested in collaborating. 

    https://www.gisaid.org/ is a possible source for sequence data.   I'm tooling-up with phylogenetics codes this weekend.

    Cheers,

    Marcus

     

    1
    Comment actions Permalink
  • The Nextstrain code appears to be a reasonable starting point, depending on what folks want to do.  Looking under the build instructions, I see it can use several alternative ML phylogenetic codes (IQ-TREE, RAxML, or FastTree).   While this code displays geographic information, it doesn't compute evolutionary bottlenecks or selection pressure, nor does it have the ability to correlate mutational patterns to facts about patients.  Within ML codes is a representation at each branch in the tree that has probabilities for each nucleotide in each position of the sequence.   Some nucleotides may be mutating faster than others depending on the different kinds of (immune system) pressure they encounter.  For zoonotic disease, there may be bigger changes that are required to adapt to new species. 

    1
    Comment actions Permalink
  • I see IQ-TREE can do ancestral reconstruction.

     

    1
    Comment actions Permalink
  • Following-up on Pau's remark:  Does anyone have experience with short read sequencing?  Are we likely to get more detail on emerging mutations, or is it just going to be a complicated alignment exercise riddled with noise?

    1
    Comment actions Permalink
  • Hi Mehmet, all,

    Since I'm waiting for GISAID access, I thought I'd try out the Genbank's sequence read archive (SRA).   The idea is to find variants that come out of next-generation (deep) sequencing machines and get a sense of population diversity within a patient (if any).   I haven't found a lot of literature on this kind of analysis for COVID-19, so maybe it is not commonplace?

    I installed htslib, bcftools, samtools, sra-tools, and bowtie2 to do the alignment.   Attached are scripts to 1) grab and index a reference sequence for the virus, and 2) compute a consensus sequence for the sample.   The first script downloads a reference alignment and indexes it for Bowtie2.  The second script pulls reads from a sample and aligns and indexes them.  Finally I used a tool to show high-quality variant calls.

     

    # Do this once to get the tool that the attached script setup_reference below uses.

    $ pip install ncbi-acc-download 

     

    # This creates the file ref_MN908947.3.fasta below, a SARs-Cov2 isolate from China.

    $ ./setup_reference 

     

    # This creates the file SRR11313286_consensus.fasta and supporting files

    $ ./create_consensus ref_MN908947.3.fasta SRR11313286

     

    # show some variants that are unlikely to happen by chance

    $ bcftools view -i '%QUAL>=50' SRR11313286_calls.vcf.gz

    ##bcftools_viewCommand=view -i %QUAL>=50 SRR11313286_calls.vcf.gz; Date=Sun Apr 5 13:25:42 2020
    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SRR11313286_sorted.bam
    MN908947.3 15090 . A T 99 . DP=27;VDB=4.13851e-08;SGB=-0.676189;RPB=0.770381;MQB=0.536296;MQSB=0.734889;BQB=0.173571;MQ0F=0.0740741;ICB=1;HOB=0.5;AC=1;AN=2;DP4=8,3,6,5;MQ=23 GT:PL 0/1:132,0,99
    MN908947.3 15153 . T C 109 . DP=137;VDB=2.01517e-20;SGB=-0.693145;RPB=0.2716;MQB=0.997273;MQSB=0.469397;BQB=0.661197;MQ0F=0.0437956;ICB=1;HOB=0.5;AC=1;AN=2;DP4=49,29,21,19;MQ=23 GT:PL 0/1:142,0,245
    MN908947.3 22120 . CAAAAA CAAAA 50 . INDEL;IDV=49;IMF=0.385827;DP=124;VDB=7.66172e-15;SGB=-0.683931;MQSB=0.81328;MQ0F=0.0322581;ICB=1;HOB=0.5;AC=1;AN=2;DP4=11,3,7,6;MQ=27 GT:PL 0/1:83,0,86
    MN908947.3 23964 . ATTTTT ATTTT 63 . INDEL;IDV=54;IMF=0.296703;DP=177;VDB=3.80248e-19;SGB=-0.692562;MQSB=0.437343;MQ0F=0.039548;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,16,14,8;MQ=22 GT:PL 0/1:96,0,50
    MN908947.3 28144 . T C 210 . DP=195;VDB=0;SGB=-0.693147;RPB=0.468292;MQB=0.95976;MQSB=0.0578741;BQB=0.301333;MQ0F=0.0717949;AC=2;AN=2;DP4=1,2,82,80;MQ=18 GT:PL 1/1:237,255,0
    MN908947.3 28214 . CTTTTT CTTTT 57 . INDEL;IDV=91;IMF=0.450495;DP=202;VDB=2.80938e-39;SGB=-0.69312;MQSB=0.998167;MQ0F=0.0693069;ICB=1;HOB=0.5;AC=1;AN=2;DP4=15,9,22,10;MQ=23 GT:PL 0/1:90,0,104
    MN908947.3 29095 . C T 165 . DP=164;VDB=1.85494e-37;SGB=-0.693147;MQSB=0.217919;MQ0F=0.103659;AC=2;AN=2;DP4=0,0,38,58;MQ=15 GT:PL 1/1:195,255,0

    I have not experimented with this kind of data before, so I don't know what kind of systematic biases exist or what the best practices are.   There seems to be a big community around these databases and tools, though.  

    Apparently this forum does not allow upload of non-image files, so here are the (simple) scripts inline:

    $ cat setup_reference
    #!/bin/sh
    acc=MN908947.3
    ncbi-acc-download -F fasta -o ref_${acc}.fasta $acc

    bowtie2-build ref_${acc}.fasta sars-cov2

    $ cat create_consensus
    #!/bin/sh
    if test -z "$2"; then
    echo Usage: $0 ref.fasta SRA_Run_Accession_Number
    exit 1
    fi
    ref=$1
    acc=$2
    fastq-dump ${acc}

    bowtie2 -x sars-cov2 -U ${acc}.fastq -S ${acc}.sam

    samtools view -bS ${acc}.sam > ${acc}.bam

    samtools sort ${acc}.bam -o ${acc}_sorted.bam

    samtools mpileup -u -f $ref ${acc}_sorted.bam | bcftools call -mv -Oz -o ${acc}_calls.vcf.gz

    bcftools index ${acc}_calls.vcf.gz

    cat $ref | bcftools consensus ${acc}_calls.vcf.gz > ${acc}_consensus.fasta

     

    Marcus

     

    1
    Comment actions Permalink
  • Here's one paper on the topic:

    https://jvi.asm.org/content/94/7/e00127-20

    0
    Comment actions Permalink

Please sign in to leave a comment.

Didn't find what you were looking for?

New post