Home
Team
Connect
Impact
  • Overview
  • Society
  • Outreach
VPRE 1.0
  • Click to use VPRE
  • Concept
  • Biological Context
  • VPRE's Process
  • Results
  • Discussion
  • Future Directions
VPRE 2.0
Home
Team
Connect
Impact
  • Overview
  • Society
  • Outreach
VPRE 1.0
  • Click to use VPRE
  • Concept
  • Biological Context
  • VPRE's Process
  • Results
  • Discussion
  • Future Directions
VPRE 2.0
More
  • Home
  • Team
  • Connect
  • Impact
    • Overview
    • Society
    • Outreach
  • VPRE 1.0
    • Click to use VPRE
    • Concept
    • Biological Context
    • VPRE's Process
    • Results
    • Discussion
    • Future Directions
  • VPRE 2.0
  • Home
  • Team
  • Connect
  • Impact
  • VPRE 1.0
  • VPRE 2.0

Discussion

What is strong about our project?


What is weak about our project?


Why did we choose to use the methods that we did?


Is there a better way to do things?

  • Compared to traditional modeling, our model supports continuous learning and can be self-improving
  • Utilizing machine learning in a novel and innovative way
  • Model is able to extract dependencies and correlations between subsequences that would otherwise go unnoticed using other bioinformatic techniques
  • Our product carries no financial burden as it is all computer-based
  • Our product has no environmental footprint, and is continuously recyclable and reusable

  • Less biological support when simulating new mutations as the underlying concept is on text processing
  • We have not incorporated a chronological evolutionary path in the training process yet
  • Currently have a relatively small dataset which can lead to bias in the results
  • We do not have a way of modelling antigenic shift yet

Why did we use a recurrent neural network?

RNNs can be trained on sequential data (ex: sentences, genetic sequences, etc.) in order to make predictions.


RNNs are called recurrent because they perform the same task for every element of the sequence, with the output depending on the previous computations (ex: if we want to predict the next nucleotide in the sequence we need to know what came before).


RNNs have memory units (called GRUs and LSTMs) which allow them to capture and retain information about what has been calculated and seen so far. 


RNNS are often used for natural language processing and text generation purposes (our data is nucleotide sequences which is also text).

Why did we build our prototype on SARS-CoV-2 and not another virus?

Given that we are living in the COVID-19 pandemic, and that this project was largely inspired by this fact, it made sense to build the prototype on SARS-CoV-2. 


Moreover, with labs all over the world working on the coronavirus, new data and findings are constantly streaming into the literature, giving us more material to work with.


This pandemic is also a wake-up call for humanity, and we wanted to respond to it specifically.

Why did we model evolution on nucleotide sequences and not amino acid sequences?

We are using nucleotide sequences as opposed to amino acid (protein) sequences in order to increase the diversity of our data.


It allows our model to capture more details in the changes of the spike protein. Proteins are produced based on nucleotide sequences, so by examining the starting material of the protein, it allows us to detect changes in a more sensitive way.


Modelling evolution/mutations on the nucleotide level is more akin to real evolution.

In our coronavirus prototype, why did we train our model only on spike proteins that bind to ACE2 and not on other coronavirus spike proteins?

ACE2 is the human cell receptor that SARS-CoV-2 requires in order to bind and enter human cells.


We hypothesize that spike proteins with better ability to infect humans have a survival advantage, and are therefore more 'fit'. This fitness can be measured by their receptor binding ability.


We validated our predictions by deeming them as 'unfit' if they cannot bind to ACE2 anymore when mutated.


Crossman (2020) found that predicted sequences lose specificity to ACE2 when the predictor is trained on generalized coronavirus sequences. Therefore, if we did not train our model on ACE2-binding spike proteins, our predictions would more often be 'unfit' because a lot of them would lose their specificity to ACE2.


Crossman, L. (2020). Leveraging deep learning to simulate coronavirus spike proteins has the potential to predict future zoonotic sequences. bioRxiv preprint. doi:10.1101/2020.04.20.046920.

What could we do better?

There are likely better and more accurate ways to model evolution than with a recurrent neural network. We considered many variables when developing our solution, as well as many components that we can improve upon in the future. Click the button below to read more!

Future Directions

UBC Virosight

  • Connect