What is strong about our project?
What is weak about our project?
Why did we choose to use the methods that we did?
Is there a better way to do things?
RNNs can be trained on sequential data (ex: sentences, genetic sequences, etc.) in order to make predictions.
RNNs are called recurrent because they perform the same task for every element of the sequence, with the output depending on the previous computations (ex: if we want to predict the next nucleotide in the sequence we need to know what came before).
RNNs have memory units (called GRUs and LSTMs) which allow them to capture and retain information about what has been calculated and seen so far.
RNNS are often used for natural language processing and text generation purposes (our data is nucleotide sequences which is also text).
Given that we are living in the COVID-19 pandemic, and that this project was largely inspired by this fact, it made sense to build the prototype on SARS-CoV-2.
Moreover, with labs all over the world working on the coronavirus, new data and findings are constantly streaming into the literature, giving us more material to work with.
This pandemic is also a wake-up call for humanity, and we wanted to respond to it specifically.
We are using nucleotide sequences as opposed to amino acid (protein) sequences in order to increase the diversity of our data.
It allows our model to capture more details in the changes of the spike protein. Proteins are produced based on nucleotide sequences, so by examining the starting material of the protein, it allows us to detect changes in a more sensitive way.
Modelling evolution/mutations on the nucleotide level is more akin to real evolution.
ACE2 is the human cell receptor that SARS-CoV-2 requires in order to bind and enter human cells.
We hypothesize that spike proteins with better ability to infect humans have a survival advantage, and are therefore more 'fit'. This fitness can be measured by their receptor binding ability.
We validated our predictions by deeming them as 'unfit' if they cannot bind to ACE2 anymore when mutated.
Crossman (2020) found that predicted sequences lose specificity to ACE2 when the predictor is trained on generalized coronavirus sequences. Therefore, if we did not train our model on ACE2-binding spike proteins, our predictions would more often be 'unfit' because a lot of them would lose their specificity to ACE2.
Crossman, L. (2020). Leveraging deep learning to simulate coronavirus spike proteins has the potential to predict future zoonotic sequences. bioRxiv preprint. doi:10.1101/2020.04.20.046920.
There are likely better and more accurate ways to model evolution than with a recurrent neural network. We considered many variables when developing our solution, as well as many components that we can improve upon in the future. Click the button below to read more!