VPRE's workflow looks something like this:

- Data collection of viral sequences
- Bioinformatic analyses
- Variational autoencoder to compress sequences into numerical variables
- Gaussian process regression to model evolutionary trajectory
- Assess validity and fitness of predicted sequences

Let's delve into the each step.

In other words, machine learning algorithms are able to improve automatically through experience.

Machine learning can sense evolution as well. For example, text auto-generation will change based on the owner’s language style.

Deep learning uses artificial neural networks - algorithms inspired by the workings of the human brain - to learn from large amounts of data.

We use deep learning in our daily lives. Functions like sentence autocomplete in Gmail, Google Translate, self-driving cars, image recognition, and many others all incorporate deep learning.

A latent space is a lower-dimensional representation of compressed data in which similar data points are closer together in space.

Just like our own brains, neural networks can get better and better at doing something, when you train them with data to practice on.

Training is similar to the way we learn new knowledge by doing practice exams. After we do a practice question, we check our answer against the correct answer, and reflect on what we did right or wrong, thus improving our understanding of a piece of knowledge.

The machine learning model is given a set of “practice questions” which is called a “training dataset”. For each item in the training dataset, the model goes through a set of mathematical equations and then produces an output. It then compares the output with the “correct answer” and reflects on the mathematical equations that it went through, and tries to adjust and see if the adjustment improves the correctness of the output. Therefore, the performance of the model is highly dependent on the quality of the training set, just like how we need good practice exams to test our knowledge.

We would like to ask our computers to imitate the ones in the training set using deep learning algorithms, and simulate other possible spike protein gene sequences.

Statistical models are often used for modeling evolution, and provide us with a tried and true method of generating likely mutations in viral spike proteins. These predictions are accompanied by the likelihood of their occurrence, and will be used to complement results from our deep learning algorithm.

For VPRE, the events we want to model are mutations in the spike protein. Using the same training data we used for deep learning, we count all the different kinds of mutations that have occurred in the past at each position of the genetic sequence. For example, we might have fifty A to C mutations in the training data at position 500 in the spike protein. We use these counts to calculate the probability of each type of mutation occurring at each position in the spike protein.

The most likely mutations are applied to current strains to generate the most likely next strains and their associated probabilities of occurring. These statistical predictions will be used to inform the deep learning model.