While building the VPRE prototype, we generated a diverse array of visuals to represent the data we worked with. This data included thousands of sequences of SARS-CoV-2 genomes from GISAID. We used these to train our software, which taught it to accurately model evolution and predict future coronavirus vaccine targets.
We generated a choropleth map to show the global spread of COVID-19 from January to May 2020.
We tested many different parameters (for example training time) while we trained the model. Our best prediction achieved over 99% similarity to a known SARS-CoV-2 spike protein, meaning we confidently mapped our predicted sequence back to a known spike protein.
Shown on the right is a dot plot that compares our prediction to a known spike protein. Both axises represent nucleotide positions. Having a line means the corresponding region on the x-axis matches up with that on the y-axis. We can see our prediction has three fragments that are mapped to the known spike protein. Fragment one matches perfectly to the right position.
The second fragment matches slightly to the back of where we would expect to find it on an actual spike protein. Therefore, we see a gap between fragments 1 and 2.
The third fragment matches to a previous part on the actual spike protein, and thus our prediction is missing that last portion of a spike protein.
We also recognize that our training dataset may not be big enough, such that the predictions it makes follow exactly the same pattern as an actual spike protein (to an extent where its predictions are too identical, or 'overfitted').
We are currently working hard on running receptor-binding simulations of our predicted spike proteins, stay tuned for the results.
Imaged above is a SARS-CoV-2 spike protein binding domain (blue) interacting with the ACE2 receptor (green).