0:04
I also would like to mention that, yeah, our company Enpicom is a Dutch company that specialises in software solutions in this field because of the challenges that we just heard about quite appropriately.
0:18
And I'm going to talk today about these challenges and how to solve them in your discovery workflows.
0:25
I'm going to start talking about the data that generates those challenges, namely high throughput sequencing and characterization in vitro and in silico. Then I’ll talk about the challenges, how to solve them and finalise with an applied example of data-driven discovery.
0:41
So it's not all just theory.
0:45
So as you all know and was again just mentioned, the immune repertoire has massive capacity to create diverse receptors even in a single sample.
0:55
The number of unique clones that one can find is astounding.
1:00
So it's no wonder that NGS allowed to look at that diversity interrogate it in a way that Sanger never could.
1:08
And this is probably why the number of studies using NGS for TCR or BCRs has just massively grown over the years.
1:19
And there's a lot of good evidence, a lot of publications out there showing how interrogating NGS can help you find better antibodies, be it from immunizations compared to those that you would get from isolated in hybridoma or single B cell or in display enrichment analysis compared to the panning isolates.
1:40
These are just some examples. And the sequencing quality and breadth is getting better.
1:47
Illumina last year for example put out the new chemistry for NextSeq 2x300, which brings 4 to 12 times the amount of sequencing data that you would have with the MiSeq, but also has better quality.
2:03
We run a test with our partners in sequencing and we found especially good quality at the end of the reads, which for long applicants is quite important.
2:14
If you want to hear more about these results, we had a webinar last year with Illumina where we show the results and we talk about the new chemistry and the things to take into account.
2:23
You can watch it on our website.
2:28
But as I was saying as well, in vitro characterization has also had many updates.
2:35
For example, the beacon allows you to isolate and screen thousands of antibodies, 10X can be paired with bar coded epitopes to link pair chain information with epitope binding information.
2:48
And Carterra's LSA allows you to have high accurate measurements of affinity and epitope binning for hundreds of antibodies at a time.
3:02
And they put out their LSAXT which improves not only the sensitivity, but the speed of acquiring the affinity measurements.
3:13
And these are the sets.
3:14
Again, as was highlighted on the talk before mine are very interesting, especially when you combine them with the NGS repertoire.
3:23
You can extrapolate the findings on subsets of antibodies to a much larger pool of clones to interrogate.
3:31
This is one of the reasons that we are collaborating with Carterra to make sure that their high throughput characterization of antibodies can be easily integrated on these larger data sets and also enable modelling and new in silico predictions, talking of which is one of the most exciting changes in the field.
3:52
And I wanted to take the opportunity to highlight some of the publications that we think are quite interesting.
3:59
As you all know, AlphaFold kind of made a leap forward on our capacity to model protein structures.
4:05
But nowadays we have faster prediction models based on antibodies and TCRs like ImmuneBuilder.
4:16
There's also models that allow you to learn what is a human antibody and predict what is the humanness score for a given antibody and even fix only the residues that will improve the humanness without potentially hampering the affinity of that antibody.
4:34
And even general purpose protein language models can allow you to optimise properties even like binding properties.
4:41
Just by looking at what is a normal antibody like.
4:44
You can generate variants of that antibody that can have better properties.
4:49
But of course that prediction of binding doesn't need to be blind.
4:54
If you have binding information, even binary information, you can use it to predict accurately both the affinity as well as the specificity.
5:03
This is just a couple examples.
5:06
And this optimization for these properties can be done at the same time, so you're not losing, for example, affinity or thermostability while you're doing those changes.
5:22
But to be able to generate those accurate in silico models, you need access not only to the large NGS datasets to know what an antibody is, but also the associated metadata, the associated information, especially when you're doing multi parameter optimization.
5:40
And this brings us to the challenges how to be able to combine all these different data sets efficiently to make the most out of them.
5:49
When you look at high throughput sequencing and screening, you need flexible and fast processing tools and an infrastructure that can house all these large datasets and make them accessible.
6:02
You also need intuitive visualisations because as we just explained, having all these different parameters means more complex data sets to look at, made harder decisions on which antibodies will be expressed and test in the lab.
6:16
And machine learning offers value, but it's only as good as your data sets that you're training with.
6:22
So having this data organised and accessible to train the models, as well as having a way to return those predictions onto your discovery campaign is crucial to be able to make the most out of it.
6:31
And to solve these problems and help navigate the funnel of selecting therapeutics, we created the IGX platform, which is our software solution designed to make data-driven antibody discovery accessible to everyone.
6:54
It basically can take any kind of sequencing information, Sanger, single cell, NGS, and associate all the metadata available from in vitro, from in silico to make the decision on which leads to proceed with.
7:11
We have an array of different apps covering different functionalities, from gene annotation to QC of whole repertoire, but also functionalities tailored to specific workflows like enrichment analysis or heat expansion, heat selection from hybridoma, for example.
7:33
And if we look more specifically at those challenges, we build, for example, our annotation tool to be able to handle any size of data, the ones that we were describing before from NovaSeq or NextSeq.
7:45
As an example, running 1 billion reads can take less than 7 hours or a full NextSeq in 1.5 hours.
7:54
And once the data is processed, it's also warehoused in an infrastructure that is designed to be able to house and access all of this information.
8:03
Which means that you can have large NGS datasets and all historical data in one environment and access it for your analysis or for training your models.
8:14
You can make queries of these billions of clones within seconds.
8:22
To make that integration into your discovery ecosystem even easier, we have built an API.
8:29
This first and foremost allows you to synchronise your raw sequencing files, because if the files are large, it's harder to drop them onto an environment.
8:36
So if it's already done by you with the API, that facilitates the process as well as exporting these datasets to LIMS systems where you can have your clones organised.
8:48
It also allows to run your own custom algorithms.
8:51
If you had a preferred method that we're not yet covering, you can also analyse with that algorithm as well as automating workflows to minimise clicks.
9:00
If you have a part of the process that it's always the same, it's better not to have to click 20 times to do it.
9:08
And last but not least, the fact that we have this infrastructure makes the platform an environment that is quite ideal for training machine learning models and applying them back to your discovery.
9:22
Just to zoom in into this last point, this infrastructure allows you to be able to retrieve all your annotated clones and associated metadata to train your models and input back in predictions, new clones, new information that can be crucial to selecting the best candidates.
9:42
And if you don't have sufficient expertise in house, we have our own proprietary protein language model as well as a data science team that can assist you to make the most of the data that you already have available.
9:56
This is an example of a use case that we did not so long ago where we created a protein language model to assess humanization.
10:06
So protein language models train on human antibodies, allow you to know the contribution of each residue towards the humanness of an antibody.
10:15
And with that contribution, you can calculate a Log likelihood per antibody, which on the whole gives you a perplexity score for each clone, which as you can see in the graph generated from our data correlates quite nicely to how close to humans these antibodies are.
10:32
And we use these concepts to be able to generate a humanization tool that allows you to convert your antibodies one change at a time to a more human antibody.
10:45
And doing it like that while you monitor the structure, or for example co-optimise for other properties, so you make the minimal number of changes that will allow your antibody to be used for a therapeutic.
11:01
Yeah, so now I'll get to the actual example.
11:04
And for that I choose a publication that I was part of during my PhD.
11:09
And it's a study where we immunised macaques several times against HIV envelope glycoprotein and we isolated antibodies that were specific for the envelope glycoprotein with FACS.
11:24
We also capture different samples of different tissues and did NGS after 76 immunizations.
11:33
So we basically have more than 600 sorted binders of which just a subset was expressed and was neutralising even less.
11:43
It 5 says binders, but it's 5 neutralizers to the final virus and then more than 10 NGS from different subsets with different time points as well.
11:57
One of the most interesting findings of the study is that when we looked at the lineage of the best binder, we didn't just find better binders in the NGS, but the neutralising capacity, the functionality of that antibody correlated quite nicely with the SHM.
12:16
So in order to be able to recap this study, you need to be able to process these samples, Sanger and NGS. You need to be able to group across different samples and identify those lineages and the clones that were expressed.
12:31
So we've built our processing tool, as I mentioned before, to be flexible.
12:35
You can create an applicant that matches exactly your library preparation technique.
12:41
For example, in this case there's a UMI that has a specific pattern, and you can specify it so it's recognised during processing.
12:50
Once the data is processed, it's very easy to select all the datasets together, Sanger and NGS and group them with our clustering algorithm where you can define what are the regions, what is the similarity threshold that you want to have between the clones in a given cluster.
13:06
And in this case it’s 80% CDR3 similarity and same V gene, 3 gene, and CDR3 length.
13:13
So we were looking for actual B cell lineages.
13:18
Once they're clustered, you can see all the clusters generated, and easily identify clusters that contain your standard clones.
13:25
That's the blue dot.
13:27
You can also easily overlay any information you have, like the neutralisation capacity that we saw before, so you can recognise which clusters could contain more antibodies of interest.
13:39
You can also directly look for that GM9_TH8 antibody that had the best characteristics by looking for the clone ID, which means that you spend less time looking at NGS datasets that might not bind, that might not be specific, the more time looking at the clones that could be of interest.
13:59
And you can plot the phylogenetic tree in one click.
14:02
And then inside the phylogenetic tree you can overlay for example the sample origin so you can find the original clone.
14:09
You can see here the different samples that were clustered together.
14:13
You can also add, for example, the SHM that in the study correlated with the neutralisation, so you can select the clones as we did in the study for expression.
14:25
Otherwise, once you've expressed those clones, you can put that information back in and without having to re-cluster, see which clones you've expressed.
14:34
This case it's going to be with triangles.
14:36
These are the clones that were expressed in the study.
14:39
And if you've measured the neutralisation, it's quite easy to add that so you can visualise it and continue to select candidates off the branches of the lineages that could have the best binding.
14:54
So in summary, we've managed to recap this study in one morning, process 21 million reads in just over 20 minutes, cluster all the samples together in an hour and select the clones that were selected for the study in just a few minutes.
15:10
And a figure that you see here is the figure that was obtained with the software as opposed to the one from the study.
15:16
But we thought that it would be interesting to also show this additional step.
15:22
It is true that in the platform we already have ImmuneBuilder to be able to assess the developability.
15:28
So that would be an in silico prediction that you can add to the study.
15:32
But we wanted to use that humanization tool that I mentioned before to take the best binder and try to make it more human for a potential therapeutic.
15:42
The advantage that we have is that we do have information on binding, which means that while you're humanising you can take that into account, and we combine that binding information with non-binders from the pre-immunisation and train model to be able to predict what are the correlates of binding.
16:02
And even though the data set is not that large, we did find some correlation between the prediction of binding and the neutralisation that we had for the clones that were not included in the training set.
16:16
So that means that when you do humanization, you can incorporate that optimization for binding to at least not make any changes that that would go against the prediction for binding.
16:29
And our model basically generates iterative changes on your sequence, so you can select at what point you want to stop continuing the humanization.
16:39
In this case, we run it leaving out the CDRs, but you can choose which regions to humanise.
16:45
And if you look at the model compared to when you don't take binding into account, you can see that there's amino acids that otherwise would have been changed that our prediction says that could have an impact for binding.
17:00
So in summary, I hope I managed to convince you that combining these datasets is really interesting and they're here to stay, especially machine learning.
17:07
and the IGX platform can help you analyse these datasets by providing a scalable infrastructure, streamlining the analysis with intuitive visualisations, and aiding your in silico model discovery.
17:24
Thanks for your attention and I'll be happy to answer any questions.
