So, I'm here to present, some machine learning applications in antibody discovery, but mainly focusing on high level 

And if you don't know, PipeBio, we're a bioinformatics platform, cloud-based platform for diverse bioinformatics workflows, mostly focused on antibodies, but also other molecules there in biologics discovery. 

I'm a field application specialist at PipeBio. I was originally going to be joined by Yannick, but I'm going to let the old man rest. So, it's just me here. 

 

So yeah, before getting started, we also have a booth down here. 
If you haven't seen us before, we're the purple, purple booth over there. 

 
So after the talk, please come over, stop by and have a chat with us. 
 

And also if you're into running, just for future conferences, it's a bit late now, but if you're into running, we usually organise these running clubs in the mornings. 

 
So, if you're an early bird and like running, then join antibody athletes. 

 
It's great fun and a great way to meet other people working with antibodies. 

I’m just going to talk about general data challenges and what are the current, I guess bottlenecks that many companies, many organisations, especially larger ones see with antibody discovery and generating heaps of data, both sequence data as well as functional data and how to use that then effectively. 

 
I'm also going to introduce our platform and give some examples, some concrete examples, some analysis and examples of what you can do with our platform. 

 
So just starting with the challenges, I'm sure you're very much familiar with these and the common struggles that we see is essentially data silos in organisations. 

 
So where you store your data, how you store your data and managing different data types. 

 
First of all, also managing data flows, large amounts of data, smaller amounts of data. 

 
You can see that we have a bunch of different keywords here. 

 
They all make a mismatch, of data types or, or maybe properties that you want to predict, store properties, antibodies. 

 
And surprisingly often we also see Excel files being a database for lots of scientists. 

 
 
I'm sure you're very familiar with this as well. 

 
 
And of course, Excel is very error prone. 

 
That's hard to manage. 

 
Of course, you can e-mail it back and forth, but that's not the ideal way to, to handle things. 

 
But it's difficult. 

 
And then in terms of larger scale data, we're generating more and more sequence data these days, functional assay data as well as public repertoires of, of antibody sequences as well. 

 
So how do we use all of this data then? 

 
That's one of the challenges that that I'm here to, to talk about a bit. 

 
So trying to make a bit of sense of this, we're looking at sequence data. 

 
So this is specifically from our perspective looking at sequence data, public databases as well as then assay data. 

 
Then for machine learning applications, of course we want to feed in an ideal world all of this into either sequence-based prediction, structure-based prediction or both. 

 
And on the side you'll see that we have also or we're gathering, well not us, but we're accruing both public data sets for high throughput screenings. 

 
This could be for example, through you're probably aware of LIBRA-seq for example, being from 10x Genomics barcode animal antigen mapping as well as deep screening approaches for essentially gathering almost library scale volumes of antigen specific binding data for specific sequences, antibodies. 

 
And then all of this we like to want to feed all of this into machinery models that we can then use to either predict particular binding properties or then developability parameters for for the antibodies that we then want to eventually push out to the clinic. 

 
And these are just some examples of the high throughput labelled data that we are able to generate these days. 

 
It's just one example from, from Porebski and, and Holliger as well as a, a group at AstraZeneca who developed a high throughput way of analysing with this antigen data, antigen specific sequence in a few days’ time, which is quite a lot of data that can be then used to feed the models that are very data hungry, as we know. 

 
And of course, we're also interested in structural data. 

 
We have structural prediction tools that we can then use to also feed into, feed into our uses basis as for the models to predict then specific protein properties or properties of, of antibodies. 

 
So of course you're aware of AlphaFold and, and the capabilities and the large database of of proteins that that AlphaFold has generated so far. 

 
And then this is just an example from the Greiff Lab where they also then predicted antigen antibody structure compounds essentially by machine learning or using machine learning. 

 
And these are all very data hungry methods. 

 
And this goes into or what we're trying to essentially do is, is then predict the developability parameters. 

 
You can see the most high and low level parameters, so specific ones such as immunogenicity, stomach interaction, instructability, secondary structure of antibodies, solvent accessibility, photochemical and electrochemical properties and stability of the antibody. 

 
So these are individual properties that you can predict. 

 
But then we also come the question comes about high-level prediction. 

 
So how can we use all these parameters to maybe build a model on top of a model to predict a number of these from a single model? 

 
So these are just some examples of both open source and tools under licence. 

 
7:39 
For example, for humanization immunogenicity predictions sapiens from BioPhi or the BioPhi as a toolkit. 

 
7:46 
Sapiens is the humanization module AbNatiV a QVAE nano body humanization toolkit, which essentially allows or helps with humanization is a machinery model for that from the SORY ANNI lab and it's MHC pan, it's MHCpan 2 or MHC 2 and one type prediction for immunogenicity. 

 
8:21 
These are very widely used as well in in the industry other tools, SaSa, solvent accessible surface areas, predicting molecular and electrochemical properties, overall developability parameters. 

 
8:39 
For example, the TAP from Chardin's lab in Oxford as well as then CamSol or SoluProt for solubility and also sequence optimization in the case of CamSol. 

 
8:51 
And all these take in either sequences or structures as inputs. 

 
8:58 
So again, referencing the large vast data mounts, these are just some sequencing platforms that are very frequently used in specifically antibody discovery also peptides and another other molecules. 

 
9:14 
So we're talking about still looking at the traditional Sanger sequencing, but then the high throughput methods like PacBio for long read sequencing, you can obtain high quality reads, high quality long reads. 

 
9:26 
So you can sequence both heavy and light chains in a could be a single chain of feed format even full length. 

 
9:34 
And then the Illumina platform as you all know is quite widely used short read sequencing and up to then NovaSeq with billions of reads. 

 
9:47 
And we also have, as mentioned, different microfluidics workflows or, or platforms such as Beacon from Bruker enabling generating on a single cell, single cell BCR, for example, level assays and generating a lot of data for that as well. 

 
10:12 
Again, 10x Genomics uses single cell screening and Nanovials such as well that are used in then for these single cell screens as well. 

 
10:22 
So all this comes down to we're generating billions of sequences and millions of data points from assays. 

 
10:32 
And where does the data go? 

 
10:33 
We have public databases. 

 
10:36 
I'm not going to go through all of these, but you'll see OAS, PLATDAB, THERA-SABDAB on the left side maintained by the Oxford Group. 

 
10:46 
And then we have SRA, which is the short view archive generating massive amounts of or deposits of data for publications and PDB structures. 

 
11:01 
And you'll see that the amount of data is growing quite rapidly with the high throughput sequencing that is enabled by today's technologies. 

 
11:12 
So where are we now or this is our vantage point? 

 
11:16 
This is likely what most of us in this room are interested in. 

 
11:20 
So both the discovery engineering and optimization. 

 
11:26 
So just briefly about where we sit in, in all of this. 

 
11:31 
So of course we placed ourselves in the middle, but this is just a small sliver of the whole ecosystem. 

 
11:39 
So what PipeBio does is basically we allow then sequence analysis. 

 
11:46 
So annotation of sequences, labelling this data and integrating this with sequence registries could be ELN/LIMS Systems, LabKey mentioned Genedata for example. 

 
11:59 
And then also utilising data from sequence vendors, instruments, Carterra and Bruker, for example, LSA or the Beacon and bringing all this data together in one place where you can actually make sense of it. 

 
12:19 
So. 

 
12:25 
What is then needed for this kind of standard data analysis? 

 
12:32 
Well, in the ideal world we have reproducibility. 

 
12:34 
So you can always go back to the results, reproduce those results. 

 
12:38 
We want to reduce many old mistakes. 

 
12:40 
As mentioned, Excel is not always the best way to save time. 

 
12:44 
So instead of having to send over files here and there or be able to request data from different departments, communicating with the scientists from a different side of the company can always take time. 

 
13:01 
You want something that's easy to use as well that you can access and results, they can visualise. 

 
13:09 
So if it's stored somewhere in the database, you only have numbers in a table that's not very useful for you. 

 
13:16 
And in many organisations where you have a bioinformatics team, you have some commercial software, internal development, internal tools. 

 
13:25 
Those don't always have user interfaces that you can then as an end user use readily. 

 
13:31 
You might have to ask help for some results or running an analysis. 

 
13:37 
And then you have open-source tools which also must be implemented in many cases. 

 
13:42 
Of course, many of these have web interfaces as well. 

 
13:45 
But then the question is, can you send the sequence data that you have on a public web server or web server that's not internal or secure? 

 
13:56 
That's then a trade-off between cost, usability, flexibility and of course the results that you obtained and how you obtain them. 

 
14:07 
So just looking at PipeBio, what the typical analysis workflow looks like on our platform is you'd import data. 

 
14:18 
We typically take in sequence data, but it could also be structural data and functional data. 

 
14:24 
So I say data from diverse instruments and different formats. 

 
14:30 
So we try to be sequence diagnostic in in that sense or molecule agnostic, but we're focused on antibodies, TCRS, but we also support different peptides. 

 
14:40 
It could be DARPins, affibodies, bicyclic peptides, or short peptides. 

 
14:46 
And then when you bring in this data to the platform, it's very important to be able to annotate this label this correctly. 

 
14:53 
And that's where we essentially help process the data and label it in terms of antibodies. You'll see the regions that we typically annotate framework regions to CDRS, germline genes, and sequence liabilities, and then validate whether the sequence is correct or incorrect. 

 
15:11 
So even if you're using public repositories such as the SRA, you'll have a lot of UN QC data there that if you want to use that, you want to want to process that ahead of using. 

 
15:23 
And then in terms of bringing this sequence data together, clustering and reducing this down to down to full groups and then using the functional data that you might accrue after this to then visualise it. 

 
15:39 
So ELISA, SPR, and BLI and I'm bringing this to a single database that you can then actually easily access and analyse and then finally lead selection and optimization. 

 
15:53 
So once you've selected your substantive sequences that you are interested in essentially being able to either use structure or other developability predictions to then actually pick and optimise these clones. 

 
16:18 
This is very similar to before, so I'll probably just skip this slide. 

 
16:24 
So in terms of formats, we're also seeing quite a wide range of different formats, not only antibodies, but these are all antibody based. 

 
16:35 
So there are a couple of nice papers if you've missed them from a few years back, Wilkinson and Hale, they went through INN lists quite a big manual task, but the results and the data was very interesting to look at. 

 
16:52 
So, so they essentially gathered data for both IgGs or antibody-based molecules and FC fusion proteins. 

 
17:04 
And you'll see that there are roughly 57 different molecular formats from the last decades. 

 
17:14 
And you'll see some of the most common ones, IgG being still the most common of these. 

 
17:20 
But you'll also see FC fusion proteins, Fabs, and different conjugates here. 

 
17:30 
So you also need to be able to work with multiple different formats, analyse these, store these, and keep track of them. 

 
17:41 
So coming back to which formats you might want to use, well analyse TCRs, bispecifics, IgGs, Fabs, single chain Fvs, VHHs, peptides. 

 
17:54 
So it could be short peptides or then longer ones as well and custom scaffolds, something out of the ordinary as well, different domains. 

 
18:04 
And in terms, of diversity between the actual antibodies. In this case, like the antibodies, you'd have then different species and different germ lines. 

 
18:17 
You want to use my creative synthetic libraries as well. 

 
18:20 
So all of this comes then down to, to what you, what you want to analyse and, and how you have the flexibility to analyse this, this data. 

 
18:29 
And then what about other scaffolds? 

 
18:33 
I mentioned these before, again, just referencing back to our platform, we have customers working with some of these molecules as well. 

 
18:44 
And it's just an example of, of an annotation of an affibody on the platform. 

 
18:50 
And it's essentially customizable. 

 
18:52 
So you can customise different regions to, to annotate as required based on your own definitions. 

 
19:02 
And what about quirky antibodies? 

 
19:06 
Well, one example is of course bovine antibodies with the ultra long CDR-3s, CDR-H3s, they had the stalk and the knob. 

 
19:15 
And yeah, then you can of course use these to engineer more interesting molecules as well. 

 
19:23 
That was a good talk by UCB yesterday about some, some knob engineered peptides. 

 
19:34 
So what does the platform actually look like? 

 
19:39 
This is all within the application. 

 
19:41 
So essentially fully controlling the data, having an interactive user interface and being able to apply labels to the data is one of our strengths and visualising this as well. 

 
19:54 
So you'll see different flags that we can output by default when you analyse any kind of NGS or Sanger or sequencing data and you can then use this. 

 
20:05 
Well, you would store this and then. 

 
20:08 
Likely use this for optimization purposes as well. 

 
20:11 
So just gathering this data is very, very crucial. 

 
20:16 
And one example application just to show again some of the tools that we're working on and have developed. 

 
20:28 
Here's one example of, of machine learning guided hitpicking. 

 
20:31 
So one of the challenges might be with using display technologies to detect grassroot level binders. 

 
20:41 
So depending on your phage display setup, you might face different challenges for extracting the potential binders among the not so enriched binders. 

 
20:57 
So based on enrichment scores using a public data set, we then created a machine learning model to predict a set of positive and negative rules based on the full change of the sequences for each planning model. 

 
21:14 
So basically turning this phage display planning data into label data and then using a machine learning model to then predict discontinuous or identify discontinuous motives of amino acids. 

 
21:30 
So not based on traditional clustering to then be able to show or group sequences by potential properties that can indicate whether this sequence is a binder towards the antigen used in the in the panning. 

 
21:51 
So this is just showing the discontinuous motifs in the rule set and you'll see that that we have a few or the diversity in in the CDR 3. 

 
22:03 
Is there for this data set, although it's based on or labelled as a single rule, there's still diversity and divergent residues in the CDR 3. 

 
22:15 
And essentially mapping this into a phylogenetic tree, you would then see some of these some of these sequences that are picked by the machine learning algorithm versus by traditional fold change calculations. 

 
22:38 
So moving on to then some of the next steps. 

 
22:42 
Obviously we have a lot of different open-source tools for structure prediction and we can implement these in different analyses. 

 
22:54 
You'll see some of the most common ones used for, for antibody structure prediction, immuneBuilder 2 by the OPIG group and AlphaFold as well. 

 
23:07 
So these two are probably the most prevalent ones that we've seen be used. 

 
23:14 
But you have multiple options as well. 

 
23:17 
And everything depends on how you then go ahead and benchmark. 

 
23:21 
Do you have internal data then that that shows you that a particular model works better for the antibodies that you're developing? 

 
23:30 
And I guess finally the question is then whether the models are accurate enough to then be able to educate you or to be used as input for predicting other protein properties among the antibodies that you're working with? 

 
23:50 
There was an interesting paper from yesterday that was published yesterday by Andrew Martin's group. 

 
24:00 
That was about whether antibody CDR loops change confirmation upon binding they looked at 177 antibodies and compared essentially bound structures to unbound restructures and found that the confirmation doesn't change or vary that much for CDR-H1 and 2. 

 
24:25 
Whereas for CDRH-3 there is more flexibility in that. 

 
24:28 
So that's also interesting to consider for then modelling and docking applications for antibody and antigen structure predictions and just wrapping up on this side. 

 
24:45 
So again, coming back to the platform, it's very important then to be able to tie some of these tools together. 

 
24:54 
And of course, security is always a consideration for proprietary data and storing the results. 

 
25:01 
So whether you store the Excel file on your desktop or in the database in a structured form, that makes a huge impact on how that data can be used and shared with colleagues and how you can actually aggregate this data and effectively use it then for predictions in terms of machine learning models. 

 
25:25 
And of course, accessibility of data. 

 
25:27 
So interpreting the data, visualising the data and finally also communicating results within an organisation is something that is our focus as well. 

 
25:41 
And just going through some visualisations, not actually spending a lot of time for these. 

 
25:48 
But again, going back to some of the interpretability, being able to visualise the sequences, visualise the functional data, overlay this data and interpret the outputs of machine learning models as something that that we, we invest heavily in. 

 
26:03 
Again, some eye candy to, to wrap up this talk, just a few examples of, of showing flexibility and essentially analysis capabilities of the platform. 

 
26:18 
We've done a few collaborations. 

 
26:20 
These are just some examples where, a group that formerly had various interests used to have Pipe Bio to analyse some sequences in a in a well HCAb B cell integration interrogation of the of heavy chain only antibody sequences and how they then used the beacon in this as the main way to then identify potential binders and then used NGS to mine similar clones in a repertoire. 

 
26:57 
Or sorry, used our platform to mine similar clones in an NGS repertoire that they even sequenced from the same cells. 

 
27:04 
So who do we work with? 

 
27:07 
We have a, a range of customers and we can of course mention all of them, but here's just a handful of these and you'll see that if you recognise any of these names, it's not only antibodies. 

 
27:19 
So, so quite a wide range of, of molecules that that we are also involved in working with. 

 
27:30 
And just finally, we're always looking for new teams, new members in our team. 

 
27:35 
So if you or anyone you know, would be interesting joining in our team, then please let us know. 

 
27:43 
You can reach out to us at jobs@pipebio.com. 

 
27:46 
And we're also fully remote, so that doesn't put geographical boundaries on that. 

 
27:53 
So something good to keep in mind and I believe this is the last slide. 

 
27:59 
So thank you all.