0:00 

So this is I think a pretty perfect segue from the previous talk because I'm going to talk about what comes after de Novo and really agree with Talib that you want something end to end. 

 
0:10 
And I think one of the places that ML shines most is actually after de Novo when you get into lead optimization. 

 
0:17 
So quick order of affairs, going to do a quick intro to Cradle who we are going to do a really brief limits of de Novo. 

 
0:24 
I think that's mostly been covered or most people at this point are familiar. 

 
0:28 
And then I'm going to talk pretty extensively about sort of our journey to the current Lab-in-the-loop platform that we have, why we made the choices we made and that the challenges we encountered along the way. 

 
0:39 
So first and foremost, Cradle is software. 

 
0:42 
So we're a software platform used by protein engineering, used for protein engineering by scientists. 

 
0:47 
So we're really target kind of the full swath of sophisticated computationalists to bench scientists who don't know sort of anything about ML, anything about stats. 

 
0:58 
Our real focus is sort of bring you the best possible lab-in-the-loop, machine learning. 

 
1:02 
So we like to say we want to squeeze every little bit of juice out of your data as possible because that's where the real benefit of ML is. 

 
1:08 
It's not sort of necessarily in these zero shot models, although there is some benefit to be had there. 

 
1:17 
We are sort of private by design. 

 
1:19 
So all your data stays with you, all the models trained on it stay with you. 

 
1:23 
So we don't let anybody else peek at models that are trained using your data. 

 
1:27 
We have a transparent software subscription license and you own the IP for all the sequences generated. 

 
1:33 
So we're not looking to be a CRO taking milestone or royalty payments. 

 
1:38 
We're really looking to be software. 

 
1:41 
We have 21 customers, 32 programmes. 

 
1:44 
Across those 21 customers, we've worked on 6 different modalities, so peptides, nanobodies, antibodies, enzymes, all sorts of things. 

 
1:56 
And we've worked on up to six properties within a single lead optimization round, a demo is worth 1000 slides. 

 
2:06 
So we may do a demo at the end if we have time or if not, you can come find me afterwards and we can set up a time to actually show you what the software is like. 

 
2:16 
OK, but I do want to talk a little bit about the science. 

 
2:19 
So limits of de Novo. 

 
2:21 
I also considered facetiously titling this slide. 

 
2:24 
The lab work isn't going anywhere. 

 
2:27 
I think that should be obvious, but I think where de Novo is really effective is kind of in this target to hit for really difficult to drug targets or targets without known binders getting just something that has some weak specific binding. 

 
2:42 
But it really struggles to get strong binders reliably, especially within a plate-based throughput. 

 
2:47 
And even if it when it does, those binders often need significant improvement to develop ability, which in another talk. 

 
2:54 
I also want to point out here, if you look at the cost per launch of a drug, so speaking all to the to all the execs in the room, the lead optimization is actually where the majority of the money gets spent in the development process. 

 
3:07 
And so shortening that, making that more reliable and getting a better product going into clinical is actually the really critical part. 

 
3:14 
If you're able to get sort of that weak binder, that starting point. 

 
3:19 
So how does Cradle do this? 

 
3:21 
Well, I'm pretty limited in what results I can show you. 

 
3:24 
I'm going to show you mostly results from our internal lab, which we used to AB test our models, as well as public competitions that we participated in. 

 
3:31 
But this is one that was recently published. 

 
3:33 
So we came out on top in adaptyv protein design competition. 

 
3:37 
But I really want to talk about this in the context of de Novo in terms of what we did do here and what we didn't do here. 

 
3:43 
So, you know, we have a lot of great zero shot technology that I'd love to talk about, but I really want to talk about what we didn't do here, which is this is not de Novo. 

 
3:52 
You know, we started from and Cetuximab SEFV variant, so like a published sequence and we re-engineered framework regions and we're able to get much better binding and a little teaser, but really consistently get much better binding. 

 
4:08 
But I think where this contrast with the other competitors who were doing generally more traditional de Novo structure-based approaches is this sequence was ranked nearly last in the Insilico screen that Adaptyv Bio used to select which things to take to the lab. 

 
4:26 
And that's basically because it was based on some of these normal filters that people are using for de Novo. 

 
4:32 
So they really struggle to get sort of the resolution on the small numbers of changes. 

 
4:38 
So this was a 10 variance and the effect that has on binding. 

 
4:42 
Cool. 

 
4:44 
So I'm going to talk a little bit about our lab in the loop platform and I'm going to talk about this in terms of sort of what we view is like climbing the ladder of lab in the loop maturity. 

 
4:54 
So I think there's a lot of stages of building a lab in the loop pipeline and we've seen now with customers who are also building internal programmes that typically people hit the same challenges and they need to overcome the same thresholds that we did. 

 
5:10 
And so I'm just kind of going to walk you through them here. 

 
5:13 
So I've lumped these first two together because I think this challenge kind of goes without saying. 

 
5:18 
So a single property predictor is moving to multi property predictors. 

 
5:21 
Lead optimization is fundamentally a multi property optimization problem. 

 
5:27 
You need to hit all of your developability targets in order to move into the clinic, into animal models. 

 
5:33 
And if you try to do them one at a time, you'll be stuck on sort of an endless merry go round of optimising one property and breaking the others. 

 
5:42 
So we really want to do these all at once. 

 

5:46 
How do we do on this? 

 
5:46 
Well, we have another competition that we participated in with a line to innovate. 

 
5:52 
We placed either first or second across sort of all the multi property prediction challenges and particularly performed well in the challenges which were hard for almost all the contestants. 

 
6:02 
So there were thirty other teams here and these were primarily enzyme projects. 

 
6:07 
There's also 2 bars here for Cradle, and that's a little bit of a teaser for what comes next. 

 
6:12 
So the first one is a predictor. 

 
6:14 
So those turn sequences into scores. 

 
6:16 
The second one is actually us benchmarking our generator as a predictor. 

 
6:20 
So we tell you actually alluded to that as well. 

 
6:26 
The other thing we are really excited about is that the multi property prediction we actually see that helping our rounds stay on the Pareto frontier and really not focusing too much on one property to the other. 

 
6:41 
So this was a bit of an AB test here we have sort of the parental sequence in red here, melting temperature on the Y axis and expression on the X. 

 
6:52 
And in the first round in zero shot, we were able to stabilise, get an additional 9.2° of stability, but expression was on average 7 times lower. 

 
7:03 
And we actually lost when we optimised for only thermostability, when we used only a thermostability predictor, we lost 85% of the samples because they wouldn't express enough for us to even measure thermostability. 

 
7:15 
So that's why there are fewer orange points and they're kind of clustered up and to the left. 

 
7:20 
When we took a multi property model and we use that for generation, we have an even larger jump in stability and we actually saw we're able to recover that expression. 

 
7:32 
So we're able to get sort of higher expression and higher thermal stability by teaching a single model both properties. 

 
7:42 
What are the challenges that even a perfect predictor will not overcome? 

 
7:48 
So the sort of fundamental one, say you're doing like site saturation and then you're scoring all your single mutants, right? 

 
7:57 
You're going to get some cases right, in which there are just no significantly, you know, improving single mutations. 

 
8:05 
These are kind of what we would call local minima in an ML space. 

 
8:09 
And that's going to really prevent you from moving forward in your project. 

 
8:12 
And we don't want that to kill your project. 

 
8:14 
The other problem is the more properties you try to optimise at once, the more likely this is to happen because the harder your functional landscape becomes, right? 

 
8:21 
So say you're doing 6 properties at once, you're doing aggregation, thermal stability, expression, etcetera. 

 
8:28 
It might be actually quite likely at that point that you don't have any beneficial single mutations. 

 
8:34 
You kind of need to hop over this local minima in the energy landscape. 

 
8:39 
And unfortunately these really high-quality predictors are also expensive to do inference with. 

 
8:46 
So the computational costs make it prohibitive to do exhaustive searches in multi mutant space. 

 
8:53 
So I think we did a back of the envelope calculation and said doing inference on all triple mutants for our predictors for a standard SCFV would cost like 250K in just GPU costs. 

 
9:10 
So it's really not feasible. 

 
9:11 
We need to have a smarter exploration of the space. 

 
9:15 
And so the kind of the way you make it over this hurdle is by moving to machine learning based generators. 

 
9:20 
And there's some problems that come with that, which will necessitate doing sort of single property conditioning. 

 
9:25 
So OK, so how are we at that? 

 
9:27 
I'm just going to flash this slide up again. 

 
9:29 
We're pretty good at burning generators. 

 
9:31 
This was an out-of-the-box no data zero shot competition. 

 
9:37 
But actually we have some updated results which I don't think we've published yet, which are very exciting, which is I mentioned we placed pretty low by the Insilico Rankers. 

 
9:46 
And actually we submitted 10 designs, nine of which were screened out by the Insilico Rankers. 

 
9:52 
But after we won, we went back to Adaptyv and said, hey, can you screen those other nine? 

 
9:56 
It turns out all ten of our designs would have placed the top ten in the competition. 

 
10:03 
So we're doing this pretty reliably. 

 
10:06 
I also think this is again, really strong evidence that if you have a weak binder, right, if you have a starting place or if you're able to do screening, you're really best off starting from that weak binder. 

 
10:18 
And really de Novo has quite a ways to come yet until it's ready to compete with that. 

 
10:27 
We also have done quite a lot of work adding features to our generator which might not be present for others. 

 
10:33 
So this is allowing insertions, deletions and substitutions across the entire sequence. 

 
10:36 
So this is an alignment of a round we designed of a VHH and you'll see quite a lot of gaps and different lengths CDRs in CDR1, 2 and 3. 

 
10:49 
This is actually the sort of the distribution of different sequence lengths and also substitution distances. 

 
10:55 
So we're able to get up almost to 50 Levenshtein distance within this one round with length spanning from I believe 112 to 125 amino acids. 

 
11:06 
So quite a lot of diversity within this round. 

 
11:09 
And within that round we're actually still able to find some sub-nM binders. 

 
11:13 
So this is the starting sequence here. 

 
11:15 
This was really not a super high hit rate, but we were just trying to see how much diversity we could pack into a single round and still get some good results out of a 384 well plate. 

 
11:25 
So we were able to see, I think that's 9 improved binders, 10 improved binders, 2 of which are sub nanomolar from the five nanomolar binder that came out of the immunisation campaign. 

 
11:38 
And again, this is without any sort of training data. 

 
11:42 
OK, so what if we do have training data? 

 
11:44 
Actually one of the properties that Talib mentioned that can be challenging is these generators not understanding that the fine grained mutation landscape. 

 
11:53 
And one way that you can fix that is actually by conditioning your generators on the data that you've observed so far. 

 
11:59 
So you know that log likelihood, you know functional correlation, you can actually train the generators such that correlation improves. 

 
12:10 
And this is the result of doing that. 

 
12:11 
So on the bottom, we have kind of our control generator, which is a greedy exploit generator. 

 
12:17 
On the top, we have sort of one of our generator conditioning methods benchmarked and you see this kind of massive hump developed to the right. 

 
12:24 
Unfortunately, I can't show you the access here because this is actually a customer project, so I can't even tell you what the property is, but yeah, so this has been really positive for us. 

 
12:35 
In this case, we saw the hit rate of the, these are by the way all predicted because there's like 500,000 sequences in these distributions. 

 
12:43 
So these are all predicted scores. 

 
12:44 
But the condition generator hit rate went from 15% to 77% when we added that knowledge of the data into the generator. 

 
12:56 
What is our next hurdle in the sort of climbing the lab in the loop maturity curve? 

 
13:06 
So one is that generators conditioned on a single property tend to focus on that property to exclusion. 

 
13:11 
So in the previous one, we conditioned on a property and we did see really good improvement for the generated sequences in the predicted value of that property. 

 
13:24 
But we would if we were say doing 6 properties, if we saw too much failure in the other properties that would lead to a sort of a failed design round for us and we have to kind of recalibrate the parameters to try to get something else. 

 
13:40 
So we want to fix this. 

 
13:42 
We also see just in general that people working with ML and generative ML tend to struggle with these generators and predictors favouring small areas of sequence space, really sort of narrowing in on a solution that there sure works. 

 
13:57 
And fundamentally biology is a process that happens in batch. 

 
14:02 
You know, you're running plates of sequences and running a bunch of plates of nearly the same sequence is not a good use of your experimental bandwidth. 

 
14:12 
So the sort of two methods we have here to overcome these problems are the downselection and multi-property conditioning. 

 
14:19 
So this is an example of multi property conditioning. 

 
14:22 
So it's basically a multi-dimensional version of that KDE plot that I showed you before. 

 
14:27 
Here we have sort of multi property DPO in blue. 

 
14:30 
So this is one of the methods that we've developed. 

 
14:33 
And then the single property you can see which is focusing on thermal stability on sort of on the bottom and more towards the right. 

 
14:42 
And it's really kind of crushing expression again. 

 
14:44 
And the sort of unconditioned model in yellow. 

 
14:49 
And you can see that really the blue multi property generator is actually dominating the Pareto frontier here. 

 
14:57 
So we're quite happy with that result and this is some of the newer work. 

 
15:02 
So we're in the process of deploying this multi property conditioning into our production pipeline. 

 
15:09 
I think the final thing and I alluded to it a little bit before is selecting sequences such that they contribute positively to plate outcomes. 

 
15:16 
So build plates, not sequences is often something we say you're not building actually a sequence design pipeline. 

 
15:23 
You're building a plate design pipeline because a plate is what the customer is going to take to the lab. 

 
15:28 
And you really want to build a diverse plate. 

 
15:32 
You want to build a plate that produces good training data downstream. 

 
15:36 
And you want a plate where you know you're going to get some results that look good so that the project doesn't get shut down and you know, you can keep working so that you can make nice plots and convince people to give you the money to run the next round. 

 
15:49 
And so we have developed an algorithm here which uses the sort of confidence estimates of our models in order to sort of generate hypothetical outcomes of the plate. 

 
16:03 
And to kind of maximise the chance that the best things in your plate are good so that they hit your target product profile that when you're looking at that multi property parameter optimization problem, they're clearing all the bars simultaneously. 

 
16:18 
We've seen this not only improve performance outcomes, but also result in a 5-fold increase in plate diversity as measured by redundant mutations. 

 
16:26 
So we're getting a lot more unique mutations without explicitly filtering for it here. 

 
16:32 
The other thing that's really exciting is we've used this algorithm now for hit identification coming from large library screening, so from early discovery, as well as lead selection for the next round. 

 
16:45 
So often we were asking people, what do you want the template sequence to be for your next round? 

 
16:51 
Or what do you want to design based off of? 

 
16:53 
And they were like, well, you're the ML people. 

 
16:55 
Why can't you tell us? 

 
16:56 
And so we're like, OK, we should figure out how to do that. 

 
16:59 
And it turns out this algorithm can be repurposed for that really well. 

 
17:03 
And I have some results, but I don't think I have quite some time to show them to you that are hot off the press. 

 
17:10 
So if you want to find me afterwards and I can click you through a very unpretty plot. 

 
17:19 
So what's next for Cradle? 

 
17:22 
So I think the top things on our mind are automatically training and deploying customer wide base models for developability properties. 

 
17:30 
So you know you at your company may have a standard expression system for example. 

 
17:35 
And we expression is a property that we've seen from some of our early work in this area generalises quite well across protein families from an ML perspective. 

 
17:44 
And so we may be able to give you know, after we've seen your 10th project, we may be able to give you a model that works out-of-the-box for filtering expression for generating high expressors. 

 
18:01 
And you may not have to worry about that so much downstream in your pipelines anymore. 

 
18:07 
And I we think there's quite a few properties this may be possible for sort of at the company wide level. 

 
18:13 
The other one is allowing deployment of third party predictors to the  insides of cradles pipeline. 

 
18:19 
So if you're on some of those lower rungs of maturity, but you feel you're doing really good work there, right, we can integrate those predictors that you're building into our full generative pipeline. 

 
18:28 
Take advantage of all those higher rungs while taking advantage of sort of the problem specific knowledge you've built at your company.  

 
18:41 
Designing multi chain complexes to support more antibody modalities, that's just kind of an obvious one, but we need to be able to design multi chain complexes and also generating large libraries for early discovery is an area we really want to move into. 

 
18:52 
So if you're interested in learning sort of more about any of these things, come find me or also Sam who's in the back there and you can chat to us or also e-mail sales@cradle.com 

 
19:05 
Thanks. 

 
19:18 
We can also do a demo if I don't know how much time there is. 

 
19:21 
5 minutes. 

 
19:21 
OK, perfect. 

 
19:25 
So this is actually sort of the live cradle software if anyone's curious. 

 
19:30 
So I can go create a new project here. 

 
19:32 
Right now this is just a project title and you give us kind of the seed sequence that you're working off of. 

 
19:36 
I'm not going to do that because that actually kicks off some machine learning already. 

 
19:40 
So I'm going to, you know, pull a cake from under the table for the first time. 

 
19:46 
Then you add your project assays, sort of if you want to add project assays, you go in here. 

 
19:52 
We already have stability added, but you can add sort of whatever you want, you know, and this is, you know, foos per second. 

 
20:02 
And I'm just going to add that's say I have some custom assay that I'm running in house, then I'm going to import project data and I’ll rab the example spreadsheet and drag that in. 

 
20:22 
And so I basically just uploaded a CSV I need to match sort of the columns and you know, round ID, batch ID and we have, you know, this is foos for a second and this is stability. 

 
20:38 
That's kind of high for a stability, but whatever cool, we're going to say around import, and then we're going to set objectives. 

 
20:52 
So this is like I mentioned, it's fundamentally a multi property optimization. 

 
20:57 
So we have a primary objective. 

 
20:58 
We want to increase stability, but we also want, you know, we want our foo to be above sort of our project based sequence. 

 
21:04 
Cool, keep that great. 

 
21:12 
And now we're ready to start our first round starting a round. 

 
21:18 
There is a lot of parameters that we can configure here, so I'll just run through them really quickly. 

 
21:24 
Obviously we say how much we want. 

 
21:26 
We pick template sequences. 

 
21:28 
We're in the process of plumbing the template sequence selection through to the software. 

 
21:33 
We can select minimum, maximum number of mutations, We can specify sort of block positions. 

 
21:38 
So we don't want any cystines hanging out sort of for the first four residues. 

 
21:45 
This is not a very long protein or let's say for these two residues. 

 
21:51 
Yeah, this is C&A. 

 
21:56 
Cool. 

 
21:58 
So we're going to design out that initial cystine in cradle and I don't know we'll end up with ladle or something. 

 
22:04 
Great. 

 
22:05 
So this is then would kick off a whole bunch of ML in the background and you would come back to something like this report. 

 
22:13 
This is actually by the way, this is we use this cradle software for our internal lab as well. 

 
22:20 
So we eat our own dog food, as they say. 

 
22:22 
And if I go back to here, you might see a report and this report would basically just give you a lot of metrics as well as the generated sequences that you could download and use in your lab. 

 
22:35 
So it's going to show you the 3D structure, it's going to show you the frequency of all the mutations etcetera. 

 
22:41 
And we're sort of adding new analysis to these reports every day. 

 
22:46 
That is the software. 

 
22:48 
I think hopefully I'm within my 5 minutes and we still have a little time for questions.