00:07
Good afternoon everyone, and a very warm welcome to today's online Thought Leadership session entitled ‘Smart Molecules, Harnessing AI in Data to Advance Antibody and Protein Engineering’. My name is Cerlin Roberts, Managing Director of Oxford Global, and I'm delighted to be your chair for this insightful webinar. We are thrilled to have you join us for what promises to be a very engaging hour exploring how artificial intelligence and data driven strategies are transforming the landscape of antibody and protein engineering, and today we'll dive into these themes through an exclusive 20-minute interview with Rahmad Akbar, Senior Data Scientist in antibody design at Novo Nordisk.
01:00
Rahmad brings deep expertise in the intersection of data science and data antibody design. Following this, we'll transition into a 30-minute panel discussion where we'll explore how best to normalise, integrate and share biological data.
01:24
There will also be time for the panellists to answer your questions, which you have posted.
01:31
And so, before we begin, just a quick reminder that this is a live virtual session and will be recorded.
01:39
So, let's get started, and it's my pleasure to introduce our Thought Leader for the session, Rahmad Akbar from Novo Nordisk. Rahmad, could you start by telling us a bit about your background and your role at Novo Nordisk?
01:54
Yeah, absolutely- a pleasure. Thank you for the kind introduction Cerlin.
02:00
I usually describe to people, we, as an organisation, I design best in class antibodies. What does that mean? I used to be in academia.
02:12
In academia, it meant we are establishing proof of principles for generative antibody design and proof of principles of antibody and antigen binding prediction. At Novo Nordisk, or any other organisations who are pursuing commercial interests, it has shifted
02:30
to operationalizing this proof of principles. So that means we are leveraging our legacy internal data set or generating new data set or acquiring data set and putting on top of that layers of machine learning models so that we can then flow those models through portfolio research project, and this is how we accelerate Research and Early Development computationally.
02:55
So, what inspired you to specialize in the intersection of data science and antibody engineering?
03:03
I love it when people can be the best versions of themselves. And data science, AI ml antibody are the tools that I choose to realize this aspiration. So, this is primarily because these are the best set of tools that we have right now, and AI, ML is advancing, not just biologics design, but also
03:29
other areas outside of biologics. In fact, we are borrowing many different technologies and algorithm from fields outside of biologics design and bring it to biologics design. Protein language model is one good example, came from the Natural Language Processing field. They have large language model, and then we borrow that architecture and created what we call protein language model.
04:01
So, and they are the sharpest tool that we have in comparison to the set of tools that we have previously. Physics based design, for example, antibody is the same. Is the best, in my opinion, is the binder format out there, not too dissimilar with enzyme. For in biology, enzyme is, I think it's the best catalyst in biology, simply because nature has invested millions of years in designing this format and for antibody, such a format has allowed it to have this intricate specificity against a target. So, it's a really, strong platform for biologics.
04:45
So, how do you describe Novo Nordisk’s approach of leveraging AI in biologics discovery?
04:53
I hope you've seen some of the news from Novo Nordisk that was circulated.
05:00
I think that, around a couple of months ago, when we had this shift in focus in Research and Early Development, and if you haven't, it was said in many of the public communications that data and AI is at the centre of Novo Nordisk research. So, we're not treating AI and data separately, it's more like they are a package deal. You cannot do one without the other.
05:26
So, we are putting efforts in generating data internally, leveraging our legacy data set, of course, as well as acquiring data sets from external parties. And as I mentioned in the beginning, we want to have a robust machine learning layer on top of that, so that we can
05:44
extract knowledge from this data and then deploy those machine learning models to accelerate decision making when you are doing your design or you are optimising your molecules.
05:57
So, can you give us a real-world example where AI driven methods significantly accelerate or improve antibody design at Novo Nordisk?
06:12
A good example would be, we are pretty good as an organisation right now, as a research organisation, in predicting a couple of properties.
06:24
We're pretty good at predicting and optimising expression, for example, for the expression systems that we care about. We're also pretty good in predicting the developability parameters of the antibodies.
06:35
And here, just by having these high-level predictors, because what you want when you do design, the first thing that you need to satisfy is that the molecules that you design, it needs to be able to be produced internally for characterization. So, the expression is the first thing that you want to hit. And what we've seen is that when we, as we train and deploy machine learning models, and as we continue to improve on them, because we generate data, and also refining the model at the same time, we are getting better and better in satisfying those criteria, expression criteria, and also developability criteria. And that has allowed us to enrich
07:19
the plate, for example, the design budget with not just good binders, but also well-behaved binders. I see, yes, which brings me to asking you, how is binder discovery and optimization work today? Are we close to designing functional antibodies entirely in silico?
07:46
That's a tough question. So, I'm going to go with - we are closer than ever.
07:55
We've seen completely in silico designed drug coming out of generative AI, for example. But that is in the small molecule field.
08:06
We have generative AI designed small molecules that has gone through the clinics. And what I've seen also more and more for rigid binders, these are for soluble proteins - mini proteins, for example- we're getting better and better at designing binders
08:22
in silico for these types of molecules as well.
08:33
So, we see that with traditional drug discovery, it relies very heavily on high throughput screening. So how is AI enabling a shift towards a much more efficient, planted approach? I'm going to highlight a couple of points here.
08:55
The first point is that we are moving more away from high throughput screening.
09:03
And number two is that the data, the quality of the datasets that we're generating, it's getting better and better. So, I'm going to unpack the first one. High throughput screening relies mostly on chance and diversity. You screen 107, 108, 1010, for example, molecules - that's why we need high throughput technology to screen. Generative AI in general has allowed us to move to a different way of doing this by, let's say, more efficient processes like efficient design, build, test and learn cycle.
09:37
So here, instead of screening a huge diversity, you screen for a plate, and the plate that you screen is enriched by binder. So, this is a lot more efficient. So, we're moving from large and high throughput screen to some leaner and leaner process with, with AI.
09:59
I see, so, what are the challenges you face while trying to reduce experimental screening in favour of AI predictions?
10:08
We touched a little bit on this in the previous questions, where we say we want to enrich the design plate or the design budget with good binders and well-behaved binders. I think, in general, that is the challenge to enrich the plate with good binders and well-behaved binder. With the technology that we have right now, it works better for soluble protein. It is still a large challenge to design flexible binders like antibody with generative AI, for example, but we are making good progress in other properties. So, prediction of binding is probably the largest, the hardest problem to solve, but predicting expression, as I mentioned earlier, and developability, we're getting pretty good at this. But I want to highlight here is that the main bottleneck that we are experiencing right now is predicting binders, binding between the two molecules, between the antibody and the target. So however, you want to look at it, from sequence or structure, we still have a huge opportunity to improve there.
11:18
So how do you see, say, the balance evolving between in silico predictions and wet lab validation in the coming years?
11:28
As we are getting better and better with in silico prediction, it will then assume the role of the workhorse, and I think wet lab validation will be operating more and more on the QC level. Let's say if you really have a good predictor, or until in silico generative models, you can generate and screen for large numbers of molecules in silico reliably. But of course, you want to always make sure that the prediction, the quality of the prediction, does not drop as you scale your discovery or your design. And this is where then I think wetland validation will play a role as a QC, just like in manufacturing, you have a huge number of things that you manufacture, and sometimes you pull up one or two samples or 30 samples, and then you do QC on them. So, the workhorse, I imagine, would be a lot of computational technologies. It will be assumed by computational technologies. But of course, the gold standard is still wet lab experiment for QC.
12:40
That's brilliant. Thank you very much. So, what do you see as being the biggest opportunity then, for AI in say, generally, antibody R&D, for the next three to five years?
12:55
The largest opportunity, in my opinion, is to improve on the success rate, I mentioned earlier compared to rigid binders, the success rate for flexible binders like antibody is much, much lower. Let's say you have double digit percentage of success to design binders for rigid binders, and we often see maybe zero to 1% for antibody or flexible binders. And if we want to convince not just ourselves but our colleagues, the ones who are doing the experiment or the world or other organisation, the improvement cannot be marginal. So, what we need is a large step forward. We have seen proof of concepts and proof of principles for in silico design antibodies.
13:48
But they're not en masse. They're still siloed, separated cases. We need to be able to reproduce this across different target, essentially democratizing the design and increasing the success rate, that will be the largest opportunities that I see in the in the next few years, no longer relying on proof of concept, but really putting it as a workhorse.
14:16
I see, yes, I suppose that the main concerns at the moment are data quality and interpretation models. How does your team address these issues in practice? Yeah, these are really exciting
14:34
challenges that you brought up. I'm going to take them one by one. So, the first thing is data quality. We touched in the beginning that we are moving away from high throughput to a leaner processes like design, build, tests and learn in such processes. Since we are screening and characterising a smaller set of protein, we are able to screen and characterise more thoroughly. So, the data quality coming out of such a workflow or such cycles is a lot higher compared to a high throughput workflow, for example. So that as we continue to pivot and accumulate data from this type of workflows, the data quality would then be increased.
15:23
And the second point that you made on interpretability, this is also a large opportunity in data driven and AI field. I'm going to bring it to this definition, AI, ML, these are data driven and statistically grounded technologies, so they rely on many samples and statistical rules in order to find the patterns in the data set. So, it's inherently it is not mechanistical, right? It's not mechanistic. So, you don't, you are not provided with interpretability baked in.
16:08
Recent advances in architecture- so, for example, we are hearing now the next wave of architecture is probably on large reasoning model, where the model can reason. And that type of architecture may have interpretability baked in, integrated to the way that the algorithm is operating. And since we are still working with a large language model, and this type of architecture is a bit obscure in terms of interpretability, there are efforts to look into interpretability, like gradient design descent, where you try to trace back what's going on within the network from the output layer to the input layer. What we typically use also to pair large model with simpler, smaller model, shallow model, that has interpretability element baked in as well- things like tree-based model decision tree, you can trace how the model prioritize features and how those features impact the prediction. So, we can pair a large, complex model with also simpler and more tractable model in order to deconvolute and de-risk the interpretability challenges.
17:28
And what was the last point again?
17:31
So, how does your team address these issues in practice? So, in practice, what we did here, we do a combination of these things. So, when we have large language model, and we want to deconvolute the features from that large language model, for example, an example, large language model outputs 1000 features, but they usually don't you cannot associate them with, let's say, biophysical features that you understand. Then we can have a set of biophysical features, and we train that with a simpler model, and we see the correlation between the large language model embeddings or features and the biophysical features that we know and familiar with. And we see where the largest correspondence is, and this is how we deconvolute the embedding from the large language model. That's one example.
18:28
Thank you for that. So how do you envisage, say, AI, transforming the entire drug discovery pipeline beyond antibodies. Do you have any view on that?
18:40
Beyond antibodies. So, what I've I have a good overview of, let's say, molecular design, like antibody and target design. So, we are operating at a molecular design level, but there are large opportunities in downstream processes, for example, the translation between in vitro computational data and in vitro data, in vitro data and in vivo data, things like this, right? And there are large opportunities in predicting clinical outcomes, predicting, accelerating and improving steps in the clinical element with AI and ML as well.
19:26
I think we're approaching it step wise. So, we are not very good at designing molecules that we don't really invest in the downstream processes, but I think we are accumulating data there as well, and this translational aspect is, I think, that that will ensure the success of the transition between hit ID and lead optimization and as well as the final drug product.
20:05
That's great, thank you ever so much. So, with a closing insight, if you have a piece of advice to organisations looking to adopt AI for antibody design, what would it be?
20:19
If you haven't invested in it, if you haven't tried it, please try it, because the technology is maturing. It is still, for antibody design, it is still in early stage, but it is maturing in protein design in general, and that benefit will transfer to antibody design. And if you're not ready to adopt this, then you might be left behind.
20:54
Thank you. Thank you ever so much. Thank you very much for the insightful thought leadership interview. We are now moving on to our panel discussion, which is entitled ‘How to normalize, integrate and share biological data’. So, I'm going to introduce Rahmad to be our moderator for this panel discussion.
21:22
Okay so, we are now in the panel session, and I have the greatest pleasure here to introduce three great people here for the panel discussion. We have Talip Ucar from AstraZeneca. He's a Senior Director in AstraZeneca. We also have with us Esmaiel Jabbari, he's a Professor at University of South Carolina. And lastly, we have
21:54
Thomas Kraft, he's a Principal Scientist at Roche. So, we have a good mix of panels here from industry and also academia, and we can cover a broad topic. So maybe then I can pull out Talip first, because I'm most familiar with the industry scenario, and I think I'm going to share many of his answers with him. But in general, in Big Pharma, in my organisation as well, much of the data capture was done pre-ML, so they're a bit noisy, bit unharmonized and a bit silos. The whole thing is fragmented. How do you see that going forward, while at the same time taking advantage of legacy data that you have accumulated in the past?
22:53
That's a very good question. So, I don't know how to go about it without giving too much away, but I will try to answer. So basically, when it comes to data, really, when we started our journey three, four years ago, we first thought about, okay, even before the data basically,
23:12
you should think about data infrastructure. We didn't have a proper data infrastructure in which you can use to write and read data throughout the pipeline. So, we started with that. So what do you need in terms of infrastructure? Dan, basically was, let's say in terms of data we went back and put it to what data we have, what historical data we have, whether they are usable, whether they are involved, ready. And unfortunately, in Big Pharma, traditionally is not good at
23:53
keeping data in good format. And as you said, it's quite noisy. And unfortunately, traditionally, they used to keep the positive data, and they don't have, let's say, if the data was negative, is usually basically pushed away and put aside. So, it was quite challenging to collect the data, create it, and make it useful. However, we were able to do it,
24:24
but we made a decision, and this is probably true for everyone else as well, that if you want to indicate AI through our processes in a very good manner, you have to start almost from scratch, like generate new data, with the new policy, with the new established standards. That's how we would do like an AI, by the way. So, we are taking, yeah, we are trying to take advantage of the circle data, but we are investing a lot in the data generation.
25:07
All right, so keeping up with this data theme then, I'm going to move to Thomas. We touch in the first session. We have many things to optimise for, developability, immunogenicity, manufacturability, human relevance, translational between in vitro and in vivo and so on. And typically, we have
25:33
very little data for these different things that we are interested in. And that is the challenge with machine learning models, I think that it’s probably a little tough to learn when the data is not sufficient. So how do we address it, is it a real challenge still, right now, in general, or in your organisation, or has it been solved? So how are we operating under this sparsity that we have right now?
26:10
Yeah, I think this is a very good point that probably we all are facing.
26:17
And you touched a lot of on several of these aspects already. I think that there are several things that one can do. Of course, having what we want to correlate, or what we want to train to is ideally human data. And of course, we have the fewest of those, the fewest compounds. So, we have actually human data for antibodies. And I think one way of addressing that is we can use, if we want to use, for example, in silico data that is generated or in vitro data to predict human, we can use an intermediate step, so to speak, the bridge and in vivo pre-clinical model and data from that where we have much more data, and also much more which is less biased towards well behaving compounds. So that is, I think, one thing that we can try to leverage. The other thing, I think, is also, so to use that as our training set
27:10
to correlate to. The other thing is, of course, yeah, if we talk about large language models or classic deep learning, that is really for large data sets. And I think what an opportunity here is really to focus on these decision tree models, or gradient boosted decision trees, where we can also use data, because our data is not always complete. You don't have for every compound, you have the same read out. Sometimes there are gaps in it, and I think that is especially these boosted models can be used to help fill that or
or use for these things.
27:49
I think that's those are probably some of the, some of the things that, and maybe to also your question, it is a challenge. I mean, it is not easy, and it's not solved. So, I think those are the things that we can do. We can produce a few really specific data to fill our gaps.
28:10
Thank you, Thomas, then I'll come to Esmaiel.
28:13
Since you are an academic, Esmaiel, I think I'm going to touch on a publicly available data set also and publicly available solutions. So, we have publicly available data sets, and they are usually, the origins are different, the way they are treated are different. Everything about them is different. So, what do you need to take
28:43
into consideration when working with publicly available dataset? Do you need to do feature engineering, or how do you integrate them as well as you can, so that the data is informative for your model and not biased towards a certain origin, or the signal is lost because of the high level of noise inherent in the dataset?
29:10
Thank you very much. So, we use some publicly available datasets in our research.
29:24
What we are doing in academia, I'm trying to identify peptide sequences with, you know, morphogenetic properties and there is very little data out there, especially on peptide sequences with the specific activities.
29:54
And so, one of the things that we are doing. There is a lot. There is public data on proteins available.
30:07
And for example, there is protein data bank available for the structure of the proteins.
30:14
There is another software, it's called AlphaFold.
30:22
It's a software that predicts the structure of proteins for you. And so, we use that quite a bit in our research, because we are working with peptides with unknown structures, or sometimes, so we sometimes we use the Alpha fold to find the structure of a non-peptide. But one of the challenges in using like AlphaFold, which is a very well known
30:55
software, is that the predictions are based on what is observed in proteins. So, as you know, peptides, short peptides, like 20 or 40 amino acid peptides, they don't behave the same way if they are free, as opposed to when they are as a part of the protein. So that's one of the challenges that we have, especially working with short peptides. And so, one of the things that we are doing because of that is to try to combine molecular simulation and
31:38
machine learning to predict the structure of peptides. So, we do molecular simulation, and we find a structure. We do molecular simulation at the molecular scale, and we try to find the structure of the peptide, and so that's one of the things that we have been trying to do in my lab. And there’s a lot of challenges when you use molecular simulation, because you may not get the exact three-dimensional structure, unless you go to atomic scale. But atomic scale is very time consuming to do the simulation. And there are also some other publicly available data that's PubChem and ChEMBL. They are also available for a small molecule compound, the structure of small molecules and properties of small molecules that we that is sometimes use in our research.
32:43
You know, they have there. I mean, there are a lot that are useful. And we use them a lot, the Protein Data Bank. But, you know, we shouldn't. We should be aware of the, you know, some of the shortcomings of these software’s and these data banks.
33:11
So, I think all of our panels echoed data challenges, mostly, whether it's a low end or it's noisy, or reliability. Actually, Esmaiel, we shared the same thing in antibody. That loopy part, the flexible part, is very difficult for an algorithm like AlphaFold to predict. So that's why the multimer predictions, the binding prediction, the complex prediction for antibody and antigen complexes, it scores much lower compared to rigid binders. Now, since we're talking about low end data, sparse data, noisy data, I see that there's an opportunity here to instead of working in silo, let's say working together in a federated format. But we don't see this in the public domain, right federated learning, for example. And this is a question to all our panellists, why do you think that it's unpopular to
34:12
have this out there so people can work together and bring together different data sets from different sources and learn and train machine learning model on top of that, so that we can have a better model overall?
34:28
Yeah, maybe I can jump in. I think this has been tried before. I think this last couple of years back where several small and large companies came together to do learning for small molecules called melody.
34:48
And in that exercise, they have seen some benefit, actually. And I think after the very first exercise, they try to follow up with the second one.
35:00
So, we talk about this data scarcity when it comes to, let's say things like double quality. And it is true that there's not much data out there,
35:12
and it's the public knowledge that actually we recently, AstraZeneca and four or five other Big Pharma companies, we came together to do federated learning for
35:24
developing properties. So basically, there is some activity around that. So, it wasn't easy. It took a lot of conversations and discussions to convince everyone that this is a good thing to do. And one of the big barriers was, I guess, at the executive level, like we have to show that this is worth it. And the way to have the buy in from executives is sort of like, okay, what will it take, basically, to collect data if we were to do it ourselves, versus if you do federated learning with other companies, how much for cost we can say, basically, and that was a big selling point.
36:09
I think that most people sort of bought into it. And now we are in the first phase of this learning exercise, and so this year we will be working on that, and it is really successful. The plan is hopefully to follow up with the second one.
36:30
So, there is some activity based here.
36:34
Other panellists’ comment?
36:38
So, I just want to add, I think you alluded on this before, and you know, as I think part of the reason is that the data that was collected, the legacy data, the data that was previously collected. It was a very limited data with a purpose,
37:06
and actually, what should be done is for companies or institutions to collect complete data on a compound, on an antibody, on a peptide sequence, on a protein, to
37:28
have to produce complete data like genetic expression, RNA expression, protein expression, binding. When data is limited, and it has a purpose, it's very difficult to share that data because there is privacy concern, there is intellectual property concern. But if the data is complete and it's less purposeful when It is collected, then it could be shared with others, and then everyone can share it. The data could be in the public domain,
38:13
but when there is no intent, or there is no purpose, then everybody can use it for its own, every company can use it, or every entity can use it for its own intent and for its own purpose. There are some issues right now with privacy concerns, because it's a limited data with a purpose, so it's not easy to share that perhaps in the future. And I just want to say that it's really, really important to actually build very large international databases, because the future of Pharma is not so much doing experimental measurements, but it's actually doing data mining, having very large databases, very large. I mean, when I say large, I mean, you know, millions and billions of data points. And then the future is to mine these databases and find information from it and then use it to produce new products. So that's my take on this.
39:19
Thank you, Esmaiel. Thomas, thoughts?
39:22
Yeah, very good points. And maybe to add onto those, I think not only the completeness of the type of readouts per antibody are important, but what we have learned from our experience, specifically the type of molecules that we designed specific, not for a project, for a portfolio project, to hit a certain target, but they were designed as tool compounds to sample the entire space that's available. For example, to sample a certain biophysical space, not with the intention of actually being for a specific project, but for learning. Those were the most valuable ones for these types of data analysis.
40:00
So, this is back then we did this, of course, for mechanistic understanding, like, you know, which biophysical property relates to PK properties and so on. But now these data, because they really sample the entire space, are really useful for learning, and I think they are also much easier to share because there's no IP attached to them. In terms of, yeah, this is a target. This goes for a specific project, and so on. So, I think maybe that could be an opportunity, when we see, as an industry, gaps in where we have data gaps, to maybe put some money together, design specific compounds for that purpose, to close the gaps. Have agreement on the type of data we want to generate with those compounds and generate them, and then learn together from them. So, I think there's an opportunity also there.
40:51
This is a very exciting discussion, and I have so many follow up questions. But maybe Cerlin, can we take one or two questions from the audience? Yes, so, the first question is, how do you check the applicability of AI generated data outputs?
41:18
So, if I get it right here.
41:22
The question is about synthetic data that is generated in silico, data that is generated by the model, and how do we ensure or check if that's applicable for our use cases?
41:36
That's right, yes. Panellists?
41:40
Yes, maybe I can start. So basically, as we build our mission in our pipelines,
41:49
the one stage of the pipeline is to predict, to build predictive models, which we call oracles, or server oracles, you use them to predict the properties of the molecules that you design, so that you can prioritise them before you send it to the lab. So, you predict things like affinity, stability,
42:11
aggregation and so on. So, basically, we have quite a few oracles to screen those molecules.
42:19
And as Rahmad mentioned, we do see some success there. Basically, for some properties, we have more successful than the others. And again, as you said, I think one of the most challenging ones is to predictive, binding ability.
42:39
And we are working on that.
42:45
So, I just wanted to add to what Talip just said.
42:55
So, one of the things that we do, we, for example, get the predictions from the machine learning model. We get the peptide sequences with the properties that we are looking for, and then, we go back, and we do ligand receptor interaction. So, we use the peptide sequence that is predicted by the ML model. Now the first thing we do, we go back, and we do ligand receptor interactions, and we do it in silico to see whether there is interaction. Now, of course, doing this, ligand receptor interaction, is in silico. It's really, really time consuming. It takes a lot of computer time to do these types of studies. So, let's say we get 50, 100
43:51
peptide sequences from the machine learning model, and then from those 50 peptide sequences, we use
44:01
a software, it's called molecular mechanic modelling. We do ligand, receptor interactions studies with that software. And then from those 50 we find five to 10 peptide sequences that we think it does have the activities that we are looking for. And then the next step is to go now to experimentally investigated activity using sales cell culture. And that's the next step. That's what we do, of course, next
44:31
so, so you know we do quite a lot of computations and machine learning,
44:41
modelling and machine learning prediction before we go into experimental stage. Because the experimental stage is obviously, it's the costliest step to go to. And then when we go to experiment and we look at five to 10, let's say sequences, then we do extensive
45:00
studies of those sequences. We do, you know, genomics, we do proteomics, and we see the activity, we find out about the activity of the peptide.
45:14
Thanks, Esmaiel. Thomas?
45:17
Yes, I think maybe another aspect that is also super important we try to do is always ensure that the model is not over predicting, so especially for these prediction models. So of course, quality control steps, like, for example, these 80/20 splits into training and tested. I mean, everything that probably everybody is doing is super important to not over train or then, of course, train the model with all the data you have, stop it and check for new data that is coming out, out of your normal pipeline, your normal experimental in vivo, whatever results, if the model that you have stopped training would predict it correctly. You know, wait for three months, collect whatever comes in terms of new data, especially a Big Pharma company, you do have, fortunately, an ongoing data train and see if that is true. So, I think these are the high hurdles we have to apply to convince at least our internal stakeholders, which often are the hardest to convince. And then, of course, the field industry.
46:24
Yes, thank you, panellists.
46:27
I think for that first question, I just want to have a one sentence. The ultimate test of a predictive model or a generative model is in its zero-shot situation. So, this is usually where the model does not generalize when the data lies really outside of training data sets. So that is a good benchmark to hit as well. If you're working with the predictive or generative model, then the zero-shot performance is decent, then you have probably a good model in your hand. And maybe we take one more question from the audience? Yes, of course, the second question from the audience is: what approach have you taken in data registration, for exmaple for antibodies. Direct all to one system and process (FAIR at source) or allow multiple setups and then aggregate/normalize as needed?
47:24
This is a tough one here on the data side, so I'm just going to outsource this to our panellists, because they are super fluent on this. I'm going to start with maybe Thomas this time.
47:36
Yes, very good question. It plays right into, like, you know, the pre-AI times of you know, where data was more stored in Excel sheets and share points and so on, and where we, of course, have big efforts to move into FAIR data systems. My experience is that really, for specifically, like compound centric data that works quite well to store this type of data in really one place where we have all, let's say, pre-clinical data for a certain compound, that is super useful and really powerful. Having multiple different databases can work, and we have experience with that. You need data crawlers that can, of course, access those for, I guess, the everyday user,
48:24
but they need to be fair. That is, I mean, they really need to qualify by the FAIR criteria. Then this interoperability is really important then. So, I think both can work. But yeah, I think the first one is easier if you have it all in one, especially for compound centric data.
48:45
Maybe then I move to Talip?
48:49
Yes, I mean, I do agree with Thomas. Maybe, instead of giving too much detail, what I will say - what we are trying to do is basically, we want to be able to track the entire lineage of the
49:06
molecules that we design. Which means that basically, the moment we design it, we should be able to track it all the way to the clinic. We should be able to track all that lineage.
49:19
And we are building a building a data infrastructure to be able to do that, and what it will do in the future, is that basically it will enable us to pull all that data in
49:33
whenever we want to do, let's say, multimodal training, or we want to be able to bring the models that is needed downstream. We can pull them into the earlier stages of the pipeline.
49:48
And Esmaiel? Yes, I just want to add, Talip and Thomas - they gave a very good response to the question.
50:00
I just want to add that I find that metadata is really, important. By that, I mean a lot of times, especially to build
50:14
very large databases. In building very large databases, you need to bring in data from different labs and especially using machine learning for AI purposes. Especially it's important to have extensive description of the data. Even the language takes description of the data. It's really, really important, like the conditions that the experimental data was collected.
50:49
For example, a lot of times this type of information is missing from the data. You just have a table of, you know, an antibody, for example, and, you know, binding and so on and other things. But, other pieces of data, the
51:12
researchers that actually did all the experiments and the decisions that they made, you know, in the process of doing those experiments, those stuff are normally missing, and so that, I think, is very important to keep those in that information, even in text. And now there are large language models that can actually read text and
51:37
then change it into a machine language. So those are very, very important, especially for building large databases.
51:48
Excellent responses. Thomas?
51:53
A short response to that. I think, this is also something where we see, often, the tension in our organisation between, if you want to have it standardized, we can talk about standardised data. So, the experiment was done in a standardised fashion that is really good for machine learning. So it's really comparable all the data sets, or you have standardised metadata, meaning it's not a free text field, but it's like somehow always filled out by the same way that is, of course, again, ideal for learning, but it limits often the researchers, or it feels often limiting to the research in the daily work. I would like to have different time points for this particular study, or I would like to add, you know, free text certain parameters because the other one is too cumbersome or something else. So, it does - it limits me too much in my ability to add what I want to add. And I think that is a tension we are, at least at Roche, definitely facing. I would be curious to see or hear how it is at other companies and places. But, yes, I don't have a good solution for that. I think having it standardized is really good, but it comes with attention.
53:09
So maybe I can jump in. Yes, I see this, the same attention in our organisation Thomas, and the way that we are going about it is that we start with a standard format, and we iterate through it basically.
53:23
So basically, it's a live format where we keep changing until it starts to work. And it is still a work in progress, so to speak. And I just want to add another thing here, basically, sometimes even, let's say data quality or
53:47
changes in the data depends on who did the experiment or in which image location or image machine. To resolve also some of the problems with the data is also one of the keys to resolve some problems is automation.
54:04
So, I think, we should try to optimise as much of the process as possible, and that will sort of enable us to standardise data collection as well. But I do agree. I mean, the metadata is super important.
54:21
I share the perspective. I think we have five more minutes. Do we have a space for one more question? I think we should close soon.
54:37
Thank you ever so much for joining us today for this very insightful session on ‘Smart Molecules, Harnessing AI and Data to Advance antibody & Protein Engineering’ and a very special thank you to our Thought Leader, Rahmad Akbar, for sharing his expert perspectives.
55:00
Thank you also to our panel for a very thought-provoking discussion, Talip, Esmaiel, and Thomas.
55:09
Thank you so much for your time and to the delegates of this webinar. We thank you for your time and good questions submitted throughout the hour. So please keep an eye on your inbox for a follow up email, which will include a recording of the session as well as additional resources in this area. And thank you for being with us again, and we look forward to welcoming you to our future webinars.
55:35
So, take care and enjoy the rest of your day. Thank you very much. Thank you, Cerlin. Thank you, panellists.