Building reliable AI systems at Google DeepMind: Lessons from the trenches

Are you enjoying this session?

See exactly how PromptQL works for your business.

Book demo

Deploying large language models (LLMs) that reliably work in real-world applications requires robust evaluation. This talk dives into hands-on techniques for crafting effective evals to measure and improve your LLM's performance, as well as spotlighting common developer mistakes and how to avoid them.

Beyond evals, we share battle-tested insights from integrating Gemini models into production applications used by 100s of millions. Expect practical takeaways on tackling challenges, implementing best practices, and actionable strategies to build LLM-powered applications you can rely on.

If your team is using LLMs for solving real problems, and want to move beyond academic benchmarks to real-world impact, this talk is for you.

What's discussed in the video

And I started doing machine learning and AI around 2009 back before it was cool. But also when everything in general felt a lot simpler. So AI is certainly transforming everything that we do at Google from the way that we're building and deploying applications. So it's really, really cool the kinds of possibilities this opens up. But it also means that you need to be considering many different possibilities in the realm of multimodal evals. Now I'd like to welcome on stage Paige representing DeepMind to talk a little bit more about that. Amazing. I'm so excited to be here and to get to talk to you all a little bit about how we're building models at DeepMind and then also how we're incorporating evals throughout the way and making sure that these models are resilient, that we have a good heartbeat pulse of their capabilities, and that we also have a notion of how they fit into all of the many application systems that we're deploying around Google and Alphabet as well. So for folks in the room, I am sure all of y'all are sort of deeply committed to this as part of your business. My name is Paige Bailey. I lead our developer relations team at Google DeepMind and I started doing machine learning and AI around 2009 back before it was cool. but also when everything in general felt a lot simpler. So AI is certainly transforming everything that we do at Google from the way that we're building and deploying applications. Something like Sundar mentioned a while back that 25 percent of the code checked in every single week is generated by AI. That number has only increased as we got Gemini two and Gemini 2 dot 5 into the hands of more developers internally. And we've been thinking about AI for a long time. So ever since the days when we released our first machine learning frameworks to building things like Alpha Zero, Alpha Fold, Alpha Star, all the way up to our latest flagship model called Gemini. And Gemini, 2 dot 5 pro and 2 dot 5 flash are the most recent models that we've released. Flash is still kind of in early testing. We announced it last week at Google Cloud Next. But it's remarkable for a number of reasons. One of which is that it's multimodal. So I can understand video, images, audio, text, code, full code bases, and all of the above all at once. But even cooler, it can also output multiple modalities. You can generate text and code, sure, but you can also create images, you can edit images, and you can also generate audio natively with the same model as opposed to having a diffusion model stacked on top. So it's really, really cool the kinds of possibilities this opens up, but it also means that you need to be considering many different possibilities in the realm of multimodal evals. So as an example of something that Gemini can do, you can give it an image of an automobile, ask it in natural language to convert it into a convertible, and pixel perfect, all of those changes are made to the image itself. These are some examples of Gemini operating on robots that we have on our Mountain View campus. So since the model is capable of understanding both video and audio input feeds, you can ask questions like, please go make me a salad, please go clean up that spill, and the model is able to reason out step by step everything that it would need to do to accomplish the task, as well as generate code behind the scenes that can operate the robot. So these are self-navigating robots that are able to flexibly analyze new environments, new locations, new tasks, and it would have been completely impossible without the introduction of these multimodal capabilities. It's also a little bit surreal, to be honest, that this is stuff that exists in the world today and not something that you're reading about in sci-fi. We also have a deep commitment to AIs applied to the sciences, so things like Alpha Fold, which just recently won the Nobel Prize, Alpha Code for competitive level programming, Graph Cast for weather prediction, Plasma Control for accelerating fusion science, and the list goes on. We're incorporating Gemini into everything that we do at Google as a company, everything from Waymo, which y'all might have experienced in San Francisco if y'all are based locally. If you're not, I encourage you to take a ride. It is very, very fun. And then also AI is applied to chip design. So we're using Gemini models to help re-architect a lot of the hardware that's used to train the models and also to serve them. As well as things like quantum chips and everything all of these applications that I'm mentioning as varied as they are They all have different evals associated to make sure that the models are giving the right assessments and that is new models come out as they tend to do at least once every couple of weeks and We have a good heartbeat pulse on whether we have any performance regressions or the capabilities are kind of increasing along the way. So these are the different model sizes. Pro is our largest here. Flash is kind of the fastest, most performant one. It's the one I see most commonly used in production. NANO, which is small enough to fit on mobile devices as well as to be embedded within the Chrome browser. So you have a model that's just hosted locally. Data can be used for inference locally and not sent over the wire to a server. And they've been doing pretty well on all of these arena boards, which is where we really start getting into evals. I think folks in the audience have probably seen benchmark performance for some of these large language models. Things like MMLU. like Raise of Hands, MMLU, Human Eval, which we'll take a look at in a second, GSMHK, some of these more like academic style evals, also things that are a little bit more aligned with users, though not perfect, things like Elam arena. So fun fact, Elam arena can be gamed to get higher model performance. Things like the ADER polyglot coding leader board, which gives you insight into how well models perform for specific coding tasks. Live bench, which is a good metric for understanding how well models are able to retain information across their full context window. So what you see here on the screen, each one of these little lines corresponds to a different model. and you want to be kind of the higher the better because that means that the model has more perfect recall across these longer context use cases up to a hundred and twenty thousand tokens. There are also, you know, things like artificialintelligence.ai that attempt to combine multiple evals into one. And of course like the plot of pricing versus model capabilities, which folks are really exciting to hear But I think one thing that is very complicated For folks who might be outside of the a world AI world to understand is what actually does this mean? Like you're telling me a score. It's like a score on a report card or a score for a GPA but what actually is this corresponding to and that's where I would like to kind of double click on that for a second and talk about what really corresponds to a great eval and how you can start thinking about them for your companies. So for folks in the audience, how many of y'all have heard of human eval? Excellent, so great show of hands. Human eval is one of the metrics that are often championed as a model's ability to write code and to be a good software engineer. but when you take a look at human eval, so I'm going to pop over to another tab, I'm going to pull it up, human eval is nothing more or less than just something that kind of looks like a CSV file, so you have an input prompt here on the left, so you have kind of like a function, you're pulling in some libraries, and then you have, you're opening up a function, You have kind of like a function definition explaining what you want to have generated. And then this is what you want the model to produce, such that it passes all of these tests that you see here off to the right. This is a hundred and 60something examples, so very, very small number of use cases. It's all Python. It's all very simplistic function completion, mostly like toy problems, things that you would see on a Python one-on-one level course curriculum. And this is how most companies in the world are gauging their model's ability to kind of perform well for software engineering tasks. And for folks in the room who might have been software engineers in a previous life or in your current life, software engineering looks dramatically different than that. Like if all I had to do was do function definition completions for my entire job in Python, the world would be very simple. and very clean. Another challenge is that for this data set in particular, it's been copied and pasted all over the internet so many times that for the model building process, most of the data sets are saturated. So most of the models have already seen these data. So when you give it this list of inputs, the function and kind of the definition of what the function is supposed to do, It's almost like you've given the model a sneak peek at the exam before the day of the test. You can't really expect it to be able to perform based on its own capabilities because it's already seen the answers before you give it the exam. The reason that I mention this is because many of us are sort of investing a lot of time and energy into looking over benchmarks, trying to understand what they mean, when the reality is that the only benchmarks that are really, really compelling and sort of meaningful to you and your business are the ones that you create yourself and that you build aligned with the expectations of your customers. And hopefully I'll convince you of that before the end of the presentation. So evals are certainly critical as a way to kind of understand diagnostics, to be able to encode requirements. So being able to say, hey, engineer, you have a model. It is currently performing at this level. Your job is to get it from 50 percent to 70 percent. Engineers love nothing more than if you give them a test telling them that they're performing lower and getting them up to a higher level. The way that you can do that is by incorporating similar examples into pre-training data so the model can kind of see it at scale or into post-training data. So you give it a whole bunch of really high quality examples that are similar to the task that you're trying to accomplish. Models are also really great ways to explain how good or how poor models are for specific domains. And the way that this works is you kind of give a prompt to a model to be tested along with any additional information that might be useful. You have a golden response or a metric that you're trying to compare against for kind of a goodness quality. And then you get some sort of final result. And so many of these things are probably steps that y'all are taking today, but are just kind of good to keep handy. while you're thinking about evals. And there are really 3 different kinds. There's kind of auto evals, which are very similar to the human eval example that we just saw. Human eval, by the way, is very ironically named because it's not really human evaluation. It's just like this very simple CSV style of inputs and outputs. But you have these more academic benchmarks like human eval, GSMAK, et cetera. You have manual evals, which is effectively you're deploying humans to manually score your model's outputs. And you also have auto-writers in which you're using models as judges to be able to do those assessments themselves. So some strengths and weaknesses of each approach for automatic evals, they're very quick to run. They're really easy to interpret scores, which is part of the reason why they get such high attention in the world of large language models. But they do have some downsides. So very, very simplistic. They can be leaked into the model's pre-training and post-training data. They're really not great at measuring no one right answer questions. And the only time when you should really use it is when there's a really clear correct answer and when you care about the speed I would also like to call to the attention though to folks that For many of these academic evals that have you know a hundred two hundred three hundred examples the correct answer is is often not correct at all for some of these eval scores. So if you get a hundred percent correct on human eval as an example, it's actually an indication that the model got it wrong at least a portion of the time because some of the outputs are incorrect. So if you aren't double clicking on your evals, that's something that I would encourage you to do. For a human evaluation, there are some benefits that you can have more nuanced prompts They might be more reliable because you know, hopefully your users are humans But ratters are not truly representative of kind of human examples and often there are not there's not in a rather agreement between many of these groups and So if folks might have used a company like Scale or Surge in order to do evaluations with real human ratters, you need to make sure and invest time to make sure that these ratters are actually representative of your real world user cohorts. These are also quite expensive they take a longer amount of time because they aren't you know something that you can run programmatically And you should use it when there's no clear correct answer and also when it's hard to evaluate if something is quite good and auto writers are currently my favourite method of model evaluation and Because they're quick to run, you can use models to kind of test scores. There are also great open source frameworks and libraries for this, things like PromptQL, if folks have heard of the PromptQL library. And you can also kind of build these systems around your model and around your applications to make sure that they're a little bit more resilient. The challenge is that it can be kind of expensive to run these at scale if you're using larger models for auto rating. And they could potentially be less accurate than humans. They're getting better and better by the day. So if you're building evals, you have to make sure that they satisfy a lot of criteria, that they're aligned with your company's goals, and you are the experts in knowing what your users want and what they need in order to be successful. So as you're building out your evals, just think of them as building out kind of like this golden set of questions and tasks that you need the models to be really great at and that you want to be able to quickly evaluate for every single model as new ones come along. You also need to make sure that they have a high quality signal so the results improve if the model capabilities increase, that there's not a lot of variability in the scoring. and that they're easy to run and interpret. And then once you create this eval set, it is very, very important that you kind of keep it close and in your company and do not leak it externally. Because as soon as you do, that's a guarantee that the models are going to kind of, again, just see the questions and answers for the exam, memorize them, and suddenly you don't have any confidence in their score. Another challenge, no eval is perfect, which means that any eval that you're kind of measuring is going to not be representative of all of a given kind of cohort of questions. So just like we saw, if you measure the ability to complete functions with Python, that doesn't give you any insight into whether it could do the same thing with TypeScript or JavaScript or the Go programming language or if it could make pull requests or if it would be able to spot security vulnerabilities. For each one of the very nuanced tasks that you want to know, you need to make sure that you have a collection of evals or a collection of questions and examples that kind of measure the model's ability to perform those. And these are just kind of, you know, a collection of tweets from around the internet of people lamenting the state of evals in the generative models world. No one is happy with evals. It is also, you know, one of the easiest paths that you can get into for model performance understanding is creating a great eval and kind of talking about why. These are more discussions about MMLU in particular, about how you can't really build confidence around any of the static academic benchmarks that get produced, and even some things around When a measure becomes a target, it ceases to be useful, so a lot of good Harding going along. If you start measuring one specific task, there's a high indication that the model building teams will hyper-optimize for that given task, as opposed to trying to create something that's general purpose and useful along the way. And the only thing that is truly consistent, honestly, is change. So, you know, we already discussed new models are coming out every couple of weeks. You need to be able to sprint quickly to make sure that these new models and their capabilities are aligned with what your users and your customers are expecting. And that is a challenge for many. The software world is not used to moving at this speed, so it means that many of us have to dramatically reimagine how our teams are working. So there are trade-offs in the speed versus comprehensiveness. My recommendation is that you have more evals if you're using auto-writers. But if you are truly dependent on humans only doing the evaluation, have fewer, but spend a lot of time really carefully thinking about how these impact your business. But a strong recommendation is to consider the balance between more or less evals and then also that spectrum of do you want to automate them or do you want to have human ratters guiding them? And I also want to point out that in addition to many of these kind of teams trying to sprint and trying to do more, there's never been an easier time for a small number of humans to do a lot of great work. This was a study that was produced just recently by the Carta team, surveying on the order of, I think, 46000, maybe a little bit more, start-ups in the United States over the last year. and 38 percent of them are now solo founder companies which is pretty magical when you consider that you know these solo founder companies they often have like recurring revenue and they're able to get multiple products to market much faster. So I also just want to give a short plug if anybody in the audience is a founder. We do have a start-up program targeted at Google for folks that are bootstrapping and for just getting into the industry, including ecosystem grants as well as up to three hundred and fifty thousand dollars in crowd credits over the course of 2 years for anybody who's institutionally backed. And if you have not had a chance to play with it so far I encourage everyone to go to AI studio and take a look at the new Gemini 2 dot 5 pro models particularly for things like video analysis Image generation image editing and then also our multimodal live feature where you can kind of interact with the model like you're talking with the human Introducing things like code execution and search grounding so there's a lot to test out and hopefully We are still on time, and we have time for a couple of questions as well. Yeah, excellent. Pardon? So we do have Gemini deeply embedded within Android Studio. So you can use it to do things like generate Android apps. You can also, if you zoom over to AI Studio, you can generate keys. So you can create API keys. which can then be used in things like cursor or windsurf or VS code with any given extension. And the things that you can do are pretty magical. So if I open up VS code, hopefully it opens up relatively quickly. I create or just kind of might be able to just start just very quickly. This is using This is using Gemini behind the scenes. So you can see that it's using 2 dot 5 pro experimental. And you can do a broad spectrum of things if folks haven't seen. But you can do things like say please create a checklist for going on a camping trip in Grand Canyon. and use HTML, JavaScript, and CSS, and be creative. Also make sure to add local data storage. And then hit Enter. And then behind the scenes, what happens is the Gemini model can start writing code. So you can see the HTML being created. some CSS for kind of the design of the web application. And then also it should create, I think, a JavaScript file just to handle some of the interactions. Yeah, save it. Sometimes you can coerce it to have requests around what kind of persistent data storage do you want. Do you want Index DB or something similar? But it creates a list for a checklist app. You can also add things like sunscreen, extremely high level of SPF. And then it adds it to the bottom. so you can incorporate it in and then even make stylistic changes. So please change the background of the website to be pink because it's the best colour and add emoji. This also works for deployments. So if you want to be able to add deployments or configs for things like Cloud Run or deploying to any other infrastructure, you can see that as well. So it's editing files. And then if I sort of refresh, you can see that the background is pink and that it's also added emojis. Yep, excellent. Could you share some information about the, if I'm not mistaken, it's one million token. Oh, yes. Window size, is there any degradation in quality Yeah, so that's an excellent question. And if I pull back up the slides, so there we go. There was one slide in particular where I had the fiction live bench score. Yeah, and so this is fiction live bench. What it's measuring is the model's kind of recall for the context window as it gets larger and larger in size. And for folks who might not be able to see, at the very end on the right-most side is a hundred and twenty thousand tokens. And you can see this red line, which is Gemini 2 dot 5 pro has one hundred percent recall up to around four thousand to eight thousand tokens. There's a slight dip around sixteen thousand and then it gets back up to around 92 percent. Recall for the a hundred and twenty K and it stays there for pretty much the entirety of the one million token context window some of the Things like this is the oranges GPT for one You can see when you get up to a hundred and twenty thousand tokens in your context window suddenly the recall drops for them down to 20 percent and so my recommendation would be to take a look at some of these longer context benchmarks, and to really pay attention to what the recall is for them, for the models that you care about. Because some of them might advertise a hundred and twenty thousand tokens or more, but if you're only remembering 20 percent of the information in the hundred and twenty thousand tokens, then it's not really a representative benchmark. Yeah. excellent questions in both respects. The first was around why did the models all seem to have this drop in quality around sixteen thousand context length. This is something that the teams are actively trying to research. Why? Sometimes it's because the hardware changes. Right now, you end up with this world where you have to have distributed inference across multiple accelerators. This might be something along those lines, but people are still digging into why this is occurring. And then the second question is around synthetic data generation and will there be a degradation in performance and model quality as we generate more and more synthetic data and use that for pre-training and for post-training? We're already using synthetic data to train models. many of which is better in quality than some of the human examples. And in instances like image generation, it does tend to degrade the quality. However, it might also be due to the fact that in general when you generate images, They're usually not as high of resolution as when you take an image with your camera. And so the amount of information that you can distil into that one image isn't quite as much as you would get from real world taken photos. My personal opinion is that long term, what we're going to see are smaller and smaller models able to do outsized numbers of things, especially local models. So as an example, with MediaType, which is an open source library that Google has released, you can take small models, even small language models up to eight billion parameters in size and embed them within browsers. And for the smaller versions, you can also embed them within mobile devices. And they're getting really good, like an eight billion parameter version of a large language model today is better than the large language model that was the original one for Chad GBT. And it's 0 dollars, you run it locally, you kind of leverage local compute. And it can do really cool things, so like pose, landmark detection. I'm not sure if folks can see in the back, but this is kind of able to run completely locally and just kind of work as expected. You can also do gesture recognition. So it can track your hand and then also classify whether it's like thumbs up, thumbs down or anything else that you want to fine tune. You can do interactive segmentation. So being able to kind of take images and create segmentation masks. And then also things like selfie segments or hair segments, which can kind of do all of this work locally. So I could turn off my Wi-Fi and this would still be possible to work. So my personal opinion is that we're going to see a surge in these on-device models. and their capabilities. And we will see much more composition of on-device models with server-side model calls whenever you have complex requirements around logic and reasoning. Great stuff. Oh, sorry. That's all the time for questions we've got, so thanks ever so much, Paige. Yeah, thank you.