PromptQL Reliability Score

Are you enjoying this session?

See exactly how PromptQL works for your business.

Book demo

How PromptQL evaluates and ensures reliability when working with imperfect enterprise data from heterogeneous sources, providing AI project leaders with confidence in AI-driven insights and recommendations.

What's discussed in the video

So Vamshi and I are going to talk about something that we've been building over the last few weeks as something that kind of emerged as a very natural artifact of what we could do once you have planning. So one of the big questions that you'll have, let's say you're talking to an agent, you're talking to a human being that's kind of doing some analysis for you. The question that you often have is, hey, how do you trust what your colleague does or the human person that you're depending on does. So for example, let's say you ask this human a question about something about the product or something in customer support. You're like, help me understand this or fix this for me or analyze this for me. Help me solve this problem. Tell me how many active users we had yesterday. When you do any of these tasks, there's this kind of gigantic problem or implicit assumption that we have that the other person understands what you're saying. And this person who's helping you solve that problem is solving it reliably. There's no holes in their understanding that will cause an incomplete analysis or an inaccurate analysis. For example, you're like, hey, tell me all of the products in the dairy segment. What's the right? How would you trust a human that gave you an answer that you would trust? Let's say the human gave you an answer that was one, or 10, or 100. It's an answer. How do I know if the answer is reliable? And the way that you do is by the exhaustiveness or the rigor of how they thought about solving that problem, or how they reacted to ambiguities in the data or in the question itself. So they're like, ah, dairy segment now. Do I have a category called dairy? I don't have a category called dairy. Do I have a product name? And do I think product names will have the word dairy in it, or product descriptions will have the word dairy in it? Maybe I should look at the product descriptions and semantically analyze them to see if they belong to the daily segment. There's so many different ways. And when you build trust, you're building trust with somebody, with another human who's solving this problem for you, because they understand that kind of data and your question enough to be able to kind of figure out the right plan. And so when PromptQL is kind of doing this for you, and you ask PromptQL to do something for you, you really need PromptQL, you need something to help you and PromptQL, you as the user of PromptQL and PromptQL itself, to give it a sense of whether its work was reliable or not. And so we kind of call this the reliability score, and this is our first kind of release of the reliability score, and it will gradually kind of amp it up in complexity of what it's able to do. But let's take a quick look at that in action, and then Vamshi can talk to us a little bit about I would actually, you know, what kinds of things it looks for. And so that's that's what kind of reliability stuff it solves for. And you know what is on our roadmap so We were debating what project to use and Vamshi is a football person. Are any of you football people? Soccer people or football people? Please do let me know. And so do let me know and please feel free to post a few questions on chat as well that we can try out. But Vamshi, where did you get this data from? And I think it's a public data set on Kaggle. I can share a link. Awesome. So it's a Kaggle data set. And so this just basically has a bunch of data about people and players and player valuations and games and stuff like that. And so then let's start doing some interesting stuff with it. One of the questions I was trying out was help me or help me identify players that are underperforming. given their player valuation in 2023? Is that the way I would have framed the question, Vamshi, as a football person? Sounds great. Are these the words that you would use? Would you say player valuation or given their valuation? I don't know. Alright, cool. Yeah, it doesn't matter, right? So, I want to kind of find out about people who are, you know, oh, this athlete joined this club for like a bazillion euros or dollars or whatever. Are they performing up to the mark? And so PromptQL does, like you heard, like you saw from understands demo, creates a plan, figures out what to do, comes up with a way of thinking about performance stats. It says, you know what, we'll do goals and assists. Because what PromptQL is trained to do is to say, let me try to solve the problem. So PromptQL is like an energetic, try hard in turn. PromptQL is in your org. PromptQL is trying to have your back. And PromptQL is like, you gave me a problem. Let me look at the data I have. And to the best of my ability, let me solve that problem. And so it found a bunch of data. And then it tried to use that data to come up with an answer. And so if you kind of just take a look at the quick plan, it says I'm going to look at goals and assists in their appearances in 20 23. I'm going to identify players who have high valuations, but low performance. This is kind of how they define performance, right? About how prompt kills decide to kind of define performance. And it made some assumptions that it kind of told me, right? Where, you know, it's making a few assumptions in how to determine performance and what is high valuation or not. And what happens in the bottom right asynchronously, I'm sure there's some data here and it's kind of fun to see this chart if any of you are curious on what is happening with valuation versus performance, but I'll draw your focus here to the bottom right corner, which has a reliability score which will asynchronously pop up. And this reliability score will kind of give you a meta commentary. Think of it like a code review on your interaction and on PromptQL's work itself. So it's going to comment as a neutral entity both on you and on PromptQL itself. And it's going to say, You know, this analysis is only considering goals and assists, which is not the only definition of performance, right? Because you're not capturing what players in defines do or what goalkeepers do, right? Goalkeeper making even one goal and superlative performance, perhaps. And it says, ten million euros as a threshold. So is that Is that what you were looking for or not? And of course, this analysis also includes appearances from all competitions. So sometimes when it's not a smaller league that has players or teams at the same level that are playing, which is perhaps a better analysis when you're thinking about things like this. So it gives me that extra note on what could make this answer less reliable, right? And then what you would do is you would ask, say, PromptQL to try to retry these improvements by itself, or you would explicitly say, hey, think about this or use ten million as a threshold or change the threshold or whatever it is, right? And then you would have a better interaction, right? And this interaction eventually would get fed into our self-learning system where our self-learning system so that the next time, let's say next week or the week after you came back and asked this question, you made certain assumptions, those assumptions would automatically be captured because that is the kind of language that you are using to reason about things here. When you say high, that typically means this valuation threshold that's going to get captured automatically. But the reliability score is going to be a key ingredient in helping both the user and PromptQL understand how reliable their work and their interaction is. I'm going to take up maybe another question from if you have any questions on the football data set. But because we're running out of time, I won't show you a few other questions that I had. But Vamshi, do you want to give us a quick sense of what are the kinds of things that you're looking at when you think about reliability? So what are the buckets that you've come up with? And what are the buckets that perhaps we're going to be looking at in the future? Right. So broadly, the kind of issues that you would see are you have like a query plan quality issue that is the query plan that PromptQL picked may not be the most relevant or may not be what you're looking for when you ask a question so that's one category of issues and then the actual implementation of the query plan so this the actual implementation of the query plan is not as important because models these days are quite good at writing code. And we also have a way to, even if a model writes an incorrect SQL query for a step that it has planned, the error message would again be fed back to the LLM and the LLM corrects itself. So the second category is not as important. But again, if there is a logic error in the implementation, and if it has high impact, then you would want to see that. So let me quickly switch to the first category. The first category is what you're seeing here, which is that the query is what could cause an issue with the quality of the query plan. Number one reason is that your query is ambiguous. So when you're asking for When Padme asks this question, as someone who understands the policy, what are you even asking? That's my first question. It's like, are you looking for attackers? Are you looking for defenders? That's probably the first question that I would ask. So because it made a bunch of assumptions, the reliability score evaluation is Complaining that, you know what, if the query is ambiguous, your query plan, the query plan that PromptQL picked may not be the right one. So if you clarify that, like, you know what, I'm actually looking for attackers, then this goes away. And similarly, I mean, if you can quickly try out, like, searching for a random player. This is my question that is going to be offensive to football players. But I don't follow football. So when Vamshi gave me this data set, the first question that I asked was, how is Smith playing these days? I don't even know if this player is called Smith. So go ahead, Vamshi. Right, so there would be a couple of things that would happen. One is, A, the question is ambiguous, but also during implementation, how is the system looking for a player named Smith? Is it doing an exact match? So for example, in this, the step one of query plan is search for players with Smith in their name. What does this mean? Is it doing a last name search or a first name search, or is it doing some similarity search? So that is what the evaluation would be. for things like implemented, like for the second category of issues, right, where you are looking at the implementation quality. And yeah, there are other classes of issues as well. So in this case, for example, you're going to be checking whether like, you know, what's the quality of our search filtering, like is that filtering like over-fetching or under-fetching data essentially, right, because that could because the way that you filter it might work sort of, but that's not important, right? If you're drastically under-fetching the amount of data, then the reliability is bad, right? Like you're not capturing enough information, makes sense. And similarly, like you could be So in this case, because the search implementation, can you open up the code term? Yeah. I think because it's doing a like search, you can see that it's a similarity search. So it found a set of players. But if it did not manage to find it, then a quality issue would be raised about the implementation saying that. And in this case, we're saying that the quality of implementation is OK, because this is the best that you can do here. But this interaction is not a reliable interaction, because if your question is about the one player Smith, then none of this is going to help. You're going to have to choose the Smith that you're interested in. That makes a lot of sense. I know we're over time, so Rob, I'll hand it back to you. And if you folks have any questions here, please do engage us in the chat, and then we'll continue from there. Yeah, Vamshi, I think the other thing is we could, you can try it on a project that you have today. So that should also, I think it's already live, right, Vamshi, for folks? Yes, you'll have to go into your settings to enable this feature once you have that. Awesome. All right. All right. We'll let you continue, and then you want to switch over to one of these projects and retry with improvements. That'll be really fun. All right. If we have time at the end, we will. There's one question that came through recently, and I want to go ahead and address it while we're on the call right now. So Harsh, I'll go through the latest question on the stage that came through about one minute ago. Or I'll just go ahead and start reading out loud. It says, how do you deal with syntactic issues? I imagine you use similar searches a lot. syntactic issues for search. Well, depending on the data layer that is available, depending on the data layer and the question that's available, the search strategy is chosen. So for example, I think previous question that came up is like you had semantic search. So if you have semantic search in your If you have semantic search as a tool that's available or a semantic search function that your database exposes like vector databases or like that you've written your own custom search, PromptQL will use that. So PromptQL will use what is available in the system depending on the task. So for example, in this football dataset, the only thing it had were these entities with the default operators that were available. So it gives you like search. But let's say, for example, I had a search called player Search as a function that was available, that was doing a better search. Then it would use the player Search function automatically. It would use the player Search function, say, player Search Smith, and then it would try to get that. So PromptQL chooses its strategy depending on the question, depending on the data layer that you've given it, which is what makes it very useful because you don't have to pre-commit to a one-size-fits-all strategy. And so that's the answer to that question. That's a great answer, and I love that it's not a one-size-fits-all.