Test the Hell Out of AI Before You Trust It, Says Galileo DevRel Head Jim

Posted by Kevin Blanco|Published on Aug 10, 2025

9 min read

Table of contents

When AI gets things wrong (and it certainly does!), the impact on customers and companies can be devastating. Jim Bennett, Head of DevRel at Galileo, spends no time sugarcoating the current state of AI and explains how Galileo is honing its tools that monitor AI solutions and fix problems before shipping and in production.

Galileo (and Jim) is worried about the reliability of your AI

“”
AI is actually quite rubbish, and it gets a lot of things wrong. Imagine if there was a way you could prove that it was getting things wrong and use that information to stop it from getting things wrong. That’s what AI needs right now.

From Air Canada’s chatbot offering non-existent refunds to the Chicago Sun-Times publishing fake books in its reading lists and even actual lawyers citing AI-generated hooey in court documents, the aftermath of unreliable AI outputs can range from reputational damage to significant legal consequences. While some of the famous AI blunders are due to the outright misunderstanding of AI by end users, many of them could have been prevented if developers had done more thorough testing or implemented live protections.

Galileo is an AI reliability, evaluation, and observability platform that uses a specialized AI designed to monitor and check your other AI tools’ outputs. It inspects your AI workflows to catch hallucinations, data misuse, and unsafe outputs. When a potential problem is detected, it flags or fixes it before it can cause damage.

Expensive AI misfires: They won’t happen to you — until they do

Jim sees it daily as part of his DevRel role: Business leadership or development teams believe they’re immune to AI’s many quirks (or aren’t aware of them), right until their chatbot gives the wrong refund advice or leaks private customer data. In most regulated businesses, mistakes don’t just hurt your ego; the cost of a slip-up can mean millions in fines. But it’s not just fear of punishment that should drive AI reliability — real people are impacted when their travel plans are disrupted or their private information is compromised.

“”
Everybody has an obligation to their customers to do the right thing.

Regardless of your company’s size, if you’re going to adopt AI tools for internal use or for customers, now’s the time to be thinking about the risks, not after something spirals out of control. Jim has over 30 years of software experience and knows it’s only a matter of time before a company gets hacked. He applies the same logic to AI malfunctions and encourages adopting the same mindset toward AI development as there was toward the rapid adoption of observability in software development a decade ago.

“”
The best time to set up monitoring is before you start production. The second-best time is now.

Galileo supports customers at different stages of the development cycle. From teams starting evaluations on day one to developers adding checks to CI/CD pipelines and even post-launch scrambles to plug holes after a close call, Jim advises getting started with evaluations as early as possible. This way, you can establish a baseline for what “good” looks like and catch problems before your business makes a news headline in the worst possible way.

AI agents that can act on behalf of users and interact with other software without prompting will soon be widespread. Their ability to act autonomously introduces many new risks that require observability and evaluation before companies can responsibly allow their use. They might call the wrong tools or misinterpret your prompt (or delete your production database!), making it essential to catch slip-ups early.

“”
You don’t want to be the face of the next big AI blunder.

Can AI be trusted to test AI?

Unlike normal code that has a deterministic nature and lets you write a unit test to check for unwanted behavior, AI outputs can be wishy-washy, making it hard to define a “good” response. Defining success is especially difficult with agentic AI because you can have a million different modalities running simultaneously with multi-step queries.

Jim gives the example of a telecom chatbot that needs to handle complex processes, like recommending phone plans to customers based on their predicted usage, while preventing private data from leaking when discussing account changes. Every conversation has a different definition of success and different consequences if something goes wrong (even if it’s just the customer thinking the AI sounds rude). Potentially millions of these conversations happen every month, with each additional AI-generated message increasing the probability of a misfire.

In complex situations where human reviewers can quickly gain context and assess whether a response is adequate, AI often struggles. This is the catch-22: You need to have AI test AI outputs, and those checks can’t have the same flaws or biases. The AI doing the checking must be specifically trained for evaluation tasks.

“”
You need a domain expert to help you define what success looks like, and the AI needs to explain why it thinks it was successful.

Jim explains that embedding AI governance starts with domain experts creating metrics to define what success looks like for each use case. This is especially important when compliance with standards and laws is required. This governance layer can be handled separately from development teams but fed into a single platform.

Because AI responses aren’t always black and white, the chatbot should provide explanations for its decisions that can be verified by domain experts. Jim believes AI reliability gives you the confidence that what you are selling/providing your customers is going to be as compliant as you can make it.

To build confidence in AI, you have to shine a light on how unreliable it can be

“”
We can’t make 100% guarantees because AI is ✨magic✨. There’s always going to be a certain fuzziness because of the intrinsic nature of AI. But you can put the guardrails in, and you can go, “Oh, this product’s doing badly. Let’s pull the plug before we end up with any serious problems.”

Somewhat counterintuitively, the best way to make AI more reliable is to acknowledge how unreliable it is when there are no guardrails in place. Raising awareness of AI risks is the best way to improve reliability, which will, in turn, build confidence.

The biggest friction point is awareness. It often comes as a surprise to those in the AI sphere when many developers who are well aware of the importance of observability and evaluations in traditional software give a blank stare if it’s mentioned in the context of AI.

Everyone should get on the evaluations hype train. (We talked about hype last week with Thor from ElevenLabs!) This is where DevRel comes in, spreading awareness and making reliability a native part of AI development.

Developers working on legacy systems have been told, ”Thou shalt do AI.” So you’ve got developers who have spent the last 20 years building business applications, and now they suddenly have to put AI in. They don’t know about evaluation, they don’t know that they need protections, and they don’t realize there’s this whole toolset out there to help them.

A common misconception AI evaluation faces from the development community is that you can just slap metrics on a tool (or worse, just add a disclaimer to the UI) and call it a day. However, gone are the simple days of just running unit tests: In reality, it needs a lot of customization.

“”
You cannot just use an out-of-the-box metric and expect that to be successful in production because everybody’s use case is different.

An example of this is preventing the disclosure of sensitive personal information. Tests that can successfully detect a Social Security number from the US may not detect a National Insurance number from the UK, a Medicare number from Australia, or one of the other countless pieces of protected information covered by worldwide privacy laws.

Galileo tackles this by providing starter metrics like toxicity and completeness to give developers a starting point and the tools to customize them dynamically to meet any use case. They use Continuous Learning with Human Feedback (CLHF) to tell the system when it’s wrong and update the metrics accordingly. Custom models can also be tuned for specific data and use cases based on the workflow.

You also need to make users accountable for what AI creates under their watch

“”
We have to be stewards of the materials that we create. We have a responsibility to ensure that when we create content in whatever form, it is not going to harm people.

AI governance is no longer only about detecting potentially harmful text and code. Realistic images, video, and audio can now be generated and mixed together, and evaluating these modalities simultaneously presents a new set of challenges. Trust in AI is undermined if it is seen as a tool for mischief.

“”
When you’ve got images on screen, text on screen, voice-overs, and all that, there are so many different components. I know we’ve got research folks looking at this.

Early detection of potentially malicious prompts is one effective guardrail. Along with reliability, it can be engendered and addressed earlier in workflows while improving UX with preemptive, not just proactive, protections. Jim believes cracking down on the toxic implications of AI hallucinations and misleading/incorrect outputs with guardrails could prevent real harm down the line, rather than waiting to clean up afterwards.

“”
Can I evaluate a prompt as I’m typing it? We have auto-complete; we get red squiggly lines as we’re typing. Could you have that preemptive evaluation earlier on in the cycle? That’s an interesting idea to make the user experience — not even for our customers but our customers’ customers — much nicer.

Jim’s hot take: “I don’t see the point in burning down the planet”

“”
I don’t think large language models have a future. Building bigger and bigger models and throwing huge data centers at them wastes resources. I like small, specialized models.

Jim thinks we should ditch the big LLMs and embrace smaller, specialized models: The future of AI is smaller local models that do one job well instead of wastefully burning CPU cycles for features that aren’t even used.

“”
I can ask for a recipe and then burn down a planet because the AI is so big. It’s been trained on Japanese culture, French history, oceanography... You know, I don’t need that. I just need a recipe LLM. So, I think the future is small language models that just do one job — and do it very, very well — that I can run on my laptop.

As hardware keeps improving and models become better, Jim expects these smaller models to catch up to the performance of state-of-the-art giants for their specific tasks. He also thinks they’ll be executable on local machines, making them not just more convenient but more environmentally friendly.

TL;DR: Our chat with Jim

“”
If you’re using AI, whatever you do, test it! Please!

AI reliability isn’t optional; it’s a business and ethical requirement. DevRel has an important responsibility to make developers aware of how their tools can misfire or be misused and harm not just their business but the public and the individual. Reliability should be a native feature that’s baked in from the start, not something that’s only addressed when someone notices something has gone wrong.

The tooling may be new and the challenges hard to conceptualize, but early guardrails and checks for AI models are the future Galileo’s paving the path for, making reliability a native part of development.

You can connect with Jim and scope out the future of AI reliability with Galileo’s free tier. Put your AI tools under the lens to see if things are working the way you want them to.