
By Philip Dawson
AI is making decisions that matter. Who gets hired, who gets credit, who gets flagged as a risk. And when those decisions go wrong, the consequences land on real people, and increasingly on the organizations responsible for them.
Armilla AI and Eticas were both founded with similar missions — to build justified trust in AI, by measuring real-world performance and risks. Both built practices around independently evaluating AI systems: getting inside them, testing how they actually behave, and surfacing the risks that internal teams miss. For Armilla, that foundation in third-party evaluation and red-teaming eventually pointed toward a larger gap: innovating the underwriting infrastructure to unlock AI-specific insurance coverage for risks that traditional policies were not designed to cover. The move opened the door to partnerships with pioneering companies like Eticas, who continue to advance the field of AI auditing and assurance.
The partnership reflects a shared conviction that safe, reliable AI requires evidence, ongoing testing, and accountability. We sat down with Gemma Galdon-Clavell, Eticas' founder and CEO, to talk about the work behind that partnership: what AI auditing has actually taught her about how systems behave in the wild, where safety efforts are falling short, and why more organizations are starting to ask not just "is our AI safe?" but "how do we know? and “what happens when it fails?"
Philip (Armilla): You built your career at the intersection of technology, ethics, and governance, well before that combination became fashionable. What first drew you into this space, and what ultimately led you to start Eticas?
Gemma (Eticas): Thank you for having me! When I started working in this space, over 15 years ago, it was very niche. Most of the issues related to AI ethics and technical evaluation were only salient in the security domain. It was then that social media scrapping was becoming a thing, together with biometric identification and data collection at scale. Many actors in security and safety needed experts who understood the legal context, social expectations and technical realities, and I had a socio-technical profile that made it quite easy for my work to be in high demand even then, and so I quickly left university after my PhD and started Eticas. Since then, my field of expertise has literally exploded before my eyes, and the capabilities of systems that I once saw only in control rooms are now in everyone’s pockets.
Philip (Armilla): When you launched Eticas, AI looked very different from what we see today. What problem were you trying to solve at the beginning, and how has the company evolved as AI systems moved into the core of how organizations actually operate?
Gemma (Eticas): When I started, I was working on Big Data. Back then most of the concerns were around privacy, security and misuse, which continue to be huge challenges today. From a technical perspective, my work has not changed so much: at Eticas we have always been at the intersection of innovation, social values and legal requirements, obsessed with tackling real-life problems and auditing, evaluating and opening up black boxes. Our early methods transitioned easily into the world of automation and predictive AI, and the same has happened more recently with Generative AI. The risks, their scope and contexts change, but our work continues to revolve around identifying a risk or a set of risks, getting into systems, measuring said risks, mitigating them and creating auditing software to monitor risk-relevant indicators. Our technical audits provide AI developers and implementers an independent assessment of the effectiveness of their systems and safety features that they can use to improve AI systems and outputs, build trust with their clients or users, and improve their compliance efforts with actual system data.
Philip (Armilla): You’re often described as one of the pioneers of operational AI auditing and testing. At this point, what do we genuinely know how to do well, and where do you think the biggest technical or methodological blind spots still are?
Gemma (Eticas): Impact auditing continues to be a huge blind spot. Most efforts to understand and build guardrails against AI risks are performed at early stages of development, which is necessary but limited. A good engine does not guarantee that a car is safe any more than a clean data set or a bias guardrail at the model level can prevent risks down the line. The data and the models can be great and the system can and will continue to create risks up until the point of interaction with real-world conditions. Other sectors know this well and have built systems and standards to make sure risks are tackled throughout the life-cycle of whatever is being produced: in medicine, vaccines must go through clinical trials and do thorough most-market surveillance; incident reporting is mandatory in many fields, like aviation. With AI, it is not only that we do not yet have mechanisms to monitor downstream risks -it’s that we often have zero visibility of them! It’s like working on plane engineering and innovation and never having any data on whether the planes actually fly. My focus is operational because I can’t afford to work at the level of abstract ideas or risks, or build guardrails that do not stand the test of real life performance. We audit AI systems, so we need clear hypotheses, measurable risks, metrics, benchmarks and contextual data.
Philip (Armilla): There’s growing attention on auditing frontier AI models, often centered on capability benchmarks and red-teaming exercises. How does that work differ from what you do at Eticas when auditing specific AI applications in deployment? Where should priorities sit, and what gets lost when we focus too heavily on models rather than full systems?
Gemma (Eticas): Our findings at the point of impact is what can then drive the tweaks and guardrails that must be developed upstream, and test their effectiveness once they hit real life. AI safety is not linear, but circular. At the model level, before an AI system is released, engineers can test hypothesis, build personas, do red-teaming, assess risks, build guardrails… All these steps are good and necessary. But not sufficient. Risks will re-emerge as a model is retrained or adapted to a specific context, suffers an attack, or when AI implementors or users try to use it for things it was not designed for. Risks can be mitigated but not eliminated, and so they are always latent in AI systems. AI lack of safety is like gravity: it is always there, it is the context and the ecosystem, and all efforts must be made to minimize this force that threatens to make all our efforts fail. While part of the AI industry promises reliability and continuous improvement, what we see in our auditing are systems where reliability and bias are features, not bugs. Systems where performance degrades with use. Systems on top of systems in complex supply chains where all good intentions at the model level are eroded at each interaction. AI is not a person we must “educate well” and then hope they will become a wholesome human being. These are data systems with embedded drift and vulnerabilities. So when we audit at the impact level, we are measuring the effectiveness of the efforts made at the model and development level, and produce key insights that shape the next iteration of guardrails and risk mitigation measures. But when no one is auditing at the impact level, the circle does not exist, and so developers lack the input they need to understand and mitigate risks upstream. This blind spot does not exist in any other sector, and is the reason why a lot of safety and regulatory efforts today are a waste of time and money.
Philip (Armilla): Auditing a live, continuously evolving AI system is very different from auditing financial statements or even traditional software. What makes it uniquely challenging in practice, and what do most organizations underestimate when they say, “Yes, we’ve tested our model”?
Gemma (Eticas): I don’t think it’s difficult! Actually, AI is making our job as AI auditors a lot easier, too. What we do is clearly very different from a financial audit, and traditional software does not have the same level of risk and user interaction we have with AI. So what we do is clearly different and new, but not difficult for us because of our unique combination of social and technical skills. Anyone with a background in quantitative social sciences knows that we can measure everything. Our ability to turn complex context into metrics, coupled with a deep understanding of AI engineering allows us to build robust checks that measure the right things when they matter. I believe that what we do feels hard for traditional engineers because measuring social dynamics and interactions, or AI impacts on education or mental health feels impossible as the datasets do not exist. I often say that engineers cannot code a world they don’t understand. AI developers need to step up their game and improve their ability to interact with complex data, so that when they say “we’ve tested our model” we can be certain that those tests have been robust, in line with actual risks and validated by independent auditors. Having said that, there will always be risks that we know we don’t know, or dynamics we can’t predict. But I’ll worry about those once we start effectively tackling the assurance risks we do know today.
Philip (Armilla): Through your work with the IAAA and other standards efforts, you’ve argued that principles alone aren’t enough. If we were serious about system-level AI assurance, what would a meaningful standard need to include to change behavior rather than just satisfy compliance requirements?
Gemma (Eticas): Standards need to set clear benchmarks. Currently, most standards out there are either too general (focusing on organizational governance, not product safety) or too specific (requiring the measurement of very niche model metrics that do not contribute to making AI safer), when what we need is a reasonable and practical middle ground. The low hanging fruit at the standards level is determining what are acceptable impact metrics and ranges. When auditing ADMs we often use proportionality as a key metric: your system need to be as close as possible to being proportional to the demographics it is impacting, from the training data to final outcomes, whether we are inspecting an HR or a cancer screening system. In other instances, like for LLMs deployed in education settings, we have suggested to our clients that the acceptable rate of un-identified instances of harm or drift must be below 2%. These are benchmarks that are general but also contextual, and so enable us to provide developers with a clear idea of what we measure and what they need to control. Standards need to operate at this socio-technical level to enter technical specifications, but unfortunately none of them do at the moment.
Philip (Armilla): Many of your customers are deploying AI in settings where mistakes can quickly turn into lawsuits, regulatory action, or significant financial loss. What are you hearing from them about the need for AI insurance, and why did partnering with Armilla feel like the right way to connect technical assurance with real-world risk?
Gemma (Eticas): We are seeing a global shift in AI safety from compliance to risk. With AI being deployed at the speed of light and a moving, often unpredictable, regulatory context, companies are quickly understanding that implementing AI brings benefits but also risks, and that those risks must be measured and mitigated like financial risks or any other risk impacting their business. This is an unexpected but very fruitful development: this approach to AI safety makes it necessarily technical and a part of doing business, not a legal afterthought or an effort that does not impact the robustness of the business. This focus on AI safety as a risk will greatly benefit from a quantification of those risks, and the insurance field has been crucial in the past in framing how risks are quantified and mitigated. We believe that an alliance between AI auditors and AI insurance can change the conversation in ways that benefit everyone, but especially businesses seeking certainty and controlled risks as they build and integrate AI. Both our fields can accelerate the emergence of an ecosystem of AI assurance with clear rules, standards and roles, which will immediately impact the quality and safety of the AI systems being deployed around us. Eticas teaming up with Armilla is one of the most fruitful collaborations I can think of, both for our organizations individually and for the field of AI safety.
Eticas’ work is grounded in a simple but demanding idea: AI systems should be understandable, testable, and accountable in the environments where they are actually used.
Rather than treating governance as a box-ticking exercise, the company has focused on building practical methods to evaluate real systems, surface hidden risks, and give organizations defensible evidence about how their technology behaves and where it can fail. That emphasis on measurement and proof, rather than slogans, makes meaningful oversight possible.
Pairing that technical rigor with mechanisms like insurance that introduce real economic consequences helps push the industry toward a future where safety is not just a design goal, but an operational requirement.
In a world where AI increasingly shapes access to jobs, credit, healthcare, and opportunity, building systems that can be audited, challenged, and insured may be one of the most important steps toward making AI not only powerful, but genuinely responsible.
Founded by Gemma Galdon-Clavell over a decade ago, Eticas.ai is a mission-driven corporation that provides measurable testing, independent verification and continuous monitoring at scale, so organizations can innovate with AI while minimizing its risks. Working with developers, deployers, and partners across the private and public sectors, Eticas.ai delivers accountability solutions that measure and monitor the real-world impact of AI systems, turning rigorous, field-tested evidence into practical tools for responsible AI adoption.
For more information, visit Eticas.ai.
Armilla is a purpose-built AI insurance and underwriting firm closing the coverage gap that traditional policies leave open. As generative AI and AI agents move into production, Armilla provides affirmative insurance coverage for the risks that matter most: model errors and hallucinations, data privacy liability, biased or harmful outputs, regulatory violations, and IP infringement.
Armilla's approach combines deep AI risk expertise with robust insurance infrastructure — offering independent model verification, and AI liability coverage up to $25M per company. Recognized by Lloyd's Lab, Y Combinator, and backed by reinsurers including Chaucer, Swiss Re, Axis Capital, Convex and Greenlight Re, Armilla works with AI developers, deployers, and the brokers who serve them to make AI adoption faster, safer, and financially protected.
To learn more, reach out at armilla.ai.