Why is good AI so hard to achieve?

What nearly all AI failures share in common is that they tend to arise due to inadequate quality and robustness testing.
Karthik Ramakrishnan
February 17, 2023
5 min read


Humans have a long habit of not always using technology responsibly. From the earliest days of the Industrial Revolution to the Atomic Age, history is filled with examples of humanity unleashing new technologies without fully considering the potential risks and dangers first.

Unfortunately, AI and ML do not appear to be exceptions to this trend thus far.

As we’ve previously discussed in Armilla’s first post on this blog, there have already been many thousands of high-profile and harmful AI failures. Moreover, this number is only likely to increase as new AI applications become more widespread and ever-larger in scope.

Crucially, one element that nearly all AI failures share in common is that they tend to arise due to inadequate quality and robustness testing.

The downstream risks of improper model-testing are diverse:

  • From a purely financial perspective, these failures can be tremendously expensive — whether from lost revenue, increased R&D costs, or broken trust in a company’s brand or reputation.
  • From an ethical perspective, the deployment of inaccurate- or biased models can have significant negative impacts on users, resulting in widespread social harms and the perpetuation of demographic inequalities.
  • Finally, there are also significant legal risks, both from lawsuits from individual victims as well as a growing number of regulatory liabilities.

In our opening series of posts on this blog, we’ll be exploring the promises and risks of emerging AI/ML technologies in depth, as well as how these missteps can be proactively avoided. Ultimately, we’ll conclude by charting some potential paths forward that we at Armilla believe can help the AI/ML industry mature and be responsible as a field.

In this first post, we’ll begin by briefly reviewing some of the common historical reasons behind these AI failures, in order to better understand what causes them. We’ll then explore how taking a more thoughtful and proactive approach to QA testing in AI/ML can help to avoid these costly errors, as well as some of the main technical challenges in developing a robust QA process.

Why do AI failures arise?

Historically, the smaller scope and scale of AI applications meant that developers could often “get away” with more limited forms of QA testing. The following are some of the major reasons why we see AI systems failing:

A. Inadequate testing approaches

For many AI developers, model evaluation often begins- and ends with simple forms of accuracy testing — such as building a model with a set of “training data,” and checking how well it performs on a “holdout” or “validation” dataset.  This is a starting point – it can give you a sense of your model being overtrained (if the model’s performance degrades on the holdout set)… but that’s about it.

While simple holdout tests such as these are certainly one important aspect of model-testing, the fact remains that they are simply not comprehensive enough to prevent costly and large-scale errors in modern ML applications.

For instance, split-data testing cannot tell you anything about whether a model is biased against certain types of customers. Nor can it help explain how your model is weighting its various inputs to actually produce decisions, or identify various other potential unintended consequences of the model.

In other words, the limited QA tests of the past have left a large number of critical issues, such as ethical- and accountability-related questions, completely unaddressed. Oversights such as these have led to many of the critical, high-profile failures we’ve seen in recent history.

B. Inadequate Business Context and Input .

There is still a large divide between data scientists and the businesses they serve when it comes to proper testing.  Maybe there is domain knowledge transfer one way… for example teaching a data scientist about a domain so that they can build a model.   Even if we are getting better at building models, this often doesn’t extend to how we test models, where often there is a lack of business context in the QA process itself.

A classic example of a lack of business context occurs in the rare-event detection problem.  For example, detecting a consumer payment as fraudulent or not.   I could build a model that always chooses the common outcome, and statistically have extremely high precision and recall for that outcome.  But is that what the business is trying to accomplish?  How we measure success is very important, and must tie back to the business context.

In fact, business context matters in all respects.  The degree of robustness of the model, and explainability must be appropriate for the business case.  Even bias or fairness is relative to the problem and the regulatory context or business requirements.  For example, I maintain that I’m a better driver than my wife, but my insurance company, with their rate plan, disagrees… and they are allowed to price based on gender.  In other contexts, this would likely not be allowed.  (Side note: my wife agrees with my insurance company.)

A proper QA test plan must take into account this business context and cannot just be a data science exercise. The field of AI/ML will inevitably need to adopt more rigorous and comprehensive forms of testing which can take these kinds of concerns into account, and which can catch and correct these mistakes before they can do harm in the real world.

C. Inadequate or Bad Data

There is a well-known, classic adage of ‘garbage in, garbage out’ when it comes to data management, and while that’s definitely true, we should not be held hostage or throw up a white flag and surrender because of bad data.  Proper QA of an ML system also includes a close inspection of the data used to train the system.  Is the underlying data biased?  Is the data sparse, and is that an artifact of the sampling, or a genuine distribution of cases in the wild?  

Proper QA also involves understanding of how the model interacts with data and the inputs.  Take for example, the case of a rare event detection.  How has the model been trained in-relation to the likely-skewed dataset. Often this involves data manipulation (up-sampling or down-sampling).

D. Things Change… and Your Model Doesn’t

Things change.  COVID happens for instance, and the inputs into your model might change.  For example, ICU bed numbers might rise to a point where a hospital capacity model wasn’t tested, with levels not seen before in its historical data, and its predictions break down.   Or the predictions a model makes may no longer be relevant.  For example, what was a good credit lending risk pre-COVID might not be a good credit lending risk post-COVID… and the model won’t inherently know this.  

This is both an operational problem (you need to monitor your models in production) but also a QA problem (you need to know how your models will perform before they go into production, to know when they need to be retrained).

As we will see, however, developing more robust QA protocols to ensure trustworthy AI is not a simple matter. Rather, rising to these challenges involves a number of complex and interlocking problems that developers will continually have to solve as they generate and fine-tune new ML models.

In our next post we will dive into some of the challenges in testing ML.