Cybersecurity is ready for local models

“Essentially, all models are wrong, but some are useful.” —George E.P. Box

The nascent cybersecurity industry has plenty of useful models—models for prioritizing vulnerabilities and models for catching anomalies. These models are all wrong in one similar way: They are all global models trained on centralized data and distributed to enterprises worldwide.

Simple ones, like the Common Vulnerability Scoring System (CVSS) are designed by committees with little outcome data driving the choice of variables. More sophisticated models, like the Exploit Prediction Scoring System (EPSS), use data centrally sourced from contributing firms.

This article will focus on why these types of global models have become state of the art in cybersecurity and where the industry can go from here.

Why global models?

Cybersecurity has quickly shot past the human scale in recent years, with the number of vulnerabilities published daily moving from about 20 to nearly 80 on average in the past decade. Teams worldwide are now struggling to keep up with a machine scale problem. Many solutions attempt to cut down on the insurmountable level of noise. But why are global models the only solutions?

Most vendors struggle to gather enough data to train models. Most cybersecurity solutions in place today were built in a world where data engineering was expensive and data scientists were rare. As a result, over 100 vendors use the open EPSS model as a part of their algorithm. It falls on enterprises to tailor these models to their individual environments, oftentimes through custom spreadsheets or reporting solutions.

Each retraining of a model is a time-consuming and often fruitless endeavor. A modeling team may spend half a year building a model only to learn that there’s no efficiency or coverage gain compared to the global model.

Are global models good enough?

The limitations of global models are becoming more apparent with the rise and everyday acceptance of global generative AI models. An enterprise may not want to train GitHub’s Copilot model on their own code or OpenAI’s GPT on their sensitive data. Security and privacy dictate that sensitive data—often the most useful data for model training—needs to be kept local.

Some examples of this data include internal cybersecurity controls, past incident data, the business value of assets or even something as simple as the text of an IT ticket. This data can provide insight for the model to downgrade or upgrade a specific risk, but it cannot be released into a global model since anyone using the model may then become aware of sensitive data.

One way to think about global models is to think back to the accuracy of weather forecasts in the 1970s.

Data was gathered and used to make regional forecasts. While useful, the forecasts were not always accurate on the city level. As data sources increased—from satellite data to individual weather stations—many different types of local weather forecasts have come to market.

In the same way, there’s a future where enterprises can use a local—and more accurate—cybersecurity model built on top of the global model.

What can local models do for us?

Do we really need local models if a model trained on an enterprise’s individual data can underperform a global one? After all, global models usually include more data.

One of the obvious benefits of local models is defensibility. Remediating a CVSS 10 vulnerability but ignoring a CVSS 7 vulnerability that’s on a critical server without any mitigating controls isn’t a good look in a breach investigation.

From the standpoint of the global model, however, it’s the right move. This is because global models can very easily label something as “critical” or “high” but struggle to label something as “not risky.” They just don’t have the data to make that claim credibly. In contrast, a local model that’s aware of enterprise-specific data would be able to justify a “no” decision.

Future models will be local and aware that the global probability of exploitation of a particular issue is 92%, but due to mitigating factors at the enterprise—for example, past incidents and the kind of software running across the enterprise—they can downgrade that risk to 45%. That model creates a defensible risk acceptance justification because it’s aware of the necessary context.

What are the challenges?

On the infrastructure side, the biggest challenge is automating the ingestion of various types of enterprise data and the retraining and evaluation of models. Modern data stacks and the separation between storage and processing are making this a reality, but work remains. New AI models can categorize data as it comes in, eliminating the need to structure it during ingestion.

Retraining must preserve the separation of global and local data/models. This presents security and machine learning automation challenges that remain unsolved, but cybersecurity, due to sensitive data, is well-suited to solve them first.

Enterprise-specific models are still rare because they paradoxically require economies of scale. Each model needs to be tailored to a company’s data, yet there are only a limited number of data types and vendors to provide that data.

Despite varying environments, the challenges—data engineering, schema mapping and testing against baselines—are consistent. These tasks demand significant investment, and the scarcity of data science expertise in security teams further complicates the process.

How can we prepare for the AI future?

Enterprises worldwide are preparing for the changes AI will bring to the way they do business. These same models are already showing promise for attackers as well—see the rise of phishing or automated exploitation attempts.

To gain a real advantage in the age of generally available machine learning, defenders will have to move to the best possible models—and that means local models, using all the data at hand. It remains to be seen if these models will stay artisanal and hand-crafted or if the industry itself can become a little more crafty.