Watching the Watcher: How we evaluated DeepTempo with BNY’s help – Source: securityboulevard.com

February 13, 2025
Post Author / Publisher: Security Boulevard

CISO2CISO post categories: ai-explainability, Cyber Security News, Cybersecurity, deep learning, rss-feed-post-generator-echo, Security Bloggers Network, Security Boulevard

Rate this post

Source: securityboulevard.com – Author: Evan Powell

Recently we reached a milestone in our design partnership with BNY, one of the world’s preeminent financial institutions and our nation’s oldest bank. You can read more about this milestone in our official announcement of our graduation from BNY’s Ascent program and can read much more about our approach in many of our blogs located here and in resources on our website. We can also provide a detailed write-up of our methodology, architecture, and results under NDA to users.

Users can also try a flavor of what we outline here, using our Tempo model running as a NativeApp on Snowflake.

This blog focuses on the evaluation process, as opposed to the value proposition for our Tempo LogLM. I hope that by sharing these tips and lessons learned we will accelerate the use advanced technology in incident identification. As mentioned below, we are open-sourcing our approaches including Python code and example logs. We welcome industry and academic collaboration.

Start with criteria

We have found a true partner in BNY, one with the breadth and depth of understanding of the domains necessary to better protect an institution from increasingly sophisticated attackers. BNY, from our experience, is forward-looking while also being locked down and focused. BNY’s acquisition of an NVIDIA superpod years ahead of its peers in the financial service industry shows its commitment to innovation and vision. For these and other reasons, BNY is an ideal partner for us as we reshape cybersecurity via deep learning for collective defense.

Together, we determined that BNY would be interested in at least the following criteria in the first phase and additional attributes in a subsequent phase.

Phase I:

Accuracy — including F1 score
Adaptability

Phase II:

Explainability
Efficiency
Return on investment

Accuracy

In the short term, the accuracy of the model was our focus. This is a standard consideration in any approach that attempts to find incidents — it answers the question of how effective is this incident identification. Like most teams of deep learning engineers — we are extremely proud of our F1 scores, which are aggregate measures of accuracy.

Some challenges in measuring the accuracy of a cybersecurity model include a lack of labels and a risk that an approach could overfit to identify a particular attack pattern while failing to identify derivatives of that pattern; rules, in particular, are unable to see similar attacks and machine learning models that have been trained to see specific indicators can have a similar challenge.

The lack of labels is inherent to cyber security — there are relatively few known attacks that most enterprises will have available. This is one reason that transfer learning from a foundation model is so important. More on that below — adaptability.

The lack of labels can make determining the precision of a model challenging. One approach is to manually review the output of a model or set of indicators. BNY, thankfully, was willing to conduct this analysis for us. Our LogLM identified dozens of apparently concerning sequences of attacks and flagged those IP addresses contained within the sequences. A manual investigation was done into each of these sequences, using the IP addresses and related IP addresses as anchors of this investigation.

As an aside, the lack of labels is also why a self-supervised approach to pretraining foundation models is the right approach. Very large-scale pretraining of foundation models has been impractical in the past — however transformer-based approaches allow for the building of models of unimaginable scale, accuracy, adaptability, and even explainability.

Another approach we have undertaken is to build Mitre Att&cks and then use these to help calibrate a classifier. We discussed this approach in a prior blog. We will be open-sourcing this work in the weeks to come. You can raise your hand to assist — whether that is providing wish lists or perhaps spending some time contributing useful Python in the Open Security Community here: https://github.com/deepsecoss

Additionally, we must credit the Canadian Institute of Cybersecurity. The work that they did years ago helped us to pre-train our Tempo LogLM model over the last two years. You can find more about their work here.

BNY like all mature security enterprises also uses red team evaluations and we are also participating in these activities.

Lastly, in cyber security, the SOC often has the last word. Do they trust the solution once it runs in production? Can they quickly apply the alerts provided or are they deprioritized by the SOC? I view this as the most critical test of all — how will the insights of our Tempo LogLM be usedupon in production?

Adaptability

Adaptability is underemphasized in cyber security. Most machine learning solutions take months of tuning to show effectiveness in a new environment. And of course, most rules are hard-coded to work best within a particular environment. A lack of adaptability has led us to have a very brittle set of systems that are expensive to maintain and that take a long time to deliver benefits.

By comparisons, we build foundation models which we call LogLMs. These LogLMs generalize well and can show their capabilities quickly, often without ANY time spent on customization at all.

To quantify this adaptation, our BNY partners suggested we follow a three-part approach. In all cases, the model identified concerning sequences, and, as mentioned above, BNY detection engineers examined the validity of these results.

The four parts of the evaluation were as follows:

No adaptation — most limited data set
No adaptation — additional data provided
Rapid — classifier based — adaptation — limited data set
Rapid — classifier based — adaptation — with additional data provided

This four-part approach sought to first establish a baseline for the model and to then demonstrate the ability of the model to adapt within the BNY domain with the assistance of more data and the aid of a classifier.

Whereas many machine learning models and products take months of tuning to achieve acceptable precision — a precision all too easily lost as the environment changes — our Tempo LogLM compared as well as these models before any adaptation. And then showed further improvement in tests 2, 3, and 4.

This approach should be rather simple for any deep learning based solution to follow to demonstrate whether it adapts well. If it does, it is likely a foundation model and all that entails. Not all deep learning is the same, by the way. GNNs have not been shown to generalize well, an Achilles heel that may have contributed to Lacework’s challenges as I previously explained.

Explainability

As mentioned above, the real test for any set of indicators is to what extent the SOC comes to rely upon them. We heard early on from our advisors and investors, such as Chris Bates, the long-time former CISO and chief trust officer at SentinelOne, that a more accurate black box would be of limited interest to the SOC operator. Our founding engineer Josiah Langley, has shared that as a former threat hunter and engineer at Dragois, he had to deeply understand the rules and other indicators to know how or whether to act upon their alerts.

As our underlying model relies upon a many-to-many comparison of 768-dimensional tensors, explainability was a challenge to us and one we started to address even before founding DeepTempo.

It is hard to know exactly how to measure explainability. Our approach includes the following — and some other proprietary techniques:

Provide and measure the accuracy of a mapping of incidents to Mitre Att&ck patterns

We may not think in 768 dimensions, but if you work in a SOC you speak Mitre Att&ck.

Create and provide the sequences within the logs themselves

It is a simple enough requirement — can a user immediately see the concerning sequence? We have put a lot of thought into how to make all of the embeddings our model creates useful including those that are flagged as concerning and those that are not.

This usefulness can be evaluated largely by human feedback as well as the accuracy of the model in predicting the ground truth of the sequences themselves.

Dashboards

Splunk, Snowflake, and other vendors have the eyeballs of the data analysts and the SOC. We provide dashboards to our users that fit our information into these environments. Whether they use our dashboards or their own — our users can apply all the context of these solutions, for example looking at everything impacting an IP, including the outputs of our Tempo LogLM.
Please review a demo of an example dashboard on our YouTube channel: https://www.youtube.com/@DeepTempo-ai

Efficiency

To measure efficiency, we capture concrete metrics and attempt to measure harder-to-measure soft metrics.

First, the soft metrics — when discussing our approach with potential users we want to get to know them better. We often ask — how do you build and maintain your rules-based indicators? Who built them? How are they documented? How are they tested?

The point from these questions is to understand what the team is doing and to emphasize the costs of the massive technical debt under which most of the cybersecurity industry is struggling. This segues naturally into the efficiency gains from having a quick-to-adapt and extremely accurate solution with baked-in explainability. Still — these are often hard to quantify.

Easier to quantify is the ability of our models and other software to handle large streams of data while using a relatively small amount of computing and memory. Without getting into proprietary details here, we have shown that the approach can scale horizontally with the help of standard containerized approaches from NVIDIA and Snowflake. Additionally, in many cases, the bottleneck is getting the logs back to a location for their analysis. In these cases, our models and related software are run in a decentralized manner.

ROI

A friend who is a deep technology investor who generally avoids cyber security explained to us in an all-company meeting last fall — there are two kinds of solutions in cyber security, those that just document stuff and keep the lawyers and regulators happy and those that apply deep technology to address the fundamental job of cyber security — i.e. greater security.

In our case, we add value by reducing the risk of especially advanced attacks. How can we quantify this? What is the value of reducing the risks of potentially successful attacks? In the case of BNY, they are a bedrock of capitalism itself, as old as the United States. What is the benefit of further securing that foundation?

We also have a hard ROI from cost avoidance. In particular, users decrease their retention of flow logs in expensive systems as they come to trust our solution to better identify and alert on certain attack vectors. The use of our embeddings for retroactive use cases along with the log sequences we parse out and make immediately available also increases confidence in users pushing a greater percentage of their logs into lower-cost datalakes like Snowflake, object storage, and other platforms.

Our approach to pricing is to attempt to share the hard ROI benefits and to leave all of the soft ROI benefits to the user.

It gets a bit more complicated than that, of course. Like many enterprise vendors we have a detailed ROI model that is used by larger customers to document their decision to rely upon our Tempo. Other users just try it out, burning off some of their credits on Snowflake to get started and go from there.

Conclusion

Cyber security has some inherent measurement challenges which is likely one reason it seems to be making unwise and backward-facing investments. While spending on cyber security is increasing rapidly, so are losses. Thanks to their success, attackers have much more to spend than the $200-$250bn we collectively spend on cyber security. These attackers do not share our measurement challenges — they have a very simple method of measuring their success.

As this blog outlines, measurement across at least the following criteria has proven helpful to our users: accuracy, adaptability, explainability, efficiency, and ROI. We hope this blog and other work including our open source contributions will help buyers to make more fact-based decisions about necessary investments in improved cyber security.

Watching the Watcher: How we evaluated DeepTempo with BNY’s help was originally published in DeepTempo on Medium, where people are continuing the conversation by highlighting and responding to this story.

*** This is a Security Bloggers Network syndicated blog from Stories by Evan Powell on Medium authored by Evan Powell. Read the original post at: https://medium.com/deeptempo/watching-the-watcher-how-we-evaluated-deeptempo-with-bnys-help-836e477d24cd?source=rss-36584a5b84a——2

Original Post URL: https://securityboulevard.com/2025/02/watching-the-watcher-how-we-evaluated-deeptempo-with-bnys-help/

Category & Tags: Security Bloggers Network,ai-explainability,Cybersecurity,deep learning – Security Bloggers Network,ai-explainability,Cybersecurity,deep learning