To protect people, we need a different type of machine learning

  • By Ed Bishop
  • 29 February 2020

Despite thousands of cybersecurity products, data breaches are at an all-time high. The reason? For decades, businesses have focused on securing the machine layer — layering defenses on top of their networks, devices, and finally cloud applications. But these measures haven’t solved the biggest security problem — an organization’s own people.

Traditional machine learning methods that are used to detect threats at the machine layer aren’t equipped to account for the complexities of human relationships and behaviors across businesses over time. There is no concept of “state” — the additional variable that makes human-layer security problems so complex. This is why “stateful machine learning” models are critical to security stacks.

The people problem

“Today, people have more control over company data and systems than ever before. In just a few clicks, employees can transfer thousands of dollars to a bank account or send 50,000 patient records in a single Excel file via email. An unbelievably slim margin of error determines whether these interactions end up being business as usual or a complete disaster, which is why so many data breaches are caused by human error.”

The problem is that people make mistakes, break the rules, and are easily hacked. When faced with overwhelming workloads, constant distractions, and schedules that have us running from meeting to meeting, we rarely have cybersecurity top of mind. And things we were taught in cybersecurity training go out the window in moments of stress. But one mistake could result in someone sharing sensitive data with the wrong person or falling victim to a phishing attack.

Securing the human layer is particularly challenging because no two humans are the same. We all communicate differently — and with natural language, not static machine protocols. What’s more, our relationships and behaviors change over time. We make new connections or take on projects. These complexities make solving human-layer security problems substantially more difficult than addressing those at the machine layer — we simply cannot codify human behavior with “if-this-then-that” logic.

The time factor

We can use machine learning to identify normal patterns and signals, allowing us to detect anomalies when they arise in real time. The technology has allowed businesses to detect attacks at the machine layer more quickly and accurately than ever before.

One example of this is detecting when malware has been deployed by malicious actors to attack company networks and systems. By inputting a sequence of bytes from a computer program into a machine learning model, it is possible to predict whether there is enough commonality with previously seen malware attacks — while successfully ignoring any obfuscation techniques used by the attacker. Like many other threat detection problem areas at the machine layer, this application of machine learning is arguably “standard” because of the nature of malware: A malware program will always be malware.

Human behavior, however, changes over time. So solving the threat of data breaches caused by human error requires stateful machine learning. 

Consider the example of trying to detect and prevent data loss caused by an employee accidentally sending an email to the wrong person. That may seem like a harmless mistake, but misdirected emails were the leading cause of online data breaches reported to regulators in 2019. All it takes is a clumsy mistake, like adding the wrong person to an email chain, for data to be leaked. And it happens more often than you might think. In organizations with over 10,000 workers, employees collectively send around 130 emails a week to the wrong person. That’s over 7,000 data breaches a year.

For example, an employee named Jane sends an email to her client Eva with the subject “Project Update.” To accurately predict whether this email is intended for Eva or is being sent by mistake, we need to understand — at that exact moment in time — the nature of Jane’s relationship with Eva. What do they typically discuss, and how do they normally communicate? We also need to understand Jane’s other email relationships to see if there is a more appropriate intended recipient for this email. We essentially need an understanding of all of Jane’s historical email relationships up until that moment.

Now let’s say Jane and Eva were working on a project that concluded six months ago. Jane recently started working on another project with a different client, Evan. She’s just hit send on an email accidentally addressed to Eva, which will result in sharing confidential information with Eva instead of Evan. Six months ago, our stateful model might have predicted that a “Project Update” email to Eva looked normal. But now it would treat the email as anomalous and predict that the correct and intended recipient is Evan. Understanding “state,” or the exact moment in time, is absolutely critical.

Why stateful machine learning?

With a “standard” machine learning problem, you can input raw data directly into the model, like a sequence of bytes in the malware example, and it can generate its own features and make a prediction. As previously mentioned, this application of machine learning is invaluable in helping businesses quickly and accurately detect threats at the machine layer, like malicious programs or fraudulent activity.

However, the most sophisticated and dangerous threats occur at the human layer when people use digital interfaces, like email. To predict whether an employee is about to leak sensitive data or determine whether they’ve received a message from a suspicious sender, for example, we can’t simply give that raw email data to the model. It wouldn’t understand the state or context within the individual’s email history.

  • Stateful machine learning and email security

    With stateful machine learning, we can look across each employees’ historical email data set and calculate important features by aggregating all of the relevant data points leading up to that moment in time. We can then pass these into the machine learning model. The time variable makes this a non-trivial task; features now need to be calculated outside of the model itself, which requires significant engineering infrastructure and a lot of computing power, especially if predictions need to be made in real time. But failure to adopt this type of machine learning means you will never be able to truly protect your people or the sensitive data they access.

People are unpredictable and error prone, and training and policies won’t change that simple fact. As employees continue to control and share more sensitive company data, businesses need a more robust, people-centric approach to cybersecurity. They need advanced technologies that understand how individuals’ relationships and behaviors change over time in order to effectively detect and prevent threats caused by human error.

*This article is part of a VentureBeat special issue. Read the full series here: AI and Security.

Ed Bishop co-founder and Chief Technology Officer