Big Data systems, or, more correctly, data analytics techniques applied in an IT security context to Big Data-style repositories of log and sensor data, promised to transform IT security by giving organizations and IT security teams deep and automatic insights into malicious behavior. Vendors touted data analytics for intrusion detection, insider threat activity detection, malware behavior detection, and phishing prevention. Data analytics techniques (advanced statistical analysis, data mining, machine learning, natural language processing, and so on) would reveal insights that manual methods were simply unable to produce.
This promise remains unfulfilled. Some theorists have begun to argue that fundamental limitations of the data set itself will prevent the highest hopes of security analytics from EVER being realized. How can that be? And what should IT security teams be doing with security analytics products?
In an IT security context, it’s not so much the data management aspects of Big Data solutions that we care about but the analytical methods we can apply to our collected security data. The purpose of data analytics, from a general perspective, is to achieve some form of insight by extracting interesting or meaningful patterns from large and perhaps dissimilar data sets. Security analytics applies data analytics methods to security-relevant data in order to assist in.identifying the fingerprint of bad actors, malware, and malicious insiders so that incident response plans can be triggered and further action can be initiated.
What exactly do we mean by data analytics? Data analytics, in this context, means applying methods such as statistical analysis, data mining, machine learning, and natural language processing to computer system and application logs along with security sensor data (primarily network traffic sensors) in order to detect and identity improper behavior. Despite products having being available for years (mostly in the SIEM and DLP spaces), the promises of security analytics are still largely just promises.
“The noise about security analytics has grown deafening in the industry, but operational reality still lags far behind.”
- Gartner, 27 Mar 2017
In the cybersecurity space, the boundary between normal and anomalous behavior isn’t always obvious. There are specific challenges and limitations inherent in the available data the impact the applicability and accuracy of data analytics techniques.
Some specific challenges in using data analytics with cybersecurity data include:
- Data set availability – There are few reference data sets available for things like insider attacks. Many patterns that current tools look for are only theoretical. Attack patterns available on the internet are often either sanitized or only applicable to a given enterprise’s IT fabric.
- Asymmetrical costs for errors – Depending on the use case, mistaken categorization can have a disproportionate cost. For example, in phishing detection a legitimate email being classified as a phishing attack is an annoyance but has little cost. Most mail filtering products allow filtered emails to be easily viewed and released. However, allowing even a single actual phishing attempt through can have significant consequences, should the targeted user release the malicious payload.
- Active adversary – Most data analytics activities are applied against a stream of data whose characteristics are relatively constant and where observing the data doesn’t effect the data generator. In the cybersecurity space however, malicious adversaries are constantly modifying and upgrading their techniques. They know their footprint is being scrutinized. They actively camouflage attacks (e.g. polymorphic viruses) and try new methods when old ones fail. This adversarial learning means that the value of training data sets for machine learning, for example, will degrade quickly.
- Dynamic and complex environments - Data analytics methods rely heavily on ‘normal’ activity happening in regular repeating patterns. Known-good business processes happen over and over in regular, knowable patterns which make anomalous behavior stand out, right? If only our IT environments behaved that way. IT environments are messy, constantly changing, noisy, fraught with one-time events, and almost always are poorly inventoried (despite what the guy who runs your CMDB tells you). Virtual technologies, mobile devices, and cloud services have aggravated the situation. Servers and services will be spun up, run for a bit, then vanish, never to be seen again. Was that a new production feature or malware? Only a long and tedious investigation MIGHT tell you whether it was one or the other.
- Base rate fallacy – Base rate fallacy is a formal logical fallacy that occurs with detecting low probability events. Adversarial attacks are by nature low density, where there are thousands or hundreds of thousands of legitimate transactions to each handful of actual attacks. The nature of trying to classify these low density events will lead to false positives greatly outnumbering actual positives. This can be frustrating for security staff who investigate alert after alert without finding an actual attack.
- Attack time scales – The time scale of malicious activity varies widely. Attacks can take place in seconds or a patient adversary might deliberately slow his pace to allow an attack to proceed over the course of days or weeks. Analysis methods dependent on time ranges can fail depending on the attacker’s mode of operations.
Given these challenges, it’s not surprising that security analytics products have largely failed to deliver the value we were promised. It’s not that they can’t deliver value. It’s more that the marketing hype, given the reality of the nature of detecting security events, has never been realistic.
So what’s an IT security professional to do? They key right now is to focus on specific use cases and avoid general purpose solutions that try to boil the ocean. Set manageable, measurable, and specific detection targets then use metrics to gauge (and demonstrate) your progress. Make an active effort not to overwhelm your follow-up team with false positives. Finally, make sure that you, and your boss, understand that using data analytics isn’t the security panacea that many security product and managed service vendors would have you think that it is.
Security is a process, not a product, and no amount of vendor promises will make your environment either more compliant or more secure. Just buying tools and implementing them without an understanding of what you’re trying to accomplish will just add to the noise, not increase the signal. Improving the security of your IT environment, using security analytics or any other technology for that matter, requires time, resource, and manpower commitment and should be driven by your use cases and your security framework, not by having tools for the sake of having tools.
For further reading, I recommend:
- “Security Analytics: Essential Data Analytics Knowledge for Cybersecurity Professionals and Students”, Verma et al, IEEE Computing Edge, May 2016
- Gartner Research Note, “Demystifying Security Analytics: Sources, Methods and Use Cases”, 27 Mar 2017
- Gartner Research Note, “Solution Path for Implementing Threat Detection and Incident Response”, 7 Jan 2019
For more information on particular subjects:
- For a good explanation of base rate fallacy, take a look at the Wikipedia page at https://en.wikipedia.org/wiki/Base_rate_fallacy
- For more on AI and machine learning subjects like Bayesian methods and deep learning, see “Making AI More Human” in the June 2017 issue of Scientific American. The article includes a discussion of applying machine learning techniques to spam filtering.