Sunday, January 20, 2019

The Failed Promise of Big Data for IT Security

[I started thinking about applying Big Data technology to security-related data when I was first asked to write security guidance for a proof-of-concept Hadoop implementation.  Nothing ever came of my efforts.  The nut was too hard for me to crack with the limited data sets that I had and only a beginner’s understanding of R.  Then security product vendors started coming along claiming that their data analytics engines were going to turn the world upside down for security operations teams.  That never happened either.  I ran into a few articles recently that delved into the dismal record of security analytics and I’ve tried to capture their essence, along with a few thoughts of my own, in this post.]

Big Data systems, or, more correctly, data analytics techniques applied in an IT security context to Big Data-style repositories of log and sensor data, promised to transform IT security by giving organizations and IT security teams deep and automatic insights into malicious behavior.  Vendors touted data analytics for intrusion detection, insider threat activity detection, malware behavior detection, and phishing prevention.  Data analytics techniques (advanced statistical analysis, data mining, machine learning, natural language processing, and so on) would reveal insights that manual methods were simply unable to produce.

This promise remains unfulfilled.  Some theorists have begun to argue that fundamental limitations of the data set itself will prevent the highest hopes of security analytics from EVER being realized.  How can that be?  And what should IT security teams be doing with security analytics products?

In an IT security context, it’s not so much the data management aspects of Big Data solutions that we care about but the analytical methods we can apply to our collected security data.  The purpose of data analytics, from a general perspective, is to achieve some form of insight by extracting interesting or meaningful patterns from large and perhaps dissimilar data sets.  Security analytics applies data analytics methods to security-relevant data in order to assist in.identifying the fingerprint of bad actors, malware, and malicious insiders so that incident response plans can be triggered and further action can be initiated.

What exactly do we mean by data analytics?  Data analytics, in this context, means applying methods such as statistical analysis, data mining, machine learning, and natural language processing to computer system and application logs along with security sensor data (primarily network traffic sensors)  in order to detect and identity improper behavior.  Despite products having being available for years (mostly in the SIEM and DLP spaces), the promises of security analytics are still largely just promises.

“The noise about security analytics has grown deafening in the industry, but operational reality still lags far behind.”
- Gartner, 27 Mar 2017

In the cybersecurity space, the boundary between normal and anomalous behavior isn’t always obvious.  There are specific challenges and limitations inherent in the available data the impact the applicability and accuracy of data analytics techniques.

Some specific challenges in using data analytics with cybersecurity data include:


  • Data set availability – There are few reference data sets available for things like insider attacks.  Many patterns that current tools look for are only theoretical.  Attack patterns available on the internet are often either sanitized or only applicable to a given enterprise’s IT fabric.
  • Asymmetrical costs for errors – Depending on the use case, mistaken categorization can have a disproportionate cost.  For example, in phishing detection a legitimate email being classified as a phishing attack is an annoyance but has little cost.  Most mail filtering products allow filtered emails to be easily viewed and released.  However, allowing even a single actual phishing attempt through can have significant consequences, should the targeted user release the malicious payload.
  • Active adversary – Most data analytics activities are applied against a stream of data whose characteristics are relatively constant and where observing the data doesn’t effect the data generator.  In the cybersecurity space however, malicious adversaries are constantly modifying and upgrading their techniques.  They know their footprint is being scrutinized.  They actively camouflage attacks (e.g. polymorphic viruses) and try new methods when old ones fail.  This adversarial learning means that the value of training data sets for machine learning, for example, will degrade quickly.
  • Dynamic and complex environments - Data analytics methods rely heavily on ‘normal’ activity happening in regular repeating patterns.  Known-good business processes happen over and over in regular, knowable patterns which make anomalous behavior stand out, right?  If only our IT environments behaved that way.  IT environments are messy, constantly changing, noisy, fraught with one-time events, and almost always are poorly inventoried (despite what the guy who runs your CMDB tells you).  Virtual technologies, mobile devices, and cloud services have aggravated the situation.  Servers and services will be spun up, run for a bit, then vanish, never to be seen again.  Was that a new production feature or malware?  Only a long and tedious investigation MIGHT tell you whether it was one or the other.
  • Base rate fallacy – Base rate fallacy is a formal logical fallacy that occurs with detecting low probability events.  Adversarial attacks are by nature low density, where there are thousands or hundreds of thousands of legitimate transactions to each handful of actual attacks. The nature of trying to classify these low density events will lead to false positives greatly outnumbering actual positives.  This can be frustrating for security staff who investigate alert after alert without finding an actual attack.
  • Attack time scales – The time scale of malicious activity varies widely.  Attacks can take place in seconds or a patient adversary might deliberately slow his pace to allow an attack to proceed over the course of days or weeks.  Analysis methods dependent on time ranges can fail depending on the attacker’s mode of operations.


Given these challenges, it’s not surprising that security analytics products have largely failed to deliver the value we were promised.  It’s not that they can’t deliver value.  It’s more that the marketing hype, given the reality of the nature of detecting security events, has never been realistic.

So what’s an IT security professional to do?  They key right now is to focus on specific use cases and avoid general purpose solutions that try to boil the ocean.  Set manageable, measurable, and specific detection targets then use metrics to gauge (and demonstrate) your progress.  Make an active effort not to overwhelm your follow-up team with false positives.  Finally, make sure that you, and your boss, understand that using data analytics isn’t the security panacea that many security product and managed service vendors would have you think that it is.

Security is a process, not a product, and no amount of vendor promises will make your environment either more compliant or more secure.  Just buying tools and implementing them without an understanding of what you’re trying to accomplish will just add to the noise, not increase the signal.  Improving the security of your IT environment, using security analytics or any other technology for that matter, requires time, resource, and manpower commitment and should be driven by your use cases and your security framework, not by having tools for the sake of having tools.


For further reading, I recommend:

  • “Security Analytics: Essential Data Analytics Knowledge for Cybersecurity Professionals and Students”, Verma et al, IEEE Computing Edge, May 2016
  • Gartner Research Note, “Demystifying Security Analytics: Sources, Methods and Use Cases”, 27 Mar 2017
  • Gartner Research Note, “Solution Path for Implementing Threat Detection and Incident Response”, 7 Jan 2019

For more information on particular subjects:

  • For a good explanation of base rate fallacy, take a look at the Wikipedia page at https://en.wikipedia.org/wiki/Base_rate_fallacy 
  • For more on AI and machine learning subjects like Bayesian methods and deep learning, see “Making AI More Human” in the June 2017 issue of Scientific American.  The article includes a discussion of applying machine learning techniques to spam filtering.

Sunday, January 13, 2019

The Chicken Tax

Ever wonder why you see TV commercial after TV commercial for pickups when light trucks make up only about a sixth of the vehicles on the road? Ever wonder why pickups seem to be so expensive as compared to cars and why there are far fewer pickup truck models to choose from as compared to car models? As with any economic phenomenon, there are lots of reasons but a big one is the leftover tariffs from a long forgotten trade war that filled the headlines in the early 1960s. That trade war was primarily between the US and Western Europe and became known as the Chicken War.

Prior to the 1950s, chicken wasn't anywhere near the staple food that it is today. Chicken was expensive. The Hoover political slogan "A chicken in every pot" was a promise of luxury for all. Chicken farming methods advanced rapidly in the post-WWII years and soon the U.S. dominated the world chicken market. Cheap U.S. chicken exports particularly hit small Western European farmers hardest and their governments responded with tariffs on imported American chicken. The U.S. responded with tariffs of their own. One such tariff was aimed at West Germany's Volkswagen, particularly the incredibly popular VW Bus (ever wonder why they vanished from the roads?).

Over the intervening years, virtually all of the tariffs from the Chicken War have been repealed except for the U.S. import tax on light trucks and pickups. Car companies have moved to circumvent the tariff, to one extent or another, by either manufacturing their trucks in North America or at least doing final assembly here. Some cargo vehicles are even manufactured overseas as passenger vehicles, brought to the U.S. and then their seats are ripped out and cargo beds installed (which is still cheaper than the tariff).

Protected from competition, some economists argue, light trucks have become a huge profit center for U.S. car companies and their incentive to develop new models and keep prices down has been curtailed... all because of cheap chicken.

Tuesday, January 8, 2019

Securing Big Data Systems


Big Data is one of the buzzes in the cybersecurity space, both in terms of using Big Data solutions for improving overall IT security as well as securing Big Data implementations.  IT security needs to provide guidance to the applications teams that are implementing Big Data solutions so that these new applications are implemented with security built in from the beginning, rather than trying to bolt on security later.  Many Big Data products haven't had security on the top of their development list as the products rapidly evolve so it's up to users to make sure that these products gets implemented and used correctly.

What is IT security's role in Big Data solution implementation?  In the context of Big Data, the security teams's primary deliverables are (1) security guidance for the Big Data solution architecture and (2) direction for implementing existing security controls and tools into a new technology environment.  What the security team, particularly the security architecture function, needs to do is to use existing patterns and guidance as a baseline and generate draft guidance based on those patterns, on research, and on vendor input.  That guidance should be updated based on feedback from the functional teams and initial informal audits.  As your guidance cycles through the various teams, it will firm up and will eventually become concrete enough to add to your security standards.

The Cloud Security Alliance (www.cloudsecurityalliance.org) Big Data Working Group has does great baseline work related to Big Data and security.  If you're responsible for securing a Big Data implementation, reading their "Expanded Top Ten Big Data Security and Privacy Challenges" is a must.read.

Consider a generic Big Data solution ecosystem:


You can map your security concerns using the CSA taxonomy.  They map as:


When you apply these concerns to the generic ecosystem, you get something that looks like:


This mapping can give you a jumping off point for any security guidance document that you develop.  Apply your set of basic security practices (e.g. RBAC, centralized authentication, encryption, server standards, etc.) and categorize them according to ecosystem components and the CSA top ten and your guidance will have a pre-built skeleton that will be easy to flesh out for your specific tools and solution.