INSIGHTS | January 21, 2014

Scientifically Protecting Data

This is not “yet another Snapchat Pwnage blog post”, nor do I want to focus on discussions about the advantages and disadvantages of vulnerability disclosure. A vulnerability has been made public, and somebody has abused it by publishing 4.6 million records. Tough luck! Maybe the most interesting article in the whole Snapchat debacle was the one published at www.diyevil.com [1], which explains how data correlation can yield interesting results in targeted attacks. The question then becomes, “How can I protect against this?”

Stored personal data is always vulnerable to attackers who can track it down to its original owner. Because skilled attackers can sometimes gain access to metadata, there is very little you can do to protect your data aside from not storing it at all. Anonymity and privacy are not new concepts. Industries, such as healthcare, have used these concepts for decades, if not centuries. For the healthcare industry, protecting patient data remains one of the most challenging problems. Where does the balance tip when protecting privacy by not disclosing that person X is a diabetic, and protecting health by giving EMT’s information about allergies and existing conditions? It’s no surprise that those who research data anonymity and privacy often use healthcare information for their test cases. In this blog, I want to focus on two key principles relating to this.

k-Anonymity [2] 
In 2000, Latanya Sweeney used the US Census data to prove that 87% of US citizens are uniquely identifiable by their birth date, gender, and zip code[3]. That isn’t surprising from a mathematical point of view as there are approximately 310 million Americans and roughly 2 billion possible combinations of the {birth date,gender, zip code} tuple. You can easily find out how unique you really are through an online application using the latest US Census data [4] Although it is not a good idea to store “unique identifiers” like names, usernames, or social security numbers, this is not at all practical. Assuming that data storage is a requirement, k-Anonymity comes into play. By using data suppression, where data is replaced by an *, and data generalization, where—as an example—a specific age is replaced by an age range, companies can anonymize a data set to a level where each row is, at the very least, identical to k-1 rows in the dataset. Whoever thought an anonymity level could actually be mathematically proven?

k-Anonymity has known weaknesses. Imagine that you know that the data of your Person of Interest (POI) is among four possible records in four anonymous datasets. If these four records have a common trait like “disease = diabetes”, you know that your POI suffers from this disease without knowing the record in which their data is contained. With sufficient metadata about the POI, another concept comes into play. Coincidentally, this is also where we find a possible solution for preventing correlation attacks against breached databases.

l-diversity [5] 
One thing companies cannot control is how much knowledge about a POI an adversary has. This does not, however, divorce us from our responsibility to protect user data. This is where l-Diversity comes into play. This concept does not focus on the fields that attackers can use to identify a person with available data. Instead, it focuses on sensitive information in the dataset. By applying the l-Diversity principle to a dataset, companies can make it notably expensive for attackers to correlate information by increasing the number of required data points.

Solving Problems 
All of this sounds very academic, and the question remains whether or not we can apply this in real-life scenarios to better protect user data. In my opinion, we definitely can.

Social application developers should become familiar with the principles of k-Anonymity and l-Diversity. It’s also a good idea to build KPIs that can be measured against. If personal data is involved, organizations should agree on minimum values for k and l.

More and more applications allow user email addresses to be the same as the associated user name. This directly impacts the l-Diversity database score. Organizations should allow users to select their username and also allow the auto-generation of usernames. Both these tactics have drawbacks, but from a security point of view, they make sense.

Users should have some control. This becomes clear when analyzing the common data points that every application requires.

  • Email address:
    • Do not use your corporate email address for online services, unless it is absolutely necessary
    • If you have multiple email addresses, randomize the email addresses that you use for registration
    • If your email provider allows you to append random strings to your email address, such as name+random@nullgmail.com, use this randomization—especially if your email address is also your username for this service
  • Username:
    • If you can select your username, make it unique
    • If your email address is also your username, see my previous comments on this
  • Password:
    • Select a unique password for each service
    • In some cases, select a phone number for 2FA or other purposes

By understanding the concepts of k-Anonymity and l-Diversity, we now know that this statement in the www.diyevil.com article is incorrect:

“While the techniques described above depend on a bit of luck and may be of limited usefulness, they are yet another tool in the pen tester’s toolset for open source intelligence gathering and user attack vectors.”

The success of techniques discussed in this blog depend on science and math. Also, where science and math are in play, solutions can be devised. I can only hope that the troves of “data scientists” that are currently being recruited also understand the principles I have described. I also hope that we will eventually evolve into a world where not only big data matters but where anonymity and privacy are no longer empty words.

[1] http://www.diyevil.com/using-snapchat-to-compromise-users/
[2] /wp-content/uploads/2014/01/K-Anonymity.pdf
[3] /wp-content/uploads/2014/01/paper1.pdf
[4] http://aboutmyinfo.org/index.html
[5] /wp-content/uploads/2014/01/ldiversityTKDDdraft.pdf

INSIGHTS | June 11, 2013

Tools of the Trade – Incident Response, Part 1: Log Analysis

There was a time when I imagined I was James Bond zip lining into a compromised environment, equipped with all kinds of top-secret tools. I would wave my hands over the boxes needing investigation, use my forensics glasses to extract all malware samples, and beam them over to Miss Moneypenny (or “Q” for APT concerns) for analysis. I would produce the report from my top-notch armpit laser printer in minutes. I was a hero.
As wonderful as it sounds, this doesn’t ever happen in real life. Instead of sporting a classy tuxedo, we are usually knee deep in data… often without boots! I have recently given a few presentations(1) on Incident Response (IR). The question I am most often asked concerns the tool chain that would enable an individual or a team to perform the basic actions one would expect from an Incident Responder.

Let me be clear. There is an immense difference between walking on to a crime scene in which your organization’s intent is to bring a perpetrator to court and approximately 90% of the actual IR work you will perform. The biggest difference has to do with how you collect and handle evidence, and you can rest assured that any improperly handled evidence will be relentlessly attacked by the defense.
However, I am writing this blog post from the point of view of a blue team member. I’ll focus on the tools that are available to determine what has happened, when it happened, and (most importantly) how to make sure it does not happen again.
TL;DR : Many, if not all, of the tools you will need are available in the SANS   Investigate Forensics Toolkit (SIFT) VMWare image(2) . You can skip the rest of this post and start playing with these tools right away if you want to, but I hope I can at least give you some pointers on where to look first.
As I illustrated in the introduction, IR work is incredibly unsexy. Almost without exception, you will run into a special case that requires you to work on the command line for hours trying to figure out what has occurred. The next-next finish of many a security product installation and use process is something you will nostalgically reflect on over a burning log (pun intended).
You will typically have in your possession three important pieces of evidence:
  • Log files
  • Memory images
  • Disk images
I will address these in three separate blog posts and try to give you some ideas about the tools you can use during your investigation.
In this blog post, we will start off by looking at log files.
Every device on the network creates a gigantic amount of data. Some of it is used for monitoring purposes, but most of it is (hopefully) stored and never to be looked at again. Until a system is compromised, this data is generally useless with the exception of a few hidden golden threads. These threads will give you a good starting point for figuring out what has happened. They might also be useful in tying the whole story together.  Essentially, you must figure out what the log files want to tell you. (Cue the X-Files soundtrack.)
Sifting through log files is like looking for gold. Your initial sift will be very coarse grained. As you get rid of more and more of the dirt, the mesh of your sift will get finer and finer.
In my humble opinion, there is nothing better than having a thorough understanding of the Unix command-line tools ‘grep’, ‘sed’, and ‘awk’.  The good thing about these tools is that you will find them on almost any Unix or Linux box you can get your hands on. The bad thing is that you will either need to know or need to find out what you are looking for.
In the absence of a full-fledged SIEM infrastructure, two tools exist that I recommend you become familiar with:
  • OSSEC(3)
  • log2timeline(6)
First up, OSSEC. Not only is this a good (and open-source) host-based intrusion detection system, but it also allows you to feed it collected logs through its command-line tools!  The events will trigger OSSEC out-of-the-box alerts and give you a good overview of the most interesting events that have taken place. You can also easily develop custom rule sets based on (as an example) publicly available data from other compromises(4)
As you become more experienced with OSSEC, you can build your own set of scripts, drawing inspiration from a variety of places. In cases where ye olde *nix is not available, make sure you are also familiar with a Windows shell tool. (Powershell is one heck of an alternative!) Also, keep your Command Line Kung Fu(5) handy!
You will not be able to perform IR without the powerful yet extremely elegant log2timeline(6) tool. This blog post does not offer enough space to provide a decent overview of log2timeline. In essence, it groks data from different sources and creates a ‘Super Timeline’ that helps you to narrow in on those timeframes that are relevant to your investigation. It is one thing to dig through information on a command line, but it is a completely different ball game when you can visualize events as they have occurred.
In upcoming blog posts, I will dig into the juicy secrets that can be found in memory and disk images. A major lesson to be learned as you climb on the ladder of IR is that many of the tools you will use are produced by your peers. Sure, IR software is available and works well. However, the bulk of tools available to you have been developed by your fellow incident responders to address a particular problem. Many responders generously release their tools to the public domain for others to use. We learn from others probably more than we teach.
 

(1) BYO-IR: Build Your Own Incident Response: http://www.slideshare.net/wremes/isc2-byo-ir
(2) http://computer-forensics.sans.org/community/downloads
(3) http://www.ossec.net
(4) e.g. APT1 Indicators of Compromise @ http://intelreport.mandiant.com/
(5) http://blog.commandlinekungfu.com/
(6) http://log2timeline.net/