»Finding the best threat intelligence provider for a specific purpose: trials and tribulations«
2018-10-17, 15:15–16:00, Europe

We undertook a large project to evaluate the quality of APT TI/IoC sources and encountered multiple expected and unexpected challenges. We will present our approach, the challenges encountered and the results.

We operate a worldwide setup of network security sensors based on Suricata and additional self-developed detection capabilities. Standard crime & malware detection on these sensors is achieved using a combination of ET PRO IDS rules, own Suricata rules, commercial IoCs, OSINT and Yara for parsing extracted files. We are activly maintaining our own APT detections rules and have built custom-parsers for some APTs. This approach has been proven to work quite well; it provides a lot of visibility and value. Nevertheless we wanted to increase our APT detection capabilities by adding a commercial Threat Intelligence provider with an APT focus into the mix. But how do you find "the best" APT focused Threat Intelligence provider?

Over the last years security researchers published interesting work on how to measure quality of IoC feeds. Alexandre Pinto and Kyle Maxwell presented on statistical analysis of public TI feeds at DEFCON 22 and published the #tiqtest on Github. Their approach works well for IPs but is not looking at context beyond ASN and geolocation. Paweł Pawliński and Piotr Kijewski of CERT.PL presented at the 28th FIRST 2016 "Towards a Methodology for Evaluating Threat Intelligence Feeds". They focused on comparison against blacklists and a few base statistics of feeds over a period of 6 months. That long-time analysis is very valuable but not possible with commercial vendors. Sometimes even getting 30 days of trial period is difficult. These approaches, while valid, were not suitable to answer our questions and so we had to developed our own approach for the comparison of TI providers. One aspect where this becomes very obvious is the "timeliness". When searching for APTs, well-labelled history data is important as we have seen time and again. For fast-changing malspam crime, old data usually equals useless data.

The challenges we encountered spanned several areas. Many were organizational challenges, both internal and external:

  • coordinating across different departments / teams
  • coordinating with different providers such that we could get access to their threat intelligence services at the same time

However, even when we were able to overcome some of the above, the nature of the different threat intelligence services meant that our original plan "Compare the quality of APT indicators of services X, Y and Z from time t to time t + x" simply wasn't feasible. Indeed, the reality is that:

  • the formats of the services differ from one to another, MISP, Stix, JSON, you name it. (We knew that before, but just how much they differ when trying to answer a specific question is just mindblowing)
  • indicators are not published in the same way, some providers like to make use of masks, others of exact URLs. Even calculating overlap between providers suddenly became a tricky task.
  • Some providers gave us all their TI with the exception of "APT-related TI", because that is to valuable. And some of them failed to tell us so...
  • Some providers have multiple TI-APIs with different content - but the documentation isn't always clear about that
  • context attached to indicators is not always delivered in a machine-readable way.

The data and structure of a threat intelligence feed reflects the "purpose" of the feed. The goal of our project was to improve the coverage of our network sensors, not provide more context to our human analysts (which is a very valid use case but was not the goal of our project).

And then there were challenges related to the specific threat we were interested in: APT. Trying to frame the question "How do you assess the "APT-ness" of an indicator?" as a data question proved tricky. While a security analyst can dig into their own experience and vast domain knowledge to assess the quality of an indicator, a data scientist must rely on carefully labelled data to draw their conclusions.

We not only compared the data and provided context by itself, as that would have been a highly quantitative approach, we also put the indicators to work in a lab setup with PCAPs from real world networks. This we did mostly for false-positive testing of the indicators and to estimate "analyst impact".

At the end we managed to compare most of the commercial providers on our list and we did get some meaningful results. But what became very clear during our project is that

  • the current commercial TI landscape makes it extremely difficult to compare TI providers
  • that we need a lot a more research into the "how to answer specific TI questions?" area

We will present our approach, the challenges encountered and the results.