Hack the Haystack: UK Government Hosts Data Security Hackathon

by Simon McHattie

Organizations commonly view viruses and external hackers as the biggest threat to their data security, but 2,500 internal security breaches occur every day in the U.S. alone.

Finding insights into these “insider threats” requires sorting through vast amounts of complex data stored by organisations. It’s possible, but not easy.

With its “Hack the Haystack” hackathon, the U.K. Government set out to investigate how machine learning can help detect and even predict such threats.

Mitigating Threats

Not all data breaches are nefarious. An employee may unwittingly use a personal USB stick to transfer files from a corporate device. Or they may store work files on their personal cloud accounts. Both increase the possibility that potentially sensitive data may find its way outside an enterprise’s security controls.

But insider threats are not always innocent mistakes. Malicious actions include employees taking contact databases when they leave a company, subverting another colleague’s data, or moving data for criminal purposes.

Whatever the cause, it’s a problem enterprises and governments can’t afford to ignore. Diagnosing a data breach does little to mitigate its effects, but AI is making it possible to predict problems before they cause damage.

Hack the Haystack

To help the improve its security, the U.K. Government invited top data scientists to tackle the problem at the “Hack the Haystack” hackathon using a variety of machine learning and deep learning approaches. Eighteen teams from a range of government departments, academia and industries gathered.

Each team had access to a range of tools, including NVIDIA GPUs supplied by Microsoft Azure. And experts from NVIDIA were on hand to support the teams with technical know-how.

Competitors faced a complex, data-intensive task. Predicting insider threats requires fast analysis of a giant “haystack” of network data, such as firewall logs, user logins and logoffs, email traffic and transfer protocols. This data must also be scanned for subtle relationships between activities and any anomalies.

After two days of activity on synthetic data, the teams presented their results. Suggested solutions ranged from handcrafted rules-based filtering to the use of recurrent neural networks. The most commonly used methods employed traditional machine learning techniques such as isolation forests, k-means clustering and auto-encoders.

The ability to interpret very large datasets and scalability were crucial success factors to developing predictive insider threat solutions. And NVIDIA GPU-accelerated computing proved its potential in developing solutions quickly and cost-effectively.