Artificial intelligence (AI) is now a major priority for government and defense worldwide — one that some countries, such as China and Russia, consider the new global arms race. AI has the potential to support a number of national and international security initiatives, from cybersecurity to logistics and counter-terrorism.

The overwhelming amount of public data available online is crucial for supporting a number of these use cases. These sources include unstructured social media data from both fringe and mainstream platforms, as well as deep and dark web data.

User offering an updated version of govRAT malware on a deep web hacking forum—discovered by Echosec Systems

While valuable, these sources are not always easily accessible through commercial threat intelligence platforms. Additionally, commercial data solutions, such as APIs, often deliver raw data in formats unsuitable for developing AI in the intelligence community.

How does public online data support AI and national security, and how can these feeds more effectively meet defense requirements for AI development?


AI and national security: The value of online data

AI applications in defense rely on training data from a variety of inputs. These could include technical cybersecurity feeds, aerial photography, or data from physical sensors in the field.

From these available databases, data scientists can develop machine learning models that automatically detect cyberattacks, monitor on-the-ground enemy activity, direct autonomous vehicles, and inform a plethora of other national security strategies.

Publicly available online data, specifically from social, deep, and dark web sources, is increasingly valuable for supporting a variety of AI applications in defense. For example:

  • Communication channels across the deep and dark web often signal targeted cybersecurity threats, like leaked classified data or coordinated malware attacks. Combining these sources with technical feeds like network traffic data creates a more robust artificial intelligence and national security strategy for addressing cyber risks.
  • A variety of online spaces — from mainstream social networks to fringe sites like 4chan and 8kun — are used by extremist groups worldwide to sow disinformation, recruit, and plan violent attacks. Machine learning models are now required to monitor online extremism, as its growth and obfuscation techniques are surpassing current detection algorithms and human analysis. AI can help locate intentionally obfuscated chatter and imminent threat indicators like manifestos and planned attacks.
  • AI is used by foreign nation-states to conduct information warfare both domestically and abroad. Conversely, military technology like AI helps monitor these targeted disinformation threats for intelligence applications.
  • For some military operations, AI supports stronger command and control systems, which analyze data feeds from multiple domains in a centralized display. Cross-referencing data points from online social, deep, and dark web sources allows defense analysts to get more value from other feeds, expand AI functionality, and persistently monitor environments more effectively.

Data leak targeting an ordnance supplier on Pastebin — discovered by Echosec Systems


Making online data “AI-ready”

While online data sources are valuable for developing AI in defense, aggregating data from a variety of online spaces efficiently is only half the battle. Data scientists in defense must also be able to collect, organize, and store data optimally for AI applications — a process that the JAIC describes as getting “AI-ready.”

“...the transition to AI ready systems will require the implementation of methodical and highly deliberative processes for collecting and curating data.”
The JAIC, June 2020

As stated by the United States Congressional Research Service, most commercial innovations supporting AI serve the private sector, not federal requirements. Consequently, many off-the-shelf threat intelligence platforms and APIs gathering social, deep, and dark web data do not organize and store data for effective AI development in defense.

Data scientists in defense require solutions that not only aggregate relevant data efficiently — but are also underpinned by a well-maintained data lake. This means gathering a wide variety of data sources and types, effectively cataloguing this data, and collecting a large enough database to build effective machine learning models.

As a result, any structured or unstructured data collected online is ready for supporting AI development.

To meet this need, many vendors have developed a proprietary API that combines well-known sources like dark web marketplaces and mainstream social networks with obscure social sources on the deep and dark web. The solution, built with a data lake, allows data scientists to integrate unstructured data from these sources and effectively develop machine learning models for defense initiatives.

The API also includes built-in machine learning models, which allow analysts to get up and running quickly on a number of common defense use cases — including automatic detection of data disclosure and PII.

Public social, deep, and dark web data is increasingly valuable for informing national security initiatives. However, data scientists require this unstructured data to be collected, curated, and stored specifically for AI development — which is not always possible through existing commercial APIs and threat intelligence platforms.

Even as defense departments worldwide invest more in AI, emerging technology often evolves faster than public policy. Solutions that deliver “AI-ready” data will allow governments to keep up with AI technologies and more effectively integrate them into defense environments. This will ultimately drive more effective, scalable, and better-informed national security strategies.