Darpa Is Developing a Search Engine for the Dark Web

A new search engine being developed by Darpa aims to shine a light on the dark web and uncover patterns and relationships in online data to help law enforcement and others track illegal activity. The project, dubbed Memex, has been in the works for a year and is being developed by 17 different contractor teams […]

A new search engine being developed by Darpa aims to shine a light on the dark web and uncover patterns and relationships in online data to help law enforcement and others track illegal activity.

The project, dubbed Memex, has been in the works for a year and is being developed by 17 different contractor teams who are working with the military's Defense Advanced Research Projects Agency. Google and Bing, with search results influenced by popularity and ranking, are only able to capture approximately five percent of the internet. The goal of Memex is to build a better map of more internet content.

"The main issue we’re trying to address is the one-size-fits-all approach to the internet where [search results are] based on consumer advertising and ranking," says Dr. Chris White, the program manager for Memex, who gave a demo of the engine to the 60 Minutes news program.

To achieve this goal, Memex will not only scrape content from the millions of regular web pages that get ignored by commercial search engines but will also chronicle thousands of sites on the so-called Dark Web---such as sites like the former Silk Road drug emporium that are part of the TOR network's Hidden Services.

These sites, which have .onion web addresses, are accessible only through the TOR browser and only to those who know a site's specific address. Although sites do exist that index some Hidden Services pages---often around a specific topic---and there is even already a search engine called Grams for uncovering sites selling illicit drugs and other contraband, the majority of Hidden Services remain well under the radar.

White says part of the Memex project is aimed at determining just how much of TOR traffic is related to Hidden Services sites. "The best estimates before were in the single digits---in the one-thousands," he says. "But we think there are, at any given time, between 30,000 and 40,000 Hidden Service Onion sites that have content on them that one could index."

The content on Hidden Services is public---in the sense that it's not password protected---but is not readily accessible through a commercial search engine. "We’re trying to move toward an automated mechanism of finding [Hidden Services sites] and making the public content on them accessible," White says. The Darpa team also wants to find a way to better understand the turnover of such sites---the relationships that exist for example between two sites when one goes down and a seemingly unrelated site pops up.

But the creators of Memex don't want just to index content on previously undiscovered sites. They also want to use automated methods to analyze that content in order to uncover hidden relationships that would be useful to law enforcement, the military, and even the private sector. The Memex project currently has eight partners involved in testing and deploying prototypes. White won't say who the partners are but they plan to test the system around various subject areas or domains. The first domain they targeted were sites that appear to be involved in human trafficking. But the same technique could be applied to tracking Ebola outbreaks or "any domain where there is a flood of online content, where you’re not going to get it if you do queries one at a time and one link at a time," he says.

In a demo conducted for 60 Minutes, White's team showed how law enforcement could possibly track the movement of people---both trafficked and traffickers---based on data related to online advertisements for sex. The* 60 Minutes* piece wasn't clear about how this was done and appeared to focus on the IP address of where the ads were hosted, implying that tracking where an ad moves from one IP address to another could reveal to law enforcement where the trafficker is located. But White says the IP address is the least important information they analyze. Instead they focus on other data points.

"Sometimes it's a function of IP address, but sometimes it's a function of a phone number or address in the ad or the geolocation of a device that posted the ad," he says. "There are sometimes other artifacts that contribute to location."

For example, an ad attempting to sell the sexual services of a woman or child in one locale might pop up in another location and include a regional address or phone number. White says this kind of data has been used by investigators to find women who were being trafficked.

"You can imagine a scenario where people are moving around the country with women and are interested in advertising them---they post ads in different places. It can involve the same women and some of the same info like phone numbers. Via methods of connecting content through shared attributes---meaning the same number or image appearing on ads---you can create a network to understand where these things are connected and where they may be located."

He notes that the connection from the online ads to the real world is not always accurate or a one-to-one match. "But that’s why there are investigators and prosecutors involved to do interpretation and make decisions. Darpa just creates the tech, and organizations adopt the technology to use it."

White won't say how much the program is costing, but says it's comparable to other data science projects that have been funded at $10 to $20 million.