WIRED is old enough to legally drink. And while those 20-plus years of bound paper will look beautiful on the shelf for a long time to come, we don’t have that luxury on the web. And so we've done some much-needed housecleaning.
The story of WIRED.com reflects the boom-bust-boom Silicon Valley we cover. Shortly after WIRED magazine launched in 1993, HotWired became a pioneering leader in online journalism. But after the dot-com bubble burst, the website was sold to Lycos and the magazine to Condé Nast. The two organizations went their own ways until 2006, when Conde bought WIRED.com and reunited print and digital. Shortly thereafter, WIRED.com moved to WordPress, updated its site and preserved its old content in an independent archive.
You can think of that archive as a digital walk-in freezer. It contained 34,220 web pages frozen in time, long-separated from the ancient platforms birthed them. This trove of content chronicles the ongoing tech revolution, from the birth of Google and our predictions of Apple's imminent demise (whoops) to the rise of social media. The code underpinning these articles illustrates a similar technological revolution in web development, with a variety of once cutting-edge standards and styles on display. While they serve as an interesting historic record of digital trends, they lack the readability and reusability of our current site. Even worse, the archive is a black box with no sitemap or consistent structure to clearly outline its size or structure, with the exception of a (mostly) functional series of magazine landing pages that date back to the publication’s founding.
Enter Cyphon.
Cyphon is a portmanteau of “cyber” and “siphon” (aren’t I clever!). I built it to crawl our vast archive and store the relevant data using a single standard format. Starting in April, Cyphon quickly consumed most of my working hours. I decided to build it as a command line tool using Node.js, a trending server-side platform with broad developer support. First I attempted to rough out any patterns in the archive’s content that could be programmatically defined. After some poking around, I discovered there were three main layouts that covered the majority of the archive’s content, each corresponding to a different publishing era. Using this information, I outlined the migration process and commenced fashioning its tools
First, I wrote a function to crawl through the archive using our library of nearly 200 magazine landing pages as starting points. Such onerous crawling can take hours, but by safely storing the unprocessed pages in a database, I reduced the risk of corrupting the data and avoided additional crawls whenever possible. Next, I wrote a method to convert this raw data into structured post information: title, author, publish date, etc. For each layout era I fashioned a unique “digester.” With these, we can feed the raw page’s HTML into one side of the digester and get extracted, cleaned post information out the other. Finally, I crafted a way to export the purified data into a simple file that WordPress could easily import.
Crawling the archive and scraping the content of the correct pages proved the biggest technical challenge. Thankfully, as is often the case with the massive Node community, a developer named Christopher Giffard already had written a fantastic crawler that I was able to implement (Thanks, Chris!). To address the issue of varying content quality and the many exceptions to any identifiable rule, I created numerous filters that did their best to remove the nuanced differences between pages. This made the output adhere to modern standards as much as possible. Most importantly, I built Cyphon to be extensible so additional crawling rules and digesters could be added easily in case a new layout was discovered. For instance, after all the magazine posts had been successfully digested, I extended the tool to crawl an arbitrary list of URLs so we could efficiently scrape the archive’s large amount of non-magazine blog posts.
Cyphon by the numbers:
- 34,220 pages scraped
- 11,195 distinct archive articles
- 14,799 new posts in production
- 97 percent of data scraped successfully produced a full-fledged post
- 1,076 tags generated for content
With the hard won data in hand, I was finally able get a peek at my finished work in WordPress. This is when it really hit me: The web has come so far since those pioneering days in the early '90s. Perusing the imported posts, you can observe the rise of images on the web, the death of table layouts, and the transition from ghastly video players to YouTube embeds. At first there were quite a few visual differences compared to the modern web. However after adding a last layer of special styling and polish, many posts looked almost identical to content published just this year. To account for migration anomalies, we decided to add a disclaimer to all imported archive posts to help explain the odd post that may appear out of place.
So what to do with 11,195 new-old posts? How about a weekly “Throwback Thursday” featuring posts published 10 years ago? Or a resurrection of WIRED’s “Failed Predictions” series? Perhaps even an article-driven timeline of famous companies, technologies, and ideas we’ve covered over the years. I'm sure the editors will come up with something.
Part of me takes great pleasure in making order out of chaos. That’s largely why I pursued engineering. With this massive undertaking I have helped do that in the extreme. We have reclaimed our publication’s digital heritage and ensured it will be passed on to the next WIRED website and its readers. Now that’s something we can proudly display on our virtual shelf.
If you’d like to see a handpicked selection of some archive highlights, please check out these 20th anniversary classics:
- Crypto Rebels - Steven Levy, May/June 1993
- Disneyland with the Death Penalty - William Gibson, September/October 1993
- Web Dreams - Josh Quittner, November 1996
- Mother Earth Mother Board - Neal Stephenson, December 1996
- The Epic Saga of the Well - Katie Hafner, May 1997
- The Long Boom: A History of the Future, 1980 - 2020 - Peter Schwartz and Peter Leyden, July 1997
- Gen Equity - Po Bronson, July 1999
- The Wurmanizer - Gary Wolf, February 2000
- Why the Future Doesn’t Need Us - Bill Joy, April 2000
- Welcome to the Luvvyplex - Charles Platt, November 2000
- The Truth, The Whole Truth, and Nothing But The Truth- November 2000, John Heilemann
- The Geek Syndrome - Steve Silberman, December 2001
- The Long Tail - Chris Anderson, October 2004
- La Vida Robot - April 2005, Joshua Davis
- The Rise of Crowdsourcing - Jeff Howe, June 2006
- The Pedal-to-the-Metal, Totally Illegal, Cross-Country Sprint for Glory - Charles Graeber, November 2007
- High Tech Cowboys of the Deep Seas - Joshua Davis, March 2008
I’d also like to personally thank these Node.js projects that helped make Cyphon a reality:
- Cheerio: parsed out HTML content
- IconV-Lite: handled those odd characters and accents
- Node Simplecrawler: crawled the archive for scraping
- XML Builder-JS: built the export file for WordPress