O’Reilly Network: How the Wayback Machine Works

[ Thanks to Jason
for this link. ]

“The Internet Archive made headlines back in November
with the release of the Wayback Machine, a Web interface to the
Archive’s five-year, 100-terabyte collection of Web pages. The
archive is the result of the efforts of its director, Brewster
Kahle, to capture the ephemeral pages of the Web and store them in
a publicly accessible library. In addition to the other millions of
web pages you can find in the Wayback Machine, it has direct
pointers to some of the pioneer sites from the early days of the
Web, including the NCSA What’s New page, The Trojan Room Coffee
Pot, and Feed magazine.

How big is 100 terabytes? Kahle, who serves as archive director
and president of Alexa Internet, a wholly-owned subsidiary of
Amazon.com, says it’s about five times as large as the Library of
Congress, with its 20 million books.

“What we have on the Web is phenomenal,” Kahle says. “There are
more than 10 million people’s voices evidenced on the Web. It’s the
people’s medium, the opportunity for people to publish about
anything — the great, the noble, the absolute picayune, and the


Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends, & analysis