By Christopher B. Daly
The Internet is many things, and most of us in the developed world have come, in a matter of just a few years, to depend on it for all sorts of things. Like a lot of people, I depend on the Internet to do most of my work, to keep track of my finances, to take and share photos, to keep in touch with loved ones, and lots of other activities that are fun, expressive, or important. More and more, I rely on the Internet to store or remember things for me. I have exported much of my deteriorating capacity to recall.
There’s more at stake than just my inability to find an old story or locate a picture I think I took a while back on my cellphone.
As a historian, I have to express my alarm about one feature of the Internet that most of us choose not to think about: LINK ROT. That’s the term for all those links you try to follow that bring you to an error page instead of the place you thought you were going. As these bad, broken links proliferate across the Internet (and its subset known as the Web), we have to wonder what kinds of things future historians will not know about us. They may be able to find out what was for lunch at our local middle school on a given day, but those researchers may be unable to find many other things.
Here is a recent New Yorker piece by Harvard historian Jill Lepore that explores the problems inherent in trying to save everything online. Can it be done? Should it? Lepore goes to the heart of the matter, by visiting the Internet Archive, at its real-world headquarters, in the old Presidio in San Francisco.
The Wayback Machine has archived more than four hundred and thirty billion Web pages. The Web is global, but, aside from the Internet Archive, a handful of fledgling commercial enterprises, and a growing number of university Web archives, most Web archives are run by national libraries. They collect chiefly what’s in their own domains (the Web Archive of the National Library of Sweden, for instance, includes every Web page that ends in “.se”). The
Mr. Peabody and Sherman using the original “Wayback Machine.”
Library of Congress has archived nine billion pages, the British Library six billion. Those collections, like the collections of most national libraries, are in one way or another dependent on the Wayback Machine; the majority also use Heritrix, the Internet Archive’s open-source code. The British Library and the Bibliothèque Nationale de France backfilled the early years of their collections by using the Internet Archive’s crawls of the .uk and .fr domains. The Library of Congress doesn’t actually do its own Web crawling; it contracts with the Internet Archive to do it instead.
All well and good, I suppose, but it’s not that simple. As Lepore points out, there are copyright issues, and there are lots of technical issues, too, involving how URLs are stored and retrieved.
In my own experience, this came home to me recently when I had to compile a dossier of my own work for a promotion. Turns out, if I wrote something for a magazine that went out of business (like the much-missed New England Monthly, for instance), no one has a stake in bringing that material onto the Wed or archiving it. So, it is pretty much gone, unless I can find my paper “clips” and scan them and post them somewhere myself. I also ran into a roadblock when I tried to retrieve my own work from a former employer, The Washington Post. Since I am no longer under contract to the Post, I had to pay for the privilege of getting access to my own work. (The Post also grabbed the copyright from me, but that’s another story.) In some cases, the only version that I had access to was the one that is stored on the floppy disk that I was using when I first wrote it back in the 1990s. But that led to another problem: I have a stack of floppies, but I no longer own a computer that can read them.
The issue is not going away any time soon. What can historians, “content-creators,” archivists and others do about it?
Here is a list of terrific suggestions from the Journalist’s Resource at Harvard’s Shorenstein Center. Part of the answer may involve a new site called Perma.cc. (But at the speed I am working, I can’t make heads or tails out of it.)
Here’s an excerpt from the JR essay by Leighton Walter Kille:
To address some of these issues, academic journals are adopting use of digital object identifiers (DOIs), which provide both persistence and traceability. But as Zittrain, Albert and Lessig point out, many people who produce content for the Web are likely to be “indifferent to the problems of posterity.” The scholars’ solution, supported by a broad coalition of university libraries, is perma.cc — the service takes a snapshot of a URL’s content and returns a permanent link (known as a permalink) that users employ rather than the original link.
Anyway, there are a whole pile of useful tips in his essay.
And, finally, here is an entirely different perspective, from a scholar who says it’s important to curate the Web by deleting stuff. Bruce Schneier, a fellow at Harvard’s Berkman Center, argues that we save too much material that used to be ephemeral (like saving emails that took the place of previously unrecorded phone calls) or so trivial that it would be literally thrown away (like that receipt from lunch).
An organization-wide deletion policy makes sense. Customer data should be deleted as soon as it isn’t immediately useful. Internal e-mails can probably be deleted after a few months, IM chats even more quickly, and other documents in one to two years. There are exceptions, of course, but they should be exceptions. Individuals should need to deliberately flag documents and correspondence for longer retention. But unless there are laws requiring an organization to save a particular type of data for a prescribed length of time, deletion should be the norm.
When it comes to archiving the Web, how much is too much?
How much is too little?
And how will we know?
[To be on the safe side, I am printing all my work and storing copies in plastic tubs with tight-fitting lids. You never know. -CBD]