Nieman Foundation at Harvard
HOME
          
LATEST STORY
Dow Jones negotiates AI usage agreements with nearly 4,000 news publishers
ABOUT                    SUBSCRIBE
March 24, 2022, 2:57 p.m.

After 25 years, Brewster Kahle and the Internet Archive are still working to democratize knowledge

“Corporations continue to control access to materials that are in the library, which is controlling preservation, and it’s killing us.”

Brewster Kahle has been at this a long time.

Consider the photo above evidence. (And yes, children, computer monitors were once the size of a mini-fridge.) It was taken by internet legend (and open records hero) Carl Malamud in December 1991, when he was reporting out what would become Exploring the Internet: A Technical Travelogue, which aimed to put some faces to the burbling sense that something exciting was happening with connected computers.

These were still early days online — only four months after Tim Berners-Lee mentioned his “WorldWideWeb” project for the first time in the newsgroup alt.hypertext. The first version of Netscape was still three years away. And there was Kahle, just 31, but already with a stuffed resume: researcher at MIT’s AI Lab, lead engineer at supercomputer maker Thinking Machines, lead developer of WAIS (Wide Area Information Server), something like an alpha version of what the web would become.

“After delving into the arcana of message-passing protocols for massively parallel processors,” Malamud wrote, “Brewster turned his attention to the much more difficult problem of finding and using information on networks.”

Brewster ushered me into his office, where he sat down on a beat-up old easy chair and balanced a keyboard on his lap. The screen and rollerball mouse were conveniently nearby, making this a highly comfortable work or play station. There was no need to start up his WAIS client since it was already up and running. Deployed for only a few months on the Internet, WAIS was a quickly becoming a part of people’s routines, and had certainly been integrated into Brewster’s daily work.

Brewster typed in a query: “Is there any information about Biology?” The query was sent, in its entirety, to the server of servers that Brewster maintained, quake.think.com. Servers of servers were no different than document servers, they simply kept a list of other servers and a description of the information they maintained.

We got back a list of servers throughout the world that had information on biology, such as a database of 981 metabolic intermediate compounds maintained in the Netherlands. At this point, we refined our query and sent it out to many servers simply by pointing to them on the screen. Servers returned lists of document descriptions; pointing to those documents retrieved the full text.

Brewster’s goal was to enable anybody with a computer, even a lowly PC, to become a publisher. The first PC-based WAIS server had recently gone online, running in somebody’s basement, and Brewster was quite excited by the prospect.

Brewster’s interest in publishing was personal as well as professional. His fianceé ran a printing museum and in the basement was an old printing press.1

That’s how someone who started out in AI and microchip design ended up being the internet’s librarian.

In 1996, Kahle founded the Internet Archive, which stands alongside Wikipedia as one of the great not-for-profit knowledge-enhancing creations of modern digital technology. You may know it best for the Wayback Machine, its now quarter-century-old tool for deriving some sort of permanent record from the inherently transient medium of the web. (It’s collected 668 billion web pages so far.) But its ambitions extend far beyond that, creating a free-to-all library of 38 million books and documents, 14 million audio recordings, 7 million videos, and more. (Malamud’s book is, of course, among them.)

That work has not been without controversy, but it’s an enormous public service — not least to journalists, who rely on it for reporting every day. (Not to mention the Wayback Machine is often the only place to find the first two decades of web-based journalism, most of which has been wiped away from its original URLs.)

A little while back, the Internet Archive celebrated its 25th birthday, and I used that as an excuse to chat with Kahle about how his vision for it had changed along with the internet it tries to preserve in amber — and about why there is still so much human knowledge locked away on microfilm. Here are some bits of our conversation, lightly edited to make me sound more coherent on Zoom calls.

Kahle on stage at the Internet Archive's 25th anniversary celebration in San Francisco, October 21, 2021.

Joshua Benton: I’m 46, so I arrived at college right in the earliest days of the web. I have an enormous fondness for the optimism and the idealism people had about technology back then. The Internet Archive feels like a project from that era — free, open to all, assembled from millions of different parts and sources. How close is the archive today to what you were imagining 25 years ago? Is it recognizable compared to what you were planning, or hoping for?

Brewster Kahle: I think so, roughly, yes. I think the way other organizations participate with the Archive is different than what I would have imagined.

I would have thought that libraries would have just digitized all their books, and that they would have followed the same course as with the digitization of the card catalog. People went and copied their physical card catalogs into software that was running on their machines.

But what really happened was, you know, not as much. We had the Million Books Project.2 We were digitizing away. But then Google Books came along and said, “We’ll take it all.” And that was a complete surprise. And then some people said, “We’ll get the books scanned, but we’ll only share it among ourselves.” That was HathiTrust.3 That I found not that encouraging, in terms of public-spiritedness and the opportunity of the internet to make it available to anybody, anywhere. You know, let’s break open the walls of academia!

There was this guy, Binkley — I really loved Binkley.4 I really wanted to learn more about him. In the 1930s, he was a thinker and a promoter of microfilm — but microfilm as a mechanism of distributing knowledge, specifically to rural populations, to break the city elite. He thought that this was a way of democratizing knowledge.

It turned out that instead, you know, they microfilmed things and mostly kept it just for themselves.

Benton: You know, the Nieman Foundation at Harvard, where I work, was initially, back in the 1930s, supposed to be centered around this giant collection of journalism on microfilm. The head of Nieman is still titled the “curator” all these years later, because the original job was supposed to be to curate this collection. Microfilm was really having a moment in the ’30s, I guess.

Kahle: It was a thing. I was really clued into this by — I don’t remember her name, she’s retired now from the MIT library. But when I gave this talk about the Internet Archive — you know, my rousing “universal access to all knowledge” blah blah blah — at the Boston Public Library, she came up to me afterward and said, in that quiet librarian way: “Brewster, I’ve heard this speech before. It was all about microfilm.”

Benton: I really don’t understand why there’s anything left in the world that’s still only available on microfilm. Digitizing all the world’s books — okay, that’s a giant challenge. That’s a huge, unknowable data set. But why hasn’t every academic library digitized all its out-of-copyright microfilmed manuscripts, which I would think is much, much easier?

Kahle: It’s all about licensing, the licensing plague. It’s the shift from libraries owning things to corporations licensing and controlling access to materials that are in libraries. Corporations continue to control access to materials that are in the library, which is controlling preservation, and it’s killing us.

Benton: So it’s the big academic publishing companies that bought up rights to microfilm that was created 50 or 80 years ago?

Kahle: There’s a play I really want to see put on at the Repertory Theatre in Harvard Square — a two-person play, fictitious, of Binkley meeting Eugene Power.5

Eugene Power started University Microfilms, and Binkley had this dream of microfilm playing a different role. And basically, Eugene Power won — Binkley died. And we ended up with it being a corporation, which then got bought and bought and bought and bought again. And then they think that, if you want to move something to the next medium, you need to go back and get a new license. That transaction cost is so high, right? You don’t do it very often. So things get left behind because of this idea of licensing.

Enabling news and archiving news

Benton: I wanted to ask you specifically about how you see the role that journalism has played and does play in the Archives’ history.

Kahle: There are two dimensions, right? There’s being a useful tool for journalists, having materials that they want to use. And then there’s documenting the output of journalism, of news. And those are both probably best illustrated with the Wayback Machine.

Being a resource for journalists has been a major goal of ours. We’ve got an internal Slack channel that uses Google Alerts to find uses of the Wayback Machine in news stories, and they come in all the time. I actually find that a useful stream of news to read, because it indicates that the journalist has done some work.

Benton: Journalists’ use of the Wayback Machine reminds me a bit of the way that Jon Stewart’s Daily Show was able to gain a certain amount of rhetorical authority by finding all these old clips of politicians saying something six months ago that was the opposite of what he is saying today. Using that archive of video information to build accountability. I think journalists use the Wayback Machine for the same reason. It’s “This company says X now, but only three months ago, on their website, they said Y.”

Kahle: Absolutely. And Jon Stewart’s Daily Show was really inspiring to us. We did a grant-funded program to try to build a tool that would allow anyone to become a Jon Stewart research intern. And that was what became the Television Archive.

We’d been archiving television, and then we wanted to make it available. And so we tried to make it so you could search on what people said and then make clips. And it didn’t happen as much as I thought it would.

So those are tools that we’ve helped make that are useful to journalists. Then there’s trying to archive news. And we’ve really done a lot work to try to make sure that we capture news from around the world. What’s becoming really tricky now is paywalls and robot traps. Newspapers are becoming very sophisticated to try to make sure that people don’t crawl them. They’re employing more and more sophisticated tools.

Benton: Are they doing that specifically to block crawlers, or is that just a side effect of their attempts to create harder paywalls for consumers? Like, are they specifically targeting, you know, Google’s spider and your bot and everything?

Kahle: Well, I think they let Google through. But they don’t necessarily let us through. They’re targeting people that are crawling their sites. And so that will make it very, very difficult for us going forward — and for all libraries.

Benton: So a site like The New York Times has a metered paywall, where you only get so many free articles per month. But I don’t think I’ve ever seen in the Wayback Machine a “You’ve run out of articles for this month” message. So are you paying somehow for that accesss?

Kahle: We’re trying all sorts of different things — conversations, relationships. It’s a work in progress.

Benton: Is there anything that you would want news organizations to do that would be helpful for you?

Kahle: Let us subscribe and download a copy. And recognize that we’re just not going to crater your business. I mean, people are just not going to go to the Wayback Machine every day to get your news instead of going to your site. They just don’t — we’ve been around for long enough.

You can imagine people coming up with scenarios: “Oh my God, you know, is one copy on the internet going to make it so that we don’t have a business? Oh, wait a minute — that doesn’t happen.” Sort of a theoretical la-la-land of some people’s imaginations. We have a long history and it hasn’t happened, right?

Benton: Have you followed the attempts by a number of countries — Australia most significantly — to get Google and Facebook to pay their local news organizations for the right to index their content?

Kahle: Link taxes, basically. Just from afar. There are other people at the Archive watching that kind of thing more carefully. It’s the sort of shift that could be life changing in terms of what libraries can do. Will there be libraries in 25 years? We’ve been around for 25 years now. Will there be libraries in this whole era of rent, lease, and license? What will a library look like?

The three wars of the internet

Benton: Is there any sense in which you’re more optimistic about these issues than you used to be?

It certainly seems to me, in the 14 years I’ve been at Harvard, that there’s been a very significant push within the university in the direction of open access and pushing back against academic publishers. It feels like, in this incredibly privileged institution, at least, that there’s been some movement in a positive direction. I’m curious if there are areas in this whole question of access that you’re seeing positive movements.

Kahle: I think of the internet as having three wars.

War No. 1 was about the plumbing. The ARPANET, evolving into the internet, versus AT&T. We did really well on that. Part of the reason was that AT&T was broken up in 1986, and so it was temporarily enfeebled. It’s now back, and it’s called AT&T again, which is just chutzpah. We now have very few choices for Tier 1 or last-mile solutions. So that was war no. 1.

War No. 2 was about open protocols versus closed. AOL versus the World Wide Web, right? And that was about Stallman, you know, and Tim Berners-Lee, and to have open protocols — open, free, and open source software. That’s huge, and hugely influential towards not having a Microsoft-dominated, AOL-dominated world. Just draw through line forward from the IBM days, you know — without free and open source software, and protocols that were open, life would have been very different.

That’s still doing okay. But the attacks on free and open source software are being so successful by companies like Facebook and Google. They used a loophole: If you used open-source software, in the old days, you had to go and share whatever you added to it. So, you know, share and share-alike, as Larry Lessig put it. But it only applied for software that you distributed — that, you know, other people could use. But when you started getting cloud services, where you ran all of the software on your own servers, right, you never actually distributed it.

Benton: Just the results of it, to everyone’s web browser.

Kahle: Yes. And so you get to leverage everybody else’s work without sharing. And that’s a problem.

War No. 3 is the content level. That’s always what the Internet Archive has been designed for. And so we’ve had, you know, open educational resources, we’ve had Creative Commons. But I don’t know how successful it’s been in opening up access to academic work. Have the journals shifted over — the key journals in your area, are they open access, or they are they still closed?

Benton: I’m totally spoiled, because as someone with a Harvard ID and the Harvard Library, I can have access to almost everything. But yeah, a lot of it is certainly still in Taylor & Francis or De Gruyter or wherever.

Kahle: We haven’t made as much progress on that.

Some of the servers on which the Internet Archive are stored, 2015. The organization and its servers are located a former Christian Science church in San Francisco.

Building permanence

Benton: I wanted to ask how you’re thinking about permanence. In Europe, we’ve seen the rise of the right to be forgotten. A lot of news organizations have gone back into their archives and tried to be thoughtful about: “This person’s arrest that we mentioned in a story 19 years ago is still their No. 1 Google result. Is that okay?” The archive has been flattened, and it’s a lot easier to find certain kinds of things that that used to require a lot of focused digging.

At the same time that’s happening, we have social media companies moving in the direction of intentionally impermanent media — you know, a Snapchat Story or Instagram Story that’s designed to disappear after 24 hours. Or Clubhouse, you know — audio conversations meant to be experienced in real time, not time-shifted as a podcast. I’m just wondering how you’ve been thinking about those issues as someone who runs this giant archive designed to keep everything forever.

Kahle: So let’s take the case of videos that some people would object to. Can those be in a digital library at all? You can say, “Well, you know, if it’s accessible online to one person, it’s accessible to everybody in the globe, to 3 billion people.”

But something being accessed 10 times or 100 times on the Internet Archive isn’t the same as the mass distribution of being on YouTube. People like to get binary about it, very black and white, but I think you can try to have some level of gray understanding. I say it’s important that it be preserved.

Back to the future

Benton: One last question. If you were to go back in time to 1996 Brewster — you know, filled with optimism about the internet and about the web — and basically just describe what the internet looks like today 25 years later, how happy do you think young Brewster would be about that?

Kahle: I think young Brewster would be surprised at how long it’s all taken. You know: “Aren’t you further along than that by now?”

In 1996, things were moving along pretty fast. There’s Google starting up, which would definitely outstrip AltaVista and Inktomi. You had Wikipedia formed in 2001. Why, in 2021, are you still talking about digitizing books? I mean, come on, guys.

Haven’t you applied the AI technologies that we already had? You know, I was at the AI Lab at MIT to help make sense out of what’s going on out there, to go and help give people context. In the words of those days, “context is king.” And where are we on that? Well, that’s rhetorical — we’re almost nowhere on that. And it’s causing huge problems, with people being confused about what it is they’re seeing. Everything looks like a scientific paper. And so you can go and pick one out and find a scientific paper to say whatever it is you want, and then that gets promoted on Fox News.

Benton: Just within journalism, I think back to the first, oh, five years of the mainstream web, where people who wanted news online would still go to NYTimes.com or LATimes.com or whatever and were still seeing stories that have been packaged and ordered by an editor, so you still have the context of “this is important, it’s the top story.” With social media, that context was gone.

I think of it as: The internet is an absolutely amazing, astonishing, wonderful thing for people who are sort of lean-forward information consumers. Dedicated infovores, people who enjoy consuming information, who seek it out with purpose and love having access to everything. But if you’re more of a lean-back information consumer, the information you used to get was often of middling quality, but it was still socially responsible in some broad sense. Your local daily newspaper wasn’t going to be pushing QAnon.

Kahle: It was a very narrow spectrum, and we’ve widened it out. It is a radical experiment in radical sharing. I think the winner, the hero of the last 25 years, is the everyman. They’ve been the heroes. The institutions are the ones who haven’t adjusted. Large corporations have found this technology as a mechanism of becoming global monopolies. It’s been a boom time for monopolists.

Photo of Kahle in 1991 by Carl Malamud. Photo of the Internet Archive’s 25th anniversary by Cory Doctorow used under a Creative Commons license. Video of Kahle reading Robert C. Binkley by Binkley’s grandson, Peter Binkley. Photo of Internet Archive servers in 2015 by Peter Theony used under a Creative Commons license.

  1. And here is a photo by Malamud of Kahle in that basement. []
  2. A Carnegie Mellon-led initiative in 2007-08 to digitize 1 million books in 20 languages. []
  3. HathiTrust is a consortium of more than 250 academic libraries that contains digital copies of much of their collections. But access to most of the material is restricted to students, faculty, and staff of those universities. []
  4. Robert C. Binkley (1897-1940) was a historian and early advocate for using technology to share academic information more broadly. He led the Joint Committee on Materials for Research of the Social Science Research Council and the American Council of Learned Societies, which promoted scanning print documents onto microfilm, which could be more readily distributed far away from traditional academic centers. []
  5. Eugene Power (1905-1993) was in the U.K. during World War II and had the idea to microfilm the rare contents of British academic libraries that were threatened by German bombing. He was then able to sell copies to American libraries. He started University Microfilms, which evolved into today’s academic-publishing powerhouse ProQuest. []
Joshua Benton is the senior writer and former director of Nieman Lab. You can reach him via email (joshua_benton@harvard.edu) or Twitter DM (@jbenton).
POSTED     March 24, 2022, 2:57 p.m.
Show tags
 
Join the 60,000 who get the freshest future-of-journalism news in our daily email.
Dow Jones negotiates AI usage agreements with nearly 4,000 news publishers
Earlier this year, the WSJ owner sued Perplexity for failing to properly license its content. Now its research tool Factiva has negotiated its own AI licensing deals.
Back to the bundle
“If media companies can’t figure out how to be the bundlers, other layers of the ecosystem — telecoms, devices, social platforms — will.”
Religious-sounding language will be everywhere in 2025
“A great deal of language that looks a lot like Christian Nationalism isn’t actually calling for theocracy; it is secular minoritarianism pushed by secular people, often linked to rightwing cable and other media with zero meaningful ties to the church or theological principle.”