Episode 97

Archiving The Internet with Jason Scott

February 27, 2015

The Internet Archive is a treasure trove of digitized culture — films, software, audio, websites and more. How it it being collected, and how might the Internet Archive be our best hope for preserving the history of this era, as we invent the web? Jason Scott joins Jen Simmons to talk about the challenges of archiving in the digital age.

Download

Open iTunes Read Transcript Subscribe Comment

In This Episode

The strange collection of digital things that is the Internet Archive
What will it take to save website from disappearing long-term?
How do certain sites get saved?
How can companies or people better preserve our web heritage?
How digital files degrade faster and have a shorter shelf life than older forms of media.
How that affects archiving, and increases the need to preserve as much as possible, so that later, when humanity is able to determine what is important to still have, the files do exist and are accessible.

There are all these ethics and concerns, and they never got cooked in, because they internet went from experiment to critical infrastructure way too quickly.

Transcript

Jen

[00:00:00] This is The Web Ahead, a weekly conversation about changing technologies and the future of the web. I'm your host Jen Simmons and this is episode 97.

I also want to say thank you to Pantheon for powering The Web Ahead's new website. The new website at thewebahead.net. You can check out Pantheon at pantheon.io and learn all about their magic tools for deploying and hosting websites. Bandwidth has been provided by CacheFly, the fastest, most reliable CDN in the business. cachefly.com.

Before I get rolling today, I want to say thank you to everybody who has been tweeting and writing and sending emails and every possible electronic way of communication possible to tell me about what you think about the new site. People have been reporting bugs, which is embarrassing, but also really kind of awesome. Thank you. Just so many kind words.

Today, for you listeners, is actually two weeks after the site launched. But for me, it's the next day. And I am exhausted. [Laughs] I am so exhausted, it's amazing trying to launch the site, finish the code, finish the bugs, fix the little design things. Oh, right, and then do all of the monitoring, oh, there's a PHP error, I've got to fix that. Speed the site up, fix this, fix this. At some point I think I'll get a break. I'm just excited. I'm really excited. I'm very grateful and very grateful to start to see the beginnings of what I really want the site to do, which is build a community around the show, and be a place where you can discuss things and bring up ideas and share corrections or just bits of information and stories and examples with each other. Comments are open. For this episode, episode 97, you can go to thewebahead.net/97 and you can leave comments for the show. Any kind of responses that you want to.

If you're interested in helping with the show, you can go to thewebahead.net/help and there's all sorts of information there about how you can help out. From leaving an iTunes review to helping with transcriptions to sponsoring the show. There's lots of ways you can help me make this show better. So if you're interested, go check it out.

Let's get into today's show. I thought a really great show to do — I've been wanting to do for a really long time — is a show about archiving the internet and the Internet Archive.

Today, I have Jason Scott on the show. He works over at the Internet Archive and he's been deeply involved in understanding digital medium and computers and games and where all of this stuff came from and where it's going. Archiving and preserving it for us to understand better right now, and for generations to understand and see it and be able to research it in the future.

Hello, Jason Scott.

Jason

[00:03:16] Why hello there!

Jen

[00:03:17] I'm really glad to have you on the show. Thank you for making time in your busy schedule to be here.

Jason

[00:03:21] No problem.

Jen

[00:03:23] Will you explain for people who haven't had a chance to pay attention? What is the Internet Archive?

Jason

[00:03:31] Ok, so, the Internet Archive — which is at archive.org — is a non-profit library. It's physically located in California but it thinks of itself as simply being on the internet. It is a collection of human knowledge.

The founder, Brewster Kahle, made a little bit of dot-com money before dot-coms really went crazy. Instead of just building a boat and going into the sunset, he had a dream from when he was very young to be a librarian. So he started out on that mission. He founded a non-profit and registered archive.org and started to build knowledge.

His business — the one that made him a lot of money — did Nielsen ratings for websites. It could tell you that you were the 4,605th ranked website. Which, apparently, Amazon really wanted to be able to do. So they bought it. In doing so, they had to hit every website and analyze them and analyze changes. So he made a deal and took the technology for the purposes of saving it. All of these websites from somewhere in 1996 onward, they were being scanned, saved, and stored away.

Eventually, he oversaw the building of a technology that is called the Wayback Machine. Which allows you to go to a lot of domains, going back many years, and be able to see what they looked like over time. You can say, "How did Wired change over the past 15 years?" You can actually go up and down a slider and be able to see pages change and URLs change.

He's been doing that steadily ever since the late 90s. The Wayback Machine is how most people think of the Internet Archive. They think of it as the Wayback Machine. They might not even know that there is an Internet Archive behind it.

Meanwhile, with the archive, he's also been gathering everything — from books and movies to music and software. For instance, the Internet Archive and its partners scan about 1,000 books a day and put them on the Internet Archive. There are millions of books up. Same thing with lots of included music and movies. People can contribute media that they make — podcasts and artwork and everything else. There's petabytes of all of this information up on the Archive.

I was brought into the organization in 2011. But I had been doing lots of archiving of my own. For a lot of people, they thought I already worked there. The best kind of hire is where they hire somebody and people go, "I thought they were already there in the first place." We've gotten along really well and we're getting up on four years now.

Primarily, at the Internet Archive, my title is Free Range Archivist. I'm also the software curator these days. Of course, I do whatever else is needed. As a Free Range Archivist, that job position — which is totally made up — is that I reach out to organizations that have digital data and say, "Hey, maybe you want to host it with us, on this library. " Because it's still a problem for places. Let's say they have four gigabytes of podcasts. They'll say, "Who are we going to pay to host this?" We're like, "We'll just do it under our non-profit umbrella." We'll do that for a lot of places. A lot of places, of course, don't even think about their web presence. I'm like, "You know, you could put those with us and send us a hard drive and it will be great."

It's a very rewarding job. You're basically walking around like a fairy godfather saying, "So, your wish, Cinderella, is to come to the tower and have a prince." [Jen laughs] That's basically what I'm doing.

Also, I've been working really, really hard on software for the organization. So software is playable in the browser. Because I feel like we do this with movies and music. We're so used to it now. But the idea of having a little window with video in it that just plays, is just amazing. If you did that in 1995 and said, "Here you go! High definition pictures from a movie!" You'd be like, "Oh my god, what am I looking at?" We're kind of there now. We can do various machines inside of a browser running at pretty good speed, so in the future, the software just doesn't sit inside of a CD-ROM or in a zip file. You can say, "Let me look at it for awhile. Why don't we play this game?" Or, "Let me try this spreadsheet. Let me see how it looks."

The place is a relatively small employee set, so there's a lot of self-direction. But it's all aimed towards the same thing. Which is, we just want to be the internet's library. We want people to come to archive.org, browse around, see things they didn't even know they wanted, and find the things they're looking for. It's been very rewarding.

Jen

[00:08:57] archive.org really is an incredible treasure trove of digital material.

One of the things I used it a lot for — many years ago — was the Prelinger archives. Correct me if any of this is wrong. Rick Prelinger collected this big archive of films from the 20s, 30s, 40s, and 50s. Not commercial feature films, but little education films. The kind of films like, "Let's hide under our desks when the atom bomb comes," that was made in the 50s by the government of the United States. Or TV commercials of cigarettes from the 40s. I imagine those existed in some kind of physical library at some point. But they have been digitized. None of them have any copyright on them.

Jason

[00:09:53] Right. Rick and I get along well.

Basically, Rick had been involved in a bunch of archiving products and projects and everything else. What got people interested in the Prelinger archives was, he turned around and started digitizing all of these government films he was getting that were just going to be thrown out.

So many times you walk into a place. I do it in my side. I keep bringing this back to software just because I happen to know it, but it's the same exact theory. Just before we got on this conversation, somebody mailed and said, "You know, for a couple of years, I was doing a technical program. Over the years, I got hundreds of these educational software programs. Would you like that?" "Why, yes! Yes! Yes, we would like that! That would actually be very nice. Send it to us." Somebody saying, "Well, I have 3,000 newsletters from various user groups. Would that be interesting?" "Yes! Yes, that would be interesting!"

He was very good at getting into contact with groups and places that were going to get rid of films and saying, "I'll do it. I'll take the film." Then the value, once they became accessible, was just inherent. As soon as you could see these old videos.

I go to film festivals a lot. I watch mostly documentaries. It's not out of control to say that maybe a third of the documentaries I see thank his archives. Because they're like, "In the 30s, they used to..." and show a picture of cars going by. It's like, "Yup. That's from something that Rick digitized."

What's fascinating to me about him — and I follow the same path — is that he's actually kind of gotten away from those. Not in terms of, "I never did that. Don't talk to me about it." Just in terms of, his big thing now is acquiring home movies. He's really big on that. He gets people to send him home movies from the 30s onward. There are some real early adopters who did home movies in the 30s, and also the 50s and 60s. He's digitizing them because they are, as he calls it, non-sponsored, non--scripted recordings of places.

He will have a family who will go up and down some part of San Francisco that nobody else recorded because, why would they do that? It's kind of weird. From there, you can go, "Oh my god, that's how the pier looked," or, "That's the kind of cars they had," or, "That's what that part of the area looked like before they developed it." He has this show, The Roadtrip Series. He will find family vacation footage.

I'm doing the same thing with software. I'm really getting big into having personal collections. Frankly, we've done a really good job of archiving Mario. I think Mario is going to hold out for the next century. [Jen laughs] We're going to do really well with Sonic. We're going to do pretty well with Crash Bandicoot. We're going to do fantastic when it comes to things like Pac-Man. But there is software that people made for their little user groups in, like, the middle of Texas or Kansas. It might have had a certain bent or approach, and maybe only a couple dozen copies went out in the first place.

It's funny how over time you follow this path of, "I must have everything." It's like, some things have advocates already. I've stopped... this is... [pause] Before our interview started, I said I go off the rails all the time. [Jen laughs]

The thing is, I'm not as worried about gaming magazines. A lot of people played video games and, of course, they'll digitize video game magazines. But I have a newsletter about an obscure word processing program. I've got six or seven issues of this handmade newsletter. That's the stuff I want to scan because I'm afraid nobody will, because it's too weird and boring and obscure.

The nice thing about the archive is that there's room for all of that. It's a weird, Willy Wonka factory. We have a new interface that's come out called version 2. You can see, it says, "Try the new beta interface." It's much easier to find things. But even I am sometimes surprised when I walk the stacks. I've tried to describe it to my coworkers. It's like this endless room of drawers and you open one and it's like, "Oh my god, it's 1,000 glass eyeballs. What's this one? Every hedgehog whisker from the past 100 years." It's just this weird thing where you'll find 50,000 manuals or you'll find a full run of a video podcast series that somebody did about Burning Man.

Just weird, wonderful things that are all living in there and being saved. It's a weird library. As time goes on, there's an inherent hope and faith that computers will generally do metadata for us. Within the realm of the world, you know how it is from the outside, if you're five feet away, you're like, "Really? That's a big fight?"

I mean, everybody's got this, right? It's like JavaScript versus sodium... [Jen laughs] There's lots of these weird little fights, and you're like, "Really? Is that a problem for you folks? Could you just have a meeting?"

Within the librarian world, there is a fight over machine-assisted metadata and whether or not a machine could ever possibly do the job of a human. Nothing beats a person working on handcrafting better structured information about a range of data. John Henry aside, you're going to end up with, "I didn't incorporate these 40,000 manuals." Well, I can have a machine go through and reasonably tell you, "I think this is about a microwave." And do it very quickly. As opposed to a person going through, "This is a microwave. This is a certain kind of microwave." It's actually a raging debate, whether or not machine-assisted metadata can go.

Meanwhile, Facebook is literally saying, "Do you want to be the friend that identifies this person in the photo who we know is in the photo?" It literally does that, right? "Alright, so we know this is Lynn. Do you want to tell us it's Lynn? Or we'll just ask one of Lynn's 100 other friends. All it takes is one of you to agree." [Laughs] I think that's the greatest example. It's such a beautifully sociopathic company. [Jen laughs] It's just funny how they do that.

The fact is, the facial recognition knows it's you. Just like you can look at a photo in Apple and it will say, "I think this is a two person shot. This is a group shot. This is a portrait." Over time, we're just going to see more and more... I mean, yes, we're going to identify faces on garbage cans. That's always going to be the case.

Our biggest problem, and it will always be our problem, is... look at this, somehow I led back to this [Jen laughs]... the hardest part is to have the data to work with, going forward.

Jen

[00:17:46] Let me jump in here with our first sponsor today, Code School. You can put away the programming books and learn JavaScript, HTML, CSS, Ruby, iOS, git, and more, by going over to Code School at codeschool.com.

They have a pretty awesome... tutorial system, perhaps I could call it. These courses you can take. You sign up. You can do an individual course or they've got these paths set up. You can follow a path. Maybe, for example, you want to learn iOS. They've got a whole language path, where you start out with the really basic stuff and work your way up into more complex stuff. As you work through any of these courses, you're earning badges and points. The great thing is they've got a really good interface going on, inside a web browser, where you're typing code, you're seeing the results of that code immediately, and there's little hints and tutorials popping up. I was just trying the HTML one, and I'm like, "This should be an h2 and this should be an h3," and you hit this button and you're like, "Hey! Check my work!" and it's like, "Yes. Let's give you a hint. This is what you need to do next." You can click on the little thing and it will tell you the answer. A lot of interactivity. A lot of guiding you through a process of learning by doing this stuff. These videos pop up and they give you... I was really impressed with the HTML video. Not just, "Here's what HTML is. Type this thing." It started out with a history of where HTML came from, and Tim Berners-Lee and Robert Cailliau and CERN. Then jumping very quickly into understanding what HTML is, then onto the basics of the markup and how to wrap different pieces of content into certain kinds of tags. When you jump into the actual activity, they've already got all of the content right there. You're just typing the tags around the stuff. Really, really well-done.

I was poking around — this is totally not part of this sponsorship read — but I thought, "Oh, this is really helpful, people might like this." For folks who are developers, they've got a seven-part course of Chrome Developer Tools. Which is free. The whole thing is free. They actually have a free level on all of their different courses. So you can take the first unit for HTML for free. You can take the first unit for jQuery for free. The very first Intro to Git, free. Try it out, see if you like it, and if you do like it, then you can sign up for a paid membership. But this entire Chrome Dev Tools thing — which they did with the folks over at the Chrome Developer evangelist world. Paul Irish and Peter Lubbers — both of whom have been guests on the show, and Peter several times — helped out on this. They made these videos along with the folks over at Code School. This is all free. This whole seven part thing. I think Chrome probably really wanted to teach people Chrome Dev Tools and they thought, "Hey, we'll pair up with Code School, we'll put this thing together." I'll drop this in the links on the show notes on the page today. Which is at thewebahead.net/97.

You can check them out. One monthly fee gives you access to every Code School course. If you ever need a breather, you can suspend your account at any time, and then you can start it up again. Don't worry, your history and points and badges and everything are still there. You know how life goes. You sign up, you do two months of it, then you get busy. Freeze the thing. Go back and start up again a couple of months later when you have time. Check them out. codeschool.com. Thanks to them for supporting the show.

A lot of people who are listening to the show right now are young and probably have just grown up in a digital world where a lot of this was written and created before they were aware that there could be a life for humans before computers. [Laughs] There are a great number of us who remember what it was like, growing up with no computers anywhere. All of the computers that first showed up, it was a very few of us who even had access, and slowly it became more and more people. In those days, back in the 70s and 80s and 90s, there was no giant corporation, giant corporate culture all over all of this stuff. Pretty much everything was some obscure little company with some obscure little idea and there were many, many, word processing programs, for example. There were many, many games. They were all made out of text and they weren't that complicated because computers were not that powerful.

I think it's hard for us, in these days where everything is super commercial and it's either Google or Facebook or Microsoft or Apple. To realize there could be a little tiny company with a little tiny idea and a little tiny program. They would type in the back of a magazine and you would type the program from the magazine — the text — into your computer. [Laughs] Or you'd get a floppy disk and you'd put into your machine. Or cassette tapes.

To preserve all of that stuff, because otherwise... I think many of us have a pile of CDs, somewhere in our house, of computer programs. Do they even mount anymore? Have they degraded to the point where it can't be read? I've got all of these Microsoft Word documents from 1991. I can't open them. The new version of Microsoft Word doesn't open the old documents from Microsoft Word.

It feels like there's a real danger. It seems like a lot of the work that you're doing around software is not only about preserving the files themselves, but also preserving the ability to open those files and run those files. The operating systems that are needed to run the programs which are needed to open the actual archived files.

Jason

[00:23:51] Right. Certainly, there's a whole bunch of things there, right? Fundamentally, I'm doing something that has been done — and had to have been dealt with — for generations before. Even entire eras before. The main difference between then and now is just that the half life is ridiculous. That's the only main strangeness.

I love the whole thing with silk murals. You could have a silk mural, and there are currently, I believe, between four and six people in the world who repair them. You have to study under them for 10 years to apprentice before you can fix any of your own. But it's a thing. Those silk murals can be preserved. They can sit in someone's attic or sit in a basement for 50 years. It will suck for it, and it might get a little hurt, but on the whole, we can probably bring it back and get a pretty good idea of what we're looking at. In other words, there's a lot of room for error, in the way that we screw up what's important and what's not important.

This is a common piece of news juice that will show up every once in a while. "Incredible example of early photography found in somebody's basement," or, "This old painting, they thought it was the guy's apprentice. It was him!" That's a common theme. Part of it is because the half life of those things... it was pretty good. You can get away with 50 years of not doing anything. Or 20 or 100.

With software, with the web, and with online... no. If I had to say, "What's the number at which you should start getting nervous?" I'm like, "About four years." Four is when you start to see cracks in the facade. Things not running quite the same way. Context being lost. Operating systems being upgraded and deciding, "We're not doing that today." Of course, websites disappearing all the time. Even the websites that are the backups of the websites, or referencing the websites, going away.

All of those things with the online world, are conspiring to make the chances that we can get information as fast and quickly as we ever could, in the history of the world, and we have the capability to lose it with no chance of recovery, even better. We don't even have the little dent in the earth where it used to stand anymore. We can actually lose that. We can't even take an aerial photo and go, "There was a house there once." We can't. We can have a whole bunch of hard drives die, shred them, and be confirmed that they are utterly gone. There's not going to be any kind of an impression.

It's a weird situation. It's a real paradox. There's a lot of initiative. We've spent a lot of time... I'm saying "we" as a people. When I used computers in the 80s, there was this wonderful promise that was built into it. This beautiful world that was painted. We have gotten that world. We have gotten that world wonderfully, to be honest. It's wonderful. But we've done it at a great cost. Certain things never got followed up.

People talk about the loss of the monopolized telephone system. There were issues of corruption and slow service, but there was a consistency and nominal legislative need for everyone to have access. Although, not everyone did. It was weird not to have the ability to get some phone access. We had a system there. When it went away, we lost certain things. Certain things became cheaper, which was great, but we also lost that consistency of repair and infrastructure. We've suffered ever since.

Similarly, we're seeing a situation where we've built up this new infrastructure that's faster, cheaper, and awesome. Yet, there are ethics and concerns we should have that never got cooked in, because the internet went from experiment to critical infrastructure way too quickly. It went from, "I can aim a webcam at a coffee pot and not have to go down the hall," and then the weird shock of having 10,000 people visit your coffee pot every day because it's kind of cool. People will send you email because the coffee pot is empty. [Jen laughs] That's neat. That's fantastic. To, "All of my financial transactions are now being done across this crazy thing." That's a pretty big leap. As we were already discovering with some of the security bugs... we've had some incredible security bugs, right? Where it's like, "Wow. It turns out every Microsoft product has been insecure for the past 15 years, along this vector." And that's just one of six major vectors we'll have discovered in the past couple of years.

Everything is on sand. All of it is on sand. All of it is weird. The young people — I guess you want to call them, the ones who are coming out of high school and going into college or starting out their lives, just trying to be a part of this world — they've always known a world like this.

I'm just old enough that I remember when you could start touching televisions and have them do something. Touch screens are now ubiquitous. I remember marveling. It had to be '99 that I saw two kids walk up to an airport terminal to press it. They were playing with it. I was like, "Those kids have always lived in a world where you could touch a screen and it would do something." I thought that was neat. That was a cool thing. But there was a time when that wasn't the case. They've always known a situation where companies host something called the internet, it mostly involves things being fast, it lives on their phone and it lives on their computers. From it, they do stuff. They do a lot of stuff on it. The jokes about, nobody understands the weird things with my computer and my phone, are starting to run out. Because everybody has to know. It's like showing up in drama programs, now. Everyone has a phone. When it says, "The police tracked his cell phone." Nobody's like, "What's a cell phone? How do you track a cell phone?" We just get it.

Along with that, for example — this is a weird one — Facebook has no phone number. If your Facebook has a problem, there's no IT department to call. To say, "Hey, Facebook, my Facebook isn't working." You can't do that with your Google. You can't do that with your Twitter. You can't do that with your eBay. There used to be a site that listed some of the secret phone numbers that you could use to get a person on the line. That's a major deal. When you lost the ability to talk to a person, they lost the ability to keep track of people as being anything other than analytics. You're just a number, and with that, some really, really, really sociopathic choices start being made.

Jen

[00:32:10] Will you talk about how companies... it feels very much like big tech companies today — even the little ones — are very focused on the present and the future. Focused on, "What are we doing right now? Are the servers up? Is it fast? What are we going to be doing in six months or a year? How are we going to monetize everything? How are we going to increase shareholder value? How are we going to make more money?" A lot of these companies — I'm thinking of Yahoo and others. Yahoo buys Geocities because they think it's going to be a really great business for them in the future. They think they're going to make a lot of money. They buy it up. This that and the other. Fast forward a bunch of years and they realize, "Ok, we're not going to do anything with this. It didn't turn out to be this awesome thing that we thought it would be. Let's kill it." Then they just delete the whole thing.

Will you talk about that? What you've been doing about that, and what you think about companies that are making those decisions and... I don't know.

[Both laugh]

Jason

[00:33:15] First, let me very clearly demarcate that I don't buy very much into startup culture. I don't get along well with a lot of the aspects of tech culture that are fundamentally aimed towards looking at the world as a series of unexploited opportunities that are waiting for you to slap together enough Perl and AJAX to be able to get people to give you enough money for somebody else to give you even more money to get those people. I don't like that world. I don't like what it's become. Right now, the poster child for that is Uber. But it's going to be someone else in two months, in six months. That idea of, "Here's a thing and we're going to do this." Then news stories come out: "It turns out a lot of them are jerks." Other people are like, "Hooray that they're jerks. That's fantastic." Adam Smith and Ayn Rand are doing a waltz and it's wonderful. I just don't like it. I realize that it's very dominant. It's a wonderful story. To start with a dream and you want your dream to come true and you all pulled together and you made the dream come true. It's just, the dreams are so small. They're all one of like, "I coded until somebody gave me a lot of money and then I taught myself how to surf to 10 years." I don't enjoy that. I always preface that. Because somebody might go, "Pfft. This guy." I totally get that. I hope that you respect me enough to go, "He said that at the beginning instead of the end. That's much better."

Prefacing it with all of that, the fact is, not only is it a problem that companies don't keep track of their past, many of them have now worked themselves into self-directed incentives to not want to. They don't want previous versions to show up. They don't want other pieces of their history to become public. They don't want to be crippled by having somebody tell them, "You said this six months ago. I don't see that." They don't like that. They have a lot of incentive to say, "We've always been here. We've always been at war with lack of shareholder value. We're going to go and continue to grow into this great thing."

I get that. But it means they tend to not keep their original story. Or they craft it. There's a story of eBay that isn't true that's out there. There's a story of Amazon that isn't true that's kind of their story. There's incentive to tell a cute story and act like every previous version was a terrible disease and move on to the next great chapter of their life. I get that.

If you're talking about a supermarket, only the weirdest people are going to care about the previous incarnations of supermarkets. The story of the American and Pacific Company — A&P — is fascinating to a small group of people. But it doesn't affect you going in and buying a can of beans. It's not going to wreck you. You just want beans. You don't care. It's a commodity.

The problem is, somewhere along the line, these companies started to absorb end-user culture and aspects of humanity that we associate with creativity and personal history. That's when I get worried. For instance, we're going to host all of your media for free. Your photos, your podcasts, your writing. Then you notice they have no contingency for shutdown. They have no contingency for export. They have no contingency for being able to get your hands on it if the company happens to go into receivership. It's not built into their DNA because it's not profitable. There's no reason to have that living will.

When things get weird — and they get to the end — that's when we see these character flaws. The final act is always awful. Ends are terrible. My talk about Brooklyn Beta was called The End. It was about the fact that a lot of people love the beginning of these things — we're going to change the world and make a lot of money. But they don't like the end. "Oh, that's right, we now have 100,000 people's only copies of their children's photos," or, "We're the only place that hosts this podcast," or, "We're the only place that hosts all of this writing that was done by this college." Then... mrrrgh. Literally. That's literally the sound they should make when they go down. "Mrrrgh. It's overrrrr. Okaaay. Good to talk to you. Remember me at the next VC meeting."

What got me involved from an activist standpoint was, "Somebody needs to make them feel vaguely bad. Let's try a little bad." So I got involved in it. It turns out, these places actually don't want to feel bad, at all.

Jen

[00:39:33] So what's a solution? What if there's a company out there that has 100,000 user's photos of their children and they want to do the right thing?

Jason

[00:39:43] There's a step in between, which is what's going on now. The answer is, ridiculous activism. I'm involved with an ad-hoc volunteer group called Archive Team. Archive Team is a bunch of volunteers who will go in and try to make a copy of a site that's going down. Because these places give no warning. They don't listen. Then they go down. The average time that I see is somewhere around the order of 30-50 days.

I'll give you a good example. RapidShare, a file-hosting service that's been around since 2002, that's survived all sorts of onslaughts and had layoffs and everything. They announced about two days ago that they're shutting down at the end of the month. After being up for 13 years. Thirteen years and 30 days of warning. For basically hosting of ridiculous amounts of data. I'm sure they got their share of, "Do we need another copy of The Expendables?"

On the other hand, there are piles and piles of unique utilities, one-off songs. Things that people thought in the mid-2000s was "too big". "This thing's almost the size of a CD! We'd better put it somewhere. Let's give it to RapidShare." All of that is going to disappear. Poof.

Archive Team is stepping in and going, "Can we save some portion of it? So down the road somebody could get their copy of it?" We'll do the same thing with whatever else we can find that's shutting down. Right now, I think we have 12 major projects going on right now. Places that announced they're shutting down. That's a stopgap measure. It's an important measure. I have a lot of people that spend a lot of hours a week trying to make this happen.

Jen

[00:41:46] It seems almost like a volunteer fire truck.

Jason

[00:41:49] Oh, it's absolutely that. Absolutely.

Jen

[00:41:50] You see smoke and you go running over there to grab everything you possibly can before it gets destroyed.

Jason

[00:41:57] Sure. In one way, we're actually really embarrassing. There will be times when we're like, "You're dying. We're getting a copy." The company's like, "We're doing fine!" We're like, "No, no, you're really not." [Jen laughs] We're like that dog that can smell disease. We're just like, "[Sniffs] No, no, you're doomed." We'll start grabbing and they'll be like, "There's been a terrible chain of events!" We'll be like, "Yeah, yeah, I'm sure it's been terrible and unexpected."

Jen

[00:42:26] What are some other projects that you've grabbed up?

Jason

[00:42:29] Quizilla was a really good example. User quizzes. Zanga, a Microsoft hosting platform. That went away. At some point, Verizon decided it was out of the user hosting business.

Jen

[00:42:50] Oh, right.

Jason

[00:42:51] They dumped all those user pages. Everything with a tilde went away. We went and grabbed a copy of all of that. We pump all of these into the Wayback Machine just because it's a really good interface for looking at the old stuff.

There's archiveteam.org. We have projects we've worked on. There will be a lot of small photo sites. The worst part is, they're not particularly big. We have a couple of sites that are terabytes. The reason a lot of these places are shutting down is because nobody used them. The people who did use them are getting caught flat footed. But size-wise, not so bad.

Posterous was a good example. Oh, man! Did I have a great time with the Posterous guys. I'm sure they don't have want to half-discuss my interactions with them. One of them called me "The Glen Beck of Archivists," because I was ridiculing them. Here's why. I'm more than happy to mention this.

What happened was, Twitter bought Posterous. They did this ridiculous thing called an acqui-hire. Which is a very recent innovation. Here's the problem. Because of software patents and other liabilities, it's very difficult to know that when you buy a company, you're not buying something else. Some disputed creation, some broken software patent, some minor misstep done by a financial guy that left after the first six months that makes it illegal in twelve countries. You just don't know.

What they'll do is, they'll buy a company. They'll fire everybody and offer them brand new jobs at the parent company. They will bankrupt the original company and shove it into a file folder. If anybody goes, "We think you owe..." They'll just go, "Here's the file folder. Have fun. You can take the file folder to court."

They did this with Posterous. They knew they were doing it. They bought it and knew they were going to do this. Yet for one year after the purchase, they let Posterous linger on its own and exist and not do as many updates and not give very good blog entries. Basically, "Users... eh, why do they need to know anything about their stuff?" At some point, they declared they were shutting it down. I got in everyone's face. Of course, the people that worked at Posterous are all very much like... they're in new Twitter world. Who knows? Someone is probably a VP now. Somebody is probably a manager in some part. It's ancient history to them. It's, "That's how we got here."

All of those accounts — there were millions, well over 6 million accounts — were just going to be deleted. We got up in that. I was really cranking them off. Then a deal was struck. It's a secret deal, that's what makes talking about it great. "If you stop insulting us, we'll give you an engineer who will help you get a copy of the public-facing stuff." We were like, "Ok." The best part was, that engineer would disappear. We were like, "We have to get his attention." So we would turn the spigot up really high and he'd show up, like, "What's going on? You're destroying..." And we're like, "Yeah, yeah, good, we got you." He did not like that.

Anyway. When it finally finished, I actually drove to Twitter headquarters and demanded to see the engineer. When he showed up at the front desk, I gave him a cake. At some point they said, "What's your favorite cake?" and he said, "Cheesecake." So I showed up with a cheesecake and said, "Thank you, Vincent, for all you've done!" and walked away. I took a picture of it. The team liked that. They thought that was pretty funny.

That's a case where it's this weird, silly, stupid situation. It's all very nice and good, in a narrative way, to have this Robin Hood ridiculousness that I'm doing and that this group is doing. It's because there's a dearth of legislation and architectural priorities built into this thing.

We have HIPAA, which prevents a medical company from dropping all of your medical information into an insecure website. That's really bad. There are ranges of what you can do if you decide you want to process credit cards. There are really intense rules, down to the networking configuration of what you can do. Then we have zero protections, controls, or information around user-generated data.

It's because 20 years ago we were on zip disks and CD-ROMS and we had to rely on trumpet windsocks to get onto the web. Then, boom! Suddenly everything depends on it and we're turning off banks and replacing everything with machines and mobile apps so quickly that we forget that stuff. I think we'll remember it, but you've got to have a few really, really tragic screwups to do it.

Danger Hiptop went offline for five days because they screwed up a backup and everyone was like, "Oh my god. Everything I own." That was a wakeup call to some people, I think. If Facebook goes down for more than a week. Or Twitter goes down for a week. I think it would take a week. One day, we'd be fine with it. But one good week, just because someone forgot to carry the five or they went from this version of the operating system to the next patch upgrade and there was a regressive bug and it doesn't work for five days. I think people would wake up very quickly to, "We really depend on this and we have no way to get it out. That's really bad." There is no real export function on Facebook. There's a pretty good function on Twitter. But even that's kind of weird and screwed up.

It's not baked into our culture yet and I'm hoping it will be. Because it's not a great job to be the screaming activist for years on end. Just ask Ralph Nader.

I don't mind it. It's good to hear heartfelt stories. Certainly Archive Team gets its share of people going, "Thank you. Somebody thought of us and a year later I was able to get my father's stuff that he had put up there and died and never gave us the password." Heroic stories are good but those heroic stories aren't as good when they never should have happened in the first place. [Laughs]

Jen

[00:50:29] It feels like there's a way in which having one really great non-profit archiving the entire digital history of these three or four decades and having groups of volunteers and staff of that one non-profit responsible for deciding and sending little bands of troops over to save things, is not scalable or sustainable or what librarians and archivists 100 or 300 years from now will look back and wish that we had during this time.

Jason

[00:51:09] Right.

Jen

[00:51:10] It feels like we need something bigger than this.

Jason

[00:51:28] If you're a hero long enough, you become the villain. I'm both delighted and annoyed that in some places, in some areas, the conversation has turned from, "Wow, this Internet Archive thing is awesome!" to, "Why is there only one?" [Jen laughs] "Why are we dependent on this monopoly?"

Jen

[00:51:47] Well, there was a fire. I heard there was a fire at the building. There's a building and there was a fire in the building. It was like, "Oh no!"

Jason

[00:51:57] That was notable.

Jen

[00:52:00] Are all the servers in that one building?

Jason

[00:52:03] I know, that's the thing. Luckily, it's in multiple buildings. It should be in multiple buildings in multiple countries.

Jen

[00:52:08] Yes. Multiple countries. Multiple sides of the globe.

Jason

[00:52:13] The whole structure of the Internet Archive — and I'm very happy that they do this and it's baked into the core — for any object on the archive, you can always download the original item.

Jen

[00:52:32] That's fantastic. It's not just a compressed, small version. The original is there, too.

Jason

[00:52:37] Right. You click here, you get the MP4. It's easy to browse on the web. But then you can go to the download link and say, "Give me this 17 gig AVI." In fact, we'll even generate a torrent for it, if it's something that's wanted. There are scripts that will enable you to point it at various things. To say, "Just give me everything." I'm going to try to put up more education about how to do that for more people.

Fundamentally baked in is not, "Hooray! Data hoard! Get away from our shiny collection of toilet rolls!" We're very much trying to be this digital repository and balance off it.

Here's the thing that drives me nuts. I'm hoping to see this change. The Internet Archive costs $12 million to run a year. That's about 20 Facebook promotional parties. I see a company spend a quarter of a million dollars at some web conference. I'll be like, "We could scan thousands of books with that money!" It's such an efficient organization, for what it is. It's such a small budget to accomplish as much as it does. It's one of the best values out there. I'm just like, "Why aren't places writing at least some guilt checks?" Just to be like, "We don't really care about people but, look, we gave a million dollars to the Internet Archive. At least something is being done."

I would love to see other places do what we do. Our Wayback Machine is completely open source. The crawlers we use are open source. It's all out there. We work very hard to make sure that some secret thing does the real work. It's not that difficult to do. You just need a consistency of vision and you need to be unafraid to scrape websites. Some places are afraid to scrape websites. Some places don't prioritize having web space.

We're involved in accessibility. I believe we're in the top 200 websites. And we don't keep logs. I'm always in the process of... people don't know that we don't do that, do they? If you browse us, we don't log you. We keep them for two hours so we can go, "Wow, a lot of people are hitting us." But then we get rid of them. Like a real library should. We just don't keep logs. There aren't may top 200 websites that don't keep logs. We switched over to HTTPS because we were like, "Better do that!" And so on. It's a wonderful organization, in terms of having a core set of values that they believe in so strongly.

I sound almost zealous when I talk about it, but this is a really good group of people, doing what they do. I wish there were three and we bump into each other at conventions. "Hey, there's the Internet Archive. There's those guys. Over there is that person." Be like, "Did you get that site? Oh man, that thing was crazy." It would be great to have that situation. I think a little competition would be nice in that context.

I'd like places to understand that funding the archive has amazing returns. It's a tax-deductible donation. We take Bitcoin. It's hard to argue with that. Until other organizations step forward in a meaningful way in the same area, it's going to depend on one place. But it's a pretty good place and it's definitely better than zero places.

I'm glad we've hit that level of success that people are like, "Wait a minute, there's only one." Without realizing that there used to be zero. I think there's a quote from me in a documentary, where I said that, essentially, we have these young people on Archive Team, and we have people like myself who are weirdly organized towards this. It's a tragedy of the modern era that we have to depend on children and maniacs to save our history. [Jen laughs] But that's where we are. It's fun to bring the message to people.

Jen

[00:57:42] Let me jump in here with our second sponsor today, DreamHost. Since 1997, DreamHost has been hosting all kinds of awesome websites for all kinds of awesome people. They've got domain name registrations, web hosting, cloud storage, and different levels of hosting solutions. If you need something more complex, or you need VPN, you need some kind of enterprise, high-volume website, you can start out with something more basic and work your way up. Staying on the same platform and same company. Dedicated servers, all of that kind of stuff. They also have this WordPress hosting. Optimized hosting for WordPress sites and blogs that make it really easy to set up WordPress websites. One-click install. Basically, you pay for one account and you get unlimited domain name support and unlimited file hosting. Unlimited whatever whatever. Maybe you've got a little shop. This is what I used to do. I had one account and I would just host a whole bunch of WordPress websites for my clients, all on the same system that I had set up. Which made it super easy for me, because I didn't have to keep learning everybody's different hosting. Oh my god, it was driving me crazy. Going from random host to random host to random host, having to learn system after system after system. It was much easier for me to say, "Hey, client, I will host this for you." Give them a little bit of a deal, but charge them an amount so that, by the time I added up all the people who were paying me, it was far more than I was paying for the one account. You could do that at DreamHost, because you can hook up as many different domain names to this plan as you want. Or as many different websites. All of that kind of stuff. It's pretty standard, pretty cool, pretty awesome, pretty solid. DreamHost has been around forever. Very nice and solid. They're just going to deliver what you would expect from a basic hosting system. You can get a special deal. I mean, this is a really good deal. $3.95 US dollars a month for hosting and a free domain name if you use the code THEWEBAHEAD395. Or go to dreamhost.com/promo/thewebahead395. Check them out. You can, as always, go to the show notes for this particular episode and find the link and the code. The show notes are at thewebahead.net/97 for this episode. Thank you so much to DreamHost for supporting the show.

Jason

[01:00:27] Oh my god. RapidShare is going down. Oh my god. Inkblazers — a comics hosting network — is going down. Or justin.tv, which went down.

Jen

[01:00:44] Or Blip, blip.tv.

Jason

[01:00:47] Blip.tv was great. I had a great fight with those guys. That was a great day. We got on the phone and were literally yelling at each other. It was loud and fervent as you can imagine. We both thought we were right. Finally, I said, "Let's take one day off and talk to each other in a day. Because you and I both obviously get what we want by yelling. Let's try relaxing for a moment." Then we came back and negotiated something that was good for both parties. They agreed to leave certain things up, we agreed to not destroy their servers downloading as much as possible. Which, apparently, a company on the outs was not prepared to handle. It was great, it was a good time.

Jen

[01:01:36] Yeah, that was frustrating. Listeners of the show might remember episode 76, where I had Michael Verdi, Jay Dedman, and Ryanne Hodson on to talk about video blogging and the early days of video blogging. We talked a lot about what it meant to put video on the web back in 2002, 2003, 2004. Where to put the files was really hard. You couldn't put them on a regular web host for $20 a month. It just wasn't going to happen. The Internet Archive was one of the first places where we could put videos up and not pay anything at all. They would actually work. But the interface was a little clunky and it wasn't automatic in certain ways. Blip came along with this little startup saying, "Hey, we believe in you, video bloggers. We think this is the beginning of something really amazing." This is all before YouTube was invented or built. "We're going to be the cool guys. We're going to be the ones who get it. We're going to be the ones with the values that we care about content creators more than we care abut shareholder value. We care about our service and doing great work." They did that for many, many, many years. Until the founders decided to sell the company and move on to something else. They exited. Then Blip pivoted and it became this very different company with a different mission and they decided to purge and delete the video blogging archive. The invention of video on the web. "Oh, we'll just hit the delete button on that." Like, ah. And I get it. I don't necessarily think that every company has to have enough money to exist forever. [Laughs]

Jason

[01:03:19] That's the thing. You walk yourself through it. You look at a real world example. There's going to be two different kinds of worlds. One is "Come Dancing" by The Kinks. It's about place, and things happen there, and it was magic and mystery every night. "There's a parking lot where that used to stand," as the lyric goes. That was experience and it's going to live on in the minds of people and maybe someone will write a book about it. it will live as an echo and maybe some photographs of it having existed and maybe some posters. You understand that group experience. It's another thing to have somebody driving around with an ice cream truck, ringing a bell, going, "We'll store your family photos." Then later you're like, "I'd love to get my family photos." Or they come around at 2:00 on a Sunday and go, "We're burning all the photos next Friday." It was like, "Oh, you weren't home Sunday? That's a bummer." Then there's no way to retrieve it afterwards. I love that myth that once they shut it down, it can't ever come back. Because they have to tell that myth. "We shut the servers down on April 30th and it's May 10th. No way."

Jen

[01:04:56] Because they did not have a shredding party on May 1st, you mean?

Jason

[01:05:01] We actually have a real world test example of this. Apple. Apple had MobileMe. In MobileMe, as far as I can determine, they did the right thing. They said, "Hey everybody on MobileMe. We've decided to move away from MobileMe and we've invented some idiot new Apple service. We're going to shut you down in one year." They said, "One year from now, all of your stuff is going away. Press this button and it will go to our new service. Otherwise, you're screwed." And, "Hey everybody! Six months!" and "Hey everybody! Five months!" They had announcements, every month, on the month. Then they shut it off and a whole bunch of people were like, "Hey!" They were still like, "What? What was this?" A lot of people just didn't know that thing that had their stuff was called MobileMe. They just didn't know.

It used to be that when planes first started — in the 20s — you could be reasonably sure that everyone on the plane knew how to fly the plane. It would be like, two people. With scarves. That's neat, but it's not very profitable. You have to spend a lot of time training those people on that plane. Now, there are 700 people on a really big plane. Maybe four or five people could maybe fly that plane. You did that because it was cheaper. Now those people only pay $300 and they fly in that big plane and they get there. Then everybody else knows how to open up the door if the plane has an issue. But only four or five could actually operate it. That was a choice. You have a lot of people who get into the plane, have no idea how the plane really works, and then land in the plane.

Jen

[01:07:15] And they don't want to. They just want to get there. [Laughs]

Jason

[01:07:17] They really don't want to, yeah. Through the 70s and 80s, you might be using a computer, you either opened it or you built it.

Jen

[01:07:29] You assembled the computer.

Jason

[01:07:30] Right. If something goes wrong, you're like, "Oh, that."

Jen

[01:07:34] I'll just open the computer, shake that wire.

Jason

[01:07:37] You go look and you're like, "Look at that chip. That smells weird. Ok. That's fine." As opposed to someone hitting the side and going, "It doesn't work! Make it work!" It's easier to buy your commodity PC and log into your commodity setup.

Jen

[01:07:55] It's much easier to use. Much more reliable.

Jason

[01:07:59] The scary part is how they've reengineered all of these boxes. To be honest, a power restart does solve everything. It assumes that they basically focused the operation of the machinery down to one switch: the power switch. Which also functions as the diagnostic switch. "I'm not working? Turn it off and turn it back on again, because that will kick in the diagnostics." If you were from an earlier time, you'd be like, "No, that's a showing of failure. That's just voodoo." But now we do that. With these services, it's like, "Aim your phone at this and your photos are taken." People take the photo and the photo goes immediately into the photo service — let's say Instagram — and that's it. Your copy might be sitting on your phone, but people switch out their phones an average of every two to three years. They might not know to pull every piece of photo and data off of the phone when they upgrade. They don't know it's there. If the service goes down, are you going to go through your filing cabinet of phones to find your original picture?

That's a case where the engineering has made things very simple, yet fundamentally we're in a place where if anything goes wrong, it's like the movie Brazil. Weird pipes are behind the tile and you have no idea what you're looking at. It's that way in our cars, our medical equipment. It's the way of the world.

Jen

[01:09:41] There's so much to understand and know how to do. We can barely keep track of the part that we're supposed to be keeping track of. Never mind worry about all of the rest of it. We need to trust each other and let everything else be taken care of by other people.

Jason

[01:09:55] Right. This ultimately leads to my theory: legislation is the only thing that's going to do it. Believe me, I deal with a lot of people who have very, very strong opinions about the government's ability to get involved in anything. The fact is, unless you level a playing field that everyone has to have the stupid export, victory favors the ridiculously bold.

I've dealt with companies who are trying to do the right thing. They've contacted me and others, asking, "How do we do the right thing?" It turns out to be pretty hard. Especially if you don't bake it in on day one. [Laughs] I got into an argument with a company who I won't name. I was punching them and people were like, "What are you doing? They're new." [Jen laughs] I'm like, "That's the time to punch them! Wouldn't you want to punch Facebook in the first six months?" So I did that to this company. One of their developers who, if this company was to grow, would probably become the VP of blah-blah-blah-blah, become completely untouchable and have his own private helicopter. He and I had this long conversation. During it, he said, "Look, we're working on export and we hope to get export." I'm like, "I'm going to take everything you just said to me and replace export with backup." You know, "We're thinking of having backups. We know backups are important. Backups are very hard. It would be really nice. I know people sometimes ask for backups and we'd listen to them, but we still haven't really focused. We're hoping in the next quarter to have backups." I'm like, "How hard is it to do an export versus a backup?" It's just backing up to a different location with a different file format. It's because there's no priority baked in because you will never get in trouble for not doing an export, but you'll get in trouble for A, B, and C. Come on.

That's been the deal with some places. There are going to be people who listen to this and go, "This whiner is never going to make something of value." Because their value is, "How fast can I strap this code onto some human process and monetize it?" I get that. It's a cultural thing. I think, personally, the equation changes once you put human beings and their culture into it. That's mostly what I care about. I'm trying to think of a company where there's very little stake in that.

Jen

[01:12:46] It's hard to think of one.

Jason

[01:12:48] I know! It's crazy, isn't it?

Jen

[01:12:50] With something like Instagram, it's clear that users are trusting you with their photos. They'd like to not have them disappear. But you think of a site like nytimes.com, which is not user-generated content, but we care deeply about The New York Times and having an archive.

Jason

[01:13:05] We still have comments and contributions. When a newspaper goes under and all of that stuff goes away, oh my god, people flip out.

Jen

[01:13:15] They might have microfiche of the current New York Times paper. The actual, physical paper-paper. But it would make more sense to have digital-only. You want those histories preserved.

Jason

[01:13:29] Right. I know you just chose The New York Times as an example, but The New York Times is a really good example of, they've been around for 100 years.

Jen

[01:13:39] So they know how to do this.

Jason

[01:13:40] They've had to go through three or four mediums, and three or four approaches. "We'd better do video. We'd better do radio." They've built up processes over the years. Even the BBC has occasionally really sucked wind on archiving. Seriously. I'm not supposed to tell anybody. They have a couple of warehouses where collateral damage from leaky roofs was a known thing.

To me, that's just the erosion of existence. That's like waking up in the morning with a slightly chipped tooth. That's because you're alive. The fact that it's generating all of this stuff over time — entropy, misfortune, malpractice. New intern comes in, wipes the wrong tapes. Uh oh. Lost Benny Hill. Ok, I get that. But when it's needless, senseless, baked-in approach to destruction. An utter disavowing of the meaning of that. That's when I get all activist and angry and "it doesn't have to be this way" and pointy finger.

That's always going to be the deep education. I would love an attitude and a world it became news that data was being lost. In a way that privacy breaches are now. "Oh my god, this hospital leaked medical records." That makes news because it's weird.

So many places shut down now. Of course, now, I am the nexus of things shutting down. I get told things are shutting down all the time. Sometimes before anyone else knows. Somebody in the organization will be like, "Better tell Jason Scott." In this way, like, "I'd better tell him. Because this is going to happen."

Jen

[01:16:01] "He's going to find out anyway." [Laughs] "I don't want him to be mad at me when he hears it."

Jason

[01:16:07] I know of at least one company that shut down, in a way, to avoid us. It's crazy. They shut down so the Archive Team wouldn't cause them more bandwidth bills. It happens.

Then there's the speculative stuff. For instance, if anyone remembers FTP sites. We've been downloading from FTP sites because our opinion — which I think is correct — is that, as they die, they're never going to come back. Once that FTP site that was up in the 90s shuts down, all of that disappears. These are going to be computer utilities and information packs.

Jen

[01:16:45] For people who don't know, this was before the web was taking off.

Jason

[01:16:49] Yeah. It's been around since, like, '82.

Jen

[01:16:54] You would just log in to someone else's machine and poke around on their hard drive and take what you wanted and leave what you wanted. That's how people shared stuff.

Jason

[01:17:03] Right. It got used a lot by game companies in the 90s because it was an easy way to say, "Go here. Download this patch. Apply this patch. Now your game will work." My group said, "Man, somebody ought to grab a copy of these things. I think they're just going to disappear." Just stopping into church basements and being like, "You guys got any..."

Jen

[01:17:27] "... floppy disks laying around?" [Laughs]

Jason

[01:17:29] Floppy disks, too. Sermons, believe it or not. I helped oversee a site that's put up 23,000 or 24,000 sermons. They're just recordings from various churches. Where they would just make tape recordings of these pastors, going back 100 years. They had them and I'm like, "We'll mirror them." Someone in the mid 30s or 40s might have said, "I got my hands on a reel-to-reel. We should record the word of the lord from this guy." Then the reel-to-reel sits around for 40 years.

Jen

[01:18:04] And it will start degrading, so it needs to be digitized.

Jason

[01:18:06] People will make these recordings. Now, where do we host them? That was a speculative thing done decades ago. Otherwise, there wouldn't be a record of how the pastor sounded or what was important to him and everything else. The only problem is, now, instead of a slowly rotting tape, we have an immediately rotting hard drive. [Laughs] That's the big difference. Congratulations, humanity! You did it! You found a way! You let the snake eat its own tail. Good job. Glad to have been a part of that.

Jen

[01:18:43] People should go poke around. If you've never poked around on the Wayback Machine or any of the Internet Archive, I highly suggest it. Go over to the Internet Archive, poke around in all of the different kinds of media and libraries.

Jason

[01:18:57] Go to archive.org/v2. You can try out our new interface. I really suggest it because they've been working on it for well over a year. It's designed for browsing and digging deep. You can look at a bunch of covers or frames and be able to click around. As a browsing experience, it's unparalleled. Just walking around, stunned. At the top is the webpages. You can type in yahoo.com or wired.com and it will show you a calendar of all of the times that the Wayback Machine stopped by and grabbed a copy. Going to Google in 1999... they had terrible design aesthetics.

Jen

[01:19:53] I'm looking at Yahoo from December 96. When Yahoo was the best way ever to find things on the internet.

Jason

[01:20:02] Right. They were a web directory.

Jen

[01:20:04] It was the only major big player where you could find everything. Long before Google.

Jason

[01:20:11] At the time, I believe it was the #1 or #2 website in the world.

Jen

[01:20:20] It was the entryway into everything else.

Jason

[01:20:21] The worst part is, last month, they shut off the web directory.

Jen

[01:20:27] [Sighs] How could you do that?

Jason

[01:20:30] Good job there! That's great!

Jen

[01:20:36] They're so far away from their original business that they just shut their original business down.

Jason

[01:20:39] They didn't find any way to upgrade it or whatever.

Jen

[01:20:43] I get it. But still.

Jason

[01:20:45] I know. It drives me nuts. But this is the story you get all the time. "I didn't know somebody could shut that off! That's fascinating."

Jen

[01:20:57] There's 23 gopher websites listed in the directory. There's five internet phone websites listed. It wasn't a search so much. It was like a phone book, a yellow pages phone book for the web, with every website that existed.

Jason

[01:21:15] What I loved about the story of Yahoo was that, Yahoo started out as an email list done by these two guys, these two graduate students. They start to put things together, and then they hired this woman who was a library science major. She redid all of their website listings along library science subjects headings and everything else. So it changed. It became very easy to find things because she was a librarian. The cool thing is that we could look at this site — because my boss had the foresight to keep it — and be able to go back and be like, "That's what went on."

Right now, as we do this interview, Jeb Bush is getting his hand stuck in the machinery of gotcha. He's putting out old historical stuff and people are like, "You really screwed this up." There was a great moment. He put up all of his old emails. In it, somebody is like, "I want to become a private investigator. Here's my social security number." People are like, "Oh my god, his social security number is in there." I watched this discussion. Someone goes, "When you go to the governor's website, it has a disclaimer that your stuff will become public." Someone else goes, "I used the Wayback Machine. That was added in July 2013. This was in 2006." [Both laugh] We can use it that way, too. You can't act like it was a flat surface and everything happened at once before last week.

Jen

[01:23:03] Does the Wayback Machine attempt to scrape every webpage on the whole internet? How much of the actual web does it succeed in scraping?

Jason

[01:23:13] It grabs over a billion URLs a week. It's a hot flashlight that skims over things. It will do things pretty well. It will use some level of social media; people tweeting links or other things, to get a sense of what's out there. It will always encounter depth issues, like if somebody is like, "I need to know what changes were made." It's great that people think we're capable of this. "Last week I had something on my website but I deleted it this week. Can you get it back?" The answer is, "Probably not. Unless you had a huge media presence and your file system died because of the weight of it." Then maybe we came in and saw it.

Archive Team created a separate project called ArchiveBot. ArchiveBot allows something like 80 people to weigh in on what websites are in the news and important and whether we should grab them. I watch them do this. We pull in about 400 gigabytes of data per day on that. Comedian dies, let's grab every website the comedian ran plus any websites the comedian was involved in, plus any organizations he was contributor or board member to, plus any news articles that mention that comedian. They'll try to do that.

A really good example with Archive Team was when Gaddafi got deposed. We went in and grabbed all of the government sites and all of his propaganda. Ten days later, it all went down. The rebels gave what I call the most insistent takedown notice ever. [Jen laughs] "Stop hosting this, or we'll kill you." And they complied. That's even faster than a DMCA! But I'm like, "Don't you want to know what the guy was saying before he got deposed? You should at least hear his propaganda and what he was saying and why people were ticked off." He had a pretty good propaganda machine going. His website was in five languages. He was like, "Check me out!" There's no evidence of that anywhere in Libya. There's no website in Libya to tell you what happened with this guy who was pretty involved in the place for awhile. Now we have a recording of it.

Jen

[01:25:56] So there's a layer. The bots are going around doing one thing and there's a human layer on top of that.

Jason

[01:26:02] Right. We've actually moved to taking screenshots, in-browser screenshots, so you can see how a place was laid out. There's lots of technical issues with screenshots.

Jen

[01:26:15] It seems like the older pages... if you go back to the 90s, the images are missing.

Jason

[01:26:25] Right. You'll get these weird problems.

Jen

[01:26:28] Which is frustrating, especially since for many years, we put so much of the navigation or the header for the site or the logo... really important parts of the site were in images. When those images didn't get archived, you ended up with big boxes of broken image links instead of actual menu items.

Jason

[01:26:50] That's actually always going to be a problem. You've got a machine doing a lot of work. It is scraping things as well as it can. It's going to miss some pieces. It's going to be really annoying. In some cases, you're going to get this pot-marked crater. "Look at this crazy photo of this person riding a motorcycle!" And... nothing.

Jen

[01:27:26] It seems like it's gotten better, though. Back in the 90s — probably because computers were weaker and storage was expensive — maybe there was awhile when there wasn't even an attempt to scrape the images.

Jason

[01:27:38] We were definitely scraping the images but I think the approach could be thwarted or run into problems.

Jen

[01:27:46] Because it seems like in recent archives of webpages, all of the images are in place. I don't know about things like audio and video.

Jason

[01:27:56] Sometimes it works. We've got a lot of MP3s. A lot of WAVs. Nothing beats a person showing up and gathering together a collection of a thing. It's the difference between going into a building and recording what's in the building, then hooking a vacuum cleaner up to a convenient window and sucking everything in. [Jen laughs] It's a completely different experience. That's what we do. There's no way around it. You don't do a billion URLs a week unless you do these things.

Sometimes it's just stupid things, like interstitials or odd JavaScript programming that isn't compatible with an older Webkit. That can thwart it. They're not being protective, they've just set up some really weird thing. Even in the ArchiveBot, they've had to put in conditions to be like, "Oh, looks like we're pulling from a forum." They make all sorts of choices.

This is where I'm a fan of the screenshots. Because you lose some of the content. By having a screenshot of the main page, you can go, "This is what it looked like." They're not hard to save. I've been happy that we do that. We're taking hundreds of thousands of screenshots.

Jen

[01:29:35] If you go to the Wayback Machine, you can manually enter a page and say, "Save this right now." As you said, it doesn't save every page, every hour, or every day. It saves your website once a month or three times a year.

Jason

[01:29:55] That's a relatively recent addition and it's been a really good one.

Jen

[01:29:58] I like it. Yesterday I launched this brand new website at thewebahead.net and I came over here because I knew I would be talking to you today. I was like, "I should save the first day of my website!" So now I have it saved.

Jason

[01:30:13] You're in front of the building, you're holding your first dollar, you're smiling. "We're in business!" is written underneath. We have that snapshot now.

Jen

[01:30:22] Yeah! It's so nice. When it was still full of bugs, this is exactly what it looked like on day one.

Jason

[01:30:30] "And you had no idea how sucky 2016 was going to be, during that one interview!" Yeah. Everything's happiness and light. You try to gather. Just being there, in the moment, and capturing it. You can always reference that.

Jen

[01:30:43] What's interesting is, I went and looked at it. "Let me go look at the archive of the page." And it looks very, very different. Because the web font is not loading. Everything is in Helvetica instead of Avenir. Which is ok. That's how progressive enhancement works. This is how the site is designed. It's designed to still work, and I've tested it periodically without the web font loading. But I think because I'm using a web font service and the font is not on my server as one of my assets, next to my other assets like the CSS and images. There's JavaScript on the site that's running and grabbing the font from fonts.com. They're copyright protecting it and you've got to come from the right place. You've got to have the right domain. The domain has to be whitelisted. Blah blah blah. When it's coming off a different domain, it doesn't work.

Jason

[01:31:39] That's the thing. You're running into that fundamental piece of it. It's going to do an ok job. It's going to verify certain things. You existed on that day, here's what you wrote about, here's the slogan before they changed it, here's what was important to them on that day.

The example that I love to give is Shakespeare. Words were pronounced differently, so now there are words that don't rhyme that used to. You'll lose that. We still enjoy Shakespeare, but we're not getting that joke. I've actually seen live performances where there will be a particularly ribald joke and you can see the actor do a gesture to go, "Wait, think twice about what I'm saying here." To be like, "Ohhh, he's not talking about flowers." They're helping the text along because we've lost that.

Jen

[01:32:51] Right, because the context has changed.

Jason

[01:32:53] That gets into general history theory. There was an Activision product that lets you play an Atari 2600 game on your PC. There was a button that would cause your mom to randomly yell that you were playing it too much. They had some recorded woman going, "You never see your friends!" and, "It's time for dinner" and, "When do you take the garbage out?" It was this random voice, this dumb thing.

I know of The Arcade Ambiance Project, where this guy tried to put together the humming sound of arcades so you could play it in the background on speakers while you were plying video games. [Jen laughs]

Another one did speaker deflection coils, so you could modify the arcade game to the level of damage that it had when you played it. If you were playing a 1980 video game in 1987, the speaker was starting to buzz. You could add the buzz in. [Jen laughs]

This kind of bespoke, historical... it's like, walk through London, but you don't have an incurable disease. It doesn't smell like crap because the horses are there. You lost the horse smell.

Jen

[01:34:17] You're going to be able to see Yahoo from 1996 but you're not going to be able to see it on the 16-bit screen with 72DPI that we used back then.

Jason

[01:34:26] Exactly. The 640x480 maximum. It's funny because we recently got a prototype working that boots up Windows 3.1 and starts up trumpet windsock and you can browse websites using Netscape 1.0. It works except most websites don't care about that anymore. They won't do non-HTTPS connections or they've got JavaScript or something else. When it works, it's funny.

The classic example of a website that's never gone away is the infamous Space Jam website. We got this running. It was a circa-1995 browser. Windows 3.1, Netscape 1.0n — which was the public version. I go to the Space Jam site going, "We're coming back!" By the way, it does shove itself into the weblog, which to me, is hysterical. I'm trying to imagine the metrics on these sites going, "Yeah, yeah, four million websites. We got hit with the Mozilla engine, the Webkit engine. They're doing stuff over here. We have one hit from Netscape 1.0n running on Windows 3.1." [Jen laughs]

Anyway, it hit it, and it rendered a little bit of it wrong. They did something between 1995 and 1996 and the Space Jam website reflects that in the frames. So it's wrong.

Jen

[01:36:08] So it doesn't load on Netscape 1.

Jason

[01:36:10] It does...

Jen

[01:36:11] But it's broken.

Jason

[01:36:13] Well, the planets are a little bit too short. One of the fonts is a little weird. I was just like, "Bwah ha!" People will be like, "In the good ol' days, it was always so easy. Before this new upgrade." I'm like, "No no no! It was terrible!"

Jen

[01:36:26] Ah, it was not easy. It was so much harder. People complain now that the browser support is so hard. It's like, "No, you don't even know."

Jason

[01:36:34] It's like, "Poor baby. Your website might be a little wider than you expected." You don't even know. 1024x768 backgrounds.

I was having a nice conversation with the museum host that I was hanging out with yesterday. He said they've discovered a radical difference between recording an interview — an oral history with someone — and recording the same person in front of things from that time. The difference is so markedly different. If you go to a programmer's house, or you call them up on Skype and you go, "Hey, do you remember working on star boof?" And the guy's like, "Yeah, star boof. That was fun. It was great." Then if you bring that guy a box, the star boof box, they'll be like, "Oh my god! Manni! Manni the lunch guy!" He'll just go off. "The box. I was holding this box when Manni told me this."

We're designed this way as creatures. If you bring people these old things and show it to them and don't just make it an email saying, "What do you remember of the old country?" or, "What do you remember of 1997?" You say to them, "Star Wars" and they'll be like, "I loved Star Wars." Then you show them the theater they saw Star Wars in, they're like, "A girl broke my heart in front of that sign." [Jen laughs] They won't even remember her name for 25 years. Our brains store our memories and context in such strange areas.

That's kind of what's going on here. Let's grab as much context as we can, knowing that our mission is doomed. [Jen laughs] Let's just go at it.

Jen

[01:38:40] Thanks for being on the show today.

Jason

[01:38:43] Thank you. Obviously I care about these things.

Jen

[01:38:44] That sounds like a good mission.

Jason

[01:38:46] It's such a fundamental part of us. The things we make and things we store and things we want to share and communicate to others. So you can go anywhere with it. I'm a little sad that we've made this devil's bargain on such a shaky medium. But it has brought us so much in the past 20 years. I'm willing to make the bet. I'm very nearsighted now. I was like, "Years of screens." They're like, "It's a shame." I'm going, "No, no, it was totally worth it! Burn the eyes out a little bit. See amazing things for 20 years. I'm doing much better than my great grandparents. I'm doing really good."

It's well worth it. The fight goes on. I wish more people would join us in the fight. I wish more people would enjoy what we're working on. archive.org is well worth visiting. archiveteam.org is well worth visiting. I'm Jason Scott.

Jen

[01:39:47] Where can people follow you on Twitter?

Jason

[01:39:50] @textfiles on Twitter.

Jen

[01:39:54] And your own website?

Jason

[01:39:56] I have textfiles.com, which is my computer bulletin board history site. I have a weblog there where people can reach me and see me rant about something that's giving me trouble or thoughts on something. I've always got something to say. Obviously. [Jen laughs]

Jen

[01:40:16] I'll get other links from you, Jason, after this. I'll put all of those links together in the show notes. thewebahead.net/97. Where there will also be a transcript of this episode. You can sign up for the newsletter on the new website. There's an email newsletter that's not set up yet, but will be. Follow the show at @thewebahead or I'm @jensimmons on Twitter. Until next week. Thanks for listening.

Show Full Transcript

The Web Ahead

Episode 97

Archiving The Internet with Jason Scott

February 27, 2015

In This Episode

Transcript

Show Notes

Filed Under

Culture and Society

The Web Behind