Episode 44

The Web Behind with Tom Bruce

November 28, 2012

Tom Bruce joins Eric Meyer and Jen Simmons to talk about the very earliest days of the web, writing the first Windows web browser, inventing 'marquee', and taking a road trip to NCSA with Tim Berners-Lee.

It was very, very hard at the time to conceive of a mass audience for the web.

Transcript

Thanks to Matt Sugihara for transcribing this episode

Jen
This is The Web Ahead, a weekly conversation about changing technologies and the future of the web. I'm your host Jen Simmons, and this is episode 44. I first want to say thank you to today's sponsors, Shutterstock and In Control Conference. We'll talk more about them later in the show. And I want to say hello again to Eric Meyer. Hi Eric.
Eric
Hey Jen.
Jen
Eric's joining us to do another episode in The Web Behind series, a series about the history of the web, where we find out where the web came from.
Eric
Yes, well, we're on what would be for this series, episode six, am I right? This is episode six?
Jen
Ummmmm… I think it's something like that.
Eric
Yeah, starting to get some really good stories and I'm really excited that we're joined today by Thomas R. Bruce, Tom Bruce, who many many people listening to this podcast probably will not have heard of. And I couldn't be more pleased to correct that now. Tom has a very interesting background, which we'll get into, but he founded the Legal Information Institute at Cornell in 1992, and has been the director since 2004. He's written a lot of software, including, and this is one of the main reasons why I want to have him on, Cello, which was the first web browser for Microsoft Windows. He's done a lot of things. He has degree that have absolutely nothing to do with either law or computers, so of course it makes sense that he'd be the director of the Legal Information Institute and have written a web browser. So we're just going to start talking about that. So welcome Tom, thanks for being here.
Tom
Ah, it's great to be with you Eric.
Eric
So, you founded the Legal Information Institute in 1992. Had you encountered the web by then, or did that come along just a little bit later?
Tom
It came along just a little bit later.
Eric
Ok, so what was it that led you to found the Legal Information Institute? How did you get to that point?
Tom

Well, a couple of things. It may help to review a little history. And because I work in a law school, I want to start in 1581. In that year, there was a printer named Richard Tottle in London, who started the practice of using consistent page numbers across different editions of a law book. Fast forward a few hundred years to 1873 and a guy named Frank Shepherd invents the forward citator, which we would now know as something like Siteseer. In 1973, the legal profession got access to all of its primary materials online through a mainframe computer system called Lexis, which has now evolved into Lexis Nexis. And by the late 80s, that's where you really wanted to start, a lot of legal material was becoming available on CD-ROM in hypertext form. The principle reason for that was that cross references and precedence are extremely important to lawyers and law people. So something that you could click and get to that reference, where it said, "See this other case" or, "See section 52 of the tax code," was something that lawyers really, really wanted and that they had started to do on local platforms.

I had gotten fascinated with the internet in a kind of 8-year-old-with-a-hammer idiot way. We began thinking, Peter Martin and I, that there was some room there to begin distributing legal materials via the web, but we were thinking at that point primarily in terms of teaching materials. As an early experiment, we began putting things like the copyright law online in Gopher. And when we discovered the web in late '91 or early '92, when Tim had put up the first CERN server, we realized that we had there something that was at least as capable as a local CD-ROM platform but with much, much wider reach, and much, much better suited to the kind of work that we wanted to do in putting law online. And that was the initial impetus. Like everybody else online at that point, let's throw some stuff up there and see what happens.

Eric
Yeah, it's interesting the way that law, that legal codes and legal texts, were kind of hypertexted before there was a web. I was working at a firm you've probably actually heard of, Tom, called Banks Baldwin.
Tom
Uh huh.
Eric
In Cleaveland in 1991 and they had the Ohio legal code, in that case in a DOS thing on CD-ROM, but it was hypertext because there was, as you say, it would refer to Ohio Code Section 721 and then you could use your arrow keys to move between these references and hit the spacebar and it would jump to that thing. And that's part of what I was doing at Banks Baldwin, a couple years before I had seen the web. This whole forward and backward referenceing thing just seems kind of natural. So you came across the web in '92, and you had material that you wanted up there.
Tom
Yeah, late '91 or early '92. I think at the time we put stuff up, we were somewhere around the 30th webserver in the world, and the first one actually with a professional orientation that was outside of high-energy physics.
Eric
Wow.
Tom
Because at that point, everybody was a physicist.
Eric
Yeah...that's....woooooo. We're not worthy! [Both laugh] thirtieth webserver in the world.
Tom
You're just not old. There's a difference there.
Eric
[00:05:41] Well, erm...anyway. So you started to put this information you had on webservers, but I gather that there was some problem in actually getting it to the people who you wanted to get it to. Like, all the physicists could find it, but...
Tom
Yeah, well at that point, I want to say the number that we had in our heads was that somewhere around 95% of all the machines that were on lawyer desktops, and that found its way into law schools as well, were running Microsoft windows, and in a way, the web itself was kind of ghettoized at that point, because it was running on Unix and Macs, and except in those places where people were developing web browsers, they really weren't that popular a platform. It's not was the consumer market was using at all. And so we thought, "Well, if the purpose is to distribute law to anybody who actually has a use for it," as we then thought, "we should probably put it on some kind of a machine that they can actually get it on." And honestly, I don't really know that my intention was as much as it was to write a web browser as it was to shame somebody else who was actually competent to do it, into writing a web browser, but that was how we started out.
Eric
Ok, so that's why you started to write a web browser, was so that lawyers who were running Windows could actually see the stuff that you were putting up there.
Tom
Yeah!
Eric
You were both the publisher and the distributor at that point.
Tom
Oh yeah, it was what I think we would now start a content driven strategy, [both laugh].
Eric
[00:07:23] I'm curious, where did the name Cello come from?
Tom
Oh, that's a very long story involving a failed romance.
Eric
Awesome!
Tom
Not suitable for podcasing.
Eric
Oh, alright. Did it have anything to do with Viola, or did Viola pick because you...?
Tom
Well, actually it did. I had seen Viola, and I had actually been quite impressed with what Eolas had done with it, and thought, "Ok, well we can use the play on words here."
Eric
So what did it take to write a web browser at that point?
Tom
Well, more than you might think. One of the problems was that the code libraries that were being used by all of the Unix developers, certainly, and I think also by the people at NCSA who were working on then the only Mac browser that was being written, were just...they would not run under Windows. There was no memory management there, and so in Windows memory space it wouldn't work at all. And while I had an aweful lot of people telling me that I was of the res for using the same libraries, my answer was, "Well, look folks: I can't. They don't run" So basically, it was done from the ground up. It was done in C++ using initially Boralan's libraries and with a later move into the Microsoft Foundation class. It was old guard C++ from the get go.
Eric
Wow and you basically had to implement everything.
Tom
Yeah. Pretty much. I think when we started doing images, I think we used packaged image code, but that was it. I wasn't going to sit there and write a gif renderer.
Eric
Right, so you had to implement http. You had to do document parsing.
Tom
Yeah.
Eric
Was there a DOM tree?
Tom
Oh no. No, no. No, not at all. In fact, that notion was barely thought of at that point, and in fact, ultimately, I suppose you'll probably end up asking me why we dropped development on, and I look at the need to start dealing with the DOM tree, and the need to start dealing with JavaScript as the thing that permanently put one man development of web browsers totally out of reach. At that point, you'd reach the scale of stuff, at least not for me. It was just not possible to do with a a single developer anymore. It was just too much
Eric
When did that happen?
Tom
Oh, I'd want to say early '94 maybe. So the whole thing had a run of about 18 months.
Eric
So just imagine if you'd had stuck it out till CSS came along.
Tom
Well yeah, I'd have been much happier. I mean, people had all sorts of crazy ideas at that point. And I have to say that it may be that Tim's genius was not so much in inventing the web as it was keeping web standards light enough that people could develop to. There was a crowd running around that wanted to require that everybody do a full SGML rendering in browsers, and I sort of sat there and started thinking, "Well that adds five years to the development process." And there was a crowd that wanted to use a system that I'm pretty sure you're probably familiar with, Eric, called Thistle, which was a rendering thing for SGML as the basis for stylesheets.

You look at these things and it was worse than that class you regretted taking in common lisp. All kinds of ideas for doing these sort of technically perfect things that just could not be put into software in less than geologic time. Fortunately, we resisted that. Otherwise, we would never have had the popularity that the thing ultimately attained.

Eric
[00:11:30] So at the time, '93, '94, when you're writing your own Windows web browser, which as you say today the concept of one person writing a web browser would be an astonishing thing today. But at the time, it could be done. Eolas did it with Violia. Tim Berners-Lee obviously did it with the first web browser, although he had the advantage of whatever he came up with, that was the way that things worked, and you did it with Cello. What when into the rendering engine, such as it was. I mean, there was no CSS, there wasn't JavaScript that you had to worry about. You bascially had to parse the document and figure out how to display it. What were some of the things that you did to make that happen?
Tom
Well, it was really all about mapping, in a funny way, we almost ended up thinking of the tags as being presentational in nature, right? Which of course they weren't, but you had to think that way if you were going to render something. And it was mostly a matter of mapping the very strange Windows font engine to something configurable that could be made to correspond to the then very limited tag vocabulary. Recall, there was no need to parse stylesheets or anything of that sort. Those styles did not exist. It was all rendering h1, h2, h3, h4, h5. If you could build a map that essentially allowed the user to map those fonts to something in the Windows font set, you were pretty much done as far as presentation was concerned. The rest was just printing it to the screen as you printed to anything else, really. We didn't have any kind of complicated layout. Everything else was really just indentation. It's hard to think now exactly how primitive that was, but it really was kind of a teletype font in some ways.
Eric
That's an interesting way to put it.
Tom
The biggest sticking point was of course Microsoft's font engine in those early versions of Windows, which I used to liken to one of those pacific island cargo cults. You would pray and hope that something good washed ashore.
Eric
Well, in the end we can always blame everything on Microsoft, I guess.
Tom
It's true!
Jen
What version of windows was this? Which era?
Tom
I think 3.1 was advanced stuff, Jen.
Eric
Woooo!
Tom
There was in fact, at the time we did that, there was no Windows TCP/IP stack.
Eric
Ok, so did you have to write your own then?
Tom
No, we actually licensed one. There was a company that came and very rapidly went, called Distinct Corporation. I think they were in Baltimore that had done a TCP/IP stack for Windows. I have no idea why. We arranged to license it from them on a very favorable basis because they saw all of this opening up a lot of possibilities for their business, and we used that for about nine months. And during that time, that was the time that winsoc was being developed. And I recall having these crazy conversations at very strange hours with Peter Tatum, who was in Australia who wrote the original winsoc stack. Because I was one of the apps he was testing with. Our clocks would intersect as these bizarre hours of the day and night because of the time difference. And we'd go back and fourth about the minutiae of the spec and whether or not the API was working and that sort of stuff.
Eric
So you had to create everything, not just the rendering engine, but also the browser chrome. How much of the UI of cello did you base on what little already existed, and how much of it did you come up with on your own?
Tom
A little bit of each. An awful lot of it, it was meant to look like a Windows application, and it did. There were some choices that we made based on what we had seen in other hypertext engines that were running locally, most particularly something I'm sure nobody remembers called OwlGuide, which was a very very early Windows hypertext engine that was used for local publishing, CD-ROM publishing. For example, in the OWLGuide world, hyperlinks were show off by lightly dotted boxes around the link. And originally, that was how Cello displayed them. We later changed it to using underlines under links just like everybody does now. So an awful lot of that stuff we were begging, borrowing, and stealing from wherever we could, but on the other hand there were things that no one had really thought of at all. To give you some idea of how small the community was at that point: for legal text, I very badly needed the funny looking backwards 'P' that's P with two stems, paragraph symbol, and the funny looking twisted 'S' section symbol. And I would just shoot an email to Dave Ragit and say, 'You know, Dave, I need these funny symbols,' and the next thing I knew, they would be in the character entity list.
Eric
Wow. Um, ok.
Tom
It was that small.
Eric
Well, yeah, I mean, there weren't that many of you. When you switched to underlining, was that because of Mosaic, or...?
Tom
Yeah, they pretty much put that forward. And there was a lot of...The dotted box thing, didn't really work well at high link density, you just couldn't read the display anymore.
Eric
Mmm.
Tom
And even I had to acknowledge that this looked like a document that had broken out with some sort of skin rash, and that the underlining was much, much cleaner.
Eric
Yeah, it's interesting because now there's been some, not a ton, but there's been some criticism of underlining of words because it destroys the descenders basically when you visually scan over, it makes it harder to read, but we forget how much harder the alternatives were and how much harder they were to read.
Tom
Well yeah. Your choices are limited. I'm not sure anybody's eyes really track changes in hovering mice very much, especially when they're trying to scan text.
Eric
There aren't a ton of screenshots of Cello, unfortunately, these days, but there are a few and when you look at the the UI, it's kind of the same. There's forward button and then a back button and a home button, and there's a place to enter the URL or the URI depending on which side of that terminology debate you fall on. That sort of stuff, how much was borrowed?
Tom
Oh, well that was all being developed simultaneously. The concept of the home button and the home page was pretty much endemic to everything that Tim had done from the beginning and everybody had that. I remember a certain amount of raging debate on WWW talk about exactly how the back button ought to behave, because people were even then encountering the same sort of problems with, 'Well, is that doing what's expected?" that I think every ajax developer deals with now. I remember some discussions about that. We were still working out what the proper treatment of whitespace was under certain circumstances, just a ton of stuff like that. Behind it all was this real question about, at least in my mind, about what your responsibilities were as a browser developer. Because at that point, it was still possible to point fingers back and forth. Some people who were developing software were very, very tolerant of problems in HTML, and I tended not to be. My feeling was, 'Well, if the document's broken, then why would I shoot novacain into that injured knee by fixing it on behalf of the guy who has a broken document?' And there were some of the same sort of negotiations going back and forth about 'What do you do with errors that are really emanating from the server? Do you try to repair them? Do you try to fail gracefully? What do you do?" None of that stuff had been worked out, I'm not sure it has been yet, frankly.
Eric
[00:20:12] Yeah, well, that's a whole other debate. So, the guys at the NCSA and you and Payway[?] and Tim were all hashing this out as a group basically?
Tom
Yeah, there was a mailing list, and an awful lot of stuff was flying around. We had all met via email very few of us had met in person before. In the late summer of '93, O'Reilly sponsored a get together for all the web developers in the world, and I think they were all there. 35 of them [Laughs]. At O'Reilly's office in Cambridge, and it was the first time I'd met most of these guys face to face, so it was me and the NCSA guys, and the Slack people, and the guys from O'Reilly, the University of Kansas team that had done Links, I remember there were some odd bods from here and there who were very active in web development at the time. But it was a very, very, very small group. If you had a mailing list for all the developers in the world, you had 50 people.
Eric
And when you say web developers, you don't just mean browser developers, you mean...
Tom
I mean server developers and everybody else. I remember jumping in a car with Tim at one point, I was in Chicago for something else altogether. He was in Chicago to visit people at Fermilab, and we jumped in a car and went down to NCSA, and that afternoon basically hashed out the beginnings of what would turn into CGI.
Eric
Oh, ok. These things that....So I have to ask--
Tom
Andreesen was sitting there saying, 'Well, you know we need some way to hook up the database' [Laughs]
Eric
Uh huh.

Eric
[00:24:08] So I have to ask, how was Sir Tim as a road trip companion?
Tom
Oh, he's great.
Eric
Really?
Tom
Yeah! We got horribly lost. Actually, he wasn't driving, it was a friend of his from Fermilab. We got horribly lost [laughs]. Eventually we got to NCSA, and eventually we got back.
Eric
Ok. So he's not one of these guys who sings the same song 20 times in a row?
Tom
No, no, no, no, no. No, not at all. We had a lot of fun, actually.
Eric
Did you introduce him to American custom of slug bug? I just have to ask.
Tom
I don't think I know it myself, Eric.
Eric
Oh. You didn't have a thing growing up--?
Tom
I did introduce him to the American habit of rolling down your Windows while going across the loop in Chicago and screaming "out of state" every time you make a turn on the wrong way onto a one way street.
Eric
That's excellent. Ok, so right. You and Tim got in a car, went down, found Mark Andreesen eventually, and figured out the beginnings of CGI, so that's where we got CGI-bin from.
Tom
Well, yeah. The NCSA guys were well on their way to doing it. They were working on some other project at the time, his name escapes me at this point[?], but it was basically a data visualizer that they were doing for mathematical and physics experiments, and that kind of thing. They were doing it under some NSF grant. It had some of the same needs for that stuff, so they had gotten part of the way there, but there was this talk about, 'Well, how do we do it through the server? how do we communicate with it through the server?" And at that point the whole CGI notion started up. And it wasn't like we walked in there at one o'clock in the afternoon and by two thirty everybody had solved this problem. It was the beginning of something that probably came to fruition between three and six months later.
Eric
Ok.
Tom
None of this stuff happened instantly. I don't recall anybody being struck by lightning and going to the keyboard and starting furiously typing.
Eric
Yeah, well, I guess all the stories can't be awesome, but....
Tom
No.
Eric
So CGI for those who may be listening who have never come across CGI except in the context of Jurassic Park and that sort of thing for "Computer Generated Imagery" was actually the Common Gateway Interface, which became an RFC eventually. CGI 1.1 became and RFC. Previous to that, your webserver couldn't talk to databases.
Tom
No, not really at all. Not in any way that you could present stuff dynamically. You were just looking at HTML pages and that was really it. I'm sure that various people had come up with clever ways of dynamically generating HTML pages, but we hadn't seen many of them. I'm a little unclear, there was a project running at the time at Xerox PARC that was a mapping project that I suspect made very intensive use of database rendering, but I'm not really sure how he was doing it.
Eric
Yeah, no...it's....
Tom
I guess the main point, Eric, is that at that point, all the stuff that we now take as being fundamental, was at a state where the need was apparent, but none of the implementation had been done. So a lot of stuff ended up happening very quickly because it wasn't like you had to sit there thinking, "Well, what should I do next?" There was always a list of problems that needed to be solved. Because everybody could see that they were there and they knew what they wanted to do with this thing.
Eric
Right. So, with the three of you in the room talking about CGI. Was it basically, someone would come up with an idea and then...?
Tom
I think there were about nine of us there, and it was more like, "Well, what about this?" So I think that's a good way of putting it. I don't recall there being a lot of religious warfare in those days, so it wasn't like these things turned into fist pounding sessions. People were willing to give a fair hearing to anything that anybody was willing to implement, and it really was very much was in that environment that everybody romanticizes now of rough consensus and writing code, I think is the stock phrase.
Eric
Yeah.
Tom
It very much operated that way. If you wanted to put the work into it, god bless you.
Eric
Right. Jeff Eaton has done a talk where one of his examples is how the img element came to be. Everyone knew that you wanted images embedded in pages. It hadn't been done, and the NCSA guys said, "We're gonna do img," so that's why we have the image tag today.
Tom
Oh yeah. And the blink tag. [Laugh]
Eric
[00:29:09] And the marquee tag. I heard a rumor that you're to blame for that one.
Tom
Yeah, I actually am. I don't know if this is an appropriate story for a family podcast, but sometime in early '94, I was contacted by some guys at, well, actually, I have to go back a step. When I started working on the second version of Cello, which never came to light except as a sub-licensed part of other people's products, where it had a lifespan that actually went on for a number of years, we knew that we needed better image rendering, and my answer to that was basically to buy it off the shelf. There was an image rendering package that was a library for C++ and it wasn't expensive to license, and we thought, "Ok, that'll save a bunch of work." So went out and licensed that, and one of the things that it would do, if appropriately provoked, was cause whatever was in your window title and windows to rotate around in this sort of marquee-like ticker tape-like fashion. And I put it in there as a kind of joke for the guys who were alpha testing for me, to see if anybody would notice. So if you use the tag marquee in anything that this version of Cello read, it would cause the title to spin around like a marquee.
Eric
Ok.
Tom
I got a call from this guy at Microsoft who was looking around for some browser code. They didn't have one at that point. And would I sign a bunch of NDAs and send in our stuff. And I said, "Oh, sure." And off it went. And it came back and ultimately they did a deal with somebody else. But curiously enough, IE is the only place where the marquee tag has ever worked. I believe it works there to this day.
Eric
Hmmm. Yeah. Ok. So you burying an easter egg in Cello lead to the marquee tag?
Tom
Yes.
Eric
Well, I hope you've learned your lesson, sir.
Tom
I have. If you want to slap me, you can.
Eric
Well, I don't think TCP/IP has come quite that far.
Tom
Ah, ok.
Eric
Maybe the next time I'm in your town.
Tom
Well, I'm gaily awaiting for the boxing glove to come out of the screen.
Eric
Boink.
Tom
Exactly.
Eric
[00:31:30] What were some other things that you added that persisted far past your expectations that they ever would have or that they died an early death and you're sad about.
Tom
We had nothing that died an early death, including the browser itself, I might add, or anything that I'm very sad about. I don't know, I remember doing a lot of initial work on how mailto: was going to work. I was the one who thought we needed it, and I did an initial implementation of it that got...one word for it would be 'refined'. Another word for it would be 'fixed'. That ended up where it is now. I remember an awful lot of back and forth about white space that I had a lot to say about, but I don't think any opinion I had on the subject ever particularly prevailed. But this was all cranking along in multiple locations at once, and in those days, there wasn't that much to disagree about. There were a few conventions that we put in that I think have started to come back. I was very, very eager, even in the first version of Cello, to have not only direct access to some kind of search engine off of the browser bar, but to have user control over what search engine that would be. I was putting a fair amount of development time into solving what I saw as problems that people would have if they were looking for information from multiple sources. I really thought of this thing as a researchers' tool, in many ways. Researcher not in the sense of chemicals and test tubes, but researcher in the sense of people who look stuff up in books. So we supported z39.50 directly for some stuff, and in one version we had multiple search engine selection in the way that you now do in some browsers. And I always thought that it was sad that a lot of the rest of the development community didn't pick up on that. What was fascinating was, that at the point where Netscape began introducing heavily visual elements into its HTML, some of them like blink, but some I thought at the time grossly inappropriate for rendering logical markup. Pretty much everybody veered off to the, "let's make pretty pictures place," because that's what many people who were either doing graphic design or eCommerce wanted. From there, it always seemed to me it took about ten years to get back to the idea that people wanted to find stuff to read.
Eric
What kind of fascinates me at the moment is that you're talking about in 1994 or so...
Tom
Yeah, there abouts.
Eric
Giving people the ability to choose multiple search engines. I was around then, but I don't really remember multiple search engines.
Tom
Well, by that time what did we have? We had Alta Vista, we had...I can't even remember who the players were. There were three or four general search engines. But there were also specialized collections that were really library offerings based on the kind of stuff that Brewster Kale[?] was doing then with z39.50. So the library access protocols that people were actually quite interested in. Mostly academics.
Eric
Actually, can you explain z39.50, because I don't thi--
Tom
No, I probably can't! [Laughs]
Eric
Oh! Can you explain what it was for?
Tom
It was basically meant for bibliographic records retrieval between large online public access catalogs. So you could go into, oh I don't know, I'll make something up, this probably isn't accurate, but you could go into something like the Hollis system at Harvard and pull out a bunch of bibliographic records. MIT was running it. There were numbers of others, and it was used, and may still be for all I know, as a kind of back end protocol for exchange among basically library catalog machines.
Eric
Mmmkay.
Tom
It was a generalized information retrieval protocol.
Eric
Yeah, and of course library catalog machines at the time were also very new.
Tom
Oh, exactly. The little bit i knew about it, has unfortunately faded from my memory. Brewster is probably out there listening somewhere and wishing he hadn't listened to this before dinner, but it was an interesting thing to try to deal with. It was not an easy protocol to implement as I recall.
Eric
[00:36:17] So it wasn't just general search engines that you were thinking to let someone say, "If I'm searching, I always want to be searching the Library of Congress catalog."
Tom
Yeah, exactly.
Eric
Ok. Well that makes some sense.
Tom
I think we were a little more attuned than most people were at that point that there really were people who professional searchers.
Eric
Mmmmm.
Tom
So in that sense we were doing something....This is a very very loose analogy, to say the least, it sort of comes from another universe, but we were doing the same kind of thing in a way that...we had the same set of concerns that the people who were doing something like Zotero[?], would have now. We saw ourselves as being tied into a community of professional academic researchers, or legal researchers, that nobody thinks in terms of anymore now that search engines have broadly generalized.
Eric
Right
Jen
It's always--
Tom
I think one of the things that's getting lost a bit in our discussion here is that it was very very hard at the time to conceive of a mass audience for the web.
Eric
Interesting.
Tom
At least it was for me. I think it was always in Tim's head, it was always in the head of...oh, I'm thinking of Dale Dourghty from O'Reilly, who really did GNN initially. There were people who got that. I'm not sure I did. I remember actually, not long after I first met you, because I think I met you at some crazy thing at Case Western in '95 or '96, and I remember going to an American Association of Law Libraries meeting in Pittsburg right from there, and seeing my first URL going by on the side of a bus advertising a bus and thinking ,"Huh! [laughs] I never thought that would happen."
Eric
Yeah, I think the first URL I might have ever seen was www.honda.com.
Tom
Yeah.
Eric
Right across the bottom of a Honda commercial on TV. Back when it was http://www.honda.com/ because if you didn't have the slash, the server might not have responded.
Tom
Right, it would break. Yeah.
Eric
Stuff like that. Sort of that reaction of, "Wait, the muggles can see that now, too?"
Tom
Yeah, pretty much. Yeah, that was it. I remember the one that got my attention, we were way more concerned with the stuff we were publishing than we were with the browser by the time '95 had rolled around. By then, we had a full version of the United States code up and all that. We'd been doing Supreme Court decisions for a while. I came in one morning very early, it was about four or five in the morning to switch out a hard disk, when I thought i could take the server down without complaint. And it got to be about 6:30, I was having trouble with this. The phone rang, and it was the Washington Bureau of the New York Times saying, "Excuse me, when are the Supreme Court decisions going to be back?" [Laughs].
Eric
Wow.
Tom
And I thought, "Well..." and these were the day when we were sitting there all agog at having 200 hits a day or something, and here's the times calling up to say, "When are you guys going to be back online?" At that point I hadn't yet developed my well-honed response that was, "Well when are you sending a donation?" but...
Eric
Nice. Ok. Yeah, it's interesting. Do you think you'd have approached what you were doing differently if you had thought that there were going to be a billion people on the web within a decade?
Tom
Not really, no. I mean, I don't see really how we could have. That's really much more of a content question.
Eric
Interesting.
Tom
And there, I think we just got lucky, frankly. Because in a funny way we did anticipate that. We saw it happening and we managed to keep up with it. I'm not sure it would have made one iota of a difference to how we approached the technology although again, I don't know. We did have the initial insight that the then mass market computing platform should have a browser.
Eric
Right.
Tom
Which really nobody was thinking about at all.
Eric
Jen, did you have a question?
Jen
[00:40:58] Oh, I was just going to comment because it always fascinates me each time to listen again about how the web was started by people who had lots of content and long form content, like law cases and physics papers and documents that are at the Library of Congress. And somewhere along the way it feels like the web became this brochure rack where nothing should be longer than 500 words and no one wants to read anything more than two paragraphs. And if you need more space than that then you need to put a pager at the bottom and click to go to the next page because people hate scrolling...but meanwhile go ahead and clutter up the whole page with lots of ads. It feels like we're just barely getting back in 2012 to the idea that like, "Let's get rid of all the clutter. And lets give people a reading experience where you can sit down and read something long and meaty and understand on a deeper level some kind of concept or something.
Tom
Oh yeah, the most used application on my iPad for a long time now has been Readability[] for exactly that reason. It sort of reminds me of some things that used to happen. I did a certain amount of work with corporate communication stuff at one point and it sort of reminds me of some of the fads that were coming and going with things like documentary video. For years and years and years it was considered the kiss of death to use static talking heads on screen. Or static images. And then Ken Burns comes along and does Eyes on the Prize and then everything else he did. And all of the sudden, all of that convention was out the window. Because you've got a guy who's creative enough to break all of those alleged rules and have it work. And I sort of think the Jacob Nielsen tyranny of 'everything above the fold' is breaking down because people are realizing, "That's silly, you can't do that if you want to read the United State's Constitution." It's all very well to say, "We have to take the text and somehow jam it into the usability model," but that's not a very realistic usability.
Eric
[00:43:21] Yeah, it's interesting. Whitespace seems to come up again and again. Even from the beginning you said there was a lot of argumentation about, or maybe discussion, maybe it wasn't argument, about what the white space should be. Do you remember any details about what the positions were, or just an example of what was being discussed in terms of what the whitespace should be in a certain place?
Tom
Well, I think it was the usual tension. It was the tension between the people who believed, I think for the very valid document parsing reasons, all whitespace should be collapsed to a single space. Or alternately, that things should be set off in pre tags that were in effect designed to preserve the original intention of whoever had done it. I don't know that anyone's putting that much ASCII art on the web, but you see my point.
Eric
Mhmm.
Tom
And that was always it. It was always, "Well, when we render these things..." this must be something that you're intimately familiar with, Eric, right, there's always this question hovering around, "Well, as we introduce these stylistic controls," and whitespace is the most fundamental one, "to what extent do we need to respect the wishes of the original document publisher?"
Eric
So the question there is really, when we're rendering the content, how much of the visual whitespace should be based on the source whitespace?
Tom
Yeah, exactly.
Eric
Ok. So it wasn't just visual, the discussions weren't necessarily how much space should there be before and after an h1, though I'm sure there was at least some of that, but more, "Hey, if there are six returns between the last paragraph before this heading, does that mean we should have more visual whitespace when we lay out the document?
Tom
Yeah, exactly. You have to remember that there were and awful lot more naïve documents out there in those days. I would like to think, this is probably grossly untrue, I would like to think that we're seeing very few tables out there anymore that have been laid out with a spacebar.
Eric
Ok.
Tom
I would like to think that. I sleep better at night believing that. But I have a feeling I'm wrong about that, but you see my point. Nobody really knew at that point, people were just throwing stuff up there. And the other thing you have to remember was that there was this general desire on the part of hypertext guys to demonstrate that HTML was not hard. Because prior systems, I'm particularly thinking of Gopher, were just kinda dump it in a directory and go, whereas with HTML you had to make links and you had to put tags in, and all this stuff. There was a certain community that saw that as being very, very burdensome.
Eric
Mmmm, as opposed to the other community, like the SGML community, who would have seen that as grossly inadequate.
Tom
Yeah, exactly. There's a certain amount of tension there between a bunch of guys who just have a bunch of documents to put up, and a bunch of guys who are prepared to be purists about how it's done. Frankly without much regard for the expense of people who have a lot of retro conversion to do. We see a bunch of that around document authentication issue that exists around legal documents. Some people are cost sensitive and some people aren't. The people who are cost sensitive tend to be the people who have huge amounts of retrofitted content that they have to put up there.
Eric
So, by retrofitted content, you mean what?
Tom
Legacy stuff of whatever kind. You really can't put up some of the decisions the United States Supreme Court. You really can't put up some of the statues of large, I mean we have, but you really shouldn't. There are places where historical completeness is important, and there are people who for that reason have to do very, very, very large document archives. And they're expensive to convert.
Eric
Back in mid '90s when you were transitioning away from this whole thing of, 'Hey, lets write a web browser,' and you're putting up Supreme Count decisions and stuff like that, to the point that the New York Times is calling you and asking you when it was going to be back, what process did you go through to put those up? Because Supreme Court decisions are not necessarily short.
Tom
No. Interestingly enough, in your happy home town there, there was a guy at Cleveland Free Net who had been putting up Supreme Court decisions for some time, and I think the place he grabbed onto that process had to do with the fact that the Court was then using an atex layout system to do all of their printed stuff. Atex was a very very popular formatting engine for things like newspapers. And a lot of people in print publishing were using it. You might know atex as what gave rise to XyWrite[?], which was a popular word processing program in the mid '90s. Which was largely based on atex. Anyway, this guy Cleveland Free Net had figured out a way to do XyWrite to, I think he was just putting them up in ASCII, he was stripping stuff out in an ASCII conversion, and we ultimately figured out how to do a not-so-good atex to HTML conversion, did that for a long time. We also wrote converters for Word Perfect and all manner of stuff. But our basic criterion was that we get them in some electronic form, we never did scanning here to any great degree, or if we did we'd send it out. It was mostly a question of converting legacy electronic formats. And in some ways it's still a problem that hasn't been solved. Because it's just hard to do.
Eric
Well, yeah.
Tom
As you know yourself from cut and paste operations, the basic problem with non-structured word processors is that they just have too many degrees of freedom and it makes conversion very, very hard to write.
Eric
Yeah, and from my own experiences in case with converting legacy content, and in some cases not legacy content. We had a project briefly with The Plain Dealer, the local daily newspaper, to every Friday they would dump, I think it was an .rtf file of, "Here's the entertainment news for the weekend," and we would slice it up and it would turn into a site, and that ran for a while, I don't know what happened with that exactly, but yeah, with these things, in a lot of cases you would have to write a new converter every time. It's interesting that you had an electronic source for at least current decisions and current law, it was going backwards that was really hard.
Tom
It was, although to some extent again we were able to get that off of CD-ROM, and in fact to this day, the US code that we have up there, which is 200,000 pages, something like that, updated as it's updated by Congress, is done from what amounts to typesetting code. It looks very much like old line printer escape code that's a system called Microcomp that the government printing office has used, well essentially since the mid 1980s. And we've gotten quite adept at converting that, and have done so for many years. They're going to start originating in XML soon, but they have not yet.
Eric
Do you have a timeframe on 'soon'? I'm just curious.
Tom
Uhhh, they have a contract out right now that's a little open ended, but I'd expect to see it within the year.
Eric
Oh wow! So legal changes will actually start coming out in XML?
Tom
Oh yeah, a lot of stuff has been for a long time. State legislatures have been doing it for a long time. They are publishing XML versions, and have been for a while, but the official source stuff has been Microcomp for a while. But that's changing. There's been internal XML at the House of Representatives for a while, can't say about the senate, I don't know as much about that.

Eric
[00:54:20] As you've said, a lot of the challenges that you faced then, you still face today, how do you feel about how things have changed, or how they're similar, whether in the browser space or just in general?
Tom
Well, you know I mentioned a few minutes ago that there was an initial perception that hypertext is hard. And I think we're starting to see some of the same stuff with semantic web concepts and linked date. One of the more clever things that happened was the shift of work in the semantic web from‒excuse me my telephone is ringing here and I think I'm just going to continue to let it do so because it will go three rings and stop. There was this initial thrust of semantic web toward AI reasoners and other kinds of highly sophisticated agent technology, and all of that became the darling of the AI crowd to some extent. And one of the better things that happened was that around 2005 we shifted to the notion of linked data and this idea that it really should be easier to mash up data than it is. That was a tremendous step, but I think that working in that world, working in the world of stuff like schema.org is still seen as being enormously burdensome. And frankly to some degree it is, because it brings with it all of the issues that people have always had with policy and for lack of a better word, craftsmanship around issues of metadata. It's a new level of data of thinking about how we do stuff. I think it's a singularly important level of thinking about how we do stuff, but it feels to me as if the world of linked data is where the web was circa 1993, 1994: we're sort of just short of critical mass to get some really good applications going. There's been a lot of good thinking on it, there's also been some stuff that's remarkable weak because it was done as initial experiments that are now being tested in ways that they can't particularly withstand. I'm thinking there particularly of, I don't know if I'm wandering off too far into the semantic web deep weeds here, but I'm thinking particularly of something like FoF, the friend of a friend thing, which was just originally a system for describing relationships among people, but which has nor temporal dimension, and so it makes it very very hard to model groups. You can say that "Fred is a member of X", but you can't say in FoF, "Fred was a member of X from 1976 to 1979."
Eric
Yeah, ok.
Tom
There's a bunch of stuff like that that as we move it out into the semantic web, and this is particularly true of identifiers. As we take things like identifiers‒call them URIs, call them what you want‒and move them out of the scope for which they were originally intended, and bring them out so that they can be viewed at web scale, suddenly we realize we've got redundancies, and duplications and stuff that doesn't parse, and things that just don't make sense when they're lifted out of the context that they were originally intended for. There's a lot, at lot, a lot of spackling and patching to do around that set of issues.
Eric
Mmmm, interesting.
Tom
I have a good friend, a metadata librarian, who worked for a long time in a photo archive where she was cataloging stuff in an online catalog, and at one point it was her job to place descriptions of this one photo archive online. The archive was entirely of pictures of Ulysses S. Grant. She said the fascinating thing about the descriptions was that they couldn't just lift them directly because they all said things like, 'On a horse'. And of course, because it was in the Ulysses S. Grant collection, everybody knew that it was Ulysses S. Grant on a horse.
Eric
Right.
Tom
But once you lifted it into a larger cataloging regime, you had to start thinking in terms of those things, of supplying that context along with it. And we've got an awful lot of data out there that is context dependent in those kinds of ways. And it's going to be a big job.
Eric
It seems like an interesting content strategy challenge in a lot of ways. Just spitting out the caption "on a horse" it's like, "Great, who's on a horse?".
Tom
Yeah, exactly. You can look around and find some people who are dealing with it very successfully, of course the proejcts everybody points to are FreeBase and DBP and stuff like GeoNames. But I actually think the New York Times is doing fantastic work with that based on stuff that they've been doing essentially since the 1860s. Really just a very, very, very solid transformation of this kind of stuff they've done historically to answer the question every reporter has, which is, "What has the New York Times said about this subject previously?"
Eric
Right.
Tom
They've had that problem for 150 years. And they've solved it one way, and now they're essentially solving it in linked data space.
Eric
[00:59:26] Huh, very interesting. So, are there any other comparisons that strike you between now and then, maybe when it comes to browsers, or just in general?
Tom
I don't know, I'm sort of with where Jen was a few minutes ago, I have some sense that browser technology is kind of, I guess I've thought this for a long time, and maybe it's just the lament of somebody who on some level wishes that it were within the grasp of one individual to write a browser, I don't know. But it does seem to me that browsers have gotten so feature intensive that they are sort of becoming Microsoft Word, in that there are many, many, many features there that people may or may not use. And they're not very fast. And they're sore of ponderous. If I'm allowed to express preferences, I like Chrome because it is simple and clean in many ways. I don't think that's true of everything. We've had this constant tension, and this is something that has not changed, between people who want to capture the browser space because they see that as being able to put their frame around the content that everyone is viewing, and of course there's commercial value in that, versus people who simply want to build an information access tools.
Eric
Yeah. That's...yeah, I can hardly agree more, actually.
Tom
And I think that Jen's right, I think we are coming back around to an era in which people are starting to realize, "Hey, wait a minute, this really is an information access tool and we probably shouldn't bitch it up with a lot of stuff that's extraneous to that process."
Eric
Yeah.
Tom
There's more of that thinking. I suspect that the tablet space is going to have a little bit of an effect that way, because tablets are a much more personal device in many ways. And they're also real estate limited. One of the things I used to say when images were first coming in was, "A lot of people think a picture is worth 1,000 words, and given the bandwidth that it takes, it'd better be."
Eric
Nice.
Tom
I think we may come back around to that way of thinking just a little bit more than we have.
Eric
[01:02:06] Huh. Well, I think that's a great place to wrap this up, although I would love to spend another hour. We never did quite find a way to work in the anecdote about the elephant and the leopard, so that the live people get a little treat.
Tom
We can't do that later, that's a show biz story though.
Eric
Oh, ok. Fantastic. Well, Tom, I'd like to thank you very much for joining us. It's been a pleasure
Tom
Oh, it's been delightful.
Eric
And I've learned a ton.
Tom
Well, always happy to help out. Thanks
Jen
And if people want to see your work, Tom, and follow you. Are you on Twitter? What websites should they go to?
Tom
I am on twitter as @trbruce, and you can find us at www.law.cornell.edu where we have been since 1992.
Jen
Nice. And I want to say thanks to Shutterstock and the In Control Conference for sponsoring the show today and making it happen. Making it possible. Eric as always, thank you so much, and people should check the schedule at 5by5.tv/schedule to see when it is when we'll be live, if you want to listen to us live, or of cousre subscribe in iTunes blah-di-blah if you want to download the show in the future. Thanks everybody!

Show Notes