internet & web

You may be aware that Tim Berners-Lee (aka Tim BL), a computer scientist at CERN proposed what would become the World Wide Web in 1989.

The initial proposal is a quick and interesting read if you like that kind of thing.

A somewhat modest proposal on the whole, it ends with the bold proclamation:

“We should work toward a universal linked information system, in which generality and portability are more important than fancy graphics techniques and complex extra facilities."

This document aims to answer the question “how did we wind up with the web we have today?”

The Internet

It’s useful to start with an important distinction.

web != internet

The internet is a global network, made up of physical cables and wireless signals. It is the manner in which a signal can travel between my computer and yours, through many others along the way. Whether it is you reading this text, or us chatting on a Zoom call, the internet is to thank.

You may be aware of IP addresses. IP, or Internet Protocol forms the basis of the internet. Computers inside your home or office form a small network, that is interconnected to countless other networks via the internet.

An IP address is either a local phone number that lets computers inside of your house talk to one another, or a global phone number that lets computers reach one another all over the world.

A related protocol, DNS, works like a phone book, mapping domain names like view-source.org to IP addresses.

4-layer model

There are a few different models of how the various internet protocols stack on top of one another.

A useful model for understanding the web is a four-layer model often referred to as TCP/IP or DoD model.

Layer Protocols Description
Link Ethernet/WiFi Physical connections between machines.
Network IP Way to address packets across the network between remote machines.
Transport TCP/UDP Underlying data format of packets sent across network.
Application HTTP/SMTP Application-level protocols for things like web, email, video.

Protocols

The internet relies on protocols.

Protocols allow multiple computers to communicate via agreed-upon convention.

If you’ve ever heard people speak over walkie-talkie they use certain carefully chosen words to indicate certain common needs.

Alice: “Base, this is Alice, over.”

Base: “Go ahead, Alice, over.”

Alice: “Reached Springfield, heading to Shelbyville, over.”

Base: “Copy, Alice. Keep an eye on traffic updates, over.”

Alice: “Will do, Base. Alice out.”

Base: “Safe travels, Alice. Base out.”

This is in many ways similar to the protocols that power the internet. Each line of dialog is delivered concisely, and with a clear ending. This ensures that if there is difficulty communicating, the other party will know that the message was truncated.

Additionally, the two take special care to introduce themselves first. This is to initiate a conversation, since a driver on the other end may be distracted, so if Alice gave her update without the “Base, this is Alice, Over” before it, Alice would not know that Base was listening.

While speaking, they also use specific language. They might even take special care to avoid misunderstandings and replace certain words with others that help avoid misunderstanding.

The NATO phonetic alphabet (Alpha, Bravo, Charlie, Delta, etc.) is used to pronounce letters in situations where they may be misunderstood.

Similarly, there are many places in the HTTP and HTML specifications where special care is taken to avoid mistaken characters:

Since the character < typically begins an html tag like <div>, if someone needs to use that character they would do so as &lt;. (Note: to create & one must write &amp; since the character denotes that what follows has special meaning.)

URLs restrict the available characters as well, so if a URL has a space in it, it will be replaced with %20, other characters have similar %xx designations.

The Web

Then we have the web.

In 1989 the internet already existed, email and even real-time chat1 predate it.

The result of Tim BL’s 1989 proposal was HTTP and HTML.

HTTP is a protocol designed for specialized programs (browsers) to request documents from servers.

HTML is a markup language for those documents that allows styling and most importantly cross-linking.

(See the links for more detail.)

If IP lets computers call one another up, HTTP is the special language they use to communicate once they’re on the phone.

Beyond the Browser

It used to be the case that we could say that the web is everything that you do in your browser: Chrome, Edge, Firefox, etc.

Today that’d meet a few objections:

Strictly speaking, on a protocol basis, email, video chat, music streaming, do not take place over HTTP. On a pedantic level then, they are not the web despite being in the browser.

A lot of things outside the browser seem like part of the web, and indeed use HTTP. Modern applications rely on APIs that transmit JSON, images, video, etc. via HTTP.

There are other protocols, like Gemini, that are used to link together documents. Some would argue they too are part of the web.

Our Definition

I’ll take the position here that the web is no longer defined purely by protocol.

The purpose of the web is to build an interconnected library of information.

Sites that use HTTP for different purposes like sending private messages are not part of the web despite using the protocol.

Similarly sites that don’t use the protocol but are interoperable with the idea of the web, arguably are.

Evolution of the Web

Web 1.0

Initially conceived as a network of ideas, a place for researchers to share information with one another, it quickly became a collection of personal sites showcasing people’s unique personalities. If you loved The Simpsons and enjoyed painting watercolors, you could spend an afternoon putting together a small site with a page that had your favorite quotes & a few scans of your recent paintings.

From the early 1990s until the mid-2000s, the web was mostly comprised of web pages (as opposed to the more general website or the modern web application). Each site was a collection of documents: mostly text, but occasionally images, audio, and/or video.

Of course, such sites still exist, for example a particular page like:

Represents a particular piece of content. The specific content can change, but the purpose of the URL does not.

This is a fundamental idea, documented in the essay Cool URIs don’t change by Tim BL.

This means other pages can link to them and expect the content to be there. This makes the web a web.

Search Engines

As the web grew, the need to discover content did as well.

Of course, as we know, by the mid-2000s Google dominated the search industry.

Google’s success was based upon the PageRank algorithm. Simply put, the more incoming links a site had, the higher it’s value to the web.

This meant that links became more and more signals to search engines, and less for humans to follow.

The legitimate practice of “Search Engine Optimization” is the name given to tailoring a site’s presentation to make sure that search engines can properly index the content. The monetary value associated with having a highly ranked page however spawned an entire shadow industry of SEO hustlers. Countless shady services exist that promise to boost your site’s rankings by generating inbound links. If you aren’t sure what “generating inbound links” means, it means spamming other sites comments, setting up fake sites that link to yours, and all other sorts of unethical practices that are damaging to the web.

These are unintended consequences of how search engines typically work, and the major players like Google and Microsoft are constantly battling the worst offenders. At the same time, they are increasingly making the field obsolete by selling the top spots in search results to the highest bidder. Who cares how many of your peers link to your website if the top spots can be bought.

This is the reason search quality has deteriorated so much in recent years. The onslaught of AI generated content (TODO: before:2023) only exacerbates this.

Web 2.0 & “User-Generated Content”

Today, we know that isn’t how many sites work. Visiting facebook.com means an ever-changing carousel of information. This is mostly done as a means to drive a type of addiction that we’re finally taking seriously.

Let’s look at how we got there.

You may remember the term “Web 2.0”, which broadly refers to a shift in the web from being published primarily as static sites to more and more “user-generated content.”

User-Generated Content is an interesting phrase. One might hear it and think the alternative is “Company-generated content” and view UGC as the more democratic option.

But, what came before Web 2.0 was not a corporate internet. Of course, corporations had websites, eBay and Amazon existed. But individuals were all over the internet too, and they created so much content search engines were necessary.

The difference was that they did so on personal sites, sometimes provided by schools or employers, but often a completely personal endeavor, a server at home, or on a hosting provider, which depending on the quality of service could be found for free or a small monthly fee.

I don’t mean to suggest the idea itself was sinister, the goal of many of us working at that time was to bring more voices into the fold.

Doing that meant lowering technical barriers, and the easiest solution to that was to centralize things.

Blogging platforms like Blogger and LiveJournal meant that people could start publishing thoughts and ideas online without learning HTML or messing around with web hosting while sacrificing a bit of control. This was also the golden age of online forums, mostly small niche communities focused on a given topic. While these platforms had their problems, trolls existed from day one on the internet, both the size and unified interest of the community served as a meaningful check on abuse. There were lots of choices, so if a community became particularly toxic, a small subset of users could found their own quite easily and enforce their desired norms.

By the mid-2000s, in the name of further lowering the technical barriers, sites like Myspace, Facebook, and Twitter became where most people were spending their time.

These sites have a critical difference, when you are posting on a blog whether it is your own domain like view-source.org or a hosted site like something.wordpress.com anyone can interact with your content. That means if I find it interesting I can share it with people, I can link to it from my own site, and perhaps even write my own post inspired by it. It is truly part of the web.

But on Facebook and Twitter, and more and more sites, this isn’t the case. A user needs to log in. This partially breaks the web, users now need to have an account on Facebook, or Medium, or Substack, or … to read your ideas. And there are a lot of good reasons not to want that, or to just decide not to take the time to create yet another account.

The term for this is walled garden.

When we refer to users posting content on these sites as user-generated content, we might more correctly think of it as corporate-captured content.

Of course, there may be times when you wish to publish content only to a select group of people.

These sites are perhaps better suited for that purpose, though safer alternatives exist that do not rely on selling your private information to advertisers.

Web3 (is bullshit)

I’m not going to spend much time on this except to say that the term Web 3/3.0 to refer to cryptocurrency-related technologies that have nothing to do with the goals of the web.

If you are interested in Web 3, I’d check out Molly White’s Web 3 is Going Just Great.

Those of us that care about the web continue to reject this attempt to hijack the web as a marketing term.

Other Resources


  1. IRC was introduced in 1988. ↩︎


the web's programming language

Last Updated: 2024-04-09

Status: sprout

Length: ~2100 words, 10 minutes


table of contents more understanding the web