How the Internet Works

One of my favorite questions to ask (or be asked) during an interview is the classic “how does the Internet work?” question. It usually goes something like this:

You open your favorite web browser, type in “www.mysite.com”, and hit return. Almost like magic, a fully-rendered web page shows up on your screen ready for you to view. Tell me, with as much detail as you can muster, what just happened.

The reason I like this question so much is that it isn’t just academic; it is a peek behind the curtains at what this person knows. It reveals what they’ve dug into in the past, learned in school, or dealt with while troubleshooting. The details show how much time they’ve spent demystifying the world (at least related to our working environment). Thinking about this made me realize how really fleshing out a solid answer to this would make a great blog post, so here we are.

The tempting answer is the simple one: focus on the web browser speaking HTTP to a web server. While this is arguably the most important part of the answer, it glosses over a ton of details. Every component in that brief answer is dripping with opportunities to ask “how does that work?” and probably “tell me more about that” a few times too.

Setting a Maximum Depth

This topic, in the hands of a truly knowledgeable person, will probably take much longer than most interviews allow. When I hit “return” on my keyboard, we could go into how keyboards interface with computers, or electrical engineering and talk about how keyboard contacts work, or how keyboard input makes its way to applications like our hypothetical web browser. All of these are almost certainly more detailed than any interviewer would want to hear. This isn’t to say that they’re not worth learning about, but I’ll skip these details.

Likewise, I won’t discuss much that happens at the physical layer. I’m willing to breeze past the line coding and electromagnetic signaling for network communication. I won’t touch on how CPUs or other computer components function either. This isn’t a blog post about how computers work, though the basics of how they work really is fascinating and worth learning. If you want to see a physical computer made out of dominos, check out this excellent Numberphile video.

Constructing network packets should probably be fair game but is probably too advanced unless there’s a need to really emphasize networking. For me, asking someone to construct a packet on a whiteboard is torturing both of us. No hex editors required for this post, don’t worry. I also won’t get into how GUI applications work. I’m also only going to focus on Linux for this post, though most details should be roughly the same for other operating systems.

All that preamble aside, nearly every subheading in this post could be even more detailed and/or be an entire post unto itself. The goal though is to try to strike a good balance between providing relevant details and providing links for more information. I hope I walk this line well enough to keep you interested.

The Web Browser

The web browser seems like a decent place to start. It, as a user space application, isn’t terribly different from most other applications. The choice of web browser does matter though, as it impacts how DNS works. Firefox, at least in the US, uses DNS over HTTPS by default. Kudos to them for finally encrypting DNS for us; I hope other browsers do the same. Chrome is also known to do unusual things regarding DNS resolution too, such as pre-fetching, allegedly using Google’s IPv6 resolvers, and allowing extensions to completely change how DNS works. I’ll ignore all this DNS weirdness and focus on more traditional OS-based DNS lookups.

Not all browsers respect your OS settings for web proxies either; Firefox allows both using system settings and manual configurations. Other browsers like the Tor Browser communicate over an overlay network called Tor for anonymity. As with DNS, I’ll mostly ignore these and focus on the basics.

No matter which browser we imagine using, the first thing it does (at least for web sites running on some other machine) is retrieve the site’s content. To do this, it needs to connect to it, which requires figuring out its IP address.

Name Resolution

Not all processes on Linux behave exactly the same for DNS resolution. There is, however, a POSIX standard that describes an API for name resolution called getaddrinfo(3) which is available in glibc for Linux, musl, and for other unix-like operating systems. Most applications will use this approach to name resolution unless they implement their own DNS API.

For applications that do use the getaddrinfo approach to DNS resolution, they’ll leverage your OS’s Name Switch Service (NSS) and its API. This means the contents of your nsswitch.conf impacts how names are looked up. In the end, this means that name resolution works based on the application leveraging the getaddrinfo C function provided by the OS’s libc (usually glibc). You might be tempted to say “wait, my application isn’t written in C”, but most other languages at some point make “native” calls (here’s a guide for how Java does this).

Assuming you don’t have an /etc/hosts file entry for your site, this getaddrinfo call will contact DNS servers and retrieve the address(es) of the servers you’d like to contact. For the gory details on exactly how this interaction works, check out this blog post.

What about DNS?

Yeah, the whole Name Resolution section there didn’t really cover how DNS works, I know. Let’s cover that. When getaddrinfo talks to a DNS server, it will probably be the DNS server on your local network. Usually, this is your home router/modem/wifi/all-the-things box. This device probably runs a tiny process like dnsmasq with the purpose of forwarding DNS queries to upstream recursive nameservers and caching the results. Your machine talks to this nearby DNS caching server and asks it what the IP of “www.mysite.com” is. Specifically, it queries the DNS server for the “A record” for “www.mysite.com.” (notice the extra “.” at the end). If this DNS server knows the answer (meaning it has cached a previous lookup), it’ll return the cached result and a modified Time to Live (TTL) showing how much longer that cached item is valid.

If the DNS server doesn’t know the answer, it sends it to one (or all) of its upstream servers. This upstream goes through the same cache lookup. Assuming it doesn’t know the answer either, it talks to a root DNS server (that final “.” in the name) and asks for the server(s) authoritative for “com.” via an “NS record” query. It then asks this server which server(s) to talk to for “mysite.com.”. Finally, one of those servers is queried asking for the original request of “www.mysite.com.” which will either be an “A record” (meaning it refers to a list of IPs) or a “CNAME record“, meaning it points to another name. Note that CNAME records may not lead to your machine doing multiple requests as you might expect; your original request was for an “A record”, so the recursive server you asked should perform the second lookup for you.

dns-workflow

This entire interaction happens over the network (usually over UDP), which I’ll describe shortly.

Establishing a Connection

Now that we’ve established how DNS works, the web browser can attempt a connection. To do so, a Linux system call is made to first create a socket via socket(2) then a connection is made via connect(2). This creates a TCP connection to our site. But here is where, if I’m conducting the interview, I might ask something about routing. Maybe I’d phrase it like “tell me how that connection makes it to the server” and see where things go.

First, local packet routing decisions are handled in the kernel. Honestly, unless you plan on doing some kernel development, knowing exactly how this works is probably overkill. A brief, mere mortal explanation is that the kernel looks at the route table, determines which interface to send the packet out, and assesses iptables rules and filters for things like NAT and firewall rules. If the kernel knows which interface to send the packet out of and is allowed to, it does. Since this packet will no doubt need to be routed, Linux uses Address Resolution Protocol (ARP) to figure out the MAC address of the router. This MAC address is used to craft an Ethernet Frame containing our IP packet. This frame can then be sent out the interface and the switching/networking infrastructure takes over.

Network Infrastructure

Switches use a MAC address table to determine which port to send the incoming frame to. This includes uplinks/connections to other switches. If the MAC address is already known, it forwards the frame to the appropriate interface. If it is not known (this is the first frame destined for that device in a while), the frame is “flooded”, meaning it is sent to all interfaces other than the one it was received on. This is how MAC addresses are “learned” by switches. Here is a good overview of this process.

Once the packet arrives at the router, it makes a routing decision. While the process of determining where to send a packet might be more complicated for routers (given routing protocols like OSPF, BGP, or even MPLS), ultimately the process is similar to what happened on our local Linux server. We can skip forward to the point where a packet will reach our destination. I’m also going to skip over the TCP three-way handshake. I know this may upset some networking professionals out there, but I’m going to assume for the purposes of this post that if the first packet makes it, the rest will too (and that packets can return too).

CDNs and Other Web Infrastructure

Now the question is: what is our destination anyway? For most big web sites, the destination isn’t actually a webserver. For most I deal with, it’ll actually be a Content Delivery Network (CDN). The CDN will do things like edge caching, geographic distribution, and may even have some features to protect my servers like DDoS protection. In some cases, the CDN may even do some of the work of generating content through edge computing. In many scenarios, the request never even needs to hit a “real” server; everything might be served up directly from the CDN.

If the CDN doesn’t have what it needs, it will contact its origin. This might be a webserver, but more likely is a security device like a Web Application Firewall (WAF). This device has rules and/or heuristics to block attackers that might exploit or slow down my web or application servers. If the request makes it through the WAF, it’ll probably hit a load balancer of some kind. This allows my site to be highly available by having multiple web servers in different datacenters. This load balancer might have rules to help determine which webserver to send the request to, or it might use a simple algorithm like round-robin.

What about SSL?

Each of the above pieces of web infrastructure probably includes TLS connections. Our web browser and the CDN is one connection, CDN to WAF is a different connection, WAF to load balancer another, and finally load balancer to the web servers might be another. Each of these connections could have slight variations in how they work for things like specific algorithms, exact protocol versions, etc., but for the most part, they’re pretty similar. Our browser, after establishing a TCP session with the CDN, performs what is called a TLS handshake. When our browser (the client) sends a ClientHello message to start this handshake, a couple of interesting things can happen. A few TLS extensions, available in most modern browsers, have revolutionized web hosting, including ALPN and SNI.

Servers present public keys, signed by some authority, along with a chain of certificates used to establish trust with that authority from some common, trusted root. This is how PKI and public-key cryptography works. Exactly what happens varies based on the TLS version and the cipher suite, but in every case, a Diffie-Hellman key exchange occurs for that session. From here, HTTP takes over.

The Anatomy of an HTTP Request

A Request

HTTP is human-readable, which makes it really easy to describe in clear terms. A simple, manually-created request might look like this:

That final empty line is important; it lets the server know you’re done with your request.

  • The first line (and really the only required line aside from the empty one) tells the server you’re performing a GET request (a request that has no “body”), you’re looking for the root (/), and you’re speaking HTTP version 1.1.
  • The second line tells the server what hostname you’d like to communicate with. This is useful for virtual hosting (meaning you’re running more than one site on your server).
  • The third line tells the server about your client. Some servers manipulate content to make it work better for your specific browser (Internet Explorer tends to have lots of these tricks). Servers also track this information so they know which types of browsers visit their site most.
  • The final line describes what type of content our client will support or is expecting. In this case, we’re saying we’ll take whatever the server gives us. When communicating with an API, we might change this to Accept: application/json.

There are a lot of other common headers.

A Response

If the server can handle the request, it might respond with something like this:

  • The first line lets us know that the server accepted our request. A 200 OK means everything is fine with the GET request.
  • The second line tells us about the server software providing the response. Sometimes this is obfuscated or not sent at all (for security).
  • The third line is just the current date from the server’s perspective. This is useful to know when the server constructed and sent the response.
  • The fourth line tells us the type of content the server is sending. This is the server-side equivalent of the Accept header in our request.
  • The fifth line tells us how much data the server will be sending in the body of the response.
  • The next two lines are used for caching. They tell us when the content it is providing was modified and a checksum of sorts useful to check back with the server later and see if the content has changed. If the ETag is the same, we should be able to tell that the response body is equivalent.
  • Everything after the empty line is called the response body, which is the stuff we actually wanted from the server. This is the HTML content of the site.

Multi-tier Applications

Once our request makes it to the webserver, this server might have everything it needs to fulfill it. The above example was a very simple static site, so it is likely that Nginx could handle everything. For more complicated sites, however, the webserver might be responsible for some things like images, CSS, and JavaScript, but the HTML content might come from a backend server. This might be a Content Management System (CMS) like AEM or like WordPress, which this site runs on. It might be an application server running on just about any programming language. In any case, there might be multiple layers or tiers required to produce the site we’re visiting.

This blog, for instance, also leverages a database server. I’m using MariaDB for that, but there are many others. Some sites may take a lot of CPU or time to generate content, so they can use a caching server like Redis to store expensive content and make retrieving it much faster. Larger sites might have many of these components and include communication between several servers or even calls to entirely different domains.

Are We Done?

No, not really. This is just to retrieve the initial HTML. That HTML will contain links to various types of media that need to be downloaded, like CSS, JavaScript, images, possibly videos, analytics resources, and more. Each of these items repeats the entire process all over again.

Phew, that was quite a walkthrough. I hope this was valuable for someone. If you have questions or if you think I missed something important, please comment and let me know. Thanks for reading!

Spread the love