One of my favorite questions to ask (or be asked) during an interview is the classic “how does the Internet work?” question. It usually goes something like this:
You open your favorite web browser, type in “www.mysite.com”, and hit return. Almost like magic, a fully-rendered web page shows up on your screen ready for you to view. Tell me, with as much detail as you can muster, what just happened.
The reason I like this question so much is that it isn’t just academic; it is a peek behind the curtains at what this person knows. It reveals what they’ve dug into in the past, learned in school, or dealt with while troubleshooting. The details show how much time they’ve spent demystifying the world (at least related to our working environment). Thinking about this made me realize how really fleshing out a solid answer to this would make a great blog post, so here we are.
The tempting answer is the simple one: focus on the web browser speaking HTTP to a web server. While this is arguably the most important part of the answer, it glosses over a ton of details. Every component in that brief answer is dripping with opportunities to ask “how does that work?” and probably “tell me more about that” a few times too.
Setting a Maximum Depth
This topic, in the hands of a truly knowledgeable person, will probably take much longer than most interviews allow. When I hit “return” on my keyboard, we could go into how keyboards interface with computers, or electrical engineering and talk about how keyboard contacts work, or how keyboard input makes its way to applications like our hypothetical web browser. All of these are almost certainly more detailed than any interviewer would want to hear. This isn’t to say that they’re not worth learning about, but I’ll skip these details.
Likewise, I won’t discuss much that happens at the physical layer. I’m willing to breeze past the line coding and electromagnetic signaling for network communication. I won’t touch on how CPUs or other computer components function either. This isn’t a blog post about how computers work, though the basics of how they work really is fascinating and worth learning. If you want to see a physical computer made out of dominos, check out this excellent Numberphile video.
Constructing network packets should probably be fair game but is probably too advanced unless there’s a need to really emphasize networking. For me, asking someone to construct a packet on a whiteboard is torturing both of us. No hex editors required for this post, don’t worry. I also won’t get into how GUI applications work. I’m also only going to focus on Linux for this post, though most details should be roughly the same for other operating systems.
All that preamble aside, nearly every subheading in this post could be even more detailed and/or be an entire post unto itself. The goal though is to try to strike a good balance between providing relevant details and providing links for more information. I hope I walk this line well enough to keep you interested.
The Web Browser
The web browser seems like a decent place to start. It, as a user space application, isn’t terribly different from most other applications. The choice of web browser does matter though, as it impacts how DNS works. Firefox, at least in the US, uses DNS over HTTPS by default. Kudos to them for finally encrypting DNS for us; I hope other browsers do the same. Chrome is also known to do unusual things regarding DNS resolution too, such as pre-fetching, allegedly using Google’s IPv6 resolvers, and allowing extensions to completely change how DNS works. I’ll ignore all this DNS weirdness and focus on more traditional OS-based DNS lookups.
Not all browsers respect your OS settings for web proxies either; Firefox allows both using system settings and manual configurations. Other browsers like the Tor Browser communicate over an overlay network called Tor for anonymity. As with DNS, I’ll mostly ignore these and focus on the basics.
No matter which browser we imagine using, the first thing it does (at least for web sites running on some other machine) is retrieve the site’s content. To do this, it needs to connect to it, which requires figuring out its IP address.
Not all processes on Linux behave exactly the same for DNS resolution. There is, however, a POSIX standard that describes an API for name resolution called
getaddrinfo(3) which is available in glibc for Linux, musl, and for other unix-like operating systems. Most applications will use this approach to name resolution unless they implement their own DNS API.
For applications that do use the
getaddrinfo approach to DNS resolution, they’ll leverage your OS’s Name Switch Service (NSS) and its API. This means the contents of your nsswitch.conf impacts how names are looked up. In the end, this means that name resolution works based on the application leveraging the
getaddrinfo C function provided by the OS’s libc (usually
glibc). You might be tempted to say “wait, my application isn’t written in C”, but most other languages at some point make “native” calls (here’s a guide for how Java does this).
Assuming you don’t have an
/etc/hosts file entry for your site, this
getaddrinfo call will contact DNS servers and retrieve the address(es) of the servers you’d like to contact. For the gory details on exactly how this interaction works, check out this blog post.
What about DNS?
Yeah, the whole Name Resolution section there didn’t really cover how DNS works, I know. Let’s cover that. When
getaddrinfo talks to a DNS server, it will probably be the DNS server on your local network. Usually, this is your home router/modem/wifi/all-the-things box. This device probably runs a tiny process like dnsmasq with the purpose of forwarding DNS queries to upstream recursive nameservers and caching the results. Your machine talks to this nearby DNS caching server and asks it what the IP of “www.mysite.com” is. Specifically, it queries the DNS server for the “A record” for “www.mysite.com.” (notice the extra “.” at the end). If this DNS server knows the answer (meaning it has cached a previous lookup), it’ll return the cached result and a modified Time to Live (TTL) showing how much longer that cached item is valid.
If the DNS server doesn’t know the answer, it sends it to one (or all) of its upstream servers. This upstream goes through the same cache lookup. Assuming it doesn’t know the answer either, it talks to a root DNS server (that final “.” in the name) and asks for the server(s) authoritative for “com.” via an “NS record” query. It then asks this server which server(s) to talk to for “mysite.com.”. Finally, one of those servers is queried asking for the original request of “www.mysite.com.” which will either be an “A record” (meaning it refers to a list of IPs) or a “CNAME record“, meaning it points to another name. Note that CNAME records may not lead to your machine doing multiple requests as you might expect; your original request was for an “A record”, so the recursive server you asked should perform the second lookup for you.
This entire interaction happens over the network (usually over UDP), which I’ll describe shortly.
Establishing a Connection
Now that we’ve established how DNS works, the web browser can attempt a connection. To do so, a Linux system call is made to first create a socket via
socket(2) then a connection is made via
connect(2). This creates a TCP connection to our site. But here is where, if I’m conducting the interview, I might ask something about routing. Maybe I’d phrase it like “tell me how that connection makes it to the server” and see where things go.
First, local packet routing decisions are handled in the kernel. Honestly, unless you plan on doing some kernel development, knowing exactly how this works is probably overkill. A brief, mere mortal explanation is that the kernel looks at the route table, determines which interface to send the packet out, and assesses iptables rules and filters for things like NAT and firewall rules. If the kernel knows which interface to send the packet out of and is allowed to, it does. Since this packet will no doubt need to be routed, Linux uses Address Resolution Protocol (ARP) to figure out the MAC address of the router. This MAC address is used to craft an Ethernet Frame containing our IP packet. This frame can then be sent out the interface and the switching/networking infrastructure takes over.
Switches use a MAC address table to determine which port to send the incoming frame to. This includes uplinks/connections to other switches. If the MAC address is already known, it forwards the frame to the appropriate interface. If it is not known (this is the first frame destined for that device in a while), the frame is “flooded”, meaning it is sent to all interfaces other than the one it was received on. This is how MAC addresses are “learned” by switches. Here is a good overview of this process.
Once the packet arrives at the router, it makes a routing decision. While the process of determining where to send a packet might be more complicated for routers (given routing protocols like OSPF, BGP, or even MPLS), ultimately the process is similar to what happened on our local Linux server. We can skip forward to the point where a packet will reach our destination. I’m also going to skip over the TCP three-way handshake. I know this may upset some networking professionals out there, but I’m going to assume for the purposes of this post that if the first packet makes it, the rest will too (and that packets can return too).
CDNs and Other Web Infrastructure
Now the question is: what is our destination anyway? For most big web sites, the destination isn’t actually a webserver. For most I deal with, it’ll actually be a Content Delivery Network (CDN). The CDN will do things like edge caching, geographic distribution, and may even have some features to protect my servers like DDoS protection. In some cases, the CDN may even do some of the work of generating content through edge computing. In many scenarios, the request never even needs to hit a “real” server; everything might be served up directly from the CDN.
If the CDN doesn’t have what it needs, it will contact its origin. This might be a webserver, but more likely is a security device like a Web Application Firewall (WAF). This device has rules and/or heuristics to block attackers that might exploit or slow down my web or application servers. If the request makes it through the WAF, it’ll probably hit a load balancer of some kind. This allows my site to be highly available by having multiple web servers in different datacenters. This load balancer might have rules to help determine which webserver to send the request to, or it might use a simple algorithm like round-robin.
What about SSL?
Each of the above pieces of web infrastructure probably includes TLS connections. Our web browser and the CDN is one connection, CDN to WAF is a different connection, WAF to load balancer another, and finally load balancer to the web servers might be another. Each of these connections could have slight variations in how they work for things like specific algorithms, exact protocol versions, etc., but for the most part, they’re pretty similar. Our browser, after establishing a TCP session with the CDN, performs what is called a TLS handshake. When our browser (the client) sends a
ClientHello message to start this handshake, a couple of interesting things can happen. A few TLS extensions, available in most modern browsers, have revolutionized web hosting, including ALPN and SNI.
Servers present public keys, signed by some authority, along with a chain of certificates used to establish trust with that authority from some common, trusted root. This is how PKI and public-key cryptography works. Exactly what happens varies based on the TLS version and the cipher suite, but in every case, a Diffie-Hellman key exchange occurs for that session. From here, HTTP takes over.
The Anatomy of an HTTP Request
HTTP is human-readable, which makes it really easy to describe in clear terms. A simple, manually-created request might look like this:
GET / HTTP/1.1
That final empty line is important; it lets the server know you’re done with your request.
- The first line (and really the only required line aside from the empty one) tells the server you’re performing a
GETrequest (a request that has no “body”), you’re looking for the root (
/), and you’re speaking HTTP version 1.1.
- The second line tells the server what hostname you’d like to communicate with. This is useful for virtual hosting (meaning you’re running more than one site on your server).
- The third line tells the server about your client. Some servers manipulate content to make it work better for your specific browser (Internet Explorer tends to have lots of these tricks). Servers also track this information so they know which types of browsers visit their site most.
- The final line describes what type of content our client will support or is expecting. In this case, we’re saying we’ll take whatever the server gives us. When communicating with an API, we might change this to
There are a lot of other common headers.
If the server can handle the request, it might respond with something like this:
HTTP/1.1 200 OK
Date: Tue, 16 Feb 2021 00:54:44 GMT
Last-Modified: Tue, 16 Feb 2021 00:54:20 GMT
<title>Welcome to my site!</title>
<link rel="stylesheet" href="/assets/style.css" type="text/css">
<h1>Welcome to my site!</h1>
- The first line lets us know that the server accepted our request. A
200 OKmeans everything is fine with the
- The second line tells us about the server software providing the response. Sometimes this is obfuscated or not sent at all (for security).
- The third line is just the current date from the server’s perspective. This is useful to know when the server constructed and sent the response.
- The fourth line tells us the type of content the server is sending. This is the server-side equivalent of the
Acceptheader in our request.
- The fifth line tells us how much data the server will be sending in the
bodyof the response.
- The next two lines are used for caching. They tell us when the content it is providing was modified and a checksum of sorts useful to check back with the server later and see if the content has changed. If the ETag is the same, we should be able to tell that the response body is equivalent.
- Everything after the empty line is called the response
body, which is the stuff we actually wanted from the server. This is the HTML content of the site.
This blog, for instance, also leverages a database server. I’m using MariaDB for that, but there are many others. Some sites may take a lot of CPU or time to generate content, so they can use a caching server like Redis to store expensive content and make retrieving it much faster. Larger sites might have many of these components and include communication between several servers or even calls to entirely different domains.
Are We Done?
Phew, that was quite a walkthrough. I hope this was valuable for someone. If you have questions or if you think I missed something important, please comment and let me know. Thanks for reading!