Digital Building Blocks | Blog

Did you know URLs have fingerprints?

Scritto da Alberto Giusti | 01/03/18 11.47

If the browser has your entire website in its cache, it can bypass the network altogether and load your website in a fast, reliable and consistent way.

 

 

Table of Contents

  • The network is the devil. Caching is your best friend
  • The complexity of caching
  • The URL fingerprinting technique

The network is the devil. Caching is your best friend.

When a website first loads, there are a number of things that need to happen. All of those, depend on a first crucial step: the browser needs to download resources from the server.

Unfortunately, this is also the slowest step of the pipeline: no matter how fast your network connection is, it is never going to be as fast as the CPU of your device.

As Ilya Grigorik put it:

“From a performance optimization perspective, the best request is a request that doesn't need to communicate with the server: a local copy of the response allows you to eliminate all network latency and avoid data charges for the data transfer”

(From Google Developers blog, emphasis mine)

In other words, fetching resources from the network is always going to be the slowest part of the process.

This means that the most dreadful performance bottleneck you need to overcome is the network. It is slow, unreliable and inconsistent. What’s worse, it is a necessary piece of the web stack: without the network, loading your page is just not possible. Or is it?

Caching is a technique that aims at making the network inconsequential: it consists of temporarily storing assets on the client’s device, so that the browser doesn’t need to hit the network next time it needs them.

If the browser has your entire website in its cache, it can bypass the network altogether and load your website in a fast, reliable and consistent way.

This introduces a problem: how can we make sure that the browser will update its cached resources every time we update some of our files?

 

The complexity of caching

In other words, if the browser is storing my styles.css so that it can skip a network request, what does it happen when I modify styles.css? Nothing! The browser cannot know that the file was updated on the server, so it will not download the fresh content and the users will potentially get an outdated version of the website.

In Computer Science literature, there are so many mechanisms that aim at invalidating items in the cache smartly, most of them are complicated and present some trade off. Fortunately for us, there is an easy algorithm-free solution: since the browser stores assets in its cache by their name, we could simply change the name of the file every time we make a change. This way, the browser would look for a name that is in its cache and would fetch it from the network.

A naming system is then required, in order to make sure that we always have the power to push changes to our websites. A straightforward one would be to add a version number to the end of the filename.

For example, styles.css could become styles.v1.css. At this point, everytime you want to force the browser to download the new version from the server, you can just bump the version number in the name, and see your changes through.

Another approach, one that guarantees that names are always unique and linked to the content of the file, is hashing. Basically, a hash function is a device that takes any string of data and turns it into a (fairly) short sequence of HEX values. For instance, this entire article can be reduced to 6da4585708e12eeafa1cb7bc85b10494 (using MD5, for example)

Don’t worry though, this is a perfect task to be delegated to our machine friend. There’s plenty of tools out there that can help you out with this and automatically handle the whole process.

 

The URL fingerprinting technique

The process just described is also called URL fingerprinting. Pretty much the same way we identify people by their fingerprints, we can use short strings of characters to tell two resources apart.

The whole idea behind this technique lies on the fact that the fingerprint uniquely represents the content of the resource itself. In other words, by using a hash function, we can be a hundred percent sure that there cannot be two different resources with the same filename!

This is extremely important as we can confidently make changes to resources and be sure we will never break our sites.

Let’s see the URL fingerprinting technique a little bit in depth. There are three components:

  1. The file name
  2. The file content
  3. The hash function

The file name is what the browser uses as index to store resources in its cache. As we’ve seen, by changing the file name, we signal to the browser that this is a new resource it should download.

The file content is what the hash function is going to use in order to generate its digest.

The hash function is the device which transforms the file content into a relatively short string of characters. In other words, the hash function is an algorithm capable of mapping an arbitrarily large amount of data to a much shorter bit string. Such a string, has also the fundamental properties of being a unique representation of the initial data. Meaning that the probability of a collision — two files yielding the same fingerprint — is negligible.

Given these three components, the technique consists in:

  1. Using the hash function on the file content so to generate a short string of characters, also known as the digest
  2. Append the digest to the filename
  3. Update every reference to the old name with the new name, throughout the site

Now you know it, URLs have fingerprints and they play a crucial role in web performance optimization.

 

Are you ready to leverage browser caching and skyrocket your website speed? Hit me up for a quick call and I'll be happy to help!