23
Sep

Dynamic Caching through Learning

We often get questions about the difference between our Standard and Advanced Caching modes so we decided to take some extra time to explain how it all works. While webpage caching is a huge topic that can (literally) fill a book, I will do my best to provide a concise explanation about the fundamentals of various Caching methods.

Standard Mode: Using HTTP Caching Headers

In a perfect world, every web server should be able to notify browsers, proxies and CDNs of what type of content it delivers, and whether it expects the content to change or not. Using this information, they are able to Cache the content, reducing the load on the server and making pages render much faster.

Servers use HTTP Headers to communicate to the client what they can and can’t Cache. There are many different headers that can serve this purpose, but generally, they fall under the following categories:

  • The [Expires] and [Cache-Control: max-age] headers allow the web server to communicate for how long it expects the content to remain the same. This is a rare Caching Nirvana.

  • The [Last-Modified], and [ETag] headers are used to mark static resources, for which there is no certain expiration date. These headers are more useful than they may seem, because even without an expiration date, browsers and CDNs can still store the content locally, and use the header data to quickly re-validate its freshness. This still saves precious “fetching time”, and bandwidth resources.

  • Finally, the server can also communicate that a resource is completely dynamic, for example by using a [Cache-Control: max-age=0] header. This type of content cannot be cached at all, at least not by using ordinary Cache-Control headers.

This is essentially how Incapsula’s Standard Caching Mode works - Caching the content, while obeying the directives of the HTTP headers. In theory, this should suffice. However, it doesn’t because the these HTTP headers are not always accurate, or even present.

Caching Basics Explained

Balancing Cache Performance with Accuracy

In reality, servers also don’t usually know anything about when content is going to change or expire. So out of the box, most servers will just send the [Last-Modified] and [Etag] headers and leave it at that. This is wasteful, because typically the content remains the same for long periods of time.

Even worse, many of today’s web servers generate content using dynamic languages, such as PHP and Ruby, and whole frameworks built on top of them; Wordpress, Joomla and others.

Although some of this content is essentially static HTML, which will be requested repeatedly, as far as the web server is concerned, it still will be treated as dynamic content and all Caching will be forbidden.

This is a known problem. Today, websites will employ all sorts of solutions to try and mitigate these issues, most commonly by using plugins and tools that essentially “compile” the pages into static content, or even automatically generate and insert caching headers.

Yet, even these solutions are not fully-automatic, are still prone to mistakes, are rarely implemented and require intimate knowledge of the application.

Faced with the above mentioned realities, we looked for a way to determine what is “Cachable” and what is not, without relying on the HTTP header mechanism alone.

A.) We didn’t want to blindly cache all content for short periods of time, mainly because this runs the risk of sending someone’s personalized content to someone else.

B.) We didn’t want to naively cache pages according to their extensions (i.e. .js or .jpg) because this would still be ambiguous an thus ineffective.

That’s why we came up with our Advanced Caching Mode.

Incapsula: Advanced [Dynamic] Caching Mode

Advanced Caching Mode: Dynamic through Learning

At its core, the Advanced Mode is a heuristic process. When we see a page for the first time, we will do the following:

  • If the server says it’s cacheable, then all is great and we will Cache the resource.

  • If the server says the content is static, then we will Cache it first and start learning more about its usage. (i.e. to better understand how often it changes)

  • If the server says the content is dynamic, we will take it with a grain of salt and attempt to validate this claim by observing the resource’s behavior.

The learning phase is used to validate that the content doesn’t change over time, for different visits and visitors. Once we find out that the server’s directive is misguiding, we start Caching the page for longer and longer periods of time, periodically checking that the page indeed remains unchanged.

If at any point we detect content changes, we will suspend learning for the resource for some time, placing it in what we call “the penalty box”.

Though the basic idea is quite simple, the implementation is very complex, as it really requires maintaining a huge amount of dynamic data structures with small state machines for each resource in the system.

The implementation of this learning process will result in a noticeable increase in Caching results, increasing overall effectiveness by 30% and more, depending on just how much dynamic content there is to Cache.

Still, with all its capabilities, Advanced Caching is just one of our Acceleration features.

Stay tuned for more upcoming updates, including our new Async Validation technique that pushes the envelope even further, making site’s running on Incapsula even faster.

Gur Shatz Incapsula Co-Founder, CEO