What is it about web browser cache design that causes performance to degrade when the cache grows - browser-cache

The nearly universal advice one gets when describing slow performance or erratic performance in a web browser is "Have you tried to clear the cache?"
Since I've gotten this advice with every browser I've ever used, I assume their cache implementations share things in common.
What is it about large amounts of cached data that degrades performance or even causes browsers to malfunction?
How is browser caching implemented at the data storage and retrieval level?

Related

Loading 1 1MB large image (spritesheet) vs loading 100 10KB images

Say I have 100 images that are each 10KB in size. What are the benefits of putting all those into a single spritesheet? I understand there are fewer HTTP requests, and therefore less of a load on the server, but I'm curious as to the specifics. With modern pipelining, is it still worth the performance gains? How significant are the performance gains? Does it result in faster load time for the client, as well as less of a load on the server or just the same amount of load time, but less of a load on the server?
Are there any test cases anyone can point to that answers these questions?
Basically, what I'm asking is -- is it worth it?
Under HTTP/1.1 (which most sites are still
using) there is a massive overhead to downloading many small resources compared to one big one. This is why spriting became popular as an optimisation technique. HTTP/2 mostly solves that so there is less requirement for spriting (and in fact it's now being considered an anti-pattern). Not sure what you mean by "modern pipelining" but that mostly means HTTP/2 as the pipelining in HTTP/1.1 isn't as fully featured or used much.
How bad a performance hit is it over HTTP/1.1? Pretty shockingly bad actually - it can make load time 10 times as slow on an example site I created. It doesn't really impact server or client load too much - the same amount of data needs to be sent either way - but does massively impact load time.
Saying that there are downsides to spriting of images (and concatenation of text files which is similar). You have to download whole sprite even if only using one image, updating it invalidates the old version in the cache, it requires a build step... etc.
Ultimately the best test is to try it, as it will be different from site to site. However once HTTP/2 becomes ubiquitous this will become a lot less common.
More discussion on this topic on this answer: Optimizing File Cacheing and HTTP2

Does std.log have a performance impact

In Varnish, does the std.log subroutine have a performance impact I should be concerned with? For example, if I call it 3-4 times a request, will that have a cumulative effect when dealing with a large number of requests?
From what I can tell, std.log logs to shared memory by requesting a lock, writing the message, and releasing the lock. This should be pretty fast, but if it happens during every single request wouldn't that affect concurrent requests?
Varnish uses a shared memory log (shm-log) for all logging. This works as a circular buffer and stores a small amount of log data - 80MB by default. It is fast.
Other tools are provided for analysing and generating output from the shm-log area. These tools are relatively slow since they must output data either to screen or disk, but they don't interfere with the performance of Varnish itself.
I'd be surprised if adding an additional 3 or 4 log entries per request has any measurable performance impact at all, seeing as each request already generates far more than that (one for every request header, for example). I'd say you are far more likely to encounter performance problems with your backend/s.

In real terms, how much faster is a production tier Heroku database than the starter database?

A basic production level database in Heroku implements a 400Mb cache. I have a production site running 2 dynos and a worker which is pretty heavy on reads and writes. The database is the bottleneck in my app.
A write to the database will invalidate many queries, as searches are performed across the database.
My question is, given the large jump in price between the $9 starter and $50 first level production database, would migrating be likely to give a significant performance improvement?
"Faster" is an odd metric here. This implies something like CPU, but CPU isn't always a huge factor in databases, especially if you're not doing heavy writes. You Basic database has 0mb of cache – every query hits disk. Even a 400mb cache will seem amazing compared to this. Examine your approximate dataset size; a general rule of thumb is for your dataset to fit into cache. Postgres will manage this cache itself, and optimize for the most referenced data.
Ultimately, Heroku Postgres doesn't sell raw performance. The benefits of the Production-tier are multiple, but to name a few: In-memory Cache, Fork/Follow support, 500 available connections, 99.95% expected uptime.
You will definitely see performance boost in upgrading to a Production-tier plan, however it's near impossible to claim it to be "3x faster" or similar, as this is dependent on how you're using the database.
It sure is a steep step, so the question really is how bad is the bottleneck? It will cost you 40 dollar extra, but once your app runs smooth again it could also mean more revenue. Of course you could also consider other hosting services, but personally I like Heroku the best (albeit cheaper options are available). Besides, you are already familiar with Heroku. There is some more information on Heroku devcenter, regarding their different plans:
https://devcenter.heroku.com/articles/heroku-postgres-plans:
Production plans
Non-production applications, or applications with minimal data storage, performance or availability requirements can choose between one of the two starter tier plans, dev and basic, depending on row requirements. However, production applications, or apps that require the features of a production tier database plan, have a variety of plans to choose from. These plans vary primarily by the size of their in-memory data cache.
Cache size
Each production tier plan’s cache size constitutes the total amount of RAM given to Postgres. While a small amount of RAM is used for managing each connection and other tasks, Postgres will take advantage of almost all this RAM for its cache.
Postgres constantly manages the cache of your data: rows you’ve written, indexes you’ve made, and metadata Postgres keeps. When the data needed for a query is entirely in that cache, performance is very fast. Queries made from cached data are often 100-1000x faster than from the full data set.
Well engineered, high performance web applications will have 99% or more of their queries be served from cache.
Conversely, having to fall back to disk is at least an order of magnitude slower. Additionally, columns with large data types (e.g. large text columns) are stored out-of-line via TOAST, and accessing large amounts of TOASTed data can be slow.

Concurrent page request comparisons

I have been hoping to find out what different server setups equate to in theory for concurrent page requests, and the answer always seems to be soaked in voodoo and sorcery. What is the approximation of max concurrent page requests for the following setups?
apache+php+mysql(1 server)
apache+php+mysql+caching(like memcached or similiar (still one server))
apache+php+mysql+caching+dedicated Database Server (2 servers)
apache+php+mysql+caching+dedicatedDB+loadbalancing(multi webserver/single dbserver)
apache+php+mysql+caching+dedicatedDB+loadbalancing(multi webserver/multi dbserver)
+distributed (amazon cloud elastic) -- I know this one is "as much as you can afford" but it would be nice to know when to move to it.
I appreciate any constructive criticism, I am just trying to figure out when its time to move from one implementation to the next, because they each come with their own implementation feat either programming wise or setup wise.
In your question you talk about caching and this is probably one of the most important factors in a web architecture r.e performance and capacity.
Memcache is useful, but actually, before that, you should be ensuring proper HTTP cache directives on your server responses. This does 2 things; it reduces the number of requests and speeds up server response times (if you have Apache configured correctly). This can also be improved by using an HTTP accelerator like Varnish and a CDN.
Another factor to consider is whether your system is stateless. By stateless, it usually means that it doesn't store sessions on the server and reference them with every request. A good systems architecture relies on state as little as possible. The less state the more horizontally scalable a system. Most people introduce state when confronted with issues of personalisation - i.e serving up different content for different users. In such cases you should first investigate using the HTML5 session storage (i.e store the complete user data in javascript on the client, obviously over https) or if the data set is smaller, secure javascript cookies. That way you can still serve up cached resources and then personalise with javascript on the client.
Finally, your stack includes a database tier, another potential bottleneck for performance and capacity. If you are only reading data from the system then again it should be quite easy to horizontally scale. If there are reads and writes, its typically better to separate the read write datasets into a separate database and have the read only in another. You can then use more relevant methods to scale.
These setups do not spit out a single answer that you can then compare to each other. The answer will vary on way more factors than you have listed.
Even if they did spit out a single answer, then it is just one metric out of dozens. What makes this the most important metric?
Even worse, each of these alternatives is not free. There is engineering effort and maintenance overhead in each of these. Which could not be analysed without understanding your organisation, your app and your cost/revenue structures.
Options like AWS not only involve development effort but may "lock you in" to a solution so you also need to be aware of that.
I know this response is not complete, but I am pointing out that this question touches on a large complicated area that cannot be reduced to a single metric.
I suspect you are approaching this from exactly the wrong end. Do not go looking for technologies and then figure out how to use them. Instead profile your app (measure, measure, measure), figure out the actual problem you are having, and then solve that problem and that problem only.
If you understand the problem and you understand the technology options then you should have an answer.
If you have already done this and the problem is concurrent page requests then I apologise in advance, but I suspect not.

SQLite caching vs Application Caching

So I'm writing an application that is very heavy on SQLite usage. I'm working on writing into my application an in memory caching system that will allow me to sort and filter my data (my own personal Core Data...in essence). I'm doing this because it seems to me that this is a better/faster option than to constantly make read requests from the SQLite database. Plus, most fields/columns will be searchable/sortable, and to set up indexes on each one seems less than ideal. But I'm not sure. I know the SQLite database is cached some in memory but I don't know to what extent or how much of an advantage that would be for me. Implementing my own caching system will be complex and will probably add to my memory footprint, especially since I'm loading each table completely into memory to perform the sort/filters. I'm more than willing to do it if it helps the performance of my app, but will it? Is the SQLite caching sufficient for me to rely solely on that or will it get bogged down when the tables start getting large (10,000+ rows)? I guess I'm asking if anyone has enough experience with SQLite to recommend one over the other.
Before anyone asks: no I can't use Core Data. Core Data isn't flexible enough for me to use in my application.
Ok, so here's what I've figured: the choice depends greatly on your requirements. I ended up removing (as much as possible) the SQLite cache, loading up what I needed, and sorting/filtering it using my own routines. This works remarkably well for me. But I've realized by implementing this that this wouldn't work in a lot of situations. Specifically, I've done a lot to make sure that my DB size is as small as possible. I'm basically storing only simple/small text and numbers. Everything else is a reference to an outside file. This makes my database small enough to use less as a database and more as an indexing service, which works well for loading the info into memory and sorting/filtering.
So, the answer depends a lot on the database. If you're storing large fields that might potentially take a lot of memory, it's probably best to let SQLite handle the cache. On the other hand, if you know the fields will be small, the SQLite cache will only inflate your memory and the round trips to the database to sort/filter data will only increase your latency. Instead, it's better to do the sorting/filtering yourself, though I will admit that's a lot work. But in the end it made my app a lot faster than round-tripping it to the DB.