In Varnish, does the std.log subroutine have a performance impact I should be concerned with? For example, if I call it 3-4 times a request, will that have a cumulative effect when dealing with a large number of requests?
From what I can tell, std.log logs to shared memory by requesting a lock, writing the message, and releasing the lock. This should be pretty fast, but if it happens during every single request wouldn't that affect concurrent requests?
Varnish uses a shared memory log (shm-log) for all logging. This works as a circular buffer and stores a small amount of log data - 80MB by default. It is fast.
Other tools are provided for analysing and generating output from the shm-log area. These tools are relatively slow since they must output data either to screen or disk, but they don't interfere with the performance of Varnish itself.
I'd be surprised if adding an additional 3 or 4 log entries per request has any measurable performance impact at all, seeing as each request already generates far more than that (one for every request header, for example). I'd say you are far more likely to encounter performance problems with your backend/s.
Related
Say I have 100 images that are each 10KB in size. What are the benefits of putting all those into a single spritesheet? I understand there are fewer HTTP requests, and therefore less of a load on the server, but I'm curious as to the specifics. With modern pipelining, is it still worth the performance gains? How significant are the performance gains? Does it result in faster load time for the client, as well as less of a load on the server or just the same amount of load time, but less of a load on the server?
Are there any test cases anyone can point to that answers these questions?
Basically, what I'm asking is -- is it worth it?
Under HTTP/1.1 (which most sites are still
using) there is a massive overhead to downloading many small resources compared to one big one. This is why spriting became popular as an optimisation technique. HTTP/2 mostly solves that so there is less requirement for spriting (and in fact it's now being considered an anti-pattern). Not sure what you mean by "modern pipelining" but that mostly means HTTP/2 as the pipelining in HTTP/1.1 isn't as fully featured or used much.
How bad a performance hit is it over HTTP/1.1? Pretty shockingly bad actually - it can make load time 10 times as slow on an example site I created. It doesn't really impact server or client load too much - the same amount of data needs to be sent either way - but does massively impact load time.
Saying that there are downsides to spriting of images (and concatenation of text files which is similar). You have to download whole sprite even if only using one image, updating it invalidates the old version in the cache, it requires a build step... etc.
Ultimately the best test is to try it, as it will be different from site to site. However once HTTP/2 becomes ubiquitous this will become a lot less common.
More discussion on this topic on this answer: Optimizing File Cacheing and HTTP2
As far as I know, HTTP/2 no longer uses separate TCP connections for every request, which is the main performance-booster of the protocol.
Does that mean it doesn't matter whether I use 10 XHRs with 10kB of content each or one XHR with 100kB and then split the parts client-side?
A precise answer would require a benchmark for your specific case.
In more general terms, from the client point of view, if you can make the 10 XHR at the same time (for example, in a tight loop), then what happens is that those 10 requests will leave the client more or less at the same time, incur in the latency between the client and the server, be processed on the server (more or less in parallel depending on the server architecture), so the result could be similar of a single XHR - although I would expect the single request to be more efficient.
From the server point of view, however, things may be different.
If you multiply by 10 what could have been done with a single request, now your server sees a 10x increase in request rate.
Reading from the network, request parsing and request dispatching are all activities that are heavily optimized in servers, but they do have a cost, and a 10x increase in that cost may be noticeable.
And that 10x increase in request rate to the server may impact the database as well, the filesystem, etc. so there may be ripple effects that can only be noticed by actually performing the benchmark.
Other things that you have to weigh are the amount of work that you need to do in the server to aggregate things, and to split them on the client; along with other less measurable things like code clarity and maintainability, and so forth.
I would say that common pragmatic judgement applies here: if you can make the same work with one request, why making 10 requests ? Do you have a more specific example ?
If you are in doubt, measure.
Been playing with ImageResizer for a bit now, and trying to do something, I am having trouble understanding the way to go about it.
Mainly I would like to stick to the idea of using the pipeline, and not trying to cheat it.
So.... Let's say, I pretty standard use ImageResizer For something like:
giants_logo.jpg?w=280&h=100
The File giants_logo.jpg
Processing Request is for a resized version of 'w=280&h=100'
In a clustered environment, what will happen is if this same request is served by 3 machines.
All 3 would end up doing the resize, and then storing their cached version in a local folder on disc. I could leverage a shared drive or something, but that has it's own limitations.
What I am looking to do, is get the processed file, and then copy it back up to the DB or S3 where the main images are served from.
My thought is.... I might have to write somehting like DiscCache, but with a complelty different guts, using the DB or S3 as the back end instead of the file system.
I realize the point of caching is speed, and what I am suggesting is negating that aspect..... but that's not the case if we layer the things maybe.
Anyway, What I am focused on is trying to keep track of the files generated, as well as avoid processing on multiple servers.
Any thoughts on the route I should look at to accomplish this?
TLDR; When DiskCache actually stops working well (usually between 1 and 20 million unique images), then switch to a CDN (unless it's too expensive), or a reverse proxy (unless your data set is really too huge to be bound by mortal infrastructure).
For petabyte data sets on the cheap when performance isn't king, it's a good plan. But for most people, it's premature. Even users with upwards of 20TB (source images) still use DiskCache. Really. Terabyte drives are cheap.
Latency is the killer.
To make this work you would need a central Redis server. MSSQL won't cut it (at least not on a VM or commodity hardware, we've tried). Given a Redis server, you can track what is done and stored (and perhaps even what is in progress, to de-duplicate effort in real time, as DiskCache does).
If you can track it, you can reuse it, and you can delete it. Reuse will be slower, since you're doubling the network traffic, moving the result twice. (But also decreasing it linearly with the number of servers in the cluster for source image fetches).
If bandwidth saturation is your bottleneck (very common), this could make performance worse. In fact, unless your read/write ratio is write and CPU heavy, you'll likely see worse performance than duplicated CPU effort under individual disk caches.
If you have the infrastructure to test it, put DiskCache on a SAN or shared drive; this will give you a solid estimate of the performance you can expect (assuming said drive and your blob storage system have comparable IO perf).
However, it's a fair amount of work, and you're essentially duplicating a subset of the functionality of reverse proxy (but with worse performance, since every response has to be proxied through the unlucky cluster server, instead of being spooled directly from disk).
CDNs and Reverse proxies to the rescue
Amazon CloudFront or Varnish can serve quite well as reverse proxies/caches for a web farm or cluster. Now, you'll have a bit less control over the 'garbage collection' process, but... also less code to maintain.
There's also ARR, but I've heard neither success nor failure stories about it.
But it sounds fun!
Send me a Github link and I'll help out.
I'd love to get a Redis-coordinated, cloud-agnostic poor-man's blob cache system out there. You bring the petabytes and infrastructure, I'll help you with the integration and troublesome bits. Efficient HTTP proxying is probably the hardest part; the rest is state management and basic threading.
You might want to have a look at a modified AzureReader2 plugin at https://github.com/orbyone/Sensible.ImageResizer.Plugins.AzureReader2
This implementation stores the transformed image back to the Azure blob container on the initial requests, so subsequent requests are redirected to that copy.
I know little about how leading RDBMSs go about retrieving data. So these questions may seem a bit rudimentary:
Does each SELECT in commonly used RDBMSs such as Oracle, SQL Server, MySQL, PostgeSQL etc. always mean a trip to read the data from the disk or do they, to some extent allowable by the hardware, cache commonly requested data to avoid the expensive I/O operation?
How do they determine which data segments to cache?
How do they go about synchronizing the cache once an update of some of the cached data occurs by a different process?
Is there a comparison matrix on how different RDBMSs cache frequently requested data?
Thanks
I'll answer for SQL Server:
Reads are served from cache if possible. Else, an IO occurs.
From what has been written and from what I observe, it is an LRU algorithm. I don't think this is documented anywhere. The LRU items are database pages of 8KB.
SQL Server is the only process which has access to the database files. So no other process can cause modifications. Regarding concurrent transactions: Multiple transactions can modify the same page. Locking (mostly at row-level, sometimes page or table level) ensures that the transactions do not disturb each other.
I don't know.
The answers for Informix are pretty similar to those given for SQL Server:
Reads and writes both use the cache if at all possible. If the page needed is not already in cache, an appropriate collection of I/O operations occurs (typically, evicting some page from cache, perhaps a dirty page that must be written before a new page can be read in, and then reading the new page where the old one was).
There are various algorithms, but page size and usage are the key parts. There are LRU queues for each page size.
The DBMS as a whole is an ensemble of processes that use a buffer pool in shared memory (and, where possible, direct disk I/O instead of going through the kernel cache), and uses various forms of locking (semaphores, spin-locks, mutexes, etc) to handle concurrency and synchronization. (On Windows, Informix uses a single process with multiple threads; on Unix, it uses multiple processes.)
Probably not.
I'm thinking of optimizing a program via taking a linear array and writing each element to a arbitrary location (random-like from the perspective of the CPU) in another array. I am only doing simple writes and not reading the elements back.
I understand that a scatted read for a classical CPU can be quite slow as each access will cause a cache miss and thus a processor wait. But I was thinking that a scattered write could technically be fast because the processor isn't waiting for a result, thus it may not have to wait for the transaction to complete.
I am unfortunately unfamiliar with all the details of the classical CPU memory architecture and thus there may be some complications that may cause this also to be quite slow.
Has anyone tried this?
(I should say that I am trying to invert a problem I have. I currently have an linear array from which I am read arbitrary values -- a scattered read -- and it is incredibly slow because of all the cache misses. My thoughts are that I can invert this operation into a scattered write for a significant speed benefit.)
In general you pay a high penalty for scattered writes to addresses which are not already in cache, since you have to load and store an entire cache line for each write, hence FSB and DRAM bandwidth requirements will be much higher than for sequential writes. And of course you'll incur a cache miss on every write (a couple of hundred cycles typically on modern CPUs), and there will be no help from any automatic prefetch mechanism.
I must admit, this sounds kind of hardcore. But I take the risk and answer anyway.
Is it possible to divide the input array into pages, and read/scan each page multiple times. Every pass through the page, you only process (or output) the data that belongs in a limited amount of pages. This way you only get cache-misses at the start of each input page loop.