Varnish: performance impact large ban list - reverse-proxy

We were wondering if anyone has experience with a large amount of bans in Varnish. We consider a ban strategy which could result in a couple of hundred (smart) bans each night (on X million cache objects).
Although I am aware that this is highly dependent on environment variables we were wondering if this have a significant performance impact.

Bans are quite CPU intensive so care should be taken not to overuse them. If you do, CPU usage will rise and you'll notice a huge amount of regular expression matches will be executed each second.
In general one ban will match against every object in memory at the point it is entered, so having a million object each ban will result in a million ban evaluation. This might sound like a lot but modern servers are fast and today a modern server is be capable of doing tens of millions of regular expression matches each second. My four year old laptop does something like 15 million regex matches a second running on a single core, just to give you an idea of the scale.
In addition there is another feature of Varnish that comes into play. The ban lurker. The ban lurker is a thread that walks the cache and evaluates bans trying to kill of objects before they are requested, thereby reducing the size of the ban list. If your bans don't use the req object they are candidates for evaluation by the lurker. If you plan on using a few bans you should take care to write your bans in a lurker friendly fashion. So called "smart bans", which you seem to be familiar with.
All in all I think your setup sounds sane. Issuing a couple of hundred smart bans with a few million objects in cache will probably work just fine. There will of course be a bit of CPU load when the bans are deployed and the TTFB will increase somewhat, but I think you'll be fine. You might want to play somewhat with the parameters that control how the ban lurker works, but try the defaults first, they are pretty sane.

Related

What is the expected performance gap switching from SQL to TSDB for handling time series?

We are in the case of using a SQL database for a single node storage of roughly 1 hour of high frequency metrics (several k inserts a second). We quickly ran into I/O issues which proper buffering would not simply handle, and we are willing to put time into solving the performance issue.
I suggested to switch to a specialised database for handling time series, but my colleague stayed pretty skeptical. His argument is that the gain "out of the box" is not guaranteed as he knows SQL well and already spent time optimizing the storage, and we in comparison do not have any kind of TSDB experience to properly optimize it.
My intuition is that using a TSDB would be much more efficient even with an out of box configuration but I don't have any data to measure this, and internet benchs such as InfluxDB's are nowhere near trustable. We should run our own, except we can't affoard to loose time in a dead end or a mediocre improvement.
What would be, in my use case but very roughly, the performance gap between relational storage and TSDB, when it comes to single node throughput ?
This question may be bordering on a software recommendation. I just want to point one important thing out: You have an existing code base so switching to another data store is expensive in terms of development costs and time. If you have someone experienced with the current technology, you are probably better off with a good-faith effort to make that technology work.
Whether you switch or not depends on the actual requirements of your application. For instance, if you don't need the data immediately, perhaps writing batches to a file is the most efficient mechanism.
Your infrastructure has ample opportunity for in-place growth -- more memory, more processors, solid-state disk (for example). These might meet your performance needs with a minimal amount of effort.
If you cannot make the solution work (and 10k inserts per second should be quite feasible), then there are numerous solutions. Some NOSQL databases relax some of the strict ACID requirements of traditional RDBMSs, providing faster throughout.

Why is the n+1 selects pattern slow?

I'm rather inexperienced with databases and have just read about the "n+1 selects issue". My follow-up question: Assuming the database resides on the same machine as my program, is cached in RAM and properly indexed, why is the n+1 query pattern slow?
As an example let's take the code from the accepted answer:
SELECT * FROM Cars;
/* for each car */
SELECT * FROM Wheel WHERE CarId = ?
With my mental model of the database cache, each of the SELECT * FROM Wheel WHERE CarId = ? queries should need:
1 lookup to reach the "Wheel" table (one hashmap get())
1 lookup to reach the list of k wheels with the specified CarId (another hashmap get())
k lookups to get the wheel rows for each matching wheel (k pointer dereferenciations)
Even if we multiply that by a small constant factor for an additional overhead because of the internal memory structure, it still should be unnoticeably fast. Is the interprocess communication the bottleneck?
Edit: I just found this related article via Hacker News: Following a Select Statement Through Postgres Internals. - HN discussion thread.
Edit 2: To clarify, I do assume N to be large. A non-trivial overhead will add up to a noticeable delay then, yes. I am asking why the overhead is non-trivial in the first place, for the setting described above.
You are correct that avoiding n+1 selects is less important in the scenario you describe. If the database is on a remote machine, communication latencies of > 1ms are common, i.e. the cpu would spend millions of clock cycles waiting for the network.
If we are on the same machine, the communication delay is several orders of magnitude smaller, but synchronous communication with another process necessarily involves a context switch, which commonly costs > 0.01 ms (source), which is tens of thousands of clock cycles.
In addition, both the ORM tool and the database will have some overhead per query.
To conclude, avoiding n+1 selects is far less important if the database is local, but still matters if n is large.
Assuming the database resides on the same machine as my program
Never assume this. Thinking about special cases like this is never a good idea. It's quite likely that your data will grow, and you will need to put your database on another server. Or you will want redundancy, which involves (you guessed it) another server. Or for security, you might want not want your app server on the same box as the DB.
why is the n+1 query pattern slow?
You don't think it's slow because your mental model of performance is probably all wrong.
1) RAM is horribly slow. Your CPU is wasting around 200-400 CPU cycles each time it needs to read something something from RAM. CPUs have a lot of tricks to hide this (caches, pipelining, hyperthreading)
2) Reading from RAM is not "Random Access". It's like a hard drive: sequential reads are faster.
See this article about how accessing RAM in the right order is 76.6% faster http://lwn.net/Articles/255364/ (Read the whole article if you want to know how horrifyingly complex RAM actually is.)
CPU cache
In your "N+1 query" case, the "loop" for each N includes many megabytes of code (on client and server) swapping in and out of caches on each iteration, plus context switches (which usually dumps the caches anyway).
The "1 query" case probably involves a single tight loop on the server (finding and copying each row), then a single tight loop on the client (reading each row). If those loops are small enough, they can execute 10-100x faster running from cache.
RAM sequential access
The "1 query" case will read everything from the DB to one linear buffer, send it to the client who will read it linearly. No random accesses during data transfer.
The "N+1 query" case will be allocating and de-allocating RAM N times, which (for various reasons) may not be the same physical bit of RAM.
Various other reasons
The networking subsystem only needs to read one or two TCP headers, instead of N.
Your DB only needs to parse one query instead of N.
When you throw in multi-users, the "locality/sequential access" gets even more fragmented in the N+1 case, but stays pretty good in the 1-query case.
Lots of other tricks that the CPU uses (e.g. branch prediction) work better with tight loops.
See: http://blogs.msdn.com/b/oldnewthing/archive/2014/06/13/10533875.aspx
Having the database on a local machine reduces the problem; however, most applications and databases will be on different machines, where each round trip takes at least a couple of milliseconds.
A database will also need a lot of locking and latching checks for each individual query. Context switches have already been mentioned by meriton. If you don't use a surrounding transaction, it also has to build implicit transactions for each query. Some query parsing overhead is still there, even with a parameterized, prepared query or one remembered by string equality (with parameters).
If the database gets filled up, query times may increase, compared to an almost empty database in the beginning.
If your database is to be used by other application, you will likely hammer it: even if your application works, others may slow down or even get an increasing number of failures, such as timeouts and deadlocks.
Also, consider having more than two levels of data. Imagine three levels: Blogs, Entries, Comments, with 100 blogs, each with 10 entries and 10 comments on each entry (for average). That's a SELECT 1+N+(NxM) situation. It will require 100 queries to retrieve the blog entries, and another 1000 to get all comments. Some more complex data, and you'll run into the 10000s or even 100000s.
Of course, bad programming may work in some cases and to some extent. If the database will always be on the same machine, nobody else uses it and the number of cars is never much more than 100, even a very sub-optimal program might be sufficient. But beware of the day any of these preconditions changes: refactoring the whole thing will take you much more time than doing it correctly in the beginning. And likely, you'll try some other workarounds first: a few more IF clauses, memory cache and the like, which help in the beginning, but mess up your code even more. In the end, you may be stuck in a "never touch a running system" position, where the system performance is becoming less and less acceptable, but refactoring is too risky and far more complex than changing correct code.
Also, a good ORM offers you ways around N+1: (N)Hibernate, for example, allows you to specify a batch-size (merging many SELECT * FROM Wheels WHERE CarId=? queries into one SELECT * FROM Wheels WHERE CarId IN (?, ?, ..., ?) ) or use a subselect (like: SELECT * FROM Wheels WHERE CarId IN (SELECT Id FROM Cars)).
The most simple option to avoid N+1 is a join, with the disadvantage that each car row is multiplied by the number of wheels, and multiple child/grandchild items likely ending up in a huge cartesian product of join results.
There is still overhead, even if the database is on the same machine, cached in RAM and properly indexed. The size of this overhead will depend on what DBMS you're using, the machine it's running on, the amount of users, the configuration of the DBMS (isolation level, ...) and so on.
When retrieving N rows, you can choose to pay this cost once or N times. Even a small cost can become noticeable if N is large enough.
One day someone might want to put the database on a separate machine or to use a different dbms. This happens frequently in the business world (to be compliant with some ISO standard, to reduce costs, to change vendors, ...)
So, sometimes it's good to plan for situations where the database isn't lightning fast.
All of this depends very much on what the software is for. Avoiding the "select n+1 problem" isn't always necessary, it's just a rule of thumb, to avoid a commonly encountered pitfall.

High disk IO rate

My rails application always reaches the threshold of the disk I/O rate set by my VPS at Linode. It's set at 3000 (I up it from 2000), and every hour or so I will get a notification that it reaches 4000-5000+.
What are the methods that I can use to minimize the disk IO rate? I mostly use Sphinx (Thinking Sphinx plugin) and Latitude and Longitude distance search.
What are the methods to avoid?
I'm using Rails 2.3.11 and MySQL.
Thanks.
did you check if your server is swapping itself to death? what does "top" say?
your Linode may have limited RAM, and it could be very likely that it is swapping like crazy to keep things running..
If you see red in the IO graph, that is swapping activity! You need to upgrade your Linode to more RAM,
or limit the number / size of processes which are running. You should also add approximately 2x the RAM size as Swap space (swap partition).
http://tinypic.com/view.php?pic=2s0b8t2&s=7
Since your question is too vague to answer concisely, this is generally a sign of one of a few things:
Your data set is too large because of historical data that you could prune. Delete what is no longer relevant.
Your tables are not indexed properly and you are hitting a lot of table scans. Check with EXAMINE on each of your slow queries.
Your data structure is not optimized for the way you are using it, and you are doing too many joins. Some tactical de-normalization would help here. Make sure all your JOIN queries are strictly necessary.
You are retrieving more data than is required to service the request. It is, sadly, all too common that people load enormous TEXT or BLOB columns from a user table when displaying only a list of user names. Load only what you need.
You're being hit by some kind of automated scraper or spider robot that's systematically downloading your entire site, page by page. You may want to alter your robots.txt if this is an issue, or start blocking troublesome IPs.
Is it going high and staying high for a long time, or is it just spiking temporarily?
There aren't going to be specific methods to avoid (other than not writing to disk).
You could try using a profiler in production like NewRelic to get more insight into your performance. A profiler will highlight the actions that are taking a long time, however, and when you examine the specific algorithm you're using in that action, you might discover what's inefficient about that particular action.

Scattered-write speed versus scattered-read speed on modern Intel or AMD CPUs?

I'm thinking of optimizing a program via taking a linear array and writing each element to a arbitrary location (random-like from the perspective of the CPU) in another array. I am only doing simple writes and not reading the elements back.
I understand that a scatted read for a classical CPU can be quite slow as each access will cause a cache miss and thus a processor wait. But I was thinking that a scattered write could technically be fast because the processor isn't waiting for a result, thus it may not have to wait for the transaction to complete.
I am unfortunately unfamiliar with all the details of the classical CPU memory architecture and thus there may be some complications that may cause this also to be quite slow.
Has anyone tried this?
(I should say that I am trying to invert a problem I have. I currently have an linear array from which I am read arbitrary values -- a scattered read -- and it is incredibly slow because of all the cache misses. My thoughts are that I can invert this operation into a scattered write for a significant speed benefit.)
In general you pay a high penalty for scattered writes to addresses which are not already in cache, since you have to load and store an entire cache line for each write, hence FSB and DRAM bandwidth requirements will be much higher than for sequential writes. And of course you'll incur a cache miss on every write (a couple of hundred cycles typically on modern CPUs), and there will be no help from any automatic prefetch mechanism.
I must admit, this sounds kind of hardcore. But I take the risk and answer anyway.
Is it possible to divide the input array into pages, and read/scan each page multiple times. Every pass through the page, you only process (or output) the data that belongs in a limited amount of pages. This way you only get cache-misses at the start of each input page loop.

First Time Architecturing?

I was recently given the task of rebuilding an existing RIA. The new RIA that I've designed is based on Silverlight, with a WCF service to connect to MS SQL Server. This is my first time doing something like this, so I'm not sure how to design the entire thing.
Basically, the client can look through graphs of "stocks" (allowing the client to choose different time periods, settings, etc). I've written the whole application essentially, but I'm not sure how to put it together.
The graphs are supposed to be directly based on the database, and to create the datapoints on the graph, some calculations need to be done (not very expensive ones).
The problem I'm having is to decide where to put the calculations (client or serverside? Or half and half?)
What factors should I look for to help me decide where the calculations should be done? And how can I go about optimizing this (caching, etc)?
Obviously this is a very broad subject, so I'm not expecting an immediate answer, but any help/pointing in the right direction/resources would be appreciated.
A few tips for this kind of app.
Put as much logic as possible on the client.
Make the client responsible for session data, making all your server code stateless.
Try to minimize traffic to and from the server (Bigger requests are more efficient than multiple smaller ones) so consolidate requests when possible.
If this project is likely to grow beyond it's current feature set I think it's probably a good idea to perform the calculations client side. This can avoid scaling issues, because you're using all the client side CPUs ratther than you're single, precious server CPU. This does however rely on being able to transfer the required data to the client in an efficient way, otherwise you replace a processor bottleneck with a network bottleneck.
As for caching it depends on your inputs, what variables can users of the client affect? If any of the variables they can alter are discrete (ie they can be a fixed set of values) then they're candidates for caching. For example if a user can select a date range of stock variations to view then that's probably not so useful, if however they can only select a year then you could cache your data sets by year (download each data set to the client and perform your calculation). I'd not worry about caching too much unless you find it's a real performance problem, it'll only make your code more complex, so don't add it until you have proven you need it.
One other thing, if this project is unlikely to be a long term concern then implement the calculations wherever is easiest and fastest, you can revisit if the project becomes more important later on.
Be REALLY REALLY careful about implementing client-side caching. Caching is INSANELY hard to do right while maintaining performance, security and correctness. Note that your DB Server's caching mechanism is already likely to be way better than any local caching mechanism you're likely to implement in less than 2 weeks' effort!
I would urge you to do as much work on the back-end as possible and to limit your client to render the data in a manner that is appropriate for your users. While many may balk at this suggestion, it's based on a number of observations from building many such systems in the past:
If you're going to filter some of the data returned by your service, you've just wasted thousands of clock cycles shipping data that need never have left your server
If you're going to sort your data, your DB could have done the sorting for you (often using otherwise idle CPU ticks) while waiting for the data to be read from its disks.
Your server most likely has more CPU and RAM available than your clients and has a surprising amount of "free time" to use for sorting, filtering, running inline calculations, etc., while its waiting for disks to read sectors etc.
As Roman suggested: Minimize your round-trips between your client and your server as much as possible.
But perhaps most importantly:
BEFORE YOU START DESIGNING YOUR SYSTEM, state your performance goals
Design what you think will achieve those goals. Try to find bottlenecks in your design, particularly areas where you make blocking calls. Re-design those areas to use async patterns wherever you can.
Build your intended solution
Measure your actual perforamnce under actual real-world load
If you're within your expected performance goals, then you're done.
If not, work out where you're spending too long and tune the design of that portion of the system. Goto 3.
Don't try to build the perfect system in one try - chances are that you won't manage it, no matter how hard you try, for a variety of reasons including user expectations, your servers ability to process the required load, your clients' ability to handle the returned data, your network's ability to carry the traffic, etc.
They're a little old now, but I suggest you read through some of the earlier posts at http://blogs.msdn.com/richardt for more thoughts around designing and constructing Service Oriented and distributed systems.