How to share the APC user cache between CLI and Web Server instances? - apc

I am using PHP's APC to store a large amount of information (with apc_fetch(), etc.). This information occasionally needs analyzed and dumped elsewhere.
The story goes, I'm getting several hundred hits/sec. These hits increase various counters (with apc_inc(), and friends). Every hour, I would like to iterate over all the values I've accumulated, and do some other processing with them, and then save them on disk.
I could do this as a random or time-based switch in each request, but it's a potentially long operation (may require 20-30 sec, if not several minutes) and I do not want to hang a request for that long.
I thought a simple PHP cronjob would do the task. However, I can't even get it to read back cahe information.
<?php
print_r(apc_cache_info());
?>
Yeilds a seemingly different APC memory segment, with:
[num_entries] => 1
(The single entry seems to be a opcode cache of itself)
While my webserver, powered by nginx/php5-fpm, yields:
[num_entries] => 3175
So, they are obviously not sharing the same chunk of memory. How can I either access the same chunk of memory in the CLI script (preferred), or if that is simply not possible, what would be the absolute safest way to execute a long running sequence on say, a random HTTP request every hour?
For the latter, would using register_shutdown_function() and immediately set_time_limit(0) and ignore_user_abort(true) do the trick to ensure execution completes and doesn't "hang" anyone's browser?
And yes, I am aware of redis, memcache, etc that would not have this problem, but I am stuck to APC for now as neither could demonstrate the same speed as APC.

This is really a design issue and a matter of selecting preferred costs vs. payoffs.
You are thrilled by the speed of APC since you do not spend time to persist the data. You also want to persist the data but now the performance hit is too big. You have to balance these out somehow.
If persistence is important, take the hit and persist (file, DB, etc.) on every request. If speed is all you care about, change nothing - this whole question becomes moot. There are cache systems with persistent storage that can optimize your disk writes by aggregating what gets written to disk and when but you will generally always have a payoff between the two with varying tipping points. You just have to choose which of those suits your objectives.
There will probably never exist an enduring, wholesome technological solution to the wolf being sated and the lamb being whole.
If you really must do it your way, you could have a cron that CURLs a special request to your application which would trigger persisting your cache to disk. That way you control the request, its timeout, etc., and don't have to worry about everything users might do to kill their requests.
Potential risks in this case, however, are data integrity (as you will be writing the cache to disk while it is being updated by other requests in the meantime) as well as requests being served while you are persisting the cache paying the performance hit of your server being busy.
Essentially, we introduced a bundle of hay to the wolf/lamb dilemma ;)

Related

How can I deal with the webserver UI of one machine being out of sync with backend/API of another?

The system my company sells is software for a multi-machine solution. In some cases, there is a UI on one of the machines and a backend/API on another. These systems communicate and both use their own clocks for various operations and storage values.
When the UI's system clock gets ahead of the backend by 30 seconds or more, the queries start to misbehave due to the UI's timestamp being sent over as key information to the REST request. There is a "what has been updated by me" query that happens every 30 seconds and the desync will cause the updated data to be missed since they are outside the timing window.
Since I do not have any control over the systems that my software is installed on, I need a solution on my code's side. I can't force customers to keep their clocks in sync.
Possible solutions I have considered:
The UI can query the backend for it's system time and cache that.
The backend/API can reach back further in time when looking for updates. This will give the clocks some room to slip around, but will cause a much heavier query load on systems with large sets of data.
Any ideas?
Your best bet is to restructure your API somewhat.
First, even though NTP is a good idea, you can't actually guarantee it's in use. Additionally, even when it is enabled, OSs (Windows at least) may reject packets that are too far out of sync, to prevent certain attacks (on the order of minutes, though).
When dealing with distributed services like this, the mantra is "do not trust the client". This applies even when you actually control the client, too, and doesn't necessarily mean the client is attempting anything malicious - it just means that the client isn't the authoritative source.
This should include timestamps.
Consider; the timestamps are a problem here because you're trying to use the client's time to query the server - except, we shouldn't trust the client. Instead, what we should do is have the server return a timestamp of when the request was processed, or the update stamp for the latest entry of the database, that can be used in subsequent queries to retrieve new updates (how far back you go on initial query is up to you).
Dealing with concurrent updates safely is a little harder, and depends on what is supposed to happen on collision. There's nothing really different here from most of the questions and answers dealing with database-centric versions of the problem, I'm just mentioning it to note you may need to add extra fields to your API to correctly handle or detect the situation, if you haven't already.

Does Redis persist data?

I understand that Redis serves all data from memory, but does it persist as well across server reboot so that when the server reboots it reads into memory all the data from disk. Or is it always a blank store which is only to store data while apps are running with no persistence?
I suggest you read about this on http://redis.io/topics/persistence . Basically you lose the guaranteed persistence when you increase performance by using only in-memory storing. Imagine a scenario where you INSERT into memory, but before it gets persisted to disk lose power. There will be data loss.
Redis supports so-called "snapshots". This means that it will do a complete copy of whats in memory at some points in time (e.g. every full hour). When you lose power between two snapshots, you will lose the data from the time between the last snapshot and the crash (doesn't have to be a power outage..). Redis trades data safety versus performance, like most NoSQL-DBs do.
Most NoSQL-databases follow a concept of replication among multiple nodes to minimize this risk. Redis is considered more a speedy cache instead of a database that guarantees data consistency. Therefore its use cases typically differ from those of real databases:
You can, for example, store sessions, performance counters or whatever in it with unmatched performance and no real loss in case of a crash. But processing orders/purchase histories and so on is considered a job for traditional databases.
Redis server saves all its data to HDD from time to time, thus providing some level of persistence.
It saves data in one of the following cases:
automatically from time to time
when you manually call BGSAVE command
when redis is shutting down
But data in redis is not really persistent, because:
crash of redis process means losing all changes since last save
BGSAVE operation can only be performed if you have enough free RAM (the amount of extra RAM is equal to the size of redis DB)
N.B.: BGSAVE RAM requirement is a real problem, because redis continues to work up until there is no more RAM to run in, but it stops saving data to HDD much earlier (at approx. 50% of RAM).
For more information see Redis Persistence.
It is a matter of configuration. You can have none, partial or full persistence of your data on Redis. The best decision will be driven by the project's technical and business needs.
According to the Redis documentation about persistence you can set up your instance to save data into disk from time to time or on each query, in a nutshell. They provide two strategies/methods AOF and RDB (read the documentation to see details about then), you can use each one alone or together.
If you want a "SQL like persistence", they have said:
The general indication is that you should use both persistence methods if you want a degree of data safety comparable to what PostgreSQL can provide you.
The answer is generally yes, however a fuller answer really depends on what type of data you're trying to store. In general, the more complete short answer is:
Redis isn't the best fit for persistent storage as it's mainly performance focused
Redis is really more suitable for reliable in-memory storage/cacheing of current state data, particularly for allowing scalability by providing a central source for data used across multiple clients/servers
Having said this, by default Redis will persist data snapshots at a periodic interval (apparently this is every 1 minute, but I haven't verified this - this is described by the article below, which is a good basic intro):
http://qnimate.com/redis-permanent-storage/
TL;DR
From the official docs:
RDB persistence [the default] performs point-in-time snapshots of your dataset at specified intervals.
AOF persistence [needs to be explicitly configured] logs every write operation received by the server, that will be played again at server startup, reconstructing the
original dataset.
Redis must be explicitly configured for AOF persistence, if this is required, and this will result in a performance penalty as well as growing logs. It may suffice for relatively reliable persistence of a limited amount of data flow.
You can choose no persistence at all.Better performance but all the data lose when Redis shutting down.
Redis has two persistence mechanisms: RDB and AOF.RDB uses a scheduler global snapshooting and AOF writes update to an apappend-only log file similar to MySql.
You can use one of them or both.When Redis reboots,it constructes data from reading the RDB file or AOF file.
All the answers in this thread are talking about the possibility of redis to persist the data: https://redis.io/topics/persistence (Using AOF + after every write (change)).
It's a great link to get you started, but it is defenently not showing you the full picture.
Can/Should You Really Persist Unrecoverable Data/State On Redis?
Redis docs does not talk about:
Which redis providers support this (AOF + after every write) option:
Almost none of them - redis labs on the cloud does NOT provide this option. You may need to buy the on-premise version of redis-labs to support it. As not all companies are willing to go on-premise, then they will have a problem.
Other Redis Providers does not specify if they support this option at all. AWS Cache, Aiven,...
AOF + after every write - This option is slow. you will have to test it your self on your production hardware to see if it fits your requirements.
Redis enterpice provide this option and from this link: https://redislabs.com/blog/your-cloud-cant-do-that-0-5m-ops-acid-1msec-latency/ let's see some banchmarks:
1x x1.16xlarge instance on AWS - They could not achieve less than 2ms latency:
where latency was measured from the time the first byte of the request arrived at the cluster until the first byte of the ‘write’ response was sent back to the client
They had additional banchmarking on a much better harddisk (Dell-EMC VMAX) which results < 1ms operation latency (!!) and from 70K ops/sec (write intensive test) to 660K ops/sec (read intensive test). Pretty impresive!!!
But it defenetly required a (very) skilled devops to help you create this infrastructure and maintain it over time.
One could (falsy) argue that if you have a cluster of redis nodes (with replicas), now you have full persistency. this is false claim:
All DBs (sql,non-sql,redis,...) have the same problem - For example, running set x 1 on node1, how much time it takes for this (or any) change to be made in all the other nodes. So additional reads will receive the same output. well, it depends on alot of fuctors and configurations.
It is a nightmare to deal with inconsistency of a value of a key in multiple nodes (any DB type). You can read more about it from Redis Author (antirez): http://antirez.com/news/66. Here is a short example of the actual ngihtmare of storing a state in redis (+ a solution - WAIT command to know how much other redis nodes received the latest change change):
def save_payment(payment_id)
redis.rpush(payment_id,”in progress”) # Return false on exception
if redis.wait(3,1000) >= 3 then
redis.rpush(payment_id,”confirmed”) # Return false on exception
if redis.wait(3,1000) >= 3 then
return true
else
redis.rpush(payment_id,”cancelled”)
return false
end
else
return false
end
The above example is not suffeint and has a real problem of knowing in advance how much nodes there actually are (and alive) at every moment.
Other DBs will have the same problem as well. Maybe they have better APIs but the problem still exists.
As far as I know, alot of applications are not even aware of this problem.
All in all, picking more dbs nodes is not a one click configuration. It involves alot more.
To conclude this research, what to do depends on:
How much devs your team has (so this task won't slow you down)?
Do you have a skilled devops?
What is the distributed-system skills in your team?
Money to buy hardware?
Time to invest in the solution?
And probably more...
Many Not well-informed and relatively new users think that Redis is a cache only and NOT an ideal choice for Reliable Persistence.
The reality is that the lines between DB, Cache (and many more types) are blurred nowadays.
It's all configurable and as users/engineers we have choices to configure it as a cache, as a DB (and even as a hybrid).
Each choice comes with benefits and costs. And this is NOT an exception for Redis but all well-known Distributed systems provide options to configure different aspects (Persistence, Availability, Consistency, etc). So, if you configure Redis in default mode hoping that it will magically give you highly reliable persistence then it's team/engineer fault (and NOT that of Redis).
I have discussed these aspects in more detail on my blog here.
Also, here is a link from Redis itself.

How can we setup DB and ORM for the absence of Data Consistency requierement?

Imagine we have a web-site which sends write and read requests into some DB via Hibernate. I use Java, but it doesn't matter for this question.
Usually we want to read the fresh data from DB. But I want to introduce some delay between the written data becomes visible to reads just to increase the performance. I.e. I dont need to "publish" the rows inserted into DB immediately. Its OK for me to "publish" fresh data after some delay.
How can I achieve it?
As far as I understand this can be set up on several different tiers of my system.
I can cache some requests in front-end. Probably I should set up proxy server for this. But this will work only if all the parameters of the query match.
I can cache the read requests in Hibernate. OK, but can I specify or estimate the average time the read query will return stale data after some fresh insert occurred? In other words how can I control the delay time between fresh data becomes visible to the users?
Or may be I should use something like a memcached system instead of Hibernate cache?
Probably I can set something in DB. I dont know what should I do with DB. Probably I can ease the isolation level to burst the performance of my DB.
So, which way is the best one?
And the main question, of course: does the relaxation of requirements I introduce here may REALLY help to increase the performance of my system?
If I am reading your architecture correct you have client -> server -> database server
Answers to each point
This will put the burden on the client to implement the caching if you only use your own client I would go for this method. It will have the side effect of improving client performance possibly and put less load on the server and database server so they will scale better.
Now caching on the server will improve scalability of the database server and possibly performance in the client but will put a memory burden on the server. This would be my second option
Implement something in the database. At this point what are you gaining? the database server still has to do work to determine what rows to send back. And also you will get no scalability benefits.
So to sum up I would cache at the client first if you can if not cache at the server. Leave the DB out of the loop.
To answer your main question - caching is one of the most effective ways of increasing both performance and scalability of web applications which are constrained by database performance - your application may or may not fall into this category.
In general, I'd recommend setting up a load testing rig, and measure the various parts of your app to identify the bottleneck before starting to optimize.
The most effective cache is one outside your system - a CDN or the user's browser. Read up on browser caching, and see if there's anything you can cache locally. Browsers have caching built in as a standard feature - you control them via HTTP headers. These caches are very effective, because they stop requests even reaching your infrastructure; they are very efficient for static web assets like images, javascript files or stylesheets. I'd consider a proxy server to be in the same category. The major drawback is that it's hard to manage this cache - once you've said to the browser "cache this for 2 weeks", refreshing it is hard.
The next most effective caching layer is to cache (parts of) web pages on your application server. If you can do this, you avoid both the cost of rendering the page, and the cost of retrieving data from the database. Different web frameworks have different solutions for this.
Next, you can cache at the ORM level. Hibernate has a pretty robust implementation, and it provides a lot of granularity in your cache strategies. This article shows a sample implementation, including how to control the expiration time. You get a lot of control over caching here - you can specify the behaviour at the table level, so you can cache "lookup" data for days, and "transaction" data for seconds.
The database already implements a cache "under the hood" - it will load frequently used data into memory, for instance. In some applications, you can further improve the database performance by "de-normalizing" complex data - so the import routine might turn a complex data structure into a simple one. This does trade of data consistency and maintainability against performance.

How to control number of running worker processes for MongoDB?

Well, as the question simply explains itself, let me clear it up little more.
I am running MongoDB primarily for read-only purposes at back-end. My crons do the writes and they don't really push it hard when they are triggered. Some updates, some new documents etc.
The thing is requests usually do not even hit the application level because of entire page caching handled within MemCached by Nginx. So the application doesn't query database for another hour per page.
But so far as I can see in my process list, there are 21 MongoDB worker processes that are using none of the CPU but reasonably large amount of memory because of the previous queries.
I checked the configuration settings and googled around but couldn't find any answer.. So, is there any way to limit those processes or at least to tell MongoDB reduce/empty its memory usage after a while?
Workers are using for talking to config server and other replica as well apart from just serving user request. This is documented in here.
we can limit net.maxIncomingConnections as par recommendation on this page to limit the number of workers processing user request. But this should be used with precaution as setting this number too low and then sending more concurrent calls will result in some calls being queued.

SQL Server 2005, Caches and all that jazz

Background to question: I'm looking to implement a caching system for my website. Currently we're exploring memcache as a means of doing this. However, I am looking to see if something similar exists for SQL Server. I understand that MySQL has query cache which although is not distributed works as a sort of 'stop gap' measure. Is MySQL query cache equivalent to the buffer cache in SQL Server?
So here are my questions:
Is there a way to know is currently stored in the buffer cache?
Follow up to this, is there a way to force certain tables or result sets into the cache
How much control do I have over what goes on in the buffer and procedure cache? I understand there used to be a DBCC PINTABLE command but that has since been discontinued.
Slightly off topic: Should the caching even exists on the database layer? Or it is more prudent to manage caches using Velocity/Memcache? Is so, why? It seems like cache invalidation is something of a pain when handling many objects with overlapping triggers.
Thanks!
SQL Server implements a buffer pool same way every database product under the sun does (more or less) since System R showed the way. The gory details are explain in Transaction Processing: Concepts and Techniques. I addition it has a caching framework used by the procedure cache, permission token cache and many many other caching classes. This framework is best described in Clock Hands - what are they for.
But this is not the kind of caching applications are usually interested in. The internal database cache is perfect for scale-up scenarios where a more powerfull back end database is able to respond faster to more queries by using these caches, but the modern application stack tends to scale out the web servers and the real problem is caching the results of query interogations in a cache used by the web farm. Ideally, this cache should be shared and distributed. Memcached and Velocity are examples of such application caching infrastructure. Memcache has a long history by now, its uses and shortcommings are understood, there is significant know-how around how to use it, deploy it, manage it and monitor it.
The biggest problem with caching in the application layer, and specially with distributed caching, is cache invalidation. How to detect the changes that occur in the back end data and mark cached entries invalid so that new requests don't use stale data.
The simplest (for some definition of simple...) alternative is proactive invalidation from the application. The code knows when it changes an entity in the database, and after the change occurs it takes the extra step to mark the cached entries invalid. This has several short commings:
Is difficult to know exactly which cached entries are to be invalidated. Dependencies can be quite complex, things are always more that just a simple table/entry, there are aggregate queries, joins, partitioned data etc etc.
Code discipline is required to ensure all paths that modify data also invalidate the cache.
Changes to the data that occur outside the application scope are not detected. In practice, there are always changes that occur outside the application scope: other applications using the same data, import/export and ETL jobs, manual intervention etc etc.
A more complicated alternative is a cache that is notified by the database itself when changes occur. Not many technologies are around to support this though, it cannot work without an active support from the database. SQL Server has Query Notifications for such scenarios, you can read more about it at The Mysterious Notification. Implementing QN based caching in a standalone application is fairly complicated (and often done badly) but it works fine when implemented correctly. Doing so in a shared scaled out cache like Memcached is quite a feats of strength, but is doable.
Nai,
Answers to your questions follow:
From Wiki - Always correct... ? :-). For a more Microsoft answer, here is their description on Buffer Cache.
Buffer management
SQL Server buffers pages in RAM to
minimize disc I/O. Any 8 KB page can
be buffered in-memory, and the set of
all pages currently buffered is called
the buffer cache. The amount of memory
available to SQL Server decides how
many pages will be cached in memory.
The buffer cache is managed by the
Buffer Manager. Either reading from or
writing to any page copies it to the
buffer cache. Subsequent reads or
writes are redirected to the in-memory
copy, rather than the on-disc version.
The page is updated on the disc by the
Buffer Manager only if the in-memory
cache has not been referenced for some
time. While writing pages back to
disc, asynchronous I/O is used whereby
the I/O operation is done in a
background thread so that other
operations do not have to wait for the
I/O operation to complete. Each page
is written along with its checksum
when it is written. When reading the
page back, its checksum is computed
again and matched with the stored
version to ensure the page has not
been damaged or tampered with in the
meantime.
For this answer, please refer to the above answer:
Either reading from or writing to any page copies it to the buffer cache. Subsequent reads or writes are redirected to the in-memory copy, rather than the on-disc version.
You can query the bpool_commit_target and bpool_committed columns in the sys.dm_os_sys_info catalog view to return the number of pages reserved as the memory target and the number of pages currently committed in the buffer cache, respectively.
I feel like Microsoft has had time to figure out caching for their product and should be trusted.
I hope this information was helpful,
Thanks!
Caching can take many different meaning for an ASP.Net application spread from the browser all the way to your hardware with the IIS, Application, Database thrown in the middle.
The caching you are talking about is Database level caching, this is mostly transparent to your application. This level of caching will include buffer pools, statement caches etc. Make sure your DB server has plenty of RAM. In theory a DB server should be able to load the entire DB store in memory. There is not much you can do at this level unless you pre-fetch some anticipated data when you start the application and ensure that it is in DB cache.
On the other hand is in-memory distributed caching system. Apart from memcache and velocity, you can look at some commercial solutions like NCache or Oracle Coherence. I have no experience in either of them to recommend. This level of caching promises scalability at a cheaper cost. It is expensive to scale the DB tier compared to this. You may have to consider aspects like network bandwidth though. This type of caching, specially with invalidation and expiry can be complicated
You can cache at Web Service tier using output caching at IIS level (in IIS 7) and ASP.Net level.
At the application level you can use ASP.Net cache. This is the one that you can control most and gives you good benefits.
Then there is caching going on at client web proxy tier that can be controlled by cache-control HTTP header.
Finally you have browser level caching, view state and cookies for small data.
And don't forget that hardware like SAN caches at physical disk access level too.
In summary caching can occur at many levels and it for you to analyse and implement the best solution for your scenario. You have find out stability and volatility of your data, expected load etc. I believe caching at ASP.Net level (specially for objects) gives you most flexibility and control.
Your specific technical questions about SQL Server's buffer cache are going down the wrong path when it comes to "implement a caching system for my website".
Sure, SQL Server is going to cache data so it can improve its performance (and it does so rather well), but the point of implementing a caching layer on your web front-ends is to avoid from having to talk to the database at all - because there is still overhead and resource contention even when your query is fulfilled entirely from SQL Server's cache.
You want to be looking into is: memcached, Velocity, ASP.NET Cache, P&P Caching Application Block, etc.