How to control number of running worker processes for MongoDB? - optimization

Well, as the question simply explains itself, let me clear it up little more.
I am running MongoDB primarily for read-only purposes at back-end. My crons do the writes and they don't really push it hard when they are triggered. Some updates, some new documents etc.
The thing is requests usually do not even hit the application level because of entire page caching handled within MemCached by Nginx. So the application doesn't query database for another hour per page.
But so far as I can see in my process list, there are 21 MongoDB worker processes that are using none of the CPU but reasonably large amount of memory because of the previous queries.
I checked the configuration settings and googled around but couldn't find any answer.. So, is there any way to limit those processes or at least to tell MongoDB reduce/empty its memory usage after a while?

Workers are using for talking to config server and other replica as well apart from just serving user request. This is documented in here.
we can limit net.maxIncomingConnections as par recommendation on this page to limit the number of workers processing user request. But this should be used with precaution as setting this number too low and then sending more concurrent calls will result in some calls being queued.

Related

Deploy REST API over multiple servers world-wide, but stay in sync

I've built a REST API with a pretty decent latency. Each request is answered in ~100 ms with a thousand requests per second. This is however with a relatively low physical distance to the data center. The users of this API would, however, be spread all over the globe. From the US for example (to a data center in Germany), the response time for a single request is ~400 ms under no load.
What would be the best approach to deploying this API? I suspect multiple servers at different locations, with a load balancer in front. How would I ensure that the MySQL database would stay in sync between the servers in that case?
With multiple servers and a load balancer, the costs rise exponentially, which is something I can hopefully afford in the future, but not at the moment.
I'd love to hear your ideas!
Afaik. for big projects people use event sourcing with an event storage and microservices and message queues between them or a basic solution is polling the event storage through a simple REST API something like send me the latest events since the last event I received. If you can accept eventual consistency, then I think this approach can work pretty well. It makes writing somewhat slower, but reading can be very fast with it. No need to sync the MySQL databases directly, you just pull the latest events and use a projection to update the local MySQL database. So the event storage is the single source of truth.

Bigquery streaming inserts taking time

During load testing of our module we found that bigquery insert calls are taking time (3-4 s). I am not sure if this is ok. We are using java biguqery client libarary and on an average we push 500 records per api call. We are expecting a million records per second traffic to our module so bigquery inserts are bottleneck to handle this traffic. Currently it is taking hours to push data.
Let me know if we need more info regarding code or scenario or anything.
Thanks
Pankaj
Since streaming has a limited payload size, see Quota policy it's easier to talk about times, as the payload is limited in the same way to both of us, but I will mention other side effects too.
We measure between 1200-2500 ms for each streaming request, and this was consistent over the last month as you can see in the chart.
We seen several side effects although:
the request randomly fails with type 'Backend error'
the request randomly fails with type 'Connection error'
the request randomly fails with type 'timeout' (watch out here, as only some rows are failing and not the whole payload)
some other error messages are non descriptive, and they are so vague that they don't help you, just retry.
we see hundreds of such failures each day, so they are pretty much constant, and not related to Cloud health.
For all these we opened cases in paid Google Enterprise Support, but unfortunately they didn't resolved it. It seams the recommended option to take for these is an exponential-backoff with retry, even the support told to do so. Which personally doesn't make me happy.
The approach you've chosen if takes hours that means it does not scale, and won't scale. You need to rethink the approach with async processes. In order to finish sooner, you need to run in parallel multiple workers, the streaming performance will be the same. Just having 10 workers in parallel it means time will be 10 times less.
Processing in background IO bound or cpu bound tasks is now a common practice in most web applications. There's plenty of software to help build background jobs, some based on a messaging system like Beanstalkd.
Basically, you needed to distribute insert jobs across a closed network, to prioritize them, and consume(run) them. Well, that's exactly what Beanstalkd provides.
Beanstalkd gives the possibility to organize jobs in tubes, each tube corresponding to a job type.
You need an API/producer which can put jobs on a tube, let's say a json representation of the row. This was a killer feature for our use case. So we have an API which gets the rows, and places them on tube, this takes just a few milliseconds, so you could achieve fast response time.
On the other part, you have now a bunch of jobs on some tubes. You need an agent. An agent/consumer can reserve a job.
It helps you also with job management and retries: When a job is successfully processed, a consumer can delete the job from the tube. In the case of failure, the consumer can bury the job. This job will not be pushed back to the tube, but will be available for further inspection.
A consumer can release a job, Beanstalkd will push this job back in the tube, and make it available for another client.
Beanstalkd clients can be found in most common languages, a web interface can be useful for debugging.

How to share the APC user cache between CLI and Web Server instances?

I am using PHP's APC to store a large amount of information (with apc_fetch(), etc.). This information occasionally needs analyzed and dumped elsewhere.
The story goes, I'm getting several hundred hits/sec. These hits increase various counters (with apc_inc(), and friends). Every hour, I would like to iterate over all the values I've accumulated, and do some other processing with them, and then save them on disk.
I could do this as a random or time-based switch in each request, but it's a potentially long operation (may require 20-30 sec, if not several minutes) and I do not want to hang a request for that long.
I thought a simple PHP cronjob would do the task. However, I can't even get it to read back cahe information.
<?php
print_r(apc_cache_info());
?>
Yeilds a seemingly different APC memory segment, with:
[num_entries] => 1
(The single entry seems to be a opcode cache of itself)
While my webserver, powered by nginx/php5-fpm, yields:
[num_entries] => 3175
So, they are obviously not sharing the same chunk of memory. How can I either access the same chunk of memory in the CLI script (preferred), or if that is simply not possible, what would be the absolute safest way to execute a long running sequence on say, a random HTTP request every hour?
For the latter, would using register_shutdown_function() and immediately set_time_limit(0) and ignore_user_abort(true) do the trick to ensure execution completes and doesn't "hang" anyone's browser?
And yes, I am aware of redis, memcache, etc that would not have this problem, but I am stuck to APC for now as neither could demonstrate the same speed as APC.
This is really a design issue and a matter of selecting preferred costs vs. payoffs.
You are thrilled by the speed of APC since you do not spend time to persist the data. You also want to persist the data but now the performance hit is too big. You have to balance these out somehow.
If persistence is important, take the hit and persist (file, DB, etc.) on every request. If speed is all you care about, change nothing - this whole question becomes moot. There are cache systems with persistent storage that can optimize your disk writes by aggregating what gets written to disk and when but you will generally always have a payoff between the two with varying tipping points. You just have to choose which of those suits your objectives.
There will probably never exist an enduring, wholesome technological solution to the wolf being sated and the lamb being whole.
If you really must do it your way, you could have a cron that CURLs a special request to your application which would trigger persisting your cache to disk. That way you control the request, its timeout, etc., and don't have to worry about everything users might do to kill their requests.
Potential risks in this case, however, are data integrity (as you will be writing the cache to disk while it is being updated by other requests in the meantime) as well as requests being served while you are persisting the cache paying the performance hit of your server being busy.
Essentially, we introduced a bundle of hay to the wolf/lamb dilemma ;)

SQL Server 2005, Caches and all that jazz

Background to question: I'm looking to implement a caching system for my website. Currently we're exploring memcache as a means of doing this. However, I am looking to see if something similar exists for SQL Server. I understand that MySQL has query cache which although is not distributed works as a sort of 'stop gap' measure. Is MySQL query cache equivalent to the buffer cache in SQL Server?
So here are my questions:
Is there a way to know is currently stored in the buffer cache?
Follow up to this, is there a way to force certain tables or result sets into the cache
How much control do I have over what goes on in the buffer and procedure cache? I understand there used to be a DBCC PINTABLE command but that has since been discontinued.
Slightly off topic: Should the caching even exists on the database layer? Or it is more prudent to manage caches using Velocity/Memcache? Is so, why? It seems like cache invalidation is something of a pain when handling many objects with overlapping triggers.
Thanks!
SQL Server implements a buffer pool same way every database product under the sun does (more or less) since System R showed the way. The gory details are explain in Transaction Processing: Concepts and Techniques. I addition it has a caching framework used by the procedure cache, permission token cache and many many other caching classes. This framework is best described in Clock Hands - what are they for.
But this is not the kind of caching applications are usually interested in. The internal database cache is perfect for scale-up scenarios where a more powerfull back end database is able to respond faster to more queries by using these caches, but the modern application stack tends to scale out the web servers and the real problem is caching the results of query interogations in a cache used by the web farm. Ideally, this cache should be shared and distributed. Memcached and Velocity are examples of such application caching infrastructure. Memcache has a long history by now, its uses and shortcommings are understood, there is significant know-how around how to use it, deploy it, manage it and monitor it.
The biggest problem with caching in the application layer, and specially with distributed caching, is cache invalidation. How to detect the changes that occur in the back end data and mark cached entries invalid so that new requests don't use stale data.
The simplest (for some definition of simple...) alternative is proactive invalidation from the application. The code knows when it changes an entity in the database, and after the change occurs it takes the extra step to mark the cached entries invalid. This has several short commings:
Is difficult to know exactly which cached entries are to be invalidated. Dependencies can be quite complex, things are always more that just a simple table/entry, there are aggregate queries, joins, partitioned data etc etc.
Code discipline is required to ensure all paths that modify data also invalidate the cache.
Changes to the data that occur outside the application scope are not detected. In practice, there are always changes that occur outside the application scope: other applications using the same data, import/export and ETL jobs, manual intervention etc etc.
A more complicated alternative is a cache that is notified by the database itself when changes occur. Not many technologies are around to support this though, it cannot work without an active support from the database. SQL Server has Query Notifications for such scenarios, you can read more about it at The Mysterious Notification. Implementing QN based caching in a standalone application is fairly complicated (and often done badly) but it works fine when implemented correctly. Doing so in a shared scaled out cache like Memcached is quite a feats of strength, but is doable.
Nai,
Answers to your questions follow:
From Wiki - Always correct... ? :-). For a more Microsoft answer, here is their description on Buffer Cache.
Buffer management
SQL Server buffers pages in RAM to
minimize disc I/O. Any 8 KB page can
be buffered in-memory, and the set of
all pages currently buffered is called
the buffer cache. The amount of memory
available to SQL Server decides how
many pages will be cached in memory.
The buffer cache is managed by the
Buffer Manager. Either reading from or
writing to any page copies it to the
buffer cache. Subsequent reads or
writes are redirected to the in-memory
copy, rather than the on-disc version.
The page is updated on the disc by the
Buffer Manager only if the in-memory
cache has not been referenced for some
time. While writing pages back to
disc, asynchronous I/O is used whereby
the I/O operation is done in a
background thread so that other
operations do not have to wait for the
I/O operation to complete. Each page
is written along with its checksum
when it is written. When reading the
page back, its checksum is computed
again and matched with the stored
version to ensure the page has not
been damaged or tampered with in the
meantime.
For this answer, please refer to the above answer:
Either reading from or writing to any page copies it to the buffer cache. Subsequent reads or writes are redirected to the in-memory copy, rather than the on-disc version.
You can query the bpool_commit_target and bpool_committed columns in the sys.dm_os_sys_info catalog view to return the number of pages reserved as the memory target and the number of pages currently committed in the buffer cache, respectively.
I feel like Microsoft has had time to figure out caching for their product and should be trusted.
I hope this information was helpful,
Thanks!
Caching can take many different meaning for an ASP.Net application spread from the browser all the way to your hardware with the IIS, Application, Database thrown in the middle.
The caching you are talking about is Database level caching, this is mostly transparent to your application. This level of caching will include buffer pools, statement caches etc. Make sure your DB server has plenty of RAM. In theory a DB server should be able to load the entire DB store in memory. There is not much you can do at this level unless you pre-fetch some anticipated data when you start the application and ensure that it is in DB cache.
On the other hand is in-memory distributed caching system. Apart from memcache and velocity, you can look at some commercial solutions like NCache or Oracle Coherence. I have no experience in either of them to recommend. This level of caching promises scalability at a cheaper cost. It is expensive to scale the DB tier compared to this. You may have to consider aspects like network bandwidth though. This type of caching, specially with invalidation and expiry can be complicated
You can cache at Web Service tier using output caching at IIS level (in IIS 7) and ASP.Net level.
At the application level you can use ASP.Net cache. This is the one that you can control most and gives you good benefits.
Then there is caching going on at client web proxy tier that can be controlled by cache-control HTTP header.
Finally you have browser level caching, view state and cookies for small data.
And don't forget that hardware like SAN caches at physical disk access level too.
In summary caching can occur at many levels and it for you to analyse and implement the best solution for your scenario. You have find out stability and volatility of your data, expected load etc. I believe caching at ASP.Net level (specially for objects) gives you most flexibility and control.
Your specific technical questions about SQL Server's buffer cache are going down the wrong path when it comes to "implement a caching system for my website".
Sure, SQL Server is going to cache data so it can improve its performance (and it does so rather well), but the point of implementing a caching layer on your web front-ends is to avoid from having to talk to the database at all - because there is still overhead and resource contention even when your query is fulfilled entirely from SQL Server's cache.
You want to be looking into is: memcached, Velocity, ASP.NET Cache, P&P Caching Application Block, etc.

Index replication and Load balancing

Am using Lucene API in my web portal which is going to have 1000s of concurrent users.
Our web server will call Lucene API which will be sitting on an app server.We plan to use 2 app servers for load balancing.
Given this, what should be our strategy for replicating lucene indexes on the 2nd app server?any tips please?
You could use solr, which contains built in replication. This is possibly the best and easiest solution, since it probably would take quite a lot of work to implement your own replication scheme.
That said, I'm about to do exactly that myself, for a project I'm working on. The difference is that since we're using PHP for the frontend, we've implemented lucene in a socket server that accepts queries and returns a list of db primary keys. My plan is to push changes to the server and store them in a queue, where I'll first store them into the the memory index, and then flush the memory index to disk when the load is low enough.
Still, it's a complex thing to do and I'm set on doing quite a lot of work before we have a stable final solution that's reliable enough.
From experience, Lucene should have no problem scaling to thousands of users. That said, if you're only using your second App server for load balancing and not for fail over situations, you should be fine hosting Lucene on only one of those servers and accessing it via NDS (if you have a unix environment) or shared directory (in windows environment) from the second server.
Again, this is dependent on your specific situation. If you're talking about having millions (5 or more) of documents in your index and needing your lucene index to be failoverable, you may want to look into Solr or Katta.
We are working on a similar implementation to what you are describing as a proof of concept. What we see as an end-product for us consists of three separate servers to accomplish this.
There is a "publication" server, that is responsible for generating the indices that will be used. There is a service implementation that handles the workflows used to build these indices, as well as being able to signal completion (a custom management API exposed via WCF web services).
There are two "site-facing" Lucene.NET servers. Access to the API is provided via WCF Services to the site. They sit behind a physical load balancer and will periodically "ping" the publication server to see if there is a more current set of indicies than what is currently running. If it is, it requests a lock from the publication server and updates the local indices by initiating a transfer to a local "incoming" folder. Once there, it is just a matter of suspending the searcher while the index is attached. It then releases its lock and the other server is available to do the same.
Like I said, we are only approaching the proof of concept stage with this, as a replacement for our current solution, which is a load balanced Endeca cluster. The size of the indices and the amount of time it will take to actually complete the tasks required are the larger questions that have yet to be proved out.
Just some random things that we are considering:
The downtime of a given server could be reduced if two local folders are used on each machine receiving data to achieve a "round-robin" approach.
We are looking to see if the load balancer allows programmatic access to have a node remove and add itself from the cluster. This would lessen the chance that a user experiences a hang if he/she accesses during an update.
We are looking at "request forwarding" in the event that cluster manipulation is not possible.
We looked at solr, too. While a lot of it just works out of the box, we have some bench time to explore this path as a learning exercise - learning things like Lucene.NET, improving our WF and WCF skills, and implementing ASP.NET MVC for a management front-end. Worst case scenario, we go with something like solr, but have gained experience in some skills we are looking to improve on.
I'm creating the Indices on the publishing Backend machines into the filesystem and replicate those over to the marketing.
That way every single, load & fail balanced, node has it's own index without network latency.
Only drawback is, you shouldn't try to recreate the index within the replicated folder, as you'll have the lockfile lying around at every node, blocking the indexreader until your reindex finished.