I'm beginning a new project using CakePHP. I like the "auto-magic" features, I think its a good fit for the project. I'm wondering about the potential to scale CakePHP to several million IP hits a day. and hundreds of thousands of database writes and reads a day. Also about 50,000 to 500,000 users, often with 3000 concurrently using the site. I'm making use of heavy stored procedures to offset this, and I'm accessing several servers including a load balancer.
I'm wondering about the computational time of some of the auto-magic and how well Cake is able to assist with session requests making many db hits. Has anyone has had success with cake running from a single server array setup with this level of traffic? I'm not using the cloud or a distributed database (yet). I'm really worried about potential bottlenecks with using this framework. I'm interested in advice from anyone who has worked with Cake in production. I've reseached, but I would love a second opinion. Thank you for your time.
This is not a problem but optimization is up to you.
There are different cache methods available you can implement, memcache, redis, full page caching... All of that is supported by cacke already. What you cache and where is up to you.
For searching you could try elastic search to speedup things
There are before dispatcher filters to by pass controller instantiation (you might want to do that in special cases, check the asset filter for example)
Use nginx not apache
Also I would not start with over optimizing and over-thinking this before any code is written, start well, think about caching but when you start to come across bottleneck analyse and fix them. Otherwise you'll waste a lot of time with over optimization before you even have written anything that works.
Cake itself is very fast. Just to proof the bullshit factor of these fancy benchmarks some frameworks do we did one using a dispatcher filter to "optimize" it and even beat Yii who seems to be pretty eager to show how fast it is, but benchmarks are pointless, specially in a huge project where so many human made fail can be introduced.
Related
Been playing with ImageResizer for a bit now, and trying to do something, I am having trouble understanding the way to go about it.
Mainly I would like to stick to the idea of using the pipeline, and not trying to cheat it.
So.... Let's say, I pretty standard use ImageResizer For something like:
giants_logo.jpg?w=280&h=100
The File giants_logo.jpg
Processing Request is for a resized version of 'w=280&h=100'
In a clustered environment, what will happen is if this same request is served by 3 machines.
All 3 would end up doing the resize, and then storing their cached version in a local folder on disc. I could leverage a shared drive or something, but that has it's own limitations.
What I am looking to do, is get the processed file, and then copy it back up to the DB or S3 where the main images are served from.
My thought is.... I might have to write somehting like DiscCache, but with a complelty different guts, using the DB or S3 as the back end instead of the file system.
I realize the point of caching is speed, and what I am suggesting is negating that aspect..... but that's not the case if we layer the things maybe.
Anyway, What I am focused on is trying to keep track of the files generated, as well as avoid processing on multiple servers.
Any thoughts on the route I should look at to accomplish this?
TLDR; When DiskCache actually stops working well (usually between 1 and 20 million unique images), then switch to a CDN (unless it's too expensive), or a reverse proxy (unless your data set is really too huge to be bound by mortal infrastructure).
For petabyte data sets on the cheap when performance isn't king, it's a good plan. But for most people, it's premature. Even users with upwards of 20TB (source images) still use DiskCache. Really. Terabyte drives are cheap.
Latency is the killer.
To make this work you would need a central Redis server. MSSQL won't cut it (at least not on a VM or commodity hardware, we've tried). Given a Redis server, you can track what is done and stored (and perhaps even what is in progress, to de-duplicate effort in real time, as DiskCache does).
If you can track it, you can reuse it, and you can delete it. Reuse will be slower, since you're doubling the network traffic, moving the result twice. (But also decreasing it linearly with the number of servers in the cluster for source image fetches).
If bandwidth saturation is your bottleneck (very common), this could make performance worse. In fact, unless your read/write ratio is write and CPU heavy, you'll likely see worse performance than duplicated CPU effort under individual disk caches.
If you have the infrastructure to test it, put DiskCache on a SAN or shared drive; this will give you a solid estimate of the performance you can expect (assuming said drive and your blob storage system have comparable IO perf).
However, it's a fair amount of work, and you're essentially duplicating a subset of the functionality of reverse proxy (but with worse performance, since every response has to be proxied through the unlucky cluster server, instead of being spooled directly from disk).
CDNs and Reverse proxies to the rescue
Amazon CloudFront or Varnish can serve quite well as reverse proxies/caches for a web farm or cluster. Now, you'll have a bit less control over the 'garbage collection' process, but... also less code to maintain.
There's also ARR, but I've heard neither success nor failure stories about it.
But it sounds fun!
Send me a Github link and I'll help out.
I'd love to get a Redis-coordinated, cloud-agnostic poor-man's blob cache system out there. You bring the petabytes and infrastructure, I'll help you with the integration and troublesome bits. Efficient HTTP proxying is probably the hardest part; the rest is state management and basic threading.
You might want to have a look at a modified AzureReader2 plugin at https://github.com/orbyone/Sensible.ImageResizer.Plugins.AzureReader2
This implementation stores the transformed image back to the Azure blob container on the initial requests, so subsequent requests are redirected to that copy.
I have been hoping to find out what different server setups equate to in theory for concurrent page requests, and the answer always seems to be soaked in voodoo and sorcery. What is the approximation of max concurrent page requests for the following setups?
apache+php+mysql(1 server)
apache+php+mysql+caching(like memcached or similiar (still one server))
apache+php+mysql+caching+dedicated Database Server (2 servers)
apache+php+mysql+caching+dedicatedDB+loadbalancing(multi webserver/single dbserver)
apache+php+mysql+caching+dedicatedDB+loadbalancing(multi webserver/multi dbserver)
+distributed (amazon cloud elastic) -- I know this one is "as much as you can afford" but it would be nice to know when to move to it.
I appreciate any constructive criticism, I am just trying to figure out when its time to move from one implementation to the next, because they each come with their own implementation feat either programming wise or setup wise.
In your question you talk about caching and this is probably one of the most important factors in a web architecture r.e performance and capacity.
Memcache is useful, but actually, before that, you should be ensuring proper HTTP cache directives on your server responses. This does 2 things; it reduces the number of requests and speeds up server response times (if you have Apache configured correctly). This can also be improved by using an HTTP accelerator like Varnish and a CDN.
Another factor to consider is whether your system is stateless. By stateless, it usually means that it doesn't store sessions on the server and reference them with every request. A good systems architecture relies on state as little as possible. The less state the more horizontally scalable a system. Most people introduce state when confronted with issues of personalisation - i.e serving up different content for different users. In such cases you should first investigate using the HTML5 session storage (i.e store the complete user data in javascript on the client, obviously over https) or if the data set is smaller, secure javascript cookies. That way you can still serve up cached resources and then personalise with javascript on the client.
Finally, your stack includes a database tier, another potential bottleneck for performance and capacity. If you are only reading data from the system then again it should be quite easy to horizontally scale. If there are reads and writes, its typically better to separate the read write datasets into a separate database and have the read only in another. You can then use more relevant methods to scale.
These setups do not spit out a single answer that you can then compare to each other. The answer will vary on way more factors than you have listed.
Even if they did spit out a single answer, then it is just one metric out of dozens. What makes this the most important metric?
Even worse, each of these alternatives is not free. There is engineering effort and maintenance overhead in each of these. Which could not be analysed without understanding your organisation, your app and your cost/revenue structures.
Options like AWS not only involve development effort but may "lock you in" to a solution so you also need to be aware of that.
I know this response is not complete, but I am pointing out that this question touches on a large complicated area that cannot be reduced to a single metric.
I suspect you are approaching this from exactly the wrong end. Do not go looking for technologies and then figure out how to use them. Instead profile your app (measure, measure, measure), figure out the actual problem you are having, and then solve that problem and that problem only.
If you understand the problem and you understand the technology options then you should have an answer.
If you have already done this and the problem is concurrent page requests then I apologise in advance, but I suspect not.
So I'm writing an application that is very heavy on SQLite usage. I'm working on writing into my application an in memory caching system that will allow me to sort and filter my data (my own personal Core Data...in essence). I'm doing this because it seems to me that this is a better/faster option than to constantly make read requests from the SQLite database. Plus, most fields/columns will be searchable/sortable, and to set up indexes on each one seems less than ideal. But I'm not sure. I know the SQLite database is cached some in memory but I don't know to what extent or how much of an advantage that would be for me. Implementing my own caching system will be complex and will probably add to my memory footprint, especially since I'm loading each table completely into memory to perform the sort/filters. I'm more than willing to do it if it helps the performance of my app, but will it? Is the SQLite caching sufficient for me to rely solely on that or will it get bogged down when the tables start getting large (10,000+ rows)? I guess I'm asking if anyone has enough experience with SQLite to recommend one over the other.
Before anyone asks: no I can't use Core Data. Core Data isn't flexible enough for me to use in my application.
Ok, so here's what I've figured: the choice depends greatly on your requirements. I ended up removing (as much as possible) the SQLite cache, loading up what I needed, and sorting/filtering it using my own routines. This works remarkably well for me. But I've realized by implementing this that this wouldn't work in a lot of situations. Specifically, I've done a lot to make sure that my DB size is as small as possible. I'm basically storing only simple/small text and numbers. Everything else is a reference to an outside file. This makes my database small enough to use less as a database and more as an indexing service, which works well for loading the info into memory and sorting/filtering.
So, the answer depends a lot on the database. If you're storing large fields that might potentially take a lot of memory, it's probably best to let SQLite handle the cache. On the other hand, if you know the fields will be small, the SQLite cache will only inflate your memory and the round trips to the database to sort/filter data will only increase your latency. Instead, it's better to do the sorting/filtering yourself, though I will admit that's a lot work. But in the end it made my app a lot faster than round-tripping it to the DB.
I'm working on a new project right now and thinking of using an ORM beyond that of OpenAccess or LLBLGen Pro or Subsonic.This project may have great quantities and hits concurrent,So our performance requirements is very high.
Please compare and recommend it to me.
Thanks
Jim.
Jim,
For the best results in answering this question, you'll need to do your own comparison since your specific requirements and data access scenarios will likely affect the results of any such performance testing.
That said, we use LLBLGen for a high throughput web application and the performance is exceptional. We have found that the big issue is in the application design itself. Using SQL Server Profiler we are able to see (during development) which parts of the application create a lot of hits on the database. The biggest penalty we found was with loading a grid and then doing another database operation OnDataBinding / DataBound events. This creates a huge amount of traffic to the SQL Server database, a lot of reads and a lot of disk swapping. We have been very well served by making sure we get all the data in the first query by making a good design choice when building the set of data/joins/etc. when building the application -- or refactoring it later once we find the performance is slow.
The overhead for LLBLGen, at least, is very minimal. It's fast even when creating huge numbers of objects. The much, much bigger performance hit comes when we make queries that spawn other queries (example above) and our DB reads go through the roof.
You may wish to evaluate both for which one you feel is a better match for your skills and productivity as well.
I was recently given the task of rebuilding an existing RIA. The new RIA that I've designed is based on Silverlight, with a WCF service to connect to MS SQL Server. This is my first time doing something like this, so I'm not sure how to design the entire thing.
Basically, the client can look through graphs of "stocks" (allowing the client to choose different time periods, settings, etc). I've written the whole application essentially, but I'm not sure how to put it together.
The graphs are supposed to be directly based on the database, and to create the datapoints on the graph, some calculations need to be done (not very expensive ones).
The problem I'm having is to decide where to put the calculations (client or serverside? Or half and half?)
What factors should I look for to help me decide where the calculations should be done? And how can I go about optimizing this (caching, etc)?
Obviously this is a very broad subject, so I'm not expecting an immediate answer, but any help/pointing in the right direction/resources would be appreciated.
A few tips for this kind of app.
Put as much logic as possible on the client.
Make the client responsible for session data, making all your server code stateless.
Try to minimize traffic to and from the server (Bigger requests are more efficient than multiple smaller ones) so consolidate requests when possible.
If this project is likely to grow beyond it's current feature set I think it's probably a good idea to perform the calculations client side. This can avoid scaling issues, because you're using all the client side CPUs ratther than you're single, precious server CPU. This does however rely on being able to transfer the required data to the client in an efficient way, otherwise you replace a processor bottleneck with a network bottleneck.
As for caching it depends on your inputs, what variables can users of the client affect? If any of the variables they can alter are discrete (ie they can be a fixed set of values) then they're candidates for caching. For example if a user can select a date range of stock variations to view then that's probably not so useful, if however they can only select a year then you could cache your data sets by year (download each data set to the client and perform your calculation). I'd not worry about caching too much unless you find it's a real performance problem, it'll only make your code more complex, so don't add it until you have proven you need it.
One other thing, if this project is unlikely to be a long term concern then implement the calculations wherever is easiest and fastest, you can revisit if the project becomes more important later on.
Be REALLY REALLY careful about implementing client-side caching. Caching is INSANELY hard to do right while maintaining performance, security and correctness. Note that your DB Server's caching mechanism is already likely to be way better than any local caching mechanism you're likely to implement in less than 2 weeks' effort!
I would urge you to do as much work on the back-end as possible and to limit your client to render the data in a manner that is appropriate for your users. While many may balk at this suggestion, it's based on a number of observations from building many such systems in the past:
If you're going to filter some of the data returned by your service, you've just wasted thousands of clock cycles shipping data that need never have left your server
If you're going to sort your data, your DB could have done the sorting for you (often using otherwise idle CPU ticks) while waiting for the data to be read from its disks.
Your server most likely has more CPU and RAM available than your clients and has a surprising amount of "free time" to use for sorting, filtering, running inline calculations, etc., while its waiting for disks to read sectors etc.
As Roman suggested: Minimize your round-trips between your client and your server as much as possible.
But perhaps most importantly:
BEFORE YOU START DESIGNING YOUR SYSTEM, state your performance goals
Design what you think will achieve those goals. Try to find bottlenecks in your design, particularly areas where you make blocking calls. Re-design those areas to use async patterns wherever you can.
Build your intended solution
Measure your actual perforamnce under actual real-world load
If you're within your expected performance goals, then you're done.
If not, work out where you're spending too long and tune the design of that portion of the system. Goto 3.
Don't try to build the perfect system in one try - chances are that you won't manage it, no matter how hard you try, for a variety of reasons including user expectations, your servers ability to process the required load, your clients' ability to handle the returned data, your network's ability to carry the traffic, etc.
They're a little old now, but I suggest you read through some of the earlier posts at http://blogs.msdn.com/richardt for more thoughts around designing and constructing Service Oriented and distributed systems.