ZooKeeper and RabbitMQ/Qpid together - overkill or a good combination? - locking

Greetings,
I'm evaluating some components for a multi-data center distributed system. We're going to be using message queues (via either RabbitMQ or Qpid) so agents can make asynchronous requests to other agents without worrying about addressing, routing, load balancing or retransmission.
In many cases, the agents will be interacting with components that were not designed for highly concurrent access, so locking and cross-agent coordination will be needed to avoid race conditions. Also, we'd like the system to automatically respond to agent or data center failures.
With the above use cases in mind, ZooKeeper seemed like it might be a good fit. But I'm wondering if trying to use both ZK and message queuing is overkill. It seems like what Zookeeper does could be accomplished by my own cluster manager using AMQP messaging, but that would be hard to get really right. On the other hand, I've seen some examples where ZooKeeper was used to implement message queuing, but I think RabbitMQ/Qpid are a more natural fit for that.
Has anyone out there used a combination like this?
Thanks in advance,
-Chris

Coming into this late, but maybe it will be of some use. The primary consideration should be the performance characteristics of your system. ZooKeeper, like you said, is more than capable of implementing a task distribution system using a distributed queue, but zk currently, is more optimized for reads than it is for writes (this only comes into play in the 1000's of ops per second range). If your throughput needs are less than this, then using just zk to implement your system would reduce number of runtime components and make it simpler. Of course, you should always run your performance tests before deciding.
Distributed coordination is really hard to get right, so I would definitely recommend using zookeeper for that and not rolling your own.

Not quite sure what ZooKeeper exactly is, but I guess that using a component from Apache (if it does fit your needs well) is preferred before managing such things as distributed synchronization and group services at your own. You could of course hire a team of developers especially for that purpose, but that doesn't guarantee you a better implementation.
I guess, that it would be anyways implemented as a separate component, cuz other way could bring much complexity and decelerate the workflow; so the preference of ZooKeeper or anything similar is kind of obvious (to me).
And surely, unless you're in the global optimization phase of your project workflow, I guess it would be better to use RabbitMQ or such (I would even stress that, cuz implementations (especially commercial) of the AMQP would be more reliable than everything that you'd come up with).
So I would go for both, carefully chosing the appropriate thirdparty products, but using as much of them as it is needed. And that's just my opinion; thanks for reading :)

Related

Scaling CakePHP Version 2.3.0

I'm beginning a new project using CakePHP. I like the "auto-magic" features, I think its a good fit for the project. I'm wondering about the potential to scale CakePHP to several million IP hits a day. and hundreds of thousands of database writes and reads a day. Also about 50,000 to 500,000 users, often with 3000 concurrently using the site. I'm making use of heavy stored procedures to offset this, and I'm accessing several servers including a load balancer.
I'm wondering about the computational time of some of the auto-magic and how well Cake is able to assist with session requests making many db hits. Has anyone has had success with cake running from a single server array setup with this level of traffic? I'm not using the cloud or a distributed database (yet). I'm really worried about potential bottlenecks with using this framework. I'm interested in advice from anyone who has worked with Cake in production. I've reseached, but I would love a second opinion. Thank you for your time.
This is not a problem but optimization is up to you.
There are different cache methods available you can implement, memcache, redis, full page caching... All of that is supported by cacke already. What you cache and where is up to you.
For searching you could try elastic search to speedup things
There are before dispatcher filters to by pass controller instantiation (you might want to do that in special cases, check the asset filter for example)
Use nginx not apache
Also I would not start with over optimizing and over-thinking this before any code is written, start well, think about caching but when you start to come across bottleneck analyse and fix them. Otherwise you'll waste a lot of time with over optimization before you even have written anything that works.
Cake itself is very fast. Just to proof the bullshit factor of these fancy benchmarks some frameworks do we did one using a dispatcher filter to "optimize" it and even beat Yii who seems to be pretty eager to show how fast it is, but benchmarks are pointless, specially in a huge project where so many human made fail can be introduced.

Does Redis support strong consistency

I am looking at porting a Java application to .NET, the application currently uses EhCache quite heavily and insists that it wants to support strong consistency (http://ehcache.org/documentation/get-started/consistency-options).
I am would like to use Redis in place of EhCache but does Redis support strong consistency or just support eventual consistency?
I've seen talk of a Redis Cluster but I guess this is a little way off release yet.
Or am I looking at this wrong? If Redis instance sat on a different server altogether and served two frontend servers how big could it get before we'd need to look at a Master / Slave style affair?
A single instance of Redis is consistent. There are options for consistency across many instances. #antirez (Redis developer) recently wrote a blog post, Redis data model and eventual consistency, and recommended Twemproxy for sharding Redis, which would give you consistency over many instances.
I don't know EhCache, so can't comment on whether Redis is a suitable replacement. One potential problem (porting to .NET) with Twemproxy is it seems to only run on Linux.
How big can a single Redis instance get? Depends on how much RAM you have.
How quickly will it get this big? Depends on how your data looks.
That said, in my experience Redis stores data quite efficiently. One app I have holds info for 200k users, 20k articles, all relationships between objects, weekly leader boards, stats, etc. (330k keys in total) in 400mb of RAM.
Redis is easy to use and fun to work with. Try it out and see if it meets your needs. If you do decide to use it and might one day want to shard, shard your data from the beginning.
Redis is not strongly consistent out of the box. You will probably need to apply 3rd party solutions to make it consistent. Here is a quote from docs:
Write safety
Redis Cluster uses asynchronous replication between nodes, and last failover wins implicit merge function. This means that the last elected master dataset eventually replaces all the other replicas. There is always a window of time when it is possible to lose writes during partitions. However these windows are very different in the case of a client that is connected to the majority of masters, and a client that is connected to the minority of masters.
Usually you need to have synchronous replication to achieve strong consistence in a distributed partitioned systems.

Concurrent page request comparisons

I have been hoping to find out what different server setups equate to in theory for concurrent page requests, and the answer always seems to be soaked in voodoo and sorcery. What is the approximation of max concurrent page requests for the following setups?
apache+php+mysql(1 server)
apache+php+mysql+caching(like memcached or similiar (still one server))
apache+php+mysql+caching+dedicated Database Server (2 servers)
apache+php+mysql+caching+dedicatedDB+loadbalancing(multi webserver/single dbserver)
apache+php+mysql+caching+dedicatedDB+loadbalancing(multi webserver/multi dbserver)
+distributed (amazon cloud elastic) -- I know this one is "as much as you can afford" but it would be nice to know when to move to it.
I appreciate any constructive criticism, I am just trying to figure out when its time to move from one implementation to the next, because they each come with their own implementation feat either programming wise or setup wise.
In your question you talk about caching and this is probably one of the most important factors in a web architecture r.e performance and capacity.
Memcache is useful, but actually, before that, you should be ensuring proper HTTP cache directives on your server responses. This does 2 things; it reduces the number of requests and speeds up server response times (if you have Apache configured correctly). This can also be improved by using an HTTP accelerator like Varnish and a CDN.
Another factor to consider is whether your system is stateless. By stateless, it usually means that it doesn't store sessions on the server and reference them with every request. A good systems architecture relies on state as little as possible. The less state the more horizontally scalable a system. Most people introduce state when confronted with issues of personalisation - i.e serving up different content for different users. In such cases you should first investigate using the HTML5 session storage (i.e store the complete user data in javascript on the client, obviously over https) or if the data set is smaller, secure javascript cookies. That way you can still serve up cached resources and then personalise with javascript on the client.
Finally, your stack includes a database tier, another potential bottleneck for performance and capacity. If you are only reading data from the system then again it should be quite easy to horizontally scale. If there are reads and writes, its typically better to separate the read write datasets into a separate database and have the read only in another. You can then use more relevant methods to scale.
These setups do not spit out a single answer that you can then compare to each other. The answer will vary on way more factors than you have listed.
Even if they did spit out a single answer, then it is just one metric out of dozens. What makes this the most important metric?
Even worse, each of these alternatives is not free. There is engineering effort and maintenance overhead in each of these. Which could not be analysed without understanding your organisation, your app and your cost/revenue structures.
Options like AWS not only involve development effort but may "lock you in" to a solution so you also need to be aware of that.
I know this response is not complete, but I am pointing out that this question touches on a large complicated area that cannot be reduced to a single metric.
I suspect you are approaching this from exactly the wrong end. Do not go looking for technologies and then figure out how to use them. Instead profile your app (measure, measure, measure), figure out the actual problem you are having, and then solve that problem and that problem only.
If you understand the problem and you understand the technology options then you should have an answer.
If you have already done this and the problem is concurrent page requests then I apologise in advance, but I suspect not.

architecture for high availability

I have this scenario:
You have a factory process line which runs 24/7. Downtime is extremely expensive.
The software controlling all different parts must use a shared form of database storage
The main reason for this is to know in which state the factory is in. For example some products can be mixed when using the same set of equipement and others DEFINITELY not.
requirements:
I want to the software be able to detect that an error in one part of
the plant must result in some machine shutdown more then 1 km away. so stoing data in the plc's is not an option.
Updates and upgrades to the factory environment are frequent
load (in computer terms) will be really low.
The systems handles a few hunderd assignments a day for which calculations / checks are done followed by instructions send for the factory machines. Systems will be bored most of the time. Most important requirement is the central computer system must be correct and always working.
I was thinking to use a dynamo based database (riak or cassandra) where data gets written to multiple machines with each machine having the whole database
When one system goes down it will go down unoticed. A Traditional sql databse might be more of a pain to upgrade when tables changes and this master slave is harder to configure.
What would be your solution?
Network has been made redundant and most other single points of failure to. The database system is critical because downtime of the db means downtime for the entire plant not just one of the machines which is acceptable.
How to solve shared state problem.
complexity in the database will not be a problem. I will be more like a simple key value store to get the most current and correct data.
I don't think this is a sql/nosql question. All of Postgres, MySQL and MS SQL Server have some kind of cluster or hot standby option.
Configuration is a one-time thing, but any NoSQL option is going to give you headaches from top-to-bottom of code, if you are trying to do something fundamentally relational on a platform that has given up relational for the purposes of running things like Amazon or Facebook. The configuration is once, the coding is forever.
So I would say stick with a tried and true solution and get that hot replication going.
This also provides a solution for upgrades. The typical sequence is to "fail over" to the standby, upgrade the master, flip back to the master, upgrade the standby, and resume. With details specific to the situation of course.
Use an established RDBMS that supports such things natively
Do you really want to run a 24/7 mission critical system on something that may be consistent at any point in time?
You need to avoid single points of failure.
All the major players in our dbms world offer at least one way to avoid making the database itself a single point of failure. I might question whether they can propagate changes fast enough for your manufacturing processes. (Or is data update not really an issue? Can't really tell from your question.) My db work in manufacturing is limited to the car and the chemical industry. Microseconds didn't matter to them.
But the dbms isn't the only thing that can fail. "Always working" means that the clients have to always be working, too. Client hardware, connections to the network, the network and network servers themselves all probably have single points of failure. Failure-tolerant servers have multiple power supplies, multiple NICs, etc.
"Always working" is really expensive. I have a feeling that the database isn't going to be the biggest problem for your company.

First Time Architecturing?

I was recently given the task of rebuilding an existing RIA. The new RIA that I've designed is based on Silverlight, with a WCF service to connect to MS SQL Server. This is my first time doing something like this, so I'm not sure how to design the entire thing.
Basically, the client can look through graphs of "stocks" (allowing the client to choose different time periods, settings, etc). I've written the whole application essentially, but I'm not sure how to put it together.
The graphs are supposed to be directly based on the database, and to create the datapoints on the graph, some calculations need to be done (not very expensive ones).
The problem I'm having is to decide where to put the calculations (client or serverside? Or half and half?)
What factors should I look for to help me decide where the calculations should be done? And how can I go about optimizing this (caching, etc)?
Obviously this is a very broad subject, so I'm not expecting an immediate answer, but any help/pointing in the right direction/resources would be appreciated.
A few tips for this kind of app.
Put as much logic as possible on the client.
Make the client responsible for session data, making all your server code stateless.
Try to minimize traffic to and from the server (Bigger requests are more efficient than multiple smaller ones) so consolidate requests when possible.
If this project is likely to grow beyond it's current feature set I think it's probably a good idea to perform the calculations client side. This can avoid scaling issues, because you're using all the client side CPUs ratther than you're single, precious server CPU. This does however rely on being able to transfer the required data to the client in an efficient way, otherwise you replace a processor bottleneck with a network bottleneck.
As for caching it depends on your inputs, what variables can users of the client affect? If any of the variables they can alter are discrete (ie they can be a fixed set of values) then they're candidates for caching. For example if a user can select a date range of stock variations to view then that's probably not so useful, if however they can only select a year then you could cache your data sets by year (download each data set to the client and perform your calculation). I'd not worry about caching too much unless you find it's a real performance problem, it'll only make your code more complex, so don't add it until you have proven you need it.
One other thing, if this project is unlikely to be a long term concern then implement the calculations wherever is easiest and fastest, you can revisit if the project becomes more important later on.
Be REALLY REALLY careful about implementing client-side caching. Caching is INSANELY hard to do right while maintaining performance, security and correctness. Note that your DB Server's caching mechanism is already likely to be way better than any local caching mechanism you're likely to implement in less than 2 weeks' effort!
I would urge you to do as much work on the back-end as possible and to limit your client to render the data in a manner that is appropriate for your users. While many may balk at this suggestion, it's based on a number of observations from building many such systems in the past:
If you're going to filter some of the data returned by your service, you've just wasted thousands of clock cycles shipping data that need never have left your server
If you're going to sort your data, your DB could have done the sorting for you (often using otherwise idle CPU ticks) while waiting for the data to be read from its disks.
Your server most likely has more CPU and RAM available than your clients and has a surprising amount of "free time" to use for sorting, filtering, running inline calculations, etc., while its waiting for disks to read sectors etc.
As Roman suggested: Minimize your round-trips between your client and your server as much as possible.
But perhaps most importantly:
BEFORE YOU START DESIGNING YOUR SYSTEM, state your performance goals
Design what you think will achieve those goals. Try to find bottlenecks in your design, particularly areas where you make blocking calls. Re-design those areas to use async patterns wherever you can.
Build your intended solution
Measure your actual perforamnce under actual real-world load
If you're within your expected performance goals, then you're done.
If not, work out where you're spending too long and tune the design of that portion of the system. Goto 3.
Don't try to build the perfect system in one try - chances are that you won't manage it, no matter how hard you try, for a variety of reasons including user expectations, your servers ability to process the required load, your clients' ability to handle the returned data, your network's ability to carry the traffic, etc.
They're a little old now, but I suggest you read through some of the earlier posts at http://blogs.msdn.com/richardt for more thoughts around designing and constructing Service Oriented and distributed systems.