architecture for high availability - sql

I have this scenario:
You have a factory process line which runs 24/7. Downtime is extremely expensive.
The software controlling all different parts must use a shared form of database storage
The main reason for this is to know in which state the factory is in. For example some products can be mixed when using the same set of equipement and others DEFINITELY not.
requirements:
I want to the software be able to detect that an error in one part of
the plant must result in some machine shutdown more then 1 km away. so stoing data in the plc's is not an option.
Updates and upgrades to the factory environment are frequent
load (in computer terms) will be really low.
The systems handles a few hunderd assignments a day for which calculations / checks are done followed by instructions send for the factory machines. Systems will be bored most of the time. Most important requirement is the central computer system must be correct and always working.
I was thinking to use a dynamo based database (riak or cassandra) where data gets written to multiple machines with each machine having the whole database
When one system goes down it will go down unoticed. A Traditional sql databse might be more of a pain to upgrade when tables changes and this master slave is harder to configure.
What would be your solution?
Network has been made redundant and most other single points of failure to. The database system is critical because downtime of the db means downtime for the entire plant not just one of the machines which is acceptable.
How to solve shared state problem.
complexity in the database will not be a problem. I will be more like a simple key value store to get the most current and correct data.

I don't think this is a sql/nosql question. All of Postgres, MySQL and MS SQL Server have some kind of cluster or hot standby option.
Configuration is a one-time thing, but any NoSQL option is going to give you headaches from top-to-bottom of code, if you are trying to do something fundamentally relational on a platform that has given up relational for the purposes of running things like Amazon or Facebook. The configuration is once, the coding is forever.
So I would say stick with a tried and true solution and get that hot replication going.
This also provides a solution for upgrades. The typical sequence is to "fail over" to the standby, upgrade the master, flip back to the master, upgrade the standby, and resume. With details specific to the situation of course.

Use an established RDBMS that supports such things natively
Do you really want to run a 24/7 mission critical system on something that may be consistent at any point in time?

You need to avoid single points of failure.
All the major players in our dbms world offer at least one way to avoid making the database itself a single point of failure. I might question whether they can propagate changes fast enough for your manufacturing processes. (Or is data update not really an issue? Can't really tell from your question.) My db work in manufacturing is limited to the car and the chemical industry. Microseconds didn't matter to them.
But the dbms isn't the only thing that can fail. "Always working" means that the clients have to always be working, too. Client hardware, connections to the network, the network and network servers themselves all probably have single points of failure. Failure-tolerant servers have multiple power supplies, multiple NICs, etc.
"Always working" is really expensive. I have a feeling that the database isn't going to be the biggest problem for your company.

Related

What is a good access time to a database (SQL)?

Hey.. i wanna know which time is a good accesstime, because i'm searching for a good sql database and hsqldb says their accesstime is 12ms... <-- good?
I think it would depend on your needs. Is it for a web server or a desktop application? The amount of data is also important, because reading lots of small records will perform differently than reading a few large records. Access time is also based upon your hardware, software and maybe even some other factors.
For example, you can use a database with lightning-fast access, but if your users need to connect to it over a 5 megabit VPN connection, passing through three different proxies and with trafic world-wide, your database would then just be a waste of power.
Basically, it's a marketing thing that they're claiming. It's a good product but don't just focus on access time. Make sure you also look at your other needs. Another system might just perform better, even if it has a slower acess time, because it is more optimized in reading it's indices and stuff.
So, what do you want, exactly?
I don't think access time tells you anything, really. If you have slow or incorrectly configured storage, then this access time metric will be dwarfed by how much time is spent on waits and split I/Os. Network latency is also a factor, since I'm guessing you probably won't want to have your code on the same machine as your database, and you will most likely have a few network devices you'll need to traverse in your production environment.
In my experience, all the database platforms these days will all perform adequately if configured correctly and paired with a complementary application. Pick the DBMS that best fits your requirements, follow the best practices for configuration of the DBMS on your hardware, and you should be please with the outcome.

First Time Architecturing?

I was recently given the task of rebuilding an existing RIA. The new RIA that I've designed is based on Silverlight, with a WCF service to connect to MS SQL Server. This is my first time doing something like this, so I'm not sure how to design the entire thing.
Basically, the client can look through graphs of "stocks" (allowing the client to choose different time periods, settings, etc). I've written the whole application essentially, but I'm not sure how to put it together.
The graphs are supposed to be directly based on the database, and to create the datapoints on the graph, some calculations need to be done (not very expensive ones).
The problem I'm having is to decide where to put the calculations (client or serverside? Or half and half?)
What factors should I look for to help me decide where the calculations should be done? And how can I go about optimizing this (caching, etc)?
Obviously this is a very broad subject, so I'm not expecting an immediate answer, but any help/pointing in the right direction/resources would be appreciated.
A few tips for this kind of app.
Put as much logic as possible on the client.
Make the client responsible for session data, making all your server code stateless.
Try to minimize traffic to and from the server (Bigger requests are more efficient than multiple smaller ones) so consolidate requests when possible.
If this project is likely to grow beyond it's current feature set I think it's probably a good idea to perform the calculations client side. This can avoid scaling issues, because you're using all the client side CPUs ratther than you're single, precious server CPU. This does however rely on being able to transfer the required data to the client in an efficient way, otherwise you replace a processor bottleneck with a network bottleneck.
As for caching it depends on your inputs, what variables can users of the client affect? If any of the variables they can alter are discrete (ie they can be a fixed set of values) then they're candidates for caching. For example if a user can select a date range of stock variations to view then that's probably not so useful, if however they can only select a year then you could cache your data sets by year (download each data set to the client and perform your calculation). I'd not worry about caching too much unless you find it's a real performance problem, it'll only make your code more complex, so don't add it until you have proven you need it.
One other thing, if this project is unlikely to be a long term concern then implement the calculations wherever is easiest and fastest, you can revisit if the project becomes more important later on.
Be REALLY REALLY careful about implementing client-side caching. Caching is INSANELY hard to do right while maintaining performance, security and correctness. Note that your DB Server's caching mechanism is already likely to be way better than any local caching mechanism you're likely to implement in less than 2 weeks' effort!
I would urge you to do as much work on the back-end as possible and to limit your client to render the data in a manner that is appropriate for your users. While many may balk at this suggestion, it's based on a number of observations from building many such systems in the past:
If you're going to filter some of the data returned by your service, you've just wasted thousands of clock cycles shipping data that need never have left your server
If you're going to sort your data, your DB could have done the sorting for you (often using otherwise idle CPU ticks) while waiting for the data to be read from its disks.
Your server most likely has more CPU and RAM available than your clients and has a surprising amount of "free time" to use for sorting, filtering, running inline calculations, etc., while its waiting for disks to read sectors etc.
As Roman suggested: Minimize your round-trips between your client and your server as much as possible.
But perhaps most importantly:
BEFORE YOU START DESIGNING YOUR SYSTEM, state your performance goals
Design what you think will achieve those goals. Try to find bottlenecks in your design, particularly areas where you make blocking calls. Re-design those areas to use async patterns wherever you can.
Build your intended solution
Measure your actual perforamnce under actual real-world load
If you're within your expected performance goals, then you're done.
If not, work out where you're spending too long and tune the design of that portion of the system. Goto 3.
Don't try to build the perfect system in one try - chances are that you won't manage it, no matter how hard you try, for a variety of reasons including user expectations, your servers ability to process the required load, your clients' ability to handle the returned data, your network's ability to carry the traffic, etc.
They're a little old now, but I suggest you read through some of the earlier posts at http://blogs.msdn.com/richardt for more thoughts around designing and constructing Service Oriented and distributed systems.

Restart daily or 100% uptime for enterprise applications?

I have a general question that is rather open-ended (i.e. "depends on platform, application type, etc.") but I am looking for general guidelines as an answer.
When is it preferable to design an application for continuous operation (100% uptime) vs. scheduled daily shutdown/restart?
Obviously, web apps need to be up all the time, so assume for this question that we are discussing an internal enterprise application, such as an accounting system, or a B2B system that is only used actively during weekday business hours.
Arguments I've heard for each are as follows:
Pro 100% Uptime: "once you get an application running, it's better to keep it up, because there's a chance it won't restart when you shut it down."
Pro daily restarts: "an application that is up continuously for 3 years might one day go down, and nobody will know how to bring it back online."
Other considerations are memory growth, performance, need for maintenance, etc. This is a programming issue because the choice you make can affect your technical design. For example, you don't need to code certain batch jobs and clear state daily if you know the application will be shutdown/restarted daily.
Thoughts?
The arguments you state both for and against 100% uptime are foolish arguments, in my opinion. If you're worried about the application not restarting when it is shutdown then you have larger issues than uptime concerns. Likewise, if you feel that nobody will know how to bring it back online after a prolonged period of uptime you have training and documentation issues.
The reality is that you should always design an application to be efficient when it comes to memory consumption and performance. Generally, by doing this you end up with an application that can sucessfully survive as a long running process or one that restarts frequently. Keep in mind that your typical computer system is rebooted periodically anyway due to OS updates, etc.
Unless you have requirements and service level agreements that guarantee 100% uptime, this isn't usually something you have to be overly concerned about as long as you design an application efficiently.
Sorry, but I'm not getting the point or this question is totally pointless.
An application, any application, should be designed, IMO, to stay up unless it's needed. If an application/platform needs to be restarted daily, then it has memory leaks, or bugs, or it's, in general, poorly written.
The point "don't make it stay up too long, otherwise you'd risk nobody will ever remember how to turn it up again" is quite laughable. I do Application Management (Operations) as my daily job, and I've never seen an application staying up for more than one month. After that period, you have to cope with OS maintainance, db patching, software upgrades, etc.
So, to summarize: write applications that can stay up as long as it's needed.
When is it preferable to design an application for continuous operation (100% uptime) vs. scheduled daily shutdown/restart?
I think this is really an orthogonal question to application design. Many web servers and application containers can support hot restarts. In other words, this is not a question so much of "application design" but rather a choice of technology. For example, you can avoid the question entirely by simply having N copies of your application (N > 1), then systematically bringing a particular instance down for maintenance and restarting as needed.
Furthermore, business needs and requirements should be determining the appropriate downtime, not your choice of technology.
Pro daily restarts: "an application that is up continuously for 3 years might one day go down, and nobody will know how to bring it back online."
Hogwash. That is a social/organizational argument, not a technical one. This is solved by having an obvious build process which includes starting the server as one of its possible tasks. That reduces the task of "restarting" to a single command.
If you're not extremely confident in your team, it might be better to go down from time to time, to clear everything. Once a day could do it, but there is a range from this to "never" ...
But this is generally dictated by business contraints. If you don't have those constraints yet ...
Well, why don't you also postpone your decision then ?
As others said, if you can't trust your app to start up again you have much larger issues.
From experience my general, personal, recommendation for web-apps is to cycle them once a day (in the early hours of the morning i.e. at the lowest traffic point) staggered over the whole server cluster. No matter how memory efficient your app is web-apps in particular can always have cache bloat issues over extended periods of up-time and one you accept the inevitability of a restart you absolutely want that to happen on your schedule and not t the whim of w3wp.exe.
Of course this all depends on the number of servers you have, the traffic manager you have (if any) and what your traffic profile looks like.
Apart from "Your app is not good enough if you need to restart it" ideas (which I see them perfect and I like them), I would prefer something in the middle as a preventive measure.
If you application is not too big, and one person can restart it without much trouble, it would be fine to restart it maybe once per month or 3/4 times per year. This way you will ensure that the sysadmin knows well how to do it (people sometimes comes and go form the companies) and also his knowledge keeps fresh.
If you have a problem and your sysadmin has not restarted the application since two years ago, he will have several manuals available and courses done, but probably he has forget some steps, or he is not so quick to solve the problem.
Other topic to consider is: "Is a fully implemented application or are you still working on it?" If it's an application made for yourselves, you still code on it and make frequent upgrades for new features, it can be interesting to restart it more frequently. If a problem appears, it has more probabilities to be hidden on the new code. It will help your programmers to fix it and your sysadmin to keep updated about what's happening with the app.
Of course, making a perfect application is always a top-prio element, but... ok, we all know that not always is possible

Is DB4O Replication faster than SQL Server Merge Replication?

Does the replication system that comes with DB4O work well? Basically I would like to know if anyone has some good numbers on the record throughput of their replication system and if it handles concurrency errors gracefully or not. What is the relative performance difference between SQL Server's merge replication between two SQL servers and using DRS between two DB4O databases?
We are currently working on improving the replication system further and improving performance certainly is a goal.
I think it's quite hard to produce comparable figures. Every object that needs to be replicated requires a lookup in the UUID BTree. If you know what you are doing, you can finetune that to run completely in memory. Then again the throughput will depend very much on how many indexes you have on each side and how big indexes are. db4o and the SQL server of your choice (and any other SQL server) may scale differently with size and that may very much depend on the hardware you use (db4o loves solid state discs with short seek times).
This is like with any other benchmark: You can only find out how things really will work for you if you mock up the scenario that you think you need and run it on your hardware.
As to handling concurrency: Any conflict will call back into your code and it's your choice how you handle it. You can resolve by hand by merging changes to either side and you can also ignore objects. It's up to your code to find out what it thinks is right.
With respect to concurrency if you have a replication session running side-by-side with another live session that constantly modifies objects: Currently released dRS code is not yet strong for this case. While we implement replication between db4o and the high-end object database Versant VOD we will try to cover these kind of concurrency cases also.

How to prove that code isn't broken, but the hardware is?

I'm sure it repeats everywhere. You can 'feel' network is slow, or machine or slow or something. But the server/chassis logs are not showing anything, so IT doesn't believe you. What do you do?
Your regressions are taking twice the time ... but that's not enough
Okay you transfer 100 GB using dd etc, but ... that's not enough.
Okay you get server placed in different chassis for 2 week, it works fine ... but .. that's not enough...
so HOW do you get IT to replace the chassis ?
More specifically:
Is there any suite which I can run on two setups ( supposed to be identical ), which can show up difference in network/cpu/disk access .. which IT will believe ?
Computers don't age and slow down the same way we do. If your server is getting slower -- actually slower, not just feels slower because every other computer you use is getting faster -- then there is a reason and it is possible that you may be able to fix it. I'd try cleaning up some disk space, de-fragmenting the disk, and checking what other processes are running (perhaps someone's added more apps to the system and you're just not getting as many cycles).
If your app uses a database, you may want to analyze your query performance and see if some indices are in order. Queries that perform well when you have little data can start taking a long time as the amount of data grows if they have to use table scans. As a former "IT" guy, I'd also be reluctant to throw hardware at a problem because someone tells me the system is slowing down. I'd want to know what has changed and see if I could get the system running the way it should be. If the app has simply out grown the hardware -- after you've made suitable optimizations -- then upgrading is a reasonable choice.
Run a standard benchmark suite. See if it pinpoints memory, cpu, bus or disk, when compared to a "working" similar computer.
See http://en.wikipedia.org/wiki/Benchmark_(computing)#Common_benchmarks for some tips.
The only way to prove something is to do a stringent audit.
Now traditionally, we should keep the system constant between two different sets while altering the variable we are interested. In this case the variable is the hardware that your code is running on. So in simple terms, you should audit the running of your software on two different sets of hardware, one being the hardware you are unhappy about. And see the difference.
Now if you are to do this properly, which I am sure you are, you will first need to come up with a null hypothesis, something like:
"The slowness of the application is
unrelated to the specific hardware we
are using"
And now you set about disproving that hypothesis in favour of an alternative hypothesis. Once you have collected enough results, you can apply statistical analyses on them, to decide whether any differences are statistically significant. There are analyses to find out how much data you need, and then compare the two sets to decide if the differences are random, or not random (which would disprove your null hypothesis). The type of tests you do will mostly depend on your data, but clever people have made checklists to help us decide.
It sounds like your main problem is being listened to by IT, but raw technical data may not be persuasive to the right people. Getting backup from the business may help you and that means talking about money.
Luckily, both platforms already contain a common piece of software - the application itself - designed to make or save money for someone. Why not measure how quickly it can do that e.g. how long does it take to process an order?
By measuring how long your application spends dealing with each sub task or data source you can get a rough idea of the underlying hardware which is under performing. Writing to a local database, or handling a data structure larger than RAM will impact the disk, making network calls will impact the network hardware, CPU bound calculations will impact there.
This data will never be as precise as a benchmark, and it may require expensive coding, but its easier to translate what it finds into money terms. Log4j's NDC and MDC features, and Springs AOP might be good enabling tools for you.
Run perfmon.msc from Start / Run in Windows 2000 through to Vista. Then just add counters for CPU, disk etc..
For SQL queries you should capture the actual queries then run them manually to see if they are slow.
For instance if using SQL Server, run the profiler from Tools, SQL Server Profiler. Then perform some operations in your program and look at the capture for any suspicous database calls. Copy and paste one of the queries into a new query window in management studio and run it.
For networking you should try artificially limiting your network speed to see how it affects your code (e.g. Traffic Shaper XP is a simple freeware limiter).