I need to decide which database to use for a system where I need AP from CAP theorem. Data I constantly but slowly going in. Big queries are expected. It should be reliable - no single point of failure. I can use up to 3 instances on different nodes. In-memory solutions are bad for me because of data size -
it should be running for years and I expect up to terabyte data sizes. Most guys in my team prefer SQL. But I understand that traditional SQL databases are not fault tolerant in terms of hardware failure. Any ideas?
Since this question was asked there have been some significant changes in the Distributed SQL or NewSQL landscape...the most noteworthy being the viability of CockroachDB. That appears to be the best option in a situation like the one referenced in this question.
No single points of failure. Easy to scale. Can handle tons of volume. You can run it wherever you want. Speaks postgres. Super fault tolerant.
Amazon Redshift seems to be the best answer(thank you kuujo). But we will try rethinkdb because it has some nice feature
Related
I am currently addressing a situation where our web application receives at least a Million requests per 30 seconds. So these requests will lead to generating 3-5 Million row inserts between 5 tables. This is pretty heavy load to handle. Currently we are using multi threading to handle this situation (which is a bit faster but unable to get a better CPU throughput). However the load will definitely increase in future and we will have to account for that too. After 6 months from now we are looking at double the load size we are currently receiving and I am currently looking at a possible new solution that is scalable and should be easy enough to accommodate any further increase to this load.
Currently with multi threading we are making the whole debugging scenario quite complicated and sometimes we are having problem with tracing issues.
FYI we are already utilizing the SQL Builk Insert/Copy that is mentioned in this previous post
Sql server 2008 - performance tuning features for insert large amount of data
However I am looking for a more capable solution (which I think there should be one) that will address this situation.
Note: I am not looking for any code snippets or code examples. I am just looking for a big picture of a concept that I could possibly use and I am sure that I can take that further to an elegant solution :)
Also the solution should have a better utilization of the threads and processes. And I do not want my threads/processes to even wait to execute something because of some other resource.
Any suggestions will be deeply appreciated.
Update: Not every request will lead to an insert...however most of them will lead to some sql operation. The appliciation performs different types of transactions and these will lead to a lot of bulk sql operations. I am more concerned towards inserts and updates.
and these operations need not be real time there can be a bit lag...however processing them real time will be much helpful.
I think your problem looks more towards getting a better CPU throughput which will lead to a better performance. So I would probably look at something like an Asynchronous Processing where in a thread will never sit idle and you will probably have to maintain a queue in the form of a linked list or any other data structure that will suit your programming model.
The way this would work is your threads will try to perform a given job immediately and if there is anything that would stop them from doing it then they will push that job into the queue and these pushed items will be processed based on how it stores the items in the container/queue.
In your case since you are already using bulk sql operations you should be good to go with this strategy.
lemme know if this helps you.
Can you partition the database so that the inserts are spread around? How is this data used after insert? Is there a natural partion to the data by client or geography or some other factor?
Since you are using SQL server, I would suggest you get several of the books on high availability and high performance for SQL Server. The internals book muight help as well. Amazon has a bunch of these. This is a complex subject and requires too much depth for a simple answer on a bulletin board. But basically there are several keys to high performance design including hardware choices, partitioning, correct indexing, correct queries, etc. To do this effectively, you have to understand in depth what SQL Server does under the hood and how changes can make a big difference in performance.
Since you do not need to have your inserts/updates real time you might consider having two databases; one for reads and one for writes. Similar to having a OLTP db and an OLAP db:
Read Database:
Indexed as much as needed to maximize read performance.
Possibly denormalized if performance requires it.
Not always up to date.
Insert/Update database:
No indexes at all. This will help maximize insert/update performance
Try to normalize as much as possible.
Always up to date.
You would basically direct all insert/update actions to the Insert/Update db. You would then create a publication process that would move data over to the read database at certain time intervals. When I have seen this in the past the data is usually moved over on a nightly bases when few people will be using the site. There are a number of options for moving the data over, but I would start by looking at SSIS.
This will depend on your ability to do a few things:
have read data be up to one day out of date
complete your nightly Read db update process in a reasonable amount of time.
I have this scenario:
You have a factory process line which runs 24/7. Downtime is extremely expensive.
The software controlling all different parts must use a shared form of database storage
The main reason for this is to know in which state the factory is in. For example some products can be mixed when using the same set of equipement and others DEFINITELY not.
requirements:
I want to the software be able to detect that an error in one part of
the plant must result in some machine shutdown more then 1 km away. so stoing data in the plc's is not an option.
Updates and upgrades to the factory environment are frequent
load (in computer terms) will be really low.
The systems handles a few hunderd assignments a day for which calculations / checks are done followed by instructions send for the factory machines. Systems will be bored most of the time. Most important requirement is the central computer system must be correct and always working.
I was thinking to use a dynamo based database (riak or cassandra) where data gets written to multiple machines with each machine having the whole database
When one system goes down it will go down unoticed. A Traditional sql databse might be more of a pain to upgrade when tables changes and this master slave is harder to configure.
What would be your solution?
Network has been made redundant and most other single points of failure to. The database system is critical because downtime of the db means downtime for the entire plant not just one of the machines which is acceptable.
How to solve shared state problem.
complexity in the database will not be a problem. I will be more like a simple key value store to get the most current and correct data.
I don't think this is a sql/nosql question. All of Postgres, MySQL and MS SQL Server have some kind of cluster or hot standby option.
Configuration is a one-time thing, but any NoSQL option is going to give you headaches from top-to-bottom of code, if you are trying to do something fundamentally relational on a platform that has given up relational for the purposes of running things like Amazon or Facebook. The configuration is once, the coding is forever.
So I would say stick with a tried and true solution and get that hot replication going.
This also provides a solution for upgrades. The typical sequence is to "fail over" to the standby, upgrade the master, flip back to the master, upgrade the standby, and resume. With details specific to the situation of course.
Use an established RDBMS that supports such things natively
Do you really want to run a 24/7 mission critical system on something that may be consistent at any point in time?
You need to avoid single points of failure.
All the major players in our dbms world offer at least one way to avoid making the database itself a single point of failure. I might question whether they can propagate changes fast enough for your manufacturing processes. (Or is data update not really an issue? Can't really tell from your question.) My db work in manufacturing is limited to the car and the chemical industry. Microseconds didn't matter to them.
But the dbms isn't the only thing that can fail. "Always working" means that the clients have to always be working, too. Client hardware, connections to the network, the network and network servers themselves all probably have single points of failure. Failure-tolerant servers have multiple power supplies, multiple NICs, etc.
"Always working" is really expensive. I have a feeling that the database isn't going to be the biggest problem for your company.
I am interested in implementing an architecture that has two databases one for read operations and the other for writes. I have never implemented something like this and have always built single database, highly normalised systems so I am not quite sure where to begin. I have a few parts to this question.
1. What would be a good resource to find out more about this architecture?
2. Is it just a question of replicating between two identical schemas, or would your schemas differ depending on the operations, would normalisation vary too?
3. How do you insure that data written to one database is immediately available for reading from the second?
Any further help, tips, resources would be appreciated. Thanks.
EDIT
After some research I have found this article which I found very informative for those interested..
http://www.codefutures.com/database-sharding/
I found this highscalability article very informative
I'm not a specialist but the read/write master database and read-only slaves pattern is a "common" pattern, especially for big applications doing mostly read accesses or data warehouses:
it allows to scale (you add more read-only slaves if required)
it allows to tune the databases differently (for either efficient reads or efficient writes)
What would be a good resource to find out more about this architecture?
There are good resources available on the Internet. For example:
Highscalability.com has good examples (e.g. Wikimedia architecture, the master-slave category,...)
Handling Data in Mega Scale Systems (starting from slide 29)
MySQL Scale-Out approach for better performance and scalability as a key factor for Wikipedia’s growth
Chapter 24. High Availability and Load Balancing in PostgreSQL documentation
Chapter 16. Replication in MySQL documentation
http://www.google.com/search?q=read%2Fwrite+master+database+and+read-only+slaves
Is it just a question of replicating between two identical schemas, or would your schemas differ depending on the operations, would normalisation vary too?
I'm not sure - I'm eager to read answers from experts - but I think the schemas are identical in traditional replication scenari (the tuning may be different though). Maybe people are doing more exotic things but I wonder if they rely on database replication in that case, it sounds more like "real-time ETL".
How do you insure that data written to one database is immediately available for reading from the second?
I guess you would need synchronous replication for that (which is of course slower than asynchronous). While some databases do support this mode, not all do AFAIK. But have a look at this answer or this one for SQL Server.
You might look up data warehouses.
These serve as 'normalized for reporting' type databases, while you can keep a normalized OLTP style instance for the data maintenance.
I don't think the idea of 'immediate' equivalence will be a reality. There will be some delay while the new data and changes are migrated in to the other system. The schedule and scope will be your big decisions here.
In regards to questions 2:
It really depends on what you are trying to achieve by having two databases. If it is for performance reasons (which i suspect it may be) i would suggest you look into denormalizing the read-only database as needed for performance. If performance isn't an issue then I wouldn't mess with the read-only schema.
I've worked on similar systems where there would be a read/write database that was only lightly used by administrative users. That database would then be replicated to the read only database during a nightly process.
Question 3:
How immediate are we talking here? Less than a second? 10 seconds? Minutes?
I have done database optimization for dbs upto 3GB size. Need a really large database to test optimization.
Simply generating a lot of data and throwing it into a table proves nothing about the DBMS, the database itself, the queries being issued against it, or the applications interacting with them, all of which factor into the performance of a database-dependent system.
The phrase "I have done database optimization for [databases] up to 3 GB" is highly suspect. What databases? On what platform? Using what hardware? For what purposes? For what scale? What was the model? What were you optimizing? What was your budget?
These same questions apply to any database, regardless of size. I can tell you first-hand that "optimizing" a 250 GB database is not the same as optimizing a 25 GB database, which is certainly not the same as optimizing a 3 GB database. But that is not merely on account of the database size, it is because databases that contain 250 GB of data invariably deal with requirements that are vastly different from those addressed by a 3 GB database.
There is no magic size barrier at which you need to change your optimization strategy; every optimization requires in-depth knowledge of the specific data model and its usage requirements. Maybe you just need to add a few indexes. Maybe you need to remove a few indexes. Maybe you need to normalize, denormalize, rewrite a couple of bad queries, change locking semantics, create a data warehouse, implement caching at the application layer, or look into the various kinds of vertical scaling available for your particular database platform.
I submit that you are wasting your time attempting to create a "really big" database for the purposes of trying to "optimize" it with no specific requirements in mind. Various data-generation tools are available for when you need to generate data fitting specific patterns for testing against a specific set of scenarios, but until you have that information on hand, you won't accomplish very much with a database full of unorganized test data.
The best way to do this is to create your schema and write a script to populate it with lots of random(ish) dummy data. Random, meaning that your text-fields don't necessarily have to make sense. 'ish', meaning that the data distribution and patterns should generally reflect your real-world DB usage.
Edit: a quick Google search reveals a number of commercial tools that will do this for you if you don't want to write your own populate scripts: DB Data Generator, DTM Data Generator. Disclaimer: I've never used either of these and can't really speak to their quality or usefulness.
Here is a free procedure I wrote to generate Random person names. Quick and dirty, but it works and might help.
http://www.joebooth-consulting.com/products/genRandNames.sql
I use Red-Gate's Data Generator regularly to test out problems as well as loads on real systems and it works quite well. That said, I would agree with Aaronnaught's sentiment in that the overall size of the database isn't nearly as important as the usage patterns and the business model. For example, generating 10 GB of data on a table that will eventually get no traffic will not provide any insight into optimization. The goal is to replicate the expected transaction and storage loads you anticipate to occur in order to identify bottlenecks before they occur.
Does the replication system that comes with DB4O work well? Basically I would like to know if anyone has some good numbers on the record throughput of their replication system and if it handles concurrency errors gracefully or not. What is the relative performance difference between SQL Server's merge replication between two SQL servers and using DRS between two DB4O databases?
We are currently working on improving the replication system further and improving performance certainly is a goal.
I think it's quite hard to produce comparable figures. Every object that needs to be replicated requires a lookup in the UUID BTree. If you know what you are doing, you can finetune that to run completely in memory. Then again the throughput will depend very much on how many indexes you have on each side and how big indexes are. db4o and the SQL server of your choice (and any other SQL server) may scale differently with size and that may very much depend on the hardware you use (db4o loves solid state discs with short seek times).
This is like with any other benchmark: You can only find out how things really will work for you if you mock up the scenario that you think you need and run it on your hardware.
As to handling concurrency: Any conflict will call back into your code and it's your choice how you handle it. You can resolve by hand by merging changes to either side and you can also ignore objects. It's up to your code to find out what it thinks is right.
With respect to concurrency if you have a replication session running side-by-side with another live session that constantly modifies objects: Currently released dRS code is not yet strong for this case. While we implement replication between db4o and the high-end object database Versant VOD we will try to cover these kind of concurrency cases also.