Is DB4O Replication faster than SQL Server Merge Replication? - replication

Does the replication system that comes with DB4O work well? Basically I would like to know if anyone has some good numbers on the record throughput of their replication system and if it handles concurrency errors gracefully or not. What is the relative performance difference between SQL Server's merge replication between two SQL servers and using DRS between two DB4O databases?

We are currently working on improving the replication system further and improving performance certainly is a goal.
I think it's quite hard to produce comparable figures. Every object that needs to be replicated requires a lookup in the UUID BTree. If you know what you are doing, you can finetune that to run completely in memory. Then again the throughput will depend very much on how many indexes you have on each side and how big indexes are. db4o and the SQL server of your choice (and any other SQL server) may scale differently with size and that may very much depend on the hardware you use (db4o loves solid state discs with short seek times).
This is like with any other benchmark: You can only find out how things really will work for you if you mock up the scenario that you think you need and run it on your hardware.
As to handling concurrency: Any conflict will call back into your code and it's your choice how you handle it. You can resolve by hand by merging changes to either side and you can also ignore objects. It's up to your code to find out what it thinks is right.
With respect to concurrency if you have a replication session running side-by-side with another live session that constantly modifies objects: Currently released dRS code is not yet strong for this case. While we implement replication between db4o and the high-end object database Versant VOD we will try to cover these kind of concurrency cases also.

Related

Distributed SQL Database

I need to decide which database to use for a system where I need AP from CAP theorem. Data I constantly but slowly going in. Big queries are expected. It should be reliable - no single point of failure. I can use up to 3 instances on different nodes. In-memory solutions are bad for me because of data size -
it should be running for years and I expect up to terabyte data sizes. Most guys in my team prefer SQL. But I understand that traditional SQL databases are not fault tolerant in terms of hardware failure. Any ideas?
Since this question was asked there have been some significant changes in the Distributed SQL or NewSQL landscape...the most noteworthy being the viability of CockroachDB. That appears to be the best option in a situation like the one referenced in this question.
No single points of failure. Easy to scale. Can handle tons of volume. You can run it wherever you want. Speaks postgres. Super fault tolerant.
Amazon Redshift seems to be the best answer(thank you kuujo). But we will try rethinkdb because it has some nice feature

Web Caching Servers for SQL Server OLTP Env. Recommendations

I inhereted a high volume OLTP DB which I have free reign to improve as much as I find reasonably possible. The improvements already were very helpful but I want to take it to the next level. The data access patterns I found made it a good candidate IMO to cache the data on other servers and I would love to hear anyone's experience or recommendations with this type of setup.
We have a DB that gets about 3GB of data added to a table every day and reporting on it used to be very slow. The data does not change once it's put in, and no data gets inserted that is over a week old. Rows inputted within the last 3 days tend to see thousands of inserts between tens of millions of rows.
I was thinking of having data over 2 weeks old get pushed out to MongoDB. I could then have the 2 week sliding window data that is not pushed out to Mongo be, be cached by some kind of caching software so those get queried and displayed instead of the data being read out of the DB the whole time. I figure this way we still get full A.C.I.D compliance by having the DB engine validate all the data, have high read performance as it is not hitting the DB, then Mongo can take it when it is no longer a 'transaction'.
Anyone have any recommended solutions? I was looking at MemCached, but not quite sure if that's a good or even plausible solution. Thanks!
Another thing you could consider is using the new In-Memory OLTP feature in SQL Server 2014. That feature improves efficiency and scaling for OLTP workloads. You will potentially be able to get a lot more out of your existing server, without the need to consider specific caching mechanisms.
I don't have specific experience with SQL Server, but what you are describing does seem like a valid use case for MongoDB.
Note that while MongoDB can't directly handle transactions, it is capable of handling certain operations in an atomic fashion (see findAndModify, for instance). Additionally, with journaling enabled, you shouldn't have any reason to worry about durability. MongoDB is a reliable data store and will not lose or corrupt your data.
MongoDB itself can also act as a performant cache if you run a second deployment with journaling disabled. In this instance, writes will take place in memory and only be persisted to the disk every 60 seconds (unless otherwise configured). This will provide comparable performance to memcache which is solely in-memory while allowing you to keep your stack a bit simpler.
Hope this helps!

Single logical SQL Server possible from multiple physical servers?

With Microsoft SQL Server 2005, is it possible to combine the processing power of multiple physical servers into a single logical sql server? Is it possible on SQL Server 2008?
I'm thinking, if the database files were located on a SAN and somehow one of the sql servers acted as a kind of master, then processing could be spread out over multiple physical servers, for instance even allowing simultaneous updates where there was no overlap, and in the case of read-only queries on unlocked tables no limit.
We have an application that is limited by the speed of our sql server, and probably stuck with server 2005 for now. Is the only option to get a single more powerful physical server?
Sorry I'm not an expert, I'm not sure if the question is a stupid one.
TIA
Before rushing out and buying new hardware, find out where your bottlenecks really are. Many locking problems can be solved with the appropriate indexes for your workload.
For example, I've seen instances where placing tempDB on SSD solved performance issues and saved the client buying an expensive new server.
Analyse your workload: How Can I Log and Find the Most Expensive Queries?
With SQL Server 2008 you can utilise the Management Data Warehouse (MDW) to capture your workload.
White Paper: SQL Server 2008 Performance and Scale
Also: please be aware that a SAN solution is not necessarily a faster I/O solution than directly attached storage. It depends on the SAN, number of Physical disks in a LUN, LUN subscription and usage, the speed of the HBA's and several other hardware factors...
Optimizing the app may be a big job of going through all business logic and lines of code. But looking for the most expansive query can easily locate the bottleneck area. Maybe it only happens to a couple of the biggest tables, views or stored procedures. Add or fine tune an index may help right the way. If bumping up the RAM is possible try that option as well. That is cheap and easy configure.
Good luck.
You might want to google for "sql server scalable shared database". Yes you can store your db files on a SAN and use multiple servers, but you're going to have to meet some pretty rigid criteria for it to be a performance boost or even useful (high ratio of reads to writes, small enough dataset to fit in memory or a fast enough SAN, multiple concurrent accessors, etc, etc).
Clustering is complicated and probably much more expensive in the long run than a bigger server, and far less effective than properly optimized application code. You should definitely make sure your app is well optimized.

How would I implement separate databases for reading and writing operations?

I am interested in implementing an architecture that has two databases one for read operations and the other for writes. I have never implemented something like this and have always built single database, highly normalised systems so I am not quite sure where to begin. I have a few parts to this question.
1. What would be a good resource to find out more about this architecture?
2. Is it just a question of replicating between two identical schemas, or would your schemas differ depending on the operations, would normalisation vary too?
3. How do you insure that data written to one database is immediately available for reading from the second?
Any further help, tips, resources would be appreciated. Thanks.
EDIT
After some research I have found this article which I found very informative for those interested..
http://www.codefutures.com/database-sharding/
I found this highscalability article very informative
I'm not a specialist but the read/write master database and read-only slaves pattern is a "common" pattern, especially for big applications doing mostly read accesses or data warehouses:
it allows to scale (you add more read-only slaves if required)
it allows to tune the databases differently (for either efficient reads or efficient writes)
What would be a good resource to find out more about this architecture?
There are good resources available on the Internet. For example:
Highscalability.com has good examples (e.g. Wikimedia architecture, the master-slave category,...)
Handling Data in Mega Scale Systems (starting from slide 29)
MySQL Scale-Out approach for better performance and scalability as a key factor for Wikipedia’s growth
Chapter 24. High Availability and Load Balancing in PostgreSQL documentation
Chapter 16. Replication in MySQL documentation
http://www.google.com/search?q=read%2Fwrite+master+database+and+read-only+slaves
Is it just a question of replicating between two identical schemas, or would your schemas differ depending on the operations, would normalisation vary too?
I'm not sure - I'm eager to read answers from experts - but I think the schemas are identical in traditional replication scenari (the tuning may be different though). Maybe people are doing more exotic things but I wonder if they rely on database replication in that case, it sounds more like "real-time ETL".
How do you insure that data written to one database is immediately available for reading from the second?
I guess you would need synchronous replication for that (which is of course slower than asynchronous). While some databases do support this mode, not all do AFAIK. But have a look at this answer or this one for SQL Server.
You might look up data warehouses.
These serve as 'normalized for reporting' type databases, while you can keep a normalized OLTP style instance for the data maintenance.
I don't think the idea of 'immediate' equivalence will be a reality. There will be some delay while the new data and changes are migrated in to the other system. The schedule and scope will be your big decisions here.
In regards to questions 2:
It really depends on what you are trying to achieve by having two databases. If it is for performance reasons (which i suspect it may be) i would suggest you look into denormalizing the read-only database as needed for performance. If performance isn't an issue then I wouldn't mess with the read-only schema.
I've worked on similar systems where there would be a read/write database that was only lightly used by administrative users. That database would then be replicated to the read only database during a nightly process.
Question 3:
How immediate are we talking here? Less than a second? 10 seconds? Minutes?

When do transactions become more of a burden than a benefit?

Transactional programming is, in this day and age, a staple in modern development. Concurrency and fault-tolerance are critical to an applications longevity and, rightly so, transactional logic has become easy to implement. As applications grow though, it seems that transactional code tends to become more and more burdensome on the scalability of the application, and when you bridge into distributed transactions and mirrored data sets the issues start to become very complicated. I'm curious what seems to be the point, in data size or application complexity, that transactions frequently start becoming the source of issues (causing timeouts, deadlocks, performance issues in mission critical code, etc) which are more bothersome to fix, troubleshoot or workaround than designing a data model that is more fault-tolerant in itself, or using other means to ensure data integrity. Also, what design patterns serve to minimize these impacts or make standard transactional logic obsolete or a non-issue?
--
EDIT: We've got some answers of reasonable quality so far, but I think I'll post an answer myself to bring up some of the things I've heard about to try to inspire some additional creativity; most of the responses I'm getting are pessimistic views of the problem.
Another important note is that not all dead-locks are a result of poorly coded procedures; sometimes there are mission critical operations that depend on similar resources in different orders, or complex joins in different queries that step on each other; this is an issue that can sometimes seem unavoidable, but I've been a part of reworking workflows to facilitate an execution order that is less likely to cause one.
I think no design pattern can solve this issue in itself. Good database design, good store procedure programming and especially learning how to keep your transactions short will ease most of the problems.
There is no 100% guaranteed method of not having problems though.
In basically every case I've seen in my career though, deadlocks and slowdowns were solved by fixing the stored procedures:
making sure all tables are accessed in order prevents deadlocks
fixing indexes and statistics makes everything faster (hence diminishes the chance of deadlock)
sometimes there was no real need of transactions, it just "looked" like it
sometimes transactions could be eliminated by making multiple statement stored procedures in single statement ones.
The use of shared resources is wrong in the long run. Because by reusing an existing environment you are creating more and more possibilities. Just review the busy beavers :) The way Erlang goes is the right way to produce fault-tolerant and easily verifiable systems.
But transactional memory is essential for many applications in widespread use. If you consult a bank with its millions of customers for example you can't just copy the data for the sake of efficiency.
I think monads are a cool concept to handle the difficult concept of changing state.
One approach I've heard of is to make a versioned insert only model where no updates ever occur. During selects the version is used to select only the latest rows. One downside I know of with this approach is that the database can get rather large very quickly.
I also know that some solutions, such as FogBugz, don't use enforced foreign keys, which I believe would also help mitigate some of these problems because the SQL query plan can lock linked tables during selects or updates even if no data is changing in them, and if it's a highly contended table that gets locked it can increase the chance of DeadLock or Timeout.
I don't know much about these approaches though since I've never used them, so I assume there are pros and cons to each that I'm not aware of, as well as some other techniques I've never heard about.
I've also been looking into some of the material from Carlo Pescio's recent post, which I've not had enough time to do it justice unfortunately, but the material seems very interesting.
If you are talking 'cloud computing' here, the answer would be to localize each transaction to the place where it happens in the cloud.
There is no need for the entire cloud to be consistent, as that would kill performance (as you noted). Simply, keep track of what is changed and where and handle multiple small transactions as changes propagate through the system.
The situation where user A updates record R and user B at the other end of cloud does not see it (yet) is the same as the one when user A didn't do the change yet in the current strict-transactional environment. This could lead to a discrepancy in an update-heavy system, so systems should be architectured to work with updates as less as possible - moving things to aggregation of data and pulling out the aggregates once the exact figure is critical (i.e. moving requirement for consistency from write-time to critical-read-time).
Well, just my POV. It's hard to conceive a system that is application agnostic in this case.
Try to make changes at the database level in the least number of possible instructions.
The general rule is to lock a resource the lest possible time. Using T-SQL, PLSQL, Java on Oracle or any similar way you can reduce the time that each transaction locks a shared resource. I fact transactions in the database are optimized with row-level locks, multi-version, and other kinds of intelligent techniques. If you can make the transaction at the database you save the network latency. Apart from other layers like ODBC/JDBC/OLEBD.
Sometimes the programmer tries to obtain the good things of a database ( It is transactional, parallel, distributed, ) but keep a caché of the data. Then they need to add manually some of the database features.