MongoDB : few questions - sql

I often hear that mongodb can perform atomicity at one collection level.
Do you know why and how this is linked to sharding?
The only difference between replication and replicasets is that both are master/slave (primary/secondary) but replicasets has
election if master is down, right?
In ACID, which are supported/not supported by mongodb2?
Can Durability in mongodb be garanteed with safe=true ?
Thank you !

MongoDB can currently provide atomicity at the "update a single document" level, that's it. This is completely unrelated to sharding.
More or less. Replica sets are newer, and you should basically always use them now. Master/Slave replication is only around for backwards compatibility these days. It's pretty likely that only replica sets will be getting new features going forwards.
Atomicity is provided for an update to a single document (see #1). Consistency and Isolation aren't really provided at all — your application will have to do that. Durability can be provided (in a fashion) by requiring that a write operation is persisted to multiple nodes before the driver reports success (see #4).
Durability can be provided by tweaking the Write Concern, either by using a value for W > 1, and/or (although this is slow) by using fsync. See the WriteConcern documentation or the connection string documentation.

Related

Can we have forests or data replication between different versions of marklogic?

Currently we are migrating from Marklogic version 8 to Marklogic version 10.
We can do replication between forests of same version Marklogic-v10 to Marklogic-v10, so i need to understand whether we can have replication between forests of ML-v8 to ML-v10 and if we can, what problems can arise if we try to do that?
Flexible Replication works between MarkLogic 8 and MarkLogic 10, Database Replication as far as I know not. Where Database Replication operates on a fairly low level (it sends journal frames, and forest data across) with a configurable allowed Replication Lag, Flexible Replication operates on fairly highly level of abstraction, and uses asynchronous communication by design (master does not wait for its completion).
Database Replication allows for a certain level of consistency. Flexible Replication however can be used for instance in master-master scenarios. Each has its own use cases, but if you need to maintain different major versions of MarkLogic, Flexible Replication is the only option of the two.
Next to that, there are also ways to move data between clusters outside of MarkLogic, using tools like MLCP, Corb, or NiFi for instance. But those are usually more suited for one-time migrations in such cases.
Please note that MarkLogic 8 has reached End of Life: https://help.marklogic.com/
HTH!

Difference between Partial Replication and Sharding?

I was wondering if sharding is an alternate name for partial replication or not. What I have figured out that --
Partial Repl. – each data item has only copies at some but not all of the nodes (‘Sharding’?)
Pure Partial Repl. – has only copies of a subset of the data item but no node contains a full copy of the database
Hybrid Partial Repl. – a set of nodes are full replicas and another set of nodes are partial replicas
Partial replication is an interesting way, in which you distribute the data with replication from a master to slaves, each contains a portion of the data. Eventually you get an array of smaller DBs, read only, each contains a portion of the data. Reads can very well be distributed and parallelized.
But what about the writes?
Those are still clogged, in 1 big fat lazy master database, tasks as buffer management, locking, thread locks/semaphores, and recovery tasks - are the real bottleneck of the OLTP, they make writes impossible to scale... See more in my blog post here: http://database-scalability.blogspot.com/2012/08/scale-up-partitioning-scale-out.html. BTW - your topic right here just gave me a great idea for another post. I'll link to this question and give you the credit! :)
Sharding is where data appears only once, within an array of DBs. Each database is the complete owner of the data, data is read from there, data is written to there. This way, reads and writes are distributed and parallelized. Real scale-out can be acheived.
Sharding is a mess to handle, to maintain, it's hard as hell. ScaleBase (I work there), enable automatic transparent scale-out, just throw it in the middle and you'll have 10 DBs at the back, and it'll look like 1 to your app. Automatic, transparent super-sharding - in a box.
Sharding is a method of horizontal partitioning of a table. It doesn't related to replication.
Traditionally an RDBMS server located in the center of system with star like topology. That's why it becomes:
the single point of failure
the performance bottleneck of the system
To resolve issue #1 you use replication: if original server dies you fail over to a replica.
To resolve issue #2 you can:
use sharding
1.1 do sharding by yourself
1.2 use your RDBMS "out of the box" clustering mechanism
migrate to a NoSQL solution
Sharding allows you to scale out database to many servers by splitting the data among them. However sharding is a trade-off. It limits you in data joining/intersecting/etc.
You still have issue #1 if you use sharding. So it's a good practice to replicate sharded nodes.

Does Redis support strong consistency

I am looking at porting a Java application to .NET, the application currently uses EhCache quite heavily and insists that it wants to support strong consistency (http://ehcache.org/documentation/get-started/consistency-options).
I am would like to use Redis in place of EhCache but does Redis support strong consistency or just support eventual consistency?
I've seen talk of a Redis Cluster but I guess this is a little way off release yet.
Or am I looking at this wrong? If Redis instance sat on a different server altogether and served two frontend servers how big could it get before we'd need to look at a Master / Slave style affair?
A single instance of Redis is consistent. There are options for consistency across many instances. #antirez (Redis developer) recently wrote a blog post, Redis data model and eventual consistency, and recommended Twemproxy for sharding Redis, which would give you consistency over many instances.
I don't know EhCache, so can't comment on whether Redis is a suitable replacement. One potential problem (porting to .NET) with Twemproxy is it seems to only run on Linux.
How big can a single Redis instance get? Depends on how much RAM you have.
How quickly will it get this big? Depends on how your data looks.
That said, in my experience Redis stores data quite efficiently. One app I have holds info for 200k users, 20k articles, all relationships between objects, weekly leader boards, stats, etc. (330k keys in total) in 400mb of RAM.
Redis is easy to use and fun to work with. Try it out and see if it meets your needs. If you do decide to use it and might one day want to shard, shard your data from the beginning.
Redis is not strongly consistent out of the box. You will probably need to apply 3rd party solutions to make it consistent. Here is a quote from docs:
Write safety
Redis Cluster uses asynchronous replication between nodes, and last failover wins implicit merge function. This means that the last elected master dataset eventually replaces all the other replicas. There is always a window of time when it is possible to lose writes during partitions. However these windows are very different in the case of a client that is connected to the majority of masters, and a client that is connected to the minority of masters.
Usually you need to have synchronous replication to achieve strong consistence in a distributed partitioned systems.

How would I implement separate databases for reading and writing operations?

I am interested in implementing an architecture that has two databases one for read operations and the other for writes. I have never implemented something like this and have always built single database, highly normalised systems so I am not quite sure where to begin. I have a few parts to this question.
1. What would be a good resource to find out more about this architecture?
2. Is it just a question of replicating between two identical schemas, or would your schemas differ depending on the operations, would normalisation vary too?
3. How do you insure that data written to one database is immediately available for reading from the second?
Any further help, tips, resources would be appreciated. Thanks.
EDIT
After some research I have found this article which I found very informative for those interested..
http://www.codefutures.com/database-sharding/
I found this highscalability article very informative
I'm not a specialist but the read/write master database and read-only slaves pattern is a "common" pattern, especially for big applications doing mostly read accesses or data warehouses:
it allows to scale (you add more read-only slaves if required)
it allows to tune the databases differently (for either efficient reads or efficient writes)
What would be a good resource to find out more about this architecture?
There are good resources available on the Internet. For example:
Highscalability.com has good examples (e.g. Wikimedia architecture, the master-slave category,...)
Handling Data in Mega Scale Systems (starting from slide 29)
MySQL Scale-Out approach for better performance and scalability as a key factor for Wikipedia’s growth
Chapter 24. High Availability and Load Balancing in PostgreSQL documentation
Chapter 16. Replication in MySQL documentation
http://www.google.com/search?q=read%2Fwrite+master+database+and+read-only+slaves
Is it just a question of replicating between two identical schemas, or would your schemas differ depending on the operations, would normalisation vary too?
I'm not sure - I'm eager to read answers from experts - but I think the schemas are identical in traditional replication scenari (the tuning may be different though). Maybe people are doing more exotic things but I wonder if they rely on database replication in that case, it sounds more like "real-time ETL".
How do you insure that data written to one database is immediately available for reading from the second?
I guess you would need synchronous replication for that (which is of course slower than asynchronous). While some databases do support this mode, not all do AFAIK. But have a look at this answer or this one for SQL Server.
You might look up data warehouses.
These serve as 'normalized for reporting' type databases, while you can keep a normalized OLTP style instance for the data maintenance.
I don't think the idea of 'immediate' equivalence will be a reality. There will be some delay while the new data and changes are migrated in to the other system. The schedule and scope will be your big decisions here.
In regards to questions 2:
It really depends on what you are trying to achieve by having two databases. If it is for performance reasons (which i suspect it may be) i would suggest you look into denormalizing the read-only database as needed for performance. If performance isn't an issue then I wouldn't mess with the read-only schema.
I've worked on similar systems where there would be a read/write database that was only lightly used by administrative users. That database would then be replicated to the read only database during a nightly process.
Question 3:
How immediate are we talking here? Less than a second? 10 seconds? Minutes?

Is DB4O Replication faster than SQL Server Merge Replication?

Does the replication system that comes with DB4O work well? Basically I would like to know if anyone has some good numbers on the record throughput of their replication system and if it handles concurrency errors gracefully or not. What is the relative performance difference between SQL Server's merge replication between two SQL servers and using DRS between two DB4O databases?
We are currently working on improving the replication system further and improving performance certainly is a goal.
I think it's quite hard to produce comparable figures. Every object that needs to be replicated requires a lookup in the UUID BTree. If you know what you are doing, you can finetune that to run completely in memory. Then again the throughput will depend very much on how many indexes you have on each side and how big indexes are. db4o and the SQL server of your choice (and any other SQL server) may scale differently with size and that may very much depend on the hardware you use (db4o loves solid state discs with short seek times).
This is like with any other benchmark: You can only find out how things really will work for you if you mock up the scenario that you think you need and run it on your hardware.
As to handling concurrency: Any conflict will call back into your code and it's your choice how you handle it. You can resolve by hand by merging changes to either side and you can also ignore objects. It's up to your code to find out what it thinks is right.
With respect to concurrency if you have a replication session running side-by-side with another live session that constantly modifies objects: Currently released dRS code is not yet strong for this case. While we implement replication between db4o and the high-end object database Versant VOD we will try to cover these kind of concurrency cases also.