Can we have forests or data replication between different versions of marklogic? - replication

Currently we are migrating from Marklogic version 8 to Marklogic version 10.
We can do replication between forests of same version Marklogic-v10 to Marklogic-v10, so i need to understand whether we can have replication between forests of ML-v8 to ML-v10 and if we can, what problems can arise if we try to do that?

Flexible Replication works between MarkLogic 8 and MarkLogic 10, Database Replication as far as I know not. Where Database Replication operates on a fairly low level (it sends journal frames, and forest data across) with a configurable allowed Replication Lag, Flexible Replication operates on fairly highly level of abstraction, and uses asynchronous communication by design (master does not wait for its completion).
Database Replication allows for a certain level of consistency. Flexible Replication however can be used for instance in master-master scenarios. Each has its own use cases, but if you need to maintain different major versions of MarkLogic, Flexible Replication is the only option of the two.
Next to that, there are also ways to move data between clusters outside of MarkLogic, using tools like MLCP, Corb, or NiFi for instance. But those are usually more suited for one-time migrations in such cases.
Please note that MarkLogic 8 has reached End of Life: https://help.marklogic.com/
HTH!

Related

moving from sql server to cassandra

I have a data intensive project for which I wrote the code recently, the data and sp live in a MS SQL db. My initial estimate is that the db will grow to 50TB, then it will become fairly static in growth. The final application will perform lots of row level look ups and readings, with a very small percentile of db write backs.
With the above scenario in mind, its being suggested that I should look at a NoSQL option in order to scale to the large load of data and transactions, and after a bit of research the roads leads to Cassandra (while considering MongoDB as a second alternative)
I would appreciate your guidance with the following set of initial questions:
-Does Cassandra support the concept of store procs?
-Would I be able to install and run the 50TB db on a single node (single Windows Server)?
-Does Cassandra support/leverage multiple CPUs in single server (ex: 4 CPUs)?
-Would open source version be able to support the 50TB db? or would I need to purchase the ENT version?
Regards,
-r
Does Cassandra support the concept of store procs?
Cassandra does not support stored procedures. However there is a feature called "prepared statements" which allows you to submit a CQL query once, and then have it executed multiple times with different parameters. But the set of things you can do with prepared statements is limited to regular CQL. In particular you can not do things like loops, conditional statements or other interesting things. But you do get some measure of protection against injection attacks and savings on multiple compilations.
Would I be able to install and run the 50TB db on a single node (single Windows Server)?
I am not aware of anything that would prevent you from running a 50TB database on one node, but you may require lots of memory to keep things relatively smooth, as you RAM/storage ratio is likely to be very low and thus impact your ability to cache disk data meaningfully. What is not recommended, however, is running a production setup on Windows. Cassandra uses some Linux specific IO optimizations, and is tested much more thoroughly on Linux. Far-out setups like you're suggesting are especially likely to be untested on Windows.
Does Cassandra support/leverage multiple CPUs in single server (ex: 4 CPUs)?
Yes
Would open source version be able to support the 50TB db? or would I need to purchase the ENT version?
The Apache distro does not have any usage limits baked into it (it makes little sense in an open source project, if you think about it). Neither does the free version from DataStax, the Community Edition.

Does Redis support strong consistency

I am looking at porting a Java application to .NET, the application currently uses EhCache quite heavily and insists that it wants to support strong consistency (http://ehcache.org/documentation/get-started/consistency-options).
I am would like to use Redis in place of EhCache but does Redis support strong consistency or just support eventual consistency?
I've seen talk of a Redis Cluster but I guess this is a little way off release yet.
Or am I looking at this wrong? If Redis instance sat on a different server altogether and served two frontend servers how big could it get before we'd need to look at a Master / Slave style affair?
A single instance of Redis is consistent. There are options for consistency across many instances. #antirez (Redis developer) recently wrote a blog post, Redis data model and eventual consistency, and recommended Twemproxy for sharding Redis, which would give you consistency over many instances.
I don't know EhCache, so can't comment on whether Redis is a suitable replacement. One potential problem (porting to .NET) with Twemproxy is it seems to only run on Linux.
How big can a single Redis instance get? Depends on how much RAM you have.
How quickly will it get this big? Depends on how your data looks.
That said, in my experience Redis stores data quite efficiently. One app I have holds info for 200k users, 20k articles, all relationships between objects, weekly leader boards, stats, etc. (330k keys in total) in 400mb of RAM.
Redis is easy to use and fun to work with. Try it out and see if it meets your needs. If you do decide to use it and might one day want to shard, shard your data from the beginning.
Redis is not strongly consistent out of the box. You will probably need to apply 3rd party solutions to make it consistent. Here is a quote from docs:
Write safety
Redis Cluster uses asynchronous replication between nodes, and last failover wins implicit merge function. This means that the last elected master dataset eventually replaces all the other replicas. There is always a window of time when it is possible to lose writes during partitions. However these windows are very different in the case of a client that is connected to the majority of masters, and a client that is connected to the minority of masters.
Usually you need to have synchronous replication to achieve strong consistence in a distributed partitioned systems.

MongoDB : few questions

I often hear that mongodb can perform atomicity at one collection level.
Do you know why and how this is linked to sharding?
The only difference between replication and replicasets is that both are master/slave (primary/secondary) but replicasets has
election if master is down, right?
In ACID, which are supported/not supported by mongodb2?
Can Durability in mongodb be garanteed with safe=true ?
Thank you !
MongoDB can currently provide atomicity at the "update a single document" level, that's it. This is completely unrelated to sharding.
More or less. Replica sets are newer, and you should basically always use them now. Master/Slave replication is only around for backwards compatibility these days. It's pretty likely that only replica sets will be getting new features going forwards.
Atomicity is provided for an update to a single document (see #1). Consistency and Isolation aren't really provided at all — your application will have to do that. Durability can be provided (in a fashion) by requiring that a write operation is persisted to multiple nodes before the driver reports success (see #4).
Durability can be provided by tweaking the Write Concern, either by using a value for W > 1, and/or (although this is slow) by using fsync. See the WriteConcern documentation or the connection string documentation.

Are services, such as redis or activemq, also highly available?

I have a doubt that every service could be also highly available.
I want to use redis and activemq service and I want to avoid single point of failure. I also need to keep writing data to the redis and activemq server.
I found many articles about MySQL high availability, but only a few about other database solutions, so my question is if there is a common high availability solution suite for many products?
High availability is one of the principles in CAP theorem and many NoSQL database systems favor rather availability at the expense of data consistency. Replication is often used to achieve high availability for reads, but writes might depend on the type replication being used. Try to look at current redis replication docs or upcoming redis cluster presentation for more information on this stuff.

Is there a way to shard and replicate neo4j data?

I am considering the option of neo4j for some of the new projects I am working for. For the given data needs (inherently graph based) neo4j fits well and a quick prototype is giving good response time for me. What I want to understand is how to scale a neo4j deployment. Specifically:
How do I shard my data across neo4j deployments. Since neo4j is deployed on a single machine, there is a limit to how much data I can store in a single machine and hence I would like to know how to distribute it. Clearly if I split it on users, then relationships between disconnected users (across shards) cannot be maintained.
How do I replicate the neo4j data? I am potentially thinking of putting up a sql-like-setup with masters used for write and slaves used for reads so that we can both scale up our potentially readers and writers, but also have a backup of our data in real time. I understand that all the neo4j data is stored in a filesystem - which is inherently non-replicatable. Is there a way I can do it here? Perhaps, something akin to a mysql bin log?
sharding is as of now not handled by Neo4j itself, but by the domain, much as you describe. Neo4j 2.0 is going to target that problem.
For replication, Online Backup is working and real High Availability with Master failover is in the works, using ZooKeeper to track the cluster nodes and elect new masters, etc.
Any more details on your app sharding requirements? What domain etc?