Selective replication with CouchDB - replication

I'm currently evaluating possible solutions to the follwing problem:
A set of data entries must be synchonized between multiple clients, where each client may only view (or even know about the existence of) a subset of the data.
Each client "owns" some of the elements, and the decision who else can read or modify those elements may only be made by the owner. To complicate this situation even more, each element (and each element revision) must have an unique identifier that is equal for all clients.
While the latter sounds like a perfect task for CouchDB (and a document based data model would fit my needs perfectly), I'm not sure if the authentication/authorization subsystem of CouchDB can handle these requirements: While it should be possible to restict write access using validation functions, there doesn't seem to be a way to authorize read access. All solutions I've found for this problem propose to route all CouchDB requests through a proxy (or an application layer) that handles authorization.
So, the question is: Is it possible to implement an authorization layer that filters requests to the database so that access is granted only to documents that the requesting client has read access to and still use the replication mechanism of CouchDB? Simplified, this would be some kind of "selective replication" where only some of the documents, and not the whole database is replicated.
I would also be thankful for directions to some detailed information about how replication works. The CouchDB wiki and even the "Definite Guide" Book are not too specific about that.

this begs for replication filters. you filter outbound replication based on whatever criteria you impose, and give the owner of the target unrestricted access to their own copy.
i haven't had the opportunity to play with replication filters directly, but the idea would be that each doc would have some information about who has access to it, and the filtering mechanism would then allow outbound replication of only those documents that you have access to. replication from the target back to the master would be unrestricted, allowing for the master to remain a rollup copy, and potentially multicast changes to overlapping sets of data.

What you are after is replication filters. According to Chris Anderson, it is a 0.11 feature.
"The current status is that there is
an API for filtering the _changes
feed. The replicator in 0.10 consumes
the changes feed, so the next step is
getting the replicator to use the
filter API.
There is work in progress on this, so
it should be fully ready to go in
0.11."
See the orginal post

Here is a new link to the some documentation about this:
http://blog.couchbase.com/what%E2%80%99s-new-apache-couchdb-011-%E2%80%94-part-three-new-features-replication

Indeed, as others have said, replication filters are the way to go for this. Here is a link with some information on using them.
One caveat I would add is that at scale replication filters can be extremely slow. More information about this and other nuances about couchdb can be found in this excellent blog post: "what every developer should know about couchdb". For large scale systems performing replication in the application layer has proven faster and more reliable.

Related

Advantage of LDAP over RDBMS?

I have an application with a backend as database.
The application is sort of PUB-SUB model where users post changes to the application and other peers subscribe to those changes. These changes may happen very frequently or periodically and all the changes have to be written to database.
Now, I am being asked to find the possibility of replacing this RDBMS with LDAP. Probably they want unified DB for all applications but anyways I have to find the advantage/disadvantages of both approaches.
I cannot directly compare RDBMS a with LDAP as I have almost no idea of LDAP though I tried to get some.
I understand that LDAP is designed for directory access and is optimized for Read access, so it is write once and read many. I have read that frequent writes will reduce the performance of LDAP server as each write will result a trigger to indexing process.
Just to give a scenario in regards with indexing in LDAP, my table will have few columns say 2 viz. Name and Desc. Now in LDAP I suppose this would become two attributes as Name and Desc. In my scenario it's Desc which will be frequently updated. I assume Name will be indexed so even if Desc is changing frequently it won't trigger indexing process.
I point is worth mentioning that the database will be hosted on some cloud platform.
I tried to find out the differences but nothing conclusive I could find out.
LDAP is a protocol, REST is a service based on the HTTP (protocol). So when the LDAP server shall not be exposed to the internet, how do you want to get the data from it? As LDAP is the protocol you would need direct access to the LDAP-server. Its like a database server that you would not expose directly to the internet. You would build an interface to encapsulate it. and that might as well be a REST interface.
I'd try to get the point actos that one is the transfer protocol and a storage backend and the ither is the public interface to its data. It's a bit like why is mysql better than a webinterface. You'd never make the mysql-server publicly available but encapsulate its protocol into an application.
REST is an interface. It doesn't matter how you orgsnize your data behind that interface. When you decide that you want to organize it differently you can do so without the consumer of your API noticing any change. And you can provide different versions of your API depending on improvements of your service.
LDAP on the other hand is an implementation. You can't change the way your data is handled without the consumer noticing it. So there's no way to rearrange your backend without affecting the consumer.
With REST you can therefore change the backend from MySQL to PostgreSQL even to LDAP without notice which you won't be able with LDAP.
Hope that helps
Now that we finally know what you're actually asking, which has nothing to do with your title, the body of your question, or REST, the simple answer is that there is no particular reason to believe that an LDAP server will perform significantly better than an RDBMS in this application, with two riders:
it may not even be feasible, due to the schema issue, and
if it is feasible it may not be semantically suitable, due to the lack of ACID properties, lack of JOINs, and the other issues mentioned in comments.
I will state that this is one of the worst formulated questions I have seen here for some considerable time, and the difficulty of extracting the actual question was extreme.

Which db should I choose for my transaction logging

I have a database question. I am developing an application where users sends some request and gets an answer from a vendor. I have a server receiving the request (through a rest call or a running web service, haven't decided which yet).
Whenever a new request comes in it should be logged in a database and when the vendor responds the record should be updated indicating whether it was accepted or not and stuff like that. The only reason for this storage of transactions is for reporting and logging purposes. So now that I have stated my requirement I need help from someone with more expertise in this.
What I've come up with so far is that it would be best to use a structured database since all records will have one type and the same information, so there's no need to waste space using a semi-structured database with each record containing both structure and information.
But I don't know if there are any databases that are particularly good for this kind of "create/update operations only" ?? As I said I only need to read the data perhaps once a month or so.
Any inputs are appreciated!
You can use any open source database like postgreSql as you are mostly going to do inserts and not much other features needed. My suggestion will try to put logging process in separate threads rather than the one you are using for processing to have better performance for your api calls.
I'm developing a application with a lot of create/update queries and currently using Neo4j.
It's fast and really good with j2E and php. NoSQL is really fast to learn with it, and the web interface is really user friendly :)

how can I control cassandra replication?

When I am making an Insert in a specified keyspace, I wants that the data is stored only in a specified node (or node list). The information contains in the insert, may be confidential and should not be distributed on any nodes.
I first thought about implementing my own AbstractReplicationStrategy, but it seams that the first node choose depends on the Token (selected by the partitioner) and not the implemented strategy.
How can I be sure the the information contains in a keyspace only comes where I allow this?
I don't think that it is possible to do what you are asking. Cassandra actively tries to maintain a certain number of replicas of each piece of data- even if you managed to force only a single node to store your insert (which is fairly straighforward), you'd have no control over which node that was (as you found, this is controlled by the partitioner), and if the node went down your data would be lost.
The short answer is that controlling replication is not the way to achieve data security- you should use proper security techniques such as encryption, segregated networks, controlled access, etc.

Html5 local datastore, and sync across devices

I am building a full featured web application. Naturally, you can save when you are in 'offline' mode to the local datastore. I want to be able to sync across devices, so people can work on one machine, save, then get on another machine and load their stuff.
The questions are:
1) Is it a bad idea to store json on the server? Why parse the json on the server into model objects when it is just going to be passed back to the (other) client(s) as json?
2) Im not sure if I would want to try a NoSql technology for this. I am not breaking the json down, for now the only relationships in the db would be from a user account to their entries. Other than the user data, the domain model would be a String, which is the json. Advice welcome.
In theory, in the future I might want to do some processing on the server or set up more complicated relationships. In other words, right now I would just be saving the json, but in the future I might want a more traditional relational system. Would NoSQL approach get in the way of this?
3) Are there any security concerns with this? JS injection for example? In theory, for this use case, the user doesn't get to enter anything, at least right now.
Thank you in advance.
EDIT - Thanx for the answers. I chose the answer I did because it went into the most detail on the advantages and disadvantages of NoSql.
JSON on the SERVER
It's not a bad idea at all to store JSON on the server, especially if you go with a noSQL solution like MongoDB or CouchDB. Both use JSON as their native format(MongoDB actually uses BSON but it's quite similar).
noSQL Approach: Assuming CouchDB as the storage engine
Baked in replication and concurrency handling
Very simple Rest API, talk to the data base with HTTP.
Store data as JSON natively and not in blobs or text fields
Powerful View/Query engine that will allow you to continue to grow the complexity of your documents
Offline Mode. You can talk to CouchDb directly using javascript and have the entire app continue to run on the client if the internet isn't available.
Security
Make sure you're parsing the JSON documents with the browers JSON.parse or a Javascript library that is safe(json2.js).
Conclusion
I think the reason I'd suggest going with noSQL here, CouchDB in particular, is that it's going to handle all of the hard stuff for you. Replication is going to be a snap to setup. You won't have to worry about concurrency, etc.
That said, I don't know what kind of App you're building. I don't know what your relationship is going to be to the clients and how easy it'll be to get them to put CouchDB on their machines.
Links
CouchDB # Apache
CouchOne
CouchDB the definitive guide
MongoDB
Update:
After looking at the app I don't think CouchDB will be a good client side option as you're not going to require folks to install a database engine to play soduku. That said, I still think it'd be a great server side option. If you wanted to sync the server CouchDb instance with the client you could use something like BrowserCouch which is a JavaScript implementation of CouchDB for local-storage.
If most of your processing is going to be done on the client side using JavaScript, I don't see any problem in storing JSON directly on the server.
If you just want to play around with new technologies, you're most welcome to try something different, but for most applications, there isn't a real reason to depart from traditional databases, and SQL makes life simple.
You're safe as long as you use the standard JSON.parse function to parse JSON strings - some browsers (Firefox 3.5 and above, for example) already have a native version, while Crockford's json2.js can replicate this functionality in others.
Just read your post and I have to say I quite like your approach, it heralds the way many web applications will probably work in the future, with both an element of local storage (for disconnected state) and online storage (the master database - to save all customers records in one place and synch to other client devices).
Here are my answers:
1) Storing JSON on server: I'm not sure I would store the objects as JSON, its possible to do so if your application is quite simple, however this will hamper efforts to use the data (running reports and emailing them on a batch job for example). I would prefer to use JSON for TRANSFERRING the information myself and a SQL database for storing it.
2) NoSQL Approach: I think you've answered your own question there. My preferred approach would be to setup a SQL database now (if the extra resource needed is not a problem), that way you'll save yourself a bit of work setting up the data access layer for NoSQL since you will probably have to remove it in the future. SQLite is a good choice if you dont want a fully-featured RDBMS.
If writing a schema is too much hassle and you still want to save JSON on the server, then you can hash up a JSON object management system with a single table and some parsing on the server side to return relevant records. Doing this will be easier and require less permissioning than saving/deleting files.
3) Security: You mentioned there is no user input at the moment:
"for this use case, the user doesn't
get to enter anything"
However at the begining of the question you also mentioned that the user can
"work on one machine, save, then get
on another machine and load their
stuff"
If this is the case then your application will be storing user data, it doesn't matter that you havent provided a nice GUI for them to do so, you will have to worry about security from more than one standpoint and JSON.parse or similar tools only solve half the the problem (client-side).
Basically, you will also have to check the contents of your POST request on the server to determine if the data being sent is valid and realistic. The integrity of the JSON object (or any data you are tying to save) will need to be validated on the server (using php or another similar language) BEFORE saving to your data store, this is because someone can easily bypass your javascript-layer "security" and tamper with the POST request even if you didnt intend them to do so and then your application will be sending the evil input out the client anyway.
If you have the server side of things tidied up then JSON.parse becomes a bit obsolete in terms of preventing JS injection. Still its not bad to have the extra layer, specially if you are relying on remote website APIs to get some of your data.
Hope this is useful to you.

Application Level Replication Technologies

I am building out a solution that will be deployed in multiple data centers in multiple regions around the world, with each data center having a replicated copy of data actively updated in each region. I will have a combination of multiple databases and file systems in each data center, the state of which must be kept consistent (within a data center). These multiple repositories will be fronted by a SOA service tier.
I can tolerate some latency in the replication, and need to allow for regions to be off-line, and then catch up later.
Given the multiple back end repositories of data, I can't easily rely on independent replication solutions for each one to maintain a consistent state. I am thus lead to implementing replication at the application layer -- by replicating the SOA requests in some manner. I'll need to make sure that replication loops don't occur, and that last writer conditions are sorted out correctly.
In your experience, what is the best pattern for solving this problem, and are there good products (free or otherwise) that should be investigated?
Lotus/ Domino is your answer. I've been working with it for ten years and its exactly what you need. It may not be trendy (a perception that I would challenge) but its powerful, adaptable and very secure, The latest version R8 is the best yet.
You should definitely consider IBM Lotus Domino. A Lotus Notes database can replicate between sites on a predefined schedule. The replicate in Notes/Domino is definitely a very powerful feature and enables for full replication of data between sites. Even if a server is unavailable the next time it connects it will simply replicate and get back in sync.
As far as SOA Service tier you could then use Domino Designer to write a webservice. Since Notes/Domino 7.5.x (I believe) Domino has been able to provision and consume webservices.
AS what other advised, I will recommend also Lotus Notes/Domino. 8.5 is really very powerful application development platfrom
You dont give enough specifics to be certain of your needs but I think you should check out SQL Server Merge replication. It allows for asynchronous replication of multiple databases with full conflict resolution. You will need to designate a Global master and all the other databases will replicate to that one, but all the database instances are fully functional (read/write) and so you can schedule replication at whatever intervals suit you. If any region goes offline they can catch up later with no issues - if the master goes offline everyone will work independantly until replication can resume.
I would be interested to know of other solutions this flexible (apart from Lotus Notes/Domino of course which is not very trendy these days).
I think that your answer is going to have to be based on a pub/sub architecture. I am assuming that you have reliable messaging between your data centers so that you can rely on published updates being received eventually. If all of your access to the data repositories is via service you can add an event notification to the orchestration of each of your update services that notifies all interested data centers of the event. Ideally the master database is the only one that sends out these updates. If the master database is the only one sending the updates you can exclude routing the notifications to the node that generated them in the first place thus avoiding update loops.