What is Commutative Replicated Data Type, and how does it do Replication without requiring consensus? - replication

I am trying to do research on Commutative Replicated Data Type, and do not find any good definitions that aren't mired down in a ton of technical terms that makes it hard to understand how this allows for replication of data in a distributed environment without using consensus.

In Layman's terms you can think of CRDTs as follows:
CRDTS are a datatype to achieve strong eventual consistency in distributed environments without explicit synchronization. The attractive property of CRDTs is that they are both conflict-free and don't require synchronization, which is a bit confusing at first since you'd think there must be some sort of synchrony, e.g what happens if I write 2 and then 3 to the replicas and replica A receives update 3 before 2 and replica B receives the correct order, first 2 then 3, then we have a conflict?
The key to CRDT is that they are delimited to specific operations in which case the scenario above would not yield any conflict. The simplest scenario is to increment an aggregated value, if A and B just add all the incoming values they will both eventually end up with the value 5, without conflict, relying only on the weak assumptions of eventual consistency. Specifically the operation requirements are typically that the operations are commutative and that operations don't violate causal order.
Basically, CRDTs guarantees that all concurrent operations commute with each other.
Of course if CRDTs could just implement simple summation it would not be very interesting. However, it turns out that clever people have developed CRDT algorithms for more useful things such editing a shared document (see "CRDT Logoot"), grow-only sets, etc.
But still, bare in mind that by removing the need for Consensus, CRDTs are inherently limited and there are many simple things they cannot do.
Hope this description made any sense for you, for a more exact description I think the mathematical definition is the most intuitive.

Related

Data model design guide lines with GEODE

We are soon going to start something with GEODE regarding reference data. I would like to get some guide lines for the same.
As you know in financial reference data world there exists complex relationships between various reference data entities like Instrument, Account, Client etc. which might be available in database as 3NF.
If my queries are mostly read intensive which requires joins across
tables (2-5 tables), what's the best way to deal with the same with in
memory grid?
Case 1:
Separate regions for all tables in your database and then do a similar join using OQL as you do in database?
Even if you do so, you will have to design it with solid care that related entities are always co-located within same partition.
Modeling 1-to-many and many-many relationship using object graph?
Case 2:
If you know how your join queries look like, create a view model per join query having equi join characteristics.
Confusion:
(1) I have 1 join query requiring Employee,Department using emp.deptId = dept.deptId [OK fantastic 1 region with such view model exists]
(2) I have another join query requiring, Employee, Department, Salary, Address joins to address different requirement
So again I have to create a view model to address (2) which will contain similar Employee and Department data as (1). This may soon reach to memory threshold.
Changes in database can still be managed by event listeners, but what's the recommendations for that?
Thanks,
Dharam
I think your general question is pretty broad and there isn't just one recommended approach to cover all UCs (primarily all your analytical views/models of your data as required by your application(s)).
Such questions involve many factors, such as the size of individual data elements, the volume of data, the frequency of access or access patterns originating from the application or applications, the timely delivery of information, how accurate the data needs to be, the size of your cluster, the physical resources of each (virtual) machine, and so on. Thus, any given approach will undoubtedly require application tuning, tuning GemFire accordingly and JVM tuning regardless of your data model. Still, a carefully crafted data model can determine the extent of such tuning.
In GemFire specifically, such tuning will involve different configuration such as, but not limited to: data management policies, eviction (Overflow) and expiration (LRU, or perhaps custom) settings along with different eviction/expiration thresholds, maybe storing data in Off-Heap memory, employing different partition strategies (PartitionResolver), and so on and so forth.
For example, if your Address information is relatively static, unchanging (i.e. actual "reference" data) then you might consider storing Address data in a REPLICATE Region. Data that is written to frequently (typically "transactional" data) is better off in a PARTITION Region.
Of course, as you know, any PARTITION data (managed in separate Regions) you "join" in a query (using OQL) must be collocated. GemFire/Geode does not currently support distributed joins.
Additionally, certain nodes could host certain Regions, thus dividing your cluster into "transactional" vs. "analytical" nodes, where the analytical-based nodes are updated from CacheListeners on Regions in transactional nodes (be careful of this), or perhaps better yet, asynchronously using an AEQ with AsyncEventListeners. AEQs can be separately made highly available and durable as well. This transactional vs analytical approach is the basis for CQRS.
The size of your data is also impacted by the form in which it is stored, i.e. serialized vs. not serialized, and GemFire's proprietary serialization format (PDX) is quite optimal compared with Java Serialization. It all depends on how "portable" your data needs to be and whether you can keep your data in serialized form.
Also, you might consider how expensive it is to join the data on-the-fly. Meaning, if your are able to aggregate, transform and enrich data at runtime relatively cheaply (compute vs. memory/storage), then you might consider using GemFire's Function Execution service, bringing your logic to the data rather than the data to your logic (the fundamental basis of MapReduce).
You should know, and I am sure you are aware, GemFire is a Key-Value store, therefore mapping a complex object graph into separate Regions is not a trivial problem. Dividing objects up by references (especially many-to-many) and knowing exactly when to eagerly vs. lazily load them is an overloaded problem, especially in a distributed, replicated data store such as GemFire where consistency and availability tradeoffs exist.
There are different APIs and frameworks to simplify persistence and querying with GemFire. One of the more notable approaches is Spring Data GemFire's extension of Spring Data Commons Repository abstraction.
It also might be a matter of using the right data model for the job. If you have very complex data relationships, then perhaps creating analytical models using a graph database (such as Neo4j) would be a simpler option. Spring also provides great support for Neo4j, led by the Neo4j team.
No doubt any design choice you make will undoubtedly involve a hybrid approach. Often times the path is not clear since it really "depends" (i.e. depends on the application and data access patterns, load, all that).
But one thing is for certain, make sure you have a good cursory knowledge and understanding of the underlying data store and it' data management capabilities, particularly as it pertains to consistency and availability, beginning with this.
Note, there is also a GemFire slack channel as well as a Apache DEV mailing list you can use to reach out to the GemFire experts and community of (advanced) GemFire/Geode users if you have more specific problems as you proceed down this architectural design path.

What is the resource impact from normalizing a database?

When taking a database from a relatively un-normalized form and normalizing it, what, if any, changes in resource utilization might one expect?
For example, normalization often means more tables get created from fewer which means the database now has a higher number of tables, but many of them are quite small, allowing the often used ones to fit into memory better.
The higher number of tables also means that more joins are needed (potentially) to get at the data that was abstracted out, so one would expect some sort of impact from the higher number of joins the system needs to do.
So, what impact on resource usage (ie. what will change) does normalizing an un-normalized database have?
Edit:
To add a bit of context, I have an existing (ie. legacy) database with over 300 horrible tables. About 1/2 of the data is TEXT and the other half is either char fields or integers. There are no constraints of any kind. The reason I ask is primarily to get more information for convincing others that things need to change and that there won't be a decrease in performance or maintainability. Unfortunately, those I have to convince know just enough about the performance benefits of a de-normalized database to want to avoid normalization as much as possible.
This can not really be answered in a general manner, as the impact will vary heavily depending on the specifics of the database in question and the apps using it.
So you basically stated the general expectations concerning the impact:
Overall memory demands for storage should go down, as redundant data gets removed
CPU needs might go up, as queries might get more expensive (Note that in many cases queries on a normalized database will actually be faster, even if they are more complex, as there are more optimization options for the query engine)
Development resource needs might go up, as developers might need to construct more elaborate queries (But on the other hand, you need less development effort to maintain data integrity)
So the only real answer is the usual: it depends ;)
Note: This assumes that we are talking about cautious and intentional denormalization. If you are referring to the 'just throw some tables together as data comes along' approach way to common with inexperienced developers, I'd risk the statement that normalization will reduce resource needs on all levels ;)
Edit: Concerning the specific context added by cdeszaq, I'd say 'Good luck getting your point through' ;)
Oviously, with over 300 Tables and no constraints (!), the answer to your question is definitely 'normalizing will reduce resource needs on all levels' (and probably very substantially), but:
Refactoring such a mess will be a major undertaking. If there is only one app using this database, it is already dreadful - if there are many, it might become a nightmare!
So even if normalizing would substantially reduce resource needs in the long run, it might not be worth the trouble, depending on circumstances. The main questions here are about long term scope - how important is this database, how long will it be used, will there be more apps using it in the future, is the current maintenance effort constant or increasing, etc. ...
Don't ignore that it is a running system - even if it's ugly and horrible, according to your description it is not (yet) broken ;-)
"Normalization" applies only and exclusively to the logical design of a database.
The logical design of a database and the physical design of a database are two completely distinct things. Database theory has always intended for things to be this way. The fact that the developers who overlook/disregard this distinction (out of ignorance or out of carelessness or out of laziness or out of whatever other so-called-but-invalid "reason") are the vast majority, does not make them right.
A logical design can be said to be normalized or not, but a logical design does not inherently carry any "performance characteristic" whatsoever. Just like 'c:=c+1;' does not inherently carry any performance characteristic.
A physical design does determine "performance characteristics", but then again a physical design simply does not have the quality of being "normalized or not".
This flawed perception of "normalization hurting performance" is really nothing else than concrete proof that all the DBMS engines that exist today are just seriously lacking in physical design options.
There's a very simple answer to your question: it depends.
Firstly, I'd re-phrase your question as 'what is the benefit of denormalization', because normalization is the something that should be done as a default (as the result of a pure logical model) and then denormalization can be applied for very specific tables where performance is critical. The main problem of denormalization is that it can complicate data integrity management, but the benefits in some cases outweigh the risks.
My advice for denormalization: do it only when it really hurts and make sure you got all scenarios covered when it comes to maintaining data integrity after any inserts, updates or deleted.
To underscore some points made by prior posters: Is you current schema really denormalized? The proper way (imho) to design a database is to:
Understand as best you can the system/information to be modeled
Build a fully normalized model
Then, if and as you find it necessary, denormalize in a controlled fashion to enhance performance
(There may be other reasons to denormalize, but the only ones I can think of off-hand are political ones--have to match the existing code, the developers/managers don't like it, etc.)
My point is, if you never fully normalized, you don't have a denormalized database, you've got an unnormalized one. And I think you can think of more descriptive if less polite terms for those databases.
I've found that normalization, in some cases, will improve performance.
Small tables read more quickly. A badly denormalized database will often have (a) longer rows and (b) more rows than a normalized design.
Reading fewer shorter rows means less physical I/O.
For one thing, you'll end up having to do resultset calculations. For example, if you have a Blog, with a number of Posts, you could either do:
select count(*) from Post where BlogID = #BlogID
which is more expensive than
select PostCount from Blog where ID = #BlogID
and can lead to the SELECT N+1 problem, if you're not careful.
Of course with the second option you have to deal with keeping the data integrity, but if the first option is painful enough, then you make it work.
Be careful you don't fall foul of premature optimisation. Do it in the normalised fashion, then measure performance against requirements, and only if it falls short should you look to denormalise.
Normalized schemas tend to perform better for INSERT/UPDATE/DELETE because there are no "update anomalies" and the actual changes that need to be made are more localized.
SELECTs are mixed. Denormalization is esentially materializing a join. There's no doubt that materializing a join sometimes helps, however, materialization is often very pessimistic (probably more often than not), so don't assume that denormalization will help you. Also, normalized schemas are generally smaller and therefore might require less I/O. A join is not necessarily expensive, so don't automatically assume that it will be.
I wanted to elaborate on Henrik Opel's #3 bullet point. Development costs might go up, but they don't have to. In fact, normalization of a database should simplify or enable the use of tools like ORMs, Code Generators, Report Writers, etc. These tools can significantly reduce the time spent on the data access layer of your applications and move development on through to adding business value.
You can find a good StackOverflow discussion here about the development aspect of normalized databases. There were many good answers, comments and things to think about.

Any good literature on join performance vs systematic denormalization?

As a corollary to this question I was wondering if there was good comparative studies I could consult and pass along about the advantages of using the RDMBS do the join optimization vs systematically denormalizing in order to always access a single table at a time.
Specifically I want information about :
Performance or normalisation versus denormalisation.
Scalability of normalized vs denormalized system.
Maintainability issues of denormalization.
model consistency issues with denormalization.
A bit of history to see where I am going here : Our system uses an in-house database abstraction layer but it is very old and cannot handle more than one table. As such all complex objects have to be instantiated using multiple queries on each of the related tables. Now to make sure the system always uses a single table heavy systematic denormalization is used throughout the tables, sometimes flattening two or three levels deep. As for n-n relationship they seemed to have worked around it by carefully crafting their data model to avoid such relations and always fall back on 1-n or n-1.
End result is a convoluted overly complex system where customer often complain about performance. When analyzing such bottle neck never they question these basic premises on which the system is based and always look for other solution.
Did I miss something ? I think the whole idea is wrong but somehow lack the irrefutable evidence to prove (or disprove) it, this is where I am turning to your collective wisdom to point me towards good, well accepted, literature that can convince other fellow in my team this approach is wrong (of convince me that I am just too paranoid and dogmatic about consistent data models).
My next step is building my own test bench and gather results, since I hate reinventing the wheel I want to know what there is on the subject already.
---- EDIT
Notes : the system was first built with flat files without a database system... only later was it ported to a database because a client insisted on the system using Oracle. They did not refactor but simply added support for relational databases to existing system. Flat files support was later dropped but we are still awaiting refactors to take advantages of database.
a thought: you have a clear impedence mis-match, a data access layer that allows access to only one table? Stop right there, this is simply inconsistent with optimal use of a relational database. Relational databases are designed to do complex queries really well. To have no option other than return a single table, and presumably do any joining in the bausiness layer, just doesn't make sense.
For justification of normalisation, and the potential consistency costs you can refer to all the material from Codd onwards, see the Wikipedia article.
I predict that benchmarking this kind of stuff will be a never ending activity, special cases will abound. I claim that normalisation is "normal", people get good enough performance fro a clean database deisgn. Perhaps an approach might be a survey: "How normalised is your data? Scale 0 to 4."
As far as I know, Dimensional Modeling is the only technique of systematic denormalization that has some theory behind it. This is the basis of data warehousing techniques.
DM was pioneered by Ralph Kimball in "A Dimensional Modeling Manifesto" in 1997. Kimball has also written a raft of books. The book that seems to have the best reviews is "The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second Edition)" (2002), although I haven't read it yet.
There's no doubt that denormalization improves performance of certain types of queries, but it does so at the expense of other queries. For example, if you have a many-to-many relationship between, say, Products and Orders (in a typical ecommerce application), and you need it to be fastest to query the Products in a given Order, then you can store data in a denormalized way to support that, and gain some benefit.
But this makes it more awkward and inefficient to query all Orders for a given Product. If you have an equal need to make both types of queries, you should stick with the normalized design. This strikes a compromise, giving both queries similar performance, though neither will be as fast as they would be in the denormalized design that favored one type of query.
Additionally, when you store data in a denormalized way, you need to do extra work to ensure consistency. I.e. no accidental duplication and no broken referential integrity. You have to consider the cost of adding manual checks for consistency.

Are we asking too much of transactional memory?

I've been reading up a lot about transactional memory lately. There is a bit of hype around TM, so a lot of people are enthusiastic about it, and it does provide solutions for painful problems with locking, but you regularly also see complaints:
You can't do I/O
You have to write your atomic sections so they can run several times (be careful with your local variables!)
Software transactional memory offers poor performance
[Insert your pet peeve here]
I understand these concerns: more often than not, you find articles about STMs that only run on some particular hardware that supports some really nifty atomic operation (like LL/SC), or it has to be supported by some imaginary compiler, or it requires that all accesses to memory be transactional, it introduces type constraints monad-style, etc. And above all: these are real problems.
This has lead me to ask myself: what speaks against local use of transactional memory as a replacement for locks? Would this already bring enough value, or must transactional memory be used all over the place if used at all?
Yes, some of the problems you mention can be real ones now, but things evolve.
As any new technology, first there is a hype, then the new technology shows that there are some unresolved problems, and then some of these problems are solved and others not. This result in another possibility to solve your problems, for which this technology is the more adapted.
I will say that you can use STM for a part of your application that can leave with the constraints the currents state of the art have. Part of the application that don't mind about a lost of efficiency for example.
Communication between the transaction and non transactional parts is the big problem. There are STM that are lock aware, so them can interact in a consistent way with non transactional parts.
I/O is also possible, but your transaction becomes irrevocable, that is, can not be aborted. That means that only one transaction can use I/O at the same time. You can also use I/O once the top level transaction has succeed, on a non-transactional world, as now.
Most of the STM library base systems force the user to make the difference between transactional and non transactional data. So yes, you need to understand what this exactly means. On the other hand, compilers can deduce what access must be transactional or not, the problem been that they can be too conservative, decreasing the efficiency we can get when we manage explicitly the different kind of variables. This is the same as having static, local and dynamic variables. You need to know the constraints each one have to make a correct program.
I've been reading up a lot about transactional memory lately.
You might also be interested in this podcast on software transactional memory, which also introduces STM using an analogy based on garbage collection:
The paper is about an analogy between garbage collection and transactional memory.
In addition to seeing the beauty of the analogy, the discussion also serves as a good
introduction to transactional memory (which was mentioned in the Goetz/Holmes episode)
and - to some extent - to garbage collection.
If you use transactional memory as a replacement for locks, all the code that executes with that lock held could be rolled back upon completion. Thus the code that was previously using locks must be transactional, and will have all the same drawbacks (and benefits).
So, you could possibly restrict the influence of TM to only those parts of the code that hold locks, right? Every piece of code that can be called during a held lock must support TM, in that scenario. How much of your program does not hold locks and is never called by code that holds locks?

When do transactions become more of a burden than a benefit?

Transactional programming is, in this day and age, a staple in modern development. Concurrency and fault-tolerance are critical to an applications longevity and, rightly so, transactional logic has become easy to implement. As applications grow though, it seems that transactional code tends to become more and more burdensome on the scalability of the application, and when you bridge into distributed transactions and mirrored data sets the issues start to become very complicated. I'm curious what seems to be the point, in data size or application complexity, that transactions frequently start becoming the source of issues (causing timeouts, deadlocks, performance issues in mission critical code, etc) which are more bothersome to fix, troubleshoot or workaround than designing a data model that is more fault-tolerant in itself, or using other means to ensure data integrity. Also, what design patterns serve to minimize these impacts or make standard transactional logic obsolete or a non-issue?
--
EDIT: We've got some answers of reasonable quality so far, but I think I'll post an answer myself to bring up some of the things I've heard about to try to inspire some additional creativity; most of the responses I'm getting are pessimistic views of the problem.
Another important note is that not all dead-locks are a result of poorly coded procedures; sometimes there are mission critical operations that depend on similar resources in different orders, or complex joins in different queries that step on each other; this is an issue that can sometimes seem unavoidable, but I've been a part of reworking workflows to facilitate an execution order that is less likely to cause one.
I think no design pattern can solve this issue in itself. Good database design, good store procedure programming and especially learning how to keep your transactions short will ease most of the problems.
There is no 100% guaranteed method of not having problems though.
In basically every case I've seen in my career though, deadlocks and slowdowns were solved by fixing the stored procedures:
making sure all tables are accessed in order prevents deadlocks
fixing indexes and statistics makes everything faster (hence diminishes the chance of deadlock)
sometimes there was no real need of transactions, it just "looked" like it
sometimes transactions could be eliminated by making multiple statement stored procedures in single statement ones.
The use of shared resources is wrong in the long run. Because by reusing an existing environment you are creating more and more possibilities. Just review the busy beavers :) The way Erlang goes is the right way to produce fault-tolerant and easily verifiable systems.
But transactional memory is essential for many applications in widespread use. If you consult a bank with its millions of customers for example you can't just copy the data for the sake of efficiency.
I think monads are a cool concept to handle the difficult concept of changing state.
One approach I've heard of is to make a versioned insert only model where no updates ever occur. During selects the version is used to select only the latest rows. One downside I know of with this approach is that the database can get rather large very quickly.
I also know that some solutions, such as FogBugz, don't use enforced foreign keys, which I believe would also help mitigate some of these problems because the SQL query plan can lock linked tables during selects or updates even if no data is changing in them, and if it's a highly contended table that gets locked it can increase the chance of DeadLock or Timeout.
I don't know much about these approaches though since I've never used them, so I assume there are pros and cons to each that I'm not aware of, as well as some other techniques I've never heard about.
I've also been looking into some of the material from Carlo Pescio's recent post, which I've not had enough time to do it justice unfortunately, but the material seems very interesting.
If you are talking 'cloud computing' here, the answer would be to localize each transaction to the place where it happens in the cloud.
There is no need for the entire cloud to be consistent, as that would kill performance (as you noted). Simply, keep track of what is changed and where and handle multiple small transactions as changes propagate through the system.
The situation where user A updates record R and user B at the other end of cloud does not see it (yet) is the same as the one when user A didn't do the change yet in the current strict-transactional environment. This could lead to a discrepancy in an update-heavy system, so systems should be architectured to work with updates as less as possible - moving things to aggregation of data and pulling out the aggregates once the exact figure is critical (i.e. moving requirement for consistency from write-time to critical-read-time).
Well, just my POV. It's hard to conceive a system that is application agnostic in this case.
Try to make changes at the database level in the least number of possible instructions.
The general rule is to lock a resource the lest possible time. Using T-SQL, PLSQL, Java on Oracle or any similar way you can reduce the time that each transaction locks a shared resource. I fact transactions in the database are optimized with row-level locks, multi-version, and other kinds of intelligent techniques. If you can make the transaction at the database you save the network latency. Apart from other layers like ODBC/JDBC/OLEBD.
Sometimes the programmer tries to obtain the good things of a database ( It is transactional, parallel, distributed, ) but keep a caché of the data. Then they need to add manually some of the database features.