Transactions for patches in RavenDb - ravendb

RavenDb does not support transactions for Patches. But in a pretty much any high-load project Patch is the only way to go (due to concurrency problems). Inevitably you have situations when you need to execute a sequence of patches:
AddLikeToCommentLikes(...);
AddLikeToUserLikes(...) //needed to show a user which comment he liked
Based on my experience typically there is a primary change and a secondary (which could be recovered if lost). But sometimes there are several equally important changes.
So what is the best practice in this case to make either none of the patches happen or all of them? Is there any way to emulate transaction for several patches?
The only idea I have in this regards is to merge changes into one script patch. So we have a script that adds a comment like, then in the same script we get UserLikes document with LoadDocument and modify it. But I doubt RavenDb guarantees UserLikes is not modified within this operation.

RavenDB does support transactions for patches, as long as you are patching by id.
If you are patching by query, the operation is handled as a transaction for each individual document, but not globally.
Calling LoadDocument in a patch will result in all documents modified by a patch sharing the same transaction, so either all the patch changes are saved, or it didn't happen in an atomic fashion

Related

Does DynamoDB have locking by default?

I'm looking over the dynamo documentation and it looks like they have optimistic. I'm wondering if this is used by default or not.
From the documentation, it looks like you need to code up the java application to use the #DynamoDBVersionAttribute annotation and get and set the versions. Without doing this, it looks like you can write to DynamoDB without any sort of locking.
Is that correct?
On a side note, I'm not too familiar with DBs without some sort of locking so what would happen if 2 people wrote to the same item at the same time in DynamoDB without any locking? Say the item we're writing to has 4 fields, would one write completely fail or is it possible that DynamoDB updates 2/4 fields with 1 write, and the other 2 fields with the other write?
You are correct. DynamoDB does NOT have optimistic locking by default. There are various SDKs for DynamoDB and as far as I am aware the only one which provides optimistic locking functionality is the Java SDK.
Here's what the Java SDK optimistic locking actually supports:
Creates an attribute in your table called version
You must load an item from the database before updating it
When you try and save an item the SDK tests that the client item version number matches the one in the table, and if it does, the save is completed and the version number is incremented
This is pretty simple to implement yourself if you are using a different SDK. You would create the version attribute yourself. You would create a wrapper for the putItem method (and any other required save/update operations). You would use the Condition Expression to test that the version number in the database is one less than the version you saving.
To answer the second part of your question, both updates would succeed (assuming you had put no conditions on your update). The first one would make any updates specified, and the second one would come along and overwrite them.
Dynamodb doesn't support optimistic locking by default. As you mentioned, you need to use the annotation in the Java model class in order to use the optimistic locking.
If two threads write to the same item, the Dynamodb item will have the last write data (i.e. last thread which writes the data).

Use cases of Event Sourcing, when we don't care about past states

I have been reading about Event Sourcing pattern, I have seen it used in the projects I have worked on, but I am still yet to see any benefit of it, while it makes the design much more complicated.
That is, many sources mention that Event Sourcing is good if you want to see Audit Log, be able to reconstruct the state of 15 days ago and I see that Event Sourcing solves all of that beautifully. But apart from that, what is the point?
Yes, I can imagine that if you are in relational world, then writes are comparatively slow as they lock the data and so on. But it is much easier to solve this problem, by going no-sql and using something like Cassandra. Cassandra's writes are super fast, as they are append-only (kinda temporary event source), it scales beautifully as well. Sources also mention that Event Sourcing helps scaling - how on earth it can help you to scale, when instead of storing ~1 row of data per user, now you have 9000 and instead of retrieving that single row, now you are replaying 9000 rows (or less, if you complicate the design even more and add some temporal snapshots of state and replay the current state form the last snapshot).
Any examples of real life problems that Event Sourcing solves or links would be much appreciated.
While I haven't implemented a distributed, event-sourced sub-system as yet (so I'm no expert), I have been researching and evaluating the approach. Event sourcing provides a number of key benefits:
Reliability
Scalability
Evolvability
Audit
I'm sure there are more. To a large extent, the benefits of event sourcing depend on the baseline you are comparing it against (CRUD, event-driven DDD, CQRS, or whatever), and the domain.
Let's look at each of those in turn:
Reliability
With event driven systems that fire events whenever the system is updated, you often have a problem: how do you both update the system state and fire the event in one go? If the 2nd operation fails, your system is in a broken, inconsistent state. Event sourcing provides a neat solution to this, since the system only requires a single operation for the state change, which will either succeed or fail atomically: the writing of the event. Other solutions tend to be more complex and less scalable - 2 phase commit, etc.
This is a big benefit in a large, high transaction system, where components are failing, being updated or replaced all the time while transactions are going on. The ability to terminate a process at any time without any worry about data corruption or consistency is a big benefit and helps you sleep at night.
In many domains you won't have concurrent writes to the same entities, or you won't require events since a state change has no knock-on effects, in which case event sourcing is unlikely to be a good approach, and simpler approaches like CRUD may be fine.
Scalability
First of all, event streams make consistent writes very efficient - it's just an append only log, which makes replication and 'compare and set' simple to optimise. Something like Cassandra is quite slow in the scenario where you need to protect your invariants - that is, you need to validate a command against the current state of a 'row', and reject the update if the row changes before you have a chance to update it. You either need to use 'lightweight transactions' to ensure consistency, or have a single writer thread per partition, so that you can be sure that you can successfully validate a command against the current state of the system before allowing the update. Of course you can implement an event store in Cassandra, using either of these approaches (single thread/lightweight transactions).
Read scalability is the biggest performance benefit though - since you can build as many different eventually consistent projections (views) on the data as you want by reading from event streams, and horizontally scale query services on these views as much as you want. These views can use custom databases (Cassandra, graph databases) as necessary to allow queries to be optimised as much as you want. They can store denormalised data, to allow all required data to be fetched in a single (non-joined) database query. They can even store the projected state in memory, for maximum performance. While this can potentially be achieved without event sourcing, it is much more complex to implement.
If you don't have complex querying and high scalability requirements, event sourcing may not be the right solution.
Evolvability
If you need to look at your data in a new way, say you create a new client app or screen in an app, it's very easy to add new projections of the event streams as new, independent services. If you need to add some data to an existing read view that you missed, or fix a bug in the read view, you can just rebuild the views using the event streams and throw away the old ones. The advantages here vs. the non-event sourced case are:
You don't need to write both DB migration code and then code to keep the view up to date as events come in. Instead, you just write the code to keep it up to date, and run it on the events from the start of time.
Related to this, you can do the update without having to bring down the query service to do a schema change - instead, just leave the old service version running against the old DB, generate a new DB with the new service version, and when it's caught up with the event streams, just atomically switch over then clean up the old service and DB once you're happy the new one is stable (noting that the old service will be keeping itself up to date in the meantime, if you need to roll back!). This is likely to be extremely difficult to achieve without event sourcing.
If you need any temporal information to be added to your views (e.g. when was the last update, when was this created), that's already available and easy to add, but impossible to add retrospectively without event sourcing.
Note that the above isn't about modifying event streams (which is tricker, see my comment on challenges below) - it's about using the existing event streams to enhance a view or create a new one.
There are simple ways to do this without event sourcing, such as using database views (with an RDBMS), but they aren't as scalable.
Event sourcing also has some challenges for evolvability - you need to take care of event versioning, probably using a combination of weak event schema (so you can add properties with default values) and stream replacement (when you want to do a bigger change to your events). Greg Young is writing a good book on this.
Audit
As you mentioned, you're not interested in this.

Incremental linearizing of git DAG

I'm the author of GitX. One of the features GitX has is the visualization of branches, as can be seen here.
This visualization is currently done by reading commits which are emitted from git in the correct order. For each commit the parents are known, so it's fairly easy to build up the lanes in the correct way.
I'd like to speed up this process by using my own commit pool and linearizing the commits myself. This allows me to reuse existing loaded commits and allows git to emit commits faster because it doesn't have to emit them in the correct order.
However, I'm not sure what algorithm to use to accomplish this. It is important that the building is incremental, as the loading of commits can take a long time (>5 seconds for 100,000 commits, which should all be displayed).
Gitk has gone the same way, and there's a patch here that shows how it is implemented, but my TCL skills are weak and the patch isn't very thoroughly commented and a bit hard to follow.
I'd also like this algorithm to be efficient, as it'll have to handle hundreds of thousands of commits. It also has to be displayed in a table, so it's important that access to specific rows is fast.
I'll describe the input I have so far, the output that I want and a few observations.
Input:
I have a current pool of commits in the form of a hash table that maps commit ids to commit objects. This pool does not have to be complete (have all commits necessary)
I have a separate thread loading in new commits from git, with a callback that can be called every time a new commit is loaded. There is no guaranteed order in which the commits come in, but in most of the cases the next commit is a parent of the previous commit.
A commit object has its own revision id and the revision ids of all its parents
I have a list of branch heads that should be listed. That is, there isn't a single 'top' of the DAG that should be displayed. There also does not have to be a single graph root.
Output:
I'll need to linearize these commits in topological order. That is, a commit cannot be listed after its parents have been listed.
I also need the 'branch lines' that can be seen in the screenshot above. These probably need to be precomputed as most of them depend on their children.
A few remarks:
It's necessary to relocate a list of commits. For example, we might have to commits (branch heads) that are unrelated, until a commit shows up which makes one head an ancestor of the other.
Multiple branch tips must be shown
It's important that this process is incremental, so that at least a partial view is available while the data is still loading. This means that new data has to be inserted halfway and that the branch lines have to be readjusted.
The standard topological sort is O(n) (OK, O(V+E)), i.e. you should be able to sort a million commits in memory in a fraction of a second. No incremental hack like those in Tcl is needed.
BTW, I use GitX (looks much better than Gitk on OS X) everyday and don't have any issue with it (maybe because I don't have those crazy merges in my repositories) :)
OK, so I'm having a similarly hard time reading the entirety of that patch, but let's see if I can piece it together from what I did figure out.
To start with, gitk simplifies things by condensing a string of commits into an arc, containing a series of commits that each only have one parent and one child. Aside from anything else, doing this should cut down pretty dramatically on the number of nodes you have to consider for your sort, which will help out any algorithm you use. As a bonus, related commits will end up grouped together.
This does introduce some complexity in terms of finding an arc when you read a new commit. There are a few situations:
The new commit has a single parent, or no parents. It extends a (possibly empty) arc. Most of the time, you'll just extend the most recent arc. There are a few interesting subcases:
It may cause an existing arc to be split, if its parent already has a child (i.e. its parent turns out to be a branch point, which I gather you don't know ahead of time).
It could be a "missing link" that connects two arcs together.
You may already know that this commit has multiple children
The new commit has multiple parents (a merge commit).
You may want to include the multi-child or multi-parent commits in arcs, or it may make more sense to keep them separate. Either way, it shouldn't be too difficult to build up this set of arcs incrementally.
Once you have these arcs, you're still left with trying to linearize them. In your case, the first algorithm described on the aforementioned Wikipedia page sounds useful, as you have a known set of branch points to use as your initial set S.
Other notes:
Relocating commits should be manageable. First of all, you only have to care when you connect two arcs, either through a new merge commit, a newly-discovered branch point, or combining two arcs into one. Any given arc can easily maintain its current row number range (assuming you're fine with putting an arc on sequential rows), so traversing up the tree checking that all new ancestors show up later should be pretty quick.
I don't know enough to say much about drawing the graph lines, but I imagine it won't be too different from what you do now.
Anyway, I hope that helps. It was interesting to think about, at least.
Do you really need to display 100k commits at once? What kind of user can soak up that kind of info?
Have you thought about paging? I.e just compute for ~100 commits or something. If a branch-line goes way back (off-page), you could use something like Github's back-pointing arrow to show that.
I haven't used GitX, so maybe I'm missing something, but it seems like you could walk back from child to parent(s) from the head of each current branch until you can draw a few screens of the graph.
That might not give you the optimal visual layout of branches that are rooted earlier. But it seems like responsiveness would be more important than waiting to draw a graph with the fewest crossings, since most users are likely to be interested in recent activity.

Migrate clearcase to perforce

I have a large quantity of clearcase data which needs to be migrated into perforce. The revisions span the better part of a decade and I need to preserve as much branch and tag information as possible. Additionally we make extensive use of symbolic links, supported in clearcase but not in perforce. What advice or tools can you suggest which might make this easier?
The first step is to decide if you need to migrate everything, or just certain key versions. If you only migrate the important versions (releases and major milestones) you'll end up with a much simpler history in Perforce, without losing anything important. Then ClearCase can be keep as a historical archive in case it is ever needed. (Unless IBM has changed things ClearCase licenses do not expire when maintainance runs out, you just lose the right to new upgrades and patches and acces to support)
Keep in mind that Perforce does not version control directories and does not keep a full per-element version tree - this means a 1:1 with exact results is going to be impossible. Recreating the important snapshots is a much more achievable goal; keeping everything may be impossible, as Perforce lacks features ClearCase relies upon.
To see what Perforce says about the miration, check out
http://perforce.com/perforce/ccaseconv.html
This explains the key differences and covers a few approaches you can take.
Start by doing a Google search on "clearcase to perforce conversion".
Then read the ClearCase to Perforce Conversion Guide.
Once you're done crying, you're going to have to decide (1) how much effort you can afford, and (2) what you really need to capture as part of the conversion. You're not going to get it all, so you might as well just focus on getting the important branches.
Another consideration would be to just capture the current state of each supported branch as a snapshot, import that into Perforce, and then turn off the old ClearCase server, saving it in a known good state for that day when you need to access something from the deep, dark, pre-Perforce days...
The other answers are outdated. Now you can import CC->Perforce with many options also preserving history.
http://www.perforce.com/sites/default/files/pdf/migration-planning-guide-clearcase-to-perforce.pdf
What you also have to keep in mind is the fact, that your importerscript may slightly commit in another sequence than the clearcase commits(maybe you are traversing dir, may be histories of files, etc.)
So, unless you gather all version information into a (large) database and sort them afterwards, you will end up with commits which are not very useful to look into(except of course history of single files). As you (hopefully) change your commit-policy to commit atomic changes into perforce, it will be visible when development started: The commits before just do not make any sense on a project scope.
So you really should think of leaving clearcase history behind. Tags/Branches creation is also a different problem, as you need your old configspecs for your old branches.
At the end you will get wrong filenames in old tags(as perforce do not support dir-vers.) so you will use clearcase for this(and it is very tricky to get the correct filename for each version of a file!).
The last problem you will encounter: importer run time:
if you have large VOBs(eg. 10 years, 50 GB size), you will wait days for the importer to gather all information and convert it to a nice shiny perforce repo. All this day your devteam will stop working.
Just a quick note on the one import I saw from ClearCase to Perforce.
As noted in the ClearCase to Perforce Conversion Guide:
Perforce supports atomic change transactions; ClearCase doesn't.
Note that labels are often used to simply denote a snapshot in time for a particular easily-specified set of files; this is inherently easy to do in Perforce without using a label, due to Perforce's use of atomic change transactions and file naming syntax.
For example, the state of all the files in //depot/projecta as of change 42 can be obtained with
p4 sync //depot/projecta/...#42
That means the ClearCase project that got imported was an UCM one, since the concept of baseline closely follows the one of global revision.
Only files with a baseline on them were imported, the other versions were discarded.

Best Database Change Control Methodologies

As a database architect, developer, and consultant, there are many questions that can be answered. One, though I was asked recently and still can't answer good, is...
"What is one of, or some of, the best methods or techniques to keep database changes documented, organized, and yet able to roll out effectively either in a single-developer or multi-developer environment."
This may involve stored procedures and other object scripts, but especially schemas - from documentation, to the new physical update scripts, to rollout, and then full-circle. There are applications to make this happen, but require schema hooks and overhead. I would rather like to know about techniques used without a lot of extra third-party involvement.
The easiest way I have seen this done without the aid of an external tool is to create a "schema patch" if you will. The schema patch is just a simple t-sql script. The schema patch is given a version number within the script and this number is stored in a table in the database to receive the changes.
Any new changes to the database involve creating a new schema patch that you can then run in sequence which would then detect what version the database is currently on and run all schema patches in between. Afterwards the schema version table is updated with whatever date/time the patch was executed to store for the next run.
A good book that goes into details like this is called Refactoring Databases.
If you wish to use an external tool you can look at Ruby's Migrations project or a similar tool in C# called Migrator.NET. These tools work by creating c# classes/ruby classes with an "Forward" and "Backward" migration. These tools are more feature rich because they know how to go forward as well as backwards in the schema patches. As you stated however, you are not interested in an external tool, but I thought I would add that for other readers anyways.
I rather liked this series:
http://odetocode.com/Blogs/scott/archive/2008/02/03/11746.aspx
In my case I have a script generate every time I change the database, I named the script like 00001.sql, n.sql and I have a table with de number of last script I have execute. You can also see Database Documentation
as long as you add columns/tables to your database it will be an easy task by scripting these changes in advance in sql-files. you just execute them. maybe you have some order to execute them.
a good solution would be to make one file per table, so that all changes belonging to this table would be visible to who-ever is working on the table (its like working on a class). the same is valid for stored procedures or views.
a more difficult task (and therefore maybe tools would be good) is to step back. as long as you just added tables/columns maybe this would not be a big issue. but if you have dropped columns on an update, and now you have to undo your update, the data is not there anymore. you will need to get this data from the backup. but keep in mind, if you have more then a few tables this could be a big task, and in the normal case you should undo your update very fast!
if you could just restore the backup, then its fine in this moment. but, if you update on monday, your clients work till wednesday and then they see that some data is missing (which you just dropped out of a table) then you could not just restore the old database.
i have a model-based approach in my mind (sorry, not implemented at the moment) in which schema-changes are "modeled" (e.g. per xml) and during an update a processor (e.g. a c# program) creates all necessary "sql" and e.g. moves data to a "dropDatabase". the data can reside there, and if for some reason i need to restore some of the dropped data, i can just do it with the processor. i think over some time (years) this approach is not as bad because otherwise developers don't touch "old" tables because they don't know anymore if the table or column is really necessary. with this approach you don't risk too lot if you drop something!
What I do is:
All the DDL commands required to recreate the schema (and the stored procedures and the indexes, etc) are in a script.
To be sure the script is OK, it is tested from time to time (create a database, run the script and restore the backup and check the database works well).
For change control, the script is kept in a Version Control System (I typically use Subversion).
The trick is that, if the database cannot be brought down to recreate with, say, an added column, I have two changes to make, an ALTER TABLE + a modification in the script. A bit more work but, in the long term, it wins.