Rails an MongoDB, how to get the last document inserted and be sure it is thread safe? - ruby-on-rails-3

I need when I add a new document in my collection X to get the last document that was inserted in that same collection, because some values of that document must influence the document I am currently inserting.
Basically as a simple example I would need to do that:
class X
include Mongoid::Document
include Mongoid::Timestamps
before_save :set_sum
def set_sum
self.sum = X.last.sum + self.misc
end
field :sum, :type => Integer
field :misc, :type => Integer
end
How can I make sure that type of process will never break if there are concurrent insert? I must make sure that when self.sum = X.last.sum + self.misc is calculate that X.last.sum absolutely represents that last possible document inserted in the collection ?
This is critical to my system. It needs to be thread safe.
Alex
ps: this also needs to be performant, when there are 50k documents in the collections, it can't take time to get the last value...

this kind of behavior is equivalent to having an auto increment id.
http://www.mongodb.org/display/DOCS/How+to+Make+an+Auto+Incrementing+Field
The cleanest way is to have a side collection with one (or more) docs representing the current total values.
Then in your client, before inserting the new doc, do a findAndModify() that atomically updates the totals AND retrieves the current total doc.
Part of the current doc can be an auto increment _id, so that even if there are concurrent inserts, your document will then be correctly ordered as long as you sort by _id.
Only caveat: if your client app dies after findAndModify and before insert, you will have a gap in there.
Either that's ok or you need to add extra protections like keeping a side log.
If you want to be 100% safe you can also get inspiration from 2-phase commit
http://www.mongodb.org/display/DOCS/two-phase+commit
Basically it is the proper way to do transaction with any db that spans more than 1 server (even sql wouldnt help there)
best
AG

If you need to keep a running sum, this should probably be done on another document in a different collection. The best way to keep this running sum is to use the $inc atomic operation. There's really no need to perform any reads while calculating this sum.
You'll want to insert your X document into its collection, then also $inc a value on a different document that is meant for keeping this tally of all the misc values from the X documents.
Note: This still won't be transactional, because you're updating two documents in two different collections separately, but it will be highly performant, and fully thread safe.
Fore more info, check out all the MongoDB Atomic Operations.

Related

How to generate an incremental item number in BOPF

I created a custom table to store reasons for modifying an Object. I'm doing a POC with BOPF in order to learn, even it may not make sense to use it here.
This is how the persistent structure looks like (simplified):
define type zobject_modifications {
object_id : zobject_id;
#EndUserText.label : 'Modification Number'
mod_num : abap.numc(4);
reason_id : zreason_id;
#EndUserText.label : 'Modification Comments'
comments : abap.string(256);
}
The alternative key consists in the object_id + mod_num. The mod_num should be an auto-generated counter, always adding 1 to the last modification for the object_id.
I created a determination before_save to generate it, checking the MAX mod_num from the database BOs and from the current instantiated BOs and increasing by 1.
But when I try to create 2 BOs for the same object in a single transaction, I get an error because of the duplicated alternative key, since the field MOD_NUM is still initial and the before_save would be triggered later. I tried to change the determination to "After Modify" but I still get the same problem.
The question is: When and how should I generate the next MOD_NUM to be able to create multiple nodes for the same object ID safely?
This must be a very common problem so there must be a best practice way to do it, but I was not able to find it.
Use a number range to produce sequential identifiers. They ensure that you won't get duplicates if there are ongoing and concurrent transactions.
If you want to insist on determining the next identifier on your own, use the io_read input parameter of the determination to retrieve the biggest mod_num:
The database contains only those nodes that have already been committed. But your new nodes are not committed, yet, such that you won't get them.
io_read in contrast accesses BOPF's temporary buffer that also contains the nodes you just created, hence seeing the more actual data.

Cocoa Scripting: Deletion of elements in a loop getting out of sync

While adding scriptability to my Mac program, I am struggling with the common programming problem of deleting items from an indexed array where the item indexes shift due to removal of items.
Let's say my app maintains a data store in which objects of type "Person" are stored. In the sdef, I've define the Cocoa Key allPersons to access these elements. My app declares an NSArray *allPersons.
That far, it works well. E.g, this script works well:
repeat with p in every person
get name of p
end repeat
The problem starts when I want to support deletion of items, like this:
repeat with p in (get every person)
delete p
end repeat
(I realize that I could just write "delete every person", which works fine, but I want to show how "repeat" makes things more complicated).
This does not work because AppleScript keep using the original item numbers to reference the items even after deleting some of them, which naturally shifts the items and their numbering.
So, considering we have 3 Persons, "Adam", "Bonny" and "Clyde", this will happen:
get every person
--> {person 1, person 2, person 3}
delete person 1
delete person 2
delete person 3
--> error number -1719 from person 3
After deleting item 1 (Adam), the other items get renumbered to item 1 and 2. The second iteration deletes item 2 (which is now Clyde), and the third iteration attempts to delete item 3, which doesn't exist any more at that point.
How do I solve this?
Can I force the scripting engine to not address the items by their index number but instead by their unique ID so that this won't happen?
It's not your ObjC code, it's your misunderstanding of how repeat with VAR in EXPR loops work. (Not really your fault either: they're 1. counterintuitive, and 2. poorly explained.) When it first encounters your repeat statement, AppleScript sends your app a count event to get the number of items specified by EXPR, which in this case is an object specifier (query) that identifies all of the person elements in whatever. It then uses that information to generate its own sequence of by-index object specifiers, counting from 1 up to the result of the aforementioned count:
person 1 of whatever
person 2 of whatever
...
person N of whatever
What you need to realize is that an object specifier is a first-class query, not an object pointer (not that Apple tell you this either): it describes a request, not an object. Ignore the purloined jargon: Apple event IPC's nearest living relatives are RDBMSes, not Cocoa or SOAP or any of the OO messaging crud that modern developers so fixate on as The One True Way To Do... well, EVERYTHING.
It's only when that query is sent to your application in an Apple event that it's evaluated against the relational graph your Apple event IPC View-Controller – aka "Apple Event Object Model" – presents as an idealized, user-friendly representation of your Model's user date that it actually resolves to a specific Model object, or objects, with which the event handler should perform the requested operation.
Thus, when the delete command in your repeat loop tells your app to delete person 1 of whatever, all your remaining elements move down by one. But on the next iteration the repeat loop still generates the object specifier person 2 of whatever, which your script then sends off to your app, which resolves it to the second item in the collection – which was originally the third item, of course, until you shifted them all about.
Or, to borrow a phrase:
Nothing in AppleScript makes sense except in light of relational queries.
..
In fact, Apple events' query-based approach it actually makes a lot of sense considering it was originally designed to be efficient over very high-latency connections (i.e. System 7's abysmally inefficient process switcher), allowing a single Apple event carrying one or more complex queries to manipulate many objects at once. It's even quite elegant [when it works right], but is quite undone by idiots at Cupertino who think the best way to make programmers not hate the technology is to lie even harder about how it actually works.
So here, I suggest you go read this, which is not the best explanation either but still a damn sight better than anything you'll get from those muppets. And this, by its original designer that explains a lot of the rationale for creating a high-level coarse-grained query-based IPC system instead of the usual low-level fine-grained OO message passing crap.
Oh, and once you've done that, you might want to consider try running this instead:
delete every person whose name is "bob"
which is pretty much the whole point of creating a thick declarative-y abstraction that does all the work so the user doesn't have to.
And when nothing but an imperative client-side loop will do, you either want to get a list of by-ID object specifiers (which are the closest things to safe, persistent pointers that AEOMs can do) from the app first and then iterate over that, or at least use your own iterator loop that counts over elements in reverse:
repeat with i from (count every person) to 1 by -1
tell person i
..
end tell
end repeat
so that, assuming it's iterating over an ordered array on the server side, will delete from last to first, and so avoid the embarrassing off-by-N errors of your original script.
HTH
re: "If you want your scripable elements to be deletable, make sure you use NSUniqueIDSpecifiers to identify them."
Yes, Apple recommends using formUniqueId or formName for object specifiers, but you can't always do that. For instance, in the Text Suite, you really only have indexing to work with; e.g. character 1, word 3, paragraph 7, etc. You don't have unique IDs for text elements. In addition to deletion, ordering can be affected by other Standard Suite commands: open, close, duplicate, make, and move.
The app implementer is a programmer, but so is the scripter. So it is reasonable to expect the scripter to solve some problems themselves. For instance, if the app has 5 persons, and the scripter wants to delete persons 2 and 4, they can easily do so even with indexed deletion:
delete person 4
delete person 2
Deleting from the end of an ordered list forward solves the problem. AS also supports negative indexes, which can be used for the same purpose:
delete person -2
delete person -4
The key to solving this lies in implementing the objectSpecifier method correctly so that it does return an NSUniqueIDSpecifier.
My code did so far only return an index specifier and that was wrong for this purpose. I guess that had I posted my code (which is, unfortunately, too complex for that), someone may have noticed my mistake.
So, I guess the rule is: If you want your scripable elements to be deletable, make sure you use NSUniqueIDSpecifiers to identify them. For read-only element arrays, using an NSIndexSpecifier is (probably) safe, though, if your element array has persistent ordering behavior.
Update
As #foo points out, it's also important that the repeat command fetches the references to the items by using … in (get every person) and not just … in every person, because only the former leads to addressing the items by their id whereas the latter keeps indexing them as item N.

DynamoDB: Have sequencing within Items

I am developing forums on DynamoDB.
There is a table posts which contains all the posts in a thread.
I need to have a notion of sequence in the posts, i.e. I need to know which post came first and which came later.
My service would be running in a distributed env.
I am not sure if using Timestamp is the best solution for deciding the sequence, as the hosts might have slightly different times and might be off my milliseconds/ seconds.
Is there another way to do this?
Can I get DynamoDB to populate the date so it is consistent?
Or is there a sequence generator that I can use in a distributed env?
You can't use DynamoDB to auto-populate dates. You can use other services to provide you with auto-generating numbers or use DynamoDB's atomic increment to create your own UUID.
This can become a bottleneck if your forum is very successful (needs lots of numbers per second). I think you should start with timestamp and later on add complexity to your id generating (concatenate timestamp+uuid or timstamp+atomiccounter)
It is always a best practice to sync your servers clock (ntpd)
Use a dedicated sequence table. If you have only one sequence (say, PostId), then there's going to be only one row with two attributes in the table.
Yes, there's extra cost and effort of managing another table, but this is the best solution I know by far and haven't seen any one else mentioning it.
The table should have a key attribute as primary partition key, and a numeric value attribute with initial value of 1 (or whatever you want the initial value to be).
Every time you want to get the next available key, you tell DynamoDB to do this:
Increment the value where key = PostId by 1, and return the value before incrementing.
Note that this is one single atomic operation. DynamoDB handles the auto-incrementing, so there's no concurrency issues.
In code, there're more than one ways of implementing this. Here's one example:
Map<String,AttributeValue> key = new HashMap<>();
key.put("key", new AttributeValue("PostId"));
Map<String, AttributeValueUpdate> item = new HashMap<String, AttributeValueUpdate>();
item.put("value",
new AttributeValueUpdate()
.withAction(AttributeAction.ADD)
.withValue(1));
UpdateItemRequest request = new UpdateItemRequest("Sequences", key, item).withReturnValues(ReturnValue.ALL_OLD);
UpdateItemResult result = dynamoDBClient.updateItem(request);
Integer postId = Integer.parseInt(result.getAttributes().get("value").getN()); // <- this is the sequential ID you want to set to your post
Another variation of Chen's suggestion is to have strict ordering of posts within a given Forum Thread, as opposed to globally across all Threads. One way to do this is to have a Reply table with the Hash key of ThreadId, and a range key of ReplyId. The ReplyId would be a Number type attribute starting at 0. Every time someone replies, your app does a Query on the Reply table for the one most recent reply on that thread (ScanIndexForward: false, Limit: 1, ThreadId: ). To insert your new reply use the ReplyId of the one returned in the Query, + 1. Then use PutItem, using a Conditional Write, so that if someone else replies at the same time, an error will be returned, and your app can start again with the query.
If you want the simplest initial solution possible, then the timestamp+uuid concatenation Chen suggests is the simplest approach. A global atomic counter item will be a scaling bottleneck, as Chen mentions, and based on what you've described, a global sequence number isn't required for your app.

Redis Sorted Set ... store data in "member"?

I am learning Redis and using an existing app (e.g. converting pieces of it) for practice.
I'm really struggling to understand first IF and then (if applicable) HOW to use Redis in one particular use-case ... apologies if this is super basic, but I'm so new that I'm not even sure if I'm asking correctly :/
Scenario:
Images are received by a server and info like time_taken and resolution is saved in a database entry. Images are then associated (e.g. "belong_to") with one Event ... all very straight-forward for a RDBS.
I'd like to use a Redis to maintain a list of the 50 most-recently-uploaded image objects for each Event, to be delivered to the client when requested. I'm thinking that a Sorted Set might be appropriate, but here are my concerns:
First, I'm not sure if a Sorted Set can/should be used in this associative manner? Can it reference other objects in Redis? Or is there just a better way to do this altogether?
Secondly, I need the ability to delete elements that are greater than X minutes old. I know about the EXPIRE command for keys, but I can't use this because not all images need to expire at the same periodicity, etc.
This second part seems more like a query on a field, which makes me think that Redis cannot be used ... but then I've read that I could maybe use the Sorted Set score to store a timestamp and find "older than X" in that way.
Can someone provide come clarity on these two issues? Thank you very much!
UPDATE
Knowing that the amount of data I need to store for each image is small and will be delivered to the client's browser, can is there anything wrong with storing it in the member "field" of a sorted set?
For example Sorted Set => event:14:pictures <time_taken> "{id:3,url:/images/3.png,lat:22.8573}"
This saves the data I need and creates a rapidly-updatable list of the last X pictures for a given event with the ability to, if needed, identify pictures that are greater than X minutes old ...
First, I'm not sure if a Sorted Set can/should be used in this
associative manner? Can it reference other objects in Redis?
Why do you need to reference other objects? An event may have n image objects, each with a time_taken and image data; a sorted set is perfect for this. The image_id is the key, the score is time_taken, and the member is the image data as json/xml, whatever; you're good to go there.
Secondly, I need the ability to delete elements that are greater than
X minutes old
If you want to delete elements greater than X minutes old, use ZREMRANGEBYSCORE:
ZREMRANGEBYSCORE event:14:pictures -inf (currentTime - X minutes)
-inf is just another way of saying the oldest member without knowing the oldest members time, but for the top range you need to calculate it based on current time before using this command ( the above is just an example)

Represent Ordering in a Relational Database

I have a collection of objects in a database. Images in a photo gallery, products in a catalog, chapters in a book, etc. Each object is represented as a row. I want to be able to arbitrarily order these images, storing that ordering in the database so when I display the objects, they will be in the right order.
For example, let's say I'm writing a book, and each chapter is an object. I write my book, and put the chapters in the following order:
Introduction, Accessibility, Form vs. Function, Errors, Consistency, Conclusion, Index
It goes to the editor, and comes back with the following suggested order:
Introduction, Form, Function, Accessibility, Consistency, Errors, Conclusion, Index
How can I store this ordering in the database in a robust, efficient way?
I've had the following ideas, but I'm not thrilled with any of them:
Array. Each row has an ordering ID, when order is changed (via a removal followed by an insertion), the order IDs are updated. This makes retrieval easy, since it's just ORDER BY, but it seems easy to break.
// REMOVAL
UPDATE ... SET orderingID=NULL WHERE orderingID=removedID
UPDATE ... SET orderingID=orderingID-1 WHERE orderingID > removedID
// INSERTION
UPDATE ... SET orderingID=orderingID+1 WHERE orderingID > insertionID
UPDATE ... SET orderID=insertionID WHERE ID=addedID
Linked list. Each row has a column for the id of the next row in the ordering. Traversal seems costly here, though there may by some way to use ORDER BY that I'm not thinking of.
Spaced array. Set the orderingID (as used in #1) to be large, so the first object is 100, the second is 200, etc. Then when an insertion happens, you just place it at (objectBefore + objectAfter)/2. Of course, this would need to be rebalanced occasionally, so you don't have things too close together (even with floats, you'd eventually run into rounding errors).
None of these seem particularly elegant to me. Does anyone have a better way to do it?
An other alternative would be (if your RDBMS supports it) to use columns of type array. While this breaks the normalization rules, it can be useful in situations like this. One database which I know about that has arrays is PostgreSQL.
The acts_as_list mixin in Rails handles this basically the way you outlined in #1. It looks for an INTEGER column called position (of which you can override to name of course) and using that to do an ORDER BY. When you want to re-order things you update the positions. It has served me just fine every time I've used it.
As a side note, you can remove the need to always do re-positioning on INSERTS/DELETES by using sparse numbering -- kind of like basic back in the day... you can number your positions 10, 20, 30, etc. and if you need to insert something in between 10 and 20 you just insert it with a position of 15. Likewise when deleting you can just delete the row and leave the gap. You only need to do re-numbering when you actually change the order or if you try to do an insert and there is no appropriate gap to insert into.
Of course depending on your particular situation (e.g. whether you have the other rows already loaded into memory or not) it may or may not make sense to use the gap approach.
If the objects aren't heavily keyed by other tables, and the lists are short, deleting everything in the domain and just re-inserting the correct list is the easiest. But that's not practical if the lists are large and you have lots of constraints to slow down the delete. I think your first method is really the cleanest. If you run it in a transaction you can be sure nothing odd happens while you're in the middle of the update to screw up the order.
Just a thought considering option #1 vs #3: doesn't the spaced array option (#3) only postpone the problem of the normal array (#1)? Whatever algorithm you choose, either it's broken, and you'll run into problems with #3 later, or it works, and then #1 should work just as well.
I did this in my last project, but it was for a table that only occasionally needed to be specifically ordered, and wasn't accessed too often. I think the spaced array would be the best option, because it reordering would be cheapest in the average case, just involving a change to one value and a query on two).
Also, I would imagine ORDER BY would be pretty heavily optimized by database vendors, so leveraging that function would be advantageous for performance as opposed to the linked list implementation.
Use a floating point number to represent the position of each item:
Item 1 -> 0.0
Item 2 -> 1.0
Item 3 -> 2.0
Item 4 -> 3.0
You can place any item between any other two items by simple bisection:
Item 1 -> 0.0
Item 4 -> 0.5
Item 2 -> 1.0
Item 3 -> 2.0
(Moved item 4 between items 1 and 2).
The bisection process can continue almost indefinitely due to the way floating point numbers are encoded in a computer system.
Item 4 -> 0.5
Item 1 -> 0.75
Item 2 -> 1.0
Item 3 -> 2.0
(Move item 1 to the position just after Item 4)
Since I've mostly run into this with Django, I've found this solution to be the most workable. It seems that there isn't any "right way" to do this in a relational database.
I'd do a consecutive number, with a trigger on the table that "makes room" for a priority if it already exists.
I had this problem as well. I was under heavy time pressure (aren't we all) and I went with option #1, and only updated rows that changed.
If you swap item 1 with item 10, just do two updates to update the order numbers of item 1 and item 10. I know it is algorithmically simple, and it is O(n) worst case, but that worst case is when you have a total permutation of the list. How often is that going to happen? That's for you to answer.
I had the same issue and have probably spent at least a week concerning myself about the proper data modeling, but I think I've finally got it. Using the array datatype in PostgreSQL, you can store the primary key of each ordered item and update that array accordingly using insertions or deletions when your order changes. Referencing a single row will allow you to map all your objects based on the ordering in the array column.
It's still a bit choppy of a solution but it will likely work better than option #1, since option 1 requires updating the order number of all the other rows when ordering changes.
Scheme #1 and Scheme #3 have the same complexity in every operation except INSERT writes. Scheme #1 has O(n) writes on INSERT and Scheme #3 has O(1) writes on INSERT.
For every other database operation, the complexity is the same.
Scheme #2 should not even be considered because its DELETE requires O(n) reads and writes. Scheme #1 and Scheme #3 have O(1) DELETE for both read and write.
New method
If your elements have a distinct parent element (i.e. they share a foreign key row), then you can try the following ...
Django offers a database-agnostic solution to storing lists of integers within CharField(). One drawback is that the max length of the stored string can't be greater than max_length, which is DB-dependent.
In terms of complexity, this would give Scheme #1 O(1) writes for INSERT, because the ordering information would be stored as a single field in the parent element's row.
Another drawback is that a JOIN to the parent row is now required to update ordering.
https://docs.djangoproject.com/en/dev/ref/validators/#django.core.validators.validate_comma_separated_integer_list