Represent Ordering in a Relational Database - sql

I have a collection of objects in a database. Images in a photo gallery, products in a catalog, chapters in a book, etc. Each object is represented as a row. I want to be able to arbitrarily order these images, storing that ordering in the database so when I display the objects, they will be in the right order.
For example, let's say I'm writing a book, and each chapter is an object. I write my book, and put the chapters in the following order:
Introduction, Accessibility, Form vs. Function, Errors, Consistency, Conclusion, Index
It goes to the editor, and comes back with the following suggested order:
Introduction, Form, Function, Accessibility, Consistency, Errors, Conclusion, Index
How can I store this ordering in the database in a robust, efficient way?
I've had the following ideas, but I'm not thrilled with any of them:
Array. Each row has an ordering ID, when order is changed (via a removal followed by an insertion), the order IDs are updated. This makes retrieval easy, since it's just ORDER BY, but it seems easy to break.
// REMOVAL
UPDATE ... SET orderingID=NULL WHERE orderingID=removedID
UPDATE ... SET orderingID=orderingID-1 WHERE orderingID > removedID
// INSERTION
UPDATE ... SET orderingID=orderingID+1 WHERE orderingID > insertionID
UPDATE ... SET orderID=insertionID WHERE ID=addedID
Linked list. Each row has a column for the id of the next row in the ordering. Traversal seems costly here, though there may by some way to use ORDER BY that I'm not thinking of.
Spaced array. Set the orderingID (as used in #1) to be large, so the first object is 100, the second is 200, etc. Then when an insertion happens, you just place it at (objectBefore + objectAfter)/2. Of course, this would need to be rebalanced occasionally, so you don't have things too close together (even with floats, you'd eventually run into rounding errors).
None of these seem particularly elegant to me. Does anyone have a better way to do it?

An other alternative would be (if your RDBMS supports it) to use columns of type array. While this breaks the normalization rules, it can be useful in situations like this. One database which I know about that has arrays is PostgreSQL.

The acts_as_list mixin in Rails handles this basically the way you outlined in #1. It looks for an INTEGER column called position (of which you can override to name of course) and using that to do an ORDER BY. When you want to re-order things you update the positions. It has served me just fine every time I've used it.
As a side note, you can remove the need to always do re-positioning on INSERTS/DELETES by using sparse numbering -- kind of like basic back in the day... you can number your positions 10, 20, 30, etc. and if you need to insert something in between 10 and 20 you just insert it with a position of 15. Likewise when deleting you can just delete the row and leave the gap. You only need to do re-numbering when you actually change the order or if you try to do an insert and there is no appropriate gap to insert into.
Of course depending on your particular situation (e.g. whether you have the other rows already loaded into memory or not) it may or may not make sense to use the gap approach.

If the objects aren't heavily keyed by other tables, and the lists are short, deleting everything in the domain and just re-inserting the correct list is the easiest. But that's not practical if the lists are large and you have lots of constraints to slow down the delete. I think your first method is really the cleanest. If you run it in a transaction you can be sure nothing odd happens while you're in the middle of the update to screw up the order.

Just a thought considering option #1 vs #3: doesn't the spaced array option (#3) only postpone the problem of the normal array (#1)? Whatever algorithm you choose, either it's broken, and you'll run into problems with #3 later, or it works, and then #1 should work just as well.

I did this in my last project, but it was for a table that only occasionally needed to be specifically ordered, and wasn't accessed too often. I think the spaced array would be the best option, because it reordering would be cheapest in the average case, just involving a change to one value and a query on two).
Also, I would imagine ORDER BY would be pretty heavily optimized by database vendors, so leveraging that function would be advantageous for performance as opposed to the linked list implementation.

Use a floating point number to represent the position of each item:
Item 1 -> 0.0
Item 2 -> 1.0
Item 3 -> 2.0
Item 4 -> 3.0
You can place any item between any other two items by simple bisection:
Item 1 -> 0.0
Item 4 -> 0.5
Item 2 -> 1.0
Item 3 -> 2.0
(Moved item 4 between items 1 and 2).
The bisection process can continue almost indefinitely due to the way floating point numbers are encoded in a computer system.
Item 4 -> 0.5
Item 1 -> 0.75
Item 2 -> 1.0
Item 3 -> 2.0
(Move item 1 to the position just after Item 4)

Since I've mostly run into this with Django, I've found this solution to be the most workable. It seems that there isn't any "right way" to do this in a relational database.

I'd do a consecutive number, with a trigger on the table that "makes room" for a priority if it already exists.

I had this problem as well. I was under heavy time pressure (aren't we all) and I went with option #1, and only updated rows that changed.
If you swap item 1 with item 10, just do two updates to update the order numbers of item 1 and item 10. I know it is algorithmically simple, and it is O(n) worst case, but that worst case is when you have a total permutation of the list. How often is that going to happen? That's for you to answer.

I had the same issue and have probably spent at least a week concerning myself about the proper data modeling, but I think I've finally got it. Using the array datatype in PostgreSQL, you can store the primary key of each ordered item and update that array accordingly using insertions or deletions when your order changes. Referencing a single row will allow you to map all your objects based on the ordering in the array column.
It's still a bit choppy of a solution but it will likely work better than option #1, since option 1 requires updating the order number of all the other rows when ordering changes.

Scheme #1 and Scheme #3 have the same complexity in every operation except INSERT writes. Scheme #1 has O(n) writes on INSERT and Scheme #3 has O(1) writes on INSERT.
For every other database operation, the complexity is the same.
Scheme #2 should not even be considered because its DELETE requires O(n) reads and writes. Scheme #1 and Scheme #3 have O(1) DELETE for both read and write.
New method
If your elements have a distinct parent element (i.e. they share a foreign key row), then you can try the following ...
Django offers a database-agnostic solution to storing lists of integers within CharField(). One drawback is that the max length of the stored string can't be greater than max_length, which is DB-dependent.
In terms of complexity, this would give Scheme #1 O(1) writes for INSERT, because the ordering information would be stored as a single field in the parent element's row.
Another drawback is that a JOIN to the parent row is now required to update ordering.
https://docs.djangoproject.com/en/dev/ref/validators/#django.core.validators.validate_comma_separated_integer_list

Related

Infinite scroll algorithm for random items with different weight ( probability to show to the user )

I have a web / mobile application that should display an infinite scroll view (the continuation of the list of items is loaded periodically in a dynamic way) with items where each of the items have a weight, the bigger is the weight in comparison to the weights of other items the higher should be the chances/probability to load the item and display it in the list for the users, the items should be loaded randomly, just the chances for the items to be in the list should be different.
I am searching for an efficient algorithm / solution or at least hints that would help me achieve that.
Some points worth to mention:
the weight has those boundaries: 0 <= w < infinite.
the weight is not a static value, it can change over time based on some item properties.
every item with a weight higher than 0 should have a chance to be displayed to the user even if the weight is significantly lower than the weight of other items.
when the users scrolls and performs multiple requests to API, he/she should not see duplicate items or at least the chance should be low.
I use a SQL Database (PostgreSQL) for storing items so the solution should be efficient for this type of database. (It shouldn't be a purely SQL solution)
Hope I didn't miss anything important. Let me know if I did.
The following are some ideas to implement the solution:
The database table should have a column where each entry is a number generated as follows:
log(R) / W,
where—
W is the record's weight greater than 0 (itself its own column), and
R is a per-record uniform random number in (0, 1)
(see also Arratia, R., "On the amount of dependence in the prime factorization of a uniform random integer", 2002). Then take the records with the highest values of that column as the need arises.
However, note that SQL has no standard way to generate random numbers; DBMSs that implement SQL have their own ways to do so (such as RANDOM() for PostgreSQL), but how they work depends on the DBMS (for example, compare MySQL's RAND() with T-SQL's NEWID()).
Peter O had a good idea, but had some issues. I would expand it a bit in favor of being able to shuffle a little better as far as being user-specific, at a higher database space cost:
Use a single column, but store in multiple fields. Recommend you use the Postgres JSONB type (which stores it as json which can be indexed and queried). Use several fields where the log(R) / W. I would say roughly log(U) + log(P) where U is the number of users and P is the number of items with a minimum of probably 5 columns. Add an index over all the fields within the JSONB. Add more fields as the number of users/items get's high enough.
Have a background process that is regularly rotating the numbers in #1. This can cause duplication, but if you are only rotating a small subset of the items at a time (such as O(sqrt(P)) of them), the odds of the user noticing are low. Especially if you are actually querying for data backwards and forwards and stitch/dedup the data together before displaying the next row(s). Careful use of manual pagination adjustments helps a lot here if it's an issue.
Before displaying items, randomly pick one of the index fields and sort the data on that. This means you have a 1 in log(P) + log(U) chance of displaying the same data to the user. Ideally the user would pick a random subset of those index fields (to avoid seeing the same order twice) and use that as the order, but can't think of a way to make that work and be practical. Though a random shuffle of the index and sorting by that might be practical if the randomized weights are normalized, such that the sort order matters.

Natural way of indexing elements in Flink

Is there a built-in way to index and access indices of individual elements of DataStream/DataSet collection?
Like in typical Java collections, where you know that e.g. a 3rd element of an ArrayList can be obtained by ArrayList.get(2) and vice versa ArrayList.indexOf(elem) gives us the index of (the first occurence of) the specified element. (I'm not asking about extracting elements out of the stream.)
More specifically, when joining DataStreams/DataSets, is there a "natural"/easy way to join elements that came (were created) first, second, etc.?
I know there is a zipWithIndex transformation that assigns sequential indices to elements. I suspect the indices always start with 0? But I also suspect that they aren't necessarily assigned in the order the elements were created in (i.e. by their Event Time). (It also exists only for DataSets.)
This is what I currently tried:
DataSet<Tuple2<Long, Double>> tempsJoIndexed = DataSetUtils.zipWithIndex(tempsJo);
DataSet<Tuple2<Long, Double>> predsLinJoIndexed = DataSetUtils.zipWithIndex(predsLinJo);
DataSet<Tuple3<Double, Double, Double>> joinedTempsJo = tempsJoIndexed
.join(predsLinJoIndexed).where(0).equalTo(0)...
And it seems to create wrong pairs.
I see some possible approaches, but they're either non-Flink or not very nice:
I could of course assign an index to each element upon the stream's
creation and have e.g. a stream of Tuples.
Work with event-time timestamps. (I suspect there isn't a way to key by timestamps, and even if there was, it wouldn't be useful for
joining multiple streams like this unless the timestamps are
actually assigned as indices.)
We could try "collecting" the stream first but then we wouldn't be using Flink anymore.
The 1. approach seems like the most viable one, but it also seems redundant given that the stream should by definition be a sequential collection and as such, the elements should have a sense of orderliness (e.g. `I'm the 36th element because 35 elements already came before me.`).
I think you're going to have to assign index values to elements, so that you can partition the data sets by this index, and thus ensure that two records which need to be joined are being processed by the same sub-task. Once you've done that, a simple groupBy(index) and reduce() would work.
But assigning increasing ids without gaps isn't trivial, if you want to be reading your source data with parallelism > 1. In that case I'd create a RichMapFunction that uses the runtimeContext sub-task id and number of sub-tasks to calculate non-overlapping and monotonic indexes.

Cocoa Scripting: Deletion of elements in a loop getting out of sync

While adding scriptability to my Mac program, I am struggling with the common programming problem of deleting items from an indexed array where the item indexes shift due to removal of items.
Let's say my app maintains a data store in which objects of type "Person" are stored. In the sdef, I've define the Cocoa Key allPersons to access these elements. My app declares an NSArray *allPersons.
That far, it works well. E.g, this script works well:
repeat with p in every person
get name of p
end repeat
The problem starts when I want to support deletion of items, like this:
repeat with p in (get every person)
delete p
end repeat
(I realize that I could just write "delete every person", which works fine, but I want to show how "repeat" makes things more complicated).
This does not work because AppleScript keep using the original item numbers to reference the items even after deleting some of them, which naturally shifts the items and their numbering.
So, considering we have 3 Persons, "Adam", "Bonny" and "Clyde", this will happen:
get every person
--> {person 1, person 2, person 3}
delete person 1
delete person 2
delete person 3
--> error number -1719 from person 3
After deleting item 1 (Adam), the other items get renumbered to item 1 and 2. The second iteration deletes item 2 (which is now Clyde), and the third iteration attempts to delete item 3, which doesn't exist any more at that point.
How do I solve this?
Can I force the scripting engine to not address the items by their index number but instead by their unique ID so that this won't happen?
It's not your ObjC code, it's your misunderstanding of how repeat with VAR in EXPR loops work. (Not really your fault either: they're 1. counterintuitive, and 2. poorly explained.) When it first encounters your repeat statement, AppleScript sends your app a count event to get the number of items specified by EXPR, which in this case is an object specifier (query) that identifies all of the person elements in whatever. It then uses that information to generate its own sequence of by-index object specifiers, counting from 1 up to the result of the aforementioned count:
person 1 of whatever
person 2 of whatever
...
person N of whatever
What you need to realize is that an object specifier is a first-class query, not an object pointer (not that Apple tell you this either): it describes a request, not an object. Ignore the purloined jargon: Apple event IPC's nearest living relatives are RDBMSes, not Cocoa or SOAP or any of the OO messaging crud that modern developers so fixate on as The One True Way To Do... well, EVERYTHING.
It's only when that query is sent to your application in an Apple event that it's evaluated against the relational graph your Apple event IPC View-Controller – aka "Apple Event Object Model" – presents as an idealized, user-friendly representation of your Model's user date that it actually resolves to a specific Model object, or objects, with which the event handler should perform the requested operation.
Thus, when the delete command in your repeat loop tells your app to delete person 1 of whatever, all your remaining elements move down by one. But on the next iteration the repeat loop still generates the object specifier person 2 of whatever, which your script then sends off to your app, which resolves it to the second item in the collection – which was originally the third item, of course, until you shifted them all about.
Or, to borrow a phrase:
Nothing in AppleScript makes sense except in light of relational queries.
..
In fact, Apple events' query-based approach it actually makes a lot of sense considering it was originally designed to be efficient over very high-latency connections (i.e. System 7's abysmally inefficient process switcher), allowing a single Apple event carrying one or more complex queries to manipulate many objects at once. It's even quite elegant [when it works right], but is quite undone by idiots at Cupertino who think the best way to make programmers not hate the technology is to lie even harder about how it actually works.
So here, I suggest you go read this, which is not the best explanation either but still a damn sight better than anything you'll get from those muppets. And this, by its original designer that explains a lot of the rationale for creating a high-level coarse-grained query-based IPC system instead of the usual low-level fine-grained OO message passing crap.
Oh, and once you've done that, you might want to consider try running this instead:
delete every person whose name is "bob"
which is pretty much the whole point of creating a thick declarative-y abstraction that does all the work so the user doesn't have to.
And when nothing but an imperative client-side loop will do, you either want to get a list of by-ID object specifiers (which are the closest things to safe, persistent pointers that AEOMs can do) from the app first and then iterate over that, or at least use your own iterator loop that counts over elements in reverse:
repeat with i from (count every person) to 1 by -1
tell person i
..
end tell
end repeat
so that, assuming it's iterating over an ordered array on the server side, will delete from last to first, and so avoid the embarrassing off-by-N errors of your original script.
HTH
re: "If you want your scripable elements to be deletable, make sure you use NSUniqueIDSpecifiers to identify them."
Yes, Apple recommends using formUniqueId or formName for object specifiers, but you can't always do that. For instance, in the Text Suite, you really only have indexing to work with; e.g. character 1, word 3, paragraph 7, etc. You don't have unique IDs for text elements. In addition to deletion, ordering can be affected by other Standard Suite commands: open, close, duplicate, make, and move.
The app implementer is a programmer, but so is the scripter. So it is reasonable to expect the scripter to solve some problems themselves. For instance, if the app has 5 persons, and the scripter wants to delete persons 2 and 4, they can easily do so even with indexed deletion:
delete person 4
delete person 2
Deleting from the end of an ordered list forward solves the problem. AS also supports negative indexes, which can be used for the same purpose:
delete person -2
delete person -4
The key to solving this lies in implementing the objectSpecifier method correctly so that it does return an NSUniqueIDSpecifier.
My code did so far only return an index specifier and that was wrong for this purpose. I guess that had I posted my code (which is, unfortunately, too complex for that), someone may have noticed my mistake.
So, I guess the rule is: If you want your scripable elements to be deletable, make sure you use NSUniqueIDSpecifiers to identify them. For read-only element arrays, using an NSIndexSpecifier is (probably) safe, though, if your element array has persistent ordering behavior.
Update
As #foo points out, it's also important that the repeat command fetches the references to the items by using … in (get every person) and not just … in every person, because only the former leads to addressing the items by their id whereas the latter keeps indexing them as item N.

Bind Top 5 Values of a To-Many Core Data Relationship to Text Fields

I am making an application that represents a cell phone bill using Core Data. I have three entities: Bill, Line, and Calls. Bills can have many lines, and lines can have many calls. All of this is set up with relationships. Right now, I have a table view that displays all of the bills. When you double click on a bill, a sheet comes down with a popup box that lists all of the lines on the bill. Below the popup box is a box that has many labels that display various information about that line. Below that information I want to list the top 5 numbers called by that line in that month. Lines has a to-many relationship with Calls, which has two fields, number and minutes. I have all of the calls for the selected line loaded into an NSArrayController with a sort descriptor that properly arranges the values. How do I populate 5 labels with the top 5 values of this array controller?
EDIT: The array of calls is already unique, when I gather the data, I combine all the individual calls into total minutes per number for each month. I just need to sort and display the first 5 records of this combined array.
I may be wrong (and really hope I am), but it looks like you'll need to use brute force on this one. There are no set / array operators that can help, nor does NSPredicate appear to help.
I think this is actually a bit tricky and it looks like you'll have to do some coding. The Core Data Programming Guide says:
If you execute a fetch directly, you
should typically not add
Objective-C-based predicates or sort
descriptors to the fetch request.
Instead you should apply these to the
results of the fetch. If you use an
array controller, you may need to
subclass NSArrayController so you can
have it not pass the sort descriptors
to the persistent store and instead do
the sorting after your data has been
fetched.
I think this applies to your case because it's important to consider whether sorting or filtering takes place first in a fetch request (when the fetch requests predicate and sort descriptors are set). This is because you'll be tempted to use the #distinctUnionOfObjects set/array operator. If the list is collapsed to uniques before sorting, it won't help. If it's applied after sorting, you can just set the fetch request's limit to 5 and there're your results.
Given the documentation, I don't know that this is how it will work. Also, in this case, it might be easier to avoid NSArrayController for this particular problem and just use NSTableViewDataSource protocol, but that's beyond the scope of this Q&A.
So, here's one way to do it:
Create a predicate to filter for the
selected bill's line items.*
Create a sort descriptor to sort the
line items by their telephone number
(which are hopefully in a
standardized format internally, else
trouble awaits) via #"call.number" in your case.
Create a fetch request for the line
item entity, with the predicate and
sort descriptors then execute it**.
With those sorted results, it would be nice if you could collapse and "unique" them easily, and again, you'll be tempted to use #distinctUnionOfObjects. Unfortunately, set/array operators won't be any help here (you can't use them directly on NSArray/NSMutableArray or NSSet/NSMutableSet instances). Brute force it is, then.
I'd create a topFive array and loop through the results, adding the number to topFive if it's not there already, until topFive has 5 items or until I'm out of results.
Displaying it in your UI (using Bindings or not) is, as I said, beyond the scope of this Q&A, so I'll leave it there. I'd LOVE to hear if there's a better way to do this - it's definitely one of those "looks like it should be easy but it's not" kind of things. :-)
*You could also use #unionOfObjects in your key path during the fetch itself to get the numbers of the calls of the line items of the selected bill, which would probably be more efficient a fetch, but I'm getting tired of typing, and you get the idea. ;-)
**In practice I'd probably limit the fetch request to something reasonable - some bills (especially for businesses and teenagers) can be quite large.

Which data type to use for ordinal?

Whenever I have some records/objects that I want to be in a certain order, I usually create a field called Ordinal.
I often wonder if it would be better to use an integer or a decimal value for the ordinal field.
This is a consideration when moving an object to a different position in the order:
If you use consecutive integers, you have to do some serious reworking of all of the ordinals (or at least the ordinals that fall before the original position of the object being moved).
If you use integers but space them out (maybe at 1000 intervals), then you can just change the ordinal to a mid point value between the surrounding objects where you want to move the object. This could fail if somewhere down the line you end up with consecutive integers.
If you use decimal numbers you could just find the average of the surround object's ordinals and use that for the object to be moved.
Maybe it would be possible to use a string, but I could see that getting pretty goofy.
I'm sure there are other considerations I haven't thought of.
What do you use and why?
"This could fail if somewhere down the line you end up with consecutive integers."
For this (probably rare and thus not performance important) case, you could implement a renumber method that spaces out again. When I used to program in COMAL (anyone know that language?), you could do this very thing with line numbers.
Decimals seem to solve your problem pretty well. Since Decimals are just base 10 floats, you actually have a lot of digits available. Unless you've seen cases where you've gotten out to quite a few digits and had reason to suspect a reason for an unlimited number of digits being necessary, I'd let it ride.
If you really need an alternative and don't see a need to stick with a basic data bype, you might go with tumbler arithmetic. The basic idea is that it's a place notation that is infinitely expandable at each position. Pretty simple conceptually.
I used to use a decimal type for a field of this kind to order records in a table, which we actually exposed to the customer so that they could set their own order. Although it sounds cheesy our customers liked it; they found it very intuitive. They caught on very quickly that they could use numbers like 21.5 to move something between 21 and 22.
Maybe it's because they were accountants.
I use integers and just rearrange as necessary when a new item needs to be inserted in the middle of the order. Since you can create the necessary gap with a single update statement, it's fairly trivial. However, I've only ever done this on lookup tables of a few dozen rows at most, obviously this scales a bit poorly. But I would say that if you need a solution to this problem for a large number of rows, the process(es) for maintaining the order should be proceduralized anyway, which makes the choice of data type largely moot.
I remember this being a similar question to a previous post. It can be found here:
SQL Server Priority Ordering
The linked list would still work, but this is a much easier solution if you don't want to track a parent child relationship.
Sounds like what you want is a linked list. That way you always know what comes next and you don't have to guess. So the position field would be a pointer to the object following it.
The problem I have always had with using arbitrary numbers for position, is that it can quickly fall to entropy. What if more items get added and the number become consecutive etc. etc. It can quickly become unmanageable if the list of items changes position.
To implement this in sql server table, add another field with the same data type as the primary key. If the field is null then it is the bottom element in the list. If you are storing multiple lists in the same table you will probably want to add another field called ListID which designates all rows with the same ListID pertain to the same list. So something like this.
Table:
ID INT
ListID INT
Child INT
Pararent Row For first list:
1, 1, 2
First Child
2, 1, 3
Second Child
3, 1, NULL
Parent Row for second list:
4, 2, 5
First Child
5, 2, 6
Second Child
6, 2, NULL
You'll probably have to do an insert and an update every time you add a row, which can be a little tedious, but it will always make the list line up.
Is the "certain order" based on data outside of the table? If so, why not include it so you can do the sorting dynamically? If it's already in the table, adding a field is redundant.