Which data type to use for ordinal? - sql

Whenever I have some records/objects that I want to be in a certain order, I usually create a field called Ordinal.
I often wonder if it would be better to use an integer or a decimal value for the ordinal field.
This is a consideration when moving an object to a different position in the order:
If you use consecutive integers, you have to do some serious reworking of all of the ordinals (or at least the ordinals that fall before the original position of the object being moved).
If you use integers but space them out (maybe at 1000 intervals), then you can just change the ordinal to a mid point value between the surrounding objects where you want to move the object. This could fail if somewhere down the line you end up with consecutive integers.
If you use decimal numbers you could just find the average of the surround object's ordinals and use that for the object to be moved.
Maybe it would be possible to use a string, but I could see that getting pretty goofy.
I'm sure there are other considerations I haven't thought of.
What do you use and why?

"This could fail if somewhere down the line you end up with consecutive integers."
For this (probably rare and thus not performance important) case, you could implement a renumber method that spaces out again. When I used to program in COMAL (anyone know that language?), you could do this very thing with line numbers.

Decimals seem to solve your problem pretty well. Since Decimals are just base 10 floats, you actually have a lot of digits available. Unless you've seen cases where you've gotten out to quite a few digits and had reason to suspect a reason for an unlimited number of digits being necessary, I'd let it ride.
If you really need an alternative and don't see a need to stick with a basic data bype, you might go with tumbler arithmetic. The basic idea is that it's a place notation that is infinitely expandable at each position. Pretty simple conceptually.

I used to use a decimal type for a field of this kind to order records in a table, which we actually exposed to the customer so that they could set their own order. Although it sounds cheesy our customers liked it; they found it very intuitive. They caught on very quickly that they could use numbers like 21.5 to move something between 21 and 22.
Maybe it's because they were accountants.

I use integers and just rearrange as necessary when a new item needs to be inserted in the middle of the order. Since you can create the necessary gap with a single update statement, it's fairly trivial. However, I've only ever done this on lookup tables of a few dozen rows at most, obviously this scales a bit poorly. But I would say that if you need a solution to this problem for a large number of rows, the process(es) for maintaining the order should be proceduralized anyway, which makes the choice of data type largely moot.

I remember this being a similar question to a previous post. It can be found here:
SQL Server Priority Ordering
The linked list would still work, but this is a much easier solution if you don't want to track a parent child relationship.
Sounds like what you want is a linked list. That way you always know what comes next and you don't have to guess. So the position field would be a pointer to the object following it.
The problem I have always had with using arbitrary numbers for position, is that it can quickly fall to entropy. What if more items get added and the number become consecutive etc. etc. It can quickly become unmanageable if the list of items changes position.
To implement this in sql server table, add another field with the same data type as the primary key. If the field is null then it is the bottom element in the list. If you are storing multiple lists in the same table you will probably want to add another field called ListID which designates all rows with the same ListID pertain to the same list. So something like this.
Table:
ID INT
ListID INT
Child INT
Pararent Row For first list:
1, 1, 2
First Child
2, 1, 3
Second Child
3, 1, NULL
Parent Row for second list:
4, 2, 5
First Child
5, 2, 6
Second Child
6, 2, NULL
You'll probably have to do an insert and an update every time you add a row, which can be a little tedious, but it will always make the list line up.

Is the "certain order" based on data outside of the table? If so, why not include it so you can do the sorting dynamically? If it's already in the table, adding a field is redundant.

Related

Is there a Postgres feature or built-in function that limits the display of uuids to only that needed to make them uniquely identifiable?

It would have to return the portion necessary to uniquely identify the row even if a select statement didn't return all rows, of course, to be of any use. And I'm not sure how it would work if the uuid column were not part of a pk/index and was repeated.
Does this exist?
I think you would have to decide what constitutes uniquely identifiable by assuming that a number of places from the right make it uniquely identifiable. I think this is folly but the way you would do that is something like this:
SELECT RIGHT(uuid_column_name::text, 7) as your_truncated_uuid FROM table_with_uuid_column;
That takes the 7 places from the right of the text value of the uuid column.
No, there is not. A UUID is a hex representation of a 120 bit random number, at least the v4 variant. It's not even guaranteed to be unique though it likely is.
You have a few options to implement this:
shave off characters and hope you don't introduce a collision. For instance, if you make d8366842-8c1d-4a31-a4c0-f1765b8ab108 d8366842, you have 16**8 possible combinations, or 4,294,967,296. how likely is your dataset to have a collision with 4.2 billion (2**32) possibilities? Perhaps you can add 8c1d back in to make it 16**12 or 28,147,497,6710,656 possibilities.
process and hash each row looking for collisions and recursively increase the frame of characters until no collisions are found, or hash every possible permutation.
That all said, another idea is to use ints and not uuids and then to use http://hashids.org/ which has a plugin for PostgreSQL. This is the method YouTube uses afaik.

Efficient management of hierarchyid values in MS SQL Server

With the hierarchyid datatype in SQL Server 2008 and onward, would there be any benefit to trying to optimize the issuing of the next child of /1/1/8/ [ /1/1/8/x/ ] such that x is the closest non-negative whole number to 1 possible?
An easy solution seems to be to find the maximum assigned child value and getting the sibling to the right but it seems like you'd eventually exhaust this (in theory if not in practice) since you're never reclaiming any of the values and to my understanding, negatives and non-wholes consume more space.
EXAMPLE: If I've got a parent /1/1/8/ who has these children (and order of the children doesn't matter and reassignment of the values is ok):
/1/1/8/-400/
/1/1/8/1/
/1/1/8/4/
/1/1/8/40/
/1/1/8/18/
/1/1/8/9999999999/
wouldn't I want the next child to have /1/1/8/2/ ?
Here's the thing.
What you are saying will be "optimal" is not necessarily optimal.
When I am inserting values into a hierarchy, I generally do not care what the order is for the child nodes of a particular node.
If I do, that is why there are two parameters in GetDescendant.
If I want to prepend the node into the order(i.e make it first), I use a first parameter of NULL and a second parameter that is the lowest value of the other children.
If I want to append the node into the order (i.e. make it last), I use a first parameter of the maximum value of the other children and a second parameter of NULL.
If I want to insert between two other child nodes, I need both the one that will be before and the one that will be after the node I am inserting.
In any case, generally the values in the hierarchy field don't really matter, because you will order by a different field like Name or something.
Ergo, the most "efficient" method of adding things into a hierarchy is to either prepend or append, since finding the MIN or MAX hierarchy value is easy, and doing what you are describing requires several queries to find the first "hole" in the tree.
In other words, don't put a lot of meaning onto the string representation of a hierarchy unless you are using them for an application in which you are using the hierarchy value to sort by.
Even in that case, you probably don't want to fill in hierarchy values as you describe, and probably want to append to the end anyway.
Hope this helped.

Redis Sorted Sets: How do I get the first intersecting element?

I have a number of large sorted sets (5m-25m) in Redis and I want to get the first element that appears in a combination of those sets.
e.g I have 20 sets and wanted to take set 1, 5, 7 and 12 and get only the first intersection of only those sets.
It would seem that a ZINTERSTORE followed by a "ZRANGE foo 0 0" would be doing a lot more work that I require as it would calculate all the intersections then return the first one. Is there an alternative solution that does not need to calculate all the intersections?
There is no direct, native alternative, although I'd suggest this:
Create a hash which its members are your elements. Upon each addition to one of your sorted sets, increment the relevant member (using HINCRBY). Of course, you'll make the increment only after you check that the element does not exist already in the sorted set you are attempting to add to.
That way, you can quickly know which elements appear in 4 sets.
UPDATE: Now that I rethink about it, it might be too expensive to query your hash to find items with value of 4 (O(n)). Another option would be creating another Sorted Set, which its members are your elements, and their score gets incremented (as I described before, but using ZINCRBY), and you can quickly pull all elements with score 4 (using ZRANGEBYSCORE).

how to store an approximate number? (number is too small to be measured)

I have a table representing standards of alloys. The standard is partly based on the chemical composition of the alloys. The composition is presented in percentages. The percentage is determined by a chemical composition test. Sample data.
But sometimes, the lab cannot measure below a certain percentage. So they indicate that the element is present, but the percentage is less than they can measure.
I was confused how to accurately store such a number in an SQL database. I thought to store the number with a negative sign. No element can have a negative composition of course, but i can interpret this as less than the specified value. Or option is to add another column for each element!! The latter option i really don't like.
Any other ideas? It's a small issue if you think about it, but i think a crowd is always wiser. Somebody might have a neater solution.
Question updated:
Thanks for all the replies.
The test results come from different labs, so there is no common lower bound.
The when the percentage of Titanium is less than <0.0004 for example, the number is still important, only the formula will differ slightly in this case.
Hence the value cannot be stored as NULL, and i don't know the lower bound for all values.
Tricky one.
Another possibility i thought of is to store it as a string. Any other ideas?
What you're talking about is a sentinel value. It's a common technique. Strings in most languages after all use 0 as a sentinel end-of-string value. You can do that. You just need to find a number that makes sense and isn't used for anything else. Many string functions will return -1 to indicate what you're looking for isn't there.
0 might work because if the element isn't there there shouldn't even be a record. You also face the problem that it might be mistaken for actually meaning 0. -1 is another option. It doesn't have that same problem obviously.
Another column to indicate if the amount is measurable or not is also a viable option. The case for this one becomes stronger if you need to store different categories of trace elements (eg <1%, <0.1%, <0.01%, etc). Storing the negative of those numbers seems a bit hacky to me.
You could just store it as NULL, meaning that the value exists but is undefined.
Any arithmetic operation with a NULL yields a NULL.
Division by NULL is safe.
NULL's are ignored by the aggregation functions, so queries like these:
SELECT SUM(metal_percent), COUNT(metal_percent)
FROM alloys
GROUP BY
metal
will give you the sum and the count of the actual, defined values, not taking the unfilled values into account.
I would use a threshold value which is at least one significant digit smaller than your smallest expected value. This way you can logically say that any value less than say 0.01, can be presented to you application as a "trace" amount. This remains easy to understand and gives you flexibility in determining where your threshold should lie.
Since the constraints of the values are well defined (cannot have negative composition), I would go for the "negative value to indicate less-than" approach. As long as this use of such sentinel values are sufficiently documented, it should be reasonably easy to implement and maintain.
An alternative but similar method would be to add 100 to the values, assuming that you can't get more than 100%. So <0.001 becomes 100.001.
I would have a table modeling the certificate, in a one to many relation with another table, storing the values for elements. Then, I would still have the elements table containing the value in one column and a flag (less than) as a separate column.
Draft:
create table CERTIFICATES
(
PK_ID integer,
NAME varchar(128)
)
create table ELEMENTS
(
ELEMENT_ID varchar(2),
CERTIFICATE_ID integer,
CONCENTRATION number,
MEASURABLE integer
)
Depending on the database engine you're using, the types of the columns may vary.
Why not add another column to store whether or not its a trace amount
This will allow you to to save the amount that the trace is less than too
Since there is no common lowest threshold value and NULL is not acceptable, the cleanest solution now is to have a marker column which indicates whether there is a quantifiable amount or a trace amount present. A value of "Trace" would indicate to anybody reading the raw data that only a trace amount was present. A value of "Quantity" would indicate that you should check an amount column to find the actual quantity present.
I would have to warn against storing numerical values as strings. It will inevitably add additional pain, since you now lose the assertions a strong type definition gives you. When your application consumes the values in that column, it has to read the string to determine whether it's a sentinel value, a numeric value or simply some other string it can't interpret. Trying to handle data conversion errors at this point in your application is something I'm sure you don't want to be doing.
Another field seems like the way to go; call it 'MinMeasurablePercent'.

Represent Ordering in a Relational Database

I have a collection of objects in a database. Images in a photo gallery, products in a catalog, chapters in a book, etc. Each object is represented as a row. I want to be able to arbitrarily order these images, storing that ordering in the database so when I display the objects, they will be in the right order.
For example, let's say I'm writing a book, and each chapter is an object. I write my book, and put the chapters in the following order:
Introduction, Accessibility, Form vs. Function, Errors, Consistency, Conclusion, Index
It goes to the editor, and comes back with the following suggested order:
Introduction, Form, Function, Accessibility, Consistency, Errors, Conclusion, Index
How can I store this ordering in the database in a robust, efficient way?
I've had the following ideas, but I'm not thrilled with any of them:
Array. Each row has an ordering ID, when order is changed (via a removal followed by an insertion), the order IDs are updated. This makes retrieval easy, since it's just ORDER BY, but it seems easy to break.
// REMOVAL
UPDATE ... SET orderingID=NULL WHERE orderingID=removedID
UPDATE ... SET orderingID=orderingID-1 WHERE orderingID > removedID
// INSERTION
UPDATE ... SET orderingID=orderingID+1 WHERE orderingID > insertionID
UPDATE ... SET orderID=insertionID WHERE ID=addedID
Linked list. Each row has a column for the id of the next row in the ordering. Traversal seems costly here, though there may by some way to use ORDER BY that I'm not thinking of.
Spaced array. Set the orderingID (as used in #1) to be large, so the first object is 100, the second is 200, etc. Then when an insertion happens, you just place it at (objectBefore + objectAfter)/2. Of course, this would need to be rebalanced occasionally, so you don't have things too close together (even with floats, you'd eventually run into rounding errors).
None of these seem particularly elegant to me. Does anyone have a better way to do it?
An other alternative would be (if your RDBMS supports it) to use columns of type array. While this breaks the normalization rules, it can be useful in situations like this. One database which I know about that has arrays is PostgreSQL.
The acts_as_list mixin in Rails handles this basically the way you outlined in #1. It looks for an INTEGER column called position (of which you can override to name of course) and using that to do an ORDER BY. When you want to re-order things you update the positions. It has served me just fine every time I've used it.
As a side note, you can remove the need to always do re-positioning on INSERTS/DELETES by using sparse numbering -- kind of like basic back in the day... you can number your positions 10, 20, 30, etc. and if you need to insert something in between 10 and 20 you just insert it with a position of 15. Likewise when deleting you can just delete the row and leave the gap. You only need to do re-numbering when you actually change the order or if you try to do an insert and there is no appropriate gap to insert into.
Of course depending on your particular situation (e.g. whether you have the other rows already loaded into memory or not) it may or may not make sense to use the gap approach.
If the objects aren't heavily keyed by other tables, and the lists are short, deleting everything in the domain and just re-inserting the correct list is the easiest. But that's not practical if the lists are large and you have lots of constraints to slow down the delete. I think your first method is really the cleanest. If you run it in a transaction you can be sure nothing odd happens while you're in the middle of the update to screw up the order.
Just a thought considering option #1 vs #3: doesn't the spaced array option (#3) only postpone the problem of the normal array (#1)? Whatever algorithm you choose, either it's broken, and you'll run into problems with #3 later, or it works, and then #1 should work just as well.
I did this in my last project, but it was for a table that only occasionally needed to be specifically ordered, and wasn't accessed too often. I think the spaced array would be the best option, because it reordering would be cheapest in the average case, just involving a change to one value and a query on two).
Also, I would imagine ORDER BY would be pretty heavily optimized by database vendors, so leveraging that function would be advantageous for performance as opposed to the linked list implementation.
Use a floating point number to represent the position of each item:
Item 1 -> 0.0
Item 2 -> 1.0
Item 3 -> 2.0
Item 4 -> 3.0
You can place any item between any other two items by simple bisection:
Item 1 -> 0.0
Item 4 -> 0.5
Item 2 -> 1.0
Item 3 -> 2.0
(Moved item 4 between items 1 and 2).
The bisection process can continue almost indefinitely due to the way floating point numbers are encoded in a computer system.
Item 4 -> 0.5
Item 1 -> 0.75
Item 2 -> 1.0
Item 3 -> 2.0
(Move item 1 to the position just after Item 4)
Since I've mostly run into this with Django, I've found this solution to be the most workable. It seems that there isn't any "right way" to do this in a relational database.
I'd do a consecutive number, with a trigger on the table that "makes room" for a priority if it already exists.
I had this problem as well. I was under heavy time pressure (aren't we all) and I went with option #1, and only updated rows that changed.
If you swap item 1 with item 10, just do two updates to update the order numbers of item 1 and item 10. I know it is algorithmically simple, and it is O(n) worst case, but that worst case is when you have a total permutation of the list. How often is that going to happen? That's for you to answer.
I had the same issue and have probably spent at least a week concerning myself about the proper data modeling, but I think I've finally got it. Using the array datatype in PostgreSQL, you can store the primary key of each ordered item and update that array accordingly using insertions or deletions when your order changes. Referencing a single row will allow you to map all your objects based on the ordering in the array column.
It's still a bit choppy of a solution but it will likely work better than option #1, since option 1 requires updating the order number of all the other rows when ordering changes.
Scheme #1 and Scheme #3 have the same complexity in every operation except INSERT writes. Scheme #1 has O(n) writes on INSERT and Scheme #3 has O(1) writes on INSERT.
For every other database operation, the complexity is the same.
Scheme #2 should not even be considered because its DELETE requires O(n) reads and writes. Scheme #1 and Scheme #3 have O(1) DELETE for both read and write.
New method
If your elements have a distinct parent element (i.e. they share a foreign key row), then you can try the following ...
Django offers a database-agnostic solution to storing lists of integers within CharField(). One drawback is that the max length of the stored string can't be greater than max_length, which is DB-dependent.
In terms of complexity, this would give Scheme #1 O(1) writes for INSERT, because the ordering information would be stored as a single field in the parent element's row.
Another drawback is that a JOIN to the parent row is now required to update ordering.
https://docs.djangoproject.com/en/dev/ref/validators/#django.core.validators.validate_comma_separated_integer_list