When would you ever set the increment value on a database identity field? - sql

Given the table:
CREATE TABLE Table1
(
UniqueID int IDENTITY(1,1)
...etc
)
Now why would you ever set the increment to something other than 1?
I can understand setting the initial seed value differently. For example if, say, you're creating one database table per month of data (e.g. Table1_082009, Table1_092009) and want to start the UniqueID of the new table where the old one left off. (I probably wouldn't use that strategy myself, but hey, I can see people doing it).
But for the increment? I can only imagine it being of any use in really odd situations, for example:
after the initial data is inserted, maybe later someone will want to turn identity insert on and insert new rows in the gaps, but for efficient lookup on the index will want the rows to be close to each other?
if you're looking up ids based directly off a URL, and want to make it harder for people to arbitrarily access the other items (for example, instead of the user being able to work out that changing the URL suffix from /GetData?id=1000 to /GetData?id=1001, you set an increment of 437 so that the next url is actually /GetData?id=1437)? Of course if this is your "security" then you're probably already in trouble...
I can't think of anything else. Has anyone used an increment that wasn't 1, and why? I'm really just curious.

One idea might be using this to facilitate partitionnement of data (though there might be more "automated" ways to do that) :
Considering you have two servers :
On one server, you start at 1 and increment by 2
On the other server, you start at 2 and increment by 2.
Then, from your application, your send half inserts to one server, and the other half to the second server
some kind of software load-balancing
This way, you still have the ability to identify your entries : the "UniqueID" is still unique, even if the data is split on two servers / tables.
But that's only a wild idea -- there are probably some other uses to that...

Once, for pure fun, (Oh yeah, we have a wild side to us) decided to negative increment. It was strange to see the numbers grow in size and smaller in value at the same time.
I could hardly sit still in my chair.
edit (afterthought):
You think the creator of the IDENTITY was in love with FOR loops? You know..
for (i = 0; i<=99; i+=17)
or for those non semi-colon folks out there
For i = 0 to 100 step 17

Only for entertainment. And you have to be REALLY bored.

Related

sdiff - limit the result set to X items

I want to get the diff of two sets in redis, but I don't need to return the entire array, just 10 items for example. Is there any way to limit the results?
I was thinking something like this:
SDIFF set1 set2 LIMIT 10
If not, are there any other options to achieve this in a performant way, considering that set1 can be millions of objects and set2 is much much smaller (hundreds).
More info would be helpful on what you want to achieve. Something like this might require you to duplicate your data. Though I don’t know if it’s something you want.
An option is chunking them.
Create a set with a unique generated id that can hold a max of 10 items
Create a sorted set like so…
zadd(key, timestamp, chunkid)
where your timestamp is a unix time and the chunkid is the key the connects to the set. The key can be the name of whatever you would like it to be or it could also be a uniquely generated id.
Use zrange to grab a specific one
(Repeat steps 1-3 for the second set)
Once you have your 1 result from both your sorted sets “zset”. You can now do your sdiff by using the chunkid.
Note that there is advantages and disadvantages in doing this. Like more connection consumption (if calling from a a client), and the obvious being a little more processing. Though it will help immensely if you put this in a lua script.
Hope this helps or at least gives you an idea on how to model your data. Though if this is critical data you might need to use a automated script of some sort to move your data around to meet the modeling requirement.

DynamoDB: Have sequencing within Items

I am developing forums on DynamoDB.
There is a table posts which contains all the posts in a thread.
I need to have a notion of sequence in the posts, i.e. I need to know which post came first and which came later.
My service would be running in a distributed env.
I am not sure if using Timestamp is the best solution for deciding the sequence, as the hosts might have slightly different times and might be off my milliseconds/ seconds.
Is there another way to do this?
Can I get DynamoDB to populate the date so it is consistent?
Or is there a sequence generator that I can use in a distributed env?
You can't use DynamoDB to auto-populate dates. You can use other services to provide you with auto-generating numbers or use DynamoDB's atomic increment to create your own UUID.
This can become a bottleneck if your forum is very successful (needs lots of numbers per second). I think you should start with timestamp and later on add complexity to your id generating (concatenate timestamp+uuid or timstamp+atomiccounter)
It is always a best practice to sync your servers clock (ntpd)
Use a dedicated sequence table. If you have only one sequence (say, PostId), then there's going to be only one row with two attributes in the table.
Yes, there's extra cost and effort of managing another table, but this is the best solution I know by far and haven't seen any one else mentioning it.
The table should have a key attribute as primary partition key, and a numeric value attribute with initial value of 1 (or whatever you want the initial value to be).
Every time you want to get the next available key, you tell DynamoDB to do this:
Increment the value where key = PostId by 1, and return the value before incrementing.
Note that this is one single atomic operation. DynamoDB handles the auto-incrementing, so there's no concurrency issues.
In code, there're more than one ways of implementing this. Here's one example:
Map<String,AttributeValue> key = new HashMap<>();
key.put("key", new AttributeValue("PostId"));
Map<String, AttributeValueUpdate> item = new HashMap<String, AttributeValueUpdate>();
item.put("value",
new AttributeValueUpdate()
.withAction(AttributeAction.ADD)
.withValue(1));
UpdateItemRequest request = new UpdateItemRequest("Sequences", key, item).withReturnValues(ReturnValue.ALL_OLD);
UpdateItemResult result = dynamoDBClient.updateItem(request);
Integer postId = Integer.parseInt(result.getAttributes().get("value").getN()); // <- this is the sequential ID you want to set to your post
Another variation of Chen's suggestion is to have strict ordering of posts within a given Forum Thread, as opposed to globally across all Threads. One way to do this is to have a Reply table with the Hash key of ThreadId, and a range key of ReplyId. The ReplyId would be a Number type attribute starting at 0. Every time someone replies, your app does a Query on the Reply table for the one most recent reply on that thread (ScanIndexForward: false, Limit: 1, ThreadId: ). To insert your new reply use the ReplyId of the one returned in the Query, + 1. Then use PutItem, using a Conditional Write, so that if someone else replies at the same time, an error will be returned, and your app can start again with the query.
If you want the simplest initial solution possible, then the timestamp+uuid concatenation Chen suggests is the simplest approach. A global atomic counter item will be a scaling bottleneck, as Chen mentions, and based on what you've described, a global sequence number isn't required for your app.

How predictable is NEWSEQUENTIALID?

According to Microsoft's documentation on NEWSEQUENTIALID, the output of NEWSEQUENTIALID is predictable. But how predictable is predictable? Say I have a GUID that was generated by NEWSEQUENTIALID, how hard would it be to:
Calculate the next value?
Calculate the previous value?
Calculate the first value?
Calculate the first value, even without knowing any GUID's at all?
Calculate the amount of rows? E.g. when using integers, /order?id=842 tells me that there are 842 orders in the application.
Below is some background information about what I am doing and what the various tradeoffs are.
One of the security benefits of using GUID's over integers as primary keys is that GUID's are hard to guess. E.g. say a hacker sees a URL like /user?id=845 he might try to access /user?id=0, since it is probable that the first user in the database is an administrative user. Moreover, a hacker can iterate over /user?id=0..1..2 to quickly gather all users.
Similarly, a privacy downside of integers is that they leak information. /order?id=482 tells me that the web shop has had 482 orders since its implementation.
Unfortunately, using GUID's as primary keys has well-known performance downsides. To this end, SQL Server introduced the NEWSEQUENTIALID function. In this question, I would like to learn how predictable the output of NEWSEQUENTIALID is.
The underlying OS function is UuidCreateSequential. The value is derived from one of your network cards MAC address and a per-os-boot incremental value. See RFC4122. SQL Server does some byte-shuffling to make the result sort properly. So the value is highly predictable, in a sense. Specifically, if you know a value you can immediately predict a range of similar value.
However one cannot predict the equivalent of id=0, nor can it predict that 52DE358F-45F1-E311-93EA-00269E58F20D means the store sold at least 482 items.
The only 'approved' random generation is CRYPT_GEN_RANDOM (which wraps CryptGenRandom) but that is obviously a horrible key candidate.
In most cases, the next newsequentialid can be predicted by taking the current value and adding one to the first hex pair.
In other words:
1E29E599-45F1-E311-80CA-00155D008B1C
is followed by
1F29E599-45F1-E311-80CA-00155D008B1C
is followed by
2029E599-45F1-E311-80CA-00155D008B1C
Occasionally, the sequence will restart from a new value.
So, it's very predictable
NewSequentialID is a wrapper around the windows function UuidCreateSequential
You can try this code:
DECLARE #tbl TABLE (
PK uniqueidentifier DEFAULT NEWSEQUENTIALID(),
Num int
)
INSERT INTO #tbl(Num) values(1),(2),(3),(4),(5)
select * from #tbl
On my machine in this time is result:
PK Num
52DE358F-45F1-E311-93EA-00269E58F20D 1
53DE358F-45F1-E311-93EA-00269E58F20D 2
54DE358F-45F1-E311-93EA-00269E58F20D 3
55DE358F-45F1-E311-93EA-00269E58F20D 4
56DE358F-45F1-E311-93EA-00269E58F20D 5
You should try it several times in different time/date to interpolate the behaviour.
I tried it run several times and the first part is changing everytime (you see in results: 52...,53...,54...,etc...). I waited some time to check it, and after some time the second part is incremented too. I suppose the incementation continues to the all parts. Basically it look like simple +=1 incementation transformed into Guid.
EDIT:
If you want sequential GUID and you want have control over the values, you can use Sequences.
Sample code:
select cast(cast(next value for [dbo].[MySequence] as varbinary(max)) as uniqueidentifier)
• Calculate the next value? Yes
Microsoft says:
If privacy is a concern, do not use this function. It is possible to guess the value of the next generated GUID and, therefore, access data associated with that GUID.
SO it's a possibility to get the next value. I don't find information if it is possible to get the prevoius one.
from: http://msdn.microsoft.com/en-us/library/ms189786.aspx
edit: another few words about NEWSEQUENTIALID and security: http://vadivel.blogspot.com/2007/09/newid-vs-newsequentialid.html
Edit:
NewSequentialID contains the server's MAC address (or one of them), therefore knowing a sequential ID gives a potential attacker information that may be useful as part of a security or DoS attack.
from: Are there any downsides to using NewSequentialID?

Which data type to use for ordinal?

Whenever I have some records/objects that I want to be in a certain order, I usually create a field called Ordinal.
I often wonder if it would be better to use an integer or a decimal value for the ordinal field.
This is a consideration when moving an object to a different position in the order:
If you use consecutive integers, you have to do some serious reworking of all of the ordinals (or at least the ordinals that fall before the original position of the object being moved).
If you use integers but space them out (maybe at 1000 intervals), then you can just change the ordinal to a mid point value between the surrounding objects where you want to move the object. This could fail if somewhere down the line you end up with consecutive integers.
If you use decimal numbers you could just find the average of the surround object's ordinals and use that for the object to be moved.
Maybe it would be possible to use a string, but I could see that getting pretty goofy.
I'm sure there are other considerations I haven't thought of.
What do you use and why?
"This could fail if somewhere down the line you end up with consecutive integers."
For this (probably rare and thus not performance important) case, you could implement a renumber method that spaces out again. When I used to program in COMAL (anyone know that language?), you could do this very thing with line numbers.
Decimals seem to solve your problem pretty well. Since Decimals are just base 10 floats, you actually have a lot of digits available. Unless you've seen cases where you've gotten out to quite a few digits and had reason to suspect a reason for an unlimited number of digits being necessary, I'd let it ride.
If you really need an alternative and don't see a need to stick with a basic data bype, you might go with tumbler arithmetic. The basic idea is that it's a place notation that is infinitely expandable at each position. Pretty simple conceptually.
I used to use a decimal type for a field of this kind to order records in a table, which we actually exposed to the customer so that they could set their own order. Although it sounds cheesy our customers liked it; they found it very intuitive. They caught on very quickly that they could use numbers like 21.5 to move something between 21 and 22.
Maybe it's because they were accountants.
I use integers and just rearrange as necessary when a new item needs to be inserted in the middle of the order. Since you can create the necessary gap with a single update statement, it's fairly trivial. However, I've only ever done this on lookup tables of a few dozen rows at most, obviously this scales a bit poorly. But I would say that if you need a solution to this problem for a large number of rows, the process(es) for maintaining the order should be proceduralized anyway, which makes the choice of data type largely moot.
I remember this being a similar question to a previous post. It can be found here:
SQL Server Priority Ordering
The linked list would still work, but this is a much easier solution if you don't want to track a parent child relationship.
Sounds like what you want is a linked list. That way you always know what comes next and you don't have to guess. So the position field would be a pointer to the object following it.
The problem I have always had with using arbitrary numbers for position, is that it can quickly fall to entropy. What if more items get added and the number become consecutive etc. etc. It can quickly become unmanageable if the list of items changes position.
To implement this in sql server table, add another field with the same data type as the primary key. If the field is null then it is the bottom element in the list. If you are storing multiple lists in the same table you will probably want to add another field called ListID which designates all rows with the same ListID pertain to the same list. So something like this.
Table:
ID INT
ListID INT
Child INT
Pararent Row For first list:
1, 1, 2
First Child
2, 1, 3
Second Child
3, 1, NULL
Parent Row for second list:
4, 2, 5
First Child
5, 2, 6
Second Child
6, 2, NULL
You'll probably have to do an insert and an update every time you add a row, which can be a little tedious, but it will always make the list line up.
Is the "certain order" based on data outside of the table? If so, why not include it so you can do the sorting dynamically? If it's already in the table, adding a field is redundant.

Represent Ordering in a Relational Database

I have a collection of objects in a database. Images in a photo gallery, products in a catalog, chapters in a book, etc. Each object is represented as a row. I want to be able to arbitrarily order these images, storing that ordering in the database so when I display the objects, they will be in the right order.
For example, let's say I'm writing a book, and each chapter is an object. I write my book, and put the chapters in the following order:
Introduction, Accessibility, Form vs. Function, Errors, Consistency, Conclusion, Index
It goes to the editor, and comes back with the following suggested order:
Introduction, Form, Function, Accessibility, Consistency, Errors, Conclusion, Index
How can I store this ordering in the database in a robust, efficient way?
I've had the following ideas, but I'm not thrilled with any of them:
Array. Each row has an ordering ID, when order is changed (via a removal followed by an insertion), the order IDs are updated. This makes retrieval easy, since it's just ORDER BY, but it seems easy to break.
// REMOVAL
UPDATE ... SET orderingID=NULL WHERE orderingID=removedID
UPDATE ... SET orderingID=orderingID-1 WHERE orderingID > removedID
// INSERTION
UPDATE ... SET orderingID=orderingID+1 WHERE orderingID > insertionID
UPDATE ... SET orderID=insertionID WHERE ID=addedID
Linked list. Each row has a column for the id of the next row in the ordering. Traversal seems costly here, though there may by some way to use ORDER BY that I'm not thinking of.
Spaced array. Set the orderingID (as used in #1) to be large, so the first object is 100, the second is 200, etc. Then when an insertion happens, you just place it at (objectBefore + objectAfter)/2. Of course, this would need to be rebalanced occasionally, so you don't have things too close together (even with floats, you'd eventually run into rounding errors).
None of these seem particularly elegant to me. Does anyone have a better way to do it?
An other alternative would be (if your RDBMS supports it) to use columns of type array. While this breaks the normalization rules, it can be useful in situations like this. One database which I know about that has arrays is PostgreSQL.
The acts_as_list mixin in Rails handles this basically the way you outlined in #1. It looks for an INTEGER column called position (of which you can override to name of course) and using that to do an ORDER BY. When you want to re-order things you update the positions. It has served me just fine every time I've used it.
As a side note, you can remove the need to always do re-positioning on INSERTS/DELETES by using sparse numbering -- kind of like basic back in the day... you can number your positions 10, 20, 30, etc. and if you need to insert something in between 10 and 20 you just insert it with a position of 15. Likewise when deleting you can just delete the row and leave the gap. You only need to do re-numbering when you actually change the order or if you try to do an insert and there is no appropriate gap to insert into.
Of course depending on your particular situation (e.g. whether you have the other rows already loaded into memory or not) it may or may not make sense to use the gap approach.
If the objects aren't heavily keyed by other tables, and the lists are short, deleting everything in the domain and just re-inserting the correct list is the easiest. But that's not practical if the lists are large and you have lots of constraints to slow down the delete. I think your first method is really the cleanest. If you run it in a transaction you can be sure nothing odd happens while you're in the middle of the update to screw up the order.
Just a thought considering option #1 vs #3: doesn't the spaced array option (#3) only postpone the problem of the normal array (#1)? Whatever algorithm you choose, either it's broken, and you'll run into problems with #3 later, or it works, and then #1 should work just as well.
I did this in my last project, but it was for a table that only occasionally needed to be specifically ordered, and wasn't accessed too often. I think the spaced array would be the best option, because it reordering would be cheapest in the average case, just involving a change to one value and a query on two).
Also, I would imagine ORDER BY would be pretty heavily optimized by database vendors, so leveraging that function would be advantageous for performance as opposed to the linked list implementation.
Use a floating point number to represent the position of each item:
Item 1 -> 0.0
Item 2 -> 1.0
Item 3 -> 2.0
Item 4 -> 3.0
You can place any item between any other two items by simple bisection:
Item 1 -> 0.0
Item 4 -> 0.5
Item 2 -> 1.0
Item 3 -> 2.0
(Moved item 4 between items 1 and 2).
The bisection process can continue almost indefinitely due to the way floating point numbers are encoded in a computer system.
Item 4 -> 0.5
Item 1 -> 0.75
Item 2 -> 1.0
Item 3 -> 2.0
(Move item 1 to the position just after Item 4)
Since I've mostly run into this with Django, I've found this solution to be the most workable. It seems that there isn't any "right way" to do this in a relational database.
I'd do a consecutive number, with a trigger on the table that "makes room" for a priority if it already exists.
I had this problem as well. I was under heavy time pressure (aren't we all) and I went with option #1, and only updated rows that changed.
If you swap item 1 with item 10, just do two updates to update the order numbers of item 1 and item 10. I know it is algorithmically simple, and it is O(n) worst case, but that worst case is when you have a total permutation of the list. How often is that going to happen? That's for you to answer.
I had the same issue and have probably spent at least a week concerning myself about the proper data modeling, but I think I've finally got it. Using the array datatype in PostgreSQL, you can store the primary key of each ordered item and update that array accordingly using insertions or deletions when your order changes. Referencing a single row will allow you to map all your objects based on the ordering in the array column.
It's still a bit choppy of a solution but it will likely work better than option #1, since option 1 requires updating the order number of all the other rows when ordering changes.
Scheme #1 and Scheme #3 have the same complexity in every operation except INSERT writes. Scheme #1 has O(n) writes on INSERT and Scheme #3 has O(1) writes on INSERT.
For every other database operation, the complexity is the same.
Scheme #2 should not even be considered because its DELETE requires O(n) reads and writes. Scheme #1 and Scheme #3 have O(1) DELETE for both read and write.
New method
If your elements have a distinct parent element (i.e. they share a foreign key row), then you can try the following ...
Django offers a database-agnostic solution to storing lists of integers within CharField(). One drawback is that the max length of the stored string can't be greater than max_length, which is DB-dependent.
In terms of complexity, this would give Scheme #1 O(1) writes for INSERT, because the ordering information would be stored as a single field in the parent element's row.
Another drawback is that a JOIN to the parent row is now required to update ordering.
https://docs.djangoproject.com/en/dev/ref/validators/#django.core.validators.validate_comma_separated_integer_list