Is INT the correct datatype for ABS(CHECKSUM(NEWID()))? - sql

I'm in the process of creating unique customers ID's that is an alternative Id for external use.
In the process of adding a new column "cust_uid" with datatype INT for my unique ID's,
When I do an INSERT into this new column:
Insert Into Customers(cust_uid)
Select ABS(CHECKSUM(NEWID()))
I get a error:
Could not create an acceptable cursor. OLE DB provider "SQLNCLI" for linked server "SHQ2IIS1" returned message "Multiple-step OLE DB operation generated errors. Check each OLE DB status value, if available. No work was done.
I've check all data types on both tables and the only things that has changed is the new column in both tables.
The update is being done on one Big #$$ table...and for reasons above my pay grade, we would like to have new uid's that are different form the one's that we currently have "so users don't know how many accounts we actually have."
Is INT the correct datatype for ABS(CHECKSUM(NEWID())) ?

For a moment, forget your issue with what must be an attempt to insert into a linked server (though it is not obvious from your code that Customers must either be a synonym or you dumbed the statement down).
Ask yourself: why would you use random numbers for uniqueness? Random and unique may seem like similar concepts, but they're not.
I also see a lack of error handling (again, this may just be that you dumbed down your code to "help" us). Eventually you will get duplicates. You may want to read this tip and this blog post. Essentially, as you insert more and more "unique" values, the likelihood that you will get a collision increases. So rather than solve the issue with your solution, I think you should step back and re-consider the problem.
Why are you using random numbers instead of simpler concepts that - at least by default - help assure uniqueness in a much more predictable way, like IDENTITY or SEQUENCE? Is it to prevent people from guessing the next value, or being able to determine how many values you generate in a time period? If so, then pre-populate a table with a bunch of random values, and pull one off the stack when you need one, as I described here. If this isn't the crucial issue, then stop breaking your back and just use an existing methodology for generating unique - and not random - numbers.

The update is being done on one Big #$$ table...and for reasons above my pay grade, we would like to have new uid's that are different form the one's that we currently have "so users don't know how many accounts we actually have."
Pick a constant and XOR it into the existing identifier to get yourself a somewhat obfuscated number. XOR it again to get the original identifier back.

Again bad choice for generating a unique ID
But with that said this does not throw an error so I think something else is going on
declare #id int
set #id = ABS(CHECKSUM(NEWID()))
print #id
Your update that you don't want users to know how many accounts and custID is an identity should have been in the original problem statement.

Related

Informix select trigger to update column

Is it possible to increase the value of a number in a column with a trigger every time it gets selected? We have special tables where we store the new id and when we update it in the app, it tends to get conflicts before the update happens, even when it all takes less than a second. So I was wondering if it is not possible to set database to increase value after every select on that column? Do not ask me why we do not use autoincrement for ids because I do not know.
Informix provides the SERIAL and BIGSERIAL types (and also SERIAL8, but don't use that) which provide autoincrement support. It also provides SEQUENCES with more sophisticated autoincrements. You should aim to use one of those.
Trying to use a SELECT trigger to update the table being selected from is, at best, fraught with problems about transactions and the like (problems which both the types and sequences carefully avoid).
If your design team needs help making effective use of these, ask a new question outlining what you want to achieve.
Normally, the correct way to proceed is to make the ID column in each table that defines 'something' (the Orders table, the Customer table, …) into a SERIAL column and either not insert a value into the ID column or insert 0 into it. The generated value can be retrieved and used when creating auxilliary information — order items, etc.
Note that you could think about using:
CREATE TABLE xyz_sequence
(
xyz SERIAL NOT NULL PRIMARY KEY
);
and using:
INSERT INTO xyz_sequence VALUES(0);
and then retrieving the inserted value — in Informix ESQL/C, you'd use sqlca.sqlerrd[1], in other languages, other techniques. You can also delete the newly inserted record, or even all the records in the table. You can afford to ignore errors from the DELETE statement; sooner or later, the rows will be deleted. The next value inserted will continue where the prior ones left off.
In a stored procedure, you'd use DBINFO('sqlca.sqlerrd1') to get the inserted value. You'd use DBINFO('bigserial') to get the value if you use a BIGSERIAL type.
I found out possible answer in this question update with return value instead of doing it with select it seems better to return value directly from update as update use locks it should be more safer even when you use multithreading application. But these are just my assumptions. Hopefully it will help someone.

Using auto assigned primary key or setting it on INSERT?

I just answered this question: Can I get the ID of the data that I just inserted?. I explained that in order to know the ID of the last record inserted in a table, what I would do is inserting it manually, instead of using some sequence or serial field.
What I like to do is to run a Max(id) query before INSERT, add 1 to that result, and use that number as ID for the record I'm about to insert.
Now, what I would like to ask: is this a good idea? Can it give some trouble? What are the reasons to use automatically set field on IDs fields?
Note: this is not exactly a question, but looking help center it seems like a good question to ask. If you find it to be off-topic, please tell me and I'll remove it.
This is a bad idea and it will fail in a multi threaded (or multi users) environment.
Please note that the surrogate-key vs natural-key debate is still far from having a concrete definitive solution - but putting that aside for a minute - even if you do go with a surrogate key - you should never try to manually auto-increment columns. Let the database do that for you and avoid all kinds of problems that can occur if you try to do that manually - such as primary key constraint violations in the best case, or duplicate values in the worst case.
If an Entity uses an ID as the Primary-Key, it is in general a good idea to let the DB autocreate it, so you don't need to determine an unused one while creating this Entity in your code. Furthermore a DateAccessObject(DAO) does not need to operate on the ID.
Dependant on what DB u might use, you might even not be allowed to retrieve all IDs of that Table..
I guess there might be other good reasons to let the DB manage this part.

advantages and disadvantages of database automatic number generator for each row vs manual numbering for each row

Imagine two tables that implemented like the following description:
The first table rows numbers created by database system administration automatically.
The second table rows numbers created manually by the programmer in a sequential order.
The main question is what are the advantages and disadvantages of these two approaches?
One distinct advantage of having the database manage auto-numbering over manually creating them is that the database implementation is thread safe - and manually creating them is usually (99.9% of the cases) is not (It's hard to do it correctly).
On the other hand, the database implementation does not guarantee sequential numbering - there can be gaps in the numbers.
Given these two facts, an auto-increment column should be used only as a surrogate key, when the values of this column does not have any business meaning - but they are simple used as a simple row identifier.
Please note that when using a surrogate key, it's best to also enforce uniqueness of a natural key - otherwise you might get rows where all the data is duplicated except the surrogate key.
When the database automatically create numbers, you habe less work.
Think about a sign up system, you have fields like name, email, password and so one:
1.) the number is generated by the database, so you can just insert the data into the table.
2.) if this is not the case you have to get the last number, so before the insert into you have to get the last id so instead a insert into you have a select + insert into.
Another reason is, what happened when you delete a row in your table?
Maybe in a forum, you want to delete the account but not all of his posts, so you can work with a workaround and when a post has a user_id not given you know this is/was a deleted or banned account - if you give a new user the number from a deleted user you will come in trouble.

How important are lookup tables?

A lot of the applications I write make use of lookup tables, since that was just the way I was taught (normalization and such). The problem is that the queries I make are often more complicated because of this. They often look like this
get all posts that are still open
"SELECT * FROM posts WHERE status_id = (SELECT id FROM statuses WHERE name = 'open')"
Often times, the lookup tables themselves are very short. For instance, there may only be 3 or so different statuses. In this case, would it be okay to search for a certain type by using a constant or so in the application? Something like
get all posts that are still open
"SELECT * FROM posts WHERE status_id = ".Status::OPEN
Or, what if instead of using a foreign id, I set it as an enum and queried off of that?
Thanks.
The answer depends a little if you are limited to freeware such as PostGreSQL (not fully SQL compliant), or if you are thinking about SQL (ie. SQL compliant) and large databases.
In SQL compliant, Open Architecture databases, where there are many apps using one database, and many users using different report tools (not just the apps) to access the data, standards, normalisation, and open architecture requirements are important.
Despite the people who attempt to change the definition of "normalisation", etc. to suit their ever-changing purpose, Normalisation (the science) has not changed.
if you have data values such as {Open; Closed; etc} repeated in data tables, that is data duplication, a simple Normalisation error: if you those values change, you may have to update millions of rows, which is very limited design.
Such values should be Normalised into a Reference or Lookup table, with a short CHAR(2) PK:
O Open
C Closed
U [NotKnown]
The data values {Open;Closed;etc} are no longer duplicated in the millions of rows. It also saves space.
the second point is ease of change, if Closed were changed to Expired, again, one row needs to be changed, and that is reflected in the entire database; whereas in the un-normalised files, millions of rows need to be changed.
Adding new data values, eg. (H,HalfOpen) is then simply a matter of inserting one row.
in Open Architecture terms, the Lookup table is an ordinary table. It exists in the [SQL compliant] catalogue; as long as the FOREIGN KEY relation has been defined, the report tool can find that as well.
ENUM is a Non-SQL, do not use it. In SQL the "enum" is a Lookup table.
The next point relates to the meaningfulness of the key.
If the Key is meaningless to the user, fine, use an {INT;BIGINT;GUID;etc} or whatever is suitable; do not number them incrementally; allow "gaps".
But if the Key is meaningful to the user, do not use a meaningless number, use a meaningful Relational Key.
Now some people will get in to tangents regarding the permanence of PKs. That is a separate point. Yes, of course, always use a stable value for a PK (not "immutable", because no such thing exists, and a system-generated key does not provide row uniqueness).
{M,F} are unlikely to change
if you have used {0,1,2,4,6}, well don't change it, why would you want to. Those values were supposed to be meaningless, remember, only a meaningful Key need to be changed.
if you do use meaningful keys, use short alphabetic codes, that developers can readily understand (and infer the long description from). You will appreciate this only when you code SELECT and realise you do not have to JOIN every Lookup table. Power users too, appreciate it.
Since PKs are stable, particularly in Lookup tables, you can safely code:
WHERE status_code = 'O' -- Open
You do not have to JOIN the Lookup table and obtain the data value Open, as a developer, you are supposed to know what the Lookup PKs mean.
Last, if the database were large, and supported BI or DSS or OLAP functions in addition to OLTP (as properly Normalised databases can), then the Lookup table is actually a Dimension or Vector, in Dimension-Fact analyses. If it was not there, then it would have to be added in, to satisfy the requirements of that software, before such analyses can be mounted.
If you do that to your database from the outset, you will not have to upgrade it (and the code) later.
Your Example
SQL is a low-level language, thus it is cumbersome, especially when it comes to JOINs. That is what we have, so we need to just accept the encumbrance and deal with it. Your example code is fine. But simpler forms can do the same thing.
A report tool would generate:
SELECT p.*,
s.name
FROM posts p,
status s
WHERE p.status_id = s.status_id
AND p.status_id = 'O'
Another Exaple
For banking systems, where we use short codes which are meaningful (since they are meaningful, we do not change them with the seasons, we just add to them), given a Lookup table such as (carefully chosen, similar to ISO Country Codes):
Eq Equity
EqCS Equity/Common Share
OTC OverTheCounter
OF OTC/Future
Code such as this is common:
WHERE InstrumentTypeCode LIKE "Eq%"
And the users of the GUI would choose the value from a drop-down that displays
{Equity/Common Share;Over The Counter},
not {Eq;OTC;OF}, not {M;F;U}.
Without a lookup table, you can't do that, either in the apps, or in the report tool.
For look-up tables I use a sensible primary key -- usually just a CHAR(1) that makes sense in the domain with an additional Title (VARCHAR) field. This can maintain relationship enforcement while "keeping the SQL simple". The key to remember here is the look-up table does not "contain data". It contains identities. Some other identities might be time-zone names or assigned IOC country codes.
For instance gender:
ID Label
M Male
F Female
N Neutral
select * from people where gender = 'M'
Alternatively, an ORM could be used and manual SQL generation might never have to be done -- in this case the standard "int" surrogate key approach is fine because something else deals with it :-)
Happy coding.
Create a function for each lookup.
There is no easy way. You want performance and query simplicity. Ensure the following is maintained. You could create a SP_TestAppEnums to compare existing lookup values against the function and look for out of sync/zero returned.
CREATE FUNCTION [Enum_Post](#postname varchar(10))
RETURNS int
AS
BEGIN
DECLARE #postId int
SET #postId =
CASE #postname
WHEN 'Open' THEN 1
WHEN 'Closed' THEN 2
END
RETURN #postId
END
GO
/* Calling the function */
SELECT dbo.Enum_Post('Open')
SELECT dbo.Enum_Post('Closed')
Question is: do you need to include the lookup tables (domain tables 'round my neck of the woods) in your queries? Presumably, these sorts of tables are usually
pretty static in nature — the domain might get extended, but it probably won't get shortened.
their primary key values are pretty unlikely to change as well (e.g., the status_id for a status of 'open' is unlikely to suddenly get changed to something other than what it was created as).
If the above assumptions are correct, there's no real need to add all those extra tables to your joins just so your where clause can use a friend name instead of an id value. Just filter on status_id directly where you need to. I'd suspect the non-key attribute in the where clause ('name' in your example above) is more likely to get changes than the key attribute ('name' in your example above): you're more protected by referencing the desire key value(s) of the domain table in your join.
Domain tables serve
to limit the domain of the variable via a foreign key relationship,
to allow the domain to be expanded by adding data to the domain table,
to populate UI controls and the like with user-friendly information,
Naturally, you'd need to suck domain tables into your queries where you you actually required the non-key attributes from the domain table (e.g., descriptive name of the value).
YMMV: a lot depends on context and the nature of the problem space.
The answer is "whatever makes sense".
lookup tables involve joins or subqueries which are not always efficient. I make use of enums a lot to do this job. its efficient and fast
Where possible (and It is not always . . .), I use this rule of thumb: If I need to hard-code a value into my application (vs. let it remain a record in the database), and also store that vlue in my database, then something is amiss with my design. It's not ALWAYS true, but basically, whatever the value in question is, it either represents a piece of DATA, or a peice of PROGRAM LOGIC. It is a rare case that it is both.
NOT that you won't find yourself discovering which one it is halfway into the project. But as the others said above, there can be trade-offs either way. Just as we don't always acheive "perfect" normalization in a database design (for reason of performance, or simply because you CAN take thngs too far in pursuit of acedemic perfection . . .), we may make some concious choices about where we locate our "look-up" values.
Personally, though, I try to stand on my rule above. It is either DATA, or PROGRAM LOGIC, and rarely both. If it ends up as (or IN) a record in the databse, I try to keep it out of the Application code (except, of course, to retrieve it from the database . . .). If it is hardcoded in my application, I try to keep it out of my database.
In cases where I can't observe this rule, I DOCUMENT THE CODE with my reasoning, so three years later, some poor soul will be able to ficure out how it broke, if that happens.
The commenters have convinced me of the error of my ways. This answer and the discussion that went along with it, however, remain here for reference.
I think a constant is appropriate here, and a database table is not. As you design your application, you expect that table of statuses to never, ever change, since your application has hard-coded into it what those statuses mean, anyway. The point of a database is that the data within it will change. There are cases where the lines are fuzzy (e.g. "this data might change every few months or so…"), but this is not one of the fuzzy cases.
Statuses are a part of your application's logic; use constants to define them within the application. It's not only more strictly organized that way, but it will also allow your database interactions to be significantly speedier.

Database-wide unique-yet-simple identifiers in SQL Server

First, I'm aware of this question, and the suggestion (using GUID) doesn't apply in my situation.
I want simple UIDs so that my users can easily communicate this information over the phone :
Hello, I've got a problem with order
1584
as opposed to
hello, I've got a problem with order
4daz33-d4gerz384867-8234878-14
I want those to be unique (database wide) because I have a few different kind of 'objects' ... there are order IDs, and delivery IDs, and billing-IDs and since there's no one-to-one relationship between those, I have no way to guess what kind of object an ID is referring to.
With database-wide unique IDs, I can immediately tell what object my customer is referring to. My user can just input an ID in a search tool, and I save him the extra-click to further refine what is looking for.
My current idea is to use identity columns with different seeds 1, 2, 3, etc, and an increment value of 100.
This raises a few question though :
What if I eventually get more than 100 object types? granted I could use 1000 or 10000, but something that doesn't scale well "smells"
Is there a possibility the seed is "lost" (during a replication, a database problem, etc?)
more generally, are there other issues I should be aware of?
is it possible to use an non integer (I currently use bigints) as an identity columns, so that I can prefix the ID with something representing the object type? (for example a varchar column)
would it be a good idea to user a "master table" containing only an identity column, and maybe the object type, so that I can just insert a row in it whenever a need a new idea. I feel like it might be a bit overkill, and I'm afraid it would complexify all my insertion requests. Plus the fact that I won't be able to determine an object type without looking at the database
are there other clever ways to address my problem?
Why not use identities on all the tables, but any time you present it to the user, simply tack on a single char for the type? e.g. O1234 is an order, D123213 is a delivery, etc.? That way you don't have to engineer some crazy scheme...
Handle it at the user interface--add a prefix letter (or letters) onto the ID number when reporting it to the users. So o472 would be an order, b531 would be a bill, and so on. People are quite comfortable mixing letters and digits when giving "numbers" over the phone, and are more accurate than with straight digits.
You could use an autoincrement column to generate the unique id. Then have a computed column which takes the value of this column and prepends it with a fixed identifier that reflects the entity type, for example OR1542 and DL1542, would represent order #1542 and delivery #1542, respectively. Your prefix could be extended as much as you want and the format could be arranged to help distiguish between items with the same autoincrement value, say OR011542 and DL021542, with the prefixes being OR01 and DL02.
I would implement by defining a generic root table. For lack of a better name call it Entity. The Entity table should have at a minimum a single Identity column on it. You could also include other fields that are common accross all your objects or even meta data that tells you this row is an order for example.
Each of your actual Order, Delivery...tables will have a FK reference back to the Entity table. This will give you a single unique ID column
Using the seeds in my opinion is a bad idea, and one that could lead to problems.
Edit
Some of the problems you mentioned already. I also see this being a pain to track and ensure you setup all new entities correctly. Imagine a developer updating the system two years from now.
After I wrote this answer I had thought a but more about why your doing this, and I came to the same conclusion that Matt did.
MS's intentional programing project had a GUID-to-word system that gave pronounceable names from random ID's
Why not a simple Base36 representation of a bigint? http://en.wikipedia.org/wiki/Base_36
We faced a similar problem on a project. We solved it by first creating a simple table that only has one row: a BIGINT set as auto-increment identity.
And we created an sproc that inserts a new row in that table, using default values and inside a transaction. It then stores the SCOPE_IDENTITY in a variable, rolls back the transaction and then returns the stored SCOPE_IDENTITY.
This gives us a unique ID inside the database without filling up a table.
If you want to know what kind of object the ID is referring to, I'd lose the transaction rollback and also store the type of object along side the ID. That way findout out what kind of object the Id is referring to is only one select (or inner join) away.
I use a high/low algorithm for this. I can't find a description for this online though. Must blog about it.
In my database, I have an ID table with an counter field. This is the high part. In my application, I have a counter that goes from 0 to 99. This is the low part. The generated key is 100 * high + low.
To get a key, I do the following
initially high = -1
initially low = 0
method GetNewKey()
begin
if high = -1 then
high = GetNewHighFromDatabase
newkey = 100 * high + low.
Inc low
If low = 100 then
low = 0
high = -1
return newKey
end
The real code is more complicated with locks etc but that is the general gist.
There are a number of ways of getting the high value from the database including auto inc keys, generators etc. The best way depends on the db you are using.
This algorithm gives simple keys while avoiding most the db hit of looking up a new key every time. In testing, I found it had similar performance to guids and vastly better performance than retrieving an auto inc key every time.
You could create a master UniqueObject table with your identity and a subtype field. Subtables (Orders, Users, etc.) would have a FK to UniqueObject. INSTEAD OF INSERT triggers should keep the pain to a minimum.
Maybe an itemType-year-week-orderNumberThisWeek variant?
o2009-22-93402
Such identifier can consist of several database column values and simply formatted into a form of an identifier by the software.
I had a similar situation with a project.
My solution: By default, users only see the first 7 characters of the GUID.
It's sufficiently random that collisions are extremely unlikely (1 in 268 million), and it's efficient for speaking and typing.
Internally, of course, I'm using the entire GUID.