Sanity Check: Floats as primary keys? - sql

I'm working with an old sql server 2000 database, mixing some of it's information with a new app I'm building. I noticed some of the primary keys in several of the tables are floats rather than any type of ints. They aren't foreign keys and are all unique. I can't think of any reason that anyone would want to make their unique primary key IDs floats, but I'm not a SQL expert by any means. So I guess what I'm asking is does whoever designed this fairly extensive database know something I don't?

I'm currently working with a rather big accountant package where EACH of 350+ tables has a primary key of FLOAT(53). All actual values are integers and the system strictly checks that they indeed are (there are special functions that do all the incrementing work).
I did wonder at this design, yet I can understand why it was chosen and give it some credits.
On the one hand, the system is big enough to have billion records in some tables. On the other hand, those primary keys must be easily readable from external applications like Excel or VB6, in which case you don't really want to make them BIGINT.
Hence, float is fine.

I worked with someone who used floats as PKs in SQL Server databases. He was worried about running out of numbers for identifiers if he stuck to INTs. (32 bit on SQL Server.) He just looked at the range of float and did not think about the fact never mind that for larger numbers the ones place is not held in the number, due to limited precision. So his code to take MAX(PK) + 1.0 would at some point return a number equal to MAX(PK). Not good. I finally convinced him to not use float for surrogate primary keys for future databases. He quit before fixing the DB he was working on.
To answer your question, "So I guess what I'm asking is does whoever designed this fairly extensive database know something I don't?" Most likely NO! At least not with respect to choosing datatypes.

Floats have one interesting property: it is always possible to insert a value between two others, except for the pathological case where you run out of bits. It has the disadvantage that representational issues may keep you from referring to a row by the key; it's hard to make two floating point values equal to each other.

Is it a NUMERIC( x, y) format and an IDENTITY? If so, it might be an upgrade from an older version of SQL Server. Back-in-the-day IDENTITY could only be a NUMERIC format, not the common INT we use today.
Otherwise, there's no way to tell if a float is suitable as a primary key -- it depends upon your domain. It's a bit harder to compare (IEEE INT is more efficient than float) and most people use monotonically increasing numbers (IDENTITY), so integers are often what people really want.
Since it looks like you're storing ints:
To answer the original question more directly: If you're storing ints, use the integer datatype. It's more efficient to store and compare.

I have been working with the Cerner Millenium database for a few years (under the covers it uses Oracle). Initially I was very surprised to see that it used floats for IDs on tables. Then I encountered an ID in our database > 2^32 and a query I wrote gave the wrong results because I had incorrectly cast it to an INT I realized why they did it. I don't find any of the arguments above against using floats persuasive in the real world, where for keys you just need numbers only "somewhat greater" than 2^32 and the value of the ID is always of the form ######.0. (No one is talking about an ID of the form ######.######.) However now that we are importing this data into a SQL Server warehouse, and bigint is available, we are going to go with bigint instead of float.

FYI-- there is another way of looking at this:
I work in real-time process control and as such, most of my row entries are time-based and are generated automatically at high rates by non-ASCII machines. time--that's what my users usually search on, and many of my 'users' are actually machines themselves. Hence, UTC based primary keys.

Related

SQL Server Implicit Data Type Conversion Challenges

I've worked with many databases over the last 20 yrs and have only ran into this "interesting" type of implicit data conversion problem with SQL Server.
If i create a table with one small int column and insert two rows with a value 1 and 2 into it and then run the following query "Select Avg(Column) From table" i get a truncated result instead of the 1.5 that i would get from pretty much any other dB on the planet that would automatically upsize the datatype to store the entire results rather than truncating/rounding to the columns data type. Now i know i can cast my way around this for every possible scenario but not a good dynamic solution especially for data analytics with data analytic products... I.E: Cognos/Microstrategy etc...
I am in data warehousing and have fact tables with millions of rows in them... I would love to store small columns and have proper aggregation results. My current approach to work around this nuance is to define the smallest quantifiable columns as Numeric(19,5) to account for all situations even though these columns many times only store 1 or 0 for which a tinyint would be great but will not naturally aggregate well.
Is there not any directive that tells SQL server do do what every other DB (oracle/db2/informix/access etc...) does? Which is promote to a larger type and show the entire results and let me do what i want with them?
You could create views on the tables which would cast the smallint or tinyint to float and only publish these views to the users. This would keep the small memory usage. The conversion should be no overhead, compared to other database systems that must do that as well if they use a different data type for aggregation.
While it might frustrate you, a lot of programming languages also behave this way with ints, 1 / 2 will spit out 0. See:
With c++ integers, does 1 divided by 2 reliably equal 0, and 3/2 = 1, 5/2 = 2 etc.?
It's a design quirk, it'd break a lot of things if they changed it. You're asking can you change a fairly fundamental way SQL Server behaves and thus potentially break any one else's code running on the server.
Simply put, no you can't.
And you're wrong that every other DB product behaves this way, Derby also does the same thing:
http://docs.oracle.com/javadb/10.6.2.1/ref/rrefsqlj32693.html
In the Oracle docs they specifically warn you that AVG will return a float regardless of the original type. This is because every language has to make the choice, do I return the original type or the most precise answer? To stop overflows, a lot of languages chose the former to the constant frustration of programmers everywhere.
So in SQL Server, to get a float out, put a float in.
To my best knowledge, the fastest way would be to do an implicit cast: SELECT AVG(Field * 1.0). You could of course do an explicit cast the same way. As far as I know there is no way to tell SQL Server that you want integers converted to floats when you average them, and arguably that's actually correct behavior.

SQL primary key, INT or GUID or..? [duplicate]

This question already has answers here:
INT vs Unique-Identifier for ID field in database
(6 answers)
Closed 9 years ago.
Is there any reason why I should not use an Integer as primary key for my tables?
Database is SQL-CE, two main tables of approx 50,000 entries per year, and a few minor tables. Only two connections will exist constantly open to the database. But updates will be triggered through multiple TCP socket connections, so it will be many cross threads that access and use the same database-connection. Although activity is very low, so simultanous updates are quite unlikely, but may occur maybe a couple of times per day max.
Will probably use LINQ2SQL for DAL, or typed datasets.
Not sure if this info is relevant, but that's why I'm asking, since I don't know :)
You should use an integer - it is smaller, meaning less memory, less IO (disk and network), less work to join on.
The database should handle the concurrency issues, regardless of the type of PK.
The advantage of using GUID primkey is that it should be unique in the world, such as whether to move data from one database to another. So you know that the row is unique.
But if we are talking about a small db, so I prefer integer.
Edit:
If you using SQL Server 2005++, can you also use NEWSEQUENTIALID(),
this generates a GUID based on the row above.Allows the index problem with newid() is not there anymore.
I see no reason not to use an auto-increment integer in this scenario. If you ever get to the point where an integer can't handle the volume of data then you're talking about an application scaled up to the point that a lot more work is involved anyway.
Keep in mind a few things:
An integer is the native word size for the hardware. It's about as fast and simple and easy on the computer as a data type gets.
When considering possibly using a GUID, know that they make for terrible primary keys. Relational databases in general (I can't speak for all, but MS SQL is a good example) don't index GUIDs well. There are hacks out there to try to make more index-friendly GUIDs, take them or leave them. But in general a GUID should be avoided as a PK for performance reasons.
Is there any reason why I should not
use an Integer as primary key for my
tables?
Nope, as long as each one is unique, integers are fine. Guids sounds like a good idea at first, but in reality they are much too large. Most of the time, it's using a sledgehammer to kill a fly, and the size of the Guid makes it much slower than using an integer.
Definitely use an integer, you do not want to use a GUID in a clustered index (PK) as it will cause the table to unnecessarily fragment.

Why do we care about data types?

Specifically, in relational database management systems, why do we need to know the data type of a column (more likely, the attribute of an object) at creation time?
To me, data types feel like an optimization, because one data point can be implemented in any number of ways. Wouldn't it be better to assign semantic roles and constraints to a data point and then have the engine internally examine and optimize which data type best serves the user?
I suspect this is where the heavy lifting is and why it's easier to just ask the user rather than to do the work.
What do you think? Where are we headed? Is this a realistic expectation? Or do I have a misguided assumption?
The type expresses a desired constraint on the values of the column.
The answer is storage space and fixed size rows.
Fixed-size rows are much, MUCH faster to search than variable length rows, because you can seek directly to the correct byte if you know which record number and field you want.
Edit: Having said that, if you use proper indexing in your database tables, the fixed-size rows thing isn't as important as it used to be.
SQLite does not care.
Other RDBMS's use principles that were designed in early 80's, when it was vital for performance.
Oracle, for instance, does not distinguish between a NULL and an empty string, and keeps its NUMBER's as sets of centesimal digits.
That hardly makes sense today, but these were very clever solutions when Oracle was being developed.
In one of the databases I developed, though, non-indexed values were used that were stored as VARCHAR2's, casted dynamically into appropriate datatypes depending on several conditions.
That was quite a special thing, though: it was used for bulk loading key-value pairs in one call to the database using collections.
Dynamic SQL statements were used for parsing data and putting them into appropriate tables based on key name.
All values were loaded to the temporary VARCHAR2 column as is and then converted into NUMBER's and DATETIME's to be put into their columns.
Explicit data types are huge for efficiency, and storage. If they are implicit they have to be 'figured' out and therefore incur speed costs. Indexes would be hard to implement as well.
I would suspect, although not positive, that having explicit types also on average incur less storage space. For numbers especially, there is no comparison between a binary int and a string of digit characters.
Hm... Your question is sort of confusing.
If I understand it correctly, you're asking why it is that we specify data types for table columns, and why it is that the "engine" automatically determines what is needed for the user.
Data types act as a constraint - they secure the data's integrity. An int column will never have letters in it, which is a good thing. The data type isn't automatically decided for you, you specify it when you create the database - almost always using SQL.
You're right: assigning a data type to a column is an implementation detail and has nothing to do with the set theory or calculus behind a database engine. As a theoretical model, a database ought to be "typeless" and able to store whatever we throw at it.
But we have to implement the database on a real computer with real constraints. It's not practical, from a performance standpoint, to have the computer dynamically try to figure out how to best store the data.
For example, let's say you have a table in which you store a few million integers. The computer could -- correctly -- figure out that it should store each datum as an integral value. But if you were to one day suddenly try to store a string in that table, should the database engine stop everything until it converts all the data to a more general string format?
Unfortunately, specifying a data type is a necessary evil.
If you know that some data item is supposed to be numeric integer, and you deliberately choose NOT to let the DBMS take care of enforcing this, then it becomes YOUR responsibility to ensure all sorts of things such as data integrity (ensuring that no value 'A' can be entered in the column, ensuring that no value 1.5 can be entered in the column), such as consistency of system behaviour (ensuring that the value '01' is considered equal to the value '1', which is not the behaviour you get from type String), ...
Types take care of all those sorts of things for you.
I'm not sure of the history of datatypes in databases, but to me it makes sense to know the datatype of a field.
When would you want to do a sum of some fields which are entirely varchar?
If I know that a field is an integer, it makes perfect sense to do a sum, avg, max, etc.
Not all databases work this way. SQLite was mentioned earlier, but a much older set of databases also does this, multivalued databases.
Consider UniVerse (now an IBM property). It does not do any data validation, nor does it require that you specify what type it is. Searches are still (relatively) fast, it takes up less space (due to the way it stores data dynamically).
You can describe what the data may look like using meta-data (dictionary items), but that is the limit of how you restrict the data.
See the wikipedia article on UniVerse
When you're pushing half a billion rows in 5 months after go live, every byte counts (in our system)
There is no such anti-pattern as "premature optimisation" in database design.
Disk space is cheap, of course, but you use the data in memory.
You should care about datatypes when it comes to filtering (WHERE clause) or sorting (ORDER BY). For example "200" is LOWER than "3" if those values are strings, and the opposite when they are integers.
I believe sooner or later you wil have to sort or filter your data ("200" > "3" ?) or use some aggregate functions in reports (like sum() or (avg()). Until then you are good with text datatype :)
A book I've been reading on database theory tells me that the SQL standard defines a concept of a domain. For instance, height and width could be two different domains. Although both might be stored as numeric(10,2), a height and a width column could not be compared without casting. This allows for a "type" constraint that is not related to implementation.
I like this idea in general, though, since I've never seen it implemented, I don't know what it would be like to use it. I can see that it would reduce the chance of errors in using values whose implementation happen to be the same, when their conceptual domain is quite different. It might also help keep people from comparing cm and inches, for instance.
Constraint is perhaps the most important thing mentioned here. Data types exist for ensuring the correctness of your data so you are sure you can manipulate it correctly. There are 2 ways we can store a date. In a type of date or as a string "4th of January 1893". But the string could also have been "4/1 1893", "1/4 1893" or similar. Datatypes constrain that and defines a canonical form for a date.
Furthermore, a datatype has the advantage that it can undergo checks. The string "0th of February 1975" is accepted as a string, but should not be as a date. How about "30th of February 1983"? Poor databases, like MySQL, does not make these checks by default (although you can configure MySQL to do it -- and you should!).
data types will ensure the consistency of your data. This is one of the most important concepts as keeping your data sane will spare your head from insanity.
RDBMs generally require definition of column types so it can perform lookups fast. If you want to get the 5th column of every row in a huge dataset, having the columns defined is a huge optimisation.
Instead of scanning each row for some form of delimiter to retrieve the 5th column (if column widths were not fixed width), the RDBMs can just take the item at sizeOf(column1 - 4(bytes)) + sizeOf(column5(bytes)). Imagine how much quicker this would be on a table of say 10,000,000 rows.
Alternatively, if you don't want to specify the types of each column, you have two options that I'm aware of. Specify each column as a varchar(255) and decide what you want to do with it within the calling program. Or you can use a different database system that uses key-value pairs such as Redis.
database is all about physical storage, data type define this!!!

Char(4) versus int as StatusID/StatusCode column in a table

I need a status column that will have about a dozen possible values.
Is there any reason why I should choose int (StatusID) over char(4) (StatusCode)?
Since sql server doesn't support named constants, char is far more descriptive than int when used in stored procedure and views as constants.
To clarify, I would still use a lookup table either way. Since the I will need a more descriptive text for the UI. So this decision is only to help me as the developer when I'm maintaining the stored procedures and views.
Right now I'm leaning toward char(4). Especially since designing views in SQL Server Management Studio prevents me from adding comments (I know it's possible to add it in the script editor, but realistically I will use the View Designer far more often, especially if the view is trivial). StateCODE = 'NEW' is much more readable than StateID = 1000.
I guess the question is will there be cases where char(4) is problematic, and since the database is pretty small, I'm not too concerned about slight performance hit (like using TinyInt versus int), but more afraid of code maintenance problems.
Database purists will say a key should have no meaning in the business domain, and that you should create a status table where you look up the description and other meanings of the status.
But for operators and end users, having a descriptive status code can be a blessing. And it doesn't even have to be char(4), you can make it varchar(20). This allows them to query without joins, and inspect the database in an easier way.
In the end, I think the char(20) organization will run more smoothly, and go home earlier on Friday. But the int organization has a better abstraction of the database, and they can enjoy meta programming on friday evening (or boosting on forums.)
(All of this assuming that you're writing business support software. One of the more succesful business support systems, SAP, makes successful use of meaningful keys.)
There are many pro's and con's to each method. I'm sure other arguments will come up in favour of using a char(4). My reasons for choosing an int over a char include:
I always use lookup tables. They allow for an audit trail of the value to be retained and easily examined. For example, if one of your status codes is 'MING' and a business decision is made to change it from 'MING' to 'MONG' from a certain date, my lookup table handles this.
Smaller index - if you need to index this column, it will be thinner.
Extendability - OK, I made that word up, but if you need to go from 4 chars to 5 chars for example, a lookup table would be a blessing.
Descriptions: We use a lot of TLA's here which once you know what they are is great but if I gave a business user a report that said "GDA's 2007 1001", they wouldn't necessarily twig that GDA = Good Dead on Arrival. With a lookup table, I can add this description.
Best practice: Can't find the link to hand but it might be something I read in a K.Tripp article. Aim to make your clustered primary key incrementing integers to optimise the index.
Of course if you are absolutely positive that you will never need any more than a handful of 4 characters, there is no reason not to bang it in the table.
The best thing should be a lookup table with defined values and then relate it to original table, that uses that enumeration.
Collation ambigities are one reason to say no to char 4: Does ABcD = abCD = äBCd?
If you have 12 possible values, why not tinyint/byte and a Status table?
If you have to store the status for 10 million rows the 3 bytes different and the collation/string compares add up.
The place where I've run into this use case is columns that would map onto things that I would typically use an Enum for when programming. Do you store the integer value of the Enum or the name of the Enum in the database column? Honestly, I've done it both ways. Usually, I ask myself if the database will be used outside the application I'm building. If so, I will choose the human readable format to store in the database. If not, then I'll choose the integer value as it saves a little time when reconstituting (it's just a cast instead of a parse operation) the Enum in code.
You could also use a tinyint over an int
i always choose int's simply because they are easier to map to enums in code.
If you're dealing with huge amounts of data and high throughput then a smallint or tinyint can give better performance and a smaller footprint on the hard disk. If the data in your application is often viewed directly through applications like Access or Cognos then your business people will probably appreciate the descriptive values. I know that when I'm analyzing data as part of my Database Developer role I get tired of joining a lot of lookup tables because I can't remember if 1 = Foo and 2 = Bar or 1 = Bar and 2 = Foo.
Also, although performance will be enhanced if you have to lookup rows by these codes which can have smaller indexes, it can also be hurt (in a minor way) by having to do the joins if you are often looking up rows regardless of the code but where you have to include the text value. In most applications that's not an issue though and would probably only come into play in large data warehousing/reporting environments.

Why use "Y"/"N" instead of a bit field in Microsoft SQL Server?

I'm working on an application developed by another mob and am confounded by the use of a char field instead of bit for all the boolean columns in the database. It uses "Y" for true and "N" for false (these have to be uppercase). The type name itself is then aliased with some obscure name like ybln.
This is very annoying to work with for a lot of reasons, not the least of which is that it just looks downright aesthetically unpleasing.
But maybe its me that's stupid - why would anyone do this? Is it a database compatibility issue or some design pattern that I am not aware of?
Can anyone enlighten me?
I've seen this practice in older database schemas quite often. One advantage I've seen is that using CHAR(1) fields provides support for more than Y/N options, like "Yes", "No", "Maybe".
Other posters have mentioned that Oracle might have been used. The schema I referred to was in-fact deployed on Oracle and SQL Server. It limited the usage of data types to a common subset available on both platforms.
They did diverge in a few places between Oracle and SQL Server but for the most part they used a common schema between the databases to minimize the development work needed to support both DBs.
Welcome to brownfield. You've inherited an app designed by old-schoolers. It's not a design pattern (at least not a design pattern with something good going for it), it's a vestige of coders who cut their teeth on databases with limited data types. Short of refactoring the DB and lots of code, grit your teeth and gut your way through it (and watch your case)!
Other platforms (e.g. Oracle) do not have a bit SQL type. In which case, it's a choice between NUMBER(1) and a single character field. Maybe they started on a different platform or wanted cross platform compatibility.
I don't like the Y/N char(1) field as a replacement to a bit column too, but there is one major down-side to a bit field in a table: You can't create an index for a bit column or include it in a compound index (at least not in SQL Server 2000).
Sure, you could discuss if you'll ever need such an index. See this request on a SQL Server forum.
They may have started development back with Microsoft SQl 6.5
Back then, adding a bit field to an existing table with data in place was a royal pain in the rear. Bit fields couldn't be null, so the only way to add one to an existing table was to create a temp table with all the existing fields of the target table plus the bit field, and then copy the data over, populating the bit field with a default value. Then you had to delete the original table and rename the temp table to the original name. Throw in some foriegn key relationships and you've got a long script to write.
Having said that, there were always 3rd party tools to help with the process. If the previous developer chose to use char fields in lieu of bit fields, the reason, in a nutshell, was probably laziness.
The reasons are as follows (btw, they are not good reasons):
1) Y/N can quickly become "X" (for unknown), "L" (for likely), etc. - What I mean by this is that I have personally worked with programmers who were so used to not collecting requirements correctly that they just started with Y/N as sort of 'flags' with the superstition that it might need to expand (to which they should use an int as a status ID).
2) "Performance" - but as was mentioned above, SQL indexes are ruled out if they are not 'selective' enough... a field that only has 2 possible values will never use that index.
3) Lazyness. - Sometimes developers want to output directly to some visual display with the letter "Y" or "N" for human readableness, and they don't want to convert it themselves :)
There are all 3 bad reasons that I've heard/seen before.
I can't imagine any disadvantage in not being able to index a "BIT" column, as it would be unlikely to have enough different values to help the execution of a query at all.
I also imagine that in most cases the storage difference between BIT and CHAR(1) is negligible (is that CHAR a NCHAR? does it store a 16bit, 24bit or 32bit unicode char? Do we really care?)
This is terribly common in mainframe files, COBOL, etc.
If you only have one such column in a table, it's not that terrible in practice (no real bit-wasting); after all SQL Server will not let you say the natural WHERE BooleanColumn, you have to say WHERE BitColumn = 1 and IF #BitFlag = 1 instead of the far more natural IF #BooleanFlag. When you have multiple bit columns, SQL Server will pack them. The case of the Y/N should only be an issue if case-sensitive collation is used, and to stop invalid data, there is always the option of a constraint.
Having said all that, my personal preference is for bits and only allowing NULLs after careful consideration.
Apparently, bit columns aren't a good idea in MySQL.
They probably were used to using Oracle and didn't properly read up on the available datatypes for SQL Server. I'm in exactly that situation myself (and the Y/N field is driving me nuts).
I've seen worse ...
One O/R mapper I had occasion to work with used 'true' and 'false' as they could be cleanly cast into Java booleans.
Also, On a reporting database such as a data warehouse, the DB is the user interface (metadata based reporting tools notwithstanding). You might want to do this sort of thing as an aid to people developing reports. Also, an index with two values will still get used by index intersection operations on a star schema.
Sometimes such quirks are more associated with the application than the database. For example, handling booleans between PHP and MySQL is a bit hit-and-miss and makes for non-intuitive code. Using CHAR(1) fields and 'Y' and 'N' makes for much more maintainable code.
I don't have any strong feelings either way. I can't see any great benefit to doing it one way over another. I know philosophically the bit fields are better for storage. My reality is that I have very few databases that contain a lot of logical fields in a single record. If I had a lot then I would definitely want bit fields. If you only have a few I don't think it matters. I currently work with Oracle and SQL server DB's and I started with Cullinet's IDMS database (1980) where we packed all kinds of data into records and worried about bits and bytes. While I do still worry about the size of data, I long ago stopped worrying about a few bits.