Fixed or Seldom-changing values in SQL - How to best code them?
Writing some SQL code yesterday, I was about to type some fixed constant values (sarcastically often called "magic numbers") in to my code. Every time I do that, I feel a pang of guilt. It's like knowingly creating technical debt, as you just know that you're feeding the future questions of "How many places does StatusX = 6 matter in our system?" / "Where might the system break or need to change if we added in another status value?".
Answering this "where-used" question is always a scavenger hunt, as sometimes fixed values are used directly (StatusX =6), or sometimes in IN clauses Status X IN (2,6,7), or sometimes in ranges StatusX < 7. You can do a pretty good job finding them, but it's time consuming. There HAS to be a better way...
Solution: Fixed Value Views
I went in search of this and found two articles in particular (see links below - one of them is here, on StackOverflow) proposing using SQL Views with fixed values for this. At first I passed the idea by. It didn't click with me right away, but after taking a look at what approaches there are on offer and their drawbacks, and then finding this concept a few more times, I re-visited it. Upon closer review, I find this approach makes a LOT of sense and has a ton of benefits, and costs next to nothing performance-wise.
The scenario where this approach applies best is where one has a fixed set of values, usually stored in a codes table. These values often have foreign key constraints on them to make sure that dependent data structures are only using valid and/or active codes values. Traditional RI has the validity of the data covered, but it doesn't make it any easier to manage or write SQL code that uses the code values sitting in these mostly static tables. That's where these fixed-value views come in!
I'm reposting this idea here, as I think it has a LOT of merit, and could or should be used by many. Glad to share and hear of any improvements or variations on it. I also added a bit (literally - see below) to the idea that improves on its code-friendliness.
Benefits
Using this method, you can use SQL server's dependency information to detect where your fixed values exposed by these fixed value views have been utilized. If you ever actually needed to add a value, it's easy and you quickly know where you should check for usage. If you wanted to gasp remove a value, you could tell what code it would break, as the view would be modified. All in all, this idea is very appealing to me. My brain isn't left feeling a bit dirty inside every time I code a fixed value using this method.
This approach does not suggest that you not have actual codes tables, or use these views instead of them. As much as possible, it is best to have defined codes tables, with referential integrity enforced to them from tables that use them. Where the constant views discussed below really shine is in SQL coding : in functions, stored procedures, etc.
Example View
CREATE VIEW dbo.cvw_OrderStatusID AS
SELECT
CAST(1 AS bit) AS [Match Exists] <-- Helper column for Left Outer Joins to check for record presence.
,CAST(1 AS INT) AS [Draft]
,CAST(2 AS INT) AS [New]
,CAST(3 AS INT) AS [Open]
,CAST(4 AS INT) AS [Delivered]
Basic Usage
Here's an example query that looks for Orders that are in a status that should still allow them to be edited:
-- Use Inner Join to select 1 to n values
SELECT
O.OrderNumber,
O.CustomerNumber
FROM
dbo.Order O
INNER JOIN dbo.cvw_OrderStatusID OS_AllowEdit ON O.OrderStatusID
IN (OS_AllowEdit.[Draft], OS_AllowEdit.[New], OS_AllowEdit.[Open])
-- Use Cross Apply to pass a fixed value to a TVF for example:
SELECT orders.* FROM dbo.cvw_OrderTypeID OrderTypes
CROSS APPLY dbo.fn_GetOrders(OrderTypes.[Express], #this ,#that ,#theother) orders
ORDER BY orders.DateSold
Small improvement on the idea - [Match Exists]
To make it more evident or easy to test when you are checking against multiple values for a match, add a field called [Match Exists] to your Constant Views of type Bit Default 1 (true).
When you use your SQL Constant Views, you can use an INNER JOIN where data must match, or a LEFT OUTER JOIN where data may or may not match X different values, structure your SQL like this:
SELECT
IIF(OS_OrderDelivered.[Match Exists] = 1, p.DeliveredDate, NULL) AS OrderDeliveryDate,
ISNULL(OS_AllowEdit.[Match Exists], 0) AS AllowEditing
FROM
dbo.Order O
LEFT OUTER JOIN dbo.cvw_OrderStatusID OS_OrderDelivered ON O.OrderStatusID = OS_OrderDelivered.[Delivered]
LEFT OUTER JOIN dbo.cvw_OrderStatusID OS_AllowEdit ON O.OrderStatusID
IN (OS_AllowEdit.[Draft], OS_AllowEdit.[New], OS_AllowEdit.[Open])
Giving credit where it is due:
For the full treatment, read these blogs, as they went deeper into performance considerations.
https://learn.microsoft.com/en-us/archive/blogs/sql_server_appendix_z/sql-server-variables-parameters-or-literals-or-constants
https://www.red-gate.com/hub/product-learning/sql-prompt/misuse-scalar-user-defined-function-constant-pe017
Related
This question already has answers here:
Performance of string comparison vs int join in SQL
(5 answers)
Closed 9 years ago.
Yesterday I was looking at queries like this:
SELECT <some fields>
FROM Thing
WHERE thing_type_id = 4
... and couldn't but think this was very "readable". What's '4'? What does it mean? I did the same thing in coding languages before but now I would use constants for this, turning the 4 in a THING_TYPE_AVAILABLE or some such name. No arcane number with no meaning anymore!
I asked about this on here and got answers as to how to achieve this in SQL.
I'm mostly partial to using JOINS with existing type tables where you have an ID and a Code, with other solutions possibly of use when there are no such tables (not every database is perfect...)
SELECT thing_id
FROM Thing
JOIN ThingType USING (thing_type_id)
WHERE thing_type_code IN ('OPENED', 'ONHOLD')
So I started using this on a query or two and my colleagues were soon upon me: "hey, you have literal codes in the query!" "Um, you know, we usually go with pks for that".
While I can understand that this method is not the usual method (hey, it wasn't for me either until now), is it really so bad?
What are the pros and cons of doing things this way? My main goal was readability, but I'm worried about performance and would like to confirm whether the idea is sound or not.
EDIT: Note that I'm not talking about PL/SQL but straight-up queries, the kind that usually starts with a SELECT.
EDIT 2:
To further clarify my situation with fake (but structurally similar) examples, here are the tables I have:
Thing
------------------------------------------
thing_id | <attributes...> | thing_type_id
1 3
4 7
5 3
ThingType
--------------------------------------------------
thing_type_id | thing_type_code | <attributes...>
3 'TYPE_C'
5 'TYPE_E'
7 'TYPE_G'
thing_type_code is just as unique as thing_type_id. It is currently also used as a display string, which is a mistake in my opinion, but would be easily fixable by adding a thing_type_label field duplicating thing_type_code for now, and changeable at any time later on if needed.
Supposedly, filtering with thing_type_code = 'TYPE_C', I'm sure to get that one line which happens to be thing_type_id = 3. Joins can (and quite probably should) still be done with the numerical IDs.
Primary key values should not be coded as literals in queries.
The reasons are:
Relational theory says that PKs should not convey any meaning. Not even a specific identity. They should be strictly row identifiers and not relied upon to be a specific value
Due to operational reasons, PKs are often different in different environments (like dev, qa and prod), even for "lookup" tables
For these reasons, coding literal IDs in queries is brittle.
Coding data literals like 'OPENED' and 'ONHOLD' is GOOD practice, because these values are going to be consistent across all servers and environments. If they do change, changing queries to be in sync will be part of the change script.
I assume that the question is about the two versions of the query -- one with the numeric comparison and the other with the join and string comparison.
Your colleagues are correct that the form with where thing_id in (list of ids) will perform better than the join. The difference in performance, however, might be quite minor if thing_id is not indexed. The query will already require a full table scan on the original table.
In most other respects, your version with the join is better. In particular, it makes the intent of the query cleaner and overall make the query more maintainable. For a small reference table, the performance hit may not be noticeable. In fact, in some databases, this form could be faster. This would occur when the in is evaluated as a series of or expressions. If the list is long, it might be faster to do an index lookup.
There is one downside to the join approach. If the values in the columns change, then the code also needs to be changed. I wouldn't be surprised if your colleague who suggests using primary keys has had this experience. S/he is working on an application and builds it using joins. Great. Lots of code. All clear. All maintainable. Then every week, the users decide to change the definitions of the codes. That can make almost any sane person prefer primary keys over using the reference table.
See Mark comment. I assume you are ok but can give my 2 cents on matter.
If that value is in the scope of one query I like to write that this, readable, way:
declare HOLD int = 4
SELECT <some fields>
FROM Thing
WHERE thing_type_id = HOLD
If that values are used many times in many points (queries, SP, views, etc)
I create a domain table.
create table ThingType (id int not null primary key, varchar(50) description)
GO
insert into ThingType values (4,'HOLD'),(5, 'ONHOLD')
GO
that way i can reuse that types on my selects as an enumerator
declare TYPE int
set TYPE = (select id from ThingType where description = 'HOLD')
SELECT <some fields>
FROM Thing
WHERE thing_type_id = TYPE
that way I keep meaning and performance (and also can enforce relational integrity over domain values)
Also I can just use enumerator at app level and just pass numeric values to the queries. A quick glimpse in that enumerator ill give me that number meaning.
In SQL queries you will definitely introduce a performance hit for JOINs (effectively multiple queries are taking place inside the SQL server). The question is whether the performance hit is significant enough to offset the benefits.
If it's just a readability thing then you may prefer to go for better performance and avoid the JOINs, but I would suggest you take into account potential integrity problems (e.g. what happens if the typed value of 4 in your example is changed by another process further down the line - the entire application may fail).
If the values will NEVER change then use PKs - this is a decision for you as the developer - there is no rule. One options may be best for one query and not for another.
In case of PL/SQL it makes sense to define constants in your package, e.g.
DECLARE
C_OPENED CONSTANT NUMBER := 3;
C_ONHOLD CONSTANT NUMBER := 4;
BEGIN
SELECT <some fields>
INTO ...
FROM Thing
WHERE thing_type_id in (C_OPENED, C_ONHOLD);
END;
Sometime it is usefull to create global package (without a body) where all commonly used constants are defined. In case the literal changes, you only have to modify the constant definition at a single place.
Of course this is not much of a factor for most small transactions, but with larger data sets, assuming there are useful indexes in place, is it faster to do an INSERT using a JOIN instead of using CASE?
Example using CASE (in T-SQL):
INSERT into Foo
(CarID, CarMake)
(SELECT CarID,
CASE WHEN CarModel = 'Odyssey' THEN 'Honda'
WHEN CarModel = 'Sienna' THEN 'Toyota' END
as CarMake
FROM FooCar)
Example using JOIN:
INSERT into Foo
(CarID, CarMakeID)
(SELECT A.CarID, B.CarMake
FROM FooCar as A
JOIN FooMake as B on (A.CarModel = B.CarModel)
)
Which of those two statements would meet the objective of more speed? Please look past the example data sample which does not represent the problem very well in terms of magnitude.
I can't find another question like this, but it seems like something that should be asked and answered already.
The case statement would, in general, be faster. The join is better in most other respects. The purpose of having reference tables is to avoid having to hard-code specific values in the code. It is better to use the reference table.
The performance difference would generally be very small, especially if FooMake is well designed and has an index on CarModel. You should not be looking for such micro-optimizations on queries. Strive first to get the query to work correctly, using reasonable SQL, and then move toward performance.
The two versions are only equivalent if all makes are present in and appear exactly once in FooMake.
Hard coded query values (like those string literals in your CASE statement) will get translated into in-memory machine language commands and will always be faster than the disk I/O required to perform the join.
The reason you have the table instead of CASE statements is maintainability not speed. What if you have the CASE statement act on two fields instead of one? Or add some branching logic? Hard-coding your queries for speed is like racing across the frozen tundra on a snowmobile which flips over, trapping you underneath. At night, the ice-weasels come.
Every time a database diagram gets looked out, one area people are critical of is inner joins. They look at them hard and has questions to see if an inner join really needs to be there.
Simple Library Example:
A many-to-many relationship is normally defined in SQL with three tables: Book, Category, BookCategory.
In this situation, Category is a table that contains two columns: ID, CategoryName.
In this situation, I have gotten questions about the Category table, is it need? Can it be used as a lookup table, and in the BookCategory table store the CategoryName instead of the CategoryID to stop from having to do an additional INNER JOIN. (For this question, we are going to ignore the changing, deleting of any CategoryNames)
The question is, what is so bad about inner joins? At what point is doing them a negative thing (general guidelines like # of transactions, # of records, # of joins in a statement, etc)?
Your example is a good counterexample. How do you rename categories if they're spread throughout the various rows of the BookCategory table? Your UPDATE to do the rename would touch all the rows in the same category.
With the separate table, you only have to update one row. There is no duplicate information.
I would be more concerned about OUTER joins, and the potential to pick up info that wasn't intended.
In your example, having the Category table means that a book is limited to being filed under a preset Category (via a foriegn key relationship), if you just shoved multiple entries in to the BookCategory table then it would be harder to limit what is selected for the Category.
Doing an INNER join is not so bad, it is what databases are made for. The only time it is bad is when you are doing it on a table or column that is inadequately indexed.
I am not sure there is some thing wrong in inner join per se, it is like each IF you add to your code impacts performance (or should I say every line...), but still, you need a minimum number of those to make your system work (yes yes, I know about Turing machines).
So if you have something that is not needed, it will be frowned upon.
When you map your domain model onto the relational model you have to split the information across multiple relations in order to get a normalized model - there is no way around that. And then you have to use joins to combine the relations again and get your information back. The only bad thing about this is that joins are relative expensive.
The other option would be not to normalize your relational model. This will fill your database with much redundant data, give you many opportunities to turn your data inconsistent and make updates a nightmare.
The only reason not to normalize a relational model (I can think of at the moment) is that reading performance is extremely - and I mean extremely - critical.
By the way, why do you (they) only mention inner joins? How are left, right, and full outer joins significantly different from inner joins?
Nobody can offer much about general guidelines - they'd be specific to the server, hardware, database design, and expectations... way too many variables.
Specifically about INNER JOINs being inefficient or bad... JOINs are the center of relational DBs, and they've been around for decades. It's only wrong when you use it wrong, because obviously someone's doing it right since it's not extinct yet. Personally, I'd assume anyone throwing out blanket statements like that either don't know SQL or know just enough to get in trouble. Next time it comes up, teach them how to use the query cache.
(Not mentioning update/delete, but you didn't say inserts!: the increased maintainability through avoiding humans and their typos can easily be worth at least 10x the time a join will take.)
Sometimes when i'm writing moderately complex SELECT statements with a few JOINs, wrong key columns are sometimes used in the JOIN statement that still return valid-looking results.
Because the auto numbering values (especially early in development) all tend to fall in similar ranges (sub 100s or so) the SELECT sill produces some results. These results often look valid at first glance and a problem is not detected until much, much later making debugging much more difficult because familiarity with the data structures and code has staled. (Gone stale in the dev's mind.)
i just spent several hours tracking down yet another of this issue that i've run into a too many times before. i name my tables and columns carefully, write my SQL statements methodically but this is an issue i can't seem to competely avoid. It comes back and bites me for hours of productivity about twice a year on average.
My question is: Has anyone come up with a clever method for avoiding this; what i assume is probably a common SQL bug/mistake?
i have thought of trying to auto-number starting with different start values but this feels cludgy and would get ugly trying to keep such a scheme straight for data models with dozens of tables... Any better ideas?
P.S.
i am very careful and methodical in naming my tables and columns. Patient table gets PatientId column, Facility get a FacilityId etc. This issues tends to arise when there are join tables involved where the linkage takes on extra meaning such as: RelatedPatientId, ReferingPatientId, FavoriteItemId etc.
When writing long complex SELECT statements try to limit the result to one record.
For instance, assume you have this gigantic enormous awesome CMS system and you have to write internal reports because the reports that come with it are horrendous. You notice that there are about 500 tables. Your select statement joins 30 of these tables. Your result should limit your row count by using a WHERE clause.
My advice is to rather then get all this code written and generalized for all cases, break the problem up and use WHERE and limit the row count to only say a record. Check all fields, if they look ok, break it up and let your code return more rows. Only after further checking should you generalize.
It bites a lot of us who keep adding more and more joins until it seems to look ok, but only after Joe Blow the accountant runs the report does he realize that the PO for 4 million was really the telephone bill for the entire year. Somehow that join got messed up!
One option would be to use your natural keys.
More practically, Red Gate SQL Prompt picks the FK columns for me.
I also tend to build up one JOIN at a time to see how things look.
If you have a visualization or diagramming tool for your SQL statements, you can follow the joins visually, and any errors will become immediately apparent, provided you have followed a sensible naming scheme for your primary and foreign keys.
Your column names should take care of this unless you named them all "ID". Are you writing multiple select statement using the same tables? You may want to create views for the more common ones.
If you're using SQL Server, you can use GUID columns as primary keys (that's what we do). You won't have problems with collisions again.
You could use GUIDs as your primary keys, but it has its pros and cons.
This pro is actually not mentioned on that page.
I have never tried doing this myself - I use a tool on top of SQL that makes incorrect joins very unlikely, so I don't have this problem. I just thought I'd mention it as another option though!
For IDs use TableNameID, for example for table Person, use PersonID
Use db model and look at the drawing when writing queries.
This way join looks like:
... ON p.PersonID = d.PersonID
as opposed to:
... ON p.ID = d.ID
Auto-increment integer PKs are among your best friends.
I've started a new project and they have a very normalized database. everything that can be a lookup is stored as the foreign key to the lookup table. this is normalized and fine, but I end up doing 5 table joins for the simplest queries.
from va in VehicleActions
join vat in VehicleActionTypes on va.VehicleActionTypeId equals vat.VehicleActionTypeId
join ai in ActivityInvolvements on va.VehicleActionId equals ai.VehicleActionId
join a in Agencies on va.AgencyId equals a.AgencyId
join vd in VehicleDescriptions on ai.VehicleDescriptionId equals vd.VehicleDescriptionId
join s in States on vd.LicensePlateStateId equals s.StateId
where va.CreatedDate > DateTime.Now.AddHours(-DateTime.Now.Hour)
select new {va.VehicleActionId,a.AgencyCode,vat.Description,vat.Code,
vd.LicensePlateNumber,LPNState = s.Code,va.LatestDateTime,va.CreatedDate}
I'd like to recommend that we denormaize some stuff. like the state code. I don't see the state codes changing in my lifetime. similar story with the 3-letter agency code. these are handed out by the agency of agencies and will never change.
When I approached the DBA with the state code issue and the 5 table joins. i get the response that "we are normalized" and that "joins are fast".
Is there a compelling argument to denormalize? I'd do it for sanity if nothing else.
the same query in T-SQL:
SELECT VehicleAction.VehicleActionID
, Agency.AgencyCode AS ActionAgency
, VehicleActionType.Description
, VehicleDescription.LicensePlateNumber
, State.Code AS LPNState
, VehicleAction.LatestDateTime AS ActionLatestDateTime
, VehicleAction.CreatedDate
FROM VehicleAction INNER JOIN
VehicleActionType ON VehicleAction.VehicleActionTypeId = VehicleActionType.VehicleActionTypeId INNER JOIN
ActivityInvolvement ON VehicleAction.VehicleActionId = ActivityInvolvement.VehicleActionId INNER JOIN
Agency ON VehicleAction.AgencyId = Agency.AgencyId INNER JOIN
VehicleDescription ON ActivityInvolvement.VehicleDescriptionId = VehicleDescription.VehicleDescriptionId INNER JOIN
State ON VehicleDescription.LicensePlateStateId = State.StateId
Where VehicleAction.CreatedDate >= floor(cast(getdate() as float))
I don't know if I would even call what you want to do denormalization -- it looks more like you just want to replace artificial foreign keys (StateId, AgencyId) with natural foreign keys (State Abbreviation, Agency Code). Using varchar fields instead of integer fields will slow down join/query performance, but (a) if you don't even need to join the table most of the time because the natural FK is what you want anyway it's not a big deal and (b) your database would need to be pretty big/have a high load for it to be noticeable.
But djna is correct in that you need a complete understanding of current and future needs before making a change like this. Are you SURE the three letter agency codes will never change, even five years from now? Really, really sure?
Some denormalization can be needed for performance (and sanity) reasons at some times. Hard to tell wihout seeing all your tables / needs etc...
But why not just build a few convenience views (to do a few joins) and then use these to be able to write simpler queries?
Beware of wanting to shape things to your current idioms. Right now the unfamiliar code seems unweildy and obstructive to your understanding. In time it's possible that you will become acclimatised.
If current (or known future) requirements, such as performance are not being met then that's a whole different issue. But remember anything can be performance tuned, the objective is not to make things as fast as possible, but to make them fast enough.
This previous post dealt with a similar issue to the one you're having. Hopefully it will be helpful to you.
Dealing with "hypernormalized" data
My own personal take on normalization is to normalize as much as possible, but denormalize only for performance. And evn the denormalization for performance is something to avoid. I'd go the route of profiling,setting correct indexes, etc before I'd denormalize.
Sanity... That's overrated. Especially in our profession.
Well, what about the performance? If the performance is okay, just make the five table JOIN into a view and, for sanity, SELECT from the view when you need the data.
State abbreviations are one of the cases in which I think meaningful keys are okay. For very simple lookup tables with a limited number of rows and where I'm in complete control of the data (meaning it's not populated from some outside source) I'll sometimes create meaningful four or five character keys so that the key value can proxy for the fully descriptive lookup value in some queries.
Create a view (or inline table-valued function to get parameterization). In any case, I usually put all my code into SPs (some code generated) whether they use views or not and that's that, you pretty much only ever write the join once.
An argument (for this "normalization") that the three-letter codes might change isn't very compelling without a plan for what you will do if the codes do change, and how your artificial-key scenario will address this eventuality better than using the codes as keys. Unless you've implemented a fully temporal schema (which is horribly difficult to do and not suggested by your example), it's not obvious to me how your normalization benefits you at all. Now if you work with agencies from multiple sources and standards that might have colliding code names, or if "state" might eventually mean a two-letter code for state, province, department, canton, or estado, that's another matter. You then need your own keys or you need a two-column key with more information than that code.