Is using JOINs to avoid numerical IDs a bad thing? [duplicate] - sql

This question already has answers here:
Performance of string comparison vs int join in SQL
(5 answers)
Closed 9 years ago.
Yesterday I was looking at queries like this:
SELECT <some fields>
FROM Thing
WHERE thing_type_id = 4
... and couldn't but think this was very "readable". What's '4'? What does it mean? I did the same thing in coding languages before but now I would use constants for this, turning the 4 in a THING_TYPE_AVAILABLE or some such name. No arcane number with no meaning anymore!
I asked about this on here and got answers as to how to achieve this in SQL.
I'm mostly partial to using JOINS with existing type tables where you have an ID and a Code, with other solutions possibly of use when there are no such tables (not every database is perfect...)
SELECT thing_id
FROM Thing
JOIN ThingType USING (thing_type_id)
WHERE thing_type_code IN ('OPENED', 'ONHOLD')
So I started using this on a query or two and my colleagues were soon upon me: "hey, you have literal codes in the query!" "Um, you know, we usually go with pks for that".
While I can understand that this method is not the usual method (hey, it wasn't for me either until now), is it really so bad?
What are the pros and cons of doing things this way? My main goal was readability, but I'm worried about performance and would like to confirm whether the idea is sound or not.
EDIT: Note that I'm not talking about PL/SQL but straight-up queries, the kind that usually starts with a SELECT.
EDIT 2:
To further clarify my situation with fake (but structurally similar) examples, here are the tables I have:
Thing
------------------------------------------
thing_id | <attributes...> | thing_type_id
1 3
4 7
5 3
ThingType
--------------------------------------------------
thing_type_id | thing_type_code | <attributes...>
3 'TYPE_C'
5 'TYPE_E'
7 'TYPE_G'
thing_type_code is just as unique as thing_type_id. It is currently also used as a display string, which is a mistake in my opinion, but would be easily fixable by adding a thing_type_label field duplicating thing_type_code for now, and changeable at any time later on if needed.
Supposedly, filtering with thing_type_code = 'TYPE_C', I'm sure to get that one line which happens to be thing_type_id = 3. Joins can (and quite probably should) still be done with the numerical IDs.

Primary key values should not be coded as literals in queries.
The reasons are:
Relational theory says that PKs should not convey any meaning. Not even a specific identity. They should be strictly row identifiers and not relied upon to be a specific value
Due to operational reasons, PKs are often different in different environments (like dev, qa and prod), even for "lookup" tables
For these reasons, coding literal IDs in queries is brittle.
Coding data literals like 'OPENED' and 'ONHOLD' is GOOD practice, because these values are going to be consistent across all servers and environments. If they do change, changing queries to be in sync will be part of the change script.

I assume that the question is about the two versions of the query -- one with the numeric comparison and the other with the join and string comparison.
Your colleagues are correct that the form with where thing_id in (list of ids) will perform better than the join. The difference in performance, however, might be quite minor if thing_id is not indexed. The query will already require a full table scan on the original table.
In most other respects, your version with the join is better. In particular, it makes the intent of the query cleaner and overall make the query more maintainable. For a small reference table, the performance hit may not be noticeable. In fact, in some databases, this form could be faster. This would occur when the in is evaluated as a series of or expressions. If the list is long, it might be faster to do an index lookup.
There is one downside to the join approach. If the values in the columns change, then the code also needs to be changed. I wouldn't be surprised if your colleague who suggests using primary keys has had this experience. S/he is working on an application and builds it using joins. Great. Lots of code. All clear. All maintainable. Then every week, the users decide to change the definitions of the codes. That can make almost any sane person prefer primary keys over using the reference table.

See Mark comment. I assume you are ok but can give my 2 cents on matter.
If that value is in the scope of one query I like to write that this, readable, way:
declare HOLD int = 4
SELECT <some fields>
FROM Thing
WHERE thing_type_id = HOLD
If that values are used many times in many points (queries, SP, views, etc)
I create a domain table.
create table ThingType (id int not null primary key, varchar(50) description)
GO
insert into ThingType values (4,'HOLD'),(5, 'ONHOLD')
GO
that way i can reuse that types on my selects as an enumerator
declare TYPE int
set TYPE = (select id from ThingType where description = 'HOLD')
SELECT <some fields>
FROM Thing
WHERE thing_type_id = TYPE
that way I keep meaning and performance (and also can enforce relational integrity over domain values)
Also I can just use enumerator at app level and just pass numeric values to the queries. A quick glimpse in that enumerator ill give me that number meaning.

In SQL queries you will definitely introduce a performance hit for JOINs (effectively multiple queries are taking place inside the SQL server). The question is whether the performance hit is significant enough to offset the benefits.
If it's just a readability thing then you may prefer to go for better performance and avoid the JOINs, but I would suggest you take into account potential integrity problems (e.g. what happens if the typed value of 4 in your example is changed by another process further down the line - the entire application may fail).
If the values will NEVER change then use PKs - this is a decision for you as the developer - there is no rule. One options may be best for one query and not for another.

In case of PL/SQL it makes sense to define constants in your package, e.g.
DECLARE
C_OPENED CONSTANT NUMBER := 3;
C_ONHOLD CONSTANT NUMBER := 4;
BEGIN
SELECT <some fields>
INTO ...
FROM Thing
WHERE thing_type_id in (C_OPENED, C_ONHOLD);
END;
Sometime it is usefull to create global package (without a body) where all commonly used constants are defined. In case the literal changes, you only have to modify the constant definition at a single place.

Related

Storing flags in SQL column, and indexing them

I need to store a set of flags that are related to an entity into database. Flags might not be the best word because these are not binary information (on/off), but rather a to-be-defined set of codes.
Normally, you would store each information (say each flag value) in a distinct column, but I'm exploring opportunities for storing such information in data structures different than one-column-for-each-attribute to prevent a dramatic increase in column mappings. Since each flag is valid for each attribute of an entity, you understand that for large entities that intrinsically require a large number of columns the total number of columns may grow as 2n.
Eventually, these codes can be mapped to a positional string.
I'm thinking about something like: 02A not being interpreted as dec 42 but rather as:
Flag 0 in position 1 (or zero if you prefer...)
Flag 2 in position 2
Flag A in position 3
Data formatted in such a way can be easily processed by high-level programming languages, because PL/SQL is out of the scope of the question and all these values are supposed to be processed by Java.
Now the real problem
One of my specs is to optimize searching. I have been required to find a way (say, an efficient way) to seek for entities that show a certain flag (or a special 0 flag) in a given position.
Normally, in SQL, given the RDBMS-specific substring function, you would
SELECT * FROM ENTITIES WHERE SUBSTRING(FLAGS,{POSITION},1) = {VALUE};
This works, but I'm afraid it may be a little slow on all platforms but Oracle, which, AFAIK, supports creating secondary indexes mapped to a substring.
However, my solution must work in MySQL, Oracle, SQL Server and DB2 thanks to Hibernate.
Given such a design, is there some, possibly cross-platform, indexing strategy that I'm missing?
If performance is an issue I would go for a some different model here.
Say a table that store entities and a relation 1->N to another table (say: flags table: entId(fk), flag, position) and this table would have an index on flag and position.
The issue here would be to get this flags in a simple column wich can be done in java or even on the database (but it would be difficult to have a cross plataform query to this)
If you want a database-independent, reasonable method for storing such flags, then use typical SQL data types. For a binary flag, you can use bit or boolean (this differs among databases). For other flags, you can use tinyint or smallint.
Doing bit-fiddling is not going to be portable. If nothing else, the functions used to extract particular bits from data differ among databases.
Second, if performance is an issue, then you may need to create indexes to avoid full table scans. You can create indexes on normal SQL data types (although some databases may not allow indexes on bits).
It sounds like you are trying to be overly clever. You should first get the application to work using reasonable data structures. Then you will understand where the performance issues are and can work on fixing them.
I have improved my design and performed a benchmark and found an interesting result.
I created a dummy demographic entity with first/last name columns, birthdate, birthplace, email, SSN...
Then in version 1
I added a column VALIDATION VARCAHR(40) NULL DEFAULT NULL with an index on it.
Instead of positional flags, the new column contains an unordered set of codes each representing a specific format error (e.g. A01 means "last name not specified", etc.). Each code is terminated by a colon : symbol.
Example columns look like
NULL
'A01:A03:A10:'
'A05:'
Typical queries are:
SELECT * FROM ENTITIES WHERE VALIDATION IS {NOT} NULL
Search for entities that are valid/invalid (NULL = no problem)
SELECT * FROM ENTITIES WHERE VALIDATION LIKE '%AXX:';
Selects entities with a specific problem
Then in version 1
I added a column VALID TINYINT NOT NULL with an index which is 0=invalid, 1=valid (Hibernate maps a Boolean to a TINYINT in MySQL).
I added a lookup table
CREATE TABLE ENTITY_VALIDATION (
ID BIGINT NOT NULL PRIMARY KEY,
PERSON_ID LONG NOT NULL, --REFERENCES PERSONS(ID) --Omitted for performance
ERROR CHAR(3) NOT NULL
)
With index on both PERSON_ID and ERROR. This represents the 1:N relationship
Queries:
SELECT * FROM ENTITIES WHERE VALIDATION = {0|1}
Select invalid/valid entities
SELECT * FROM ENTITIES JOIN ENTITY_VALIDATION ON ENTITIES.ID = ENTITY_VALIDATION.PERSON_ID WHERE ERROR = 'Axx';
Selects entities with a given problem
Then I benchmarked
the count(*) function via JUnit+JDBC. So the same queries you see above replace * with COUNT(*).
I did several benchmarks, with entity table containing 100k, 250k, 500k, 750k, 1M entities with a mean ratio entity:flag of 1:3 (there are meanly 3 errors for each entity).
The result
is displayed below. While correct/incorrect entities lookup is equally performing, it looks like MySQL is faster in the LIKE operator rather than in a JOIN, even though there are indexes
Of course,
This was only a benchmark on MySQL. While the approach is cross-platform, the benchmark does not (yet) compare performance in different DBMSes

SQL Server - Select * vs Select Column in a Stored Procedure

In an ad-hoc query using Select ColumnName is better, but does it matter in a Stored Procedure after it's saved in the plan guide?
Always explicitly state the columns, even in a stored procedure. SELECT * is considered bad practice.
For instance you don't know the column order that will be returned, some applications may be relying on a specific column order.
I.e. the application code may look something like:
Id = Column[0]; // bad design
If you've used SELECT * ID may no longer be the first column and cause the application to crash. Also, if the database is modified and an additional 5 fields have been added you are returning additional fields that may not be relevant.
These topics always elicit blanket statements like ALWAYS do this or NEVER do that, but the reality is, like with most things it depends on the situation. I'll concede that it's typically good practice to list out columns, but whether or not it's bad practice to use SELECT * depends on the situation.
Consider a variety of tables that all have a common field or two, for example we have a number of tables that have different layouts, but they all have 'access_dt' and 'host_ip'. These tables aren't typically used together, but there are instances when suspicious activity prompts a full report of all activity. These aren't common, and they are manually reviewed, as such, they are well served by a stored procedure that generates a report by looping through every log table and using SELECT * leveraging the common fields between all tables.
It would be a waste of time to list out fields in this situation.
Again, I agree that it's typically good practice to list out fields, but it's not always bad practice to use SELECT *.
Edit: Tried to clarify example a bit.
It's a best practice in general but if you actually do need all the column, you'd better use the quickly read "SELECT *".
The important thing is to avoid retreiving data you don't need.
It is considered bad practice in situations like stored procedures when you are querying large datasets with table scans. You want to avoid using table scans because it causes a hit to the performance of the query. It's also a matter of readability.
SOme other food for thought. If your query has any joins at all you are returning data you don't need because the data in the join columns is the same. Further if the table is later changed to add some things you don't need (such as columns for audit purposes) you may be returning data to the user that they should not be seeing.
Nobody has mentioned the case when you need ALL columns from a table, even if the columns change, e.g. when archiving table rows as XML. I agree one should not use "SELECT *" as a replacement for "I need all the columns that currently exist in the table," just out of laziness or for readability. There needs to be a valid reason. It could be essential when one needs "all the columns that could exist in the table."
Also, how about when creating "wrapper" views for tables?

Select * from table vs Select col1,col2,col3 from table [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
select * vs select column
I was just having a discussion with one of my colleague on the SQL Server performance on specifying the query command in the stored procedure.
So I want to know which one is preferred over another and whats the concrete reason behind that.
Suppose, We do have one table called
Employees(EmpName,EmpAddress)
And we want to select all the records from the table. So we can write the query in two ways,
Select * from Employees
Select EmpName, EmpAddress from Employees
So I would like to know is there any specific difference or performance issue in the above queries or are they just equal to the SQL Server Engine.
UPDATE:
Lets say the table schema won't change anymore. So no point for future maintenance.
Performance wise, lets say, the usage is very very high i.e. millions of hits per seconds on the database server. I want a clear and precise performance rating on both approaches.
No Indexing is done on the entire table.
The specific difference would show its ugly head if you add a column to the table.
Suddenly, the query you expected to return two columns now returns three. If you coded specifically for the two columns, the rest of your code is now broken.
Performance-wise, there shouldn't be a difference.
I always take the approach that being as specific as possible is the best when dealing with databases. If the table has two columns and you only need those two columns, be specific. Specify those two columns. It'll save you headaches in the future.
I am an avid avokat of the "be as specific as possible" rule, too. Not following it will hurt you in the long run. However, your question seems to be coming from a different background, so let me attempt to answer it.
When you submit a query to SQL Server it goes through several stages:
transmitting of query string over the network.
parsing of query string, producing a parse-tree
linking the referenced objects in the parse tree to existing objects
optimizing based on statistics and row count/size estimates
executing
transmitting of result data over the network
Let's look at each one:
The * query is a few bytes shorter, so step this will be faster
The * query contains fewer "tokens" so this should(!) be faster
During linking the list of columns need to be puled and compared to the query string. Here the "*" gets resolved to the actual column reference. Without access to the code it is impossible to say which version takes less cycles, however the amount of data accessed is about the same so this should be similar.
-6. In these stages there is no difference between the two example queries, as they will both get compiled to the same execution plan.
Taking all this into account, you will probably save a few nanoseconds when using the * notation. However, you example is very simplistic. In a more complex example it is possible that specifying as subset of columns of a table in a multi table join will lead to a different plan than using a *. If that happens we can be pretty certain that the explicit query will be faster.
The above comparison also assumes that the SQL Server process is running alone on a single processor and no other queries are submitted at the same time. If the process has to yield during the compilation those extra cycles will be far more than the ones we are trying to save.
So, the amont of saving we are talking about is very minute compared to the actual execution time and should not be used as an excuse for a "bad" coding practice.
I hope this answers your question.
You should always reference columns explicitly. This way, if the table structure changes (and such changes are made in an intelligent, backward-compatible way), your queries will continue to work and can be modified over time.
Also, unless you actually need all of the columns from the table (not typical), using SELECT * is bringing more data to your application than is necessary, and potentially forcing a clustered index scan instead of what might have been satisfied by a narrower covering index.
Bad habits to kick : using SELECT * / omitting the column list
Performance wise there are no difference between those 2 i think.But those 2 are used in different cases what may be the difference.
Consider a slightly larger table.If your table(Employees) contains 10 columns,then the 1st query will retain all of the information of the table.But for 2nd query,you may specify which columns information you need.So when you need all of the information of employees no.1 is the best one rather than specifying all of the column names.
Ofcourse,when you need to ALTER a table then those 2 would not be equal.

Strategy for avoiding a common sql development error (misleading result on join bug)

Sometimes when i'm writing moderately complex SELECT statements with a few JOINs, wrong key columns are sometimes used in the JOIN statement that still return valid-looking results.
Because the auto numbering values (especially early in development) all tend to fall in similar ranges (sub 100s or so) the SELECT sill produces some results. These results often look valid at first glance and a problem is not detected until much, much later making debugging much more difficult because familiarity with the data structures and code has staled. (Gone stale in the dev's mind.)
i just spent several hours tracking down yet another of this issue that i've run into a too many times before. i name my tables and columns carefully, write my SQL statements methodically but this is an issue i can't seem to competely avoid. It comes back and bites me for hours of productivity about twice a year on average.
My question is: Has anyone come up with a clever method for avoiding this; what i assume is probably a common SQL bug/mistake?
i have thought of trying to auto-number starting with different start values but this feels cludgy and would get ugly trying to keep such a scheme straight for data models with dozens of tables... Any better ideas?
P.S.
i am very careful and methodical in naming my tables and columns. Patient table gets PatientId column, Facility get a FacilityId etc. This issues tends to arise when there are join tables involved where the linkage takes on extra meaning such as: RelatedPatientId, ReferingPatientId, FavoriteItemId etc.
When writing long complex SELECT statements try to limit the result to one record.
For instance, assume you have this gigantic enormous awesome CMS system and you have to write internal reports because the reports that come with it are horrendous. You notice that there are about 500 tables. Your select statement joins 30 of these tables. Your result should limit your row count by using a WHERE clause.
My advice is to rather then get all this code written and generalized for all cases, break the problem up and use WHERE and limit the row count to only say a record. Check all fields, if they look ok, break it up and let your code return more rows. Only after further checking should you generalize.
It bites a lot of us who keep adding more and more joins until it seems to look ok, but only after Joe Blow the accountant runs the report does he realize that the PO for 4 million was really the telephone bill for the entire year. Somehow that join got messed up!
One option would be to use your natural keys.
More practically, Red Gate SQL Prompt picks the FK columns for me.
I also tend to build up one JOIN at a time to see how things look.
If you have a visualization or diagramming tool for your SQL statements, you can follow the joins visually, and any errors will become immediately apparent, provided you have followed a sensible naming scheme for your primary and foreign keys.
Your column names should take care of this unless you named them all "ID". Are you writing multiple select statement using the same tables? You may want to create views for the more common ones.
If you're using SQL Server, you can use GUID columns as primary keys (that's what we do). You won't have problems with collisions again.
You could use GUIDs as your primary keys, but it has its pros and cons.
This pro is actually not mentioned on that page.
I have never tried doing this myself - I use a tool on top of SQL that makes incorrect joins very unlikely, so I don't have this problem. I just thought I'd mention it as another option though!
For IDs use TableNameID, for example for table Person, use PersonID
Use db model and look at the drawing when writing queries.
This way join looks like:
... ON p.PersonID = d.PersonID
as opposed to:
... ON p.ID = d.ID
Auto-increment integer PKs are among your best friends.

Char(4) versus int as StatusID/StatusCode column in a table

I need a status column that will have about a dozen possible values.
Is there any reason why I should choose int (StatusID) over char(4) (StatusCode)?
Since sql server doesn't support named constants, char is far more descriptive than int when used in stored procedure and views as constants.
To clarify, I would still use a lookup table either way. Since the I will need a more descriptive text for the UI. So this decision is only to help me as the developer when I'm maintaining the stored procedures and views.
Right now I'm leaning toward char(4). Especially since designing views in SQL Server Management Studio prevents me from adding comments (I know it's possible to add it in the script editor, but realistically I will use the View Designer far more often, especially if the view is trivial). StateCODE = 'NEW' is much more readable than StateID = 1000.
I guess the question is will there be cases where char(4) is problematic, and since the database is pretty small, I'm not too concerned about slight performance hit (like using TinyInt versus int), but more afraid of code maintenance problems.
Database purists will say a key should have no meaning in the business domain, and that you should create a status table where you look up the description and other meanings of the status.
But for operators and end users, having a descriptive status code can be a blessing. And it doesn't even have to be char(4), you can make it varchar(20). This allows them to query without joins, and inspect the database in an easier way.
In the end, I think the char(20) organization will run more smoothly, and go home earlier on Friday. But the int organization has a better abstraction of the database, and they can enjoy meta programming on friday evening (or boosting on forums.)
(All of this assuming that you're writing business support software. One of the more succesful business support systems, SAP, makes successful use of meaningful keys.)
There are many pro's and con's to each method. I'm sure other arguments will come up in favour of using a char(4). My reasons for choosing an int over a char include:
I always use lookup tables. They allow for an audit trail of the value to be retained and easily examined. For example, if one of your status codes is 'MING' and a business decision is made to change it from 'MING' to 'MONG' from a certain date, my lookup table handles this.
Smaller index - if you need to index this column, it will be thinner.
Extendability - OK, I made that word up, but if you need to go from 4 chars to 5 chars for example, a lookup table would be a blessing.
Descriptions: We use a lot of TLA's here which once you know what they are is great but if I gave a business user a report that said "GDA's 2007 1001", they wouldn't necessarily twig that GDA = Good Dead on Arrival. With a lookup table, I can add this description.
Best practice: Can't find the link to hand but it might be something I read in a K.Tripp article. Aim to make your clustered primary key incrementing integers to optimise the index.
Of course if you are absolutely positive that you will never need any more than a handful of 4 characters, there is no reason not to bang it in the table.
The best thing should be a lookup table with defined values and then relate it to original table, that uses that enumeration.
Collation ambigities are one reason to say no to char 4: Does ABcD = abCD = äBCd?
If you have 12 possible values, why not tinyint/byte and a Status table?
If you have to store the status for 10 million rows the 3 bytes different and the collation/string compares add up.
The place where I've run into this use case is columns that would map onto things that I would typically use an Enum for when programming. Do you store the integer value of the Enum or the name of the Enum in the database column? Honestly, I've done it both ways. Usually, I ask myself if the database will be used outside the application I'm building. If so, I will choose the human readable format to store in the database. If not, then I'll choose the integer value as it saves a little time when reconstituting (it's just a cast instead of a parse operation) the Enum in code.
You could also use a tinyint over an int
i always choose int's simply because they are easier to map to enums in code.
If you're dealing with huge amounts of data and high throughput then a smallint or tinyint can give better performance and a smaller footprint on the hard disk. If the data in your application is often viewed directly through applications like Access or Cognos then your business people will probably appreciate the descriptive values. I know that when I'm analyzing data as part of my Database Developer role I get tired of joining a lot of lookup tables because I can't remember if 1 = Foo and 2 = Bar or 1 = Bar and 2 = Foo.
Also, although performance will be enhanced if you have to lookup rows by these codes which can have smaller indexes, it can also be hurt (in a minor way) by having to do the joins if you are often looking up rows regardless of the code but where you have to include the text value. In most applications that's not an issue though and would probably only come into play in large data warehousing/reporting environments.