Why are positional queries bad? - sql

I'm reading CJ Date's SQL and Relational Theory: How to Write Accurate SQL Code, and he makes the case that positional queries are bad — for example, this INSERT:
INSERT INTO t VALUES (1, 2, 3)
Instead, you should use attribute-based queries like this:
INSERT INTO t (one, two, three) VALUES (1, 2, 3)
Now, I understand that the first query is out of line with the relational model since tuples (rows) are unordered sets of attributes (columns). I'm having trouble understanding where the harm is in the first query. Can someone explain this to me?

The first query breaks pretty much any time the table schema changes. The second query accomodates any schema change that leaves its columns intact and doesn't add defaultless columns.
People who do SELECT * queries and then rely on positional notation for extracting the values they're concerned about are software maintenance supervillains for the same reason.

While the order of columns is defined in the schema, it should generally not be regarded as important because it's not conceptually important.
Also, it means that anyone reading the first version has to consult the schema to find out what the values are meant to mean. Admittedly this is just like using positional arguments in most programming languages, but somehow SQL feels slightly different in this respect - I'd certainly understand the second version much more easily (assuming the column names are sensible).

I don't really care about theoretical concepts in this regard (as in practice, a table does have a defined column order). The primary reason I would prefer the second one to the first is an added layer of abstraction. You can modify columns in a table without screwing up your queries.

You should try to make your SQL queries depend on the exact layout of the table as little as possible.
The first query relies on the table only having three fields, and in that exact order. Any change at all to the table will break the query.
The second query only relies on there being those three felds in the table, and the order of the fields is irrelevant. You can change the order of fields in the table without breaking the query, and you can even add fields as long as they allow null values or has a default value.
Although you don't rearrange the table layout very often, adding more fields to a table is quite common.
Also, the second query is more readable. You can tell from the query itself what the values put in the record means.

Something that hasn't been mentioned yet is that you will often be having a surrogate key as your PK, with auto_increment (or something similar) to assign a value. With the first one, you'd have to specify something there — but what value can you specify if it isn't to be used? NULL might be an option, but that doesn't really fit in considering the PK would be set to NOT NULL.
But apart from that, the whole "locked to a specific schema" is a much more important reason, IMO.

SQL gives you syntax for specifying the name of the column for both INSERT and SELECT statements. You should use this because:
Your queries are stable to changes in the column ordering, so that maintenance takes less work.
The column ordering maps better to how people think, so it's more readable. It's more clear to think of a column as the "Name" column rather than the 2nd column.

I prefer to use the UPDATE-like syntax:
INSERT t SET one = 1 , two = 2 , three = 3
Which is far easier to read and maintain than both the examples.

Long term, if you add one more column to your table, your INSERT will not work unless you explicitly specify list of columns. If someone changes the order of columns, your INSERT may silently succeed inserting values into wrong columns.

I'm going to add one more thing, the second query is less prone to error orginally even before tables are changed. Why do I say that? Becasue with the seocnd form you can (and should when you write the query) visually check to see if the columns in the insert table and the data in the values clause or select clause are in fact in the right order to begin with. Otherwise you may end up putting the Social Security Number in the Honoraria field by accident and paying speakers their SSN instead of the amount they should make for a speech (example not chosen at random, except we did catch it before it actually happened thanks to that visual check!).

Related

Determine if a SQL Insert/Update statement affects the result from a stored Select Statement

Thought this would be a good place to ask for some "brainstorming." Apologies if it's a little broad/off subject.
I was wondering if anyone here had any ideas on how to approach the following problem:
First assume that I have a select statement stored somewhere as an object (this can be the tree form of the query). For example (for simplicity):
SELECT A, B FROM table_A WHERE A > 10;
It's easy to determine the below would change the result of the above query:
INSERT INTO table_A (A,B) VALUES (12,15);
But, given any possible Insert/Update/Whatever statement, as well as any possible starting Select (but we know the Selects and can analyze them all day) I'd like to determine if it would affect the result of the Select Statement.
It's fine to assume that there won't be any "outside" queries, and that we know about all the queries being sent to the DB. It is also assumed we know the DB schema.
No, this isn't for homework. Just a brain teaser I've been thinking about and started to get stuck on (obviously, SQL can get very complicated.)
Based on the reply to the comment, I'd say that without additional criteria, this ranges between very hard and impossible.
Very hard (leastways, it would be for me) because you'd have to write something to parse and interpret your SQL statements into a workable frame of reference for your goals. Doable, but can it be worth the effort?
Impossible because some queries transcend phrases like "Byzantinely complex". (Think nested queries, correlated subqueries, views, common table expressions, triggers, outer joins, and who knows what all.) Without setting criteria such as "no subqueries, no views or triggers, no more than X joins" and so forth, the problem becomes open-ended enough to warrant an NP Complete answer.
My first thought would be to put a trigger on table_A, where if any of the columns you're affecting (col A in this case) changes to meet (or no longer meet) the condition (> 10 here), then the trigger records that an "affecting" change has taken place.
E.g. have another little table to record a "last update timestamp", which the trigger could pop a getdate() into when it detects such a change.
Then, you could check that table to see if the timestamp has changed since the last time you ran the select query - if it has, then you know you need to re-run it, if it hasn't, then you know the results would be the same.
The table could hold many such timestamps (one per row, perhaps with the table/trigger name as a key value in another column) to service many such triggers.
Advantage? Being done in a trigger on the table means no risk of a change that could affect the select statement being missed.
Disadvantage? I guess depending on how your select statements come into existence, you might have an undesirable/unmanageable overhead in creating the trigger(s).

SQL Server - Select * vs Select Column in a Stored Procedure

In an ad-hoc query using Select ColumnName is better, but does it matter in a Stored Procedure after it's saved in the plan guide?
Always explicitly state the columns, even in a stored procedure. SELECT * is considered bad practice.
For instance you don't know the column order that will be returned, some applications may be relying on a specific column order.
I.e. the application code may look something like:
Id = Column[0]; // bad design
If you've used SELECT * ID may no longer be the first column and cause the application to crash. Also, if the database is modified and an additional 5 fields have been added you are returning additional fields that may not be relevant.
These topics always elicit blanket statements like ALWAYS do this or NEVER do that, but the reality is, like with most things it depends on the situation. I'll concede that it's typically good practice to list out columns, but whether or not it's bad practice to use SELECT * depends on the situation.
Consider a variety of tables that all have a common field or two, for example we have a number of tables that have different layouts, but they all have 'access_dt' and 'host_ip'. These tables aren't typically used together, but there are instances when suspicious activity prompts a full report of all activity. These aren't common, and they are manually reviewed, as such, they are well served by a stored procedure that generates a report by looping through every log table and using SELECT * leveraging the common fields between all tables.
It would be a waste of time to list out fields in this situation.
Again, I agree that it's typically good practice to list out fields, but it's not always bad practice to use SELECT *.
Edit: Tried to clarify example a bit.
It's a best practice in general but if you actually do need all the column, you'd better use the quickly read "SELECT *".
The important thing is to avoid retreiving data you don't need.
It is considered bad practice in situations like stored procedures when you are querying large datasets with table scans. You want to avoid using table scans because it causes a hit to the performance of the query. It's also a matter of readability.
SOme other food for thought. If your query has any joins at all you are returning data you don't need because the data in the join columns is the same. Further if the table is later changed to add some things you don't need (such as columns for audit purposes) you may be returning data to the user that they should not be seeing.
Nobody has mentioned the case when you need ALL columns from a table, even if the columns change, e.g. when archiving table rows as XML. I agree one should not use "SELECT *" as a replacement for "I need all the columns that currently exist in the table," just out of laziness or for readability. There needs to be a valid reason. It could be essential when one needs "all the columns that could exist in the table."
Also, how about when creating "wrapper" views for tables?

SQL INSERT without specifying columns. What happens?

Was looking through the beloved W3schools and found this page and actually learned something interesting. I didn't know you could call an insert command without specifying columns to values. For example;
INSERT INTO table_name
VALUES (value1, value2, value3,...)
Pulling from my hazy memory, I seem to remember the SQL prof mentioning that you have to treat fields as if they are not in any particular order (although there is on the RDB side, but it's not guaranteed).
My question is, how does the server know which values get assigned to which fields?* I would test this myself, but am not going to use a production server to do which is all I a have access to at the moment.
If this technology specific, I am working on PostgresSQL. How is this particular syntax even useful?
Your prof was right - you should name the columns explicitly before naming the values.
In this case though the values will be inserted in the order that they appear in the table definition.
The problem with this is that if that order changes, or columns are removed or added (even if they are nullable), then the insert will break.
In terms of its usefulness, not that much in production code. If you're hand coding a quick insert then it might just help save you typing all the column names out.
They get inserted into the fields in the order they are in the table definition.
So if your table has fields (a,b,c), a=value1, b=value2, c=value3.
Your professor was right, this is lazy, and liable to break. But useful for a quick and dirty lazy insert.
I cannot resist to put a "RTFM" here.
The PostgreSQL manual details what happens in the chapter on INSERT:
The target column names can be listed in any order. If no list of
column names is given at all, the default is all the columns of the
table in their declared order; or the first N column names, if there
are only N columns supplied by the VALUES clause or query. The values
supplied by the VALUES clause or query are associated with the
explicit or implicit column list left-to-right.
Bold emphasis mine.
The values are just added in the same order as the columns appear in the table. It's useful in situations where you don't know the names of the columns you're working with but know what data needs to go in. It's generally not a good idea to do this though as it of course breaks if the order of the columns change or new columns are inserted in the middle.
That syntax only works without specifying columns if and only if you provide with the same number of values as the number of columns. The second more important thing is columns in a sql table are always in the same order and that depends on your table definition. The only thing that has no innate order in a sql table are rows.
when the Table is created , each column will have an order number in the system table. So each value would be inserted as per the order..
Firt value will go to first column ... an so on
In sql server , system table syscolumn maintains this order. Postgresql should have something similar to this..

Insert Query: Why is it a bad idea to not include column names?

I was told at my last position never to do this; it wasn't explained why. I understand it enhances the opportunity for mistakes, but if there are just a couple of columns and I'm confident of their order, why can't I shorthand it and just insert the values in the right order without explicitly matching up the column names? Is there a large performance difference? If so, does it matter on a small scale?
If there's no performance hit and this isn't a query that will be saved for others to view, why shouldn't I?
Thanks in advance.
This is acceptable only when you type your query by hand into an interactive DB tool. When your SQL statement is executed by your program, you cannot be absolutely confident about the order of columns in a table, unless you are the only developer who has access to your database. In other words, in any team environment there is an opportunity that someone would break your query simply by re-ordering columns in your database. Logically, your table would remain the same, but your program would still break.
Devil's advocate: if there are only a couple of columns, what does short-handing it gain you? You saved a few keystrokes, big deal.
For ad hoc queries you're writing once and throwing away, it's not a big deal. I suspect you were told to never do this in production code or anything anyone else would later have to reverse engineer (either simply to understand it, or to account for underlying schema changes). Remember that code that you write may only ever be viewed and maintained by you right now but you should be writing code with the intention that it will outlast you.
Another reason including the column list is good is if you later want to search for all references to a specific column name in your data model...
The reason is to make the code more robust.
Specifying the fields makes the code less dependant on that the table layout stays exactly the same, and also gives you the ability to add fields to the table without the need to change the code as long as you provide default values for the new field.
It also makes it easier to see that the query is supposed to do, without the need to look up the table layout to see where the data will end up.
In most development shops, you will not be the only developer working on a given project. In which case, you run the risk of unintended inserts. You may be confident now that there won't be any schema changes and things can change... now or after you're gone.
Also, if a column is added, you inserts will fail.
Its mostly for readability so if its just you its fine but if you do muck up the order it can be a hard one to explain if people find out you were not using good practices, especially if someone modifies the structure while you were writing your query... Also if you have complex things to do in a column it can help visualisation and if as you say it is small then adding a few column names is hardly going to take long so you may as well.
Because things change and if you code that query in your program, thinking that the column order will never change and then comes along someone else that decides to alter the table adding a new column your query will fail.
Also, there's a quick trick if you don't want to type column names out. In management studio, you can set it up so that your result set returns in a CSV (with column headers). CTRL-T.
Then do a quick,
select top 1 * from <tablename>
and copy and paste the column list from the resultset window.
A relation has no left-to-right ordering of columns. The same cannot be said of a SQL table, not a good thing. However, just because SQL has non-relational features doesn't mean you have to use them! Specifying a column list as part of a table or row value constructor is one way of mitigating against SQL's column ordering.
Consider that the following SQL statements are semantically equivalent:
INSERT INTO Table1 (col1, col2, col3) VALUES (1, 2, 3);
INSERT INTO Table1 (col2, col1, col3) VALUES (2, 1, 3);
INSERT INTO Table1 (col3, col2, col1) VALUES (3, 2, 1);

When to use SQL Table Alias

I'm curious to know how people are using table aliases. The other developers where I work always use table aliases, and always use the alias of a, b, c, etc.
Here's an example:
SELECT a.TripNum, b.SegmentNum, b.StopNum, b.ArrivalTime
FROM Trip a, Segment b
WHERE a.TripNum = b.TripNum
I disagree with them, and think table aliases should be use more sparingly.
I think they should be used when including the same table twice in a query, or when the table name is very long and using a shorter name in the query will make the query easier to read.
I also think the alias should be a descriptive name rather than just a letter. In the above example, if I felt I needed to use 1 letter table alias I would use t for the Trip table and s for the segment table.
There are two reasons for using table aliases.
The first is cosmetic. The statements are easier to write, and perhaps also easier to read when table aliases are used.
The second is more substantive. If a table appears more than once in the FROM clause, you need table aliases in order to keep them distinct. Self joins are common in cases where a table contains a foreign key that references the primary key of the same table.
Two examples: an employees table that contains a supervisorID column that references the employeeID of the supervisor.
The second is a parts explosion. Often, this is implemented in a separate table with three columns: ComponentPartID, AssemblyPartID, and Quantity. In this case, there won't be any self joins, but there will often be a three way join between this table and two different references to the table of Parts.
It's a good habit to get into.
I use them to save typing. However, I always use letters similar to the function. So, in your example, I would type:
SELECT t.TripNum, s.SegmentNum, s.StopNum, s.ArrivalTime
FROM Trip t, Segment s
WHERE t.TripNum = s.TripNum
That just makes it easier to read, for me.
As a general rule I always use them, as there are usually multiple joins going on in my stored procedures. It also makes it easier when using code generation tools like CodeSmith to have it generate the alias name automatically for you.
I try to stay away from single letters like a & b, as I may have multiple tables that start with the letter a or b. I go with a longer approach, the concatenation of the referenced foreign key with the alias table, for example CustomerContact ... this would be the alias for the Customer table when joining to a Contact table.
The other reason I don't mind longer name, is due to most of my stored procedures are being generated via code CodeSmith. I don't mind hand typing the few that I may have to build myself.
Using the current example, I would do something like:
SELECT TripNum, TripSegment.SegmentNum, TripSegment.StopNum, TripSegment.ArrivalTime
FROM Trip, Segment TripSegment
WHERE TripNum = TripSegment.TripNum
Can I add to a debate that is already several years old?
There is another reason that no one has mentioned. The SQL parser in certain databases works better with an alias. I cannot recall if Oracle changed this in later versions, but when it came to an alias, it looked up the columns in the database and remembered them. When it came to a table name, even if it was already encountered in the statement, it re-checked the database for the columns. So using an alias allowed for faster parsing, especially of long SQL statements. I am sure someone knows if this is still the case, if other databases do this at parse time, and if it changed, when it changed.
I use it always, reasons:
leaving full tables names in statements makes them hard to read, plus you cannot have a same table twice
not using anything is a very bad idea, because later you could add some field to one of the tables that is already present in some other table
Consider this example:
select col1, col2
from tab1
join tab2 on tab1.col3 = tab2.col3
Now, imagine a few months later, you decide to add column named 'col1' to tab2. Database will silently allow you to do that, but applications would break when executing the above query because of ambiguity between tab1.col1 and tab2.col1.
But, I agree with you on the naming: a, b, c is fine, but t and s would be much better in your example. And when I have the same table more than once, I would use t1, t2, ... or s1, s2, s3...
Using the full name makes it harder to read, especially for larger queries or the Order/Product/OrderProduct scenari0
I'd use t and s. Or o/p/op
If you use SCHEMABINDING then columns must be qualified anyway
If you add a column to a base table, then the qualification reduces the chance of a duplicate in the query (for example a "Comment" column)
Because of this qualification, it makes sense to always use aliases.
Using a and b is blind obedience to a bizarre standard.
In simple queries I do not use aliases. In queries whit multiple tables I always use them because:
they make queries more readable (my
aliases are 2 or more capital letters that is a shortcut for the table name and if possible
a relationship to other
tables)
they allow faster developing and
rewriting (my table names are long and have prefixes depending on role they pose)
so instead of for example:
SELECT SUM(a.VALUE)
FROM Domesticvalues a, Foreignvalues b
WHERE a.Value>b.Value
AND a.Something ...
I write:
select SUM(DVAL.Value)
from DomesticValues DVAL, ForeignValues FVAL
where DVAL.Value > FVAL.Value
and DVAL.Something ...
There are many good ideas in the posts above about when and why to alias table names. What no one else has mentioned is that it is also beneficial in helping a maintainer understand the scope of tables. At our company we are not allowed to create views. (Thank the DBA.) So, some of our queries become large, even exceeding the 50,000 character limit of a SQL command in Crystal Reports. When a query aliases its tables as a, b, c, and a subquery of that does the same, and multiple subqueries in that one each use the same aliases, it is easy for one to mistake what level of the query is being read. This can even confuse the original developer when enough time has passed. Using unique aliases within each level of a query makes it easier to read because the scope remains clear.
I feel that you should use them as often as possible but I do agree that t & s represent the entities better than a & b.
This boils down to, like everything else, preferences. I like that you can depend on your stored procedures following the same conventions when each developer uses the alias in the same manner.
Go convince your coworkers to get on the same page as you or this is all worthless. The alternative is you could have a table Zebra as first table and alias it as a. That would just be cute.
i only use them when they are necessary to distinguish which table a field is coming from
select PartNumber, I.InventoryTypeId, InventoryTypeDescription
from dbo.Inventory I
inner join dbo.InventoryType IT on IT.InventoryTypeId = I.InventoryTypeId
In the example above both tables have an InventoryTypeId field, but the other field names are unique.
Always use an abbreviation for the table as the name so that the code makes more sense - ask your other developers if they name their local variables A, B, C, etc!
The only exception is in the rare cases where the SQL syntax requires a table alias but it isn't referenced, e.g.
select *
from (
select field1, field2, sum(field3) as total
from someothertable
) X
In the above, SQL syntax requires the table alias for the subselect, but it isn't referenced anywhere so I get lazy and use X or something like that.
I find it nothing more than a preference. As mentioned above, aliases save typing, especially with long table/view names.
One thing I've learned is that especially with complex queries; it is far simpler to troubleshoot six months later if you use the alias as a qualifier for every field reference. Then you aren't trying to remember which table that field came from.
We tend to have some ridiculously long table names, so I find it easier to read if the tables are aliased. And of course you must do it if you are using a derived table or a self join, so being in the habit is a good idea. I find most of our developers end up using the same alias for each table in all their sps,so most of the time anyone reading it will immediately know what pug is the alias for or mmh.
I always use them. I formerly only used them in queries involving just one table but then I realized a) queries involving just one table are rare, and b) queries involving just one table rarely stay that way for long. So I always put them in from the start so that I (or someone else) won't have to retro fit them later. Oh and BTW: I call them "correlation names", as per the SQL-92 Standard :)
Tables aliases should be four things:
Short
Meaningful
Always used
Used consistently
For example if you had tables named service_request, service_provider, user, and affiliate (among many others) a good practice would be to alias those tables as "sr", "sp", "u", and "a", and do so in every query possible. This is especially convenient if, as is often the case, these aliases coincide with acronyms used by your organization. So if "SR" and "SP" are the accepted terms for Service Request and Service Provider respectively, the aliases above carry a double payload of intuitively standing in for both the table and the business object it represents.
The obvious flaws with this system are first that it can be awkward for table names with lots of "words" e.g. a_long_multi_word_table_name which would alias to almwtn or something, and that it's likely you'll end up with tables named such that they abbreviate the same. The first flaw can be dealt with however you like, such as by taking the last 3 or 4 letters, or whichever subset you feel is most representative, most unique, or easiest to type. The second I've found in practice isn't as troublesome as it might seem, perhaps just by luck. You can also do things like take the second letter of a "word" in the table as well, such as aliasing account_transaction to "atr" instead of "at" to avoid conflicting with account_type.
Of course whether you use the above approach or not, aliases should be short because you'll be typing them very very frequently, and they should always be used because once you've written a query against a single table and omitted the alias, it's inevitable that you'll later need to edit in a second table with duplicate column names.
Because I always fully qualify my tables, I use aliases to provide a shorter name for the fields being SELECTed when JOINed tables are involved. I think it makes my code easier to follow.
If the query deals with only one source - no JOINs - then I don't use an alias.
Just a matter of personal preference.
Always. Make it a habit.