Which are the SQL improvements you are waiting for? - sql

Dealing with SQL shows us some limitations and gives us an opportunity to imagine what could be.
Which improvements to SQL are you waiting for? Which would you put on top of the wish list?
I think it can be nice if you post in your answer the database your feature request lacks.

T-SQL Specific: A decent way to select from a result set returned by a stored procedure that doesn't involve putting it into a temporary table or using some obscure function.
SELECT * FROM EXEC [master].[dbo].[xp_readerrorlog]

I know it's wildly unrealistic, but I wish they'd make the syntax of INSERT and UPDATE consistent. Talk about gratuitous non-orthogonality.

Operator to manage range of dates (or numbers):
where interval(date0, date1) intersects interval(date3, date4)
EDIT: Date or numbers, of course are the same.
EDIT 2: It seems Oracle have something to go, the undocumented OVERLAPS predicate. More info here.

A decent way of walking a tree with hierarchical data. Oracle has CONNECT BY but the simple and common structure of storing an object and a self-referential join back to the table for 'parent' is hard to query in a natural way.

More SQL Server than SQL but better integration with Source Control. Preferably SVN rather than VSS.

Implicit joins or what it should be called (That is, predefined views bound to the table definition)
SELECT CUSTOMERID, SUM(C.ORDERS.LINES.VALUE) FROM
CUSTOMER C
A redesign of the whole GROUP BY thing so that every expression in the SELECT clause doesn't have to be repeated in the GROUP BY clause
Some support for let expressions or otherwise more legal places to use an alias, a bit related to the GROUP BY thing, but I find other times what I just hate Oracle for forcing me to use an outer select just to reference a big expression by alias.

I would like to see the ability to use Regular Expressions in string handling.

A way of dynamically specifying columns/tables without having to resort to full dynamic sql that executes in another context.

Ability to define columns based on other columns ad infinitum (including disambiguation).
This is a contrived example and not a real world case, but I think you'll see where I'm going:
SELECT LTRIM(t1.a) AS [a.new]
,REPLICATE(' ', 20 - LEN([a.new])) + [a.new] AS [a.conformed]
,LEN([a.conformed]) as [a.length]
FROM t1
INNER JOIN TABLE t2
ON [a.new] = t2.a
ORDER BY [a.new]
instead of:
SELECT LTRIM(t1.a) AS [a.new]
,REPLICATE(' ', 20 - LEN(LTRIM(t1.a))) + LTRIM(t1.a) AS [a.conformed]
,LEN(REPLICATE(' ', 20 - LEN(LTRIM(t1.a))) + LTRIM(t1.a)) as [a.length]
FROM t1
INNER JOIN TABLE t2
ON LTRIM(t1.a) = t2.a
ORDER BY LTRIM(t1.a)
Right now, in SQL Server 2005 and up, I would use a CTE and build up in successive layers.

I'd like the vendors to actually standardise their SQL. They're all guilty of it. The LIMIT/OFFSET clause from MySQL and PostGresql is a good solution that no-one else appears to do. Oracle has it's own syntax for explicit JOINs whilst everyone else uses ANSI-92. MySQL should deprecate the CONCAT() function and use || like everyone else. And there are numerous clauses and statements that are outside the standard that could be wider spread. MySQL's REPLACE is a good example. There's more, with issues about casting and comparing types, quirks of column types, sequences, etc etc etc.

parameterized order by, as in:
select * from tableA order by #columName

Support in SQL to specify if you want your query plan to be optimized to return the first rows quickly, or all rows quickly.
Oracle has the concept of FIRST_ROWS hint, but a standard approach in the language would be useful.

Automatic denormalization.
But I may be dreaming.

Improved pivot tables. I'd like to tell it to automatically create the columns based on the keys found in the data.

On my wish list is a database supporting sub-queries in CHECK-constraints, without having to rely on materialized view tricks. And a database which supports the SQL standard's "assertions", i.e. constraints which may span more than one table.
Something else: A metadata-related function which would return the possible values of a given column, if the set of possible values is low. I.e., if a column has a foreign key to another column, it would return the existing values in the column being referred to. Of if the column has a CHECK-constraint like "CHECK foo IN(1,2,3)", it would return 1,2,3. This would make it easier to create GUI elements based on a table schema: If the function returned a list of two values, the programmer could decide that a radio button widget would be relevant - or if the function returned - e.g. - 10 values, the application showed a dropdown-widget instead. Etc.

UPSERT or MERGE in PostgreSQL. It's the one feature whose absence just boggles my mind. Postgres has everything else; why can't they get their act together and implement it, even in limited form?

Check constraints with subqueries, I mean something like:
CHECK ( 1 > (SELECT COUNT(*) FROM TABLE WHERE A = COLUMN))

These are all MS Sql Server/T-SQL specific:
"Natural" joins based on an existing Foreign Key relationship.
Easily use a stored proc result as a resultset
Some other loop construct besides while
Unique constraints across non NULL values
EXCEPT, IN, ALL clauses instead of LEFT|RIGHT JOIN WHERE x IS [NOT] NULL
Schema bound stored proc (to ease #2)
Relationships, schema bound views, etc. across multiple databases

WITH clause for other statements other than SELECT, it means for UPDATE and DELETE.
For instance:
WITH table as (
SELECT ...
)
DELETE from table2 where not exists (SELECT ...)

Something which I call REFERENCE JOIN. It joins two tables together by implicitly using the FOREIGN KEY...REFERENCES constraint between them.

A relational algebra DIVIDE operator. I hate always having to re-think how to do all elements of table a that are in all of given from table B.
http://www.tc.umn.edu/~hause011/code/SQLexample.txt

String Agregation on Group by (In Oracle is possible with this trick):
SELECT deptno, string_agg(ename) AS employees
FROM emp
GROUP BY deptno;
DEPTNO EMPLOYEES
---------- --------------------------------------------------
10 CLARK,KING,MILLER
20 SMITH,FORD,ADAMS,SCOTT,JONES
30 ALLEN,BLAKE,MARTIN,TURNER,JAMES,WARD

More OOP features:
stored procedures and user functions
CREATE PROCEDURE tablename.spname ( params ) AS ...
called via
EXECUTE spname
FROM tablename
WHERE conditions
ORDER BY
which implicitly passes a cursor or a current record to the SP. (similar to inserted and deleted pseudo-tables)
table definitions with inheritance
table definition as derived from base table, inheriting common columns etc
Btw, this is not necessarily real OOP, but only syntactic sugar on existing technology, but it would simplify development a lot.

Abstract tables and sub-classing
create abstract table person
(
id primary key,
name varchar(50)
);
create table concretePerson extends person
(
birth date,
death date
);
create table fictionalCharacter extends person
(
creator int references concretePerson.id
);

Increased temporal database support in Sql Server. Intervals, overlaps, etc.
Increased OVER support in Sql Server, including LAG, LEAD, and TOP.

Arrays
I'm not sure what's holding this back but lack of arrays lead to temp tables and related mess.

Some kind of UPGRADE table which allows to make changes on the table to be like the given:
CREATE OR UPGRADE TABLE
(
a VARCHAR,
---
)

My wish list (for SQLServer)
Ability to store/use multiple execution plans for a stored procedure concurrently and have the system automatically understand the best stored plan to use at each execution.
Currently theres one plan - if it is no longer optimal its used anyway or a brand new one is computed in its place.
Native UTF-8 storage
Database mirroring with more than one standby server and the ability to use a recovery model approaching 'simple' provided of course all servers are up and the transaction commits everywhere.
PCRE in replace functions
Some clever way of reusing fragments of large sql queries, stored match conditions, select conditions...etc. Similiar to functions but actually implemented more like preprocessor macros.

Comments for check constraints. With this feature, an application (or the database itself when raising an error) can query the metadata and retrieve that comment to show it to the user.

Automated dba notification in the case where the optimizer generates a plan different that the plan that that the query was tested with.
In other words, every query can be registered. At that time, the plan is saved. Later when the query is executed, if there is a change to the plan, the dba receives a notice, that something unexpected occurred.

Related

can I rely on the order of fields in a sql view?

If I create a view and select my fields in the order I want to "receive" them in can I be fully assured that I can call "Select * from myView" from my apps instead of specifying ALL of the fieldnames yet again in my select query?
I ask this because I pass whole datarows to my DataModels and construct the objects by assigning properties to the different indexes in the itemarray attached to this datarow. If these fields get out of order there's no telling what could happen to my object.
I know that I can't rely on an order-by that lives inside of a view (been burned before on this one). But the order of the fields I was not sure about.
Sorry if this is sql noob level. We all start somewhere with it. Right now all the extraneous field names in my app code is making readability somewhat difficult so if I can safely go back and replace a lot of syntax with a * then that would be great.
These tables are small so i'm not worried about implications of using a * over individual fields. I'm just looking to not code unnecessary syntax.
Column order is guaranteed, row order (as you noted) is not.
Column order may not be guaranteed or reliable if both of these are true
the view definition has SELECT * or SELECT tableA.* internally
any changes are made to the table(s) concerned
You'd need to run sp_refreshview: see this question/answer for potential issues.
Of course, if you have simple SELECT * FROM table in a view, why not just use the table and save some maintenance pain?
Finally, and I have to say it, it isn't recommeded to use SELECT *... :-)
Yes, left-to-right ordering of columns is guaranteed in SQL. In fact, it's one of the top three flaws used to prove that SQL is not truly relational (e.g. see The Importance of Column Names by Hugh Darwen), duplicate rows and the NULL value being the other two.
Yes, I've always relied on select * returning fields in the order specified in the view or table.
For example Microsoft SQL - "* Specifies that all columns from all tables and views in the FROM clause should be returned. The columns are returned by table or view, as specified in the FROM clause, and in the order in which they exist in the table or view."

SQL Code Smells

Could you please list some of the bad practices in SQL, that novice people do?
I have found the use of "WHILE loop" in scenarios which could be resolved using set operations.
Another example is inserting data only if it does not exist. This can be achieved using LEFT OUTER JOIN. Some people go for "IF"
Any other thoughts?
Edit: What I am looking for is specific scenarios (as mentioned in the question) that could be achieved using SQL without using procedural constructs
Thanks
Lijo
Here are some I have seen:
Using cursors instead of equivalent (and faster) set operations (joins etc).
Dynamic SQL for everything.
Code that is open to SQL Injection attacks.
Full outer joins even when they are not needed.
Huge stored procedures (hundreds/thousands of lines).
No comments.
Placing ODBC or dynamic SQL calls all over the code.
Often it is better to define a data abstraction layer that provides access
to the databases. All the SQL code can hide in that layer.
This often avoids replication of similar queries, and makes changing
data models easier to do.
Personally for me: anything that is not a plain INSERT, UPDATE, DELETE or SELECT statement
I don't like logic in SQL.
My biggest beef here is definitely repetitive SQL. As an example, multiple stored procedures that perform the exact same joins but different filters.
Using Views in such cases can make your database MUCH easier to look at and work with
Creating vendor-specific SQL, when generic SQL would do.
Creating tables dynamically at runtime (other than TEMPORARY tables).
Letting your application code have table create or super user privs.
The question asking for a list of SQL smells, no answer can be
exhaustive. I will be expanding my answer as time permits and memory
serves:
Redundant grouping
Redundant grouping is the application of the GROUP BY statement—and
consequently of aggregate functions—to more columns than required. It
occurs when the author starts by collecting most of or all the data that
he needs, to group it at the very end. Redundant grouping is,
therefore, late grouping, for the correct approach is to group early,
and only the data that needs grouping.
If a main entity (main) have a journal (jrnl) and refer to another
enity appendage (apnd), then the following query:
SELECT
main . Id ,
main . Name ,
MAX(jrnl.Entry) AS Entry ,
MAX(jrnl.Date ) AS Date ,
apnd . Reference,
apnd . Status
FROM main
JOIN jrnl ON jrnl.Parent = main.Id
JOIN apnd ON apnd.Id = main.ApndId
GROUP BY main.Id, main.Name, apnd.Reference, apnd.Status
has redundant grouping, because the sole purpose of the GROUP BY
clause is to obtain the latest journal entry. It should be rewritten in
a non-redundant mannger as follows:
SELECT
main.Id ,
main.Name ,
skel.Entry ,
jrnl.Date ,
apnd.Reference,
apnd.Status
FROM
( SELECT
jrnl.Parent AS MainId,
MAX(jrnl.Entry) AS MaxEntry
FROM main
GROUP BY jrnl.Parent
) skel -- eton
JOIN main ON main.Id = skel.MainId
JOIN jrnl ON jrnl.Entry = skel.MaxEntry
JOIN apnd ON apnd.Id = main.ApndId
That is—we group on the narrowest dataset possible, and join the rest
afterwards, even if it means referencing the same tables!

Is WHERE ID IN (1, 2, 3, 4, 5, ...) the most efficient?

I know that this topic has been beaten to death, but it seems that many articles on the Internet are often looking for the most elegant way instead of the most efficient way how to solve it. Here is the problem. We are building an application where one of the common database querys will involve manipulation (SELECT’s and UPDATE’s) based on a user supplied list of ID’s. The table in question is expected to have hundreds of thousands of rows, and the user provided lists of ID’s can potentially unbounded, bust they will be most likely in terms of tens or hundreds (we may limit it for performance reasons later).
If my understanding of how databases work in general is correct, the most efficient is to simply use the WHERE ID IN (1, 2, 3, 4, 5, ...) construct and build queries dynamically. The core of the problem is the input lists of ID’s will be really arbitrary, and so no matter how clever the database is or how cleverly we implement it, we always have an random subset of integers to start with and so eventually every approach has to internally boil down to something like WHERE ID IN (1, 2, 3, 4, 5, ...) anyway.
One can find many approaches all over the web. For instance, one involves declaring a table variable, passing the list of ID’s to a store procedure as a comma delimited string, splitting it in the store procedure, inserting the ID’s into the table variable and joining the master table on it, i.e. something like this:
-- 1. Temporary table for ID’s:
DECLARE #IDS TABLE (ID int);
-- 2. Split the given string of ID’s, and each ID to #IDS.
-- Omitted for brevity.
-- 3. Join the main table to #ID’s:
SELECT MyTable.ID, MyTable.SomeColumn
FROM MyTable INNER JOIN #IDS ON MyTable.ID = #IDS.ID;
Putting the problems with string manipulation aside, I think what essentially happens in this case is that in the third step the SQL Server says: “Thank you, that’s nice, but I just need a list of the ID’s”, and it scans the table variable #IDS, and then does n seeks in MyTable where n is the number of the ID’s. I’ve done some elementary performance evaluations and inspected the query plan, and it seems that this is what happens. So the table variable, the string concatenation and splitting and all the extra INSERT’s are for nothing.
Am I correct? Or am I missing anything? Is there really some clever and more efficient way? Basically, what I’m saying is that the SQL Server has to do n index seeks no matter what and formulating the query as WHERE ID IN (1, 2, 3, 4, 5, ...) is the most straightforward way to ask for it.
Well, it depends on what's really going on. How is the user choosing these IDs?
Also, it's not just efficiency; there's also security and correctness to worry about. When and how does the user tell the database about their ID choices? How do you incorporate them into the query?
It might be much better to put the selected IDs into a separate table that you can join against (or use a WHERE EXISTS against).
I'll give you that you're not likely to do much better performance-wise than IN (1,2,3..n) for a small (user-generated) n. But you need to think about how you generate that query. Are you going to use dynamic SQL? If so, how will you secure it from injection? Will the server be able to cache the execution plan?
Also, using an extra table is often just easier. Say you're building a shopping cart for an eCommerce site. Rather than worrying up keeping track of the cart client side or in a session, it's likely better to update the ShoppingCart table every time the user makes a selection. This also avoids the whole problem of how to safely set the parameter value for your query, because you're only making one change at a time.
Don't forget to old adage (with apologies to Benjamin Franklin):
He who would trade correctness for performance deserves neither
Be careful; on many databases, IN (...) is limited to a fixed number of things in the IN clause. For example, I think it's 1000 in Oracle. That's big, but possibly worth knowing.
The IN clause does not guaranties a INDEX SEEK. I faced this problem before using SQL Mobile edition in a Pocket with very few memory. Replacing IN (list) with a list of OR clauses boosted my query by 400% aprox.
Another approach is to have a temp table that stores the ID's and join it against the target table, but if this operation is used too often a permanent/indexed table can help the optimizer.
For me the IN (...) is not the preferred option due to many reasons, including the limitation on the number of parameters.
Following up on a note from Jan Zich regarding the performance using various temp-table implementations, here are some numbers from SQL execution plan:
XML solution: 99% time - xml parsing
comma-separated procedure using UDF from CodeProject: 50% temp table scan, 50% index seek. One can agrue if this is the most optimal implementation of string parsing, but I did not want to create one myself (I will happily test another one).
CLR UDF to split string: 98% - index seek.
Here is the code for CLR UDF:
public class SplitString
{
[SqlFunction(FillRowMethodName = "FillRow")]
public static IEnumerable InitMethod(String inputString)
{
return inputString.Split(',');
}
public static void FillRow(Object obj, out int ID)
{
string strID = (string)obj;
ID = Int32.Parse(strID);
}
}
So I will have to agree with Jan that XML solution is not efficient. Therefore if comma-separated list is to be passed as a filter, simple CLR UDF seems be optimal in terms of performance.
I tested the search of 1K record in a table of 200K.
A table var has issues: using a temp table with index has benefits for statistics.
A table var is assumed to always have one row, whereas a temp table has stats the optimiser can use.
Parsing a CSV is easy: see questions on right...
Essentially, I would agree with your observation; SQL Server's optimizer will ultimately pick the best plan for analyzing a list of values and it will typically equate to the same plan, regardless of whether or not you are using
WHERE IN
or
WHERE EXISTS
or
JOIN someholdingtable ON ...
Obviously, there are other factors which influence plan choice (like covering indexes, etc). The reason that people have various methods for passing in this list of values to a stored procedure is that before SQL 2008, there really was no simple way of passing in multiple values. You could do a list of parameters (WHERE IN (#param1, #param2)...), or you could parse a string (the method you show above). As of SQL 2008, you can also pass table variables around, but the overall result is the same.
So yes, it doesn't matter how you get the list of variables to the query; however, there are other factors which may have some effect on the performance of said query once you get the list of variables in there.
Once upon a long time ago, I found that on the particular DBMS I was working with, the IN list was more efficient up to some threshold (which was, IIRC, something like 30-70), and after that, it was more efficient to use a temp table to hold the list of values and join with the temp table. (The DBMS made creating temp tables very easy, but even with the overhead of creating and populating the temp table, the queries ran faster overall.) This was with up-to-date statistics on the main data tables (but it also helped to update the statistics for the temp table too).
There is likely to be a similar effect in modern DBMS; the threshold level may well have changed (I am talking about depressingly close to twenty years ago), but you need to do your measurements and consider your strategy or strategies. Note that optimizers have improved since then - they may be able to make sensible use of bigger IN lists, or automatically convert an IN list into an anonymous temp table. But measurement will be key.
In SQL Server 2008 or later you should be looking to use table-valued parameters.
2008 makes it simple to pass a comma-separated list to SQL Server using this method.
Here is an excellent source of information and performance tests on the subject:
Arrays-in-sql-2008
Here is a great tutorial:
passing-table-valued-parameters-in-sql-server-2008
For many years I use 3 approach but when I start using OR/M it's seems to be unnecessary.
Even loading each row by id is not as much inefficient like it looks like.
If problems with string manipulation are putted aside, I think that:
WHERE ID=1 OR ID=2 OR ID=3 ...
is more efficient, nevertheless I wouldn't do it.
You could compare performance between both approaches.
To answer the question directly, there is no way to pass a (dynamic) list of arguments to an SQL Server 2005 procedure. Therefore what most people do in these cases is passing a comma-delimited list of identifiers, which I did as well.
Since sql 2005 though I prefer passing and XML string, which is also very easy to create on a client side (c#, python, another SQL SP), and "native" to work with since 2005:
CREATE PROCEDURE myProc(#MyXmlAsSTR NVARCHAR(MAX)) AS BEGIN
DECLARE #x XML
SELECT #x = CONVERT(XML, #MyXmlAsSTR)
Then you can join your base query directly with the XML select as (not tested):
SELECT t.*
FROM myTable t
INNER JOIN #x.nodes('/ROOT/ROW') AS R(x)
ON t.ID = x.value('#ID', 'INTEGER')
when passing <ROOT><ROW ID="1"/><ROW ID="2"/></ROOT>. Just remember that XML is CaSe-SensiTiv.
select t.*
from (
select id = 35 union all
select id = 87 union all
select id = 445 union all
...
select id = 33643
) ids
join my_table t on t.id = ids.id
If the set of ids to search on is small, this may improve performance by permitting the query engine to do an index seek. If the optimizer judges that a table scan would be faster than, say, one hundred index seeks, then the optimizer will so instruct the query engine.
Note that query engines tend to treat
select t.*
from my_table t
where t.id in (35, 87, 445, ..., 33643)
as equivalent to
select t.*
from my_table t
where t.id = 35 or t.id = 87 or t.id = 445 or ... or t.id = 33643
and note that query engines tend not to be able to perform index seeks on queries with disjunctive search criteria. As an example, Google AppEngine datastore will not execute a query with a disjunctive search criteria at all, because it will only execute queries for which it knows how to perform an index seek.

Why is using '*' to build a view bad?

Why is using '*' to build a view bad ?
Suppose that you have a complex join and all fields may be used somewhere.
Then you just have to chose fields needed.
SELECT field1, field2 FROM aview WHERE ...
The view "aview" could be SELECT table1.*, table2.* ... FROM table1 INNER JOIN table2 ...
We have a problem if 2 fields have the same name in table1 and table2.
Is this only the reason why using '*' in a view is bad?
With '*', you may use the view in a different context because the information is there.
What am I missing ?
Regards
I don't think there's much in software that is "just bad", but there's plenty of stuff that is misused in bad ways :-)
The example you give is a reason why * might not give you what you expect, and I think there are others. For example, if the underlying tables change, maybe columns are added or removed, a view that uses * will continue to be valid, but might break any applications that use it. If your view had named the columns explicitly then there was more chance that someone would spot the problem when making the schema change.
On the other hand, you might actually want your view to blithely
accept all changes to the underlying tables, in which case a * would
be just what you want.
Update: I don't know if the OP had a specific database vendor in mind, but it is now clear that my last remark does not hold true for all types. I am indebted to user12861 and Jonny Leeds for pointing this out, and sorry it's taken over 6 years for me to edit my answer.
Although many of the comments here are very good and reference one common problem of using wildcards in queries, such as causing errors or different results if the underlying tables change, another issue that hasn't been covered is optimization. A query that pulls every column of a table tends to not be quite as efficient as a query that pulls only those columns you actually need. Granted, there are those times when you need every column and it's a major PIA having to reference them all, especially in a large table, but if you only need a subset, why bog down your query with more columns than you need.
Another reason why "*" is risky, not only in views but in queries, is that columns can change name or change position in the underlying tables. Using a wildcard means that your view accommodates such changes easily without needing to be changed. But if your application references columns by position in the result set, or if you use a dynamic language that returns result sets keyed by column name, you could experience problems that are hard to debug.
I avoid using the wildcard at all times. That way if a column changes name, I get an error in the view or query immediately, and I know where to fix it. If a column changes position in the underlying table, specifying the order of the columns in the view or query compensates for this.
These other answers all have good points, but on SQL server at least they also have some wrong points. Try this:
create table temp (i int, j int)
go
create view vtemp as select * from temp
go
insert temp select 1, 1
go
alter table temp add k int
go
insert temp select 1, 1, 1
go
select * from vtemp
SQL Server doesn't learn about the "new" column when it is added. Depending on what you want this could be a good thing or a bad thing, but either way it's probably not good to depend on it. So avoiding it just seems like a good idea.
To me this weird behavior is the most compelling reason to avoid select * in views.
The comments have taught me that MySQL has similar behavior and Oracle does not (it will learn about changes to the table). This inconsistency to me is all the more reason not to use select * in views.
Using '*' for anything production is bad. It's great for one-off queries, but in production code you should always be as explicit as possible.
For views in particular, if the underlying tables have columns added or removed, the view will either be wrong or broken until it is recompiled.
Using SELECT * within the view does not incur much of a performance overhead if columns aren't used outside the view - the optimizer will optimize them out; SELECT * FROM TheView can perhaps waste bandwidth, just like any time you pull more columns across a network connection.
In fact, I have found that views which link almost all the columns from a number of huge tables in my datawarehouse have not introduced any performance issues at all, even through relatively few of those columns are requested from outside the view. The optimizer handles that well and is able to push the external filter criteria down into the view very well.
However, for all the reasons given above, I very rarely use SELECT *.
I have some business processes where a number of CTEs are built on top of each other, effectively building derived columns from derived columns from derived columns (which will hopefully one day being refactored as the business rationalizes and simplifies these calculations), and in that case, I need all the columns to drop through each time, and I use SELECT * - but SELECT * is not used at the base layer, only in between the first CTE and the last.
The situation on SQL Server is actually even worse than the answer by #user12861 implies: if you use SELECT * against multiple tables, adding columns to a table referenced early in the query will actually cause your view to return the values of the new columns under the guise of the old columns. See the example below:
-- create two tables
CREATE TABLE temp1 (ColumnA INT, ColumnB DATE, ColumnC DECIMAL(2,1))
CREATE TABLE temp2 (ColumnX INT, ColumnY DATE, ColumnZ DECIMAL(2,1))
GO
-- populate with dummy data
INSERT INTO temp1 (ColumnA, ColumnB, ColumnC) VALUES (1, '1/1/1900', 0.5)
INSERT INTO temp2 (ColumnX, ColumnY, ColumnZ) VALUES (1, '1/1/1900', 0.5)
GO
-- create a view with a pair of SELECT * statements
CREATE VIEW vwtemp AS
SELECT *
FROM temp1 INNER JOIN temp2 ON 1=1
GO
-- SELECT showing the columns properly assigned
SELECT * FROM vwTemp
GO
-- add a few columns to the first table referenced in the SELECT
ALTER TABLE temp1 ADD ColumnD varchar(1)
ALTER TABLE temp1 ADD ColumnE varchar(1)
ALTER TABLE temp1 ADD ColumnF varchar(1)
GO
-- populate those columns with dummy data
UPDATE temp1 SET ColumnD = 'D', ColumnE = 'E', ColumnF = 'F'
GO
-- notice that the original columns have the wrong data in them now, causing any datatype-specific queries (e.g., arithmetic, dateadd, etc.) to fail
SELECT *
FROM vwtemp
GO
-- clean up
DROP VIEW vwTemp
DROP TABLE temp2
DROP TABLE temp1
It's because you don't always need every variable, and also to make sure that you are thinking about what you specifically need.
There's no point getting all the hashed passwords out of the database when building a list of users on your site for instance, so a select * would be unproductive.
Once upon a time, I created a view against a table in another database (on the same server) with
Select * From dbname..tablename
Then one day, a column was added to the targetted table. The view started returning totally incorrect results until it was redeployed.
Totally incorrect : no rows.
This was on Sql Server 2000.
I speculate that this is because of syscolumns values that the view had captured, even though I used *.
A SQL query is basically a functional unit designed by a programmer for use in some context. For long-term stability and supportability (possibly by someone other than you) everything in a functional unit should be there for a purpose, and it should be reasonably evident (or documented) why it's there - especially every element of data.
If I were to come along two years from now with the need or desire to alter your query, I would expect to grok it pretty thoroughly before I would be confident that I could mess with it. Which means I would need to understand why all the columns are called out. (This is even more obviously true if you are trying to reuse the query in more than one context. Which is problematic in general, for similar reasons.) If I were to see columns in the output that I couldn't relate to some purpose, I'd be pretty sure that I didn't understand what it did, and why, and what the consequences would be of changing it.
It's generally a bad idea to use *. Some code certification engines mark this as a warning and advise you to explicitly refer only the necessary columns. The use of * can lead to performance louses as you might only need some columns and not all. But, on the other hand, there are some cases where the use of * is ideal. Imagine that, no matter what, using the example you provided, for this view (aview) you would always need all the columns in these tables. In the future, when a column is added, you wouldn't need to alter the view. This can be good or bad depending the case you are dealing with.
I think it depends on the language you are using. I prefer to use select * when the language or DB driver returns a dict(Python, Perl, etc.) or associative array(PHP) of the results. It makes your code alot easier to understand if you are referring to the columns by name instead of as an index in an array.
No one else seems to have mentioned it, but within SQL Server you can also set up your view with the schemabinding attribute.
This prevents modifications to any of the base tables (including dropping them) that would affect the view definition.
This may be useful to you for some situations. I realise that I haven't exactly answered your question, but thought I would highlight it nonetheless.
And if you have joins using select * automatically means you are returning more data than you need as the data in the join fields is repeated. This is wasteful of database and network resources.
If you are naive enough to use views that call other views, using select * can make them even worse performers (This is technique that is bad for performance on its own, calling mulitple columns you don't need makes it much worse).

When to use SQL Table Alias

I'm curious to know how people are using table aliases. The other developers where I work always use table aliases, and always use the alias of a, b, c, etc.
Here's an example:
SELECT a.TripNum, b.SegmentNum, b.StopNum, b.ArrivalTime
FROM Trip a, Segment b
WHERE a.TripNum = b.TripNum
I disagree with them, and think table aliases should be use more sparingly.
I think they should be used when including the same table twice in a query, or when the table name is very long and using a shorter name in the query will make the query easier to read.
I also think the alias should be a descriptive name rather than just a letter. In the above example, if I felt I needed to use 1 letter table alias I would use t for the Trip table and s for the segment table.
There are two reasons for using table aliases.
The first is cosmetic. The statements are easier to write, and perhaps also easier to read when table aliases are used.
The second is more substantive. If a table appears more than once in the FROM clause, you need table aliases in order to keep them distinct. Self joins are common in cases where a table contains a foreign key that references the primary key of the same table.
Two examples: an employees table that contains a supervisorID column that references the employeeID of the supervisor.
The second is a parts explosion. Often, this is implemented in a separate table with three columns: ComponentPartID, AssemblyPartID, and Quantity. In this case, there won't be any self joins, but there will often be a three way join between this table and two different references to the table of Parts.
It's a good habit to get into.
I use them to save typing. However, I always use letters similar to the function. So, in your example, I would type:
SELECT t.TripNum, s.SegmentNum, s.StopNum, s.ArrivalTime
FROM Trip t, Segment s
WHERE t.TripNum = s.TripNum
That just makes it easier to read, for me.
As a general rule I always use them, as there are usually multiple joins going on in my stored procedures. It also makes it easier when using code generation tools like CodeSmith to have it generate the alias name automatically for you.
I try to stay away from single letters like a & b, as I may have multiple tables that start with the letter a or b. I go with a longer approach, the concatenation of the referenced foreign key with the alias table, for example CustomerContact ... this would be the alias for the Customer table when joining to a Contact table.
The other reason I don't mind longer name, is due to most of my stored procedures are being generated via code CodeSmith. I don't mind hand typing the few that I may have to build myself.
Using the current example, I would do something like:
SELECT TripNum, TripSegment.SegmentNum, TripSegment.StopNum, TripSegment.ArrivalTime
FROM Trip, Segment TripSegment
WHERE TripNum = TripSegment.TripNum
Can I add to a debate that is already several years old?
There is another reason that no one has mentioned. The SQL parser in certain databases works better with an alias. I cannot recall if Oracle changed this in later versions, but when it came to an alias, it looked up the columns in the database and remembered them. When it came to a table name, even if it was already encountered in the statement, it re-checked the database for the columns. So using an alias allowed for faster parsing, especially of long SQL statements. I am sure someone knows if this is still the case, if other databases do this at parse time, and if it changed, when it changed.
I use it always, reasons:
leaving full tables names in statements makes them hard to read, plus you cannot have a same table twice
not using anything is a very bad idea, because later you could add some field to one of the tables that is already present in some other table
Consider this example:
select col1, col2
from tab1
join tab2 on tab1.col3 = tab2.col3
Now, imagine a few months later, you decide to add column named 'col1' to tab2. Database will silently allow you to do that, but applications would break when executing the above query because of ambiguity between tab1.col1 and tab2.col1.
But, I agree with you on the naming: a, b, c is fine, but t and s would be much better in your example. And when I have the same table more than once, I would use t1, t2, ... or s1, s2, s3...
Using the full name makes it harder to read, especially for larger queries or the Order/Product/OrderProduct scenari0
I'd use t and s. Or o/p/op
If you use SCHEMABINDING then columns must be qualified anyway
If you add a column to a base table, then the qualification reduces the chance of a duplicate in the query (for example a "Comment" column)
Because of this qualification, it makes sense to always use aliases.
Using a and b is blind obedience to a bizarre standard.
In simple queries I do not use aliases. In queries whit multiple tables I always use them because:
they make queries more readable (my
aliases are 2 or more capital letters that is a shortcut for the table name and if possible
a relationship to other
tables)
they allow faster developing and
rewriting (my table names are long and have prefixes depending on role they pose)
so instead of for example:
SELECT SUM(a.VALUE)
FROM Domesticvalues a, Foreignvalues b
WHERE a.Value>b.Value
AND a.Something ...
I write:
select SUM(DVAL.Value)
from DomesticValues DVAL, ForeignValues FVAL
where DVAL.Value > FVAL.Value
and DVAL.Something ...
There are many good ideas in the posts above about when and why to alias table names. What no one else has mentioned is that it is also beneficial in helping a maintainer understand the scope of tables. At our company we are not allowed to create views. (Thank the DBA.) So, some of our queries become large, even exceeding the 50,000 character limit of a SQL command in Crystal Reports. When a query aliases its tables as a, b, c, and a subquery of that does the same, and multiple subqueries in that one each use the same aliases, it is easy for one to mistake what level of the query is being read. This can even confuse the original developer when enough time has passed. Using unique aliases within each level of a query makes it easier to read because the scope remains clear.
I feel that you should use them as often as possible but I do agree that t & s represent the entities better than a & b.
This boils down to, like everything else, preferences. I like that you can depend on your stored procedures following the same conventions when each developer uses the alias in the same manner.
Go convince your coworkers to get on the same page as you or this is all worthless. The alternative is you could have a table Zebra as first table and alias it as a. That would just be cute.
i only use them when they are necessary to distinguish which table a field is coming from
select PartNumber, I.InventoryTypeId, InventoryTypeDescription
from dbo.Inventory I
inner join dbo.InventoryType IT on IT.InventoryTypeId = I.InventoryTypeId
In the example above both tables have an InventoryTypeId field, but the other field names are unique.
Always use an abbreviation for the table as the name so that the code makes more sense - ask your other developers if they name their local variables A, B, C, etc!
The only exception is in the rare cases where the SQL syntax requires a table alias but it isn't referenced, e.g.
select *
from (
select field1, field2, sum(field3) as total
from someothertable
) X
In the above, SQL syntax requires the table alias for the subselect, but it isn't referenced anywhere so I get lazy and use X or something like that.
I find it nothing more than a preference. As mentioned above, aliases save typing, especially with long table/view names.
One thing I've learned is that especially with complex queries; it is far simpler to troubleshoot six months later if you use the alias as a qualifier for every field reference. Then you aren't trying to remember which table that field came from.
We tend to have some ridiculously long table names, so I find it easier to read if the tables are aliased. And of course you must do it if you are using a derived table or a self join, so being in the habit is a good idea. I find most of our developers end up using the same alias for each table in all their sps,so most of the time anyone reading it will immediately know what pug is the alias for or mmh.
I always use them. I formerly only used them in queries involving just one table but then I realized a) queries involving just one table are rare, and b) queries involving just one table rarely stay that way for long. So I always put them in from the start so that I (or someone else) won't have to retro fit them later. Oh and BTW: I call them "correlation names", as per the SQL-92 Standard :)
Tables aliases should be four things:
Short
Meaningful
Always used
Used consistently
For example if you had tables named service_request, service_provider, user, and affiliate (among many others) a good practice would be to alias those tables as "sr", "sp", "u", and "a", and do so in every query possible. This is especially convenient if, as is often the case, these aliases coincide with acronyms used by your organization. So if "SR" and "SP" are the accepted terms for Service Request and Service Provider respectively, the aliases above carry a double payload of intuitively standing in for both the table and the business object it represents.
The obvious flaws with this system are first that it can be awkward for table names with lots of "words" e.g. a_long_multi_word_table_name which would alias to almwtn or something, and that it's likely you'll end up with tables named such that they abbreviate the same. The first flaw can be dealt with however you like, such as by taking the last 3 or 4 letters, or whichever subset you feel is most representative, most unique, or easiest to type. The second I've found in practice isn't as troublesome as it might seem, perhaps just by luck. You can also do things like take the second letter of a "word" in the table as well, such as aliasing account_transaction to "atr" instead of "at" to avoid conflicting with account_type.
Of course whether you use the above approach or not, aliases should be short because you'll be typing them very very frequently, and they should always be used because once you've written a query against a single table and omitted the alias, it's inevitable that you'll later need to edit in a second table with duplicate column names.
Because I always fully qualify my tables, I use aliases to provide a shorter name for the fields being SELECTed when JOINed tables are involved. I think it makes my code easier to follow.
If the query deals with only one source - no JOINs - then I don't use an alias.
Just a matter of personal preference.
Always. Make it a habit.