SQL Table / Sub-Query Alias Conventions - sql

I've been writing SQL for a number of years now on various DBMS (Oracle, SQL Server, MySQL, Access etc.) and one thing that has always struck me is the seemingly lack of naming convention when it comes to table & sub-query aliases.
I've always read that table alises are the way to go and although I haven't always used them, when I do I'm always stuck between what names to use. I've gone from using descriptive names to single characters such as 't', 's' or 'q' and back again. Take for example this MS Access query I've just written, I'm still not entirely happy with the aliases I'm using even with a relatively simple query as this, I still don't think it's all that easy to read:
SELECT stkTrans.StockName
, stkTrans.Sedol
, stkTrans.BookCode
, SUM(IIF(stkTrans.TransactionType="S", -1 * stkTrans.Units, 0)) AS [Sell Shares]
, SUM(IIF(stkTrans.TransactionType="B", stkTrans.Units, 0)) AS [Buy Shares]
, SUM(IIF(stkTrans.TransactionType="B", -1 * stkTrans.Price, 0) * stkTrans1.Min_Units) + SUM(IIF(stkTrans.TransactionType="S", stkTrans.Price, 0) * stkTrans1.Min_Units) AS [PnL]
, "" AS [Comment]
FROM tblStockTransactions AS stkTrans
INNER JOIN (SELECT sT1.BookCode
, sT1.Sedol
, MIN(sT1.Units) AS [Min_Units]
FROM tblStockTransactions sT1
GROUP BY sT1.BookCode, sT1.Sedol
HAVING (SUM(IIF(sT1.TransactionType="S", 1, 0)) > 0
AND SUM(IIF(sT1.TransactionType="B", 1, 0)) > 0)) AS stkTrans1 ON (stkTrans.BookCode = stkTrans1.BookCode) AND (stkTrans.Sedol = stkTrans1.Sedol)
GROUP BY stkTrans.BookCode, stkTrans.StockName, stkTrans.Sedol;
What do you think? Thought I would throw it out there to see what everyone else's feelings are about this.

I don't know of any canonical style rules for naming table/query aliases across databases, although I understand that Oracle recommends abbreviations of three to four characters.
I would generally steer clear of single letter abbreviations, except where the query is sufficiently simple that these should be completely unambiguous to anyone having to maintain the code - typically no more than two or three tables per query.
I would also generally avoid long alias names that conform to the general style of your database table-naming conventions, since it can become unclear what is a database table name and what is an alias.
In the example provided, the alias sT1 inside the inline view is utterly unnecessary, as there is only one table being accessed within that inline view. That leaves one table being joined to one inline view (based on the same table) in the query - in these circumstances, I would use s as the alias for the table, and s1 as the alias for the inline view (to indicate that it was querying the same underlying database table).

DT1 and DT2 seems to be good approach ..
I am in phase of understanding the existing Procedures and some
Procedures use this naming convention DT1,DT2..
It becomes fairly simple to understand ..rather than giving some table short name as alias

If I was using SQL Server I'd probably put the derived table in a Common Table Expression (CTE) with a logical name (to indicate what the table is) then shorten it with a correlation name ("table alias") in the main query (to aid readability) e.g.
WITH StockTransactions__type_S_or_B__smallest_units
AS
(
<derived table query here>
)
SELECT stkTrans.StockName
...
FROM tblStockTransactions AS stkTrans
INNER JOIN StockTransactions__type_S_or_B__smallest_units AS stkTrans1
ON (stkTrans.BookCode = stkTrans1.BookCode) AND (stkTrans.Sedol = stkTrans1.Sedol)
GROUP BY stkTrans.BookCode, stkTrans.StockName, stkTrans.Sedol;
Obviously, this isn't an option in Access so you go straight to the correlation name and lose the full name entirely. This is not ideal but acceptable, IMO.
SQL requires a name to be assigned to a derived table for no reason at all. This example from Hugh Darwen, from which I think we can safely assume that the obligation annoys him:
SELECT DISTINCT E#, TOTAL_PAY
FROM ( SELECT E#, SALARY + BONUS AS TOTAL_PAY
FROM EMP ) AS TEETH_GNASHER
WHERE TOTAL_PAY >= 500
Personally, for such a meaningless requirement I choose the almost meaningless and uncontroversial name, DT1 being a contraction of first derived table and allowing for DT2, DT3, etc e.g.
SELECT DISTINCT E#, TOTAL_PAY
FROM ( SELECT E#, SALARY + BONUS AS TOTAL_PAY
FROM EMP ) AS DT1
WHERE TOTAL_PAY >= 500

Related

Divide by 'Over' clause in MSSQL works but divide by 'Alias' does not

In MS SQL Server, I spent too much time trying to resolve this. I finally figured it out, except I don't know the reason. How come, dividing by the cast statement in line 4 works below
SELECT
cast(dbo.FACTINVOICEHEADER.TOTAL_NET_AMOUNT_AMOUNT AS decimal(18,8))
AS TOTAL_NET_AMOUNT_AMOUNT,
cast((SUM(dbo.FACTINVOICEHEADER.TOTAL_NET_AMOUNT_AMOUNT)
OVER (PARTITION BY dbo.DIMPROJECT.PROJECT_KEY)) AS decimal(18,8))
AS ActualAmountPaidOnProjectGroupedByInvoice,
((dbo.FACTINVOICEHEADER.TOTAL_NET_AMOUNT_AMOUNT)
/
(cast((SUM(dbo.FACTINVOICEHEADER.TOTAL_NET_AMOUNT_AMOUNT)
OVER (PARTITION BY dbo.DIMPROJECT.PROJECT_KEY)) AS decimal(18,8))))
AS 'Allocation_Amount',
But when I try and divide by the alias that I created, ''ActualAmountPaidOnMatterGroupedByInvoice' in Line 3 I get an error message:
Msg 207, Level 16, State 1, Line 131 Invalid column name 'ActualAmountPaidOnMatterGroupedByInvoice'
Sample incorrect code:
SELECT
cast(dbo.FACTINVOICEHEADER.TOTAL_NET_AMOUNT_AMOUNT AS decimal(18,8))
AS TOTAL_NET_AMOUNT_AMOUNT,
cast((SUM(dbo.FACTINVOICEHEADER.TOTAL_NET_AMOUNT_AMOUNT)
OVER (PARTITION BY dbo.DIMPROJECT.PROJECT_KEY))
AS decimal(18,8))
AS ActualAmountPaidOnProjectGroupedByInvoice,
((dbo.FACTINVOICEHEADER.TOTAL_NET_AMOUNT_AMOUNT)
/
(ActualAmountPaidOnProjectGroupedByInvoice) AS decimal(18,8))))
AS 'Allocation_Amount'
How come? Thanks all!
The reason that you cannot use the alias in the query, is because the alias has not been recognized by the query engine yet. The engine evaluates queries in stages in the following order:
FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> ORDER BY -> LIMIT
One of the last steps in the SELECT stage is to apply the aliases specified in the query to the resulting dataset. Since these are not applied until the end of the SELECT stage, they are not available in the evaluation of the data to be returned nor in the WHERE, GROUP BY, or HAVING stages.
Additionally, some query engines do allow aliases (or ordinal position) to be used in the ORDER BY stage. As pointed out by Julian in the comments, MSSQL does allow for ordinal position ordering syntax.
I think you might be misunderstanding where aliased columns are available/able to be referenced by the aliased name, particularly because you said (paraphrase) "an alias I created on line 3 of the sql wasn't available on line 4":
Wrong:
SELECT
1200 as games_won,
25 as years_played,
--can't use these aliases below in the same select block that they were declared in
games_won / years_played as games_won_per_year
...
Right:
SELECT
1200 as games_won,
25 as years_played,
--can use the values though
1200 / 25 as games_won_per_year
Right:
SELECT
games_won / years_played as games_won_per_year --alias from inner scope is available in this outer scope
FROM
(
SELECT
--these aliases only become available outside the brackets
1200 as games_won,
25 as years_played
) x
You can't alias a column and use the alias again in the same select block; you can only alias in an inner/subquery and use the alias in an outer query. SQL is not like a programming language that operates line by line:
int gameswon = 1200;
int yearsplayed = 25;
int winsperyear = gameswon / yearsplayed;
Here in this C# you can see we declare variables (aliases) on earlier lines and use them on later lines but that's because the programming language operates line by line. The results of an earlier line execution are available to later lines. SQL doesn't work like that; SQL works on entire sections of the query at a time. Your columns don't acquire those aliases you gave them until the entire select block is finished being processed so you cannot give a column or calculation an alias and then use that alias again in the same select block. The only way to get round this and create an alias that you will later use repeatedly is to create the alias in a subquery.
Here's another example:
SELECT
fih.tot_amt / fih.amt_per_proj AS allocation_amount
FROM
(
SELECT
CAST(f.total_net_amount_amount AS DECIMAL(18,8)) as tot_mt,
CAST(SUM(f.total_net_amoun_amount) OVER (PARTITION BY p.project_key)) AS DECIMAL(18,8)) AS amt_per_proj
FROM
dbo.factinvoiceheader f
INNER JOIN
dbo.dimproject p
ON ...
) fih
Here you can see I pulled the columns I wanted and aliased them in an inner query and then used the aliases in the outer query - it works because the aliases decalred inside the inner block are made available to the outer block
Always remember that SQL is not line by line line a typical programming language, but block by block. Indeed in most programming languages, things declared in inner code blocks are not available in outer code blocks (unless they're some globalised thing like javascript var) so SQL is a departure from what you're used to. Every time you create a block of instructions in SQL you have an opportunity to re-alias the columns of data.
Because SQL is block by block based, I indent my SQLs in blocks to make it easy to see what gets processed together. Keywords like SELECT, FROM, WHERE, GROUP BY and ORDER BY denote blocks and aliases can be created for columns in a SELECT, and for tables in a FROM. In taking your example above I've applied aliases not just to the calculations and columns but to the tables as well. It makes the query massively easier to read when it's indented and aliased throughout- give your table names an alias rather than writing dbo.factinvoiceheader. before every column name
Here's a set of tips for making your SQLs neater and easier to read and debug:
don't put them all on one line or at the same indent level - indent according to how deep or shallow the block of instructions is
select, from, where, group by, order by etc denote the start of a block of operations - indent them all to the same level and indent their sub-instructions another level (if your select is indent level 2, the columns being selected should be indent level 3)
when you have an inner query indent that too unless it's really simple and reads nicely as a one liner
use lowercase for column and table names, upper case for reserved words, functions, datatypes (some people prefer camel case for functions)
decide whether to use canelCase or underscore_style to split your words and keep to it
always alias tables, and always select columns as tablealias.columnname - this prevents your query breaking in future if a table has a column added that is the same name as an original column you selected without qualifying what table the original column came from
aliasing tables allows another vital operation; repeatedly joining the same table into a query. If your Person table has a WorkAddress and a HomeAddress the only way you can join the address table in twice to get both addresses for a person, is to alias the table (person join address h on p.homeaddressid = h.id join address w on p.workaddressid = w.id)

SQL Server: Using columns with identical names

I'm writing a migration script to move data from one data model to another in Microsoft SQL Server Management Studio. The problem I'm running into is that, in the source database, some tables have foreign key columns that I need to compare. A snippet of code:
INSERT INTO TargetDB.dbo.Encounter(EncounterID, PATID, DRG)
Select
visit_occurrence_id,
person_id,
(Select
Case when ((Select top 1 observation_concept_id from SourceDB.dbo.Observation where visit_occurrence_id = visit_occurrence_id) = 3040464)
Then (Select top 1 value_as_string from SourceDB.dbo.Observation where visit_occurrence_id = visit_occurrence_id)
Else NULL End
)
from SourceDB.dbo.Visit_occurrence
As you can see, I need to compare visit_occurrence_id in SourceDB.dbo.Observation to visit_occurrence_id in SourceDB.dbo.Visit_occurrence. As it is, it's just returning values from the first row in SourceDB.dbo.Observation, since visit_occurrence_id will always equal itself.
What's the proper way to do this? Can I assign the first visit_occurrence_id value to a variable within the query, so it has a distinct name? I'm pretty lost here.
I'm going to add a little more detail for you here in an answer. You can always refer to an object by it's fully-qualified name, but it isn't always necessary:
Database.Schema.Table
or
Database.Schema.Table.Column
with sql server, it can even include server for linked-server scenarios.
also true of other objects like views, procedures, functions, etc... Aliasing of tables and/or columns can be a good strategy for shortening this qualification.
Anytime there is ambiguity, this is necessary. However, it is a good practice to be fairly explicit, because it can save you future headaches. As an example, consider this view:
CREATE VIEW vwEmployeesWithLocation AS
SELECT
E.EmployeeId -- from employees
, LastName -- from employees
, Status -- from employees
, LocationName -- from locations
FROM
Employees AS E
INNER JOIN
EmployeeLocations AS EL ON E.EmloyeeId = EL.EmployeeId
INNER JOIN
Locations AS L ON EL.LocationId = L.LocationId
Right now, everything is fine because other than EmployeeId, the column names are distinct. However, someone might add a Status column to the Locations table in the future and break this view. So, it would be better to explicitly include the table prefix for all columns in the select.
In your case, your query is cross database, so again, be explicit about the database in all parts of your query.
Used snow_FFFFFF's answer in the comments: Just used SourceDB.dbo.Observation.visit_occurence_id.

Why is selecting specified columns, and all, wrong in Oracle SQL?

Say I have a select statement that goes..
select * from animals
That gives a a query result of all the columns in the table.
Now, if the 42nd column of the table animals is is_parent, and I want to return that in my results, just after gender, so I can see it more easily. But I also want all the other columns.
select is_parent, * from animals
This returns ORA-00936: missing expression.
The same statement will work fine in Sybase, and I know that you need to add a table alias to the animals table to get it to work ( select is_parent, a.* from animals ani), but why must Oracle need a table alias to be able to work out the select?
Actually, it's easy to solve the original problem. You just have to qualify the *.
select is_parent, animals.* from animals;
should work just fine. Aliases for the table names also work.
There is no merit in doing this in production code. We should explicitly name the columns we want rather than using the SELECT * construct.
As for ad hoc querying, get yourself an IDE - SQL Developer, TOAD, PL/SQL Developer, etc - which allows us to manipulate queries and result sets without needing extensions to SQL.
Good question, I've often wondered this myself but have then accepted it as one of those things...
Similar problem is this:
sql>select geometrie.SDO_GTYPE from ngg_basiscomponent
ORA-00904: "GEOMETRIE"."SDO_GTYPE": invalid identifier
where geometrie is a column of type mdsys.sdo_geometry.
Add an alias and the thing works.
sql>select a.geometrie.SDO_GTYPE from ngg_basiscomponent a;
Lots of good answers so far on why select * shouldn't be used and they're all perfectly correct. However, don't think any of them answer the original question on why the particular syntax fails.
Sadly, I think the reason is... "because it doesn't".
I don't think it's anything to do with single-table vs. multi-table queries:
This works fine:
select *
from
person p inner join user u on u.person_id = p.person_id
But this fails:
select p.person_id, *
from
person p inner join user u on u.person_id = p.person_id
While this works:
select p.person_id, p.*, u.*
from
person p inner join user u on u.person_id = p.person_id
It might be some historical compatibility thing with 20-year old legacy code.
Another for the "buy why!!!" bucket, along with why can't you group by an alias?
The use case for the alias.* format is as follows
select parent.*, child.col
from parent join child on parent.parent_id = child.parent_id
That is, selecting all the columns from one table in a join, plus (optionally) one or more columns from other tables.
The fact that you can use it to select the same column twice is just a side-effect. There is no real point to selecting the same column twice and I don't think laziness is a real justification.
Select * in the real world is only dangerous when referring to columns by index number after retrieval rather than by name, the bigger problem is inefficiency when not all columns are required in the resultset (network traffic, cpu and memory load).
Of course if you're adding columns from other tables (as is the case in this example it can be dangerous as these tables may over time have columns with matching names, select *, x in that case would fail if a column x is added to the table that previously didn't have it.
why must Oracle need a table alias to be able to work out the select
Teradata is requiring the same. As both are quite old (maybe better call it mature :-) DBMSes this might be historical reasons.
My usual explanation is: an unqualified * means everything/all columns and the parser/optimizer is simply confused because you request more than everything.

Sub-query Optimization Talk with an example case

I need advises and want to share my experience about Query Optimization. This week, I found myself stuck in an interesting dilemma.
I'm a novice person in mySql (2 years theory, less than one practical)
Environment :
I have a table that contains articles with a column 'type', and another table article_version that contain a date where an article is added in the DB, and a third table that contains all the article types along with types label and stuffs...
The 2 first tables are huge (800000+ fields and growing daily), the 3rd one is naturally small sized. The article tables have a lot of column, but we will only need 'ID' and 'type' in articles and 'dateAdded' in article_version to simplify things...
What I want to do :
A Query that, for a specified 'dateAdded', returns the number of articles for each types (there is ~ 50 types to scan).
What was already in place is 50 separate count, one for each document types oO ( not efficient, long(~ 5sec in general), ).
I wanted to do it all in one query and I came up with that :
SELECT type,
(SELECT COUNT(DISTINCT articles.ID)
FROM articles
INNER JOIN article_version
ON article_version.ARTI_ID = legi_arti.ID
WHERE type = td.NEW_ID
AND dateAdded = '2009-01-01 00:00:00') AS nbrArti
FROM type_document td
WHERE td.NEW_ID != ''
GROUP BY td.NEW_ID;
The external select (type_document) allow me to get the 55 types of documents I need.
The sub-Query is counting the articles for each type_document for the given date '2009-01-01'.
A common result is like :
* type * nbrArti *
*************************
* 123456 * 23 *
* 789456 * 5 *
* 16578 * 98 *
* .... * .... *
* .... * .... *
*************************
This query get the job done, but the join in the sub-query is making this extremely slow, The reason, if I'm right, is that a join is made by the server for each types, so 50+ times, this solution is even more slower than doing the 50 queries independently for each types, awesome :/
A Solution
I came up with a solution myself that drastically improve the performance with the same result, I just created a view corresponding to the subQuery, making the join on ids for each types... And Boom, it's f.a.s.t.
I think, correct me if I'm wrong, that the reason is the server only runs the JOIN statement once.
This solution is ~5 time faster than the solution that was already there, and ~20 times faster than my first attempt. Sweet
Questions / thoughts
With yet another view, I'll now need to check if I don't loose more than win when documents get inserted...
Is there a way to improve the original Query, by getting the JOIN statement out of the sub-query? (And getting rid of the view)
Any other tips/thoughts? (In Server Optimizing for example...)
Apologies for my approximating English, it'is not my primary language.
You cannot create a single index on (type, date_added), because these fields are in different tables.
Without the view, the subquery most probably selects article as a leading table and the index on type which is not very selective.
By creating the view, you force the subquery to calculate the sums for all types first (using a selective the index on date) and then use a JOIN BUFFER (which is fast enough for only 55 types).
You can achieve similar results by rewriting your query as this:
SELECT new_id, COALESCE(cnt, 0) AS cnt
FROM type_document td
LEFT JOIN
(
SELECT type, COUNT(DISTINCT article_id) AS cnt
FROM article_versions av
JOIN articles a
ON a.id = av.article_id
WHERE av.date = '2009-01-01 00:00:00'
GROUP BY
type
) q
ON q.type = td.new_id
Unfortunately, MySQL is not able to do table spools or hash joins, so to improve the performance you'll need to denormalize your tables: add type to article_version and create a composite index on (date, type).

SQL - table alias scope

I've just learned ( yesterday ) to use "exists" instead of "in".
BAD
select * from table where nameid in (
select nameid from othertable where otherdesc = 'SomeDesc' )
GOOD
select * from table t where exists (
select nameid from othertable o where t.nameid = o.nameid and otherdesc = 'SomeDesc' )
And I have some questions about this:
1) The explanation as I understood was: "The reason why this is better is because only the matching values will be returned instead of building a massive list of possible results". Does that mean that while the first subquery might return 900 results the second will return only 1 ( yes or no )?
2) In the past I have had the RDBMS complainin: "only the first 1000 rows might be retrieved", this second approach would solve that problem?
3) What is the scope of the alias in the second subquery?... does the alias only lives in the parenthesis?
for example
select * from table t where exists (
select nameid from othertable o where t.nameid = o.nameid and otherdesc = 'SomeDesc' )
AND
select nameid from othertable o where t.nameid = o.nameid and otherdesc = 'SomeOtherDesc' )
That is, if I use the same alias ( o for table othertable ) In the second "exist" will it present any problem with the first exists? or are they totally independent?
Is this something Oracle only related or it is valid for most RDBMS?
Thanks a lot
It's specific to each DBMS and depends on the query optimizer. Some optimizers detect IN clause and translate it.
In all DBMSes I tested, alias is only valid inside the ( )
BTW, you can rewrite the query as:
select t.*
from table t
join othertable o on t.nameid = o.nameid
and o.otherdesc in ('SomeDesc','SomeOtherDesc');
And, to answer your questions:
Yes
Yes
Yes
You are treading into complicated territory, known as 'correlated sub-queries'. Since we don't have detailed information about your tables and the key structures, some of the answers can only be 'maybe'.
In your initial IN query, the notation would be valid whether or not OtherTable contains a column NameID (and, indeed, whether OtherDesc exists as a column in Table or OtherTable - which is not clear in any of your examples, but presumably is a column of OtherTable). This behaviour is what makes a correlated sub-query into a correlated sub-query. It is also a routine source of angst for people when they first run into it - invariably by accident. Since the SQL standard mandates the behaviour of interpreting a name in the sub-query as referring to a column in the outer query if there is no column with the relevant name in the tables mentioned in the sub-query but there is a column with the relevant name in the tables mentioned in the outer (main) query, no product that wants to claim conformance to (this bit of) the SQL standard will do anything different.
The answer to your Q1 is "it depends", but given plausible assumptions (NameID exists as a column in both tables; OtherDesc only exists in OtherTable), the results should be the same in terms of the data set returned, but may not be equivalent in terms of performance.
The answer to your Q2 is that in the past, you were using an inferior if not defective DBMS. If it supported EXISTS, then the DBMS might still complain about the cardinality of the result.
The answer to your Q3 as applied to the first EXISTS query is "t is available as an alias throughout the statement, but o is only available as an alias inside the parentheses". As applied to your second example box - with AND connecting two sub-selects (the second of which is missing the open parenthesis when I'm looking at it), then "t is available as an alias throughout the statement and refers to the same table, but there are two different aliases both labelled 'o', one for each sub-query". Note that the query might return no data if OtherDesc is unique for a given NameID value in OtherTable; otherwise, it requires two rows in OtherTable with the same NameID and the two OtherDesc values for each row in Table with that NameID value.
Oracle-specific: When you write a query using the IN clause, you're telling the rule-based optimizer that you want the inner query to drive the outer query. When you write EXISTS in a where clause, you're telling the optimizer that you want the outer query to be run first, using each value to fetch a value from the inner query. See "Difference between IN and EXISTS in subqueries".
Probably.
Alias declared inside subquery lives inside subquery. By the way, I don't think your example with 2 ANDed subqueries is valid SQL. Did you mean UNION instead of AND?
Personally I would use a join, rather than a subquery for this.
SELECT t.*
FROM yourTable t
INNER JOIN otherTable ot
ON (t.nameid = ot.nameid AND ot.otherdesc = 'SomeDesc')
It is difficult to generalize that EXISTS is always better than IN. Logically if that is the case, then SQL community would have replaced IN with EXISTS...
Also, please note that IN and EXISTS are not same, the results may be different when you use the two...
With IN, usually its a Full Table Scan of the inner table once without removing NULLs (so if you have NULLs in your inner table, IN will not remove NULLS by default)... While EXISTS removes NULL and in case of correlated subquery, it runs inner query for every row from outer query.
Assuming there are no NULLS and its a simple query (with no correlation), EXIST might perform better if the row you are finding is not the last row. If it happens to be the last row, EXISTS may need to scan till the end like IN.. so similar performance...
But IN and EXISTS are not interchangeable...