In other postgresql DBMSes (e.g., Netezza) I can do something like this without errors:
select store_id
,sum(sales) as total_sales
,count(distinct(txn_id)) as d_txns
,total_sales/d_txns as avg_basket
from my_tlog
group by 1
I.e., I can use aggregate values within the same SQL query that defined them.
However, when I go to do the same sort of thing on Amazon Redshift, I get the error "Column total_sales does not exist..." Which it doesn't, that's correct; it's not really a column. But is there a way to preserve this idiom, rather than restructuring the query? I ask because there would be a lot of code to change.
Thanks.
You simply need to repeat the expressions (or use a subquery or CTE):
select store_id,
sum(sales) as total_sales,
count(distinct txn_id) as d_txns,
sum(sales)/count(distinct txn_id) as avg_basket
from my_tlog
group by store_id;
Most databases do not support the re-use of column aliases in the select. The reason is twofold (at least):
The designers of the database engine do not want to specify the order of processing of expressions in the select.
There is ambiguity when a column alias is also a valid column in a table in the from clause.
Personally I loove the construct in netezza. This is compact and the syntax is not ambiguous: any 'dublicate' column names will default to (new) alias in the current query, and if you need to reference the column of the underlying tables, simply put the tablename in front of the column. The above example would become:
select store_id
,sum(sales) as sales ---- dublicate name
,count(distinct(txn_id)) as d_txns
,my_tlog.sales/d_txns as avg_basket --- this illustrates but may not make sense
from my_tlog
group by 1
I recently moved away from sql server, and on that database I used a construct like this to avoid repeating the expressions:
Select *, total_sales/d_txns as avg_basket
From (
select store_id
,sum(sales) as total_sales
,count(distinct(txn_id)) as d_txns
from my_tlog
group by 1
)x
Most (if not all) databases will support this construct, and have done so for 10 years or more
Related
Are the following queries identical, or might I get different results (in any major DB system, e.g. MSSQL, MySQL, Postgres, SQLite):
Doing both in the same query:
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_value
vs. ordering in a subquery:
SELECT group, some_agg_func(some_value)
FROM (
SELECT group, some_value
FROM my_table
ORDER BY some_other_value
) as alias
GROUP BY group
Looking at the first sample:
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_value
Let's think about what GROUP BY does by looking at this imaginary sample data:
A B
- -
1 1
1 2
Then think about this query:
SELECT A
FROM SampleData
GROUP BY A
ORDER BY B
The GROUP BY clause puts the two rows into a single group. Then we want to order by B... but the two rows in the group have different values for B. Which should it use?
Obviously in this situation it doesn't really matter: there's only one row in the results, so the order is not relevant. But generally, how does the database know what to do?
The database could guess which one you want, or just take the first value, or the last — whatever those mean in a setting where the data is unordered by definition. And in fact this is what MySql will try to do for you: it will try to guess are your meaning. But this response is really inappropriate. You specified an in-exact query; the only correct thing to do is throw an error, which is what most databases will do.
Now let's look at the second sample:
SELECT group, some_agg_func(some_value)
FROM (
SELECT group, some_value
FROM my_table
ORDER BY some_other_value
) as alias
GROUP BY group
Here it is important to remember databases have their roots in relational set theory, and what we think of as "tables" are more formally described as Unordered Relations. Again: the idea of being "unordered" is baked into the very nature of a table at the deepest level.
In this case the inner query can run and create results in the specified order, and then the outer query can use that with GROUP BY to create a new set... but just like tables, query results are unordered relations. Without an ORDER BY clause the final result is also unordered by definition.
Now you might tend to get results in the order you want, but the reality is all bets are off. In fact, the databases that run this query will tend to give you results in the order in which they first encountered each group, which will not tend to match the ORDER BY because the GROUP BY expression is looking at completely different columns. Other databases (Sql Server is in this group) will not even allow the query to run, though I might prefer a warning here.
So now we come to the final section, where we must re-think the question, like this:
How can I use GROUP BY on the one group column, while also ordering by some_other_column not in the group?
The answer is each group can contain multiple rows, and so you must tell the database which row to look at to get the correct (specific) some_other_column value. The typical way to do this is with another aggregate function, which might look like this:
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_agg_func(some_other_column)
That code will run without error on pretty much any database.
Just be careful here. On one hand, when people want to do this it's often for the common case where they know every record for some_other_column in each group will have the same value. For example, you might GROUP BY UserID, but ORDER BY Email, where of course every record with the same UserID should have the same Email address. As humans, we have the ability to make that kind of inference. Computers, however, don't handle that kind of thinking as well, and so we help it out with an extra aggregate function like MIN() or MAX().
On the other hand, if you're not careful sometimes the two different aggregate functions don't match up, and you end up showing the value from one row in the group, while using a completely different row from the group for the ORDER BY expression in a way that is not good.
Tables are unordered sets of data. A query result is a table. So if you select from a subquery that contains an ORDER BY clause, that clause means nothing; the data set is unordered by definition. The DBMS is free to ignore the ORDER BY clause. Some DBMS may even issue a warning or error, but I suppose it's more common that the ORDER BY clause just has no effect - at least not guaranteed.
In this query
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_value
you try to order your results by some_other_value. If this is meant to be a column, you can't, because that other column is no part of your results. You'll get a syntax error. If some_other_value is a fixed value, then there is nothing ordered, because you'd have the same sort key for every row. But it can be an expression based on your result data (group key and aggreation results) and you can order your result rows by that.
In this query
SELECT group, some_agg_func(some_value)
FROM (
SELECT group, some_value
FROM my_table
ORDER BY some_other_value
) as alias
GROUP BY group
the ORDER BY clause has no effect. You could just as well just select FROM my_table directly:
SELECT group, some_agg_func(some_value)
FROM my_table as alias
GROUP BY group
This gets the results unordered (or at least the order you see is not guaranteed to be thus every time you run that query), because your query doesn't have an ORDER BY clause.
My company has recently switched to big query, one issue I am having right now is that big query in standard SQL is not able to accept alias columns in query.
For eg. returns me Unrecognized name: product_code at [3:5].
Does anyone knows a workaround on this issue?
select sales, t_001 as product_code
from "project_01.sales_001.trans_datamart"
where product_code = '001-40040-00'
According to the documentation, you can not reference an alias from the SELECT and list it in a WHERE clause. The where clause filters each row against a bool_expression.
However, there is a way for you to achieve what you want. Below is the syntax:
select sales, product_code
from (select *, t_001 as product_code from "project_01.sales_001.trans_datamart")
where product_code = '001-40040-00'
Therefore, you use the alias as a new column name within your from clause, which makes possible for you to filter using the alias you just created in your where clause.
I would also encourage you to check out this link with all the explanations about aliases in BigQuery.
I'm not aware of any SQL dialect that allows the use of a column alias in the WHERE clause.
Sticking to just the clauses in your example, SQL engines generally evaluate the FROM clause first, determining which tables to pull data from, then evaluate the WHERE clause to filter the retrieved data, and then the SELECT clause to determine what to display and how to display it.
Given that, the column alias is unknown to the SQL engine at the point that it's reading the WHERE clause.
So your options are to either use the column name in the WHERE clause, or, as Gordon suggests in the comments, put the alias in a sub-query or CTE that will be evaluated as part of the FROM clause.
Column name:
select sales, t_001 as product_code
from "project_01.sales_001.trans_datamart"
where t_001 = '001-40040-00' --<--- Modification here.
Sub-query:
select
sales,
product_code
from
(
select sales, t_001 as product_code
from "project_01.sales_001.trans_datamart"
) as d
where product_code = '001-40040-00'
I'm trying to understand why some DBMS systems allow the below while the most don't. Assume table X has attributes name, id, data
SELECT id, count(*) as count
FROM TABLE X
GROUP BY id
HAVING count > X.data
In most databases, it's illegal to use non-grouping or non-aggregate field in HAVING clause conditional statement. Some systems seem to allow the same. Would you be able to explain why they would have allowed the HAVING condition to use an attribute which may not have a unique value throughout the group?
Referred to database documentation of DB2, PostgreSQL, MySQL
SELECT id, count(*) as count
FROM TABLE X
GROUP BY id
HAVING count > X.data
The first issue with this query:
SELECT name, count(*) as count
FROM TABLE X
GROUP BY id
HAVING count > X.data;
is that you have name in the SELECT but id in GROUP BY. Because you are grouping by id, I assume that there are multiple values in X. Hence, this is incorrect syntax.
There are some cases where this is allowed by the standard -- and even in some databases. However, that requires that id be unique in the table (the technical jargon is that the columns in the SELECT are functionally dependent on the columns in the GROUP BY).
The next issue is the use of count in the HAVING clause. This is fine conceptually. However, not all databases may support it.
Finally, you have x.data in the HAVING clause. If that is functionally dependent on a subset of the GROUP BY keys, then the usage conforms with the standard. However, that is unlikely in this case.
The standard is quite explicit that x.data is out-of-scope after the aggregation. So, this should result in a syntax error -- and it does in almost all databases.
There are a dwindling number of databases that support this construct -- happily MySQL no longer supports it by default. In such databases, they take an arbitrary and indetermine value of data from a row in each group and use that for the comparison.
I'm having trouble wrapping my head around a fairly elementary concept. Any help appreciated!
I have learned from resources like this that the processing order of SQL operations is:
1) from
2) where
3) group by
4) having
5) select
6) order by
7) limit
However, I am perplexed when looking at this below query taken from DataCamp. If SQL is processing GROUP BY before SELECT, how can I use a field that was created within the SELECT statement (home_team) in the GROUP BY clause?
Thank you!
-- Identify the home team as Bayern Munich, Schalke 04, or neither
SELECT
CASE WHEN hometeam_id = 10189 THEN 'FC Schalke 04'
WHEN hometeam_id = 9823 THEN 'FC Bayern Munich'
ELSE 'Other' END AS home_team,
COUNT(id) AS total_matches
FROM matches_germany
-- Group by the CASE statement alias
GROUP BY home_team;
Your particular query has:
GROUP BY hometeam_id
---------^
This is a column in the original data, not in the SELECT. The data is aggregated at the hometeam_id level. Then the CASE expression is applied after the aggregation.
Your question supposed that the query is written using:
GROUP BY home_team
And this might or might not work, depending on the database.
SQL does not have an "order of processing". The SQL engine analyzes the query and develops a directed-acyclic graph (DAG) representing the operations that need to be performed on the data.
What you are thinking of are rules for the scoping of identifiers in SQL. The big question is where an alias defined in a SELECT can be used.
Basically no databases allow column aliases to be used in the following clauses:
SELECT
FROM
WHERE
All databases allow column aliases in the following clauses:
ORDER BY.
Some databases allow column aliases in the GROUP BY and HAVING clauses.
Your database appears to be one that allows such usage in the GROUP BY.
I'm trying to reference the new variable DATE_YEAR instead the original DATE_YEAR that I have in my Teradata database TEST.
How can i get it?
I found nothing in Teradata documentation.
SELECT DATE_YEAR+1 AS DATE_YEAR, COUNT(1)
FROM TEST
WHERE DATE_YEAR = 2016 GROUP BY 1;
Based on Standard SQL the columns in the SELECT-list are created after FROM/WHERE/GROUP BY/HAVING/OLAP, but before ORDER BY, so you can use an alias only in ORDER BY.
Due to historical reasons Teradata allows reusage of an alias in any place, this is very convenient as you don't have to cut&paste or use nested SELECTs. But there are some scoping rules, only when a column name is not found it's looked up in the alias-list. So the rough rule of thumb is: Never assign an alias which matches an existing column name and then you can easily use it in any place:
SELECT DATE_YEAR+1 AS DATE_YR, COUNT(1)
FROM TEST
WHERE DATE_YR = 2016 GROUP BY 1;
Wrap it up in a derived table:
select DATE_YEAR_ADJUSTED, count(*)
from
(
SELECT DATE_YEAR+1 AS DATE_YEAR_ADJUSTED
FROM TEST
) as dt
WHERE DATE_YEAR_ADJUSTED = 2016
GROUP BY DATE_YEAR_ADJUSTED
That GROUP BY doesn't make much sense... Only one year anyway.
PS. I don't know Teradata, but I hope this works. ANSI SQL.