Still confusing the rules around selecting columns, group by, and joins - sql

I am still confused by the syntax rules of using GROUP BY. I understand we use GROUP BY when there is some aggregate function. If I have even one aggregate function in a SQL statement, do I need to put all of my selected columns into my GROUP BY statement? I don't have a specific query to ask about but when I try to do joins, I get errors. In particular, when I use a count(*) in a statement and/or a join, I just seem to mess it up.
I use BigQuery at my job. I am regularly floored by strange gaps in knowledge.
Thank you!

This is a little complicated.
First, no aggregation functions are needed in an aggregation query. So this is allowed:
select a
from t
group by a;
This is equivalent, by the way, to:
select distinct a
from t;
If there are aggregation functions, then no group by is needed. So, this is allowed:
select max(a)
from t;
Such an aggregation query -- with no group by -- always returns one row. This is true even if the table is empty or a where clause filters out all the rows. In that case, most aggregation functions return NULL, with the notable exception of count() that returns 0.
Next, if you mix aggregation functions and non-aggregation expressions in the select, then in general you want the non-aggregation, non-constant expressions in the group by. I should note that you can do:
select a, concat(a, 'bcd'), count(*)
from t
group by a;
This should work, but sometimes BigQuery gets confused and will want the expression in the group by.
Finally, the SQL standard supports a query like this:
select t.*, count(*)
from t join
u
using (foo)
group by t.a;
When a is the primary key (or equivalent) in t. However, BigQuery does not have primary keys, so this is not relevant to that database.

Related

Can LAG be used with HAVING?

I distinctly recall that T-SQL will never let you mix LAG and WHERE. For example,
SELECT FOO
WHERE LAG(BAR) OVER (ORDER BY DATE) > 7
will never work. T-SQL will not run it no matter what you do. But does T-SQL ever let you mix LAG with HAVING?
Note: All that an answer needs to do is either give a theory-based or documentation-based reason why it does not, or give any example at all of where it does.
From Logical Processing Order of the SELECT statement:
The following steps show the logical processing order, or binding
order, for a SELECT statement......
FROM
ON
JOIN
WHERE
GROUP BY
WITH CUBE or WITH ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP
Window functions are evaluated at the level of SELECT, which comes after HAVING, so the answer is no you can't use window functions in the HAVING clause.
Having clause can only be used with Group by clause. In order to use Group by the listed columns should be aggregated using Group by columns. Group by can only be used with aggregate functions like min,max,sum,count functions. Hence it is not possible to combine having clause along with the LAG analytical function.
In order to use LAG and Having, one should use CTE or subquery.

SQL Group By Column Part Number giving the data from most recent received date

New qith SQL my group by is not working and I am wanting it to pull the most recent POReleases.DateReceived date and group by part number. Here is what I have
SELECT POReleases.PONum, POReleases.PartNo, POReleases.JobNo, POReleases.Qty, POReleases.QtyRejected, POReleases.QtyCanceled, POReleases.DueDate, POReleases.DateReceived, PODet.ProdCode, PODet.Unit, PODet.UnitCost, PODet.QtyOrd, PODet.QtyRec, PODet.QtyReject, PODet.QtyCancel
FROM Waples.dbo.PODet PODet, Waples.dbo.POReleases POReleases
WHERE PODet.PartNo = POReleases.PartNo AND PODet.PONum = POReleases.PONum AND ((POReleases.DateReceived>{ts '2010-01-01 00:00:00'}))
GROUP BY PartNo
For starters, columns specified in the GROUP BY should be present in the select statement too. Here in your case only "PartNo" is used in GROUP BY clause whereas so many columns are used in the SELECT statement.
You can try WITH CTE to achieve this,
WITH CTE AS (
SELECT *, ROW_NUMBER() OVER( PARTITION BY PartNo ORDER BY POReleases.DateReceived DESC) AS PartNoCount
FROM TABLENAME
) SELECT * FROM CTE
When you write an SQL statement, you should think about the logical flow, which might be technically slightly inaccurate due to optimizations, but still, it is a good thing to think about it like this:
without the from clause specifying the source relation, the filter cannot be evaluated, so at least logically, the from is the first thing to evaluate
without the where clause specifying which records should be kept from the source relation, the filtered records cannot be grouped, so, at least logically, the where precedes the group by
without the group by, specifying the groups, you cannot select values from the groups, so, at least logically, group by precedes select
So, the projection (select) is executed on the groups of filtered records, which are groups themselves. Since the groups have an attribute, namely PartNo, it becomes an aggregated column. The other columns, which were reachable before the group by, can no longer be reached in the select. If you want to reach them, you need to group by them as well, or use aggregated functions for them, since if you have a group by, you will be able to select only the aggregated columns, which are either aggregated functions or columns which became aggregated due to their presence in the group by.
Since you did not specify how this query is not working, I will have to assume that you have a syntax error in the selection, due to the fact that you refer to columns which are not aggregated. Also, you might want to use join instead of Descartes multiplication and finally, if you want to filter the groups, not the records of the initial relation (which is the result of a Descartes multiplication in your case), then you might consider using a having clause.

Aggregate window function output in postgresql (redshift)

I really want to use the median window function as an aggregate function.
I currently am forced to use the window function in a sub-select, and then aggregate over it like this:
SELECT id, MIN(avg) AS mean, MIN(median) AS median, COUNT(*)
FROM (
SELECT id, AVG(metric) OVER(PARTITION BY id), MEDIAN(metric) OVER(PARTITION BY id)
FROM data_table
)
GROUP BY id;
Is there a way to aggregate over a window function result so there's only one SELECT statement?
Strictly speaking, your example query could be rewritten:
SELECT id,
AVG(metric),
MEDIAN(metric),
COUNT(*)
FROM data_table
GROUP BY id;
But I'm wondering if you just picked a poor example that happens to be mathematically capable of simplification. This is a special case because the subquery and the main query are aggregating on the same field, and the outer aggregates are picking a minimum from what would be a set of identical values.
If that's not the case and your actual query and subquery are not grouping by the same field, then the answer is no, you need a subquery for two reasons:
First, by ANSI definition, window functions are evaluated after the WHERE, GROUP BY, and HAVING clauses. There is no clause to specify your desired behavior of aggregating after a window function, so you must use a subquery or CTE.
Second, even if you eliminated the windowing from the OVER() clause you need to GROUP BY data you only know after the first round of aggregation has been completed.

why there is invalid identifier in this query?

I want to assign the result of count to a variable, so I can use it later in the query, here is my code:
select distinct(Artist), count(distinct(Instrument)) as allins
from performers
where allins = (select count(distinct(x.Instrument))
from performers x)
group by Artist;
the error: ORA-00904: "ALLINS": invalid identifier
This is your query:
select distinct(Artist), count(distinct(Instrument)) as allins
from performers
where allins = (select count(distinct(x.Instrument)) from performers x)
group by Artist;
Naughty, naughty. You cannot use a column alias defined in the select in a where clause. You also can't use aggregation function in the where clause, so the code doesn't make sense. What you want is a having clause:
select Artist, count(distinct(Instrument)) as allins
from performers
group by Artist
having count(distinct Instrument) = (select count(distinct x.Instrument) from performers x)
Note: you almost never need select distinct when you have an aggregation query. And, distinct isn't a function so parenthesis are unnecessary.
Standard SQL disallows references to column aliases in a WHERE clause.
This restriction is imposed because when the WHERE clause is evaluated, the column value may not yet have been determined.
Column_alias can be used in an ORDER BY clause, but it cannot be used in a WHERE, GROUP BY, or HAVING clause.
Documentation references:
http://dev.mysql.com/doc/refman/5.5/en/problems-with-alias.html
http://msdn.microsoft.com/en-us/library/ms173451.aspx
The execution for SQL is definitely not the same as Java or C, so often it trips up new programmers getting into the language.
More often than not, the database's order for understanding SQL instructions goes like this:
FROM -> WHERE -> GROUP BY -> ... -> SELECT
In order to do this properly you can't state that something is an alias in the SELECT clause, and expect it to work, because the program would most likely start from the FROM clause.
Also, from my experience, column aliases don't work with each other nicely when referencing each other in the SELECT clause.
My rather unfortunate solution would be to just not use aliases and type the whole thing out.
Also, you absolutely, positively, should not use an aggregate function in the WHERE clause. It must always be in the HAVING clause. You also need to use a GROUP BY clause if you don't want Oracle to complain about asking for the Artist. Also, since you are grouping by the Artist and the other function is an aggregate, you don't need to use DISTINCT.

What is the difference between HAVING and WHERE in SQL?

What is the difference between HAVING and WHERE in an SQL SELECT statement?
EDIT: I have marked Steven's answer as the correct one as it contained the key bit of information on the link:
When GROUP BY is not used, HAVING behaves like a WHERE clause
The situation I had seen the WHERE in did not have GROUP BY and is where my confusion started. Of course, until you know this you can't specify it in the question.
HAVING: is used to check conditions after the aggregation takes place.
WHERE: is used to check conditions before the aggregation takes place.
This code:
select City, CNT=Count(1)
From Address
Where State = 'MA'
Group By City
Gives you a table of all cities in MA and the number of addresses in each city.
This code:
select City, CNT=Count(1)
From Address
Where State = 'MA'
Group By City
Having Count(1)>5
Gives you a table of cities in MA with more than 5 addresses and the number of addresses in each city.
HAVING specifies a search condition for a
group or an aggregate function used in SELECT statement.
Source
Number one difference for me: if HAVING was removed from the SQL language then life would go on more or less as before. Certainly, a minority queries would need to be rewritten using a derived table, CTE, etc but they would arguably be easier to understand and maintain as a result. Maybe vendors' optimizer code would need to be rewritten to account for this, again an opportunity for improvement within the industry.
Now consider for a moment removing WHERE from the language. This time the majority of queries in existence would need to be rewritten without an obvious alternative construct. Coders would have to get creative e.g. inner join to a table known to contain exactly one row (e.g. DUAL in Oracle) using the ON clause to simulate the prior WHERE clause. Such constructions would be contrived; it would be obvious there was something was missing from the language and the situation would be worse as a result.
TL;DR we could lose HAVING tomorrow and things would be no worse, possibly better, but the same cannot be said of WHERE.
From the answers here, it seems that many folk don't realize that a HAVING clause may be used without a GROUP BY clause. In this case, the HAVING clause is applied to the entire table expression and requires that only constants appear in the SELECT clause. Typically the HAVING clause will involve aggregates.
This is more useful than it sounds. For example, consider this query to test whether the name column is unique for all values in T:
SELECT 1 AS result
FROM T
HAVING COUNT( DISTINCT name ) = COUNT( name );
There are only two possible results: if the HAVING clause is true then the result with be a single row containing the value 1, otherwise the result will be the empty set.
The HAVING clause was added to SQL because the WHERE keyword could not be used with aggregate functions.
Check out this w3schools link for more information
Syntax:
SELECT column_name, aggregate_function(column_name)
FROM table_name
WHERE column_name operator value
GROUP BY column_name
HAVING aggregate_function(column_name) operator value
A query such as this:
SELECT column_name, COUNT( column_name ) AS column_name_tally
FROM table_name
WHERE column_name < 3
GROUP
BY column_name
HAVING COUNT( column_name ) >= 3;
...may be rewritten using a derived table (and omitting the HAVING) like this:
SELECT column_name, column_name_tally
FROM (
SELECT column_name, COUNT(column_name) AS column_name_tally
FROM table_name
WHERE column_name < 3
GROUP
BY column_name
) pointless_range_variable_required_here
WHERE column_name_tally >= 3;
The difference between the two is in the relationship to the GROUP BY clause:
WHERE comes before GROUP BY; SQL evaluates the WHERE clause before it groups records.
HAVING comes after GROUP BY; SQL evaluates HAVING after it groups records.
References
SQLite SELECT Statement Syntax/Railroad Diagram
Informix SELECT Statement Syntax/Railroad Diagram
HAVING is used when you are using an aggregate such as GROUP BY.
SELECT edc_country, COUNT(*)
FROM Ed_Centers
GROUP BY edc_country
HAVING COUNT(*) > 1
ORDER BY edc_country;
WHERE is applied as a limitation on the set returned by SQL; it uses SQL's built-in set oeprations and indexes and therefore is the fastest way to filter result sets. Always use WHERE whenever possible.
HAVING is necessary for some aggregate filters. It filters the query AFTER sql has retrieved, assembled, and sorted the results. Therefore, it is much slower than WHERE and should be avoided except in those situations that require it.
SQL Server will let you get away with using HAVING even when WHERE would be much faster. Don't do it.
WHERE clause does not work for aggregate functions
means : you should not use like this
bonus : table name
SELECT name
FROM bonus
GROUP BY name
WHERE sum(salary) > 200
HERE Instead of using WHERE clause you have to use HAVING..
without using GROUP BY clause, HAVING clause just works as WHERE clause
SELECT name
FROM bonus
GROUP BY name
HAVING sum(salary) > 200
Difference b/w WHERE and HAVING clause:
The main difference between WHERE and HAVING clause is, WHERE is used for row operations and HAVING is used for column operations.
Why we need HAVING clause?
As we know, aggregate functions can only be performed on columns, so we can not use aggregate functions in WHERE clause. Therefore, we use aggregate functions in HAVING clause.
One way to think of it is that the having clause is an additional filter to the where clause.
A WHERE clause is used filters records from a result. The filter occurs before any groupings are made. A HAVING clause is used to filter values from a group
In an Aggregate query, (Any query Where an aggregate function is used) Predicates in a where clause are evaluated before the aggregated intermediate result set is generated,
Predicates in a Having clause are applied to the aggregate result set AFTER it has been generated. That's why predicate conditions on aggregate values must be placed in Having clause, not in the Where clause, and why you can use aliases defined in the Select clause in a Having Clause, but not in a Where Clause.
I had a problem and found out another difference between WHERE and HAVING. It does not act the same way on indexed columns.
WHERE my_indexed_row = 123 will show rows and automatically perform a "ORDER ASC" on other indexed rows.
HAVING my_indexed_row = 123 shows everything from the oldest "inserted" row to the newest one, no ordering.
When GROUP BY is not used, the WHERE and HAVING clauses are essentially equivalent.
However, when GROUP BY is used:
The WHERE clause is used to filter records from a result. The
filtering occurs before any groupings are made.
The HAVING clause is used to filter values from a group (i.e., to
check conditions after aggregation into groups has been performed).
Resource from Here
From here.
the SQL standard requires that HAVING
must reference only columns in the
GROUP BY clause or columns used in
aggregate functions
as opposed to the WHERE clause which is applied to database rows
While working on a project, this was also my question. As stated above, the HAVING checks the condition on the query result already found. But WHERE is for checking condition while query runs.
Let me give an example to illustrate this. Suppose you have a database table like this.
usertable{ int userid, date datefield, int dailyincome }
Suppose, the following rows are in table:
1, 2011-05-20, 100
1, 2011-05-21, 50
1, 2011-05-30, 10
2, 2011-05-30, 10
2, 2011-05-20, 20
Now, we want to get the userids and sum(dailyincome) whose sum(dailyincome)>100
If we write:
SELECT userid, sum(dailyincome) FROM usertable WHERE
sum(dailyincome)>100 GROUP BY userid
This will be an error. The correct query would be:
SELECT userid, sum(dailyincome) FROM usertable GROUP BY userid HAVING
sum(dailyincome)>100
WHERE clause is used for comparing values in the base table, whereas the HAVING clause can be used for filtering the results of aggregate functions in the result set of the query
Click here!
When GROUP BY is not used, the WHERE and HAVING clauses are essentially equivalent.
However, when GROUP BY is used:
The WHERE clause is used to filter records from a result. The
filtering occurs before any groupings are made.
The HAVING clause is
used to filter values from a group (i.e., to check conditions after
aggregation into groups has been performed).
I use HAVING for constraining a query based on the results of an aggregate function. E.G. select * in blahblahblah group by SOMETHING having count(SOMETHING)>0