I am doing SQL transformation lesson from Codecademy here. I am not sure why they are using those numbers after GROUP BY clause and what those numbers are doing. Can anyone passed the course be so kind to let me know?
SELECT dep_month,
dep_day_of_week,
dep_date,
COUNT(*) AS flight_count
FROM flights
GROUP BY 1,2,3
The numbers in the GROUP BY clause simply refer to the columns in the SELECT list, from left to right. Hence, your query is identical to the following:
SELECT
dep_month,
dep_day_of_week,
dep_date,
COUNT(*) AS flight_count
FROM flights
GROUP BY
dep_month,
dep_day_of_week,
dep_date
The above query which I wrote is what I would use in practice. The reason for this is that GROUP BY 1,2,3 refers to positions rather than columns. If someone refactors the SELECT later, he runs the risk of breaking your query.
Obviously these are position numbers. So this is a GROUP BY on the first three columns:
GROUP BY 1,2,3
means
GROUP BY dep_month, dep_day_of_week, dep_date
here.
This is not compliant with the SQL standard, because the GROUP BY clause is supposed to be executed before the SELECT clause, so the positions cannot be known. They are only known in the ORDER BY clause, because that occurs after the SELECT clause. Only few DBMS make an exception and allow this positional declaration in GROUP BY. It's bad hence to show this in a tutorial.
It's basically group by column 1, column 2 and column 3 from your select query.
Related
SELECT reviews.*, COUNT(comments.review_id)
AS comment_count
FROM reviews
LEFT JOIN comments ON comments.review_id = reviews.review_id
GROUP BY reviews.review_id
ORDER BY reviews.review_id ASC;
When I run this code I get exactly what I want from my SQL query, however if I run the following
SELECT *, COUNT(comments.review_id)
AS comment_count
FROM reviews
LEFT JOIN comments ON comments.review_id = reviews.review_id
GROUP BY reviews.review_id
ORDER BY reviews.review_id ASC;
then I get an error "column must appear in GROUP BY clause or be used in an aggregate function
Just wondered what the difference was and why the behaviour is different.
Thanks
In the first example, the column are taken only from the reviews table. Although not databases allow the use of SELECT * with GROUP BY, it is allowed by Standard SQL, assuming that review_id is the primary key.
The issue is that that you are including columns in the SELECT that are not included in the GROUP BY. This is only allowed -- in certain databases -- under very special circumstances, where the columns in the GROUP BY are declared to uniquely identify each row (which a primary key does).
The second example has columns from comments that do not meet this condition. Hence it is not allowed.
In the select part of the query with group by, you can chose only those columns which you used in group by.
Since you did group by reviews.review_id, you can get the output for the first case. In the second query you are try to get all the records and that is not possible with group by.
You can use window function if you need to select columns which are not present in your group by clause. Hope it makes sense.
https://www.windowfunctions.com/
Are the following queries identical, or might I get different results (in any major DB system, e.g. MSSQL, MySQL, Postgres, SQLite):
Doing both in the same query:
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_value
vs. ordering in a subquery:
SELECT group, some_agg_func(some_value)
FROM (
SELECT group, some_value
FROM my_table
ORDER BY some_other_value
) as alias
GROUP BY group
Looking at the first sample:
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_value
Let's think about what GROUP BY does by looking at this imaginary sample data:
A B
- -
1 1
1 2
Then think about this query:
SELECT A
FROM SampleData
GROUP BY A
ORDER BY B
The GROUP BY clause puts the two rows into a single group. Then we want to order by B... but the two rows in the group have different values for B. Which should it use?
Obviously in this situation it doesn't really matter: there's only one row in the results, so the order is not relevant. But generally, how does the database know what to do?
The database could guess which one you want, or just take the first value, or the last — whatever those mean in a setting where the data is unordered by definition. And in fact this is what MySql will try to do for you: it will try to guess are your meaning. But this response is really inappropriate. You specified an in-exact query; the only correct thing to do is throw an error, which is what most databases will do.
Now let's look at the second sample:
SELECT group, some_agg_func(some_value)
FROM (
SELECT group, some_value
FROM my_table
ORDER BY some_other_value
) as alias
GROUP BY group
Here it is important to remember databases have their roots in relational set theory, and what we think of as "tables" are more formally described as Unordered Relations. Again: the idea of being "unordered" is baked into the very nature of a table at the deepest level.
In this case the inner query can run and create results in the specified order, and then the outer query can use that with GROUP BY to create a new set... but just like tables, query results are unordered relations. Without an ORDER BY clause the final result is also unordered by definition.
Now you might tend to get results in the order you want, but the reality is all bets are off. In fact, the databases that run this query will tend to give you results in the order in which they first encountered each group, which will not tend to match the ORDER BY because the GROUP BY expression is looking at completely different columns. Other databases (Sql Server is in this group) will not even allow the query to run, though I might prefer a warning here.
So now we come to the final section, where we must re-think the question, like this:
How can I use GROUP BY on the one group column, while also ordering by some_other_column not in the group?
The answer is each group can contain multiple rows, and so you must tell the database which row to look at to get the correct (specific) some_other_column value. The typical way to do this is with another aggregate function, which might look like this:
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_agg_func(some_other_column)
That code will run without error on pretty much any database.
Just be careful here. On one hand, when people want to do this it's often for the common case where they know every record for some_other_column in each group will have the same value. For example, you might GROUP BY UserID, but ORDER BY Email, where of course every record with the same UserID should have the same Email address. As humans, we have the ability to make that kind of inference. Computers, however, don't handle that kind of thinking as well, and so we help it out with an extra aggregate function like MIN() or MAX().
On the other hand, if you're not careful sometimes the two different aggregate functions don't match up, and you end up showing the value from one row in the group, while using a completely different row from the group for the ORDER BY expression in a way that is not good.
Tables are unordered sets of data. A query result is a table. So if you select from a subquery that contains an ORDER BY clause, that clause means nothing; the data set is unordered by definition. The DBMS is free to ignore the ORDER BY clause. Some DBMS may even issue a warning or error, but I suppose it's more common that the ORDER BY clause just has no effect - at least not guaranteed.
In this query
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_value
you try to order your results by some_other_value. If this is meant to be a column, you can't, because that other column is no part of your results. You'll get a syntax error. If some_other_value is a fixed value, then there is nothing ordered, because you'd have the same sort key for every row. But it can be an expression based on your result data (group key and aggreation results) and you can order your result rows by that.
In this query
SELECT group, some_agg_func(some_value)
FROM (
SELECT group, some_value
FROM my_table
ORDER BY some_other_value
) as alias
GROUP BY group
the ORDER BY clause has no effect. You could just as well just select FROM my_table directly:
SELECT group, some_agg_func(some_value)
FROM my_table as alias
GROUP BY group
This gets the results unordered (or at least the order you see is not guaranteed to be thus every time you run that query), because your query doesn't have an ORDER BY clause.
I have two table Census and Crime
From the crime table, i need to find the most frequent occurrence of community_area_number
and linked the crime's community_area_number to table census's community_area_number to get the community_area_name
I am able to do the first step, but i fail at linking to another table. Please advise where have I done wrong. Thanks
%%sql
SELECT COUNT(CR.COMMUNITY_AREA_NUMBER) AS MOST_FREQ, CR.COMMUNITY_AREA_NUMBER, CE.COMMUNITY_AREA_NAME from CRIME AS CR, CENSUS AS CE
WHERE CR.COMMUNITY_AREA_NUMBER = CE.COMMUNITY_AREA_NUMBER
GROUP BY CR.COMMUNITY_AREA_NUMBER
ORDER BY COUNT(CR.COMMUNITY_AREA_NUMBER) DESC LIMIT 1
Expected output
MOST_FREQ ,community_area_number,, COMMUNITY_AREA_NAME
43 25 Uptown
Sample CENSUS
SAMPLE CRIME
You should be writing the query like this:
SELECT COUNT(*) AS MOST_FREQ,
CR.COMMUNITY_AREA_NUMBER, CE.COMMUNITY_AREA_NAME
FROM CRIME CR JOIN
CENSUS CE
ON CR.COMMUNITY_AREA_NUMBER = CE.COMMUNITY_AREA_NUMBER
GROUP BY CR.COMMUNITY_AREA_NUMBER, CE.COMMUNITY_AREA_NAME
ORDER BY COUNT(*) DESC
LIMIT 1;
Note the use of proper, explicit, standard, readable JOIN syntax. Never use commas in the FROM clause.
The relevant change, though, is to include CE.COMMUNITY_AREA_NAME in the GROUP BY. All non-aggregated columns should be in the GROUP BY as a general rule.
Also, COUNT(*) is simpler for counting matches, so this query uses that instead of counting the non-NULL values of a column.
You are using a aggregate function COUNT(CR.COMMUNITY_AREA_NUMBER) AS MOST_FREQ
and all other (non aggregate) return values need to be in the GROUP BY clause.
For your query it means try adding E.COMMUNITY_AREA_NAME to the GROUP BY.
I have got the table MYTABLE with 2 columns: A and B
I have got the following pieces of the code:
SELECT MYTABLE.A FROM MYTABLE
HAVING SUM(MYTABLE.B) > 100
GROUP BY MYTABLE.A
and
SELECT MYTABLE.A FROM MYTABLE
GROUP BY MYTABLE.A
HAVING SUM(MYTABLE.B) > 100
Is it the same? Is it possible that these 2 codes will return diffrent sets of results?
Thank you in advance
As documented, there is no difference. People are just used to seeing HAVING after GROUP BY.
http://docs.oracle.com/cd/B28359_01/server.111/b28286/statements_10002.htm#SQLRF20040
Specify GROUP BY and HAVING after the where_clause and hierarchical_query_clause. If you specify both GROUP BY and HAVING, then they can appear in either order.
http://sqlfiddle.com/#!4/66e33/1
I originally wrote:
I am not sure your 1st query is valid. As far as I know, HAVING should always come after GROUP BY.
I was corrected by David Aldridge, the Oracle docs state that the order does not matter. Although I don't recommend using HAVING before GROUP for readability reasons (and to prevent confusion with a WHERE clause), it is technically correct. So that makes the answer to your question 'yes, it's the same'.
You can't have a HAVING before a GROUP BY, the HAVING is like the "WHERE" but for the GROUP BY condition.
The clauses are evaluated in order. You can have a HAVING clause following immediately the FROM clause. In this case, the HAVING clause will apply to the entire rows of the result set. The select list may only contain, in this case, one/more/all of the aggregation functions contained in the HAVING clause.
So, your first query is not valid because of the above. A valid query would be
SELECT SUM(MYTABLE.B) AS s FROM MYTABLE
HAVING SUM(MYTABLE.B) > 100
The above query will return one or no row, depending on whether the condition SUM(MYTABLE.B) > 100 is verified or not.
Still, there is one more reason for which your first query is not valid. The GROUP BY clause may refer only to columns in the data set to which it applies. So going on with my valid query above, you can write the following valid query (though it will be useless and nonsense, as it is applied to either one or no rows):
SELECT SUM(s)
FROM
(
SELECT SUM(MYTABLE.B) s
FROM MYTABLE
HAVING SUM(MYTABLE.B) > 100
) q
GROUP BY s
So, just to answer: no, they're not the same. One of them is not even valid.
both WHERE and HAVING allow for the imposition of conditions in the query. Difference:
We use WHERE for the records returned by select from the table,
We use HAVING for groups returned by group by select query
The GROUP BY clause groups the rows, but it does not necessarily sort the results in any particular order. To change the order, use the ORDER BY clause, which follows the GROUP BY clause. The columns used in the ORDER BY clause must appear in the SELECT list, which is unlike the normal use of ORDER BY. [Oracle by Example, fourth Edition, page 274]
Why is that? Why does using GROUP BY influence the required columns in the SELECT clause?
Also, in the case where I do not use GROUP BY: Why would I want to ORDER BY some columns but then select only a subset of the columns?
Actually the statement is not entirely true as Dave Costa's example shows.
The Oracle documentation says that an expression can be used but the expression must be based on the columns in the selection list.
expr - expr orders rows based on their value for expr. The expression is based on
columns in the select list or columns in the tables, views, or materialized views in the
FROM clause. Source: Oracle® Database
SQL Language Reference
11g Release 2 (11.2)
E26088-01
September 2011. Page 19-33
From the the same work page 19-13 and 19-33 (Page 1355 and 1365 in the PDF)
http://docs.oracle.com/cd/E11882_01/server.112/e26088/statements_10002.htm#SQLRF01702
http://docs.oracle.com/cd/E11882_01/server.112/e26088/statements_10002.htm#i2171079
The bold text from your quote is incorrect (it's probably an oversimplification that is true in many common use cases, but it is not strictly true as a requirement). For instance, this statement executes just fine, although AVG(val) is not in the select list:
WITH DATA AS (SELECT mod(LEVEL,3) grp, LEVEL val FROM dual CONNECT BY LEVEL < 100)
SELECT grp,MIN(val),MAX(val)
FROM DATA
GROUP BY grp
ORDER BY AVG(val)
The expressions in the ORDER BY clause simply have to be possible to evaluate in the context of the GROUP BY. For instance, ORDER BY val would not work in the above example, because the expression val does not have a distinct value for each row produced by the grouping.
As to your second question, you may care about the ordering but not about the value of the ordering expression. Excluding unneeded expressions from the select lists reduces the amount of data that must actually be sent from the server to the client.
First:
The implementation of group by is one which creates a new resultset that differs in structure to the original from clause (table view or some joined tables). That resultset is defined by what is selected.
Not every SQL RDBMS has this restriction, though it is a always requirement that what is ordered by be either an aggregate function of the non-grouped columns (AVG, SUM, etc) or one of the columns grouped by, or functions upon more than one of those results (like adding two columns), because this is a logical requirement of the result of the grouping operation.
Second:
Because you only care about that column for the ordering. For example, you might have a list of the top selling singles without giving their sales (the NYT Bestsellers keeps some details of their data a secret, but do have a ranked list). Of course, you can get around this by just selecting that column and then not using it.
The data is aggregated before it is sorted for the ORDER BY.
If you try to order by any other column (that is not in the group by list or an aggregation function), what value would be used? There is no single value to use for ordering.
I believe that you can use combinations of the values for sorting. So you can say:
order by a+b
If a and b are in the group by. You just cannot introduce columns not mentioned in the SELECT. I believe you can use aggregation functions not mentioned in the SELECT, however.
Sample table
sample.grades
Name Grade Score
Adam A 95
Bob A 97
Charlie C 75
First Query using GROUP BY
Select grade, count(Grade) from sample.grades GROUP BY Grade
Output
Grade Count
A 2
C 1
Second Query using order by
select Name, score from sample grades order by score
Output
Bob A 97
Adam A 95
Charlie C 75
Third Query using GROUP BY and ordering
Select grade, count(Grade) from sample.grades GROUP BY Grade desc
Output
Grade Count
A 2
C 1
Once you start using things like Count, you must have group by. You can use them together, but they have very different uses, as I hope the examples clearly show.
To try and answer the question, why does group by effect the items in the select section, because that is what group by is meant to do. You can't do the count of a column if you do not group by that column.
Second question, why would you want to order by but not select all the columns?
If I want to order by the score, but do not care about the actual grade or even the score I might do
select name from sample.grades order by score
Output
Name
Bob
Adam
Charlie
Which results do you expect to see ordering by columns not listed in the select list and not participated in group by clause? at any case all kind of sort by non-mentioned in SELECT list columns will be omitted so Oracle guys added the restriction correctly.
with c as (
select 1 id, 2 value from dual
union all
select 1 id, 3 value from dual
union all
select 2 id, 3 value from dual
)
select id
from c
group by id
order by count(*) desc
Here my inderstanding
"The GROUP BY clause groups the rows, but it does not necessarily sort the results in any particular order."
-> you can use Group by without order by
"To change the order, use the ORDER BY clause, which follows the GROUP BY clause."
-> the rows are selected by defaut with primary key, and if you add order by you must add after group by
"The columns used in the ORDER BY clause must appear in the SELECT list, which is unlike the normal use of ORDER BY."