In DB2 SELECT MOST Frequent occurrence and link with other table - sql

I have two table Census and Crime
From the crime table, i need to find the most frequent occurrence of community_area_number
and linked the crime's community_area_number to table census's community_area_number to get the community_area_name
I am able to do the first step, but i fail at linking to another table. Please advise where have I done wrong. Thanks
%%sql
SELECT COUNT(CR.COMMUNITY_AREA_NUMBER) AS MOST_FREQ, CR.COMMUNITY_AREA_NUMBER, CE.COMMUNITY_AREA_NAME from CRIME AS CR, CENSUS AS CE
WHERE CR.COMMUNITY_AREA_NUMBER = CE.COMMUNITY_AREA_NUMBER
GROUP BY CR.COMMUNITY_AREA_NUMBER
ORDER BY COUNT(CR.COMMUNITY_AREA_NUMBER) DESC LIMIT 1
Expected output
MOST_FREQ ,community_area_number,, COMMUNITY_AREA_NAME
43 25 Uptown
Sample CENSUS
SAMPLE CRIME

You should be writing the query like this:
SELECT COUNT(*) AS MOST_FREQ,
CR.COMMUNITY_AREA_NUMBER, CE.COMMUNITY_AREA_NAME
FROM CRIME CR JOIN
CENSUS CE
ON CR.COMMUNITY_AREA_NUMBER = CE.COMMUNITY_AREA_NUMBER
GROUP BY CR.COMMUNITY_AREA_NUMBER, CE.COMMUNITY_AREA_NAME
ORDER BY COUNT(*) DESC
LIMIT 1;
Note the use of proper, explicit, standard, readable JOIN syntax. Never use commas in the FROM clause.
The relevant change, though, is to include CE.COMMUNITY_AREA_NAME in the GROUP BY. All non-aggregated columns should be in the GROUP BY as a general rule.
Also, COUNT(*) is simpler for counting matches, so this query uses that instead of counting the non-NULL values of a column.

You are using a aggregate function COUNT(CR.COMMUNITY_AREA_NUMBER) AS MOST_FREQ
and all other (non aggregate) return values need to be in the GROUP BY clause.
For your query it means try adding E.COMMUNITY_AREA_NAME to the GROUP BY.

Related

SQL - Difference between .* and * in aggregate function query

SELECT reviews.*, COUNT(comments.review_id)
AS comment_count
FROM reviews
LEFT JOIN comments ON comments.review_id = reviews.review_id
GROUP BY reviews.review_id
ORDER BY reviews.review_id ASC;
When I run this code I get exactly what I want from my SQL query, however if I run the following
SELECT *, COUNT(comments.review_id)
AS comment_count
FROM reviews
LEFT JOIN comments ON comments.review_id = reviews.review_id
GROUP BY reviews.review_id
ORDER BY reviews.review_id ASC;
then I get an error "column must appear in GROUP BY clause or be used in an aggregate function
Just wondered what the difference was and why the behaviour is different.
Thanks
In the first example, the column are taken only from the reviews table. Although not databases allow the use of SELECT * with GROUP BY, it is allowed by Standard SQL, assuming that review_id is the primary key.
The issue is that that you are including columns in the SELECT that are not included in the GROUP BY. This is only allowed -- in certain databases -- under very special circumstances, where the columns in the GROUP BY are declared to uniquely identify each row (which a primary key does).
The second example has columns from comments that do not meet this condition. Hence it is not allowed.
In the select part of the query with group by, you can chose only those columns which you used in group by.
Since you did group by reviews.review_id, you can get the output for the first case. In the second query you are try to get all the records and that is not possible with group by.
You can use window function if you need to select columns which are not present in your group by clause. Hope it makes sense.
https://www.windowfunctions.com/

SQL group by, what am I doing wrong?

The situation is as follows:
Find the top 5 Community Areas by average College Enrollment.
The DB is stored as SCHOOLS.
%sql SELECT COLLEGE_ENROLLMENT, COMMUNITY_AREA_NAME FROM SCHOOLS GROUP BY COLLEGE_ENROLLMENT;
I understand that this would give me the college enrollment by community, but I get the error message of this:
(ibm_db_dbi.ProgrammingError) ibm_db_dbi::ProgrammingError: Exception('SQLNumResultCols failed: [IBM][CLI Driver][DB2/LINUXX8664] SQL0119N An expression starting with "COMMUNITY_AREA_NAME" specified in a SELECT clause, HAVING clause, or ORDER BY clause is not specified in the GROUP BY clause or it is in a SELECT clause, HAVING clause, or ORDER BY clause with a column function and no GROUP BY clause is specified. SQLSTATE=42803\r SQLCODE=-119')
Can anyone give me a lead on what I'm doing wrong here?
Thank you!
When using GROUP BY anything you put after the SELECT clause has to be used in the GROUP BY clause or an aggregate function, like SUM(). In your case you would need to place COMMUNITY_AREA_NAME in the GROUP BY clause or remove it from the SELECT clause to get the error to go away. That said, I don't think this query is quite what you want - I would do something like this:
SELECT COMMUNITY_AREA_NAME, SUM(COLLEGE_ENROLLMENT) AS TOTAL_ENROLLED FROM SCHOOLS GROUP BY COMMUNITY_AREA_NAME, ORDER BY TOTAL_ENROLLED DESC;
Explanation:
SUM(COLLEGE_ENROLLMENT): Total up the enrollment of all schools
that are in a single COMMUNITY_AREA_NAME.
AS TOTAL_ENROLLED: Give the result from SUM() a name so we can easily refer to it later in the ORDER BY clause.
ORDER BY TOTAL_ENROLLED DESC: Sort the output by TOTAL_ENROLLED and put the biggest numbers
at the top.
Try the following it should work.
Find the top 5 Community Areas by average College Enrollment.
SELECT
COMMUNITY_AREA_NAME,
AVG(COLLEGE_ENROLLMENT) AS AVG_ENROLL
FROM SCHOOLS
GROUP BY
COMMUNITY_AREA_NAME
ORDER BY
AVG(COLLEGE_ENROLLMENT) DESC
LIMIT 5
;

SQL GROUP BY usages

I am doing SQL transformation lesson from Codecademy here. I am not sure why they are using those numbers after GROUP BY clause and what those numbers are doing. Can anyone passed the course be so kind to let me know?
SELECT dep_month,
dep_day_of_week,
dep_date,
COUNT(*) AS flight_count
FROM flights
GROUP BY 1,2,3
The numbers in the GROUP BY clause simply refer to the columns in the SELECT list, from left to right. Hence, your query is identical to the following:
SELECT
dep_month,
dep_day_of_week,
dep_date,
COUNT(*) AS flight_count
FROM flights
GROUP BY
dep_month,
dep_day_of_week,
dep_date
The above query which I wrote is what I would use in practice. The reason for this is that GROUP BY 1,2,3 refers to positions rather than columns. If someone refactors the SELECT later, he runs the risk of breaking your query.
Obviously these are position numbers. So this is a GROUP BY on the first three columns:
GROUP BY 1,2,3
means
GROUP BY dep_month, dep_day_of_week, dep_date
here.
This is not compliant with the SQL standard, because the GROUP BY clause is supposed to be executed before the SELECT clause, so the positions cannot be known. They are only known in the ORDER BY clause, because that occurs after the SELECT clause. Only few DBMS make an exception and allow this positional declaration in GROUP BY. It's bad hence to show this in a tutorial.
It's basically group by column 1, column 2 and column 3 from your select query.

GROUP BY combined with ORDER BY

The GROUP BY clause groups the rows, but it does not necessarily sort the results in any particular order. To change the order, use the ORDER BY clause, which follows the GROUP BY clause. The columns used in the ORDER BY clause must appear in the SELECT list, which is unlike the normal use of ORDER BY. [Oracle by Example, fourth Edition, page 274]
Why is that? Why does using GROUP BY influence the required columns in the SELECT clause?
Also, in the case where I do not use GROUP BY: Why would I want to ORDER BY some columns but then select only a subset of the columns?
Actually the statement is not entirely true as Dave Costa's example shows.
The Oracle documentation says that an expression can be used but the expression must be based on the columns in the selection list.
expr - expr orders rows based on their value for expr. The expression is based on
columns in the select list or columns in the tables, views, or materialized views in the
FROM clause. Source: Oracle® Database
SQL Language Reference
11g Release 2 (11.2)
E26088-01
September 2011. Page 19-33
From the the same work page 19-13 and 19-33 (Page 1355 and 1365 in the PDF)
http://docs.oracle.com/cd/E11882_01/server.112/e26088/statements_10002.htm#SQLRF01702
http://docs.oracle.com/cd/E11882_01/server.112/e26088/statements_10002.htm#i2171079
The bold text from your quote is incorrect (it's probably an oversimplification that is true in many common use cases, but it is not strictly true as a requirement). For instance, this statement executes just fine, although AVG(val) is not in the select list:
WITH DATA AS (SELECT mod(LEVEL,3) grp, LEVEL val FROM dual CONNECT BY LEVEL < 100)
SELECT grp,MIN(val),MAX(val)
FROM DATA
GROUP BY grp
ORDER BY AVG(val)
The expressions in the ORDER BY clause simply have to be possible to evaluate in the context of the GROUP BY. For instance, ORDER BY val would not work in the above example, because the expression val does not have a distinct value for each row produced by the grouping.
As to your second question, you may care about the ordering but not about the value of the ordering expression. Excluding unneeded expressions from the select lists reduces the amount of data that must actually be sent from the server to the client.
First:
The implementation of group by is one which creates a new resultset that differs in structure to the original from clause (table view or some joined tables). That resultset is defined by what is selected.
Not every SQL RDBMS has this restriction, though it is a always requirement that what is ordered by be either an aggregate function of the non-grouped columns (AVG, SUM, etc) or one of the columns grouped by, or functions upon more than one of those results (like adding two columns), because this is a logical requirement of the result of the grouping operation.
Second:
Because you only care about that column for the ordering. For example, you might have a list of the top selling singles without giving their sales (the NYT Bestsellers keeps some details of their data a secret, but do have a ranked list). Of course, you can get around this by just selecting that column and then not using it.
The data is aggregated before it is sorted for the ORDER BY.
If you try to order by any other column (that is not in the group by list or an aggregation function), what value would be used? There is no single value to use for ordering.
I believe that you can use combinations of the values for sorting. So you can say:
order by a+b
If a and b are in the group by. You just cannot introduce columns not mentioned in the SELECT. I believe you can use aggregation functions not mentioned in the SELECT, however.
Sample table
sample.grades
Name Grade Score
Adam A 95
Bob A 97
Charlie C 75
First Query using GROUP BY
Select grade, count(Grade) from sample.grades GROUP BY Grade
Output
Grade Count
A 2
C 1
Second Query using order by
select Name, score from sample grades order by score
Output
Bob A 97
Adam A 95
Charlie C 75
Third Query using GROUP BY and ordering
Select grade, count(Grade) from sample.grades GROUP BY Grade desc
Output
Grade Count
A 2
C 1
Once you start using things like Count, you must have group by. You can use them together, but they have very different uses, as I hope the examples clearly show.
To try and answer the question, why does group by effect the items in the select section, because that is what group by is meant to do. You can't do the count of a column if you do not group by that column.
Second question, why would you want to order by but not select all the columns?
If I want to order by the score, but do not care about the actual grade or even the score I might do
select name from sample.grades order by score
Output
Name
Bob
Adam
Charlie
Which results do you expect to see ordering by columns not listed in the select list and not participated in group by clause? at any case all kind of sort by non-mentioned in SELECT list columns will be omitted so Oracle guys added the restriction correctly.
with c as (
select 1 id, 2 value from dual
union all
select 1 id, 3 value from dual
union all
select 2 id, 3 value from dual
)
select id
from c
group by id
order by count(*) desc
Here my inderstanding
"The GROUP BY clause groups the rows, but it does not necessarily sort the results in any particular order."
-> you can use Group by without order by
"To change the order, use the ORDER BY clause, which follows the GROUP BY clause."
-> the rows are selected by defaut with primary key, and if you add order by you must add after group by
"The columns used in the ORDER BY clause must appear in the SELECT list, which is unlike the normal use of ORDER BY."

Can I group by something that isn't in the SELECT line?

Given a command in SQL;
SELECT ...
FROM ...
GROUP BY ...
Can I group by something that isn't in the SELECT line?
Yes.
This is often used in the superaggregate queries like this:
SELECT AVG(cnt)
FROM (
SELECT COUNT(*) AS cnt
FROM sales
GROUP BY
product
HAVING COUNT(*) > 10
) q
, which aggregate the aggregates.
Yes of course e.g.
select
count(*)
from
some_table_with_updated_column
group by
trunc(updated, 'MM.YYYY')
Yes you can do it, but if you do that you won't be able to tell which result is for which group.
As a result, you almost always want to return the columns you've grouped by in the select clause. But you don't have to.
Yes, you can. Example:
select count(1)
from sales
group by salesman_id
What you can't do, of course, if having something on your select clause (other than aggregate functions) that are not part of the group by clause.
Hmm, I think the question should have been in the other way round like,
Can I SELECT something that is not there in the GROUP BY?
It's alright to write a code like:
SELECT customerId, count(orderId) FROM orders
GROUP BY customerId, orderedOn
If you want to find out the number of orders done by a customer datewise.
But you cannot do it the other way round:
SELECT customerId, orderedOn count(orderId) FROM orders
GROUP BY customerId
You can issue an aggregate function on the column that is not there in the group by. But you cannot give it in the select line without the aggregate function. As it will not make much sense. Like for the above query. You group by just customerId for order counts and you want the date also to be printed in the output??!! You don't involve the date factor in the group for counting then will it mean something to have a date in it?
I don't know about other DBMS' but DB2/z, for one, does this just fine. It's not required to have the column in the select portion but, of course, it does have to extract the data from the table in order to aggregate so you're probably not saving any time by leaving it off. You should only select the columns that you need, aggregation of the data is a separate task from that.
I'm pretty certain the SQL standard allows this (although that's only based on the knowledge that the mainframe DB2 product follows it pretty closely).