Suppose I'm writing as query for the aggregate functions where I want result based on some conditions both on the column of the table and on aggregate function. So is it possible to use WHERE and HAVING clause to get expected result without GROUP BY clause.
I wrote following query for the above condition.
select *
from ORDER_DETAILS
where item_price > 1000
having count(item) >= 5 ;
First of all, having just like where, but can apply to aggregate function results.
You should keep track of the data rows and columns after each clause.
If we name a row_id property that can be used to locate one single row of a table. Then the where clause doesn't change the row_id.
When we use aggregate functions, it implies input multi rows and get a single result, that changes the row_id.In fact no group by clause means everything go to one bucket, and the output result only have one row.
My best guess is that you want to get original data rows, which have some attributes that passes aggregated value check.Eg found order details that item price>1000(origin filter) and more than 5 items in single order(aggregated filter).
So group + aggregate + having give you aggregated filter dataset, you can join this dataset back to origin table, then the result table have same row_id with original ORDER_DETAILS
select *
from ORDER_DETAILS
where item_price > 1000
and order_id in (
select order_id
from ORDER_DETAILS
group by order_id
having count(item) >= 5
);
Note:
order_id is the aggregated filter column example
I use in subquery for convenience, you can change it into join
If you are working with big data sql, like hive/spark, you can also use window functions to get the aggregate result on each row of original table.
Related
Can you filter a SQL table based on an aggregated value, but still show column values that weren't in the aggregate statement?
My table has only 3 columns: "Composer_Tune", "_Year", and "_Rank".
I want to use SQL to find which "Composer_Tune" values are repeated in each annual list, as well as which ranks the duplicated items had.
Since I am grouping by "Composer_Tune" & "Year", I can't list "_Rank" with my current code.
The image shows the results of my original "find the duplicates" query vs what I want:
Current vs Desired Results
I tried applying the concepts in this Aggregate Subquery StackOverflow post but am still getting "_Rank is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause" from this code:
WITH DUPE_DB AS (SELECT * FROM DB.dbo.[NAME] GROUP BY Composer_Tune, _Year HAVING COUNT(*)>1)
SELECT Composer_Tune, _Year, _Rank
FROM DUPE_DB
You need to explicitly declare the columns used in the Group By expression in the select columns.
You can use the following documentation if you are using transact sql for the proper use of Group By.
Simply join the aggregated resultset to original unit level table:
WITH DUPE_DB AS (
SELECT Composer_Tune, _Year
FROM DB.dbo.[NAME]
GROUP BY Composer_Tune, _Year
HAVING COUNT(*) > 1
)
SELECT n.Composer_Tune, n._Year, n._Rank
FROM DB.dbo.[NAME] n
INNER JOIN DUPE_DB
ON n.Compuser_Tune = DUPE_DB.Composer_Tune
AND n._Year = DUPE_DB._Year
ORDER n.Composer_Tune, n._Year
I have a view which is wrote in SQLite which is V_Order_Calculator3. And also 1.5 million records inside the orders table.
imagine the code below:
select * from (
SELECT
order_id
FROM V_Order_Calculator3
WHERE /* CHOICE1 order_id = '00002092-03b4-4661-a9f4-afa73984860a'*/
GROUP BY
(CASE
WHEN order_type IN ('webshop_sell','sell','buy') THEN order_id
WHEN order_type IN ('sell_return','buy_return') THEN order_linked_id
ELSE 0
END)
) a /* CHOICE2 where a.order_id = '00002092-03b4-4661-a9f4-afa73984860a'*/
If I uncomment CHOICE1 then the the query takes 15 milliseconds to run. and if I uncomment CHOICE2 then it will be 18000 milliseconds.
It seems the sqlite query planner not calculating the CHOICE2 where clause before using the view.
I got confused and tried many ways but no luck.
You think that the time should be the same because you assume that the queries have the same meaning and should give the same results, so sqlite should optimize the second query by pushing the WHERE down in the inner query, but your assumption is wrong.
The first one (inside WHERE) will filter the table for a single order_id, then will create groups based on the values of order_type, order_id and order_linked_id that it finds on the filtered rows. For every group it will return the order_id of an unspecified row of the group. Since you filtered for a specific order_id, it will always be the same value for every group.
The second one (outside WHERE) will scan all the table, creating groups based on the values of order_type, order_id and order_linked_id that it finds on all rows. For every group it will return the order_id of an unspecified row of the group, which at this point could be any value. The outside WHERE will then filter the result based on THESE values.
This is an example data where your two queries give different results: https://www.db-fiddle.com/f/xqiFACuv8PzU3VBnLbEoWF/2
It doesn't matter that in your application this set of data would not be possible or meaningful. Since the queries are not semantically equivalent, sqlite cannot transform the second one into the first.
Using 'HAVING' without 'GROUP BY' is not allowed:
SELECT *
FROM products
HAVING unitprice > avg(unitprice)
Column 'products.UnitPrice' is invalid in the HAVING clause because it is not contained in either an aggregate function or the GROUP BY clause.
But when placing the same code under 'EXISTS' - no problems:
SELECT *
FROM products p
WHERE EXISTS (SELECT 1
FROM products
HAVING p.unitprice > avg(unitprice))
Can you please explain why?
well the error is clear in first query UnitPrice is not part of aggregation nor group by
whereas in your second query you are comparing p.unitprice from table "products p" which doesn't need to be part of aggregation or group by , your second query is equivalent to :
select * from products p
where p.unitprice > (select avg(unitprice) FROM products)
which maybe this is more clear , that sql caculate avg(unitprice) then compares it with unitprice column from product.
HAVING filters after aggregation according to the SQL standard and in most databases.
Without a GROUP BY, there is still aggregation.
But in your case, you simply want a subquery and WHERE:
SELECT p.*
FROM products p
WHERE p.unitprice > (SELECT AVG(p2.unitprice) FROM products p2);
The problem comes from the columns you select :
SELECT *
and
SELECT 1
Unlike ordinary functions that are evaluated at each row, aggregate functions are computed once the whole dataset is processed, which means that in theory (at least without a GROUP BY statement), you can't select both aggregate and regular functions in a same column set (even if some DBMS still tolerate this).
It's easier to see when considering SUM(). You're not supposed to have an access to the total of a column before all rows have been returned, which prevents you to write something like SELECT price,SUM(price), for instance.
GROUP BY, now, enables you to regroup your rows according to a given criteria (actually, a bunch of columns), which makes these aggregate functions to be computed at the end of each of these groups instead of the whole dataset. Therefore, since all the column specified in GROUP BY are supposed to be the same for a given group, you're allowed to include them in your global SELECT statement.
This leads us to the actual failure cause: on first query, you select all columns. On the second one, you select none: only the constant 1, which is not part of the table itself.
Select * from FMN_XX.order odr where
exists(
select (1) from FMN_XX.order_expired exp
where odr.order_id = exp.order_id
);
Above is the example query for exists. I have tried looking around and reading about it but I just can't get my head wrapped around it.
When I query individually the query inside the EXISTS bracket, it returns 1 as expected and no order_id from order_expired since I didn't query for column there.
But when I run the whole query, it returns the correct number of rows! My question is, how does it know the order_ID from order_expired table when I don't even query for order_id from the order_expired table? How does it compare to get the right rows?
Extra note: Currently, in the order table, I have 19779 rows and in order_expired table, I have 8506 rows. The final result I get when I added count at the outer query layer is 8506 rows, meaning, somewhat the EXISTS statement has filters the rows. If it should just returns if at least one order_id is hit... shouldn't the whole query returns the whole 19779 rows?
how does it know the order_ID from order_expired table when I don't even query for order_id from the order_expired table? How does it compare to get the right rows?
The condition from WHERE clause of the exists's SUBSELECT gives this information :
the odr.order_id is the column from main SELECT, whereas
the exp.order_id is the column from exists SUBSELECT
where odr.order_id = exp.order_id
if the condition above returns TRUE then the record will appear in the result set.
https://en.wikipedia.org/wiki/Correlated_subquery
Exists is similar to join - you delimit your output based on values in another table (or even the same table with different condition.).
The difference in useablity is that the exists function does not care for duplicit values, it checks only if there are query results existing with your condition.
In other words, if your table order_expired would be unique in column order_id, then you should get the same result from your query as from this query:
Select odr.* from FMN_XX.order odr
join FMN_XX.order_expired exp on odr.order_id = exp.order_id;
However if it is not unique then the join would delimit your results, but at the same time duplicate orders from order_expired.
One more difference is also, that with eixsts you cant use any values from the table inside the exists subquery - with join you can use any columns from joined tables.
You said:
When I query individually the query inside the EXISTS bracket, it returns 1 as expected and no order_id from order_expired since I didn't query for column there.
However, I guess that you haven't really used the EXISTS query as it would have been:
select (1) from FMN_XX.order_expired exp
where odr.order_id = exp.order_id
and it would return error because it doesn't know what odr is.
The clause where odr.order_id = exp.order_id is exactly what gives the correlation between the main query and the EXISTS subquery.
So, the query would be roughly translated in natural language as:
select all the orders that exist into the expired orders table by looking it up by the order_id field
I need a bit of help straightening out something, I know it's a very easy easy question but it's something that is slightly confusing me in SQL.
This SQL query throws a 'not a GROUP BY expression' error in Oracle. I understand why, as I know that once I group by an attribute of a tuple, I can no longer access any other attribute.
SELECT *
FROM order_details
GROUP BY order_no
However this one does work
SELECT SUM(order_price)
FROM order_details
GROUP BY order_no
Just to concrete my understanding on this.... Assuming that there are multiple tuples in order_details for each order that is made, once I group the tuples according to order_no, I can still access the order_price attribute for each individual tuple in the group, but only using an aggregate function?
In other words, aggregate functions when used in the SELECT clause are able to drill down into the group to see the 'hidden' attributes, where simply using 'SELECT order_no' will throw an error?
In standard SQL (but not MySQL), when you use GROUP BY, you must list all the result columns that are not aggregates in the GROUP BY clause. So, if order_details has 6 columns, then you must list all 6 columns (by name - you can't use * in the GROUP BY or ORDER BY clauses) in the GROUP BY clause.
You can also do:
SELECT order_no, SUM(order_price)
FROM order_details
GROUP BY order_no;
That will work because all the non-aggregate columns are listed in the GROUP BY clause.
You could do something like:
SELECT order_no, order_price, MAX(order_item)
FROM order_details
GROUP BY order_no, order_price;
This query isn't really meaningful (or most probably isn't meaningful), but it will 'work'. It will list each separate order number and order price combination, and will give the maximum order item (number) associated with that price. If all the items in an order have distinct prices, you'll end up with groups of one row each. OTOH, if there are several items in the order at the same price (say £0.99 each), then it will group those together and return the maximum order item number at that price. (I'm assuming the table has a primary key on (order_no, order_item) where the first item in the order has order_item = 1, the second item is 2, etc.)
The order in which SQL is written is not the same order it is executed.
Normally, you would write SQL like this:
SELECT
FROM
JOIN
WHERE
GROUP BY
HAVING
ORDER BY
Under the hood, SQL is executed like this:
FROM
JOIN
WHERE
GROUP BY
HAVING
SELECT
ORDER BY
Reason why you need to put all the non-aggregate columns in SELECT to the GROUP BY is the top-down behaviour in programming. You cannot call something you have not declared yet.
Read more: https://sqlbolt.com/lesson/select_queries_order_of_execution
SELECT *
FROM order_details
GROUP BY order_no
In the above query you are selecting all the columns because of that its throwing an error not group by something like..
to avoid that you have to mention all the columns whichever in select statement all columns must be in group by clause..
SELECT *
FROM order_details
GROUP BY order_no,order_details,etc
etc it means all the columns from order_details table.
To use group by clause you have to mention all the columns from select statement in to group by clause but not the column from aggregate function.
TO do this instead of group by you can use partition by clause you can use only one port to group as a partition by.
you can also make it as partition by 1
use Common table expression(CTE) to avoid this issue.
multiple CTes also come handy, pasting a case where I have used...maybe helpful
with ranked_cte1 as
( select r.mov_id,DENSE_RANK() over ( order by r.rev_stars desc )as rankked from ratings r ),
ranked_cte2 as ( select * from movie where mov_id=(select mov_id from ranked_cte1 where rankked=7 ) ) select * from ranked_cte2
select * from movie where mov_id=902