SQL Distinct / GroupBy - sql

Ok, I’m stuck on an SQL query and tried long enough that it’s time to ask for help :) I'm using Objection.js – but that's not super relevant as I really just can't figure out how to structure the SQL.
I have the following example data set:
Items
id
name
1
Test 1
2
Test 2
3
Test 3
Listings
id
item_id
price
created_at
1
1
100
1654640000
2
1
60
1654640001
3
1
80
1654640002
4
2
90
1654640003
5
2
90
1654640004
6
3
50
1654640005
What I’m trying to do:
Return the lowest priced listing for each item
If all listings for an item have the same price, I want to return the newest of the two items
Overall, I want to return the resulting items by price
I’m trying to write a query that returns the data:
id
item_id
name
price
created_at
6
3
Test 3
50
1654640005
2
1
Test 1
60
1654640001
5
2
Test 2
90
1654640004
Any help would be greatly appreciated! I'm also starting fresh, so I can add new columns to the data if that would help at all :)
An example of where my query is right now:
select * from "listings" inner join (select "item_id", MIN(price) as "min_price" from "listings" group by "item_id") as "grouped_listings" on "listings"."item_id" = "grouped_listings"."item_id" and "listings"."price" = "grouped_listings"."min_price" where "listings"."sold_at" is null and "listings"."expires_at" > ? order by CAST(price AS DECIMAL) ASC limit ?;
This gets me listings – but if two listings have the same price, it returns multiple listings with the same item_id – not ideal.

Given the postgresql tag, this should work:
with listings_numbered as (
select *, row_number() over (
partition by item_id
order by price asc, created_at desc
) as rownum
from listings
)
select l.id, l.item_id, i.name, l.price, l.created_at
from listings_numbered l
join items i on l.item_id=i.id
where l.rownum=1
order by price asc;
This is a bit of an advanced query, using window functions and a common table expression, but we can break it down.
with listings_numbered as (...) select simply means to run the query inside of the ..., and then we can refer to the results of that query as listings_numbered inside of the select, as though it was a table.
We're selecting all of the columns in listings, plus one more:
row_number() over (partition by item_id order by price asc, created_at desc). partition by item_id means that we would like the row number to reset for each new item_id, and the order by specifies the ordering that the rows should get within each partition before we number them: first increasing by price, then decreasing by creation time to break ties.
The result of the CTE listings_numbered looks like:
id
item_id
price
created_at
rownum
2
1
60
1654640001
1
3
1
80
1654640002
2
1
1
100
1654640000
3
5
2
90
1654640004
1
4
2
90
1654640003
2
6
3
50
1654640005
1
If you look at only the rows where rownum (the last column) is 1, then you can see that it's exactly the set of listings that you're interested in.
The outer query then selects from this this dataset, joins on items to get the name, filters to only the listings where rownum is 1, and sorts by price, to get the final result:
id
item_id
name
price
created_at
6
3
Test 3
50
1654640005
2
1
Test 1
60
1654640001
5
2
Test 2
90
1654640004

Aggregation functions, as the MIN function you employed in your query, is a viable option, yet if you want to have an efficient query for your problem, window functions can be your best friends. This class of functions allow to compute values over "windows" (partitions) of your table given some specified columns.
For the solution to this problem I'm going to compute two values using the window functions:
the minimum value for "listings.price", by partitioning on "listings.item_id",
the maximum value for "created_at", by partitioning on "listings.item_id" and listings.price
SELECT *,
MIN(price) OVER(PARTITION BY item_id) AS min_price,
MAX(created_at) OVER(PARTITION BY item_id, price) AS max_created_at
FROM listings
Once you have all records of listings associated to the corresponding minimum price and latest date, it's necessary for you to select the records whose
price equals the minimum price
created_at equals the most recent created_at
WITH cte AS (
SELECT *,
MIN(price) OVER(PARTITION BY item_id) AS min_price,
MAX(created_at) OVER(PARTITION BY item_id, price) AS max_created_at
FROM listings
)
SELECT id,
item_id,
price,
created_at
FROM cte
WHERE price = min_price
AND created_at = max_created_at
If you need to order by price, it's sufficient to add a ORDER BY price clause.
Check the demo here.

Related

How to enforce uniqueness in postgresql per row for a specific column

I have the following table (stripped down for demonstration)
products
- id
- part_number
- group_id
I want to be able to query against products and only return a single row per group_id (whichever is noticed first in the query is fine). All rows with group_id = null return as well.
Example:
ID part_number group_id
2314 ABB19 1
4543 GFH54 1
3454 GHT56 2
3657 QWT56 2
7689 GIT56 2
3465 HG567 null
5675 FG345 null
I would want to query against this table and get the following results:
ID part_number group_id
2314 ABB19 1
3454 GHT56 2
3465 HG567 null
5675 FG345 null
I have tried using group by but wasnt able to get it working without selecting the group_id and doing a group by on it which just returned a list of unique group_id's. Given the complexity of my real products table its important that I am able to keep using select * and not naming each column I need to return.
row_number() and filtering might be more efficient than distinct on and union all, which incur two table scans.
select *
from (
select p.*,
row_number() over(partition by group_id order by id) rn
from products p
) p
where rn = 1 or group_id is null
I was able to solve this with a combination of DISTINCT ON and a UNION
SELECT DISTINCT ON (group_id) * from products
WHERE group_id IS NOT NULL
UNION
SELECT * FROM products
WHERE group_id IS NULL

SQL Server 2008 - ROWNUMBER OVER - filtering the result

I have the following SQL which works and returns products with duplicate names and the rownum column is a count of how many times that name appears.
Adding where rownum > 1 at the end gives me the duplicates only.
SELECT *
FROM
(SELECT
id, productname,
ROW_NUMBER() OVER (PARTITION BY productname
ORDER BY productname) Rownum
FROM products
GROUP BY id, productname) result
REQUIREMENT
I need to produce a list of products where if the rownum column has a value greater than one, I want to see all the rows pertaining to that product grouped by the name column.
If the rownum value for a product is 1 only, and no value greater than one (so no duplicate) I don't want to see that row.
So for example if "Blue umbrella" appears three times, I want to see the result for this product as:
ID Name Rownum
35 Blue umbrella 1
41 Blue umbrella 2
90 Blue umbrella 3
How would I go about achieving this please?
Change the Row_NUmber Over to Count(1) Over and select where the count is greater than 1 and remove the group by
SELECT * from (Select id,productname,
Count(1) OVER(Partition By productname ORDER by productname) Rownum
FROM products
) result
WHERE Rownum > 1

How to do a complex calculation as this sample

In the stored procedure (I'm using SQL server2008), I'm having a business like this sample:
ID City Price Sold
1 A 10 3
1 B 10 5
1 A 10 1
1 B 10 3
1 C 10 5
1 C 10 2
2 A 10 1
2 B 10 6
2 A 10 3
2 B 10 4
2 C 10 3
2 C 10 4
What I want to do is:
with each ID, sort by City first.
After sort, for each row of this ID, re-calculate Sold from top to bottom with condition: total of Sold for each ID does not exceed Price (as the result below).
And the result like this:
ID City Price Sold_Calculated
1 A 10 3
1 A 10 1
1 B 10 5
1 B 10 1 (the last one equal '1': Total of Sold = Price)
1 C 10 0 (begin from this row, Sold = 0)
1 C 10 0
2 A 10 1
2 A 10 3
2 B 10 6
2 B 10 0 (begin from this row, Sold = 0)
2 C 10 0
2 C 10 0
And now, I'm using the Cursor to do this task: Get each ID, sort City, calculate Sold then, and save to temp table. After finish calculating, union all temp tables. But it take a long time.
What I know people advise is, DO NOT use Cursor.
So, with this task, can you give me the example (with using select form where group) to finish? or do we have other ways to solve it quickly?
I understand this task is not easy for you, but I still post here, hope that there is someone helps me to go through.
I'm very appriciated for your help.
Thanks.
In order to accomplish your task you'll need to calculate a running sum and use a case statement
Previously I used a JOIN to do the running sum and Lag with the case statement
However using a recursive Cte to calculate the running total as described here by Aaron Bertand, and the case statement by Andriy M we can construct the following, which should offer the best performance and doesn't need to "peek at the previous row"
WITH cte
AS (SELECT Row_number()
OVER ( partition BY id ORDER BY id, city, sold DESC) RN,
id,
city,
price,
sold
FROM table1),
rcte
AS (
--Anchor
SELECT rn,
id,
city,
price,
sold,
runningTotal = sold
FROM cte
WHERE rn = 1
--Recursion
UNION ALL
SELECT cte.rn,
cte.id,
cte.city,
cte.price,
cte.sold,
rcte.runningtotal + cte.sold
FROM cte
INNER JOIN rcte
ON cte.id = rcte.id
AND cte.rn = rcte.rn + 1)
SELECT id,
city,
price,
sold,
runningtotal,
rn,
CASE
WHEN runningtotal <= price THEN sold
WHEN runningtotal > price
AND runningtotal < price + sold THEN price + sold - runningtotal
ELSE 0
END Sold_Calculated
FROM rcte
ORDER BY id,
rn;
DEMO
As #Gordon Linoff commented, the order of sort is not clear from the question. For the purpose of this answer, I have assumed the sort order as city, sold.
select id, city, price, sold, running_sum,
lag_running_sum,
case when running_sum <= price then Sold
when running_sum > price and price > coalesce(lag_running_sum,0) then price - coalesce(lag_running_sum,0)
else 0
end calculated_sold
from
(
select id, city, price, sold,
sum(sold) over (partition by id order by city, sold
rows between unbounded preceding and current row) running_sum,
sum(sold) over (partition by id order by city, sold
rows between unbounded preceding and 1 preceding) lag_running_sum
from n_test
) n_test_running
order by id, city, sold;
Here is the demo for Oracle.
Let me break down the query.
I have used SUM as analytical function to calculate the running sum.
The first SUM, groups the rows based on id, and in each group orders the row by city and sold.
The rows between clause tell which rows to be considered for adding up. Here i have specified it to add
current row and all other rows above it. This gives the running sum.
The second one does the same thing except for, the current row is excluded from adding up. This
essentially creates a running sum but lagging the previous sum by one row.
Using this result as inline view, the outer select makes use of CASE statement to determine the
value of new column.
As long as the running sum is less than or equal to price it gives sold.
If it crosses the price, the value is adjusted so that sum becomes equal to price.
For the rest of the rows below it, value is set as 0.
Hope my explanation is quite clear.
To me, it sounds like you could use window functions in a case like this. Is this applicable?
Although in my case your end result would possibly look like:
ID City Price Sold_Calculated
2 A 10 4
2 B 10 6
2 C 10 0
Which could have an aggregation like
SUM(Sold_Calculated) OVER (PARTITION BY ID, City, Price, Sold_Calculated)
depending on how far down you want to go.. You could even use a case statement if need be
Are you looking to do this entirely in SQL? A simple approach would be this:
SELECT C.ID,
C.City,
C.Price,
calculate_Sold_Function(C.ID, C.Price) AS C.Sold_Calculated
FROM CITY_TABLE C
GROUP BY C.City
Where calculate_Sold_Function is a T-SQL/MySQL/etc function taking the ID and Price as parameters. No idea how you plan on calculating price.

Summing and ordering at once

I have a table of orders. There I need to find out which 3 partner_id's have made the largest sum of amount_totals, and sort those 3 from biggest to smallest.
testdb=# SELECT amount_total, partner_id FROM sale_order;
amount_total | partner_id
--------------+------------
1244.00 | 9
3065.90 | 12
3600.00 | 3
2263.00 | 25
3000.00 | 10
3263.00 | 3
123.00 | 25
5400.00 | 12
(8 rows)
Just starting SQL, I find it confusing ...
Aggregated amounts
If you want to list aggregated amounts, it can be as simple as:
SELECT partner_id, sum(amount_total) AS amout_suptertotal
FROM sale_order
GROUP BY 1
ORDER BY 2 DESC
LIMIT 3;
The 1 in GROUP BY 1 is a numerical parameter, that refers to the position in the SELECT list. Just a notational shortcut for GROUP BY partner_id in this case.
This ignores the special case where more than three partner would qualify and picks 3 arbitrarily (for lack of definition).
Individual amounts
SELECT partner_id, amount_total
FROM sale_order
JOIN (
SELECT partner_id, rank() OVER (ORDER BY sum(amount) DESC) As rnk
FROM sale_order
GROUP BY 1
ORDER BY 2
LIMIT 3
) top3 USING (partner_id)
ORDER BY top3.rnk;
This one, on the other hand includes all peers if more than 3 partner qualify for the top 3. The window function rank() gives you that.
The technique here is to group by partner_id in the subquery top3 and have the window function rank() attach ranks after the aggregation (window functions execute after aggregate functions). ORDER BY is applied after window functions and LIMIT is applied last. All in one subquery.
Then I join the base table to this subquery, so that only the top dogs remain in the result and order by rnk.
Window functions require PostgreSQL 8.4 or later.
This is rather advanced stuff. You should start learning SQL with something simpler probably.
select amount_total, partner_id
from (
select
sum(amount_total) amount_total,
partner_id
from sale_order
group by partner_id
) s
order by amount_total desc
limit 3

SQL query to get status as of a given date

I'm sure this has been answered before but couldn't find it.
I have a table of items which change status every few weeks. I want to look at an arbitrary day and figure out how many items were in each status.
For example:
tbl_ItemHistory
ItemID
StatusChangeDate
StatusID
Sample data:
1001, 1/1/2010, 1
1001, 4/5/2010, 2
1001, 6/15/2010, 4
1002, 4/1/2010, 1
1002, 6/1/2010, 3
...
So I need to figure out how many items were in each status for a given day. So on 5/1/2010, there was one item (1001) in status 2 and one item in status 1 (1002).
Since these items don't change status very often, maybe I could create a cached table every night that has a row for every item and every day of the year? I'm not sure if that's best or how to do that though
I'm using SQL Server 2008R2
For an arbitrary day, you can do something like this:
select ih.*
from (select ih.*,
row_number() over (partition by itemId order by StatusChangeDate desc) as seqnum
from tbl_ItemHistory ih
where StatusChangeDate <= #YOURDATEGOESHERE
) ih
where seqnum = 1
The idea is to enumerate all the history records for each on or before the date,using row_nubmer. The ordering is in reverse chronological order, so the most recent record -- on or before the date -- has a value of 1.
The query then just chooses the records whose value is 1.
To aggregate the results to get each status for the date, use:
select statusId, count(*)
from (select ih.*,
row_number() over (partition by itemId order by StatusChangeDate desc) as seqnum
from tbl_ItemHistory ih
where StatusChangeDate <= #YOURDATEGOESHERE
) ih
where seqnum = 1
group by StatusId