MAX() OVER PARTITION BY not working as intended

MAX() OVER PARTITION BY not working as intended - sql

I'm having some issues when i try to obtain the MAX value of a field withing a set of records and i hope some of you can help me finding what am i doing wrong.
I'm trying to get the ID of the item of the most expensive line, within an order.
Given this query:
SELECT
orderHeader.orderKey, orderLines.lineKey, orderLines.itemKey, orderLines.OrderedQty,
orderLines.price, (orderLines.price*orderLines.OrderedQty) as LinePrice,
ROW_NUMBER() OVER(PARTITION BY orderHeader.orderKey ORDER BY orderLines.lineKey asc) AS [ItemLineNum],
ROW_NUMBER() OVER(PARTITION BY orderHeader.orderKey ORDER BY (orderLines.price*orderLines.OrderedQty) DESC) AS [LineMaxPriceNum],
max(orderLines.itemKey) OVER (PARTITION BY orderHeader.orderKey ORDER BY (orderLines.price*orderLines.OrderedQty) DESC) as [MaxPriceItem]
FROM
orderHeader inner join orderLines on orderHeader.orderKey=orderLines.orderKey
I'm getting this results:
Results of Query
Sorry, as i'm not allowed to insert images directly in the post, i'll try with snippets for formatting the tables.
These are the results
| orderKey | lineKey | itemKey | OrderedQty | Price | LinePrice | ItemLineNum | LineMaxPriceNum | MaxPriceItem |
|----------|---------|---------|------------|-------|-----------|-------------|-----------------|--------------|
| 176141 | 367038 | 15346 | 3 | 1000 | 3000 | 2 | 1 | 15346 |
| 176141 | 367037 | 15159 | 2 | 840 | 1680 | 1 | 2 | 15346 |
| 176141 | 367039 | 15374 | 5 | 100 | 500 | 3 | 3 | 15374 |
As you can see, for the same "orderKey" i have three lines (lineKey), each of them with a different item (itemKey), a different quantity, a different price and a different total cost (LinePrice).
I want in the column MaxPriceItem the key of the item with the higher "LinePrice", but in the results is wrong. The three lines should show 15346 as the most expensive item but the last one is not right, and i can't see why. Also, the ROW_NUMBER partitioned by the same expression (LineMaxPriceNum) is giving me the right order.
If i change the expression of the ORDER BY within the MAX, like this (ordering by "OrderedQty"):
SELECT
orderHeader.orderKey, orderLines.lineKey, orderLines.itemKey, orderLines.OrderedQty,
orderLines.price, (orderLines.price*orderLines.OrderedQty) as LinePrice,
ROW_NUMBER() OVER(PARTITION BY orderHeader.orderKey ORDER BY orderLines.lineKey asc) AS [ItemLineNum],
ROW_NUMBER() OVER(PARTITION BY orderHeader.orderKey ORDER BY (orderLines.price*orderLines.OrderedQty) DESC) AS [LineMaxPriceNum],
max(orderLines.itemKey) OVER (PARTITION BY orderHeader.orderKey ORDER BY orderLines.OrderedQty DESC) as [MaxPriceItem]
FROM
orderHeader inner join orderLines on orderHeader.orderKey=orderLines.orderKey
Then it works:
| orderKey | lineKey | itemKey | OrderedQty | Price | LinePrice | ItemLineNum | LineMaxPriceNum | MaxPriceItem |
|----------|---------|---------|------------|-------|-----------|-------------|-----------------|--------------|
| 176141 | 367038 | 15346 | 3 | 1000 | 3000 | 2 | 1 | 15374 |
| 176141 | 367037 | 15159 | 2 | 840 | 1680 | 1 | 2 | 15374 |
| 176141 | 367039 | 15374 | 5 | 100 | 500 | 3 | 3 | 15374 |
The item with the highest "OrderedQty" is 15374 so the results are correct.
If i change, again, the expression of the ORDER BY within the MAX, like this (ordering by "Price"):
SELECT
orderHeader.orderKey, orderLines.lineKey, orderLines.itemKey, orderLines.OrderedQty,
orderLines.price, (orderLines.price*orderLines.OrderedQty) as LinePrice,
ROW_NUMBER() OVER(PARTITION BY orderHeader.orderKey ORDER BY orderLines.lineKey asc) AS [ItemLineNum],
ROW_NUMBER() OVER(PARTITION BY orderHeader.orderKey ORDER BY (orderLines.price*orderLines.OrderedQty) DESC) AS [LineMaxPriceNum],
max(orderLines.itemKey) OVER (PARTITION BY orderHeader.orderKey ORDER BY orderLines.price DESC) as [MaxPriceItem]
FROM
orderHeader inner join orderLines on orderHeader.orderKey=orderLines.orderKey
Then it happens the same than with the first example, the results are wrong:
| orderKey | lineKey | itemKey | OrderedQty | Price | LinePrice | ItemLineNum | LineMaxPriceNum | MaxPriceItem |
|----------|---------|---------|------------|-------|-----------|-------------|-----------------|--------------|
| 176141 | 367038 | 15346 | 3 | 1000 | 3000 | 2 | 1 | 15346 |
| 176141 | 367037 | 15159 | 2 | 840 | 1680 | 1 | 2 | 15346 |
| 176141 | 367039 | 15374 | 5 | 100 | 500 | 3 | 3 | 15374 |
The item with the highest price is 15346 but the MAX for the last record is not showing this.
What am i missing here? Why i'm getting those different results?
Sorry if the formatting is not properly done, it's my first question here and i've tried my best.
Thanks in advance for any help you can give me.

I'm trying to get the ID of the item of the most expensive line, within an order.
You misunderstand the purpose of the order by clause to the window function; it is meant to defined the window frame, not to compare the values; max() gives you the maximum value of the expression given as argument within the window frame.
On the other hand, you want the itemKey of the most expensive order line. I think that first_value() would do what you want:
first_value(orderLines.itemKey) over(
partition by orderHeader.orderKey
order by orderLines.price * orderLines.OrderedQty desc
) as [MaxPriceItem]

The accepted answer provides a reasonable alternate solution to the original problem, but doesn't really explain why the max() function appears to work inconsistently. (And spoiler alert, you actually can use max() as originally intended with a small tweak.)
You have to understand that aggregation functions actually operate on a window frame within a partition. By default, the frame is the entire partition. And so aggregation operations like max() and sum() do operate over the entire partition, exactly like you assumed. This default specification is defined as RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. This just means that whatever record we're on, max() looks back all the way to the first row in the partition, and all the way forward to the last row in the partition, in order to calculate the value.
But there's an insidious gotcha: Adding an ORDER BY clause to the partition changes the the default frame specification to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This means that whatever record we're on, max() looks back all the way to the first row in the partition, and then only up to the current row, in order to calculate the value. You can see this clearly in your last example (simplified a bit):
SELECT orderKey, itemKey, price,
ROW_NUMBER() OVER(PARTITION BY orderKey ORDER BY price DESC) AS [PartitionRowNum],
MAX(itemKey) OVER (PARTITION BY orderKey ORDER BY price DESC) as [MaxPriceItem]
FROM orders
Result/explanation:
| orderKey | itemKey | Price | PartitionRowNum | MaxPriceItem | Commentary |
|----------|---------|-------|-----------------|--------------|------------------------|
| 176141 | 15346 | 1000 | 1 | 15346 | Taking max of rows 1-1 |
| 176141 | 15159 | 840 | 2 | 15346 | Taking max of rows 1-2 |
| 176141 | 15374 | 100 | 3 | 15374 | Taking max of rows 1-3 |
SOLUTION
We can explicitly indicate the window frame specification by adding RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING to the partition as follows:
SELECT orderKey, itemKey, price,
ROW_NUMBER() OVER(PARTITION BY orderKey ORDER BY price DESC) AS [PartitionRowNum],
MAX(itemKey) OVER (PARTITION BY orderKey ORDER BY price DESC RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as [MaxPriceItem]
FROM orders
Result/explanation:
| orderKey | itemKey | Price | PartitionRowNum | MaxPriceItem | Commentary |
|----------|---------|-------|-----------------|--------------|------------------------|
| 176141 | 15346 | 1000 | 1 | 15374 | Taking max of rows 1-3 |
| 176141 | 15159 | 840 | 2 | 15374 | Taking max of rows 1-3 |
| 176141 | 15374 | 100 | 3 | 15374 | Taking max of rows 1-3 |

Related

Find groups in data by relative difference between records

I have some rows that are sorted by price:
| id | price |
|----|-------|
| 1 | 2.00 |
| 2 | 2.10 |
| 3 | 2.11 |
| 4 | 2.50 |
| 5 | 2.99 |
| 6 | 3.02 |
| 7 | 9.01 |
| 8 | 9.10 |
| 9 | 9.11 |
| 10 | 13.01 |
| 11 | 13.51 |
| 12 | 14.10 |
I need to group them in "price groups". An item belongs to a different group when difference in price between it and the previous item is greater than some fixed value, say 1.50.
So the expected result is something like this:
| MIN(price) | MAX(price) |
|------------|------------|
| 2.00 | 3.02 |
| 9.01 | 9.11 |
| 13.01 | 14.10 |
I'm not even sure how to call this type of grouping. Group by "rolling difference"? Not exactly...
Can this be done in SQL (or in Postgres in particular)?

Your results are consistent with looking at the previous value and saying a group starts when the difference is greater than 1.5. You can do this with lag(), a cumulative sum, and aggregation:
select min(price), max(price)
from (select t.*,
count(*) filter (where prev_price is null or prev_price < price - 1.5) over (order by price) as grp
from (select t.*,
lag(price) over (order by price) as prev_price
from t
) t
) t
group by grp

Thanks Gordon Linoff for his answer, it is exactly what I was after!
I ended up using this query here simply because I understand it better. I guess it is more noobish, but so am I.
Both queries sort a table of 1M rows into 34 groups in about a second. This query is a bit more performant on 11M rows, sorting them into 380 groups in 15 seconds, vs 23 seconds in Gordon's answer.
SELECT results.group_i, MIN(results.price), MAX(results.price), AVG(results.price)
FROM (
SELECT *,
SUM(new_group) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS group_i
FROM (
SELECT annotated.*,
CASE
WHEN prev_price IS NULL OR price - prev_price > 1.5 THEN 1
ELSE 0
END AS new_group
FROM (
SELECT *,
LAG(price) OVER (ORDER BY price) AS prev_price
FROM prices
) AS annotated
) AS grouppable
) AS results
GROUP BY results.group_i
ORDER BY results.group_i;

SQL Query to Find Min and Max Values between Values, dates and companies in the same Query

This is to find the historic max and min price of a stock in the same query for every past 10 days from the current date. below is the data. I've tried the query but getting the same high and low for all the rows. The high and low needs to be calculated per stock for a period of 10 days.
RDBMS -- SQL Server 2014
Note: also duration might be past 30 to 2months if required ie... 30 days. or 60 days.
for example, the output needs to be like ABB,16-12-2019,1480 (MaxClose),1222 (MinClose) (test data) in last 10 days.
+------+------------+-------------+
| Name | Date | Close |
+------+------------+-------------+
| ABB | 26-12-2019 | 1272.15 |
| ABB | 24-12-2019 | 1260.15 |
| ABB | 23-12-2019 | 1261.3 |
| ABB | 20-12-2019 | 1262 |
| ABB | 19-12-2019 | 1476 |
| ABB | 18-12-2019 | 1451.45 |
| ABB | 17-12-2019 | 1474.4 |
| ABB | 16-12-2019 | 1480.4 |
| ABB | 13-12-2019 | 1487.25 |
| ABB | 12-12-2019 | 1484.5 |
| INFY | 26-12-2019 | 73041.66667 |
| INFY | 24-12-2019 | 73038.33333 |
| INFY | 23-12-2019 | 73036.66667 |
| INFY | 20-12-2019 | 73031.66667 |
| INFY | 19-12-2019 | 73030 |
| INFY | 18-12-2019 | 73028.33333 |
| INFY | 17-12-2019 | 73026.66667 |
| INFY | 16-12-2019 | 73025 |
| INFY | 13-12-2019 | 73020 |
| INFY | 12-12-2019 | 73018.33333 |
+------+------------+-------------+
The query I tried but no luck
select max([close]) over (PARTITION BY name) AS MaxClose,
min([close]) over (PARTITION BY name) AS MinClose,
[Date],
name
from historic
where [DATE] between [DATE] -30 and [DATE]
and name='ABB'
group by [Date],
[NAME],
[close]
order by [DATE] desc

If you just want the highest and lowest close per name, then simple aggregation is enough:
select name, max(close) max_close, min(close) min_close
from historic
where close >= dateadd(day, -10, getdate())
group by name
order by name
If you want the entire corresponding records, then rank() is a solution:
select name, date, close
from (
select
h.*,
rank() over(partition by name order by close) rn1,
rank() over(partition by name order by close desc) rn2
from historic h
where close >= dateadd(day, -10, getdate())
) t
where rn1 = 1 or rn2 = 1
order by name, date
Top and bottom ties will show up if any.
You can add a where condition to filter on a given name.

If you are looking for a running min/max
Example
Select *
,MinClose = min([Close]) over (partition by name order by date rows between 10 preceding and current row)
,MaxClose = max([Close]) over (partition by name order by date rows between 10 preceding and current row)
From YourTable
Returns

SQL - group by a change of value in a given column

Apologies for the confusing title, I was unsure how to phrase it.
Below is my dataset:
+----+-----------------------------+--------+
| Id | Date | Amount |
+----+-----------------------------+--------+
| 1 | 2019-02-01 12:14:08.8056282 | 10 |
| 1 | 2019-02-04 15:23:21.3258719 | 10 |
| 1 | 2019-02-06 17:29:16.9267440 | 15 |
| 1 | 2019-02-08 14:18:14.9710497 | 10 |
+----+-----------------------------+--------+
It is an example of a bank trying to collect money from a debtor, where first, 10% of the owed sum is attempted to be collected, if a card is managed to be charged 15% is attempted, if that throws an error (for example insufficient funds), 10% is attempted again.
The desired output would be:
+----+--------+---------+
| Id | Amount | Attempt |
+----+--------+---------+
| 1 | 10 | 1 |
| 1 | 15 | 2 |
| 1 | 10 | 3 |
+----+--------+---------+
I have tried:
SELECT Id, Amount
FROM table1
GROUP BY Id, Amount
I am struggling to create a new column based on when value changes in the Amount column as I assume that could be used as another grouping variable that could fix this.

If you just want when a value changes, use lag():
select t.id, t.amount,
row_number() over (partition by id order by date) as attempt
from (select t.*, lag(amount) over (partition by id order by date) as prev_amount
from table1 t
) t
where prev_amount is null or prev_amount <> amount

SQL formula for Row number

I'm trying to rank the rows in the following table that looks like this:
| ID | Key | Date | Row|
*****************************
| P175 | 5 | 2017-01| 2 |
| P175 | 5 | 2017-02| 2 |
| P175 | 5 | 2017-03| 2 |
| P175 | 12 | 2017-03| 1 |
| P175 | 12 | 2017-04| 1 |
| P175 | 12 | 2017-05| 1 |
This person has two Keys at once during 2017-03, but I want the formula to put '1' for the rows where Key=12 since it reflects the most recent records.
I want the same formula to also work for the people who don't have overlapping Keys, putting '1' for the most recent records:
| ID | Key | Date | Row|
*****************************
| P170 | 8 | 2017-01| 2 |
| P170 | 8 | 2017-02| 2 |
| P170 | 8 | 2017-03| 2 |
| P170 | 6 | 2017-04| 1 |
| P170 | 6 | 2017-05| 1 |
I've tried variations of ROW_NUMBER() OVER PARTITION BY and DENSE_RANK but cannot figure out the correct formula. Thanks for your help.

First calculate the max date for the key. Then use dense_rank():
select t.*,
dense_rank() over (partition by id order by max_date desc, key) as row
from (select t.*, max(date) over (partition by id, key) as max_date
from t
) t;
If the ranges for each key did not overlap, you could do this with a cumulative count distinct:
select t.*, count(distinct key) over (partition by id order by date desc) as rank
from t;
However, this would not work in the first case. I just find it interesting that this does almost the same thing as the first query.

I guess you are looking for something like this
select personid, mykey, month,
dense_rank() over (partition by personid order by mykey desc) rown
from personkeys
order by month
see the example
http://sqlfiddle.com/#!15/cf751/8

Select latest values for group of related records

I have a table that accommodates data that is logically groupable by multiple properties (foreign key for example). Data is sequential over continuous time interval; i.e. it is a time series data. What I am trying to achieve is to select only latest values for each group of groups.
Here is example data:
+-----------------------------------------+
| code | value | date | relation_id |
+-----------------------------------------+
| A | 1 | 01.01.2016 | 1 |
| A | 2 | 02.01.2016 | 1 |
| A | 3 | 03.01.2016 | 1 |
| A | 4 | 01.01.2016 | 2 |
| A | 5 | 02.01.2016 | 2 |
| A | 6 | 03.01.2016 | 2 |
| B | 1 | 01.01.2016 | 1 |
| B | 2 | 02.01.2016 | 1 |
| B | 3 | 03.01.2016 | 1 |
| B | 4 | 01.01.2016 | 2 |
| B | 5 | 02.01.2016 | 2 |
| B | 6 | 03.01.2016 | 2 |
+-----------------------------------------+
And here is example of desired output:
+-----------------------------------------+
| code | value | date | relation_id |
+-----------------------------------------+
| A | 3 | 03.01.2016 | 1 |
| A | 6 | 03.01.2016 | 2 |
| B | 3 | 03.01.2016 | 1 |
| B | 6 | 03.01.2016 | 2 |
+-----------------------------------------+
To put this in perspective — for every related object I want to select each code with latest date.
Here is a select I came with. I've used ROW_NUMBER OVER (PARTITION BY...) approach:
SELECT indicators.code, indicators.dimension, indicators.unit, x.value, x.date, x.ticker, x.name
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY indicator_id ORDER BY date DESC) AS r,
t.indicator_id, t.value, t.date, t.company_id, companies.sic_id,
companies.ticker, companies.name
FROM fundamentals t
INNER JOIN companies on companies.id = t.company_id
WHERE companies.sic_id = 89
) x
INNER JOIN indicators on indicators.id = x.indicator_id
WHERE x.r <= (SELECT count(*) FROM companies where sic_id = 89)
It works but the problem is that it is painfully slow; when working with about 5% of production data which equals to roughly 3 million fundamentals records this select take about 10 seconds to finish. My guess is that happens due to subselect selecting huge amounts of records first.
Is there any way to speed this query up or am I digging in wrong direction trying to do it the way I do?

Postgres offers the convenient distinct on for this purpose:
select distinct on (relation_id, code) t.*
from t
order by relation_id, code, date desc;

So your query uses different column names than your sample data, so it's hard to tell, but it looks like you just want to group by everything except for date? Assuming you don't have multiple most recent dates, something like this should work. Basically don't use the window function, use a proper group by, and your engine should optimize the query better.
SELECT mytable.code,
mytable.value,
mytable.date,
mytable.relation_id
FROM mytable
JOIN (
SELECT code,
max(date) as date,
relation_id
FROM mytable
GROUP BY code, relation_id
) Q1
ON Q1.code = mytable.code
AND Q1.date = mytable.date
AND Q1.relation_id = mytable.relation_id

Other option:
SELECT DISTINCT Code,
Relation_ID,
FIRST_VALUE(Value) OVER (PARTITION BY Code, Relation_ID ORDER BY Date DESC) Value,
FIRST_VALUE(Date) OVER (PARTITION BY Code, Relation_ID ORDER BY Date DESC) Date
FROM mytable
This will return top value for what ever you partition by, and for whatever you order by.

I believe we can try something like this
SELECT CODE,Relation_ID,Date,MAX(value)value FROM mytable
GROUP BY CODE,Relation_ID,Date

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

MAX() OVER PARTITION BY not working as intended - sql

Related

Find groups in data by relative difference between records

SQL Query to Find Min and Max Values between Values, dates and companies in the same Query

SQL - group by a change of value in a given column

SQL formula for Row number

Select latest values for group of related records

Categories

Resources