I have a table with the following data:
dt device id count
2018-10-05 computer 7541185957382 6
2018-10-20 computer 7541185957382 3
2018-10-14 computer 7553187775734 6
2018-10-17 computer 7553187775734 10
2018-10-21 computer 7553187775734 2
2018-10-22 computer 7549187067178 5
2018-10-20 computer 7553187757256 3
2018-10-11 computer 7549187067178 10
I want to get the last and first dt for each id. Hence, I used the window functions first_value and last_value as follows:
select id,last_value(dt) over (partition by id order by dt) last_dt
from table
order by id
;
But I am getting this error:
FAILED: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies.
Underlying error: Primitve type DATE not supported in Value Boundary expression
I am not able to diagnose the problem, and I would appreciate any help.
If you add rows between clause in your query, then your query will work fine.
hive> select id,last_value(dt) over (partition by id order by dt
rows between unbounded preceding and unbounded following) last_dt
from table order by id;
Result:
+----------------+-------------+--+
| id | last_dt |
+----------------+-------------+--+
| 7541185957382 | 2018-10-20 |
| 7541185957382 | 2018-10-20 |
| 7549187067178 | 2018-10-22 |
| 7549187067178 | 2018-10-22 |
| 7553187757256 | 2018-10-20 |
| 7553187775734 | 2018-10-21 |
| 7553187775734 | 2018-10-21 |
| 7553187775734 | 2018-10-21 |
+----------------+-------------+--+
There is Jira regards to primitive type support and got fixed in Hive.2.1.0
UPDATE:
For distinct records you can use ROW_NUMBER window function and filter out only the first row from the result set.
hive> select id,last_dt from
(select id,last_value(dt) over (partition by id order by dt
rows between unbounded preceding and unbounded following) last_dt,
ROW_NUMBER() over (partition by id order by dt)rn
from so )t
where t.rn=1;
Result:
+----------------+-------------+--+
| id | dt |
+----------------+-------------+--+
| 7541185957382 | 2018-10-20 |
| 7553187757256 | 2018-10-20 |
| 7553187775734 | 2018-10-21 |
| 7549187067178 | 2018-10-22 |
+----------------+-------------+--+
Related
I've got a long table that tracks a numerical 'state' value (0=new, 1=setup mode, 2=retired, 3=active, 4=inactive) of a collection of 'devices' historically. These devices may be activated/deactivated throughout the year, so the table is continuous collection of state changes - mostly state 3 and 4, ordered by id, with a timestamp on the end, for example:
id | device_id | new_state | when
----------+-----------+-----------+----------------------------
218010581 | 2505 | 0 | 2022-06-06 16:28:11.174084
218010580 | 2505 | 1 | 2022-06-06 16:28:11.174084
218010634 | 2505 | 3 | 2022-06-06 16:29:25.129019
218087737 | 659 | 3 | 2022-06-07 22:55:48.705208
218087744 | 1392 | 3 | 2022-06-07 22:55:59.016974
218087757 | 1556 | 3 | 2022-06-07 22:56:09.811876
218087758 | 2071 | 1 | 2022-06-07 22:56:20.850095
218087765 | 2071 | 3 | 2022-06-07 22:56:29.122074
When I want to look for a list of devices and see their 'history', I know I can use something like:
select *
from devstatechange
where device_id = 2345
order by "when";
id | device_id | new_state | when
-----------+-----------+-----------+----------------------------
184682659 | 2345 | 0 | 2021-05-27 17:03:36.894429
184682658 | 2345 | 1 | 2021-05-27 17:03:36.894429
184684721 | 2345 | 3 | 2021-05-27 17:31:01.968314
194933399 | 2345 | 4 | 2021-08-31 23:30:05.555407
195213746 | 2345 | 3 | 2021-09-03 16:53:39.043005
206278232 | 2345 | 4 | 2021-12-31 22:30:08.820068
206515355 | 2345 | 3 | 2022-01-03 16:06:01.223759
215709888 | 2345 | 4 | 2022-04-30 23:30:30.309389
215846807 | 2345 | 3 | 2022-05-02 19:40:31.525514
select *
from devstatechange
where device_id = 2351
order by "when";
id | device_id | new_state | when
-----------+-----------+-----------+----------------------------
186091252 | 2351 | 0 | 2021-06-09 15:36:02.775035
186091253 | 2351 | 1 | 2021-06-09 15:36:02.775035
186091349 | 2351 | 3 | 2021-06-09 15:37:56.965599
197880878 | 2351 | 4 | 2021-09-30 23:30:06.691835
197945073 | 2351 | 3 | 2021-10-01 15:32:35.907913
208981857 | 2351 | 4 | 2022-01-31 22:30:09.521694
209722639 | 2351 | 3 | 2022-02-09 15:20:12.412816
217666572 | 2351 | 4 | 2022-05-31 23:30:30.881928
What I am really looking for is a query that returns a unique list of devices where the latest dated entry for each device only contains a state of '4' ('inactive state'), and not include records that do not match.
So in using the above data samples, even though both devices 2345 and 2351 have states of 3 and 4 throughout their history, only device 2351 has it's last dated entry with a state of 4 - meaning it is currently in an 'inactive' state. Device 2345's would not appear in the result set since its last dated entry has a state of 3 - it's still active.
Stabbing in the dark, I've tried variants of:
SELECT DISTINCT *
FROM devstatechange
WHERE MAX("when") AND new_state = 4
ORDER BY "when";
SELECT DISTINCT device_id, new_state, MAX("when")
FROM devstatechange
WHERE new_state = 4
ORDER BY "when";
with obviously no success.
I'm thinking I might need to 'group' the entries together, but I don't know how to specify 'return last entry only if new_state = 4' in SQL, or rather PostgreSQL.
Any tidbits or pokes in the right direction would be appreciated.
SELECT * FROM (
SELECT DISTINCT ON (device_id)
*
FROM devstatechange
ORDER BY device_id, "when" DESC
) AS latest
WHERE new_state = 4;
The DISTINCT ON keyword together with the ORDER BY will pull the newest row for each device. The outer query then filters these by your condition.
You may use Row_Number() function with a partition by device_id and order by when.
Try the following CTE:
with cte as
(
Select id ,device_id ,new_state ,when_ ,
row_number() over (partition by device_id order by when_ desc) as rn
from devstatechange
)
select * from cte where rn=1 and new_state=4
See a demo from db-fiddle.
The problem with:
SELECT DISTINCT * FROM devstatechange WHERE MAX("when") AND new_state=4 ORDER BY "when";
is that MAX("when") refers to all the entrys on the table.
you should change it to:
when = (select max(when) from devstatechange dev2 where dev2.device_id = dev1.device_id )
You can use CTE to obtain a last state of each device and then select only those, whose last state is 4, like this
WITH device_last_state AS (
SELECT DISTINCT ON (device_id)
id,
device_id,
last_value (new_state) over (partition by device_id order by "when" desc) as new_state,
"when"
FROM devicestatechange
)
SELECT * FROM device_last_state
WHERE new_state = 4
Check a demo
I'm trying to merge overlapping dates between Admit and discharge dates of patients. There are a few edge cases which I couldn't cover in the query.
Input
+----+------------+--------------+
| ID | Admit_Dt | Discharge_Dt |
+----+------------+--------------+
| 1 | 12/30/2020 | 07/14/2021 |
+----+------------+--------------+
| 1 | 01/02/2021 | 07/14/2021 |
+----+------------+--------------+
| 1 | 06/16/2021 | 07/14/2021 |
+----+------------+--------------+
| 2 | 03/04/2021 | 03/25/2021 |
+----+------------+--------------+
| 2 | 05/01/2021 | 05/10/2021 |
+----+------------+--------------+
| 3 | 06/01/2021 | 06/05/2021 |
+----+------------+--------------+
Expected Output
+----+------------+--------------+
| ID | Admit_dt | Discharge_dt |
+----+------------+--------------+
| 1 | 12/30/2020 | 07/14/2021 |
+----+------------+--------------+
| 2 | 03/04/2021 | 03/25/2021 |
+----+------------+--------------+
| 2 | 05/01/2021 | 05/10/2021 |
+----+------------+--------------+
| 3 | 06/01/2021 | 06/05/2021 |
+----+------------+--------------+
Query I used the logic that was here But this doesn't cover the edge case for ID 2 and 3. Also the subquery is slower when the data is huge. Is it possible to tackle this problem using LAG?
SELECT dr1.* FROM Member_Discharges dr1
INNER JOIN Member_Discharges dr2
ON dr2.ADMIT_DT> dr1.ADMIT_DT
and dr2.ADMIT_DT< dr1.DISCHARGE_DT
This is a type of gaps-and-islands problem. I would suggest using a cumulative max to determine when an "island" starts and then aggregate:
select id, min(admit_dt), max(discharge_dt)
from (select t.*,
sum(case when prev_Discharge_dt >= Admit_Dt then 0 else 1 end) over (partition by id order by admit_dt, discharge_dt) as grp
from (select t.*,
max(Discharge_dt) over (partition by id
order by Admit_Dt, Discharge_dt
rows between unbounded preceding and 1 preceding) as prev_Discharge_dt
from t
) t
) t
group by id, grp;
Here is a db<>fiddle.
The innermost subquery is retrieving the maximum discharge date before each row. This allows you to check for an overlap. The middle subquery counts up the number of times there is no overlap -- the beginning of a group. And the outer query aggregates.
Here is another "gaps-and-islands" solution
Use LAG to determine if the previous Discharge_Dt is earlier than the current Admit_Dt, if so we have a starting point
Number the islands using COUNT OVER
Group by the ID and the new grouping number, and take the min and max dates
WITH StartPoints AS (
SELECT *,
IsStart = CASE WHEN LAG(Discharge_Dt, 1, '19000101')
OVER (PARTITION BY ID ORDER BY Admit_Dt)
< Admit_Dt THEN 1 END
FROM YourTable t
),
Groupings AS (
SELECT *,
GroupId = COUNT(IsStart) OVER (PARTITION BY ID
ORDER BY Admit_Dt ROWS UNBOUNDED PRECEDING)
FROM StartPoints
)
SELECT ID, Admit_Dt = MIN(Admit_Dt), Discharge_Dt = MAX(Discharge_Dt)
FROM Groupings
GROUP BY ID, GroupId
ORDER BY ID, GroupId;
db<>fiddle
I'm having some issues when i try to obtain the MAX value of a field withing a set of records and i hope some of you can help me finding what am i doing wrong.
I'm trying to get the ID of the item of the most expensive line, within an order.
Given this query:
SELECT
orderHeader.orderKey, orderLines.lineKey, orderLines.itemKey, orderLines.OrderedQty,
orderLines.price, (orderLines.price*orderLines.OrderedQty) as LinePrice,
ROW_NUMBER() OVER(PARTITION BY orderHeader.orderKey ORDER BY orderLines.lineKey asc) AS [ItemLineNum],
ROW_NUMBER() OVER(PARTITION BY orderHeader.orderKey ORDER BY (orderLines.price*orderLines.OrderedQty) DESC) AS [LineMaxPriceNum],
max(orderLines.itemKey) OVER (PARTITION BY orderHeader.orderKey ORDER BY (orderLines.price*orderLines.OrderedQty) DESC) as [MaxPriceItem]
FROM
orderHeader inner join orderLines on orderHeader.orderKey=orderLines.orderKey
I'm getting this results:
Results of Query
Sorry, as i'm not allowed to insert images directly in the post, i'll try with snippets for formatting the tables.
These are the results
| orderKey | lineKey | itemKey | OrderedQty | Price | LinePrice | ItemLineNum | LineMaxPriceNum | MaxPriceItem |
|----------|---------|---------|------------|-------|-----------|-------------|-----------------|--------------|
| 176141 | 367038 | 15346 | 3 | 1000 | 3000 | 2 | 1 | 15346 |
| 176141 | 367037 | 15159 | 2 | 840 | 1680 | 1 | 2 | 15346 |
| 176141 | 367039 | 15374 | 5 | 100 | 500 | 3 | 3 | 15374 |
As you can see, for the same "orderKey" i have three lines (lineKey), each of them with a different item (itemKey), a different quantity, a different price and a different total cost (LinePrice).
I want in the column MaxPriceItem the key of the item with the higher "LinePrice", but in the results is wrong. The three lines should show 15346 as the most expensive item but the last one is not right, and i can't see why. Also, the ROW_NUMBER partitioned by the same expression (LineMaxPriceNum) is giving me the right order.
If i change the expression of the ORDER BY within the MAX, like this (ordering by "OrderedQty"):
SELECT
orderHeader.orderKey, orderLines.lineKey, orderLines.itemKey, orderLines.OrderedQty,
orderLines.price, (orderLines.price*orderLines.OrderedQty) as LinePrice,
ROW_NUMBER() OVER(PARTITION BY orderHeader.orderKey ORDER BY orderLines.lineKey asc) AS [ItemLineNum],
ROW_NUMBER() OVER(PARTITION BY orderHeader.orderKey ORDER BY (orderLines.price*orderLines.OrderedQty) DESC) AS [LineMaxPriceNum],
max(orderLines.itemKey) OVER (PARTITION BY orderHeader.orderKey ORDER BY orderLines.OrderedQty DESC) as [MaxPriceItem]
FROM
orderHeader inner join orderLines on orderHeader.orderKey=orderLines.orderKey
Then it works:
| orderKey | lineKey | itemKey | OrderedQty | Price | LinePrice | ItemLineNum | LineMaxPriceNum | MaxPriceItem |
|----------|---------|---------|------------|-------|-----------|-------------|-----------------|--------------|
| 176141 | 367038 | 15346 | 3 | 1000 | 3000 | 2 | 1 | 15374 |
| 176141 | 367037 | 15159 | 2 | 840 | 1680 | 1 | 2 | 15374 |
| 176141 | 367039 | 15374 | 5 | 100 | 500 | 3 | 3 | 15374 |
The item with the highest "OrderedQty" is 15374 so the results are correct.
If i change, again, the expression of the ORDER BY within the MAX, like this (ordering by "Price"):
SELECT
orderHeader.orderKey, orderLines.lineKey, orderLines.itemKey, orderLines.OrderedQty,
orderLines.price, (orderLines.price*orderLines.OrderedQty) as LinePrice,
ROW_NUMBER() OVER(PARTITION BY orderHeader.orderKey ORDER BY orderLines.lineKey asc) AS [ItemLineNum],
ROW_NUMBER() OVER(PARTITION BY orderHeader.orderKey ORDER BY (orderLines.price*orderLines.OrderedQty) DESC) AS [LineMaxPriceNum],
max(orderLines.itemKey) OVER (PARTITION BY orderHeader.orderKey ORDER BY orderLines.price DESC) as [MaxPriceItem]
FROM
orderHeader inner join orderLines on orderHeader.orderKey=orderLines.orderKey
Then it happens the same than with the first example, the results are wrong:
| orderKey | lineKey | itemKey | OrderedQty | Price | LinePrice | ItemLineNum | LineMaxPriceNum | MaxPriceItem |
|----------|---------|---------|------------|-------|-----------|-------------|-----------------|--------------|
| 176141 | 367038 | 15346 | 3 | 1000 | 3000 | 2 | 1 | 15346 |
| 176141 | 367037 | 15159 | 2 | 840 | 1680 | 1 | 2 | 15346 |
| 176141 | 367039 | 15374 | 5 | 100 | 500 | 3 | 3 | 15374 |
The item with the highest price is 15346 but the MAX for the last record is not showing this.
What am i missing here? Why i'm getting those different results?
Sorry if the formatting is not properly done, it's my first question here and i've tried my best.
Thanks in advance for any help you can give me.
I'm trying to get the ID of the item of the most expensive line, within an order.
You misunderstand the purpose of the order by clause to the window function; it is meant to defined the window frame, not to compare the values; max() gives you the maximum value of the expression given as argument within the window frame.
On the other hand, you want the itemKey of the most expensive order line. I think that first_value() would do what you want:
first_value(orderLines.itemKey) over(
partition by orderHeader.orderKey
order by orderLines.price * orderLines.OrderedQty desc
) as [MaxPriceItem]
The accepted answer provides a reasonable alternate solution to the original problem, but doesn't really explain why the max() function appears to work inconsistently. (And spoiler alert, you actually can use max() as originally intended with a small tweak.)
You have to understand that aggregation functions actually operate on a window frame within a partition. By default, the frame is the entire partition. And so aggregation operations like max() and sum() do operate over the entire partition, exactly like you assumed. This default specification is defined as RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. This just means that whatever record we're on, max() looks back all the way to the first row in the partition, and all the way forward to the last row in the partition, in order to calculate the value.
But there's an insidious gotcha: Adding an ORDER BY clause to the partition changes the the default frame specification to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This means that whatever record we're on, max() looks back all the way to the first row in the partition, and then only up to the current row, in order to calculate the value. You can see this clearly in your last example (simplified a bit):
SELECT orderKey, itemKey, price,
ROW_NUMBER() OVER(PARTITION BY orderKey ORDER BY price DESC) AS [PartitionRowNum],
MAX(itemKey) OVER (PARTITION BY orderKey ORDER BY price DESC) as [MaxPriceItem]
FROM orders
Result/explanation:
| orderKey | itemKey | Price | PartitionRowNum | MaxPriceItem | Commentary |
|----------|---------|-------|-----------------|--------------|------------------------|
| 176141 | 15346 | 1000 | 1 | 15346 | Taking max of rows 1-1 |
| 176141 | 15159 | 840 | 2 | 15346 | Taking max of rows 1-2 |
| 176141 | 15374 | 100 | 3 | 15374 | Taking max of rows 1-3 |
SOLUTION
We can explicitly indicate the window frame specification by adding RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING to the partition as follows:
SELECT orderKey, itemKey, price,
ROW_NUMBER() OVER(PARTITION BY orderKey ORDER BY price DESC) AS [PartitionRowNum],
MAX(itemKey) OVER (PARTITION BY orderKey ORDER BY price DESC RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as [MaxPriceItem]
FROM orders
Result/explanation:
| orderKey | itemKey | Price | PartitionRowNum | MaxPriceItem | Commentary |
|----------|---------|-------|-----------------|--------------|------------------------|
| 176141 | 15346 | 1000 | 1 | 15374 | Taking max of rows 1-3 |
| 176141 | 15159 | 840 | 2 | 15374 | Taking max of rows 1-3 |
| 176141 | 15374 | 100 | 3 | 15374 | Taking max of rows 1-3 |
I have a table with multiple records for each patient.
My end goal is a table that is 1-to-1 between Patient_id and Value.
I would like to de-duplicate (in respect to patient_id) my rows based on "a hierarchical series of aggregate functions" (if someone has a better way to phrase this, I'd appreciate that as well.)
+----+------------+------------+------------+----------+-----------------+-------+
| ID | patient_id | Date | Date2 | Priority | Source | Value |
+----+------------+------------+------------+----------+-----------------+-------+
| 1 | 1 | 2017-09-09 | 2018-09-09 | 1 | 'verified' | 55 |
| 2 | 1 | 2017-09-09 | 2018-11-11 | 2 | 'verified' | 78 |
| 3 | 1 | 2017-11-11 | 2018-09-09 | 3 | 'verified' | 23 |
| 4 | 1 | 2017-11-11 | 2018-11-11 | 1 | 'self_reported' | 11 |
| 5 | 1 | 2017-09-09 | 2018-09-09 | 2 | 'self_reported' | 90 |
| 5 | 1 | 2017-09-09 | 2018-09-09 | 3 | 'self_reported' | 34 |
| 6 | 2 | 2017-11-11 | 2018-09-09 | 2 | 'self_reported' | 21 |
+----+------------+------------+------------+----------+-----------------+-------+
For each patient_id, I would like to get the row(s) that has/have the MAX(Date). In the case that there are still duplicated patient_id, I would like to get the row(s) with the MIN(Priority). In the case that there are still duplicated rows I would like to get the row(s) with the MIN(Date2).
The way I've approached this problem is using a series of queries like this to de-duplicate on the columns one at a time.
SELECT *
FROM #table t1
LEFT JOIN
(SELECT
patient_id,
MIN(priority) AS min_priority
FROM #table
GROUP BY patient_id) t2 ON t2.patient_id = t1.patient_id
WHERE t2.min_priority = t1.priority
Is there a way to do this that allows me to de-dup on multiple columns at once? Is there a more elegant way to do this?
I'm able to get my results, but my solution feels very inefficient, and I keep running into this. Thank you for any input.
You could use row_number(), if your RDBMS supports it:
select ID, patient_id, Date, Date2, Priority, Source, Value
from (
select
t.*,
row_number() over(partition by patient_id order by Date desc, Priority, Date2) rn
from mytable t
) where rn = 1
Another option is to filter with a correlated subquery that sorts the record according to your criteria, like so:
select t.*
from mytable t
where id = (
select id
from mytable t1
where t1.patient_id = t.patient_id
order by t1.Date desc, t1.Priority, t1.Date2
limit 1
)
The actual syntax for limit varies accross RDBMS.
I have a table and need to get the difference between two dates for a very similar set of records. I've tried a few methods today but cannot seem to get this one to work.
Example Table:
Payment_ID | Created_Date | Version_ID | Status
----------------------------------------------------------
1526 | 20/10/2015 | 1 | Opened
1526 | 20/10/2015 | 2 | Verified Open
1526 | 22/10/2015 | 3 | Assigned
1526 | 23/10/2015 | 4 | Contact Made
1859 | 20/10/2015 | 1 | Opened
1859 | 20/10/2015 | 2 | Verified Open
1859 | 22/10/2015 | 3 | Assigned
1859 | 22/10/2015 | 3.5 | Re-Assigned
1859 | 22/10/2015 | 4.5 | Contact Failed
1859 | 23/10/2015 | 4 | Contact Made
1859 | 24/10/2015 | 5 | Assigned Updated
1859 | 25/10/2015 | 6 | Contact Made
1859 | 26/10/2015 | 7 | Resolved
1859 | 21/10/2015 | 8 | Closed
1852 | 26/10/2015 | 1 | Opened
1778 | 21/09/2015 | 1 | Opened
1778 | 22/09/2015 | 2 | Verified Open
1778 | 23/09/2015 | 3 | Assigned
1778 | 24/09/2015 | 4 | Contact Made
1778 | 25/09/2015 | 5 | Assigned Updated
The requirement is to return the Payment_ID and StatusDateDiff for a given Status, in this case the Contact_Made one and only the first one if a Payment_ID has more than one, then take the difference between that date and the previous status date for any of them.
So taking 1526 "Contact_Made" was on the 24/10/2015 and the previous status, regardless of what that was, is 23/10/2015 so the difference is 1.
For the above it would look like this:
Payment_ID | StatusDateDiff
-----------------------------
1526 | 1
1859 | 1
1852 | 0
1778 | 1
I tried a few sub queries to get the distinct Payment_ID and Min(Created_Date), but that resulted in duplicates once put together.
Also tried a Common Table Expression but that lead to the same - though I'm not too familiar with them.
Any thoughts would be appreciated.
Use LAG() (available in SQL Server 2012+):
select payment_id, datediff(day, prev_created_date, created_date)
from (select t.*,
lag(created_date) over (partition by payment_id order by created_date) as prev_created_date,
row_number() over (partition by payment_id, status order by created_date) as seqnum
from t
) t
where status = 'Contact Made' and seqnum = 1;
This is untested, but this should point you in the right direction. You can use a windowed ROW_NUMBER() function to determine which values are the latest, and do a DATEDIFF() to find the number of days they are different.
Edit: I just noticed you have a SQL Server tag and an Oracle tag - this answer is for SQL Server
;With Ver As
(
Select *,
Row_Number() Over (Partition By Payment_Id Order By Version Desc) Row_Number
From Table
)
Select Latest.Payment_Id,
DateDiff(Day, Coalesce(Previous.Created_Date, Latest.CreatedDate), Latest.CreatedDate) As StatusDateDiff
From Ver As Latest
Left Join Ver As Previous On Latest.Payment_Id = Previous.Payment_Id
And Previous.Row_Number = 2
Where Latest.Row_Number = 1