I'm working on writing a query to organize install and removal dates for car part numbers. I want to find a record of all car part installs, and removals of the same part if they have been removed from a vehicle, identified by it's VIN. I'm having trouble associating these events together because the only thing tying them together is the dates. Removals must occur after installs and another install cannot occur on the same part unless it has been removed first.
I have been able to summarize the data into separate rows by event type (e.g. each install has its own row and each removal has its own row.
What I've tried is using DECODE() by event type, but it keeps the records in separate rows. Maybe there's something COALESCE() can do here, but I'm not sure.
Here's a summary of how the data looks:
part_no | serial_no | car_vin | event_type | event_date
12345 | a1b2c3 | 9876543 | INSTALL | 01-JAN-2019
12345 | a1b2c3 | 9876543 | REMOVE | 01-AUG-2019
54321 | t3c4a8 | 9876543 | INSTALL | 01-MAR-2019
12345 | a1b2c3 | 3456789 | INSTALL | 01-SEP-2019
And here's what the expected outcome is:
part_no | serial_no | car_vin | install_date | remove_date
12345 | a1b2c3 | 9876543 | 01-JAN-2019 | 01-AUG-2019
12345 | a1b2c3 | 3456789 | 01-SEP-2019 |
54321 | t3c4a8 | 9876543 | 01-MAR-2019 |
We can use pivoting logic here:
SELECT
part_no,
serial_no,
car_vin,
MAX(CASE WHEN event_type = 'INSTALL' THEN event_date END) AS install_date,
MAX(CASE WHEN event_type = 'REMOVE' THEN event_date END) AS remove_date
FROM yourTable
GROUP BY
part_no,
serial_no,
car_vin
ORDER BY
part_no;
Demo
This approach is a typical way to transform a key value store table, which is basically what your table is, into the output you want to see.
You can use the SQL for Pattern Matching (MATCH_RECOGNIZE):
WITH t(part_no,serial_no,car_vin,event_type,event_date) AS
(SELECT 12345, 'a1b2c3', 9876543, 'INSTALL', DATE '2019-01-01' FROM dual
UNION ALL SELECT 12345, 'a1b2c3', 9876543, 'REMOVE', DATE '2019-08-01' FROM dual
UNION ALL SELECT 54321, 't3c4a8', 9876543, 'INSTALL', DATE '2019-03-01' FROM dual
UNION ALL SELECT 12345, 'a1b2c3', 3456789, 'INSTALL', DATE '2019-09-01' FROM dual)
SELECT part_no,serial_no,car_vin, INSTALL_DATE, REMOVE_DATE
FROM t
MATCH_RECOGNIZE (
PARTITION BY part_no,serial_no,car_vin
ORDER BY event_date
MEASURES
FINAL MAX(REMOVE.event_date) AS REMOVE_DATE,
FINAL MAX(INSTALL.event_date) AS INSTALL_DATE
PATTERN ( INSTALL REMOVE? )
DEFINE
REMOVE AS event_type = 'REMOVE',
INSTALL AS event_type = 'INSTALL'
)
ORDER BY part_no, INSTALL_DATE, REMOVE_DATE;
+--------------------------------------------------+
|PART_NO|SERIAL_NO|CAR_VIN|INSTALL_DATE|REMOVE_DATE|
+--------------------------------------------------+
|12345 |a1b2c3 |9876543|01.01.2019 |01.08.2019 |
|12345 |a1b2c3 |3456789|01.09.2019 | |
|54321 |t3c4a8 |9876543|01.03.2019 | |
+--------------------------------------------------+
The key clause here is PATTERN ( INSTALL REMOVE? ). It means, you have exactly one INSTALL event followed by zero or one REMOVE event.
If you can have more than just one INSTALL event then use PATTERN ( INSTALL+ REMOVE? )
If you can have more than just one INSTALL event and optionally more than one REMOVE event then use PATTERN ( INSTALL+ REMOVE* )
You can simply add more events, e.g. ORDER, DISPOSAL, etc.
Related
I’m looking for some kind of SQL window function that calculate values based on a calculated value from a previous iteration when looping over the window. I’m not looking for ‘lag’ which will just take the original value of the previous row.
Here is the case: We have web analytics sessions. We would like to attribute to each session to the last relevant channel. There are 3 channels: direct, organic and paid. However, they have different priorities: paid will always be relevant. Organic will only be relevant if there was no paid channel in the last 30 days and direct would only be relevant if there was not paid or organic channel in the last 30 days
So in the example table we would like to calculate the values in column ‘attributed’ based on channel and the date columns. Note, the data is there for several users so this should be calculated per user.
+─────────────+───────+──────────+─────────────+
| date | user | channel | attributed |
+─────────────+───────+──────────+─────────────+
| 2022-01-01 | 123 | direct | direct |
| 2022-01-14 | 123 | paid | paid |
| 2022-02-01 | 123 | direct | paid |
| 2022-02-12 | 123 | direct | paid |
| 2022-02-13 | 123 | organic | paid |
| 2022-03-08 | 123 | direct | direct |
| 2022-03-10 | 123 | paid | paid |
+─────────────+───────+──────────+─────────────+
So in the table above row 1 is attributed direct because it’s the first line. The second then is paid as this has priority to direct. It stays paid for the next 2 sessions as direct has lower priority, then it switches to organic as the paid attribution is older than 30 days. The last one is then paid again as it has a higher priority than organic.
I would know how to solve it if you could decide whether a new channel needs to be attributed only based on the current row and the previous. I added here the SQL to do it:
with source as ( -- example data
select cast("2022-01-01" as date) as date, 123 as user, "direct" as channel
union all
select "2022-01-14", 123, "paid"
union all
select "2022-02-01", 123, "direct"
union all
select "2022-02-12", 123, "direct"
union all
select "2022-02-13", 123, "organic"
union all
select "2022-03-08", 123, "direct"
union all
select "2022-03-10", 123, "paid"
),
flag_new_channel as( -- flag sessions that would override channel informaton ; this only works statically here
select *,
case
when lag(channel) over (partition by user order by date) is null then 1
when date_diff(date,lag(date) over (partition by user order by date),day)>30 then 1
when channel = "paid" then 1
when channel = "organic" and lag(channel) over (partition by user order by date)!='paid' then 1
else 0
end flag
from source
qualify flag=1
)
select s.*,
f.channel attributed_channel,
row_number() over (partition by s.user, s.date order by f.date desc) rn -- number of flagged previous sessions
from source s
left join flag_new_channel f
on s.date>=f.date
qualify rn=1 --only keep the last flagged session at or before the current session
However, this would for example set "organic" in row 5 because it doesn't know "paid" is still relevant.
+─────────────+───────+──────────+─────────────────────+
| date | user | channel | attributed_channel |
+─────────────+───────+──────────+─────────────────────+
| 2022-01-01 | 123 | direct | direct |
| 2022-01-14 | 123 | paid | paid |
| 2022-02-01 | 123 | direct | paid |
| 2022-02-12 | 123 | direct | paid |
| 2022-02-13 | 123 | organic | organic |
| 2022-03-08 | 123 | direct | organic |
| 2022-03-10 | 123 | paid | paid |
+─────────────+───────+──────────+─────────────────────+
Any ideas? Not sure recursive queries can help or udfs. I’m using BigQuery usually but if you know solutions in other dialects it would still be interesting to know.
Here's one approach:
Updated: Corrected. I lost track of your correct / expected result, due to the confusing story.
For PostgreSQL, we can do something like this (with CTE and window functions):
The fiddle for PG 14
pri - provides a table of (channel, priority) pairs
cte0 - provides the test data
cte1 - determines the minimum priority over the last 30 days per user
final - the final query expression obtains the attributed channel name
WITH pri (channel, pri) AS (
VALUES ('paid' , 1)
, ('organic' , 2)
, ('direct' , 3)
)
, cte0 (date, xuser, channel) AS (
VALUES
('2022-01-01'::date, 123, 'direct')
, ('2022-01-14' , 123, 'paid')
, ('2022-02-01' , 123, 'direct')
, ('2022-02-12' , 123, 'direct')
, ('2022-02-13' , 123, 'organic')
, ('2022-03-08' , 123, 'direct')
, ('2022-03-10' , 123, 'paid')
)
, cte1 AS (
SELECT cte0.*
, pri.pri
, MIN(pri) OVER (PARTITION BY xuser ORDER BY date
RANGE BETWEEN INTERVAL '30' DAY PRECEDING AND CURRENT ROW
) AS mpri
FROM cte0
JOIN pri
ON pri.channel = cte0.channel
)
SELECT cte1.*
, pri.channel AS attributed
FROM cte1
JOIN pri
ON pri.pri = cte1.mpri
;
The result:
date
xuser
channel
pri
mpri
attributed
2022-01-01
123
direct
3
3
direct
2022-01-14
123
paid
1
1
paid
2022-02-01
123
direct
3
1
paid
2022-02-12
123
direct
3
1
paid
2022-02-13
123
organic
2
1
paid
2022-03-08
123
direct
3
2
organic
2022-03-10
123
paid
1
1
paid
Our data table looks like this:
Machine Name
Lot Number
Qty
Load TxnDate
Unload TxnDate
M123
ABC
500
10/1/2020
10/2/2020
M741
DEF
325
10/1/2020
M123
ZZZ
100
10/5/2020
10/7/2020
M951
AAA
550
10/5/2020
10/9/2020
M123
BBB
550
10/7/2020
I need to create an SQL query that shows the currently loaded Lot number - Machines with no Unload TxnDate - and the last loaded Lot number based on the unload TxnDate.
So in the example, when I run a query for M123, the result will show:
Machine Name
Lot Number
Qty
Load TxnDate
Unload TxnDate
M123
ZZZ
100
10/5/2020
10/7/2020
M123
BBB
550
10/7/2020
As you can see although Machine Name has 3 records, the results only show the currently loaded and the last loaded. Is there anyway to replicate this? The Machine Name is dynamic, so my user can enter the Machine Name and see the results the machine based on the missing Unload TxnDate and the last Unload Txn Date
You seem to want the last two rows. That would be something like this:
select t.*
from t
where machine_name = 'M123'
order by load_txn_date desc
fetch first 2 rows only;
Note: not all databases support the first first clause. Some spell it limit, or select top, or even something else.
If you want two rows per machine, one option uses window functions:
select *
from (
select t.*,
row_number() over(
partition by machine_name, (case when unload_txn_date is null then 0 else 1 end)
order by coalesce(unload_txn_date, load_txn_date) desc
) rn
from mytable t
) t
where rn = 1
The idea is to separate rows between those that have an unload date, and those that do not. We can then bring the top record per group.
For your sample data, this returns:
Machine_Name | Lot_Number | Qty | Load_Txn_Date | Unload_Txn_Date | rn
:----------- | :--------- | --: | :------------ | :-------------- | -:
M123 | BBB | 550 | 2020-10-07 | null | 1
M123 | ZZZ | 100 | 2020-10-05 | 2020-10-07 | 1
M741 | DEF | 325 | 2020-10-01 | null | 1
M951 | AAA | 550 | 2020-10-05 | 2020-10-09 | 1
You might use the following query, presuming that you're on a database having Window(or Analytic) Function
WITH t AS
(
SELECT COALESCE(Unload_Txn_Date -
LAG(Load_Txn_Date) OVER
(PARTITION BY Machine_Name ORDER BY Load_Txn_Date DESC),0) AS lg,
MAX(CASE WHEN Unload_Txn_Date IS NULL THEN Load_Txn_Date END) OVER
(PARTITION BY Machine_Name) AS mx,
t.*
FROM tab t
), t2 AS
(
SELECT DENSE_RANK() OVER (ORDER BY mx DESC NULLS LAST) AS dr, t.*
FROM t
WHERE mx IS NOT NULL
)
SELECT Machine_Name,Lot_Number,Qty,Load_Txn_Date,Unload_Txn_Date
FROM t2
WHERE dr = 1 AND lg = 0
ORDER BY Load_Txn_Date
where if previous row's Unload_Txn_Date is equal to the current Load_Txn_Date, then it's accepted that there's no interruption will occur for the job, while determining the last Unload Txn Dates with no unload date values per each machine. And then, the result set returns through filtering by the values derived from the window functions within the penultimate query.
Demo
I have a table in Amazon Redshift, named 'inventory'
This is a data pull from external systems. This happens twice a day, once in the morning (right at opening), and once right after closing. These are the location_id column below (there are multiple locations).
I want to figure out the total items sold based on column 'total_inventory'.
There is a column 'import_time' which has two possible values, 'am' and 'pm'.
All of this should by done by date, called 'import_date'
Data may look like this:
item_id | location_id | total_inventory | import_date | import_time
-------------------------------------------------------------------
10123 | 3 | 10 | 2019-10-01 | am
10123 | 3 | 3 | 2019-10-01 | pm
10123 | 3 | 7 | 2019-10-02 | am
10123 | 3 | 6 | 2019-10-02 | pm
I would ideally like to be able to see results of total_sold such as:
item_id | location_id | total_sold | import_date
------------------------------------------------
10123 | 3 | 7 | 2019-10-01
10123 | 3 | 1 | 2019-10-02
Note: Daily start levels have nothing to do with previous stock levels as they are replenished over night.
Also note: I have inherited this issue, and if structural changes are required, I can do so, but if possible to avoid it would be helpful.
I have attempted to look at other answers where arithmetic is being done based on column values, but I did not see (or rather, understand) a fit that would work for me.
Full Transparency: My SQL skills are fairly weak as of late due to not using them in a long while, so please go easy on me if I have asked a foolish question.
If the pm value is alway less than the am, you can do:
select import_date, item_id, location_id,
max(total_inventory) - min(total_inventory)
from t
group by import_date, item_id, location_id;
However, I suspect yo really want conditional aggregation:
select import_date, item_id, location_id,
(max(case when import_time = 'pm' then total_inventory else 0 end) -
min(case when import_time = 'am' then total_inventory else end)
)
from t
group by import_date, item_id, location_id;
So let me describe the problem:
-I have a task table with an assignee column, a created column and a resolved column
(both created and resolved are timestamps)
+---------+----------+------------+------------+
| task_id | assignee | created | resolved |
+---------+----------+------------+------------+
| tsk1 | him | 2000-01-01 | 2018-01-03 |
+---------+----------+------------+------------+
-I have a change log table with a task_id, a from column, a to column and a date column that records each time the assignee is changed
+---------+----------+------------+------------+
| task_id | from | to | date |
+---------+----------+------------+------------+
| tsk1 | me | you | 2017-04-06 |
+---------+----------+------------+------------+
| tsk1 | you | him | 2017-04-08 |
+---------+----------+------------+------------+
I want to select a table that shows a list of all the assignees that worked on a task within an interval
+---------+----------+------------+------------+
| task_id | assignee | from | to |
+---------+----------+------------+------------+
| tsk1 | me | 2000-01-01 | 2017-04-06 |
+---------+----------+------------+------------+
| tsk1 | you | 2017-04-06 | 2017-04-08 |
+---------+----------+------------+------------+
| tsk1 | him | 2017-04-08 | 2018-01-03 |
+---------+----------+------------+------------+
I'm having trouble with the first(/last) row, where the from(/to) should be set as created(/resolved), I don't know how to make a column with data from two different tables...
I've tried making them in their own select and then merging all rows with union, but I don't think this is a very good solution...
Hmmm . . . This is tricker than it seems. The idea is to use lead() to get the next date, but you need to "augment" the data with information from the tasks table:
select task_id, to, date as fromdate,
coalesce(lead(date) over (partition by task_id order by date),
max(resolved) over (partition by task_id)
) as todate
from ((select task_id, to, date, null::timestamp
from log l
) union all
(select distint on (t.task_id) t.task_id, l.from, t.created, t.resolved
from task t join
log l
on t.task_id = l.task_id
order by t.task_id, l.date
)
) t;
demo:db<>fiddle
SELECT
l.task_id,
assignee_from as assignee,
COALESCE(
lag(assign_date) OVER (ORDER BY assign_date),
created
) as date_from,
assign_date as date_to
FROM
log l
JOIN
task t
ON l.task_id = t.task_id
UNION ALL
SELECT * FROM (
SELECT DISTINCT ON (l.task_id)
l.task_id, assignee_to, assign_date, resolved
FROM
log l
JOIN
task t
ON l.task_id = t.task_id
ORDER BY l.task_id, assign_date DESC
) s
ORDER BY task_id, date_from
UNION consists of two parts: The part from the log and finally the last row from the task table.
The first part uses LAG() window function to get the previous date to the current row. Because "me" has no previous row, that would result in a NULL value. So this is catched by getting the created date from the task table.
The second part is to get the last row: Here I am getting the last row of the log by DISTINCT and ORDER BY assign_date DESC. So I know the last assignee_to. The rest is similar to the first part: Getting the resolved value from the task table.
Thanks to the answer from S-Man and Gordon Linoff, I was able to come up with this solution:
SELECT t.task_id,
t.item_from AS assignee,
COALESCE(lag(t.changelog_created) OVER (
PARTITION BY t.task_id ORDER BY t.changelog_created),
max(t.creationdate) OVER (PARTITION BY t.task_id)) AS fromdate,
t.changelog_created as todate
FROM ( SELECT ch.task_id,
ch.item_from,
ch.changelog_created,
NULL::timestamp without time zone AS creationdate
FROM changelog_generic_expanded_view ch
WHERE ch.field::text = 'assignee'::text
UNION ALL
( SELECT DISTINCT ON (t_1.id_task) t_1.id_task,
t_1.assigneekey,
t_1.resolutiondate,
t_1.creationdate
FROM task_jira t_1
ORDER BY t_1.id_task)) t;
Note: this is the final version so the names are a bit different, but the idea stays the same.
This is basically the same code as Gordon Linoff, but I go through changelog in the opposite direction.
I use the 2nd part of UNION ALL to generate the last assignee instead of the first (this is to handle the case where there is no changelog at all, the last assignee is generated without involving changelogs)
I want to design a query to find out is there at least one cat (select count(*) where rownum = 1) that haven't been checked out.
One weird condition is that the result should exclude if the most recent cat that didn't checked out, so that:
TABLE schedule
-------------------------------------
| type | checkin | checkout
-------------------------------------
| cat | 20:10 | (null)
| dog | 19:35 | (null)
| dog | 19:35 | (null)
| cat | 15:31 | (null) ----> exclude this cat in this scenario
| dog | 12:47 | 13:17
| dog | 10:12 | 12:45
| cat | 08:27 | 11:36
should return 1, the first record
| cat | 20:10 | (null)
I kind of create the query like
select * from schedule where type = 'cat' and checkout is null order by checkin desc
however this query does not resolve the exclusion. I can sure handle it in the service layer like java, but just wondering any solution can design in the query and with good performance when there is large amount of data in the table ( checkin and checkout are indexed but not type)
How about this?
Select *
From schedule
Where type='cat' and checkin=(select max(checkin) from schedule where type='cat' and checkout is null);
Assuming the checkin and checkout data type is string (which it shouldn't be, it should be DATE), to_char(checkin, 'hh24:mi') will create a value of the proper data type, DATE, assuming the first day of the current month as the "date" portion. It shouldn't matter to you, since presumably all the hours are from the same date. If in fact checkin/out are in the proper DATE data type, you don't need the to_date() call in order by (in two places).
I left out the checkout column from the output, since you are only looking for the rows with null in that column, so including it would provide no information. I would have left out type as well, but perhaps you'll want to have this for cats AND dogs at some later time...
with
schedule( type, checkin, checkout ) as (
select 'cat', '20:10', null from dual union all
select 'dog', '19:35', null from dual union all
select 'dog', '19:35', null from dual union all
select 'cat', '15:31', null from dual union all
select 'dog', '12:47', '13:17' from dual union all
select 'dog', '10:12', '12:45' from dual union all
select 'cat', '08:27', '11:36' from dual
)
-- end of test data; actual solution (SQL query) begins below this line
select type, checkin
from ( select type, checkin,
row_number() over (order by to_date(checkin, 'hh24:mi')) as rn
from schedule
where type = 'cat' and checkout is null
)
where rn > 1
order by to_date(checkin, 'hh24:mi') -- ORDER BY is optional
;
TYPE CHECKIN
---- -------
cat 20:10