I currently have the dataset below:
Group
Start
End
A
2021-01-01
2021-04-05
A
2021-01-01
2021-06-05
A
2021-03-01
2021-06-05
B
2021-06-13
2021-08-05
B
2021-06-13
2021-09-05
B
2021-07-01
2021-09-05
C
2021-10-07
2021-10-17
C
2021-10-07
2021-11-15
C
2021-11-12
2021-11-15
I want like the following final dataset: Essentially, I would like to remove all observations that don't equal the minimum start value and I want to do this by group.
Group
Start
End
A
2021-01-01
2021-04-05
A
2021-01-01
2021-06-05
B
2021-06-13
2021-08-05
B
2021-06-13
2021-09-05
C
2021-10-07
2021-10-17
C
2021-10-07
2021-11-15
I tried the following code but I cannot do a min statement in a where clause. Any help would be appreciated.
Delete from #df1
where start != min(start)
If you want to remove all rows, that have not the same [start] you can join a subquery which find the earliest day, you can add additional ON clauses if you need to find other rows as well
DELETE
o1
FROM observations o1
INNER JOIN(SELECT MIN([Start]) minstart , [Group] FROM observations GROUP BY [Group] ) o2
ON o1.[Group] = o2.[Group] AND o1.[Start] <> o2.minstart
SELECT *
FROM observations
Group | Start | End
:---- | :--------- | :---------
A | 2021-01-01 | 2021-04-05
A | 2021-01-01 | 2021-06-05
B | 2021-06-13 | 2021-08-05
B | 2021-06-13 | 2021-09-05
C | 2021-10-07 | 2021-10-17
C | 2021-10-07 | 2021-11-15
db<>fiddle here
You can try this
DELETE FROM table_name WHERE start IN (SELECT MIN(start) FROM table_name GROUP BY start)
Another alternative using a CTE:
with keepers as (
select [Group], min(Start) as mStart
from #df1 group by [Group]
)
delete src
from #df1 as src
where not exists (select * from keepers
where keepers.[Group] = src.[Group] and keepers.mStart = src.Start)
;
You should make an effort to avoid using reserved words as names, since that requires extra effort to write sql using those.
Related
Currently, I am doing an ETL task from record data for process mining task. The goal is to make a "Directly Follow (DF)" Matrix based on the record data. This is the flow:
I have a record (event) data, for example:
ID ev_ID Act Complete
1 1 A 2020-01-13 11:46
2 1 B 2020-01-13 11:50
3 1 C 2020-01-13 11:55
4 1 D 2020-01-13 12:50
5 1 E 2020-01-13 12:52
6 2 A 2020-01-06 09:13
7 2 B 2020-01-06 09:15
8 2 C 2020-01-06 11:46
9 2 D 2020-01-06 11:46
10 3 A 2020-01-06 08:11
11 3 C 2020-01-06 08:10
12 3 B 2020-01-06 09:46
13 3 D 2020-01-06 11:23
14 3 E 2020-01-06 16:05
As I mentioned above, I want to create a DF matrix that shows the "direct follow relation" see here. However, I want to change the output with a table representation (not a matrix).
The (desired) output:
From To Frequency
A A 0
A B 3
A C 1
… … …
D E 2
… … …
E E 0
The idea is to calculate the frequency of "direct follow relation" for each activity per ev_id. For example:
We have ev_1 = [ABCD]
The ev_1 has direct follow relation: AB, BC, and CD.
So, we can calculate the direct follow frequency for each activity.
My question:
Is there anyone who can suggest how to make the output using a SQL query?
I am doing the task with PostgreSQL now.
Any help is appreciated. Thank you very much.
I tried by myself, but the result seems not correctly 100%.
This is my code:
with ev_data as (
select
ID as eid,
ev_ID as ci,
Act as ea,
Complete as ec
from
table_name
),
A0 as (
select
eid,
ci::int,
row_number() over (partition by ci order by ci, ec) as idx,
ea as act1,
ea as act2
from
ev_data
),
A1 as (
select
L1.ci as ci1,
L1.idx as idx1,
L1.act1 as afrom,
L2.ci as ci2,
L2.idx as idx2,
L2.act2 as ato
from A0 as L1
join A0 as L2
on L1.ci = L2.ci
and L2.idx = L1.idx + 1
)
select
afrom,
ato,
count(*) as count
from A1
group by afrom, ato
order by afrom
Let me assume that your goal is the first matrix. You have two issues:
Getting the adjacent counts.
Generating the rows with 0 values.
Neither is really difficult. The first uses lead() and aggregation. The second uses cross join:
select a_f.act_from, a_t.act_to,
count(t.id)
from (select distinct act as act_from from table_name
) a_f cross join
(select distinct act as act_to from table_name
) a_t left join
(select t.*,
lead(act) over (partition by ev_id order by complete) as next_act
from table_name t
) t
on t.act = a_f.act_from and
t.next_act = a_t.act_to
group by a_f.act_from, a_t.act_to;
TableA
ID
Counter
Value
1
1
10
1
2
28
1
3
34
1
4
22
1
5
80
2
1
15
2
2
50
2
3
39
2
4
33
2
5
99
TableB
StartDate
EndDate
2020-01-01
2020-01-11
2020-01-02
2020-01-12
2020-01-03
2020-01-13
2020-01-04
2020-01-14
2020-01-05
2020-01-15
2020-01-06
2020-01-16
TableC (output)
ID
Counter
StartDate
EndDate
Val
1
1
2020-01-01
2020-01-11
10
2
1
2020-01-01
2020-01-11
15
1
2
2020-01-02
2020-01-12
28
2
2
2020-01-02
2020-01-12
50
1
3
2020-01-03
2020-01-13
34
2
3
2020-01-03
2020-01-13
39
1
4
2020-01-04
2020-01-14
22
2
4
2020-01-04
2020-01-14
33
1
5
2020-01-05
2020-01-15
80
2
5
2020-01-05
2020-01-15
99
1
1
2020-01-06
2020-01-16
10
2
1
2020-01-06
2020-01-16
15
I am attempting to come up with some SQL to create TableC. What TableC is, it takes the data from TableB, in chronological order, and for each ID in tableA, it finds the next counter in the sequence, and assigns that to the Start/End date combination for that ID, and when it reaches the end of the counter, it will start back at 1.
Is something like this even possible with SQL?
Yes this is possible. Try to do the following:
Calculate maximal value for Counter in TableA using SELECT MAX(Counter) ... into max_counter.
Add identifier row_number to each row in TableB so it will be able to find matching Counter value using SELECT ROW_NUMBER() OVER() ....
Establish relation between row number in TableB and Counter in TableA like this ... FROM TableB JOIN TableA ON (COALESCE(NULLIF(TableB.row_number % max_counter = 0), max_counter)) = TableA.Counter.
Then gather all these queries using CTE (Common Table Expression) into one query as official documentation shows.
Consider below approach
select id, counter, StartDate, EndDate, value
from tableA
join (
select *, mod(row_number() over(order by StartDate) - 1, 5) + 1 as counter
from tableB
)
using (counter)
if applied to sample data in your question - output is
I have two tables that I am trying to join. The tables have a primary and foreign key but there are some instances where the keys don't match and I need to join on the next best match.
I tried to use a case statement and it works but because the join isn't perfect. It will either grab the incorrect value or duplicate the record.
The way the table works is if the Info_IDs don't match up we can use a combination of Lev1 and if the cust_start date is between Info_Start and Info_End
I need a way to match on the IDs and then the SQL stops matching on that row. But im not sure if that is something BigQuery can do.
Customer Table
Cust_ID Cust_InfoID Cust_name Cust_Start Cust_Lev1
1111 1 Amy 2021-01-01 A
1112 3 John 2020-01-01 D
1113 8 Bill 2020-01-01 D
Info Table
Info_ID Info_Lev1 Info_Start Info_End state
1 A 2021-01-15 2021-01-14 NJ
3 D 2020-01-01 2020-12-31 NY
5 A 2021-01-01 2022-01-31 CA
Expected Result
Cust_ID Cust_InfoID Info_ID Cust_Lev1 Cust_Start Info_Start Info_End state
1111 1 1 A 2021-01-01 2021-01-15 2021-01-14 NJ
1112 3 3 D 2020-01-01 2020-01-01 2020-12-31 NY
1112 8 3 D 2020-01-01 2020-01-01 2020-12-31 NY
Join Idea 1:
CASE
WHEN
(Cust_InfoID = Info_ID) = true
AND (Cust_Start BETWEEN Info_Start AND Info_End) = true
THEN
Cust_InfoID = Info_ID
ELSE
Cust_Start BETWEEN Info_Start AND Info_End
and Info_Lev1 = Cust_Lev1
END
Output:
Cust_ID Cust_InfoID Info_ID Cust_Lev1 Cust_Start Info_Start Info_End state
1111 1 5 A 2021-01-01 2021-01-01 2022-01-31 CA
1112 3 3 D 2020-01-01 2020-01-01 2020-12-31 NY
1113 8 3 D 2020-01-01 2020-01-01 2020-12-31 NY
The problem here is that IDs match but the dates don't so it uses the ELSE statement to join. This is incorrect
Join Idea 2:
CASE
WHEN
Cust_InfoID = Info_ID
THEN
Cust_InfoID = Info_ID
ELSE
Cust_Start BETWEEN Info_Start AND Info_End
and Info_Lev1 = Cust_Lev1
END
Output:
Cust_ID Cust_InfoID Info_ID Cust_Lev1 Cust_Start Info_Start Info_End state
1111 1 1 A 2021-01-01 2021-01-15 2021-01-14 NJ
1111 1 5 A 2021-01-01 2021-01-01 2022-01-31 CA
1112 3 3 D 2020-01-01 2020-01-01 2020-12-31 NY
1113 8 3 D 2020-01-01 2020-01-01 2020-12-31 NY
The problem here is that IDs match but the ELSE statement also matches up the wrong duplicate row. This is also incorrect
Example tables here:
with customer as (
SELECT 1111 Cust_ID,1 Cust_InfoID,'Amy' Cust_name,'2021-01-01' Cust_Start,'A' Cust_Lev1
UNION ALL
SELECT 1112,3,'John','2020-01-01','D'
union all
SELECT 1113,8,'Bill','2020-01-01','D'
),
info as (
select 1 Info_ID,'A' Info_Lev1,'2021-01-15' Info_Start,'2021-01-14' Info_End,'NJ' state
union all
select 3,'D','2020-01-01','2020-12-31','NY'
union all
select 5,'A','2021-01-01','2022-01-31','CA'
)
select Cust_ID,Cust_InfoID,Info_ID,Cust_Lev1,Cust_Start,Info_Start,Info_End,state
from customer
join info on
[case statement here]
The way the table works is if the Info_IDs don't match up we can use a combination of Lev1 and if the cust_start date is between Info_Start and Info_End
Use two left joins, one for each of the conditions:
select c.*,
coalesce(ii.info_start, il.info_start),
coalesce(ii.info_end, il.info_end),
coalesce(ii.state, il.state)
from customer c left join
info ii
on c.cust_infoid = ii.info_id left join
info il
on ii.info_id is null and
c.cust_lev1 = il.info_lev1 and
c.cust_start between il.info_start and il.info_end
Consider below ("with one JOIN and a CASE statement" as asked)
select any_value(c).*,
array_agg(i order by
case when c.cust_infoid = i.info_id then 1 else 2 end
limit 1
)[offset(0)].*
from `project.dataset.customer` c
join `project.dataset.info` i
on c.cust_infoid = i.info_id
or(
c.cust_lev1 = i.info_lev1 and
c.cust_start between i.info_start and i.info_end
)
group by format('%t', c)
when applied to sample data in your question - output is
I am working with a SQLite RDB and have the following problem.
PID EID EPISODETYPE START_TIME END_TIME
123 556 emergency_room 2020-03-29 15:09:00 2020-03-30 20:36:00
123 558 ward 2020-04-30 20:35:00 2020-05-04 22:12:00
123 660 ward 2020-05-04 22:12:00 2020-05-21 08:59:00
123 661 icu 2020-05-21 09:00:00 2020-07-01 17:00:00
Basically, PID represents each patient unique identifier. They all have an episode identifier for all the different beds they occupy during a unique stay.
What I wish to accomplish is to select all episodes from a single hospital stay and return it as the stay number.
I would want my query to result in this :
PID EID StayNumber
123 556 1
123 558 2
123 660 2
123 661 2
1 st row is StayNumber as it's the first.
As the 2nd, 3rd and 4th row are from the same hospital stay (we can tell by the overlapping OR relatively close start and end time period) they are all labeled StayNumber 2.
A hospital stay is defined as the period of time during which the patient never left the hospital.
I tried to write the query by starting off with a :
GROUP BY PID (to isolate the process for each individual patient)
Using datetime to compute a simple time difference rule but I have trouble writing down a query using the end time from a row and the start time from the next row.
Thank you in advance.
I am a SQL learner
UPDATE ***
Use window function LAG() to flag the groups for each hospital stay and window function SUM() to get the numbers:
SELECT PID, EID,
SUM(flag) OVER (PARTITION BY PID ORDER BY START_TIME) StayNumber
FROM (
SELECT *,
strftime('%s', START_TIME) -
strftime('%s', LAG(END_TIME, 1, datetime(START_TIME, '-1 hour')) OVER (PARTITION BY PID ORDER BY START_TIME)) > 60 flag
FROM tablename
)
See the demo.
Results:
|PID | EID | StayNumber
|:-- | :-- | ---------:
|123 | 556 | 1
|123 | 558 | 2
|123 | 660 | 2
|123 | 661 | 2
I have a table covid, my table looks something like this:
location | date | new_cases | total_deaths | new_deaths
----------------------------------------------------------------
Afghanistan 2020-04-07 38 7 0
Afghanistan 2020-04-08 30 11 4
Afghanistan 2020-04-09 56 14 3
Afghanistan 2020-04-10 61 15 1
Afghanistan 2020-04-11 37 15 0
Afghanistan 2020-04-12 34 18 3
In this case, I want to get rows location based on max(new_cases),this is my query:
select a.*
from covid a
join (
select location, max(new_cases) highest_case
from covid
group by location
) b
on a.location = b.location
and a.new_cases = b.highest_case
but I found the same location and max(case) values with the different date value, this is the result.
location | date | new_cases | total_deaths | new_deaths
----------------------------------------------------------------
Bhutan 2020-06-08 11 0 0
Bolivia 2020-07-28 2382 2647 64
Bonaire Sint 2020-04-02 2 0 0
Bonaire Sint 2020-07-15 2 0 0
Botswana 2020-07-24 164 1 0
Now, how can I get the values based on min(date), please give me advice for fix this, and the output should be like this:
location | date | new_cases | total_deaths | new_deaths
----------------------------------------------------------------
Bhutan 2020-06-08 11 0 0
Bolivia 2020-07-28 2382 2647 64
Bonaire Sint 2020-04-02 2 0 0
Botswana 2020-07-24 164 1 0
Use distinct on:
select distinct on (location) c.*
from covid c
order by location, new_cases desc;
For the minimum date, use:
order by location, date asc;
You can use window function Max() to get max_cases (according to location) and then numbering rows (to fetch the min date) :
select location,date,new_cases,total_deaths,new_deaths from
(
--get min date with max_cases
select row_number()over(partition by location order by date)n,date,
location,new_cases,total_deaths,new_deaths
from
(
select location,date,max(new_cases)over(partition by
location)max_case,new_cases,total_deaths,new_deaths from covid --get max_case
) X
where new_cases=max_case --fetch only max case
)Y where n=1