How can I convert gaps in dates to discrete intervals? - sql

Can't think of a better title for this, and I've had a lot of problems coming up with a solution in SQL.
Basically, I have a table of events with dates and an associated entity, like the following:
event
entity
date
er_visit
bob
2020-01-01
triage
bob
2020-01-01
admitted_to_icu
bob
2020-01-01
inpatient_bed_rest
bob
2020-01-02
inpatient_bed_rest
bob
2020-01-03
physical_therapy
bob
2020-01-03
hospital_discharge
bob
2020-01-11
physical_therapy
bob
2020-01-12
physical_therapy
bob
2020-01-13
inpatient_followup
bob
2020-02-01
inpatient_followup
bob
2021-02-11
pregnancy_checkup
alice
2020-01-01
admitted_maternity_ward
alice
2020-02-01
inpatient_birth
alice
2020-02-02
bed_rest
alice
2020-02-02
bed_rest
alice
2020-02-03
bed_rest
alice
2020-02-04
hospital_discharge
alice
2021-02-04
I need to turn this into a list of events per entity, with the relative dates encoded as discrete intervals - say 1 day, 1 week, 1 month, 1 year - and other intervals encoded as a combination of those intervals. 14 months, 15 days would be 1 year + 1 month + 1 month + 1 week + 1 week + 1 day.
event
entity
seq
er_visit
bob
1
triage
bob
2
admitted_to_icu
bob
3
nextday
bob
4
inpatient_bed_rest
bob
5
nextday
bob
6
inpatient_bed_rest
bob
7
physical_therapy
bob
8
nextweek
bob
9
nextday
bob
10
hospital_discharge
bob
11
nextday
bob
12
physical_therapy
bob
13
nextday
bob
14
physical_therapy
bob
15
nextweek
bob
16
nextweek
bob
17
nextday
bob
18
nextday
bob
19
nextday
bob
20
nextday
bob
21
nextday
bob
22
inpatient_followup
bob
23
nextyear
bob
24
nextweek
bob
25
nextday
bob
26
nextday
bob
27
nextday
bob
28
inpatient_followup
bob
29
pregnancy_checkup
alice
1
nextmonth
alice
2
admitted_maternity_ward
alice
3
nextday
alice
4
inpatient_birth
alice
5
bed_rest
alice
6
nextday
alice
7
bed_rest
alice
8
nextday
alice
9
bed_rest
alice
10
hospital_discharge
alice
11
Is there a reasonable solution for this in Athena (Presto) SQL?

Done in Trino, but should be the same for these functions.
I've output the intervals in number of days, converting it from there to your strangely formatted desired output is possible, but way more complicated than makes sense.
WITH the_table AS (
SELECT event, entity, date(date) date
FROM (VALUES
('er_visit', 'bob', '2020-01-01'), ('triage', 'bob', '2020-01-01'),
('admitted_to_icu', 'bob', '2020-01-01'), ('inpatient_bed_rest', 'bob', '2020-01-02'),
('inpatient_bed_rest', 'bob', '2020-01-03'), ('physical_therapy', 'bob', '2020-01-03'),
('hospital_discharge', 'bob', '2020-01-11'), ('physical_therapy', 'bob', '2020-01-12'),
('physical_therapy', 'bob', '2020-01-13'), ('inpatient_followup', 'bob', '2020-02-01'),
('inpatient_followup', 'bob', '2021-02-11'), ('pregnancy_checkup', 'alice', '2020-01-01'),
('admitted_maternity_ward', 'alice', '2020-02-01'), ('inpatient_birth', 'alice', '2020-02-02'),
('bed_rest', 'alice', '2020-02-02'), ('bed_rest', 'alice', '2020-02-03'),
('bed_rest', 'alice', '2020-02-04'), ('hospital_discharge', 'alice', '2021-02-04'))
AS data(event, entity, date)
)
SELECT event
, entity
, date_diff('day', lag(date) OVER (PARTITION BY entity ORDER BY date), date) days_since_last
FROM the_table
event
entity
days_since_last
pregnancy_checkup
alice
NULL
admitted_maternity_ward
alice
31
inpatient_birth
alice
1
bed_rest
alice
0
bed_rest
alice
1
bed_rest
alice
1
hospital_discharge
alice
366
er_visit
bob
NULL
triage
bob
0
admitted_to_icu
bob
0
inpatient_bed_rest
bob
1
inpatient_bed_rest
bob
1
physical_therapy
bob
0
hospital_discharge
bob
8
physical_therapy
bob
1
physical_therapy
bob
1
inpatient_followup
bob
19
inpatient_followup
bob
376

Related

Pandas iterate over rows and conditional count

I am trying to iterate over rows in a pandas Dataframe with a conditional count in a new column called Stage. For each name the stage should start at 1, and if the name is the same between rows then after a "Healthy" status a new stage should start. A "Healthy" event will be in the same stage as the preceding "Sick" events, if they exist. I've done the code in excel before but not sure how to do it in python.
What I have now is:
Date
Name
Status
2020-01-02
Mary
Healthy
2020-01-05
Mary
Sick
2020-01-15
Mary
Sick
2020-01-20
Mary
Healthy
2020-02-03
Mary
Healthy
2020-02-06
Mary
Sick
2020-02-10
Mary
Sick
2020-02-15
Mary
Healthy
2020-01-02
Bob
Healthy
2020-01-05
Bob
Healthy
2020-01-15
Bob
Healthy
2020-01-20
Bob
Sick
2020-02-03
Bob
Sick
2020-02-06
Bob
Sick
2020-02-10
Bob
Sick
2020-02-15
Bob
Healthy
What I would like to have:
Date
Name
Status
Stage
2020-01-02
Mary
Healthy
1
2020-01-05
Mary
Sick
2
2020-01-15
Mary
Sick
2
2020-01-20
Mary
Healthy
2
2020-02-03
Mary
Healthy
3
2020-02-06
Mary
Sick
4
2020-02-10
Mary
Sick
4
2020-02-15
Mary
Healthy
4
2020-01-02
Bob
Healthy
1
2020-01-05
Bob
Healthy
2
2020-01-15
Bob
Healthy
3
2020-01-20
Bob
Sick
4
2020-02-03
Bob
Sick
4
2020-02-06
Bob
Sick
4
2020-02-10
Bob
Sick
4
2020-02-15
Bob
Healthy
4
You don't need an explicit loop. You need the following:
group by the name column
apply to each group:
shift the Status column to look at the previous value
take cumulative sum of the following series:
if the previous value is null and current value is Healthy, we're at the first row so call it one
if the previous row is Healthy, call it one
otherwise, call it zero
from io import StringIO
import numpy
import pandas
df = pandas.read_csv(StringIO("""\
|Date|Name|Stage|
|2020-01-02|Mary|Healthy|
|2020-01-05|Mary|Sick|
|2020-01-15|Mary|Sick|
|2020-01-20|Mary|Healthy|
|2020-02-03|Mary|Healthy|
|2020-02-06|Mary|Sick|
|2020-02-10|Mary|Sick |
|2020-02-15|Mary|Healthy|
|2020-01-02|Bob|Healthy|
|2020-01-05|Bob|Healthy|
|2020-01-15|Bob|Healthy|
|2020-01-20|Bob|Sick|
|2020-02-03|Bob|Sick|
|2020-02-06|Bob|Sick|
|2020-02-10|Bob|Sick |
|2020-02-15|Bob|Healthy|
"""), sep='|').loc[:, ['Date', 'Name', 'Stage']]
output = (
df.assign(Status=lambda df: df.groupby('Name')['Stage'].apply(lambda g:
numpy.bitwise_or( # returns 1 if either two conditions are met
g.shift().eq('Healthy'), # general case
g.shift().isnull() & g.eq("Healthy") # handles first row of a group
).cumsum()
))
)
print(output.to_string())
And I get:
Date Name Stage Status
0 2020-01-02 Mary Healthy 1
1 2020-01-05 Mary Sick 2
2 2020-01-15 Mary Sick 2
3 2020-01-20 Mary Healthy 2
4 2020-02-03 Mary Healthy 3
5 2020-02-06 Mary Sick 4
6 2020-02-10 Mary Sick 4
7 2020-02-15 Mary Healthy 4
8 2020-01-02 Bob Healthy 1
9 2020-01-05 Bob Healthy 2
10 2020-01-15 Bob Healthy 3
11 2020-01-20 Bob Sick 4
12 2020-02-03 Bob Sick 4
13 2020-02-06 Bob Sick 4
14 2020-02-10 Bob Sick 4
15 2020-02-15 Bob Healthy 4

How to find the Effective end Date for the below table using select statement only

How to find the Effective end Date for the below table using select statement only
This is the actual table:
EMID ENAME DEPT_NO EFDT
101 ANUJ 10 1/1/2018
101 ANUJ 11 1/1/2020
101 ANUJ 12 5/1/2020
102 KUNAL 12 1/1/2019
102 KUNAL 14 1/1/2020
102 KUNAL 15 5/1/2020
103 AJAY 11 1/1/2018
103 AJAY 12 1/1/2020
104 RAJAT 10 1/1/2018
104 RAJAT 12 1/1/2020
This is desired output:
EMID ENAME DEPTNO EFDT EF_ENDT
101 ANUJ 10 1/1/2018 12/31/2019
101 ANUJ 11 1/1/2020 4/30/2020
101 ANUJ 12 5/1/2020 NULL
102 KUNAL 12 1/1/2019 12/31/2019
102 KUNAL 14 1/1/2020 4/30/2020
102 KUNAL 15 5/1/2020 NULL
103 AJAY 11 1/1/2018 12/31/2019
103 AJAY 12 1/1/2020 NULL
104 RAJAT 10 1/1/2018 12/31/2019
104 RAJAT 12 1/1/2020 NULL
The EF_ENDT needs to be populated using the statement only.
How can we do this?
This code can be generic for all Database
Basically, you want a lead and then to subtract one day. The standard SQL for this is:
select t.*,
lead(efdt) over (partition by emid order by efdt) - interval '1 day' as ef_enddt
from t;
Date/time function vary significantly among databases. All provide some method for subtracting one day. You'll probably have to adapt this to your particular (unstated) database.

SQL server select from 3 tables

I have three tables in my database Books, Borrowers and Movement:
Books
BookID Title Author Category Published
----------- ------------------------------ ------------------------- --------------- ----------
101 Ulysses James Joyce Fiction 1922-06-16
102 Huckleberry Finn Mark Twain Fiction 1884-03-24
103 The Great Gatsby F. Scott Fitzgerald Fiction 1925-06-17
104 1984 George Orwell Fiction 1949-04-19
105 War and Peace Leo Tolstoy Fiction 1869-08-01
106 Gullivers Travels Jonathan Swift Fiction 1726-07-01
107 Moby Dick Herman Melville Fiction 1851-08-01
108 Pride and Prejudice Jane Austen Fiction 1813-08-13
110 The Second World War Winston Churchill NonFiction 1953-09-01
111 Relativity Albert Einstein NonFiction 1917-01-09
112 The Right Stuff Tom Wolfe NonFiction 1979-09-07
121 Hitchhikers Guide to Galaxy Douglas Adams Humour 1975-10-27
122 Dad Is Fat Jim Gaffigan Humour 2013-03-01
131 Kick-Ass 2 Mark Millar Comic 2012-03-03
133 Beautiful Creatures: The Manga Kami Garcia Comic 2014-07-01
Borrowers
BorrowerID Name Birthday
----------- ------------------------- ----------
2 Bugs Bunny 1938-09-08
3 Homer Simpson 1992-09-09
5 Mickey Mouse 1928-02-08
7 Fred Flintstone 1960-06-09
11 Charlie Brown 1965-06-05
13 Popeye 1933-03-03
17 Donald Duck 1937-07-27
19 Mr. Magoo 1949-09-14
23 George Jetson 1948-04-08
29 SpongeBob SquarePants 1984-08-04
31 Stewie Griffin 1971-11-17
Movement
MoveID BookID BorrowerID DateOut DateIn ReturnCondition
----------- ----------- ----------- ---------- ---------- ---------------
1 131 31 2012-06-01 2013-05-24 good
2 101 23 2012-02-10 2012-03-24 good
3 102 29 2012-02-01 2012-04-01 good
4 105 7 2012-03-23 2012-05-11 good
5 103 7 2012-03-22 2012-04-22 good
6 108 7 2012-01-23 2012-02-12 good
7 112 19 2012-01-12 2012-02-10 good
8 122 11 2012-04-14 2013-05-01 poor
9 106 17 2013-01-24 2013-02-01 good
10 104 2 2013-02-24 2013-03-10 bitten
11 121 3 2013-03-01 2013-04-01 good
12 131 19 2013-04-11 2013-05-23 good
13 111 5 2013-05-22 2013-06-22 poor
14 131 2 2013-06-12 2013-07-23 bitten
15 122 23 2013-07-10 2013-08-12 good
16 107 29 2014-01-01 2014-02-14 good
17 110 7 2014-01-11 2014-02-01 good
18 105 2 2014-02-22 2014-03-02 bitten
What is a query I can use to find out which book was borrowed by the oldest borrower?
I am new to SQL and am using Microsoft SQL Server 2014
Here are two different solutions:
First using two sub querys and one equi-join:
select Title
from Books b , Movement m
where b.BookID = m.BookID and m.BorrowerID = (select BorrowerID
from Borrowers
where Birthday = (select MIN(Birthday)
from Borrowers))
Using two equi-joins and one sub query:
select Title
from Books b, Borrowers r, Movement m
where b.BookID = m.BookID
and m.BorrowerID = r.BorrowerID
and Birthday = (select MIN(Birthday) from Borrowers)
Both above queries give the following answer:
Title
------------------------------
Relativity

SQL Queries (Difference between tables)

I'm trying to find a difference between two tables. The tables are
Sample Data
PERSON_PHOTO
ID USERID FNAME
801 uid01 Geroge
801 uid05 George
803 uid01 George
901 uid01 Alice
201 uid01 Alice
330 uid01 Alice
802 uid05 Alice
803 uid05 Alice
804 uid05 Alice
901 uid05 Alice
701 uid05 Alice
201 uid05 Alice
101 uid05 Alice
330 uid05 Alice
501 uid05 Alice
501 uid12 Jane
330 uid12 Jane
101 uid12 Jane
201 uid12 Jane
701 uid12 Jane
801 uid12 Jane
901 uid12 Jane
101 uid07 Mary
101 uid03 Mary
201 uid03 Mary
801 uid03 Mary
901 uid03 Mary
201 uid15 Tom
801 uid15 Tom
Table VALID_FRIEND
FNAME USERID
Bill uid02
George uid01
Mary uid07
Jane uid12
Tom uid15
Alice uid05
Mary uid03
SAMPLE OUTPUT
USERID PHOTOS NOT IN
uid02 0
uid01 5
uid07 9
uid12 3
uid15 8
uid05 8
uid03 6
The query I'm trying to perform is to find the number of Photos that the person is not in. I'm trying to output by USERID and the number of photos not currently in. I know I need to find the count of the distinct PID in person photo and take the difference of the count of the userid in photo. Thanks for any help.

SQL: Grouping by 2 columns

I have a table points:
event_time | name | points |
------------------------------------
2014-07-16 11:40 Bob 10
2014-07-16 10:00 Jim 20
2014-07-16 09:20 Jim 30
2014-07-15 11:20 Bob 5
2014-07-15 10:20 Anna 10
2014-07-15 09:40 Bob 30
2014-07-15 09:00 Anna 10
Is it possible to make a query that results with:
event_date | name | total_points |
------------------------------------
2014-07-16 Bob 10
2014-07-16 Jim 50
2014-07-15 Bob 35
2014-07-15 Anna 20
Where total_points is a sum of all points for the given name during the day?
select date(event_time) as event_date,
name,
sum(points) as total_points
from points
group by date(event_time), name