Pandas iterate over rows and conditional count - pandas

I am trying to iterate over rows in a pandas Dataframe with a conditional count in a new column called Stage. For each name the stage should start at 1, and if the name is the same between rows then after a "Healthy" status a new stage should start. A "Healthy" event will be in the same stage as the preceding "Sick" events, if they exist. I've done the code in excel before but not sure how to do it in python.
What I have now is:
Date
Name
Status
2020-01-02
Mary
Healthy
2020-01-05
Mary
Sick
2020-01-15
Mary
Sick
2020-01-20
Mary
Healthy
2020-02-03
Mary
Healthy
2020-02-06
Mary
Sick
2020-02-10
Mary
Sick
2020-02-15
Mary
Healthy
2020-01-02
Bob
Healthy
2020-01-05
Bob
Healthy
2020-01-15
Bob
Healthy
2020-01-20
Bob
Sick
2020-02-03
Bob
Sick
2020-02-06
Bob
Sick
2020-02-10
Bob
Sick
2020-02-15
Bob
Healthy
What I would like to have:
Date
Name
Status
Stage
2020-01-02
Mary
Healthy
1
2020-01-05
Mary
Sick
2
2020-01-15
Mary
Sick
2
2020-01-20
Mary
Healthy
2
2020-02-03
Mary
Healthy
3
2020-02-06
Mary
Sick
4
2020-02-10
Mary
Sick
4
2020-02-15
Mary
Healthy
4
2020-01-02
Bob
Healthy
1
2020-01-05
Bob
Healthy
2
2020-01-15
Bob
Healthy
3
2020-01-20
Bob
Sick
4
2020-02-03
Bob
Sick
4
2020-02-06
Bob
Sick
4
2020-02-10
Bob
Sick
4
2020-02-15
Bob
Healthy
4

You don't need an explicit loop. You need the following:
group by the name column
apply to each group:
shift the Status column to look at the previous value
take cumulative sum of the following series:
if the previous value is null and current value is Healthy, we're at the first row so call it one
if the previous row is Healthy, call it one
otherwise, call it zero
from io import StringIO
import numpy
import pandas
df = pandas.read_csv(StringIO("""\
|Date|Name|Stage|
|2020-01-02|Mary|Healthy|
|2020-01-05|Mary|Sick|
|2020-01-15|Mary|Sick|
|2020-01-20|Mary|Healthy|
|2020-02-03|Mary|Healthy|
|2020-02-06|Mary|Sick|
|2020-02-10|Mary|Sick |
|2020-02-15|Mary|Healthy|
|2020-01-02|Bob|Healthy|
|2020-01-05|Bob|Healthy|
|2020-01-15|Bob|Healthy|
|2020-01-20|Bob|Sick|
|2020-02-03|Bob|Sick|
|2020-02-06|Bob|Sick|
|2020-02-10|Bob|Sick |
|2020-02-15|Bob|Healthy|
"""), sep='|').loc[:, ['Date', 'Name', 'Stage']]
output = (
df.assign(Status=lambda df: df.groupby('Name')['Stage'].apply(lambda g:
numpy.bitwise_or( # returns 1 if either two conditions are met
g.shift().eq('Healthy'), # general case
g.shift().isnull() & g.eq("Healthy") # handles first row of a group
).cumsum()
))
)
print(output.to_string())
And I get:
Date Name Stage Status
0 2020-01-02 Mary Healthy 1
1 2020-01-05 Mary Sick 2
2 2020-01-15 Mary Sick 2
3 2020-01-20 Mary Healthy 2
4 2020-02-03 Mary Healthy 3
5 2020-02-06 Mary Sick 4
6 2020-02-10 Mary Sick 4
7 2020-02-15 Mary Healthy 4
8 2020-01-02 Bob Healthy 1
9 2020-01-05 Bob Healthy 2
10 2020-01-15 Bob Healthy 3
11 2020-01-20 Bob Sick 4
12 2020-02-03 Bob Sick 4
13 2020-02-06 Bob Sick 4
14 2020-02-10 Bob Sick 4
15 2020-02-15 Bob Healthy 4

Related

How can I convert gaps in dates to discrete intervals?

Can't think of a better title for this, and I've had a lot of problems coming up with a solution in SQL.
Basically, I have a table of events with dates and an associated entity, like the following:
event
entity
date
er_visit
bob
2020-01-01
triage
bob
2020-01-01
admitted_to_icu
bob
2020-01-01
inpatient_bed_rest
bob
2020-01-02
inpatient_bed_rest
bob
2020-01-03
physical_therapy
bob
2020-01-03
hospital_discharge
bob
2020-01-11
physical_therapy
bob
2020-01-12
physical_therapy
bob
2020-01-13
inpatient_followup
bob
2020-02-01
inpatient_followup
bob
2021-02-11
pregnancy_checkup
alice
2020-01-01
admitted_maternity_ward
alice
2020-02-01
inpatient_birth
alice
2020-02-02
bed_rest
alice
2020-02-02
bed_rest
alice
2020-02-03
bed_rest
alice
2020-02-04
hospital_discharge
alice
2021-02-04
I need to turn this into a list of events per entity, with the relative dates encoded as discrete intervals - say 1 day, 1 week, 1 month, 1 year - and other intervals encoded as a combination of those intervals. 14 months, 15 days would be 1 year + 1 month + 1 month + 1 week + 1 week + 1 day.
event
entity
seq
er_visit
bob
1
triage
bob
2
admitted_to_icu
bob
3
nextday
bob
4
inpatient_bed_rest
bob
5
nextday
bob
6
inpatient_bed_rest
bob
7
physical_therapy
bob
8
nextweek
bob
9
nextday
bob
10
hospital_discharge
bob
11
nextday
bob
12
physical_therapy
bob
13
nextday
bob
14
physical_therapy
bob
15
nextweek
bob
16
nextweek
bob
17
nextday
bob
18
nextday
bob
19
nextday
bob
20
nextday
bob
21
nextday
bob
22
inpatient_followup
bob
23
nextyear
bob
24
nextweek
bob
25
nextday
bob
26
nextday
bob
27
nextday
bob
28
inpatient_followup
bob
29
pregnancy_checkup
alice
1
nextmonth
alice
2
admitted_maternity_ward
alice
3
nextday
alice
4
inpatient_birth
alice
5
bed_rest
alice
6
nextday
alice
7
bed_rest
alice
8
nextday
alice
9
bed_rest
alice
10
hospital_discharge
alice
11
Is there a reasonable solution for this in Athena (Presto) SQL?
Done in Trino, but should be the same for these functions.
I've output the intervals in number of days, converting it from there to your strangely formatted desired output is possible, but way more complicated than makes sense.
WITH the_table AS (
SELECT event, entity, date(date) date
FROM (VALUES
('er_visit', 'bob', '2020-01-01'), ('triage', 'bob', '2020-01-01'),
('admitted_to_icu', 'bob', '2020-01-01'), ('inpatient_bed_rest', 'bob', '2020-01-02'),
('inpatient_bed_rest', 'bob', '2020-01-03'), ('physical_therapy', 'bob', '2020-01-03'),
('hospital_discharge', 'bob', '2020-01-11'), ('physical_therapy', 'bob', '2020-01-12'),
('physical_therapy', 'bob', '2020-01-13'), ('inpatient_followup', 'bob', '2020-02-01'),
('inpatient_followup', 'bob', '2021-02-11'), ('pregnancy_checkup', 'alice', '2020-01-01'),
('admitted_maternity_ward', 'alice', '2020-02-01'), ('inpatient_birth', 'alice', '2020-02-02'),
('bed_rest', 'alice', '2020-02-02'), ('bed_rest', 'alice', '2020-02-03'),
('bed_rest', 'alice', '2020-02-04'), ('hospital_discharge', 'alice', '2021-02-04'))
AS data(event, entity, date)
)
SELECT event
, entity
, date_diff('day', lag(date) OVER (PARTITION BY entity ORDER BY date), date) days_since_last
FROM the_table
event
entity
days_since_last
pregnancy_checkup
alice
NULL
admitted_maternity_ward
alice
31
inpatient_birth
alice
1
bed_rest
alice
0
bed_rest
alice
1
bed_rest
alice
1
hospital_discharge
alice
366
er_visit
bob
NULL
triage
bob
0
admitted_to_icu
bob
0
inpatient_bed_rest
bob
1
inpatient_bed_rest
bob
1
physical_therapy
bob
0
hospital_discharge
bob
8
physical_therapy
bob
1
physical_therapy
bob
1
inpatient_followup
bob
19
inpatient_followup
bob
376

How to find the Effective end Date for the below table using select statement only

How to find the Effective end Date for the below table using select statement only
This is the actual table:
EMID ENAME DEPT_NO EFDT
101 ANUJ 10 1/1/2018
101 ANUJ 11 1/1/2020
101 ANUJ 12 5/1/2020
102 KUNAL 12 1/1/2019
102 KUNAL 14 1/1/2020
102 KUNAL 15 5/1/2020
103 AJAY 11 1/1/2018
103 AJAY 12 1/1/2020
104 RAJAT 10 1/1/2018
104 RAJAT 12 1/1/2020
This is desired output:
EMID ENAME DEPTNO EFDT EF_ENDT
101 ANUJ 10 1/1/2018 12/31/2019
101 ANUJ 11 1/1/2020 4/30/2020
101 ANUJ 12 5/1/2020 NULL
102 KUNAL 12 1/1/2019 12/31/2019
102 KUNAL 14 1/1/2020 4/30/2020
102 KUNAL 15 5/1/2020 NULL
103 AJAY 11 1/1/2018 12/31/2019
103 AJAY 12 1/1/2020 NULL
104 RAJAT 10 1/1/2018 12/31/2019
104 RAJAT 12 1/1/2020 NULL
The EF_ENDT needs to be populated using the statement only.
How can we do this?
This code can be generic for all Database
Basically, you want a lead and then to subtract one day. The standard SQL for this is:
select t.*,
lead(efdt) over (partition by emid order by efdt) - interval '1 day' as ef_enddt
from t;
Date/time function vary significantly among databases. All provide some method for subtracting one day. You'll probably have to adapt this to your particular (unstated) database.

How to find sum of count, grouped by the id?

I would like to get the output for the following problem.
I have the following datatype:
id start end count Time Train
001 Paris London 01 05:00 Yes
001 Paris London 01 05:00 Yes
002 Prague Vienna 15 15:00 No
003 Frankfurt London 01 17:00 Yes
015 Paris London 08 21:00 No
019 Barcelona Vienna 15 15:00 No
003 Frankfurt London 01 07:00 Yes
002 Prague Vienna 15 05:00 No
I would like to find the sum of count, grouped by the id. Also ignore the rows that has the same id, start and end . Also I have a data of 4 gb and I would like to find the start and end city of top 5 count. Thank you.
I could like to get output that gives data something similar to this,
Prague -> Vienna Count : 15
Barcelona -> Vienna count : 15
Paris --> london Count : 09
Frankfurt -> London Count: 02
.....
You can use drop_duplicates + groupby with aggregating sum:
df['count'] = df['count'].astype(int)
df = df.drop_duplicates(['id','start','end'])
print (df)
id start end count Time Train
0 001 Paris London 1 05:00 Yes
2 002 Prague Vienna 15 15:00 No
3 003 Frankfurt London 1 07:00 Yes
4 015 Paris London 8 21:00 No
5 019 Barcelona Vienna 15 15:00 No
df1 = df.groupby('id', as_index=False)['count'].sum()
print (df1)
id count
0 001 1
1 002 15
2 003 1
3 015 8
4 019 15
df11 = df.groupby(['id', 'start', 'end'], as_index=False)['count'].sum()
print (df11)
id start end count
0 001 Paris London 1
1 002 Prague Vienna 15
2 003 Frankfurt London 1
3 015 Paris London 8
4 019 Barcelona Vienna 15
df12 = df.groupby(['start', 'end'], as_index=False)['count'].sum()
print (df12)
start end count
0 Barcelona Vienna 15
1 Frankfurt London 1
2 Paris London 9
3 Prague Vienna 15
For top values use nlargest:
df2 = df.nlargest(5, 'count')[['start','end']]
print (df2)
start end
2 Prague Vienna
5 Barcelona Vienna
4 Paris London
0 Paris London
3 Frankfurt London
SELECT T.* FROM
(
SELECT *,COUNT(id) AS count FROM TABLE1 GROUP BY id,start,end
) T
GROUP BY id ORDER BY count DESC LIMIT 0,5

SQL server select from 3 tables

I have three tables in my database Books, Borrowers and Movement:
Books
BookID Title Author Category Published
----------- ------------------------------ ------------------------- --------------- ----------
101 Ulysses James Joyce Fiction 1922-06-16
102 Huckleberry Finn Mark Twain Fiction 1884-03-24
103 The Great Gatsby F. Scott Fitzgerald Fiction 1925-06-17
104 1984 George Orwell Fiction 1949-04-19
105 War and Peace Leo Tolstoy Fiction 1869-08-01
106 Gullivers Travels Jonathan Swift Fiction 1726-07-01
107 Moby Dick Herman Melville Fiction 1851-08-01
108 Pride and Prejudice Jane Austen Fiction 1813-08-13
110 The Second World War Winston Churchill NonFiction 1953-09-01
111 Relativity Albert Einstein NonFiction 1917-01-09
112 The Right Stuff Tom Wolfe NonFiction 1979-09-07
121 Hitchhikers Guide to Galaxy Douglas Adams Humour 1975-10-27
122 Dad Is Fat Jim Gaffigan Humour 2013-03-01
131 Kick-Ass 2 Mark Millar Comic 2012-03-03
133 Beautiful Creatures: The Manga Kami Garcia Comic 2014-07-01
Borrowers
BorrowerID Name Birthday
----------- ------------------------- ----------
2 Bugs Bunny 1938-09-08
3 Homer Simpson 1992-09-09
5 Mickey Mouse 1928-02-08
7 Fred Flintstone 1960-06-09
11 Charlie Brown 1965-06-05
13 Popeye 1933-03-03
17 Donald Duck 1937-07-27
19 Mr. Magoo 1949-09-14
23 George Jetson 1948-04-08
29 SpongeBob SquarePants 1984-08-04
31 Stewie Griffin 1971-11-17
Movement
MoveID BookID BorrowerID DateOut DateIn ReturnCondition
----------- ----------- ----------- ---------- ---------- ---------------
1 131 31 2012-06-01 2013-05-24 good
2 101 23 2012-02-10 2012-03-24 good
3 102 29 2012-02-01 2012-04-01 good
4 105 7 2012-03-23 2012-05-11 good
5 103 7 2012-03-22 2012-04-22 good
6 108 7 2012-01-23 2012-02-12 good
7 112 19 2012-01-12 2012-02-10 good
8 122 11 2012-04-14 2013-05-01 poor
9 106 17 2013-01-24 2013-02-01 good
10 104 2 2013-02-24 2013-03-10 bitten
11 121 3 2013-03-01 2013-04-01 good
12 131 19 2013-04-11 2013-05-23 good
13 111 5 2013-05-22 2013-06-22 poor
14 131 2 2013-06-12 2013-07-23 bitten
15 122 23 2013-07-10 2013-08-12 good
16 107 29 2014-01-01 2014-02-14 good
17 110 7 2014-01-11 2014-02-01 good
18 105 2 2014-02-22 2014-03-02 bitten
What is a query I can use to find out which book was borrowed by the oldest borrower?
I am new to SQL and am using Microsoft SQL Server 2014
Here are two different solutions:
First using two sub querys and one equi-join:
select Title
from Books b , Movement m
where b.BookID = m.BookID and m.BorrowerID = (select BorrowerID
from Borrowers
where Birthday = (select MIN(Birthday)
from Borrowers))
Using two equi-joins and one sub query:
select Title
from Books b, Borrowers r, Movement m
where b.BookID = m.BookID
and m.BorrowerID = r.BorrowerID
and Birthday = (select MIN(Birthday) from Borrowers)
Both above queries give the following answer:
Title
------------------------------
Relativity

SQL: Grouping by 2 columns

I have a table points:
event_time | name | points |
------------------------------------
2014-07-16 11:40 Bob 10
2014-07-16 10:00 Jim 20
2014-07-16 09:20 Jim 30
2014-07-15 11:20 Bob 5
2014-07-15 10:20 Anna 10
2014-07-15 09:40 Bob 30
2014-07-15 09:00 Anna 10
Is it possible to make a query that results with:
event_date | name | total_points |
------------------------------------
2014-07-16 Bob 10
2014-07-16 Jim 50
2014-07-15 Bob 35
2014-07-15 Anna 20
Where total_points is a sum of all points for the given name during the day?
select date(event_time) as event_date,
name,
sum(points) as total_points
from points
group by date(event_time), name