Filling na values conditionally by group and date - pandas

The following list
[[22415, 7, Timestamp('2022-02-15 00:00:00'), 'KEY', nan], [22415, 7, Timestamp('2022-02-24 00:00:00'), 'MELOXICA', nan], [22415, 7, Timestamp('2022-10-11 00:00:00'), 'CEPFR', 12.0], [22415, 7, Timestamp('2022-10-11 00:00:00'), 'MELOXICA', nan], [25302, 8, Timestamp('2022-06-05 00:00:00'), 'TOX FL', 11.0], [25302, 8, Timestamp('2022-06-05 00:00:00'), 'FLUNIX', nan], [25302, 8, Timestamp('2022-06-07 00:00:00'), 'FLUNIX', nan], [25302, 8, Timestamp('2022-07-07 00:00:00'), 'MAS', nan], [25302, 8, Timestamp('2022-07-07 00:00:00'), 'FLUNIX', nan], [26662, 8, Timestamp('2022-07-08 00:00:00'), 'FR', 12.0], [26662, 8, Timestamp('2022-07-08 00:00:00'), 'FLUNIX', nan], [26662, 8, Timestamp('2022-07-17 00:00:00'), 'SFFR', 12.0], [26662, 8, Timestamp('2022-07-17 00:00:00'), 'MELOXICA', nan]]
Translates to the following dataframe example
ID LACT Date Remark QUARTER
2105 22415 7 2022-02-15 KEY NaN
2106 22415 7 2022-02-24 MELOXICA NaN
4 22415 7 2022-10-11 CEPFR 12.0
2107 22415 7 2022-10-11 MELOXICA NaN
9 25302 8 2022-06-05 TOX FL 11.0
2116 25302 8 2022-06-05 FLUNIX NaN
2117 25302 8 2022-06-07 FLUNIX NaN
10 25302 8 2022-07-07 MAS NaN
2118 25302 8 2022-07-07 FLUNIX NaN
14 26662 8 2022-07-08 FR 12.0
2125 26662 8 2022-07-08 FLUNIX NaN
15 26662 8 2022-07-17 SFFR 12.0
2126 26662 8 2022-07-17 MELOXICA NaN
I would like to forward and backward fill "QUARTER" when it is missing and there is a value for the same ID and Lact and the interval between the "Date" is < 7 days
I have used bfill and ffill with groupby with other data when there has not been the requirement for the date constraint .
The output I am looking for in this example is.
ID LACT Date Remark QUARTER
2105 22415 7 2022-02-15 KEY NaN
2106 22415 7 2022-02-24 MELOXICA NaN
4 22415 7 2022-10-11 CEPFR 12.0
2107 22415 7 2022-10-11 MELOXICA 12.0
9 25302 8 2022-06-05 TOX FL 11.0
2116 25302 8 2022-06-05 FLUNIX 11.0
2117 25302 8 2022-06-07 FLUNIX 11.0
10 25302 8 2022-07-07 MAS 11.0
2118 25302 8 2022-07-07 FLUNIX 11.0
14 26662 8 2022-07-08 FR 12.0
2125 26662 8 2022-07-08 FLUNIX 12.0
15 26662 8 2022-07-17 SFFR 12.0
2126 26662 8 2022-07-17 MELOXICA 12.0
The source dataset is large with varied intervals between dates for the same ID and lact
Appreciate any ideas how to fill na values based on the id lact grouping within the 6 day date constraint.
Thanks

Let's group the dataframe by ID and LACT and apply a custom function to backfill and forward fill the values
def pad(grp):
g = pd.Grouper(key='Date', freq='7D', origin='start')
return grp.groupby(g)['QUARTER'].apply(lambda s: s.ffill().bfill())
df['QUARTER'] = df.groupby(['ID', 'LACT'], group_keys=False).apply(pad)
Result
ID LACT Date Remark QUARTER
0 22415 7 2022-02-15 KEY NaN
1 22415 7 2022-02-24 MELOXICA NaN
2 22415 7 2022-10-11 CEPFR 12.0
3 22415 7 2022-10-11 MELOXICA 12.0
4 25302 8 2022-06-05 TOX FL 11.0
5 25302 8 2022-06-05 FLUNIX 11.0
6 25302 8 2022-06-07 FLUNIX 11.0
7 25302 8 2022-07-07 MAS NaN
8 25302 8 2022-07-07 FLUNIX NaN
9 26662 8 2022-07-08 FR 12.0
10 26662 8 2022-07-08 FLUNIX 12.0
11 26662 8 2022-07-17 SFFR 12.0
12 26662 8 2022-07-17 MELOXICA 12.0

Use DataFrameGroupBy.diff for test intervals between dates and for less like 6 days create helper groups for use last forward filling missing values per ID, LACT and groups by GroupBy.ffill and bfill:
#if necesary convert to datetimes
df['Date'] = pd.to_datetime(df['Date'])
#if necessary sorting dates per ID LACT
df = df.sort_values(['ID','LACT','Date'])
groups = df.groupby(['ID','LACT'])['Date'].diff().dt.days.fillna(0).gt(6).cumsum()
f = lambda x: x.ffill().bfill()
df['QUARTER'] = df.groupby(['ID','LACT', groups])['QUARTER'].transform(f)
print (df)
ID LACT Date Remark QUARTER
2105 22415 7 2022-02-15 KEY NaN
2106 22415 7 2022-02-24 MELOXICA NaN
4 22415 7 2022-10-11 CEPFR 12.0
2107 22415 7 2022-10-11 MELOXICA 12.0
9 25302 8 2022-06-05 TOX FL 11.0
2116 25302 8 2022-06-05 FLUNIX 11.0
2117 25302 8 2022-06-07 FLUNIX 11.0
10 25302 8 2022-07-07 MAS NaN
2118 25302 8 2022-07-07 FLUNIX NaN
14 26662 8 2022-07-08 FR 12.0
2125 26662 8 2022-07-08 FLUNIX 12.0
15 26662 8 2022-07-17 SFFR 12.0
2126 26662 8 2022-07-17 MELOXICA 12.0

Related

Diff() function use with groupby for pandas

I am encountering an errors each time i attempt to compute the difference in readings for a meter in my dataset. The dataset structure is this.
id paymenttermid houseid houseid-meterid quantity month year cleaned_quantity
Datetime
2019-02-01 255 water 215 215M201 23.0 2 2019 23.0
2019-02-01 286 water 193 193M181 24.0 2 2019 24.0
2019-02-01 322 water 172 172M162 22.0 2 2019 22.0
2019-02-01 323 water 176 176M166 61.0 2 2019 61.0
2019-02-01 332 water 158 158M148 15.0 2 2019 15.0
I am attempting to generate a new column called consumption that computes the difference in quantities consumed for each house(identified by houseid-meterid) after every month of the year.
The code i am using to implement this is:
water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff(-1)
After executing this code, the consumption column is filled with NaN values. How can I correctly implement this logic.
The end result looks like this:
id paymenttermid houseid houseid-meterid quantity month year cleaned_quantity consumption
Datetime
2019-02-01 255 water 215 215M201 23.0 2 2019 23.0 NaN
2019-02-01 286 water 193 193M181 24.0 2 2019 24.0 NaN
2019-02-01 322 water 172 172M162 22.0 2 2019 22.0 NaN
2019-02-01 323 water 176 176M166 61.0 2 2019 61.0 NaN
2019-02-01 332 water 158 158M148 15.0 2 2019 15.0 NaN
Many thank in advance.
I have attempted to use
water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff(-1)
and
water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff(0)
and
water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff()
all this commands result in the same behaviour as stated above.
Expected output should be:
Datetime houseid-meterid cleaned_quantity consumption
2019-02-01 215M201 23.0 20
2019-03-02 215M201 43.0 9
2019-04-01 215M201 52.0 12
2019-05-01 215M201 64.0 36
2019-06-01 215M201 100.0 20
what steps should i take?
Sort values by Datetime (if needed) then group by houseid-meterid before compute the diff for cleaned_quantity values then shift row to align with the right data:
df['consumption'] = (df.sort_values('Datetime')
.groupby('houseid-meterid')['cleaned_quantity']
.transform(lambda x: x.diff().shift(-1)))
print(df)
# Output
Datetime houseid-meterid cleaned_quantity consumption
0 2019-02-01 215M201 23.0 20.0
1 2019-03-02 215M201 43.0 9.0
2 2019-04-01 215M201 52.0 12.0
3 2019-05-01 215M201 64.0 36.0
4 2019-06-01 215M201 100.0 NaN

Moving Average Pandas Across Group

My data has the following structure:
np.random.seed(25)
tdf = pd.DataFrame({'person_id' :[1,1,1,1,
2,2,
3,3,3,3,3,
4,4,4,
5,5,5,5,5,5,5,
6,
7,7,
8,8,8,8,8,8,8,
9,9,
10,10
],
'Date': ['2021-01-02','2021-01-05','2021-01-07','2021-01-09',
'2021-01-02','2021-01-05',
'2021-01-02','2021-01-05','2021-01-07','2021-01-09','2021-01-11',
'2021-01-02','2021-01-05','2021-01-07',
'2021-01-02','2021-01-05','2021-01-07','2021-01-09','2021-01-11','2021-01-13','2021-01-15',
'2021-01-02',
'2021-01-02','2021-01-05',
'2021-01-02','2021-01-05','2021-01-07','2021-01-09','2021-01-11','2021-01-13','2021-01-15',
'2021-01-02','2021-01-05',
'2021-01-02','2021-01-05'
],
'Quantity': np.floor(np.random.random(size=35)*100)
})
And I want to calculate moving average (2 periods) over Date. So, the final output looks like the following. For first MA, we are taking 2021-01-02 & 2021-01-05 across all observations & calculate the MA (50). Similarly for other dates. The output need not be in the structure I'm showing the report. I just need date & MA column in the final data.
Thanks!
IIUC, you can aggregate the similar dates first, getting the sum and count.
Then take the sum per rolling 2 dates (here it doesn't look like you want to take care of a defined period but rather raw successive values, so I am assuming here prior sorting).
Finally, perform the ratio of sum and count to get the mean:
g = tdf.groupby('Date')['Quantity']
out = g.sum().rolling(2).sum()/g.count().rolling(2).sum()
output:
Date
2021-01-02 NaN
2021-01-05 50.210526
2021-01-07 45.071429
2021-01-09 41.000000
2021-01-11 44.571429
2021-01-13 48.800000
2021-01-15 50.500000
Name: Quantity, dtype: float64
joining the original data:
g = tdf.groupby('Date')['Quantity']
s = g.sum().rolling(2).sum()/g.count().rolling(2).sum()
tdf.merge(s.rename('Quantity_MA(2)'), left_on='Date', right_index=True)
output:
person_id Date Quantity Quantity_MA(2)
0 1 2021-01-02 87.0 NaN
4 2 2021-01-02 41.0 NaN
6 3 2021-01-02 68.0 NaN
11 4 2021-01-02 11.0 NaN
14 5 2021-01-02 16.0 NaN
21 6 2021-01-02 51.0 NaN
22 7 2021-01-02 38.0 NaN
24 8 2021-01-02 51.0 NaN
31 9 2021-01-02 90.0 NaN
33 10 2021-01-02 45.0 NaN
1 1 2021-01-05 58.0 50.210526
5 2 2021-01-05 11.0 50.210526
7 3 2021-01-05 43.0 50.210526
12 4 2021-01-05 44.0 50.210526
15 5 2021-01-05 52.0 50.210526
23 7 2021-01-05 99.0 50.210526
25 8 2021-01-05 55.0 50.210526
32 9 2021-01-05 66.0 50.210526
34 10 2021-01-05 28.0 50.210526
2 1 2021-01-07 27.0 45.071429
8 3 2021-01-07 55.0 45.071429
13 4 2021-01-07 58.0 45.071429
16 5 2021-01-07 32.0 45.071429
26 8 2021-01-07 3.0 45.071429
3 1 2021-01-09 18.0 41.000000
9 3 2021-01-09 36.0 41.000000
17 5 2021-01-09 69.0 41.000000
27 8 2021-01-09 71.0 41.000000
10 3 2021-01-11 40.0 44.571429
18 5 2021-01-11 36.0 44.571429
28 8 2021-01-11 42.0 44.571429
19 5 2021-01-13 83.0 48.800000
29 8 2021-01-13 43.0 48.800000
20 5 2021-01-15 48.0 50.500000
30 8 2021-01-15 28.0 50.500000

creating multi index from data grouped by month in Pandas

Consider this sample data:
Month Location Products Sales Profit
JAN 1 43 32 20
JAN 2 82 54 25
JAN 3 64 43 56
FEB 1 37 28 78
FEB 2 18 15 34
FEB 3 5 2 4
MAR 1 47 40 14
The multi-index transformation I am trying to achieve is this:
JAN FEB MAR
Location Products Sales Profit Products Sales Profit Products Sales Profit
1 43 32 29 37 28 78 47 40 14
2 82 54 25 18 15 34 null null null
3 64 43 56 5 2 4 null null null
I tried versions of this:
df.stack().to_frame().T
It put all the data into one row. So, that's not the goal.
I presume I am close in that it should be a stacking or unstacking, melting or unmelting, but my attempts have all resulted in data oatmeal at this point. Appreciate your time trying to solve this one.
You can use pivot with reorder_levels and sort_index():
df.pivot(index='Location',columns='Month').reorder_levels(order=[1,0],axis=1).sort_index(axis=1)
Month FEB JAN MAR
Products Profit Sales Products Profit Sales Products Profit Sales
Location
1 37.0 78.0 28.0 43.0 20.0 32.0 47.0 14.0 40.0
2 18.0 34.0 15.0 82.0 25.0 54.0 NaN NaN NaN
3 5.0 4.0 2.0 64.0 56.0 43.0 NaN NaN NaN
In case you are interested, this answer elaborates between swaplevel and reoder_levels.
Use pivot:
>>> df.pivot('Location', 'Month').swaplevel(axis=1).sort_index(axis=1)
Month FEB JAN MAR
Products Profit Sales Products Profit Sales Products Profit Sales
Location
1 37.0 78.0 28.0 43.0 20.0 32.0 47.0 14.0 40.0
2 18.0 34.0 15.0 82.0 25.0 54.0 NaN NaN NaN
3 5.0 4.0 2.0 64.0 56.0 43.0 NaN NaN NaN
To preserve order, you have to transform your Month column as CategoricalDtype before:
df['Month'] = df['Month'].astype(pd.CategoricalDtype(df['Month'].unique(), ordered=True))
out = df.pivot('Location', 'Month').swaplevel(axis=1).sort_index(axis=1)
print(out)
# Output:
Month JAN FEB MAR
Products Profit Sales Products Profit Sales Products Profit Sales
Location
1 43.0 20.0 32.0 37.0 78.0 28.0 47.0 14.0 40.0
2 82.0 25.0 54.0 18.0 34.0 15.0 NaN NaN NaN
3 64.0 56.0 43.0 5.0 4.0 2.0 NaN NaN NaN
Update 2
Try to force the order of level 2 columns:
df1 = df.set_index(['Month', 'Location'])
df1.columns = pd.CategoricalIndex(df1.columns, ordered=True)
df1 = df1.unstack('Month').swaplevel(axis=1).sort_index(axis=1)

Pandas combine two dataframes based on time difference

I have two data frames that stores different types of medical information of patients. The common elements of both the data frames are the encounter ID (hadm_id), the time the information was recorded ((n|c)e_charttime).
One data frame (df_str) contains structured information such as vital signs and lab test values and values derived from these (such as change statistics over 24 hours). The other data frame (df_notes) contains a column with a clinical note recorded at a specified time for an encounter. Both these data frames contain multiple encounters, but the common element is the encounter ID (hadm_id).
Here are examples of the data frames for ONE encounter ID (hadm_id) with a subset of variables:
df_str
hadm_id ce_charttime hr resp magnesium hr_24hr_mean
0 196673 2108-03-05 15:34:00 95.0 12.0 NaN 95.000000
1 196673 2108-03-05 16:00:00 85.0 11.0 NaN 90.000000
2 196673 2108-03-05 16:16:00 85.0 11.0 1.8 88.333333
3 196673 2108-03-05 17:00:00 109.0 12.0 1.8 93.500000
4 196673 2108-03-05 18:00:00 97.0 12.0 1.8 94.200000
5 196673 2108-03-05 19:00:00 99.0 16.0 1.8 95.000000
6 196673 2108-03-05 20:00:00 98.0 13.0 1.8 95.428571
7 196673 2108-03-05 21:00:00 97.0 14.0 1.8 95.625000
8 196673 2108-03-05 22:00:00 101.0 12.0 1.8 96.222222
9 196673 2108-03-05 23:00:00 97.0 13.0 1.8 96.300000
10 196673 2108-03-06 00:00:00 93.0 13.0 1.8 96.000000
11 196673 2108-03-06 01:00:00 89.0 12.0 1.8 95.416667
12 196673 2108-03-06 02:00:00 88.0 10.0 1.8 94.846154
13 196673 2108-03-06 03:00:00 87.0 12.0 1.8 94.285714
14 196673 2108-03-06 04:00:00 97.0 19.0 1.8 94.466667
15 196673 2108-03-06 05:00:00 95.0 11.0 1.8 94.500000
16 196673 2108-03-06 05:43:00 95.0 11.0 2.0 94.529412
17 196673 2108-03-06 06:00:00 103.0 17.0 2.0 95.000000
18 196673 2108-03-06 07:00:00 101.0 12.0 2.0 95.315789
19 196673 2108-03-06 08:00:00 103.0 20.0 2.0 95.700000
20 196673 2108-03-06 09:00:00 84.0 11.0 2.0 95.142857
21 196673 2108-03-06 10:00:00 89.0 11.0 2.0 94.863636
22 196673 2108-03-06 11:00:00 91.0 14.0 2.0 94.695652
23 196673 2108-03-06 12:00:00 85.0 10.0 2.0 94.291667
24 196673 2108-03-06 13:00:00 98.0 14.0 2.0 94.440000
25 196673 2108-03-06 14:00:00 100.0 18.0 2.0 94.653846
26 196673 2108-03-06 15:00:00 95.0 12.0 2.0 94.666667
27 196673 2108-03-06 16:00:00 96.0 20.0 2.0 95.076923
28 196673 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
df_notes
hadm_id ne_charttime note
0 196673 2108-03-05 16:54:00 Nursing\nNursing Progress Note\nPt is a 43 yo ...
1 196673 2108-03-05 17:54:00 Physician \nPhysician Resident Admission Note\...
2 196673 2108-03-05 18:09:00 Physician \nPhysician Resident Admission Note\...
3 196673 2108-03-06 06:11:00 Nursing\nNursing Progress Note\nPain control (...
4 196673 2108-03-06 08:06:00 Physician \nPhysician Resident Progress Note\n...
5 196673 2108-03-06 12:40:00 Nursing\nNursing Progress Note\nChief Complain...
6 196673 2108-03-06 13:01:00 Nursing\nNursing Progress Note\nPain control (...
7 196673 2108-03-06 17:09:00 Nursing\nNursing Transfer Note\nChief Complain...
8 196673 2108-03-06 17:12:00 Nursing\nNursing Transfer Note\nPain control (...
9 196673 2108-03-07 15:25:00 Radiology\nCHEST (PA & LAT)\n[**2108-3-7**] 3:...
10 196673 2108-03-07 18:34:00 Radiology\nCTA CHEST W&W/O C&RECONS, NON-CORON...
11 196673 2108-03-09 09:10:00 Radiology\nABDOMEN (SUPINE & ERECT)\n[**2108-3...
12 196673 2108-03-09 12:22:00 Radiology\nCT ABDOMEN W/CONTRAST\n[**2108-3-9*...
13 196673 2108-03-10 05:26:00 Radiology\nABDOMEN (SUPINE & ERECT)\n[**2108-3...
14 196673 2108-03-10 05:27:00 Radiology\nCHEST (PA & LAT)\n[**2108-3-10**] 5...
What I want to do is to combine both the data frames based on the time when that information was recorded. More specifically, for each row in df_notes, I want a corresponding row from df_str with ce_charttime <= ne_charttime.
As an example, the first row in df_notes has ne_charttime = 2108-03-05 16:54:00. There are three rows in df_str with record times less than this time: ce_charttime = 2108-03-05 15:34:00, ce_charttime = 2108-03-05 16:00:00, ce_charttime = 2108-03-05 16:16:00. The most recent of these is the row with ce_charttime = 2108-03-05 16:16:00. So in my resulting data frame, for ne_charttime = 2108-03-05 16:54:00, I will have hr = 85.0, resp = 11.0, magnesium = 1.8, hr_24hr_mean = 88.33.
Essentially, in this example the resulting data frame will look like this:
hadm_id ne_charttime note hr resp magnesium hr_24hr_mean
0 196673 2108-03-05 16:54:00 Nursing\nNursing Progress Note\nPt is a 43 yo ... 85.0 11.0 1.8 88.333333
1 196673 2108-03-05 17:54:00 Physician \nPhysician Resident Admission Note\... 109.0 12.0 1.8 93.500000
2 196673 2108-03-05 18:09:00 Physician \nPhysician Resident Admission Note\... 97.0 12.0 1.8 94.200000
3 196673 2108-03-06 06:11:00 Nursing\nNursing Progress Note\nPain control (... 103.0 17.0 2.0 95.000000
4 196673 2108-03-06 08:06:00 Physician \nPhysician Resident Progress Note\n... 103.0 20.0 2.0 95.700000
5 196673 2108-03-06 12:40:00 Nursing\nNursing Progress Note\nChief Complain... 85.0 10.0 2.0 94.291667
6 196673 2108-03-06 13:01:00 Nursing\nNursing Progress Note\nPain control (... 98.0 14.0 2.0 94.440000
7 196673 2108-03-06 17:09:00 Nursing\nNursing Transfer Note\nChief Complain... 106.0 21.0 2.0 95.360000
8 196673 2108-03-06 17:12:00 Nursing\nNursing Transfer Note\nPain control (... NaN NaN NaN NaN
9 196673 2108-03-07 15:25:00 Radiology\nCHEST (PA & LAT)\n[**2108-3-7**] 3:... NaN NaN NaN NaN
10 196673 2108-03-07 18:34:00 Radiology\nCTA CHEST W&W/O C&RECONS, NON-CORON... NaN NaN NaN NaN
11 196673 2108-03-09 09:10:00 Radiology\nABDOMEN (SUPINE & ERECT)\n[**2108-3... NaN NaN NaN NaN
12 196673 2108-03-09 12:22:00 Radiology\nCT ABDOMEN W/CONTRAST\n[**2108-3-9*... NaN NaN NaN NaN
13 196673 2108-03-10 05:26:00 Radiology\nABDOMEN (SUPINE & ERECT)\n[**2108-3... NaN NaN NaN NaN
14 196673 2108-03-10 05:27:00 Radiology\nCHEST (PA & LAT)\n[**2108-3-10**] 5... NaN NaN NaN NaN
The resulting data frame will be of the same length as df_notes. I have been able to come with a very inefficient piece of code using for loops and explicit indexing to get this result:
cols = list(df_str.columns[2:])
final_df = df_notes.copy()
for col in cols:
final_df[col] = np.nan
idx = 0
for i, note_row in final_df.iterrows():
ne = note_row['ne_charttime']
for j, str_row in df_str.iterrows():
ce = str_row['ce_charttime']
if ne < ce:
idx += 1
for col in cols:
final_df.iloc[i, final_df.columns.get_loc(col)] = df_str.iloc[j-1][col]
break
for col in cols:
final_df.iloc[idx, final_df.columns.get_loc(col)] = df_str.iloc[-1][col]
This piece of code is bad because it is very inefficient and while it may work for this example, in my example dataset, I have over 30 different columns of structured variables, and over 10,000 encounters.
EDIT-2:
#Stef has provided an excellent answer which seems to work and replace my elaborate loopy code with a single line (amazing). However, while that works for this particular example, I am running into problems when I apply it to a bigger subset which includes multiple encounters. For example, consider the following example:
df_str.shape, df_notes.shape
((217, 386), (35, 4))
df_notes[['hadm_id', 'ne_charttime']]
hadm_id ne_charttime
0 100104 2201-06-21 20:00:00
1 100104 2201-06-21 22:51:00
2 100104 2201-06-22 05:00:00
3 100104 2201-06-23 04:33:00
4 100104 2201-06-23 12:59:00
5 100104 2201-06-24 05:15:00
6 100372 2115-12-20 02:29:00
7 100372 2115-12-21 10:15:00
8 100372 2115-12-22 13:05:00
9 100372 2115-12-25 17:16:00
10 100372 2115-12-30 10:58:00
11 100372 2115-12-30 13:07:00
12 100372 2115-12-30 14:16:00
13 100372 2115-12-30 22:34:00
14 100372 2116-01-03 09:10:00
15 100372 2116-01-07 11:08:00
16 100975 2126-03-02 06:06:00
17 100975 2126-03-02 17:44:00
18 100975 2126-03-03 05:36:00
19 100975 2126-03-03 18:27:00
20 100975 2126-03-04 05:29:00
21 100975 2126-03-04 10:48:00
22 100975 2126-03-04 16:42:00
23 100975 2126-03-05 22:12:00
24 100975 2126-03-05 23:01:00
25 100975 2126-03-06 11:02:00
26 100975 2126-03-06 13:38:00
27 100975 2126-03-08 13:39:00
28 100975 2126-03-11 10:41:00
29 101511 2199-04-30 09:29:00
30 101511 2199-04-30 09:53:00
31 101511 2199-04-30 18:06:00
32 101511 2199-05-01 08:28:00
33 111073 2195-05-01 01:56:00
34 111073 2195-05-01 21:49:00
This example has 5 encounters. The dataframe is sorted by hadm_id and within each hadm_id, ne_charttime is sorted. However, the column ne_charttime by itself is NOT sorted as seen from row 0 ce_charttime=2201-06-21 20:00:00 and row 6 ne_charttime=2115-12-20 02:29:00. When I try to do a merge_asof, I get the following error:
ValueError: left keys must be sorted. Is this because of the fact that ne_charttime column is not sorted? If so, how do I rectify this while maintaining the integrity of the encounter ID group?
EDIT-1:
I was able to loop over the encounters as well:
cols = list(dev_str.columns[1:]) # get the cols to merge (everything except hadm_id)
final_dfs = []
grouped = dev_notes.groupby('hadm_id') # get groups of encounter ids
for name, group in grouped:
final_df = group.copy().reset_index(drop=True) # make a copy of notes for that encounter
for col in cols:
final_df[col] = np.nan # set the values to nan
idx = 0 # index to track the final row in the given encounter
for i, note_row in final_df.iterrows():
ne = note_row['ne_charttime']
sub = dev_str.loc[(dev_str['hadm_id'] == name)].reset_index(drop=True) # get the df corresponding to the ecounter
for j, str_row in sub.iterrows():
ce = str_row['ce_charttime']
if ne < ce: # if the variable charttime < note charttime
idx += 1
# grab the previous values for the variables and break
for col in cols:
final_df.iloc[i, final_df.columns.get_loc(col)] = sub.iloc[j-1][col]
break
# get the last value in the df for the variables
for col in cols:
final_df.iloc[idx, final_df.columns.get_loc(col)] = sub.iloc[-1][col]
final_dfs.append(final_df) # append the df to the list
# cat the list to get final df and reset index
final_df = pd.concat(final_dfs)
final_df.reset_index(inplace=True, drop=True)
Again this very inefficient but does the job.
Is there a better way to achieve what I want? Any help is appreciated.
Thanks.
You can use merge_asof (both dataframes must be sorted by the columns you're merging them on, which is already the case in your example):
final_df = pd.merge_asof(df_notes, df_str, left_on='ne_charttime', right_on='ce_charttime', by='hadm_id')
Result:
hadm_id ne_charttime note ce_charttime hr resp magnesium hr_24hr_mean
0 196673 2108-03-05 16:54:00 Nursing\nNursing Progress Note\nPt is a 43 yo ... 2108-03-05 16:16:00 85.0 11.0 1.8 88.333333
1 196673 2108-03-05 17:54:00 Physician \nPhysician Resident Admission Note\... 2108-03-05 17:00:00 109.0 12.0 1.8 93.500000
2 196673 2108-03-05 18:09:00 Physician \nPhysician Resident Admission Note\... 2108-03-05 18:00:00 97.0 12.0 1.8 94.200000
3 196673 2108-03-06 06:11:00 Nursing\nNursing Progress Note\nPain control (... 2108-03-06 06:00:00 103.0 17.0 2.0 95.000000
4 196673 2108-03-06 08:06:00 Physician \nPhysician Resident Progress Note\n... 2108-03-06 08:00:00 103.0 20.0 2.0 95.700000
5 196673 2108-03-06 12:40:00 Nursing\nNursing Progress Note\nChief Complain... 2108-03-06 12:00:00 85.0 10.0 2.0 94.291667
6 196673 2108-03-06 13:01:00 Nursing\nNursing Progress Note\nPain control (... 2108-03-06 13:00:00 98.0 14.0 2.0 94.440000
7 196673 2108-03-06 17:09:00 Nursing\nNursing Transfer Note\nChief Complain... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
8 196673 2108-03-06 17:12:00 Nursing\nNursing Transfer Note\nPain control (... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
9 196673 2108-03-07 15:25:00 Radiology\nCHEST (PA & LAT)\n[**2108-3-7**] 3:... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
10 196673 2108-03-07 18:34:00 Radiology\nCTA CHEST W&W/O C&RECONS, NON-CORON... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
11 196673 2108-03-09 09:10:00 Radiology\nABDOMEN (SUPINE & ERECT)\n[**2108-3... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
12 196673 2108-03-09 12:22:00 Radiology\nCT ABDOMEN W/CONTRAST\n[**2108-3-9*... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
13 196673 2108-03-10 05:26:00 Radiology\nABDOMEN (SUPINE & ERECT)\n[**2108-3... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
14 196673 2108-03-10 05:27:00 Radiology\nCHEST (PA & LAT)\n[**2108-3-10**] 5... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
PS: This gives you the correct result for all rows. There's a logical flaw in your code: you look for the first time ce_charttime > ne_charttime and then take the previous row. If there's no such time, you'll never have the chance to take the previous row, hence the NaNs in your result table starting from row 8.
PPS: This includes ce_charttime in the final dataframe. You can replace it by a column of how old the information is and/or remove it:
final_df['info_age'] = final_df.ne_charttime - final_df.ce_charttime
final_df = final_df.drop(columns='ce_charttime')
UPDATE for EDIT-2: As I wrote at the very beginning, repeated in the comments and as the docs clearly states: both ce_charttime and ne_charttime must be sorted (hadm_id need not be sorted). If this condition is not met, you'll have to (temporarily) sort your dataframes as required. See the following example:
import pandas as pd, string
df_str = pd.DataFrame( {'hadm_id': pd.np.tile([111111, 222222],10), 'ce_charttime': pd.date_range('2019-10-01 00:30', periods=20, freq='30T'), 'hr': pd.np.random.randint(80,120,20)})
df_notes = pd.DataFrame( {'hadm_id': pd.np.tile([111111, 222222],3), 'ne_charttime': pd.date_range('2019-10-01 00:45', periods=6, freq='40T'), 'note': [''.join(pd.np.random.choice(list(string.ascii_letters), 10)) for _ in range(6)]}).sort_values('hadm_id')
final_df = pd.merge_asof(df_notes.sort_values('ne_charttime'), df_str, left_on='ne_charttime', right_on='ce_charttime', by='hadm_id').sort_values(['hadm_id', 'ne_charttime'])
print(df_str); print(df_notes); print(final_df)
Output:
hadm_id ce_charttime hr
0 111111 2019-10-01 00:30:00 118
1 222222 2019-10-01 01:00:00 93
2 111111 2019-10-01 01:30:00 92
3 222222 2019-10-01 02:00:00 86
4 111111 2019-10-01 02:30:00 88
5 222222 2019-10-01 03:00:00 86
6 111111 2019-10-01 03:30:00 106
7 222222 2019-10-01 04:00:00 91
8 111111 2019-10-01 04:30:00 109
9 222222 2019-10-01 05:00:00 95
10 111111 2019-10-01 05:30:00 113
11 222222 2019-10-01 06:00:00 92
12 111111 2019-10-01 06:30:00 104
13 222222 2019-10-01 07:00:00 83
14 111111 2019-10-01 07:30:00 114
15 222222 2019-10-01 08:00:00 98
16 111111 2019-10-01 08:30:00 110
17 222222 2019-10-01 09:00:00 89
18 111111 2019-10-01 09:30:00 98
19 222222 2019-10-01 10:00:00 109
hadm_id ne_charttime note
0 111111 2019-10-01 00:45:00 jOcRWVdPDF
2 111111 2019-10-01 02:05:00 mvScJNrwra
4 111111 2019-10-01 03:25:00 FBAFbJYflE
1 222222 2019-10-01 01:25:00 ilNuInOsYZ
3 222222 2019-10-01 02:45:00 ysyolaNmkV
5 222222 2019-10-01 04:05:00 wvowGGETaP
hadm_id ne_charttime note ce_charttime hr
0 111111 2019-10-01 00:45:00 jOcRWVdPDF 2019-10-01 00:30:00 118
2 111111 2019-10-01 02:05:00 mvScJNrwra 2019-10-01 01:30:00 92
4 111111 2019-10-01 03:25:00 FBAFbJYflE 2019-10-01 02:30:00 88
1 222222 2019-10-01 01:25:00 ilNuInOsYZ 2019-10-01 01:00:00 93
3 222222 2019-10-01 02:45:00 ysyolaNmkV 2019-10-01 02:00:00 86
5 222222 2019-10-01 04:05:00 wvowGGETaP 2019-10-01 04:00:00 91
You can do full merge and then filter with query:
df_notes.merge(df_str, on=hadm_id).query('ce_charttime <= ne_charttime')

Pandas dataframe column math when row conditions is met

I have a dataframe containing the following data. I would like to query the age column of each dataframe (1-4) for values between 295.0 and 305.0. For each dataframe there will be a single age value in this range and a corresponding subsidence value. I would like to take the subsidence value and add it to the remaining values in the dataframe.
For instance in the first dataframe; at age 300.0 subsidence= 274.057861. In this case, 274.057861 would be added to the rest of the subsidence values in dataframe 1.
In the second data frame; at age 299.0 subsidence= 77.773720. So, 77.773720 would be added to to the rest of the subsidence values in dataframe 2. Etc, etc. Is it possible to do this easily in Pandas or am I better off working towards an alternate solution.
Thanks :)
1 2 3 4 \
age subsidence age subsidence age subsidence age
0 0.0 -201.538712 0.0 -235.865433 0.0 134.728821 0.0
1 10.0 -77.446548 8.0 -102.183365 10.0 88.796074 10.0
2 20.0 44.901043 18.0 35.316868 20.0 35.871178 20.0
3 31.0 103.172806 28.0 98.238434 30.0 -17.901653 30.0
4 41.0 124.625687 38.0 124.719254 40.0 -13.381897 40.0
5 51.0 122.877541 48.0 130.725235 50.0 -25.396996 50.0
6 61.0 138.810898 58.0 140.301117 60.0 -37.057205 60.0
7 71.0 119.818176 68.0 137.433670 70.0 -11.587639 70.0
8 81.0 77.867607 78.0 96.285652 80.0 21.854662 80.0
9 91.0 33.612885 88.0 32.740803 90.0 67.754501 90.0
10 101.0 15.885051 98.0 8.626043 100.0 150.172699 100.0
11 111.0 118.089211 109.0 88.812439 100.0 150.172699 100.0
12 121.0 247.301956 119.0 212.000061 110.0 124.367874 110.0
13 131.0 268.748627 129.0 253.204819 120.0 157.066010 120.0
14 141.0 231.799255 139.0 292.828461 130.0 145.811783 130.0
15 151.0 259.626343 149.0 260.067993 140.0 175.388763 140.0
16 161.0 288.704651 159.0 240.051605 150.0 265.435791 150.0
17 171.0 249.121857 169.0 203.727097 160.0 336.471924 160.0
18 181.0 339.038055 179.0 245.738480 170.0 283.483582 170.0
19 191.0 395.920410 189.0 318.751160 180.0 381.575500 180.0
20 201.0 404.843445 199.0 338.245209 190.0 491.534424 190.0
21 211.0 461.865784 209.0 418.997559 200.0 495.025604 200.0
22 221.0 518.710632 219.0 446.496216 200.0 495.025604 200.0
23 231.0 483.963867 224.0 479.213287 210.0 571.982361 210.0
24 239.0 445.292389 229.0 492.352905 220.0 611.698608 220.0
25 249.0 396.609497 239.0 445.322144 230.0 645.545776 230.0
26 259.0 321.553558 249.0 429.429932 240.0 596.046265 240.0
27 269.0 306.150177 259.0 297.355103 250.0 547.157654 250.0
28 279.0 259.717468 269.0 174.210785 260.0 457.071472 260.0
29 289.0 301.114410 279.0 114.175957 270.0 438.705170 270.0
30 300.0 274.057861 289.0 91.768898 280.0 397.985535 280.0
31 310.0 216.760361 299.0 77.773720 290.0 426.858276 290.0
32 320.0 192.317093 309.0 73.767090 300.0 410.508331 300.0
33 330.0 179.511917 319.0 63.295345 300.0 410.508331 300.0
34 340.0 231.126053 329.0 -4.296405 310.0 355.303558 310.0
35 350.0 142.894958 339.0 -62.745190 320.0 284.932892 320.0
36 360.0 51.547047 350.0 -60.224789 330.0 251.817078 330.0
37 370.0 -39.064964 360.0 -85.826874 340.0 302.303925 340.0
38 380.0 -54.111374 370.0 -81.139206 350.0 207.799942 350.0
39 390.0 -68.999535 380.0 -40.080212 360.0 77.729439 360.0
40 400.0 -47.595322 390.0 -29.945852 370.0 -127.037209 370.0
41 410.0 13.159509 400.0 -26.656607 380.0 -109.327545 380.0
42 NaN NaN 410.0 -13.723764 390.0 -127.160942 390.0
43 NaN NaN NaN NaN 400.0 -61.404510 400.0
44 NaN NaN NaN NaN 410.0 13.058900 410.0
For the first Dataframe:
df1['subsidence'] = df1[(df1.age >295) & (df1.age <305)]['subsidence'].value
You need to update each dataframes accordingly.