Pandas combine two dataframes based on time difference - pandas
I have two data frames that stores different types of medical information of patients. The common elements of both the data frames are the encounter ID (hadm_id), the time the information was recorded ((n|c)e_charttime).
One data frame (df_str) contains structured information such as vital signs and lab test values and values derived from these (such as change statistics over 24 hours). The other data frame (df_notes) contains a column with a clinical note recorded at a specified time for an encounter. Both these data frames contain multiple encounters, but the common element is the encounter ID (hadm_id).
Here are examples of the data frames for ONE encounter ID (hadm_id) with a subset of variables:
df_str
hadm_id ce_charttime hr resp magnesium hr_24hr_mean
0 196673 2108-03-05 15:34:00 95.0 12.0 NaN 95.000000
1 196673 2108-03-05 16:00:00 85.0 11.0 NaN 90.000000
2 196673 2108-03-05 16:16:00 85.0 11.0 1.8 88.333333
3 196673 2108-03-05 17:00:00 109.0 12.0 1.8 93.500000
4 196673 2108-03-05 18:00:00 97.0 12.0 1.8 94.200000
5 196673 2108-03-05 19:00:00 99.0 16.0 1.8 95.000000
6 196673 2108-03-05 20:00:00 98.0 13.0 1.8 95.428571
7 196673 2108-03-05 21:00:00 97.0 14.0 1.8 95.625000
8 196673 2108-03-05 22:00:00 101.0 12.0 1.8 96.222222
9 196673 2108-03-05 23:00:00 97.0 13.0 1.8 96.300000
10 196673 2108-03-06 00:00:00 93.0 13.0 1.8 96.000000
11 196673 2108-03-06 01:00:00 89.0 12.0 1.8 95.416667
12 196673 2108-03-06 02:00:00 88.0 10.0 1.8 94.846154
13 196673 2108-03-06 03:00:00 87.0 12.0 1.8 94.285714
14 196673 2108-03-06 04:00:00 97.0 19.0 1.8 94.466667
15 196673 2108-03-06 05:00:00 95.0 11.0 1.8 94.500000
16 196673 2108-03-06 05:43:00 95.0 11.0 2.0 94.529412
17 196673 2108-03-06 06:00:00 103.0 17.0 2.0 95.000000
18 196673 2108-03-06 07:00:00 101.0 12.0 2.0 95.315789
19 196673 2108-03-06 08:00:00 103.0 20.0 2.0 95.700000
20 196673 2108-03-06 09:00:00 84.0 11.0 2.0 95.142857
21 196673 2108-03-06 10:00:00 89.0 11.0 2.0 94.863636
22 196673 2108-03-06 11:00:00 91.0 14.0 2.0 94.695652
23 196673 2108-03-06 12:00:00 85.0 10.0 2.0 94.291667
24 196673 2108-03-06 13:00:00 98.0 14.0 2.0 94.440000
25 196673 2108-03-06 14:00:00 100.0 18.0 2.0 94.653846
26 196673 2108-03-06 15:00:00 95.0 12.0 2.0 94.666667
27 196673 2108-03-06 16:00:00 96.0 20.0 2.0 95.076923
28 196673 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
df_notes
hadm_id ne_charttime note
0 196673 2108-03-05 16:54:00 Nursing\nNursing Progress Note\nPt is a 43 yo ...
1 196673 2108-03-05 17:54:00 Physician \nPhysician Resident Admission Note\...
2 196673 2108-03-05 18:09:00 Physician \nPhysician Resident Admission Note\...
3 196673 2108-03-06 06:11:00 Nursing\nNursing Progress Note\nPain control (...
4 196673 2108-03-06 08:06:00 Physician \nPhysician Resident Progress Note\n...
5 196673 2108-03-06 12:40:00 Nursing\nNursing Progress Note\nChief Complain...
6 196673 2108-03-06 13:01:00 Nursing\nNursing Progress Note\nPain control (...
7 196673 2108-03-06 17:09:00 Nursing\nNursing Transfer Note\nChief Complain...
8 196673 2108-03-06 17:12:00 Nursing\nNursing Transfer Note\nPain control (...
9 196673 2108-03-07 15:25:00 Radiology\nCHEST (PA & LAT)\n[**2108-3-7**] 3:...
10 196673 2108-03-07 18:34:00 Radiology\nCTA CHEST W&W/O C&RECONS, NON-CORON...
11 196673 2108-03-09 09:10:00 Radiology\nABDOMEN (SUPINE & ERECT)\n[**2108-3...
12 196673 2108-03-09 12:22:00 Radiology\nCT ABDOMEN W/CONTRAST\n[**2108-3-9*...
13 196673 2108-03-10 05:26:00 Radiology\nABDOMEN (SUPINE & ERECT)\n[**2108-3...
14 196673 2108-03-10 05:27:00 Radiology\nCHEST (PA & LAT)\n[**2108-3-10**] 5...
What I want to do is to combine both the data frames based on the time when that information was recorded. More specifically, for each row in df_notes, I want a corresponding row from df_str with ce_charttime <= ne_charttime.
As an example, the first row in df_notes has ne_charttime = 2108-03-05 16:54:00. There are three rows in df_str with record times less than this time: ce_charttime = 2108-03-05 15:34:00, ce_charttime = 2108-03-05 16:00:00, ce_charttime = 2108-03-05 16:16:00. The most recent of these is the row with ce_charttime = 2108-03-05 16:16:00. So in my resulting data frame, for ne_charttime = 2108-03-05 16:54:00, I will have hr = 85.0, resp = 11.0, magnesium = 1.8, hr_24hr_mean = 88.33.
Essentially, in this example the resulting data frame will look like this:
hadm_id ne_charttime note hr resp magnesium hr_24hr_mean
0 196673 2108-03-05 16:54:00 Nursing\nNursing Progress Note\nPt is a 43 yo ... 85.0 11.0 1.8 88.333333
1 196673 2108-03-05 17:54:00 Physician \nPhysician Resident Admission Note\... 109.0 12.0 1.8 93.500000
2 196673 2108-03-05 18:09:00 Physician \nPhysician Resident Admission Note\... 97.0 12.0 1.8 94.200000
3 196673 2108-03-06 06:11:00 Nursing\nNursing Progress Note\nPain control (... 103.0 17.0 2.0 95.000000
4 196673 2108-03-06 08:06:00 Physician \nPhysician Resident Progress Note\n... 103.0 20.0 2.0 95.700000
5 196673 2108-03-06 12:40:00 Nursing\nNursing Progress Note\nChief Complain... 85.0 10.0 2.0 94.291667
6 196673 2108-03-06 13:01:00 Nursing\nNursing Progress Note\nPain control (... 98.0 14.0 2.0 94.440000
7 196673 2108-03-06 17:09:00 Nursing\nNursing Transfer Note\nChief Complain... 106.0 21.0 2.0 95.360000
8 196673 2108-03-06 17:12:00 Nursing\nNursing Transfer Note\nPain control (... NaN NaN NaN NaN
9 196673 2108-03-07 15:25:00 Radiology\nCHEST (PA & LAT)\n[**2108-3-7**] 3:... NaN NaN NaN NaN
10 196673 2108-03-07 18:34:00 Radiology\nCTA CHEST W&W/O C&RECONS, NON-CORON... NaN NaN NaN NaN
11 196673 2108-03-09 09:10:00 Radiology\nABDOMEN (SUPINE & ERECT)\n[**2108-3... NaN NaN NaN NaN
12 196673 2108-03-09 12:22:00 Radiology\nCT ABDOMEN W/CONTRAST\n[**2108-3-9*... NaN NaN NaN NaN
13 196673 2108-03-10 05:26:00 Radiology\nABDOMEN (SUPINE & ERECT)\n[**2108-3... NaN NaN NaN NaN
14 196673 2108-03-10 05:27:00 Radiology\nCHEST (PA & LAT)\n[**2108-3-10**] 5... NaN NaN NaN NaN
The resulting data frame will be of the same length as df_notes. I have been able to come with a very inefficient piece of code using for loops and explicit indexing to get this result:
cols = list(df_str.columns[2:])
final_df = df_notes.copy()
for col in cols:
final_df[col] = np.nan
idx = 0
for i, note_row in final_df.iterrows():
ne = note_row['ne_charttime']
for j, str_row in df_str.iterrows():
ce = str_row['ce_charttime']
if ne < ce:
idx += 1
for col in cols:
final_df.iloc[i, final_df.columns.get_loc(col)] = df_str.iloc[j-1][col]
break
for col in cols:
final_df.iloc[idx, final_df.columns.get_loc(col)] = df_str.iloc[-1][col]
This piece of code is bad because it is very inefficient and while it may work for this example, in my example dataset, I have over 30 different columns of structured variables, and over 10,000 encounters.
EDIT-2:
#Stef has provided an excellent answer which seems to work and replace my elaborate loopy code with a single line (amazing). However, while that works for this particular example, I am running into problems when I apply it to a bigger subset which includes multiple encounters. For example, consider the following example:
df_str.shape, df_notes.shape
((217, 386), (35, 4))
df_notes[['hadm_id', 'ne_charttime']]
hadm_id ne_charttime
0 100104 2201-06-21 20:00:00
1 100104 2201-06-21 22:51:00
2 100104 2201-06-22 05:00:00
3 100104 2201-06-23 04:33:00
4 100104 2201-06-23 12:59:00
5 100104 2201-06-24 05:15:00
6 100372 2115-12-20 02:29:00
7 100372 2115-12-21 10:15:00
8 100372 2115-12-22 13:05:00
9 100372 2115-12-25 17:16:00
10 100372 2115-12-30 10:58:00
11 100372 2115-12-30 13:07:00
12 100372 2115-12-30 14:16:00
13 100372 2115-12-30 22:34:00
14 100372 2116-01-03 09:10:00
15 100372 2116-01-07 11:08:00
16 100975 2126-03-02 06:06:00
17 100975 2126-03-02 17:44:00
18 100975 2126-03-03 05:36:00
19 100975 2126-03-03 18:27:00
20 100975 2126-03-04 05:29:00
21 100975 2126-03-04 10:48:00
22 100975 2126-03-04 16:42:00
23 100975 2126-03-05 22:12:00
24 100975 2126-03-05 23:01:00
25 100975 2126-03-06 11:02:00
26 100975 2126-03-06 13:38:00
27 100975 2126-03-08 13:39:00
28 100975 2126-03-11 10:41:00
29 101511 2199-04-30 09:29:00
30 101511 2199-04-30 09:53:00
31 101511 2199-04-30 18:06:00
32 101511 2199-05-01 08:28:00
33 111073 2195-05-01 01:56:00
34 111073 2195-05-01 21:49:00
This example has 5 encounters. The dataframe is sorted by hadm_id and within each hadm_id, ne_charttime is sorted. However, the column ne_charttime by itself is NOT sorted as seen from row 0 ce_charttime=2201-06-21 20:00:00 and row 6 ne_charttime=2115-12-20 02:29:00. When I try to do a merge_asof, I get the following error:
ValueError: left keys must be sorted. Is this because of the fact that ne_charttime column is not sorted? If so, how do I rectify this while maintaining the integrity of the encounter ID group?
EDIT-1:
I was able to loop over the encounters as well:
cols = list(dev_str.columns[1:]) # get the cols to merge (everything except hadm_id)
final_dfs = []
grouped = dev_notes.groupby('hadm_id') # get groups of encounter ids
for name, group in grouped:
final_df = group.copy().reset_index(drop=True) # make a copy of notes for that encounter
for col in cols:
final_df[col] = np.nan # set the values to nan
idx = 0 # index to track the final row in the given encounter
for i, note_row in final_df.iterrows():
ne = note_row['ne_charttime']
sub = dev_str.loc[(dev_str['hadm_id'] == name)].reset_index(drop=True) # get the df corresponding to the ecounter
for j, str_row in sub.iterrows():
ce = str_row['ce_charttime']
if ne < ce: # if the variable charttime < note charttime
idx += 1
# grab the previous values for the variables and break
for col in cols:
final_df.iloc[i, final_df.columns.get_loc(col)] = sub.iloc[j-1][col]
break
# get the last value in the df for the variables
for col in cols:
final_df.iloc[idx, final_df.columns.get_loc(col)] = sub.iloc[-1][col]
final_dfs.append(final_df) # append the df to the list
# cat the list to get final df and reset index
final_df = pd.concat(final_dfs)
final_df.reset_index(inplace=True, drop=True)
Again this very inefficient but does the job.
Is there a better way to achieve what I want? Any help is appreciated.
Thanks.
You can use merge_asof (both dataframes must be sorted by the columns you're merging them on, which is already the case in your example):
final_df = pd.merge_asof(df_notes, df_str, left_on='ne_charttime', right_on='ce_charttime', by='hadm_id')
Result:
hadm_id ne_charttime note ce_charttime hr resp magnesium hr_24hr_mean
0 196673 2108-03-05 16:54:00 Nursing\nNursing Progress Note\nPt is a 43 yo ... 2108-03-05 16:16:00 85.0 11.0 1.8 88.333333
1 196673 2108-03-05 17:54:00 Physician \nPhysician Resident Admission Note\... 2108-03-05 17:00:00 109.0 12.0 1.8 93.500000
2 196673 2108-03-05 18:09:00 Physician \nPhysician Resident Admission Note\... 2108-03-05 18:00:00 97.0 12.0 1.8 94.200000
3 196673 2108-03-06 06:11:00 Nursing\nNursing Progress Note\nPain control (... 2108-03-06 06:00:00 103.0 17.0 2.0 95.000000
4 196673 2108-03-06 08:06:00 Physician \nPhysician Resident Progress Note\n... 2108-03-06 08:00:00 103.0 20.0 2.0 95.700000
5 196673 2108-03-06 12:40:00 Nursing\nNursing Progress Note\nChief Complain... 2108-03-06 12:00:00 85.0 10.0 2.0 94.291667
6 196673 2108-03-06 13:01:00 Nursing\nNursing Progress Note\nPain control (... 2108-03-06 13:00:00 98.0 14.0 2.0 94.440000
7 196673 2108-03-06 17:09:00 Nursing\nNursing Transfer Note\nChief Complain... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
8 196673 2108-03-06 17:12:00 Nursing\nNursing Transfer Note\nPain control (... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
9 196673 2108-03-07 15:25:00 Radiology\nCHEST (PA & LAT)\n[**2108-3-7**] 3:... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
10 196673 2108-03-07 18:34:00 Radiology\nCTA CHEST W&W/O C&RECONS, NON-CORON... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
11 196673 2108-03-09 09:10:00 Radiology\nABDOMEN (SUPINE & ERECT)\n[**2108-3... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
12 196673 2108-03-09 12:22:00 Radiology\nCT ABDOMEN W/CONTRAST\n[**2108-3-9*... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
13 196673 2108-03-10 05:26:00 Radiology\nABDOMEN (SUPINE & ERECT)\n[**2108-3... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
14 196673 2108-03-10 05:27:00 Radiology\nCHEST (PA & LAT)\n[**2108-3-10**] 5... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
PS: This gives you the correct result for all rows. There's a logical flaw in your code: you look for the first time ce_charttime > ne_charttime and then take the previous row. If there's no such time, you'll never have the chance to take the previous row, hence the NaNs in your result table starting from row 8.
PPS: This includes ce_charttime in the final dataframe. You can replace it by a column of how old the information is and/or remove it:
final_df['info_age'] = final_df.ne_charttime - final_df.ce_charttime
final_df = final_df.drop(columns='ce_charttime')
UPDATE for EDIT-2: As I wrote at the very beginning, repeated in the comments and as the docs clearly states: both ce_charttime and ne_charttime must be sorted (hadm_id need not be sorted). If this condition is not met, you'll have to (temporarily) sort your dataframes as required. See the following example:
import pandas as pd, string
df_str = pd.DataFrame( {'hadm_id': pd.np.tile([111111, 222222],10), 'ce_charttime': pd.date_range('2019-10-01 00:30', periods=20, freq='30T'), 'hr': pd.np.random.randint(80,120,20)})
df_notes = pd.DataFrame( {'hadm_id': pd.np.tile([111111, 222222],3), 'ne_charttime': pd.date_range('2019-10-01 00:45', periods=6, freq='40T'), 'note': [''.join(pd.np.random.choice(list(string.ascii_letters), 10)) for _ in range(6)]}).sort_values('hadm_id')
final_df = pd.merge_asof(df_notes.sort_values('ne_charttime'), df_str, left_on='ne_charttime', right_on='ce_charttime', by='hadm_id').sort_values(['hadm_id', 'ne_charttime'])
print(df_str); print(df_notes); print(final_df)
Output:
hadm_id ce_charttime hr
0 111111 2019-10-01 00:30:00 118
1 222222 2019-10-01 01:00:00 93
2 111111 2019-10-01 01:30:00 92
3 222222 2019-10-01 02:00:00 86
4 111111 2019-10-01 02:30:00 88
5 222222 2019-10-01 03:00:00 86
6 111111 2019-10-01 03:30:00 106
7 222222 2019-10-01 04:00:00 91
8 111111 2019-10-01 04:30:00 109
9 222222 2019-10-01 05:00:00 95
10 111111 2019-10-01 05:30:00 113
11 222222 2019-10-01 06:00:00 92
12 111111 2019-10-01 06:30:00 104
13 222222 2019-10-01 07:00:00 83
14 111111 2019-10-01 07:30:00 114
15 222222 2019-10-01 08:00:00 98
16 111111 2019-10-01 08:30:00 110
17 222222 2019-10-01 09:00:00 89
18 111111 2019-10-01 09:30:00 98
19 222222 2019-10-01 10:00:00 109
hadm_id ne_charttime note
0 111111 2019-10-01 00:45:00 jOcRWVdPDF
2 111111 2019-10-01 02:05:00 mvScJNrwra
4 111111 2019-10-01 03:25:00 FBAFbJYflE
1 222222 2019-10-01 01:25:00 ilNuInOsYZ
3 222222 2019-10-01 02:45:00 ysyolaNmkV
5 222222 2019-10-01 04:05:00 wvowGGETaP
hadm_id ne_charttime note ce_charttime hr
0 111111 2019-10-01 00:45:00 jOcRWVdPDF 2019-10-01 00:30:00 118
2 111111 2019-10-01 02:05:00 mvScJNrwra 2019-10-01 01:30:00 92
4 111111 2019-10-01 03:25:00 FBAFbJYflE 2019-10-01 02:30:00 88
1 222222 2019-10-01 01:25:00 ilNuInOsYZ 2019-10-01 01:00:00 93
3 222222 2019-10-01 02:45:00 ysyolaNmkV 2019-10-01 02:00:00 86
5 222222 2019-10-01 04:05:00 wvowGGETaP 2019-10-01 04:00:00 91
You can do full merge and then filter with query:
df_notes.merge(df_str, on=hadm_id).query('ce_charttime <= ne_charttime')
Related
Moving Average Pandas Across Group
My data has the following structure: np.random.seed(25) tdf = pd.DataFrame({'person_id' :[1,1,1,1, 2,2, 3,3,3,3,3, 4,4,4, 5,5,5,5,5,5,5, 6, 7,7, 8,8,8,8,8,8,8, 9,9, 10,10 ], 'Date': ['2021-01-02','2021-01-05','2021-01-07','2021-01-09', '2021-01-02','2021-01-05', '2021-01-02','2021-01-05','2021-01-07','2021-01-09','2021-01-11', '2021-01-02','2021-01-05','2021-01-07', '2021-01-02','2021-01-05','2021-01-07','2021-01-09','2021-01-11','2021-01-13','2021-01-15', '2021-01-02', '2021-01-02','2021-01-05', '2021-01-02','2021-01-05','2021-01-07','2021-01-09','2021-01-11','2021-01-13','2021-01-15', '2021-01-02','2021-01-05', '2021-01-02','2021-01-05' ], 'Quantity': np.floor(np.random.random(size=35)*100) }) And I want to calculate moving average (2 periods) over Date. So, the final output looks like the following. For first MA, we are taking 2021-01-02 & 2021-01-05 across all observations & calculate the MA (50). Similarly for other dates. The output need not be in the structure I'm showing the report. I just need date & MA column in the final data. Thanks!
IIUC, you can aggregate the similar dates first, getting the sum and count. Then take the sum per rolling 2 dates (here it doesn't look like you want to take care of a defined period but rather raw successive values, so I am assuming here prior sorting). Finally, perform the ratio of sum and count to get the mean: g = tdf.groupby('Date')['Quantity'] out = g.sum().rolling(2).sum()/g.count().rolling(2).sum() output: Date 2021-01-02 NaN 2021-01-05 50.210526 2021-01-07 45.071429 2021-01-09 41.000000 2021-01-11 44.571429 2021-01-13 48.800000 2021-01-15 50.500000 Name: Quantity, dtype: float64 joining the original data: g = tdf.groupby('Date')['Quantity'] s = g.sum().rolling(2).sum()/g.count().rolling(2).sum() tdf.merge(s.rename('Quantity_MA(2)'), left_on='Date', right_index=True) output: person_id Date Quantity Quantity_MA(2) 0 1 2021-01-02 87.0 NaN 4 2 2021-01-02 41.0 NaN 6 3 2021-01-02 68.0 NaN 11 4 2021-01-02 11.0 NaN 14 5 2021-01-02 16.0 NaN 21 6 2021-01-02 51.0 NaN 22 7 2021-01-02 38.0 NaN 24 8 2021-01-02 51.0 NaN 31 9 2021-01-02 90.0 NaN 33 10 2021-01-02 45.0 NaN 1 1 2021-01-05 58.0 50.210526 5 2 2021-01-05 11.0 50.210526 7 3 2021-01-05 43.0 50.210526 12 4 2021-01-05 44.0 50.210526 15 5 2021-01-05 52.0 50.210526 23 7 2021-01-05 99.0 50.210526 25 8 2021-01-05 55.0 50.210526 32 9 2021-01-05 66.0 50.210526 34 10 2021-01-05 28.0 50.210526 2 1 2021-01-07 27.0 45.071429 8 3 2021-01-07 55.0 45.071429 13 4 2021-01-07 58.0 45.071429 16 5 2021-01-07 32.0 45.071429 26 8 2021-01-07 3.0 45.071429 3 1 2021-01-09 18.0 41.000000 9 3 2021-01-09 36.0 41.000000 17 5 2021-01-09 69.0 41.000000 27 8 2021-01-09 71.0 41.000000 10 3 2021-01-11 40.0 44.571429 18 5 2021-01-11 36.0 44.571429 28 8 2021-01-11 42.0 44.571429 19 5 2021-01-13 83.0 48.800000 29 8 2021-01-13 43.0 48.800000 20 5 2021-01-15 48.0 50.500000 30 8 2021-01-15 28.0 50.500000
Creating values from datetime objects in certain fixed divisions
I am trying to create a new column, in which e.g. the time 14:02 should be saved as 14.0, whereas 14:16 should be 14.5. This would equal half-hour units. Of course 15min units should also be creatable and so on. This is my approach for full hours, but I need a higher resolution. df["Time"] = df.StartDateTime.apply(lambda x: x.hour)
So long as the units evenly divide an hour you can round with that frequency and then divide by an hour. import pandas as pd df = pd.DataFrame({'Time': pd.timedelta_range('14:00:00', freq='4min', periods=10)}) for freq in ['30min', '15min', '20min', '10min']: df[freq] = df['Time'].dt.round(freq)/pd.Timedelta('1H') Time 30min 15min 20min 10min 0 14:00:00 14.0 14.00 14.000000 14.000000 1 14:04:00 14.0 14.00 14.000000 14.000000 2 14:08:00 14.0 14.25 14.000000 14.166667 3 14:12:00 14.0 14.25 14.333333 14.166667 4 14:16:00 14.5 14.25 14.333333 14.333333 5 14:20:00 14.5 14.25 14.333333 14.333333 6 14:24:00 14.5 14.50 14.333333 14.333333 7 14:28:00 14.5 14.50 14.333333 14.500000 8 14:32:00 14.5 14.50 14.666667 14.500000 9 14:36:00 14.5 14.50 14.666667 14.666667 If you start from a datetime64[ns] column you can isolate the time by subtracting off the normalized date. For example: df = pd.DataFrame({'Time': pd.date_range('2010-01-01 14:00:00', freq='4min', periods=5)}) df['Time_only'] = df['Time'] - df['Time'].dt.normalize() # Time Time_only #0 2010-01-01 14:00:00 14:00:00 #1 2010-01-01 14:04:00 14:04:00 #2 2010-01-01 14:08:00 14:08:00 #3 2010-01-01 14:12:00 14:12:00 #4 2010-01-01 14:16:00 14:16:00 print(df.dtypes) #Time datetime64[ns] #Time_only timedelta64[ns] #dtype: object
how to group by date in pandas . I'm having 48 entries for a single date i.e 30 min interval
please find the input and output below. Input in the code and output in the image. input: time value index_no date block_out no_load 0 2018-07-16 00:30:00 1 2.0 2018-07-16 1 2018-07-16 01:00:00 -1 3.0 2018-07-16 3.0 2 2018-07-16 01:30:00 -1 4.0 2018-07-16 4.0 3 2018-07-16 02:00:00 -1 5.0 2018-07-16 5.0 4 2018-07-16 02:30:00 1 6.0 2018-07-16 5 2018-07-16 03:00:00 1 7.0 2018-07-16 6 2018-07-16 03:30:00 0 8.0 2018-07-16 8.0 7 2018-07-16 04:00:00 1 9.0 2018-07-16 8 2018-07-16 04:30:00 -1 10.0 2018-07-16 10.0 9 2018-07-16 05:00:00 2 11.0 2018-07-16 10 2018-07-16 05:30:00 3 12.0 2018-07-16 11 2018-07-16 06:00:00 2 13.0 2018-07-16 12 2018-07-16 06:30:00 2 14.0 2018-07-16 13 2018-07-16 07:00:00 -1 15.0 2018-07-16 15.0 14 2018-07-16 07:30:00 1 16.0 2018-07-16 15 2018-07-16 08:00:00 -1 17.0 2018-07-16 17.0 16 2018-07-16 08:30:00 2 18.0 2018-07-16 17 2018-07-16 09:00:00 2 19.0 2018-07-16 18 2018-07-16 09:30:00 3 20.0 2018-07-16 19 2018-07-16 10:00:00 -1 21.0 2018-07-16 21.0
This is how you group by column in pandas import pandas as pd import numpy as np # Generating random data data = np.random.randint(0, 4, 15).reshape(5, 3) # Wrapping it with pandas DataFrame df = pd.DataFrame(data, columns=['A', 'B', 'C']) # Group by column 'A' groupbys = df.groupby('A') In your specific problem you want to split your 'time' column to 'date', 'time' and then group by 'date' but it probably won't get you anywhere, because you didn't define what kind of aggregation you want to do. If you don't know how to split your "time" column you can use a map function and convert it to datetime object as so: def to_datetime(str): return datetime.strptime('2018-07-16 00:30:00', "%Y-%m-%d %H:%M:%S") and than create a new column like so: df['date'] = list(map(to_datetime, df['time']))
Pandas dataframe column math when row conditions is met
I have a dataframe containing the following data. I would like to query the age column of each dataframe (1-4) for values between 295.0 and 305.0. For each dataframe there will be a single age value in this range and a corresponding subsidence value. I would like to take the subsidence value and add it to the remaining values in the dataframe. For instance in the first dataframe; at age 300.0 subsidence= 274.057861. In this case, 274.057861 would be added to the rest of the subsidence values in dataframe 1. In the second data frame; at age 299.0 subsidence= 77.773720. So, 77.773720 would be added to to the rest of the subsidence values in dataframe 2. Etc, etc. Is it possible to do this easily in Pandas or am I better off working towards an alternate solution. Thanks :) 1 2 3 4 \ age subsidence age subsidence age subsidence age 0 0.0 -201.538712 0.0 -235.865433 0.0 134.728821 0.0 1 10.0 -77.446548 8.0 -102.183365 10.0 88.796074 10.0 2 20.0 44.901043 18.0 35.316868 20.0 35.871178 20.0 3 31.0 103.172806 28.0 98.238434 30.0 -17.901653 30.0 4 41.0 124.625687 38.0 124.719254 40.0 -13.381897 40.0 5 51.0 122.877541 48.0 130.725235 50.0 -25.396996 50.0 6 61.0 138.810898 58.0 140.301117 60.0 -37.057205 60.0 7 71.0 119.818176 68.0 137.433670 70.0 -11.587639 70.0 8 81.0 77.867607 78.0 96.285652 80.0 21.854662 80.0 9 91.0 33.612885 88.0 32.740803 90.0 67.754501 90.0 10 101.0 15.885051 98.0 8.626043 100.0 150.172699 100.0 11 111.0 118.089211 109.0 88.812439 100.0 150.172699 100.0 12 121.0 247.301956 119.0 212.000061 110.0 124.367874 110.0 13 131.0 268.748627 129.0 253.204819 120.0 157.066010 120.0 14 141.0 231.799255 139.0 292.828461 130.0 145.811783 130.0 15 151.0 259.626343 149.0 260.067993 140.0 175.388763 140.0 16 161.0 288.704651 159.0 240.051605 150.0 265.435791 150.0 17 171.0 249.121857 169.0 203.727097 160.0 336.471924 160.0 18 181.0 339.038055 179.0 245.738480 170.0 283.483582 170.0 19 191.0 395.920410 189.0 318.751160 180.0 381.575500 180.0 20 201.0 404.843445 199.0 338.245209 190.0 491.534424 190.0 21 211.0 461.865784 209.0 418.997559 200.0 495.025604 200.0 22 221.0 518.710632 219.0 446.496216 200.0 495.025604 200.0 23 231.0 483.963867 224.0 479.213287 210.0 571.982361 210.0 24 239.0 445.292389 229.0 492.352905 220.0 611.698608 220.0 25 249.0 396.609497 239.0 445.322144 230.0 645.545776 230.0 26 259.0 321.553558 249.0 429.429932 240.0 596.046265 240.0 27 269.0 306.150177 259.0 297.355103 250.0 547.157654 250.0 28 279.0 259.717468 269.0 174.210785 260.0 457.071472 260.0 29 289.0 301.114410 279.0 114.175957 270.0 438.705170 270.0 30 300.0 274.057861 289.0 91.768898 280.0 397.985535 280.0 31 310.0 216.760361 299.0 77.773720 290.0 426.858276 290.0 32 320.0 192.317093 309.0 73.767090 300.0 410.508331 300.0 33 330.0 179.511917 319.0 63.295345 300.0 410.508331 300.0 34 340.0 231.126053 329.0 -4.296405 310.0 355.303558 310.0 35 350.0 142.894958 339.0 -62.745190 320.0 284.932892 320.0 36 360.0 51.547047 350.0 -60.224789 330.0 251.817078 330.0 37 370.0 -39.064964 360.0 -85.826874 340.0 302.303925 340.0 38 380.0 -54.111374 370.0 -81.139206 350.0 207.799942 350.0 39 390.0 -68.999535 380.0 -40.080212 360.0 77.729439 360.0 40 400.0 -47.595322 390.0 -29.945852 370.0 -127.037209 370.0 41 410.0 13.159509 400.0 -26.656607 380.0 -109.327545 380.0 42 NaN NaN 410.0 -13.723764 390.0 -127.160942 390.0 43 NaN NaN NaN NaN 400.0 -61.404510 400.0 44 NaN NaN NaN NaN 410.0 13.058900 410.0
For the first Dataframe: df1['subsidence'] = df1[(df1.age >295) & (df1.age <305)]['subsidence'].value You need to update each dataframes accordingly.
multi index(time series) slicing error in pandas
i have below dataframe. date/time is multi-indexed indexes. when i doing this code, <code> idx = pd.IndexSlice print(df_per_wday_temp.loc[idx[:,datetime.time(4, 0, 0): datetime.time(7, 0, 0)]])" but i got error 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (2), lexsort depth (1)'. this may be error in index slicing but i don't know why this happened. anybody can solve it ? a b date time 2018-01-26 19:00:00 25.08 -7.85 19:15:00 24.86 -7.81 19:30:00 24.67 -8.24 19:45:00 NaN -9.32 20:00:00 NaN -8.29 20:15:00 NaN -8.58 20:30:00 NaN -9.48 20:45:00 NaN -8.73 21:00:00 NaN -8.60 21:15:00 NaN -8.70 21:30:00 NaN -8.53 21:45:00 NaN -8.90 22:00:00 NaN -8.55 22:15:00 NaN -8.48 22:30:00 NaN -9.90 22:45:00 NaN -9.70 23:00:00 NaN -8.98 23:15:00 NaN -9.17 23:30:00 NaN -9.07 23:45:00 NaN -9.45 00:00:00 NaN -9.64 00:15:00 NaN -10.08 00:30:00 NaN -8.87 00:45:00 NaN -9.91 01:00:00 NaN -9.91 01:15:00 NaN -9.93 01:30:00 NaN -9.55 01:45:00 NaN -9.51 02:00:00 NaN -9.75 02:15:00 NaN -9.44 ... ... ... 03:45:00 NaN -9.28 04:00:00 NaN -9.96 04:15:00 NaN -10.19 04:30:00 NaN -10.20 04:45:00 NaN -9.85 05:00:00 NaN -10.33 05:15:00 NaN -10.18 05:30:00 NaN -10.81 05:45:00 NaN -10.51 06:00:00 NaN -10.41 06:15:00 NaN -10.49 06:30:00 NaN -10.13 06:45:00 NaN -10.36 07:00:00 NaN -10.71 07:15:00 NaN -12.11 07:30:00 NaN -10.76 07:45:00 NaN -10.76 08:00:00 NaN -11.63 08:15:00 NaN -11.18 08:30:00 NaN -10.49 08:45:00 NaN -11.18 09:00:00 NaN -10.67 09:15:00 NaN -10.60 09:30:00 NaN -10.36 09:45:00 NaN -9.39 10:00:00 NaN -9.77 10:15:00 NaN -9.54 10:30:00 NaN -8.99 10:45:00 NaN -9.01 11:00:00 NaN -10.01 thanks in advance
If is not possible sorting index, is necessary create boolean mask and filter by boolean indexing: from datetime import time mask = df1.index.get_level_values(1).to_series().between(time(4, 0, 0), time(7, 0, 0)).values df = df1[mask] print (df) a b date time 2018-01-26 04:00:00 NaN -9.96 04:15:00 NaN -10.19 04:30:00 NaN -10.20 04:45:00 NaN -9.85 05:00:00 NaN -10.33 05:15:00 NaN -10.18 05:30:00 NaN -10.81