pandas multiindex add labels to an index level - pandas

I have a pandas dataframe with multiindex as the following:
TALLY
DAY NODE CLASS
2018-02-04 pdk2r08o005 3 7.0
2018-02-05 pdk2r08o005 3 24.0
2018-02-06 dsvtxvCsdbc02 3 2.0
pdk2r08o005 3 28.0
2018-02-07 pdk2r08o005 3 24.0
2018-02-08 dsvtxvCsdbc02 3 3.0
pdk2r08o005 3 24.0
2018-02-09 pdk2r08o005 3 24.0
2018-02-10 dsvtxvCsdbc02 3 2.0
pdk2r08o005 3 24.0
2018-02-11 pdk2r08o005 3 31.0
2018-02-12 pdk2r08o005 3 24.0
2018-02-13 pdk2r08o005 3 20.0
2018-02-14 dsvtxvCsdbc02 3 4.0
pdk2r08o005 3 24.0
2018-02-15 dsvtxvCsdbc02 3 2.0
pdk2r08o005 3 24.0
2018-02-16 dsvtxvCsdbc02 3 121.0
pdk2r08o005 3 26.0
2018-02-17 dsvtxvCsdbc02 3 401.0
pdk2r08o005 3 24.0
2018-02-18 dsvtxvCsdbc02 3 327.0
pdk2r08o005 3 24.0
2018-02-19 dsvtxvCsdbc02 3 164.0
pdk2r08o005 3 24.0
2018-02-20 dsvtxvCsdbc02 3 26.0
pdk2r08o005 3 38.0
2018-02-21 pdk2r08o005 3 24.0
2018-02-22 pdk2r08o005 3 24.0
2018-02-23 pdk2r08o005 3 24.0
...
2018-03-01 pdk2r08o005 3 24.0
2018-03-02 pdk2r08o005 3 24.0
2018-03-03 pdk2r08o005 3 24.0
2018-03-04 pdk2r08o005 3 36.0
2018-03-05 pdk2r08o005 3 24.0
2018-03-06 dsvtxvCsdbc02 3 2.0
pdk2r08o005 3 24.0
2018-03-07 dsvtxvCsdbc02 3 8.0
pdk2r08o005 3 24.0
2018-03-08 pdk2r08o005 3 31.0
2018-03-09 pdk2r08o005 3 24.0
2018-03-10 pdk2r08o005 3 24.0
2018-03-11 dsvtxvCsdbc02 3 2.0
pdk2r08o005 3 39.0
2018-03-12 pdk2r08o005 3 24.0
2018-03-13 pdk2r08o005 3 24.0
2018-03-14 dsvtxvCsdbc02 3 4.0
pdk2r08o005 3 24.0
2018-03-15 dsvtxvCsdbc02 3 2.0
pdk2r08o005 3 24.0
2018-03-16 dsvtxvCsdbc02 3 2.0
pdk2r08o005 3 24.0
2018-03-17 dsvtxvCsdbc02 3 4.0
pdk2r08o005 3 24.0
2018-03-18 dsvtxvCsdbc02 3 12.0
9 2.0
pdk2r08o005 3 24.0
2018-03-19 pdk2r08o005 3 44.0
2018-03-20 pdk2r08o005 3 24.0
2018-03-21 pdk2r08o005 3 18.0
[68 rows x 1 columns]
In this dataset "DAY", "NODE" and "CLASS" are the part of index.
Now I have to fill in some missing dates in "DAY" column.
Like:
date_range = pd.date_range('02-06-2018','03-18-2018')
indices = pd.MultiIndex.from_product(dataset.index.levels)
How to use this date_range to add dates to the indices of the dataset?

I have figured out the answer for this and it is as follows:
Read the dataframe "df" in following structure.
NODE CLASS TALLY
DAY
2018-02-04 pdk2r08o005 3 7.0
2018-02-05 pdk2r08o005 3 24.0
2018-02-06 dsvtxvCsdbc02 3 2.0
2018-02-06 pdk2r08o005 3 28.0
2018-02-07 pdk2r08o005 3 24.0
2018-02-08 dsvtxvCsdbc02 3 3.0
2018-02-08 pdk2r08o005 3 24.0
2018-02-09 pdk2r08o005 3 24.0
2018-02-10 dsvtxvCsdbc02 3 2.0
2018-02-10 pdk2r08o005 3 24.0
2018-02-11 pdk2r08o005 3 31.0
2018-02-12 pdk2r08o005 3 24.0
2018-02-13 pdk2r08o005 3 20.0
2018-02-14 dsvtxvCsdbc02 3 4.0
2018-02-14 pdk2r08o005 3 24.0
2018-02-15 dsvtxvCsdbc02 3 2.0
2018-02-15 pdk2r08o005 3 24.0
2018-02-16 dsvtxvCsdbc02 3 121.0
2018-02-16 pdk2r08o005 3 26.0
2018-02-17 dsvtxvCsdbc02 3 401.0
2018-02-17 pdk2r08o005 3 24.0
2018-02-18 dsvtxvCsdbc02 3 327.0
2018-02-18 pdk2r08o005 3 24.0
2018-02-19 dsvtxvCsdbc02 3 164.0
2018-02-19 pdk2r08o005 3 24.0
2018-02-20 dsvtxvCsdbc02 3 26.0
2018-02-20 pdk2r08o005 3 38.0
2018-02-21 pdk2r08o005 3 24.0
2018-02-22 pdk2r08o005 3 24.0
2018-02-23 pdk2r08o005 3 24.0
... ... ...
2018-03-01 pdk2r08o005 3 24.0
2018-03-02 pdk2r08o005 3 24.0
2018-03-03 pdk2r08o005 3 24.0
2018-03-04 pdk2r08o005 3 36.0
2018-03-05 pdk2r08o005 3 24.0
2018-03-06 dsvtxvCsdbc02 3 2.0
2018-03-06 pdk2r08o005 3 24.0
2018-03-07 dsvtxvCsdbc02 3 8.0
2018-03-07 pdk2r08o005 3 24.0
2018-03-08 pdk2r08o005 3 31.0
2018-03-09 pdk2r08o005 3 24.0
2018-03-10 pdk2r08o005 3 24.0
2018-03-11 dsvtxvCsdbc02 3 2.0
2018-03-11 pdk2r08o005 3 39.0
2018-03-12 pdk2r08o005 3 24.0
2018-03-13 pdk2r08o005 3 24.0
2018-03-14 dsvtxvCsdbc02 3 4.0
2018-03-14 pdk2r08o005 3 24.0
2018-03-15 dsvtxvCsdbc02 3 2.0
2018-03-15 pdk2r08o005 3 24.0
2018-03-16 dsvtxvCsdbc02 3 2.0
2018-03-16 pdk2r08o005 3 24.0
2018-03-17 dsvtxvCsdbc02 3 4.0
2018-03-17 pdk2r08o005 3 24.0
2018-03-18 dsvtxvCsdbc02 3 12.0
2018-03-18 dsvtxvCsdbc02 9 2.0
2018-03-18 pdk2r08o005 3 24.0
2018-03-19 pdk2r08o005 3 44.0
2018-03-20 pdk2r08o005 3 24.0
2018-03-21 pdk2r08o005 3 18.0
I am reading it from table like following
df = pd.read_sql('select DAY,NODE,CLASS,TALLY FROM TABLE', con=cnx, index_col=['DAY'])
df.index = pd.to_datetime(dataset.index)
create a new dataframe "df1" with the similar structure for given date range
date_range = pd.date_range(start='02-01-2018',end='03-21-2018',name='DAY')
df1 = pd.DataFrame({"NODE":[np.nan],"CLASS":[np.nan],"TALLY":[np.nan]},index=date_range)
Append new dataset into old dadaset
df = df.append(df1)
Get the multiindex
indices = pd.MultiIndex.from_product(df.index.levels)
Reindex the dataset
df = df.reindex(indices,fill_value=0)
And viola the asked datastructure is the new output.

Related

Pandas combine two dataframes based on time difference

I have two data frames that stores different types of medical information of patients. The common elements of both the data frames are the encounter ID (hadm_id), the time the information was recorded ((n|c)e_charttime).
One data frame (df_str) contains structured information such as vital signs and lab test values and values derived from these (such as change statistics over 24 hours). The other data frame (df_notes) contains a column with a clinical note recorded at a specified time for an encounter. Both these data frames contain multiple encounters, but the common element is the encounter ID (hadm_id).
Here are examples of the data frames for ONE encounter ID (hadm_id) with a subset of variables:
df_str
hadm_id ce_charttime hr resp magnesium hr_24hr_mean
0 196673 2108-03-05 15:34:00 95.0 12.0 NaN 95.000000
1 196673 2108-03-05 16:00:00 85.0 11.0 NaN 90.000000
2 196673 2108-03-05 16:16:00 85.0 11.0 1.8 88.333333
3 196673 2108-03-05 17:00:00 109.0 12.0 1.8 93.500000
4 196673 2108-03-05 18:00:00 97.0 12.0 1.8 94.200000
5 196673 2108-03-05 19:00:00 99.0 16.0 1.8 95.000000
6 196673 2108-03-05 20:00:00 98.0 13.0 1.8 95.428571
7 196673 2108-03-05 21:00:00 97.0 14.0 1.8 95.625000
8 196673 2108-03-05 22:00:00 101.0 12.0 1.8 96.222222
9 196673 2108-03-05 23:00:00 97.0 13.0 1.8 96.300000
10 196673 2108-03-06 00:00:00 93.0 13.0 1.8 96.000000
11 196673 2108-03-06 01:00:00 89.0 12.0 1.8 95.416667
12 196673 2108-03-06 02:00:00 88.0 10.0 1.8 94.846154
13 196673 2108-03-06 03:00:00 87.0 12.0 1.8 94.285714
14 196673 2108-03-06 04:00:00 97.0 19.0 1.8 94.466667
15 196673 2108-03-06 05:00:00 95.0 11.0 1.8 94.500000
16 196673 2108-03-06 05:43:00 95.0 11.0 2.0 94.529412
17 196673 2108-03-06 06:00:00 103.0 17.0 2.0 95.000000
18 196673 2108-03-06 07:00:00 101.0 12.0 2.0 95.315789
19 196673 2108-03-06 08:00:00 103.0 20.0 2.0 95.700000
20 196673 2108-03-06 09:00:00 84.0 11.0 2.0 95.142857
21 196673 2108-03-06 10:00:00 89.0 11.0 2.0 94.863636
22 196673 2108-03-06 11:00:00 91.0 14.0 2.0 94.695652
23 196673 2108-03-06 12:00:00 85.0 10.0 2.0 94.291667
24 196673 2108-03-06 13:00:00 98.0 14.0 2.0 94.440000
25 196673 2108-03-06 14:00:00 100.0 18.0 2.0 94.653846
26 196673 2108-03-06 15:00:00 95.0 12.0 2.0 94.666667
27 196673 2108-03-06 16:00:00 96.0 20.0 2.0 95.076923
28 196673 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
df_notes
hadm_id ne_charttime note
0 196673 2108-03-05 16:54:00 Nursing\nNursing Progress Note\nPt is a 43 yo ...
1 196673 2108-03-05 17:54:00 Physician \nPhysician Resident Admission Note\...
2 196673 2108-03-05 18:09:00 Physician \nPhysician Resident Admission Note\...
3 196673 2108-03-06 06:11:00 Nursing\nNursing Progress Note\nPain control (...
4 196673 2108-03-06 08:06:00 Physician \nPhysician Resident Progress Note\n...
5 196673 2108-03-06 12:40:00 Nursing\nNursing Progress Note\nChief Complain...
6 196673 2108-03-06 13:01:00 Nursing\nNursing Progress Note\nPain control (...
7 196673 2108-03-06 17:09:00 Nursing\nNursing Transfer Note\nChief Complain...
8 196673 2108-03-06 17:12:00 Nursing\nNursing Transfer Note\nPain control (...
9 196673 2108-03-07 15:25:00 Radiology\nCHEST (PA & LAT)\n[**2108-3-7**] 3:...
10 196673 2108-03-07 18:34:00 Radiology\nCTA CHEST W&W/O C&RECONS, NON-CORON...
11 196673 2108-03-09 09:10:00 Radiology\nABDOMEN (SUPINE & ERECT)\n[**2108-3...
12 196673 2108-03-09 12:22:00 Radiology\nCT ABDOMEN W/CONTRAST\n[**2108-3-9*...
13 196673 2108-03-10 05:26:00 Radiology\nABDOMEN (SUPINE & ERECT)\n[**2108-3...
14 196673 2108-03-10 05:27:00 Radiology\nCHEST (PA & LAT)\n[**2108-3-10**] 5...
What I want to do is to combine both the data frames based on the time when that information was recorded. More specifically, for each row in df_notes, I want a corresponding row from df_str with ce_charttime <= ne_charttime.
As an example, the first row in df_notes has ne_charttime = 2108-03-05 16:54:00. There are three rows in df_str with record times less than this time: ce_charttime = 2108-03-05 15:34:00, ce_charttime = 2108-03-05 16:00:00, ce_charttime = 2108-03-05 16:16:00. The most recent of these is the row with ce_charttime = 2108-03-05 16:16:00. So in my resulting data frame, for ne_charttime = 2108-03-05 16:54:00, I will have hr = 85.0, resp = 11.0, magnesium = 1.8, hr_24hr_mean = 88.33.
Essentially, in this example the resulting data frame will look like this:
hadm_id ne_charttime note hr resp magnesium hr_24hr_mean
0 196673 2108-03-05 16:54:00 Nursing\nNursing Progress Note\nPt is a 43 yo ... 85.0 11.0 1.8 88.333333
1 196673 2108-03-05 17:54:00 Physician \nPhysician Resident Admission Note\... 109.0 12.0 1.8 93.500000
2 196673 2108-03-05 18:09:00 Physician \nPhysician Resident Admission Note\... 97.0 12.0 1.8 94.200000
3 196673 2108-03-06 06:11:00 Nursing\nNursing Progress Note\nPain control (... 103.0 17.0 2.0 95.000000
4 196673 2108-03-06 08:06:00 Physician \nPhysician Resident Progress Note\n... 103.0 20.0 2.0 95.700000
5 196673 2108-03-06 12:40:00 Nursing\nNursing Progress Note\nChief Complain... 85.0 10.0 2.0 94.291667
6 196673 2108-03-06 13:01:00 Nursing\nNursing Progress Note\nPain control (... 98.0 14.0 2.0 94.440000
7 196673 2108-03-06 17:09:00 Nursing\nNursing Transfer Note\nChief Complain... 106.0 21.0 2.0 95.360000
8 196673 2108-03-06 17:12:00 Nursing\nNursing Transfer Note\nPain control (... NaN NaN NaN NaN
9 196673 2108-03-07 15:25:00 Radiology\nCHEST (PA & LAT)\n[**2108-3-7**] 3:... NaN NaN NaN NaN
10 196673 2108-03-07 18:34:00 Radiology\nCTA CHEST W&W/O C&RECONS, NON-CORON... NaN NaN NaN NaN
11 196673 2108-03-09 09:10:00 Radiology\nABDOMEN (SUPINE & ERECT)\n[**2108-3... NaN NaN NaN NaN
12 196673 2108-03-09 12:22:00 Radiology\nCT ABDOMEN W/CONTRAST\n[**2108-3-9*... NaN NaN NaN NaN
13 196673 2108-03-10 05:26:00 Radiology\nABDOMEN (SUPINE & ERECT)\n[**2108-3... NaN NaN NaN NaN
14 196673 2108-03-10 05:27:00 Radiology\nCHEST (PA & LAT)\n[**2108-3-10**] 5... NaN NaN NaN NaN
The resulting data frame will be of the same length as df_notes. I have been able to come with a very inefficient piece of code using for loops and explicit indexing to get this result:
cols = list(df_str.columns[2:])
final_df = df_notes.copy()
for col in cols:
final_df[col] = np.nan
idx = 0
for i, note_row in final_df.iterrows():
ne = note_row['ne_charttime']
for j, str_row in df_str.iterrows():
ce = str_row['ce_charttime']
if ne < ce:
idx += 1
for col in cols:
final_df.iloc[i, final_df.columns.get_loc(col)] = df_str.iloc[j-1][col]
break
for col in cols:
final_df.iloc[idx, final_df.columns.get_loc(col)] = df_str.iloc[-1][col]
This piece of code is bad because it is very inefficient and while it may work for this example, in my example dataset, I have over 30 different columns of structured variables, and over 10,000 encounters.
EDIT-2:
#Stef has provided an excellent answer which seems to work and replace my elaborate loopy code with a single line (amazing). However, while that works for this particular example, I am running into problems when I apply it to a bigger subset which includes multiple encounters. For example, consider the following example:
df_str.shape, df_notes.shape
((217, 386), (35, 4))
df_notes[['hadm_id', 'ne_charttime']]
hadm_id ne_charttime
0 100104 2201-06-21 20:00:00
1 100104 2201-06-21 22:51:00
2 100104 2201-06-22 05:00:00
3 100104 2201-06-23 04:33:00
4 100104 2201-06-23 12:59:00
5 100104 2201-06-24 05:15:00
6 100372 2115-12-20 02:29:00
7 100372 2115-12-21 10:15:00
8 100372 2115-12-22 13:05:00
9 100372 2115-12-25 17:16:00
10 100372 2115-12-30 10:58:00
11 100372 2115-12-30 13:07:00
12 100372 2115-12-30 14:16:00
13 100372 2115-12-30 22:34:00
14 100372 2116-01-03 09:10:00
15 100372 2116-01-07 11:08:00
16 100975 2126-03-02 06:06:00
17 100975 2126-03-02 17:44:00
18 100975 2126-03-03 05:36:00
19 100975 2126-03-03 18:27:00
20 100975 2126-03-04 05:29:00
21 100975 2126-03-04 10:48:00
22 100975 2126-03-04 16:42:00
23 100975 2126-03-05 22:12:00
24 100975 2126-03-05 23:01:00
25 100975 2126-03-06 11:02:00
26 100975 2126-03-06 13:38:00
27 100975 2126-03-08 13:39:00
28 100975 2126-03-11 10:41:00
29 101511 2199-04-30 09:29:00
30 101511 2199-04-30 09:53:00
31 101511 2199-04-30 18:06:00
32 101511 2199-05-01 08:28:00
33 111073 2195-05-01 01:56:00
34 111073 2195-05-01 21:49:00
This example has 5 encounters. The dataframe is sorted by hadm_id and within each hadm_id, ne_charttime is sorted. However, the column ne_charttime by itself is NOT sorted as seen from row 0 ce_charttime=2201-06-21 20:00:00 and row 6 ne_charttime=2115-12-20 02:29:00. When I try to do a merge_asof, I get the following error:
ValueError: left keys must be sorted. Is this because of the fact that ne_charttime column is not sorted? If so, how do I rectify this while maintaining the integrity of the encounter ID group?
EDIT-1:
I was able to loop over the encounters as well:
cols = list(dev_str.columns[1:]) # get the cols to merge (everything except hadm_id)
final_dfs = []
grouped = dev_notes.groupby('hadm_id') # get groups of encounter ids
for name, group in grouped:
final_df = group.copy().reset_index(drop=True) # make a copy of notes for that encounter
for col in cols:
final_df[col] = np.nan # set the values to nan
idx = 0 # index to track the final row in the given encounter
for i, note_row in final_df.iterrows():
ne = note_row['ne_charttime']
sub = dev_str.loc[(dev_str['hadm_id'] == name)].reset_index(drop=True) # get the df corresponding to the ecounter
for j, str_row in sub.iterrows():
ce = str_row['ce_charttime']
if ne < ce: # if the variable charttime < note charttime
idx += 1
# grab the previous values for the variables and break
for col in cols:
final_df.iloc[i, final_df.columns.get_loc(col)] = sub.iloc[j-1][col]
break
# get the last value in the df for the variables
for col in cols:
final_df.iloc[idx, final_df.columns.get_loc(col)] = sub.iloc[-1][col]
final_dfs.append(final_df) # append the df to the list
# cat the list to get final df and reset index
final_df = pd.concat(final_dfs)
final_df.reset_index(inplace=True, drop=True)
Again this very inefficient but does the job.
Is there a better way to achieve what I want? Any help is appreciated.
Thanks.
You can use merge_asof (both dataframes must be sorted by the columns you're merging them on, which is already the case in your example):
final_df = pd.merge_asof(df_notes, df_str, left_on='ne_charttime', right_on='ce_charttime', by='hadm_id')
Result:
hadm_id ne_charttime note ce_charttime hr resp magnesium hr_24hr_mean
0 196673 2108-03-05 16:54:00 Nursing\nNursing Progress Note\nPt is a 43 yo ... 2108-03-05 16:16:00 85.0 11.0 1.8 88.333333
1 196673 2108-03-05 17:54:00 Physician \nPhysician Resident Admission Note\... 2108-03-05 17:00:00 109.0 12.0 1.8 93.500000
2 196673 2108-03-05 18:09:00 Physician \nPhysician Resident Admission Note\... 2108-03-05 18:00:00 97.0 12.0 1.8 94.200000
3 196673 2108-03-06 06:11:00 Nursing\nNursing Progress Note\nPain control (... 2108-03-06 06:00:00 103.0 17.0 2.0 95.000000
4 196673 2108-03-06 08:06:00 Physician \nPhysician Resident Progress Note\n... 2108-03-06 08:00:00 103.0 20.0 2.0 95.700000
5 196673 2108-03-06 12:40:00 Nursing\nNursing Progress Note\nChief Complain... 2108-03-06 12:00:00 85.0 10.0 2.0 94.291667
6 196673 2108-03-06 13:01:00 Nursing\nNursing Progress Note\nPain control (... 2108-03-06 13:00:00 98.0 14.0 2.0 94.440000
7 196673 2108-03-06 17:09:00 Nursing\nNursing Transfer Note\nChief Complain... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
8 196673 2108-03-06 17:12:00 Nursing\nNursing Transfer Note\nPain control (... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
9 196673 2108-03-07 15:25:00 Radiology\nCHEST (PA & LAT)\n[**2108-3-7**] 3:... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
10 196673 2108-03-07 18:34:00 Radiology\nCTA CHEST W&W/O C&RECONS, NON-CORON... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
11 196673 2108-03-09 09:10:00 Radiology\nABDOMEN (SUPINE & ERECT)\n[**2108-3... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
12 196673 2108-03-09 12:22:00 Radiology\nCT ABDOMEN W/CONTRAST\n[**2108-3-9*... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
13 196673 2108-03-10 05:26:00 Radiology\nABDOMEN (SUPINE & ERECT)\n[**2108-3... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
14 196673 2108-03-10 05:27:00 Radiology\nCHEST (PA & LAT)\n[**2108-3-10**] 5... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
PS: This gives you the correct result for all rows. There's a logical flaw in your code: you look for the first time ce_charttime > ne_charttime and then take the previous row. If there's no such time, you'll never have the chance to take the previous row, hence the NaNs in your result table starting from row 8.
PPS: This includes ce_charttime in the final dataframe. You can replace it by a column of how old the information is and/or remove it:
final_df['info_age'] = final_df.ne_charttime - final_df.ce_charttime
final_df = final_df.drop(columns='ce_charttime')
UPDATE for EDIT-2: As I wrote at the very beginning, repeated in the comments and as the docs clearly states: both ce_charttime and ne_charttime must be sorted (hadm_id need not be sorted). If this condition is not met, you'll have to (temporarily) sort your dataframes as required. See the following example:
import pandas as pd, string
df_str = pd.DataFrame( {'hadm_id': pd.np.tile([111111, 222222],10), 'ce_charttime': pd.date_range('2019-10-01 00:30', periods=20, freq='30T'), 'hr': pd.np.random.randint(80,120,20)})
df_notes = pd.DataFrame( {'hadm_id': pd.np.tile([111111, 222222],3), 'ne_charttime': pd.date_range('2019-10-01 00:45', periods=6, freq='40T'), 'note': [''.join(pd.np.random.choice(list(string.ascii_letters), 10)) for _ in range(6)]}).sort_values('hadm_id')
final_df = pd.merge_asof(df_notes.sort_values('ne_charttime'), df_str, left_on='ne_charttime', right_on='ce_charttime', by='hadm_id').sort_values(['hadm_id', 'ne_charttime'])
print(df_str); print(df_notes); print(final_df)
Output:
hadm_id ce_charttime hr
0 111111 2019-10-01 00:30:00 118
1 222222 2019-10-01 01:00:00 93
2 111111 2019-10-01 01:30:00 92
3 222222 2019-10-01 02:00:00 86
4 111111 2019-10-01 02:30:00 88
5 222222 2019-10-01 03:00:00 86
6 111111 2019-10-01 03:30:00 106
7 222222 2019-10-01 04:00:00 91
8 111111 2019-10-01 04:30:00 109
9 222222 2019-10-01 05:00:00 95
10 111111 2019-10-01 05:30:00 113
11 222222 2019-10-01 06:00:00 92
12 111111 2019-10-01 06:30:00 104
13 222222 2019-10-01 07:00:00 83
14 111111 2019-10-01 07:30:00 114
15 222222 2019-10-01 08:00:00 98
16 111111 2019-10-01 08:30:00 110
17 222222 2019-10-01 09:00:00 89
18 111111 2019-10-01 09:30:00 98
19 222222 2019-10-01 10:00:00 109
hadm_id ne_charttime note
0 111111 2019-10-01 00:45:00 jOcRWVdPDF
2 111111 2019-10-01 02:05:00 mvScJNrwra
4 111111 2019-10-01 03:25:00 FBAFbJYflE
1 222222 2019-10-01 01:25:00 ilNuInOsYZ
3 222222 2019-10-01 02:45:00 ysyolaNmkV
5 222222 2019-10-01 04:05:00 wvowGGETaP
hadm_id ne_charttime note ce_charttime hr
0 111111 2019-10-01 00:45:00 jOcRWVdPDF 2019-10-01 00:30:00 118
2 111111 2019-10-01 02:05:00 mvScJNrwra 2019-10-01 01:30:00 92
4 111111 2019-10-01 03:25:00 FBAFbJYflE 2019-10-01 02:30:00 88
1 222222 2019-10-01 01:25:00 ilNuInOsYZ 2019-10-01 01:00:00 93
3 222222 2019-10-01 02:45:00 ysyolaNmkV 2019-10-01 02:00:00 86
5 222222 2019-10-01 04:05:00 wvowGGETaP 2019-10-01 04:00:00 91
You can do full merge and then filter with query:
df_notes.merge(df_str, on=hadm_id).query('ce_charttime <= ne_charttime')

how to fill missing datatime row with pandas

index valuve
2017-01-25 01:00:00:00 1
2017-01-25 02:00:00:00 5
2017-01-25 03:00:00:00 7
2017-01-25 07:00:00:00 34
2017-01-25 20:00:00:00 45
2017-01-25 24:00:00:00 45
2017-01-26 1:00:00:00 31
This dataframe is a 24h record of each day, but it misses some record. How can i insert the missing row into the right place and fill 'nan' to the corresponding value?
Here is complicated 24H in datetimes, so necessary replace it to 23H and add one hour. Last use DataFrame.asfreq for add missing values for 24H DatetimeIndex:
mask = df.index.str.contains(' 24:')
idx = df.index.where(~mask, df.index.str.replace(' 24:', ' 23:'))
idx = pd.to_datetime(idx, format='%Y-%m-%d %H:%M:%S:%f')
df.index = idx.where(~mask, idx + pd.Timedelta(1, unit='H'))
df = df.asfreq('H')
print (df)
valuve
index
2017-01-25 01:00:00 1.0
2017-01-25 02:00:00 5.0
2017-01-25 03:00:00 7.0
2017-01-25 04:00:00 NaN
2017-01-25 05:00:00 NaN
2017-01-25 06:00:00 NaN
2017-01-25 07:00:00 34.0
2017-01-25 08:00:00 NaN
2017-01-25 09:00:00 NaN
2017-01-25 10:00:00 NaN
2017-01-25 11:00:00 NaN
2017-01-25 12:00:00 NaN
2017-01-25 13:00:00 NaN
2017-01-25 14:00:00 NaN
2017-01-25 15:00:00 NaN
2017-01-25 16:00:00 NaN
2017-01-25 17:00:00 NaN
2017-01-25 18:00:00 NaN
2017-01-25 19:00:00 NaN
2017-01-25 20:00:00 45.0
2017-01-25 21:00:00 NaN
2017-01-25 22:00:00 NaN
2017-01-25 23:00:00 NaN
2017-01-26 00:00:00 45.0
2017-01-26 01:00:00 31.0

Resampling and doing Linear Interpolation in Pandas

I have a data which contains Id, DateTime and Value Column. Data are supposed to be collected in every 10 mins. However, there are some data which has gaps of over 10 mins ( for example gaps of over 20 mins, 1hr, 2 hr). Data are collected for one full month. I want to use re sampling and use linear interpolation on my Value Column so that every Value columns contain data for fixed interval of time ( let say for every 1 hr and (weekly based)).
This is my sample data
Id DateTime Value
20 2018-04-08 00:28:52 10
20 2018-04-08 00:38:34 11
20 2018-04-08 00:48:57 9
20 2018-04-08 01:18:22 7
............................
205 2018-04-08 01:08:28 11
205 2018-04-08 01:18:33 13
205 2018-04-08 01:27:22 8
205 2018-04-08 01:37:02 7
205 2018-04-08 01:56:44 6
205 2018-04-08 02:16:14 10
.....
2053 2018-04-08 02:06:03 11
2053 2018-04-08 02:17:10 12
2053 2018-04-08 02:26:33 8
2053 2018-04-08 02:36:53 9
2053 2018-04-08 03:26:33 13
Any suggestions ?
Thanks
I believe need:
print (df)
Id DateTime Value
0 20 2018-04-08 00:28:52 10
1 20 2018-04-08 00:38:34 11
2 20 2018-04-08 00:48:57 9
3 20 2018-04-08 01:18:22 7
4 205 2018-04-08 01:08:28 11
5 205 2018-04-08 01:18:33 13
6 205 2018-04-08 01:27:22 8
7 205 2018-04-08 01:37:02 7
8 205 2018-04-08 01:56:44 6
9 205 2018-04-08 02:16:14 10
10 2053 2018-04-08 10:06:03 11
11 2053 2018-04-08 10:17:10 12
12 2053 2018-04-08 10:26:33 8
13 2053 2018-04-08 10:36:53 9
14 2053 2018-04-08 10:26:33 13
df = df.set_index('DateTime')['Value'].resample('1H').mean().interpolate()
print (df)
DateTime
2018-04-08 00:00:00 10.000000
2018-04-08 01:00:00 8.666667
2018-04-08 02:00:00 10.000000
2018-04-08 03:00:00 10.075000
2018-04-08 04:00:00 10.150000
2018-04-08 05:00:00 10.225000
2018-04-08 06:00:00 10.300000
2018-04-08 07:00:00 10.375000
2018-04-08 08:00:00 10.450000
2018-04-08 09:00:00 10.525000
2018-04-08 10:00:00 10.600000
Freq: H, Name: Value, dtype: float64
EDIT:
If need resample per groups also add groupby with reindex for same DatetimeIndex per each unique Ids:
df = df.set_index('DateTime').groupby('Id')['Value'].resample('1H').mean()
mux = pd.MultiIndex.from_product([df.index.levels[0], pd.date_range(df.index.levels[1].min(), df.index.levels[1].max(), freq='h')])
df = df.reindex(mux)
df = df.groupby(level=0).apply(lambda x: x.interpolate())
print (df)
20 2018-04-08 00:00:00 10.0
2018-04-08 01:00:00 7.0
2018-04-08 02:00:00 7.0
2018-04-08 03:00:00 7.0
2018-04-08 04:00:00 7.0
2018-04-08 05:00:00 7.0
2018-04-08 06:00:00 7.0
2018-04-08 07:00:00 7.0
2018-04-08 08:00:00 7.0
2018-04-08 09:00:00 7.0
2018-04-08 10:00:00 7.0
205 2018-04-08 00:00:00 NaN
2018-04-08 01:00:00 9.0
2018-04-08 02:00:00 10.0
2018-04-08 03:00:00 10.0
2018-04-08 04:00:00 10.0
2018-04-08 05:00:00 10.0
2018-04-08 06:00:00 10.0
2018-04-08 07:00:00 10.0
2018-04-08 08:00:00 10.0
2018-04-08 09:00:00 10.0
2018-04-08 10:00:00 10.0
2053 2018-04-08 00:00:00 NaN
2018-04-08 01:00:00 NaN
2018-04-08 02:00:00 NaN
...
2018-04-08 07:00:00 NaN
2018-04-08 08:00:00 NaN
2018-04-08 09:00:00 NaN
2018-04-08 10:00:00 10.6
Name: Value, dtype: float64
Another solution if need interpolate each group separately:
df = (df.set_index('DateTime')
.groupby('Id')['Value']
.resample('1H')
.mean()
.groupby(level=0)
.apply(lambda x: x.interpolate()))
print (df)
Id DateTime
20 2018-04-08 00:00:00 10.0
2018-04-08 01:00:00 7.0
205 2018-04-08 01:00:00 9.0
2018-04-08 02:00:00 10.0
2053 2018-04-08 10:00:00 10.6
Name: Value, dtype: float64

Select previous row which satisfies a condition in hive

I have product data like this
Product Date Sales Availbility
xyz 2017-12-31 724.5 6.0
xyz 2018-01-07 362.25 7.0
xyz 2018-01-14 281.75 7.0
xyz 2018-01-21 442.75 7.0
xyz 2018-01-28 442.75 6.0
xyz 2018-02-04 402.5 7.0
xyz 2018-02-11 201.25 3.0
xyz 2018-02-18 120.75 0.0
xyz 2018-02-25 40.25 0.0
xyz 2018-03-11 201.25 0.0
xyz 2018-03-18 483.0 5.0
xyz 2018-03-25 322.0 7.0
xyz 2018-04-01 241.5 7.0
xyz 2018-04-08 281.75 7.0
xyz 2018-04-15 523.25 7.0
xyz 2018-04-22 241.5 7.0
xyz 2018-04-29 362.25 7.0
The data is not ordered ( a minor issue) , what I want to do is that wherever we have 0 in the availbility column (4th column) I want to take previous 3 weeks ( which have full availability i.e 7 ) average
something like below:
xyz 2017-12-31 724.5 6.0 Null
xyz 2018-01-07 362.25 7.0 362.25 ( Same value for weeks with availbility = 7)
xyz 2018-01-14 281.75 7.0 281.75
xyz 2018-01-21 442.75 7.0 442.75
xyz 2018-01-28 442.75 6.0 361 (362 + 281 + 362/3)the prior fully availble week avg which is avilble)
xyz 2018-02-04 402.5 7.0 402
xyz 2018-02-11 201.25 3.0 375 (402 + 442 + 281 /3)
xyz 2018-02-18 120.75 0.0 375 ( Same since 375 is the most recent 4 fully availble average)
xyz 2018-02-25 40.25 0.0 375
xyz 2018-03-11 201.25 0.0 375
xyz 2018-03-18 483.0 5.0 375
xyz 2018-03-25 322.0 7.0 322
xyz 2018-04-01 241.5 7.0 241
xyz 2018-04-08 281.75 7.0 281
xyz 2018-04-15 523.25 7.0 523
xyz 2018-04-22 241.5 7.0 241
xyz 2018-04-29 362.25 7.0 362
I approached it by trying to find the 3 weeks average of only the fully available weeks and union it with the rest of the weeks, later tries using lag function to retrieve the most recent average .
select a.*,lag(case when a.Full_availble_sales >0 then a.Full_availble_sales end,1) over (partition by a.asin order by a.week_beginning) as Four_wk_avg from (select asin,week_beginning,avg(sales) as weekly_sales,sum(available_to_purchase) as weekly_availbility,0 as Full_availble_sales from t1 where asin = 'xyz' group by asin,week_beginning having sum(available_to_purchase) < 7
union all
select t.asin,t.week_beginning,t.weekly_sales,t.weekly_availbility,avg(t.weekly_sales) over (partition by t.asin order by t.week_beginning rows between 3 preceding and current row ) as Full_availble_sales from
(select asin,week_beginning,avg(sales) as weekly_sales,sum(available_to_purchase) as weekly_availbility from t1 where asin = 'xyz' group by asin,week_beginning having sum(available_to_purchase) = 7)t ) a order by a.week_beginning
O/P was
xyz 2017-12-31 724.5 6.0 0.0 NULL
xyz 2018-01-07 362.25 7.0 362.25 NULL
xyz 2018-01-14 281.75 7.0 322.0 362.25
xyz 2018-01-21 442.75 7.0 362.25 322.0
xyz 2018-01-28 442.75 6.0 0.0 362.25
xyz 2018-02-04 402.5 7.0 372.3125 NULL
xyz 2018-02-11 201.25 3.0 0.0 372.3125
xyz 2018-02-18 120.75 0.0 0.0 NULL
xyz 2018-02-25 40.25 0.0 0.0 NULL
xyz 2018-03-11 201.25 0.0 0.0 NULL
xyz 2018-03-18 483.0 5.0 0.0 NULL
xyz 2018-03-25 322.0 7.0 362.25 NULL
xyz 2018-04-01 241.5 7.0 352.1875 362.25
xyz 2018-04-08 281.75 7.0 311.9375 352.1875
xyz 2018-04-15 523.25 7.0 342.125 311.9375
xyz 2018-04-22 241.5 7.0 322.0 342.125
xyz 2018-04-29 362.25 7.0 352.1875 322.0
which was not what I intended.
This will do the job (using aggregate functions avg and max_by on a window)
WITH
tt1 (Product,Date_week_beginning,Sales,Availbility) AS
( SELECT * FROM ( VALUES
('xyz','2017-12-31', 724.5 ,6.0),
('xyz','2018-01-07', 362.25 ,7.0),
('xyz','2018-01-14', 281.75 ,7.0),
('xyz','2018-01-21', 442.75 ,7.0),
('xyz','2018-01-28', 442.75 ,6.0),
('xyz','2018-02-04', 402.5 ,7.0),
('xyz','2018-02-11', 201.25 ,3.0),
('xyz','2018-02-18', 120.75 ,0.0),
('xyz','2018-02-25', 40.25 ,0.0),
('xyz','2018-03-11', 201.25 ,0.0),
('xyz','2018-03-18', 483.0 ,5.0),
('xyz','2018-03-25', 322.0 ,7.0),
('xyz','2018-04-01', 241.5 ,7.0),
('xyz','2018-04-08', 281.75 ,7.0),
('xyz','2018-04-15', 523.25 ,7.0),
('xyz','2018-04-22', 241.5 ,7.0),
('xyz','2018-04-29', 362.25 ,7.0) )
), tt2 AS (
SELECT *, avg(sales) OVER (partition by Product order by if(Availbility = 7.0,1),Date_week_beginning ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) avg3
FROM tt1
)
SELECT Product,Date_week_beginning,Sales,Availbility,
CASE WHEN Availbility = 7.0 THEN Sales
ELSE
max_by(if(Availbility = 7.0,avg3),(Availbility = 7.0, Date_week_beginning) ) OVER (partition by Product order by Date_week_beginning)
END new_col
FROM tt2
ORDER BY Product,Date_week_beginning
And the resuls are exactly like requested:
Product Date_week_beginning Sales Availbility new_col
xyz 2017-12-31 724.5 6.0 NULL
xyz 2018-01-07 362.25 7.0 362.25
xyz 2018-01-14 281.75 7.0 281.75
xyz 2018-01-21 442.75 7.0 442.75
xyz 2018-01-28 442.75 6.0 362.25
xyz 2018-02-04 402.5 7.0 402.5
xyz 2018-02-11 201.25 3.0 375.6666666666667
xyz 2018-02-18 120.75 0.0 375.6666666666667
xyz 2018-02-25 40.25 0.0 375.6666666666667
xyz 2018-03-11 201.25 0.0 375.6666666666667
xyz 2018-03-18 483.0 5.0 375.6666666666667
xyz 2018-03-25 322.0 7.0 322.0
xyz 2018-04-01 241.5 7.0 241.5
xyz 2018-04-08 281.75 7.0 281.75
xyz 2018-04-15 523.25 7.0 523.25
xyz 2018-04-22 241.5 7.0 241.5
xyz 2018-04-29 362.25 7.0 362.25
I assume two typos in the question (as can be verified from the example):
This line in the question is mistake: 361 (362 + 281 + 362/3)the prior fully availble week avg which is avilble) should be 361 (442 + 281 + 362/3)the prior fully availble week avg which is avilble)
The sentence: "what I want to do is that wherever we have 0 in the availbility column (4th column).." should be "what I want to do is that wherever we do not have 7.0 in the availbility column (4th column).."

Week difference from current week to last day previous week

I have a pivot pandas data frame (sales by region) that got created from another pandas data frame (sales by store) using the pivot_table method.
As an example:
df = pd.DataFrame(
{'store':['A','B','C','D','E']*7,
'region':['NW','NW','SW','NE','NE']*7,
'date':['2017-03-30']*5+['2017-04-05']*5+['2017-04-07']*5+['2017-04-12']*5+['2017-04-13']*5+['2017-04-17']*5+['2017-04-20']*5,
'sales':[30,1,133,9,1,30,3,135,9,11,30,1,140,15,15,25,10,137,9,3,29,10,137,9,11,30,19,145,20,10,30,8,141,25,25]
})
df_sales = df.pivot_table(index = ['region'], columns = ['date'], aggfunc = [np.sum], margins = True)
df_sales = df_sales.ix[:,range(0, df_sales.shape[1]-1)]
My goal is to do the following to the sales data frame.
Add a column called week difference that computes the difference between the total sales for this week, and the latest value (by date) for the previous week. Assumption: I always have data for some days a week but it's not fixed days.
The week difference column will be different as new data comes in, but for the latest data would look like:
>>> df_sales
sum \
sales
date 2017-03-30 2017-04-05 2017-04-07 2017-04-12 2017-04-13 2017-04-17
region
NE 10.0 20.0 30.0 12.0 20.0 30.0
NW 31.0 33.0 31.0 35.0 39.0 49.0
SW 133.0 135.0 140.0 137.0 137.0 145.0
All 174.0 188.0 201.0 184.0 196.0 224.0
date 2017-04-20 WeekDifference
region
NE 50.0 50.0-20.0
NW 38.0 38.0-39.0
SW 141.0 141.0-137.0
All 229.0 229-196.0
Because it's the difference between the latest date and the latest day of the previous week. In this specific example, we are on week 2017-04-20, and the last day of data from previous week is 2017-04-13.
I'd want to do this in a general way as data gets updated.
df = pd.DataFrame(
{'store':['A','B','C','D','E']*7,
'region':['NW','NW','SW','NE','NE']*7,
'date':['2017-03-30']*5+['2017-04-05']*5+['2017-04-07']*5+['2017-04-12']*5+['2017-04-13']*5+['2017-04-17']*5+['2017-04-20']*5,
'sales':[30,1,133,9,1,30,3,135,9,11,30,1,140,15,15,25,10,137,9,3,29,10,137,9,11,30,19,145,20,10,30,8,141,25,25]
})
df_sales = df.pivot_table(index = ['region'], columns = ['date'], aggfunc = [np.sum], margins = True)
df_sales = df_sales.ix[:,range(0, df_sales.shape[1]-1)]
Input:
sum \
sales
date 2017-03-30 2017-04-05 2017-04-07 2017-04-12 2017-04-13 2017-04-17
region
NE 10.0 20.0 30.0 12.0 20.0 30.0
NW 31.0 33.0 31.0 35.0 39.0 49.0
SW 133.0 135.0 140.0 137.0 137.0 145.0
All 174.0 188.0 201.0 184.0 196.0 224.0
date 2017-04-20 weekdiffernce
region
NE 50.0 50.0 - 20.0
NW 38.0 38.0 - 39.0
SW 141.0 141.0 - 137.0
All 229.0 229.0 - 196.0
Calculate Last week and one week offset:
last_column = pd.to_datetime(df_sales.iloc[:,-1].name[2])
last_week_column = last_column + pd.DateOffset(weeks=-1)
col_mask = (pd.to_datetime(df_sales.columns.get_level_values(2)).weekofyear == (last_column.weekofyear-1))
df_sales.loc[:,('sum','sales','weekdiffernce')]=df_sales.iloc[:,-1].astype(str) + ' - '+df_sales.loc[:,('sum','sales',last_week_column.strftime('%Y-%m-%d'))].astype(str)
df_sales.loc[:,('sum','sales','weekdiffernce')]=df_sales.iloc[:,-1].astype(str) + ' - '+df_sales.loc[:,('sum','sales',list(col_mask))].iloc[:,-1].astype(str)
print(df_sales)
Output:
sum \
sales
date 2017-03-30 2017-04-05 2017-04-07 2017-04-12 2017-04-13 2017-04-17
region
NE 10.0 20.0 30.0 12.0 20.0 30.0
NW 31.0 33.0 31.0 35.0 39.0 49.0
SW 133.0 135.0 140.0 137.0 137.0 145.0
All 174.0 188.0 201.0 184.0 196.0 224.0
date 2017-04-20 weekdiffernce
region
NE 50.0 50.0 - 20.0
NW 38.0 38.0 - 39.0
SW 141.0 141.0 - 137.0
All 229.0 229.0 - 196.0