select avg for specific values of date - sql

I have this table 'meteorecords' with date, temperature, rh and the meteo station which made the record.
rerowid date temp rh meteostid
1 2019-09-9 28.8 55.6 AITNIA2
2 2019-09-10 30.3 51.3 AITNIA2
3 2019-09-11 28.6 49.0 AITNIA2
4 2019-09-12 26.7 51.9 AITNIA2
5 2019-09-13 25.3 48.1 AITNIA2
6 2019-09-14 25.3 38.5 AITNIA2
7 2019-09-15 25.0 42.2 AITNIA2
8 2019-09-16 24.1 52.1 AITNIA2
9 2019-09-17 23.3 65.2 AITNIA2
10 2019-09-18 22.7 72.2 AITNIA2
11 2019-09-19 23.4 73.9 AITNIA2
12 2019-09-20 23.1 76.7 AITNIA2
13 2019-09-21 22.5 60.3 AITNIA2
14 2019-09-22 20.9 61.6 AITNIA2
15 2019-09-23 21.9 73.9 AITNIA2
16 2019-09-24 23.2 79.6 AITNIA2
17 2019-09-25 21.8 73.6 AITNIA2
18 2019-09-26 22.2 77.6 AITNIA2
19 2019-09-27 22.9 77.1 AITNIA2
20 2019-09-28 22.8 68.4 AITNIA2
21 2019-09-29 22.6 75.5 AITNIA2
...........................
I want to select all the fields plus the average temperature of the last 3 days.
I'm using postgresql because I have some geometric and spatial data in the db.
I tried this with no luck:
SELECT rerowid,redate,retemp,rerh,meteostid,
(SELECT AVG(retemp)
FROM meteorecords m
WHERE meteostid = m.meteostid AND m.redate BETWEEN redate-2 AND redate)
FROM meteorecords
which returns a result like this:
rerowid date temp rh meteostid AVG_Last_3_Days
1 2019-09-09 28.8 55.6 AITNIA2 22.2824
2 2019-09-10 30.3 51.3 AITNIA2 22.2824
3 2019-09-11 28.6 49.0 AITNIA2 22.2824
4 2019-09-12 26.7 51.9 AITNIA2 22.2824
5 2019-09-13 25.3 48.1 AITNIA2 22.2824
6 2019-09-14 25.3 38.5 AITNIA2 22.2824
7 2019-09-15 25.1 42.2 AITNIA2 22.2824
..................
But I want a result like this:
rerowid date temp rh meteostid AVG_Last_3_Days
1 2019-09-09 28.8 55.6 AITNIA2 28.8
2 2019-09-10 30.3 51.3 AITNIA2 29.5
3 2019-09-11 28.6 49.0 AITNIA2 29.2
4 2019-09-12 26.7 51.9 AITNIA2 28.5
5 2019-09-13 25.3 48.1 AITNIA2 26.9
6 2019-09-14 25.3 38.5 AITNIA2 25.8
7 2019-09-15 25.1 42.2 AITNIA2 25.2
..................

Use window functions. If you have one row per date or you want the previous three dates *in the data):
SELECT rerowid, redate, retemp, rerh, meteostid,
AVG(retemp) OVER (PARTITION BY meteostid ORDER BY redate ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as avg_retemp_3
FROM meteorecords;
If you want 3 chronological days, use RANGE:
SELECT rerowid, redate, retemp, rerh, meteostid,
AVG(retemp) OVER (PARTITION BY meteostid
ORDER BY redate
RANGE BETWEEN '2 DAY' PRECEDING AND CURRENT ROW) as avg_retemp_3
FROM meteorecords;

Related

What code will iterate rows using pandas and append the data to a new df?

I am trying to reorganize a temperature data set to get it in the same format as other data sets I have been using. I am having trouble iterating through the data frame and appending the data to a new data frame.
Here is the data:
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
0 1901 -3.16 -4.14 2.05 6.85 13.72 18.27 22.22 20.54 15.30 10.50 2.60 -2.68
1 1902 -3.73 -2.67 1.78 7.62 14.35 18.21 20.51 19.81 14.97 9.93 3.20 -4.02
2 1903 -3.93 -4.39 2.44 7.18 13.07 17.22 20.25 19.67 15.00 9.35 1.52 -2.84
3 1904 -5.49 -3.92 1.83 7.22 13.46 17.78 20.22 19.25 15.87 9.60 3.20 -2.31
4 1905 -4.89 -4.40 4.54 8.01 13.20 18.24 20.25 20.21 16.15 8.42 3.47 -3.28
.. ... ... ... ... ... ... ... ... ... ... ... ... ...
116 2017 -2.07 1.77 3.84 10.02 14.21 19.69 22.57 20.38 17.15 10.85 4.40 -0.77
117 2018 -2.36 -0.56 3.39 7.49 16.39 20.09 22.39 21.01 17.57 10.37 2.48 -0.57
118 2019 -2.38 -1.85 2.93 9.53 14.10 19.21 22.38 21.31 18.41 9.37 3.00 -0.08
119 2020 -1.85 -0.98 4.50 8.34 14.61 19.66 22.42 21.69 16.75 9.99 4.92 -0.38
120 2021 -0.98 -3.86 3.94 8.41 14.06 20.63 22.22 21.23 17.48 11.47 3.54 0.88
Here is the code that I have tried:
df = pds.read_excel("Temp_Data.xlsx")
data = pds.dataframe()
for i in range(len(df)):
data1 = df.iloc[i]
data.append(data1)
Here is the result of that code:
print(data)
Feb -0.72
Mar 0.75
Apr 6.77
May 14.44
Jun 18.40
Jul 20.80
Aug 20.13
Sep 16.17
Oct 10.64
Nov 2.71
Dec -2.80
Name: 43, dtype: float64, Year 1945.00
Jan -2.62
Feb -0.75
Mar 4.00
Apr 7.29
May 12.31
Jun 16.98
Jul 20.76
Aug 20.11
Sep 16.08
Oct 9.82
Nov 2.09
Dec -3.87
Note: for some reason the data starts at 1945 and goes to 2021.
Here is how I am trying to format the data eventually:
Date Temp
0 190101 -3.16
1 190102 -4.14
2 190103 2.05
3 190104 6.85
4 190105 13.72
5 190106 18.27
6 190107 22.22
7 190108 20.54
8 190109 15.30
9 190110 10.50
10 190111 2.60
11 190112 -2.68
12 190201 -3.73
13 190202 -2.67
14 190203 1.78
15 190204 7.62
16 190205 14.35
17 190206 18.21
18 190207 20.51
19 190208 19.81
20 190209 14.97
21 190210 9.93
22 190211 3.20
23 190212 -4.02
You can use melt to reshape your dataframe then create the Date column from Year and Month columns:
months = {'Jan': '01', 'Feb': '02', 'Mar': '03', 'Apr': '04',
'May': '05', 'Jun': '06', 'Jul': '07', 'Aug': '08',
'Sep': '09', 'Oct': '10', 'Nov': '11', 'Dec': '12'}
# Convert Year and Month columns to YYYMM
to_date = lambda x: x.pop('Year').astype(str) + x.pop('Month').map(months)
out = (df.melt(id_vars='Year', var_name='Month', value_name='Temp')
.assign(Date=to_date).set_index('Date').sort_index().reset_index())
Output:
>>> out
Date Temp
0 190101 -3.16
1 190102 -4.14
2 190103 2.05
3 190104 6.85
4 190105 13.72
.. ... ...
115 202108 21.23
116 202109 17.48
117 202110 11.47
118 202111 3.54
119 202112 0.88
[120 rows x 2 columns]

Write & Apply Python Function with Grouped Pandas Data

I have data that is grouped by a column 'plant_name' and I need to write & apply a function to test for a trend on one of the columns, i.e., named "10%" or '90%' for example.
My data looks like this -
plant_name year count mean std min 10% 50% 90% max
0 ARIZONA I 2005 8760.0 8.25 2.21 1.08 5.55 8.19 11.09 15.71
1 ARIZONA I 2006 8760.0 7.87 2.33 0.15 4.84 7.82 10.74 16.75
2 ARIZONA I 2007 8760.0 8.31 2.25 0.03 5.52 8.27 11.23 16.64
3 ARIZONA I 2008 8784.0 7.67 2.46 0.21 4.22 7.72 10.78 15.73
4 ARIZONA I 2009 8760.0 6.92 2.33 0.23 3.79 6.95 9.96 14.64
5 ARIZONA I 2010 8760.0 8.07 2.21 0.68 5.51 7.85 11.14 17.31
6 ARIZONA I 2011 8760.0 7.54 2.38 0.33 4.44 7.45 10.54 17.77
7 ARIZONA I 2012 8784.0 8.61 1.92 0.33 6.37 8.48 11.07 15.84
8 ARIZONA I 2015 8760.0 8.21 2.13 0.60 5.58 8.24 10.88 16.74
9 ARIZONA I 2016 8784.0 8.39 2.27 0.46 5.55 8.32 11.34 16.09
10 ARIZONA I 2017 8760.0 8.32 2.11 0.85 5.70 8.25 11.12 17.96
11 ARIZONA I 2018 8760.0 7.94 2.28 0.07 5.17 7.72 11.04 16.31
12 ARIZONA I 2019 8760.0 7.71 2.49 0.38 4.28 7.75 10.87 15.79
13 ARIZONA I 2020 8784.0 7.57 2.43 0.50 4.36 7.47 10.78 15.69
14 CAETITE I 2005 8760.0 8.11 3.15 0.45 3.76 8.38 12.08 18.89
15 CAETITE I 2006 8760.0 7.70 3.21 0.05 3.50 7.66 12.05 19.08
16 CAETITE I 2007 8760.0 8.64 3.18 0.01 4.05 8.83 12.63 18.57
17 CAETITE I 2008 8784.0 7.87 3.09 0.28 3.75 7.80 11.92 18.54
18 CAETITE I 2009 8760.0 7.31 3.02 0.17 3.46 7.21 11.40 19.46
19 CAETITE I 2010 8760.0 8.00 3.24 0.34 3.63 8.03 12.29 17.27
I'm using this function from here -
import pymannkendall as mk
and you apply the function like this:
mk.original_test(dataframe)
I need the final dataframe to look like this which is the result of the series columns returned by the function (mk.original_test):
trend, h, p, z, Tau, s, var_s, slope, intercept = mk.original_test(data)
plant_name trend h p z Tau s var_s slope intercept
0 ARIZONA I no trend False 0.416 0.812 xxx x x x x
1 CAETITE I increasing True 0.002 3.6 xxx x x x x
I just am not sure how to use groupby to group by plant_name column and then apply the mk function by plant_name to either of the columns in the data shown. Thank you,
For a given column, you can run the test in a GroupBy.apply() and return the result as a Series indexed by result._fields:
def mktest(x):
result = mk.original_test(x)
return pd.Series(result, index=result._fields)
column = '10%'
df.groupby('plant_name', as_index=False)[column].apply(mktest)
plant_name
trend
h
p
z
Tau
s
var_s
slope
intercept
ARIZONA I
no trend
False
0.956276
-0.054827
-0.021978
-2.0
332.666667
-0.003333
5.361667
CAETITE I
no trend
False
0.452370
-0.751469
-0.333333
-5.0
28.333333
-0.026000
3.755000

Pandas combine two dataframes based on time difference

I have two data frames that stores different types of medical information of patients. The common elements of both the data frames are the encounter ID (hadm_id), the time the information was recorded ((n|c)e_charttime).
One data frame (df_str) contains structured information such as vital signs and lab test values and values derived from these (such as change statistics over 24 hours). The other data frame (df_notes) contains a column with a clinical note recorded at a specified time for an encounter. Both these data frames contain multiple encounters, but the common element is the encounter ID (hadm_id).
Here are examples of the data frames for ONE encounter ID (hadm_id) with a subset of variables:
df_str
hadm_id ce_charttime hr resp magnesium hr_24hr_mean
0 196673 2108-03-05 15:34:00 95.0 12.0 NaN 95.000000
1 196673 2108-03-05 16:00:00 85.0 11.0 NaN 90.000000
2 196673 2108-03-05 16:16:00 85.0 11.0 1.8 88.333333
3 196673 2108-03-05 17:00:00 109.0 12.0 1.8 93.500000
4 196673 2108-03-05 18:00:00 97.0 12.0 1.8 94.200000
5 196673 2108-03-05 19:00:00 99.0 16.0 1.8 95.000000
6 196673 2108-03-05 20:00:00 98.0 13.0 1.8 95.428571
7 196673 2108-03-05 21:00:00 97.0 14.0 1.8 95.625000
8 196673 2108-03-05 22:00:00 101.0 12.0 1.8 96.222222
9 196673 2108-03-05 23:00:00 97.0 13.0 1.8 96.300000
10 196673 2108-03-06 00:00:00 93.0 13.0 1.8 96.000000
11 196673 2108-03-06 01:00:00 89.0 12.0 1.8 95.416667
12 196673 2108-03-06 02:00:00 88.0 10.0 1.8 94.846154
13 196673 2108-03-06 03:00:00 87.0 12.0 1.8 94.285714
14 196673 2108-03-06 04:00:00 97.0 19.0 1.8 94.466667
15 196673 2108-03-06 05:00:00 95.0 11.0 1.8 94.500000
16 196673 2108-03-06 05:43:00 95.0 11.0 2.0 94.529412
17 196673 2108-03-06 06:00:00 103.0 17.0 2.0 95.000000
18 196673 2108-03-06 07:00:00 101.0 12.0 2.0 95.315789
19 196673 2108-03-06 08:00:00 103.0 20.0 2.0 95.700000
20 196673 2108-03-06 09:00:00 84.0 11.0 2.0 95.142857
21 196673 2108-03-06 10:00:00 89.0 11.0 2.0 94.863636
22 196673 2108-03-06 11:00:00 91.0 14.0 2.0 94.695652
23 196673 2108-03-06 12:00:00 85.0 10.0 2.0 94.291667
24 196673 2108-03-06 13:00:00 98.0 14.0 2.0 94.440000
25 196673 2108-03-06 14:00:00 100.0 18.0 2.0 94.653846
26 196673 2108-03-06 15:00:00 95.0 12.0 2.0 94.666667
27 196673 2108-03-06 16:00:00 96.0 20.0 2.0 95.076923
28 196673 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
df_notes
hadm_id ne_charttime note
0 196673 2108-03-05 16:54:00 Nursing\nNursing Progress Note\nPt is a 43 yo ...
1 196673 2108-03-05 17:54:00 Physician \nPhysician Resident Admission Note\...
2 196673 2108-03-05 18:09:00 Physician \nPhysician Resident Admission Note\...
3 196673 2108-03-06 06:11:00 Nursing\nNursing Progress Note\nPain control (...
4 196673 2108-03-06 08:06:00 Physician \nPhysician Resident Progress Note\n...
5 196673 2108-03-06 12:40:00 Nursing\nNursing Progress Note\nChief Complain...
6 196673 2108-03-06 13:01:00 Nursing\nNursing Progress Note\nPain control (...
7 196673 2108-03-06 17:09:00 Nursing\nNursing Transfer Note\nChief Complain...
8 196673 2108-03-06 17:12:00 Nursing\nNursing Transfer Note\nPain control (...
9 196673 2108-03-07 15:25:00 Radiology\nCHEST (PA & LAT)\n[**2108-3-7**] 3:...
10 196673 2108-03-07 18:34:00 Radiology\nCTA CHEST W&W/O C&RECONS, NON-CORON...
11 196673 2108-03-09 09:10:00 Radiology\nABDOMEN (SUPINE & ERECT)\n[**2108-3...
12 196673 2108-03-09 12:22:00 Radiology\nCT ABDOMEN W/CONTRAST\n[**2108-3-9*...
13 196673 2108-03-10 05:26:00 Radiology\nABDOMEN (SUPINE & ERECT)\n[**2108-3...
14 196673 2108-03-10 05:27:00 Radiology\nCHEST (PA & LAT)\n[**2108-3-10**] 5...
What I want to do is to combine both the data frames based on the time when that information was recorded. More specifically, for each row in df_notes, I want a corresponding row from df_str with ce_charttime <= ne_charttime.
As an example, the first row in df_notes has ne_charttime = 2108-03-05 16:54:00. There are three rows in df_str with record times less than this time: ce_charttime = 2108-03-05 15:34:00, ce_charttime = 2108-03-05 16:00:00, ce_charttime = 2108-03-05 16:16:00. The most recent of these is the row with ce_charttime = 2108-03-05 16:16:00. So in my resulting data frame, for ne_charttime = 2108-03-05 16:54:00, I will have hr = 85.0, resp = 11.0, magnesium = 1.8, hr_24hr_mean = 88.33.
Essentially, in this example the resulting data frame will look like this:
hadm_id ne_charttime note hr resp magnesium hr_24hr_mean
0 196673 2108-03-05 16:54:00 Nursing\nNursing Progress Note\nPt is a 43 yo ... 85.0 11.0 1.8 88.333333
1 196673 2108-03-05 17:54:00 Physician \nPhysician Resident Admission Note\... 109.0 12.0 1.8 93.500000
2 196673 2108-03-05 18:09:00 Physician \nPhysician Resident Admission Note\... 97.0 12.0 1.8 94.200000
3 196673 2108-03-06 06:11:00 Nursing\nNursing Progress Note\nPain control (... 103.0 17.0 2.0 95.000000
4 196673 2108-03-06 08:06:00 Physician \nPhysician Resident Progress Note\n... 103.0 20.0 2.0 95.700000
5 196673 2108-03-06 12:40:00 Nursing\nNursing Progress Note\nChief Complain... 85.0 10.0 2.0 94.291667
6 196673 2108-03-06 13:01:00 Nursing\nNursing Progress Note\nPain control (... 98.0 14.0 2.0 94.440000
7 196673 2108-03-06 17:09:00 Nursing\nNursing Transfer Note\nChief Complain... 106.0 21.0 2.0 95.360000
8 196673 2108-03-06 17:12:00 Nursing\nNursing Transfer Note\nPain control (... NaN NaN NaN NaN
9 196673 2108-03-07 15:25:00 Radiology\nCHEST (PA & LAT)\n[**2108-3-7**] 3:... NaN NaN NaN NaN
10 196673 2108-03-07 18:34:00 Radiology\nCTA CHEST W&W/O C&RECONS, NON-CORON... NaN NaN NaN NaN
11 196673 2108-03-09 09:10:00 Radiology\nABDOMEN (SUPINE & ERECT)\n[**2108-3... NaN NaN NaN NaN
12 196673 2108-03-09 12:22:00 Radiology\nCT ABDOMEN W/CONTRAST\n[**2108-3-9*... NaN NaN NaN NaN
13 196673 2108-03-10 05:26:00 Radiology\nABDOMEN (SUPINE & ERECT)\n[**2108-3... NaN NaN NaN NaN
14 196673 2108-03-10 05:27:00 Radiology\nCHEST (PA & LAT)\n[**2108-3-10**] 5... NaN NaN NaN NaN
The resulting data frame will be of the same length as df_notes. I have been able to come with a very inefficient piece of code using for loops and explicit indexing to get this result:
cols = list(df_str.columns[2:])
final_df = df_notes.copy()
for col in cols:
final_df[col] = np.nan
idx = 0
for i, note_row in final_df.iterrows():
ne = note_row['ne_charttime']
for j, str_row in df_str.iterrows():
ce = str_row['ce_charttime']
if ne < ce:
idx += 1
for col in cols:
final_df.iloc[i, final_df.columns.get_loc(col)] = df_str.iloc[j-1][col]
break
for col in cols:
final_df.iloc[idx, final_df.columns.get_loc(col)] = df_str.iloc[-1][col]
This piece of code is bad because it is very inefficient and while it may work for this example, in my example dataset, I have over 30 different columns of structured variables, and over 10,000 encounters.
EDIT-2:
#Stef has provided an excellent answer which seems to work and replace my elaborate loopy code with a single line (amazing). However, while that works for this particular example, I am running into problems when I apply it to a bigger subset which includes multiple encounters. For example, consider the following example:
df_str.shape, df_notes.shape
((217, 386), (35, 4))
df_notes[['hadm_id', 'ne_charttime']]
hadm_id ne_charttime
0 100104 2201-06-21 20:00:00
1 100104 2201-06-21 22:51:00
2 100104 2201-06-22 05:00:00
3 100104 2201-06-23 04:33:00
4 100104 2201-06-23 12:59:00
5 100104 2201-06-24 05:15:00
6 100372 2115-12-20 02:29:00
7 100372 2115-12-21 10:15:00
8 100372 2115-12-22 13:05:00
9 100372 2115-12-25 17:16:00
10 100372 2115-12-30 10:58:00
11 100372 2115-12-30 13:07:00
12 100372 2115-12-30 14:16:00
13 100372 2115-12-30 22:34:00
14 100372 2116-01-03 09:10:00
15 100372 2116-01-07 11:08:00
16 100975 2126-03-02 06:06:00
17 100975 2126-03-02 17:44:00
18 100975 2126-03-03 05:36:00
19 100975 2126-03-03 18:27:00
20 100975 2126-03-04 05:29:00
21 100975 2126-03-04 10:48:00
22 100975 2126-03-04 16:42:00
23 100975 2126-03-05 22:12:00
24 100975 2126-03-05 23:01:00
25 100975 2126-03-06 11:02:00
26 100975 2126-03-06 13:38:00
27 100975 2126-03-08 13:39:00
28 100975 2126-03-11 10:41:00
29 101511 2199-04-30 09:29:00
30 101511 2199-04-30 09:53:00
31 101511 2199-04-30 18:06:00
32 101511 2199-05-01 08:28:00
33 111073 2195-05-01 01:56:00
34 111073 2195-05-01 21:49:00
This example has 5 encounters. The dataframe is sorted by hadm_id and within each hadm_id, ne_charttime is sorted. However, the column ne_charttime by itself is NOT sorted as seen from row 0 ce_charttime=2201-06-21 20:00:00 and row 6 ne_charttime=2115-12-20 02:29:00. When I try to do a merge_asof, I get the following error:
ValueError: left keys must be sorted. Is this because of the fact that ne_charttime column is not sorted? If so, how do I rectify this while maintaining the integrity of the encounter ID group?
EDIT-1:
I was able to loop over the encounters as well:
cols = list(dev_str.columns[1:]) # get the cols to merge (everything except hadm_id)
final_dfs = []
grouped = dev_notes.groupby('hadm_id') # get groups of encounter ids
for name, group in grouped:
final_df = group.copy().reset_index(drop=True) # make a copy of notes for that encounter
for col in cols:
final_df[col] = np.nan # set the values to nan
idx = 0 # index to track the final row in the given encounter
for i, note_row in final_df.iterrows():
ne = note_row['ne_charttime']
sub = dev_str.loc[(dev_str['hadm_id'] == name)].reset_index(drop=True) # get the df corresponding to the ecounter
for j, str_row in sub.iterrows():
ce = str_row['ce_charttime']
if ne < ce: # if the variable charttime < note charttime
idx += 1
# grab the previous values for the variables and break
for col in cols:
final_df.iloc[i, final_df.columns.get_loc(col)] = sub.iloc[j-1][col]
break
# get the last value in the df for the variables
for col in cols:
final_df.iloc[idx, final_df.columns.get_loc(col)] = sub.iloc[-1][col]
final_dfs.append(final_df) # append the df to the list
# cat the list to get final df and reset index
final_df = pd.concat(final_dfs)
final_df.reset_index(inplace=True, drop=True)
Again this very inefficient but does the job.
Is there a better way to achieve what I want? Any help is appreciated.
Thanks.
You can use merge_asof (both dataframes must be sorted by the columns you're merging them on, which is already the case in your example):
final_df = pd.merge_asof(df_notes, df_str, left_on='ne_charttime', right_on='ce_charttime', by='hadm_id')
Result:
hadm_id ne_charttime note ce_charttime hr resp magnesium hr_24hr_mean
0 196673 2108-03-05 16:54:00 Nursing\nNursing Progress Note\nPt is a 43 yo ... 2108-03-05 16:16:00 85.0 11.0 1.8 88.333333
1 196673 2108-03-05 17:54:00 Physician \nPhysician Resident Admission Note\... 2108-03-05 17:00:00 109.0 12.0 1.8 93.500000
2 196673 2108-03-05 18:09:00 Physician \nPhysician Resident Admission Note\... 2108-03-05 18:00:00 97.0 12.0 1.8 94.200000
3 196673 2108-03-06 06:11:00 Nursing\nNursing Progress Note\nPain control (... 2108-03-06 06:00:00 103.0 17.0 2.0 95.000000
4 196673 2108-03-06 08:06:00 Physician \nPhysician Resident Progress Note\n... 2108-03-06 08:00:00 103.0 20.0 2.0 95.700000
5 196673 2108-03-06 12:40:00 Nursing\nNursing Progress Note\nChief Complain... 2108-03-06 12:00:00 85.0 10.0 2.0 94.291667
6 196673 2108-03-06 13:01:00 Nursing\nNursing Progress Note\nPain control (... 2108-03-06 13:00:00 98.0 14.0 2.0 94.440000
7 196673 2108-03-06 17:09:00 Nursing\nNursing Transfer Note\nChief Complain... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
8 196673 2108-03-06 17:12:00 Nursing\nNursing Transfer Note\nPain control (... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
9 196673 2108-03-07 15:25:00 Radiology\nCHEST (PA & LAT)\n[**2108-3-7**] 3:... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
10 196673 2108-03-07 18:34:00 Radiology\nCTA CHEST W&W/O C&RECONS, NON-CORON... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
11 196673 2108-03-09 09:10:00 Radiology\nABDOMEN (SUPINE & ERECT)\n[**2108-3... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
12 196673 2108-03-09 12:22:00 Radiology\nCT ABDOMEN W/CONTRAST\n[**2108-3-9*... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
13 196673 2108-03-10 05:26:00 Radiology\nABDOMEN (SUPINE & ERECT)\n[**2108-3... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
14 196673 2108-03-10 05:27:00 Radiology\nCHEST (PA & LAT)\n[**2108-3-10**] 5... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
PS: This gives you the correct result for all rows. There's a logical flaw in your code: you look for the first time ce_charttime > ne_charttime and then take the previous row. If there's no such time, you'll never have the chance to take the previous row, hence the NaNs in your result table starting from row 8.
PPS: This includes ce_charttime in the final dataframe. You can replace it by a column of how old the information is and/or remove it:
final_df['info_age'] = final_df.ne_charttime - final_df.ce_charttime
final_df = final_df.drop(columns='ce_charttime')
UPDATE for EDIT-2: As I wrote at the very beginning, repeated in the comments and as the docs clearly states: both ce_charttime and ne_charttime must be sorted (hadm_id need not be sorted). If this condition is not met, you'll have to (temporarily) sort your dataframes as required. See the following example:
import pandas as pd, string
df_str = pd.DataFrame( {'hadm_id': pd.np.tile([111111, 222222],10), 'ce_charttime': pd.date_range('2019-10-01 00:30', periods=20, freq='30T'), 'hr': pd.np.random.randint(80,120,20)})
df_notes = pd.DataFrame( {'hadm_id': pd.np.tile([111111, 222222],3), 'ne_charttime': pd.date_range('2019-10-01 00:45', periods=6, freq='40T'), 'note': [''.join(pd.np.random.choice(list(string.ascii_letters), 10)) for _ in range(6)]}).sort_values('hadm_id')
final_df = pd.merge_asof(df_notes.sort_values('ne_charttime'), df_str, left_on='ne_charttime', right_on='ce_charttime', by='hadm_id').sort_values(['hadm_id', 'ne_charttime'])
print(df_str); print(df_notes); print(final_df)
Output:
hadm_id ce_charttime hr
0 111111 2019-10-01 00:30:00 118
1 222222 2019-10-01 01:00:00 93
2 111111 2019-10-01 01:30:00 92
3 222222 2019-10-01 02:00:00 86
4 111111 2019-10-01 02:30:00 88
5 222222 2019-10-01 03:00:00 86
6 111111 2019-10-01 03:30:00 106
7 222222 2019-10-01 04:00:00 91
8 111111 2019-10-01 04:30:00 109
9 222222 2019-10-01 05:00:00 95
10 111111 2019-10-01 05:30:00 113
11 222222 2019-10-01 06:00:00 92
12 111111 2019-10-01 06:30:00 104
13 222222 2019-10-01 07:00:00 83
14 111111 2019-10-01 07:30:00 114
15 222222 2019-10-01 08:00:00 98
16 111111 2019-10-01 08:30:00 110
17 222222 2019-10-01 09:00:00 89
18 111111 2019-10-01 09:30:00 98
19 222222 2019-10-01 10:00:00 109
hadm_id ne_charttime note
0 111111 2019-10-01 00:45:00 jOcRWVdPDF
2 111111 2019-10-01 02:05:00 mvScJNrwra
4 111111 2019-10-01 03:25:00 FBAFbJYflE
1 222222 2019-10-01 01:25:00 ilNuInOsYZ
3 222222 2019-10-01 02:45:00 ysyolaNmkV
5 222222 2019-10-01 04:05:00 wvowGGETaP
hadm_id ne_charttime note ce_charttime hr
0 111111 2019-10-01 00:45:00 jOcRWVdPDF 2019-10-01 00:30:00 118
2 111111 2019-10-01 02:05:00 mvScJNrwra 2019-10-01 01:30:00 92
4 111111 2019-10-01 03:25:00 FBAFbJYflE 2019-10-01 02:30:00 88
1 222222 2019-10-01 01:25:00 ilNuInOsYZ 2019-10-01 01:00:00 93
3 222222 2019-10-01 02:45:00 ysyolaNmkV 2019-10-01 02:00:00 86
5 222222 2019-10-01 04:05:00 wvowGGETaP 2019-10-01 04:00:00 91
You can do full merge and then filter with query:
df_notes.merge(df_str, on=hadm_id).query('ce_charttime <= ne_charttime')

Pandas: Days since last event per id

I want to build a column for my dataframe df['days_since_last'] that shows the days since the last match for each player_id for each event_id and nan if the row is the first match for the player in the dataset.
Example of my data:
event_id player_id match_date
0 1470993 227485 2015-11-29
1 1492031 227485 2016-07-23
2 1489240 227485 2016-06-19
3 1495581 227485 2016-09-02
4 1490222 227485 2016-07-03
5 1469624 227485 2015-11-14
6 1493822 227485 2016-08-13
7 1428946 313444 2014-08-10
8 1483245 313444 2016-05-21
9 1472260 313444 2015-12-13
I tried the code in Find days since last event pandas dataframe but got nonsensical results.
It seems you need sort first:
df['days_since_last_event'] = (df.sort_values(['player_id','match_date'])
.groupby('player_id')['match_date'].diff()
.dt.days)
print (df)
event_id player_id match_date days_since_last_event
0 1470993 227485 2015-11-29 15.0
1 1492031 227485 2016-07-23 20.0
2 1489240 227485 2016-06-19 203.0
3 1495581 227485 2016-09-02 20.0
4 1490222 227485 2016-07-03 14.0
5 1469624 227485 2015-11-14 NaN
6 1493822 227485 2016-08-13 21.0
7 1428946 313444 2014-08-10 NaN
8 1483245 313444 2016-05-21 160.0
9 1472260 313444 2015-12-13 490.0
Demo:
In [174]: df['days_since_last'] = (df.groupby('player_id')['match_date']
.transform(lambda x: (x.max()-x).dt.days))
In [175]: df
Out[175]:
event_id player_id match_date days_since_last
0 1470993 227485 2015-11-29 278
1 1492031 227485 2016-07-23 41
2 1489240 227485 2016-06-19 75
3 1495581 227485 2016-09-02 0
4 1490222 227485 2016-07-03 61
5 1469624 227485 2015-11-14 293
6 1493822 227485 2016-08-13 20
7 1428946 313444 2014-08-10 650
8 1483245 313444 2016-05-21 0
9 1472260 313444 2015-12-13 160

How to select most recent values?

I have a logging table collecting values from many probes:
CREATE TABLE [Log]
(
[LogID] int IDENTITY (1, 1) NOT NULL,
[Minute] datetime NOT NULL,
[ProbeID] int NOT NULL DEFAULT 0,
[Value] FLOAT(24) NOT NULL DEFAULT 0.0,
CONSTRAINT Log_PK PRIMARY KEY([LogID])
)
GO
CREATE INDEX [Minute_ProbeID_Value] ON [Log]([Minute], [ProbeID], [Value])
GO
Typically, each probe generates a value every minute or so. Some example output:
LogID Minute ProbeID Value
====== ================ ======= =====
873875 2014-07-27 09:36 1972 24.4
873876 2014-07-27 09:36 2001 29.7
873877 2014-07-27 09:36 3781 19.8
873878 2014-07-27 09:36 1963 25.6
873879 2014-07-27 09:36 2002 22.9
873880 2014-07-27 09:36 1959 -30.1
873881 2014-07-27 09:36 2005 20.7
873882 2014-07-27 09:36 1234 23.8
873883 2014-07-27 09:36 1970 19.9
873884 2014-07-27 09:36 1991 22.4
873885 2014-07-27 09:37 1958 1.7
873886 2014-07-27 09:37 1962 21.3
873887 2014-07-27 09:37 1020 23.1
873888 2014-07-27 09:38 1972 24.1
873889 2014-07-27 09:38 3781 20.1
873890 2014-07-27 09:38 2001 30
873891 2014-07-27 09:38 2002 23.4
873892 2014-07-27 09:38 1963 26
873893 2014-07-27 09:38 2005 20.8
873894 2014-07-27 09:38 1234 23.7
873895 2014-07-27 09:38 1970 19.8
873896 2014-07-27 09:38 1991 22.7
873897 2014-07-27 09:39 1958 1.4
873898 2014-07-27 09:39 1962 22.1
873899 2014-07-27 09:39 1020 23.1
What is the most efficient way to get just the latest reading for each Probe?
e.g.of desired output (note: the "Value" is not e.g. a Max() or an Avg()):
LogID Minute ProbeID Value
====== ================= ======= =====
873899 27-Jul-2014 09:39 1020 3.1
873894 27-Jul-2014 09:38 1234 23.7
873897 27-Jul-2014 09:39 1958 1.4
873880 27-Jul-2014 09:36 1959 -30.1
873898 27-Jul-2014 09:39 1962 22.1
873892 27-Jul-2014 09:38 1963 26
873895 27-Jul-2014 09:38 1970 19.8
873888 27-Jul-2014 09:38 1972 24.1
873896 27-Jul-2014 09:38 1991 22.7
873890 27-Jul-2014 09:38 2001 30
873891 27-Jul-2014 09:38 2002 23.4
873893 27-Jul-2014 09:38 2005 20.8
873889 27-Jul-2014 09:38 3781 20.1
This is another approach
select *
from log l
where minute =
(select max(x.minute) from log x where x.probeid = l.probeid)
You can compare the execution plan w/ a fiddle - http://sqlfiddle.com/#!3/1d3ff/3/0
Try this:
SELECT T1.*
FROM Log T1
INNER JOIN (SELECT Max(Minute) Minute,
ProbeID
FROM Log
GROUP BY ProbeID)T2
ON T1.ProbeID = T2.ProbeID
AND T1.Minute = T2.Minute
You can play around with it on SQL Fiddle
Your question is: "What is the most efficient way to get just the latest reading for each Probe?"
To really answer this question, you test to test different solutions. I would generally go with the row_number() method suggested by #jyparask. However, the following might have better performance:
select l.*
from log l
where not exists (select 1
from log l2
where l2.probeid = l.probeid and
l2.minute > l.minute
);
For performance, you want an index on log(probeid, minute).
Although not exactly your problem, here is an example of where not exists performs better than other methods on SQL Server.
;WITH MyCTE AS
(
SELECT LogID,
Minute,
ProbeID,
Value,
ROW_NUMBER() OVER(PARTITION BY ProbeID ORDER BY Minute DESC) AS rn
FROM LOG
)
SELECT LogID,
Minute,
ProbeID,
Value
FROM MyCTE
WHERE rn = 1