I am looking for an efficient way to summarize rows (in groupby-style) that fall in a certain time period, using Pandas in Python. Specifically:
The time period is given in dataframe A: there is a column for "start_timestamp" and a column for "end_timestamp", specifying the start and end time of the time period that is to be summarized. Hence, every row represents one time period that is meant to be summarized.
The rows to be summarized are given in dataframe B: there is a column for "timestamp" and a column "metric" with the values to be aggregated (with mean, max, min etc.). In reality, there might be more than just 1 "metric" column.
For every row's time period from dataframe A, I want to summarice the values of the "metric" column in dataframe B that fall in the given time period. Hence, the number of rows of the output dataframe will be exactly the same as the number of rows of dataframe A.
Any hints would be much appreciated.
Additional Requirements
The number of rows in dataframe A and dataframe B may be large (several thousand rows).
There may be many metrics to summarize in dataframe B (~100).
I want to avoid solving this problem with a for loop (as in the reproducible example below).
Reproducible Example
Input Dataframe A
# Input dataframe A
df_a = pd.DataFrame({
"start_timestamp": ["2022-08-09 00:30", "2022-08-09 01:00", "2022-08-09 01:15"],
"end_timestamp": ["2022-08-09 03:30", "2022-08-09 04:00", "2022-08-09 08:15"]
})
df_a.loc[:, "start_timestamp"] = pd.to_datetime(df_a["start_timestamp"])
df_a.loc[:, "end_timestamp"] = pd.to_datetime(df_a["end_timestamp"])
print(df_a)
start_timestamp
end_timestamp
0
2022-08-09 00:30:00
2022-08-09 03:30:00
1
2022-08-09 01:00:00
2022-08-09 04:00:00
2
2022-08-09 01:15:00
2022-08-09 08:15:00
Input Dataframe B
# Input dataframe B
df_b = pd.DataFrame({
"timestamp":[
"2022-08-09 01:00",
"2022-08-09 02:00",
"2022-08-09 03:00",
"2022-08-09 04:00",
"2022-08-09 05:00",
"2022-08-09 06:00",
"2022-08-09 07:00",
"2022-08-09 08:00",
],
"metric": [1, 2, 3, 4, 5, 6, 7, 8],
})
df_b.loc[:, "timestamp"] = pd.to_datetime(df_b["timestamp"])
print(df_b)
timestamp
metric
0
2022-08-09 01:00:00
1
1
2022-08-09 02:00:00
2
2
2022-08-09 03:00:00
3
3
2022-08-09 04:00:00
4
4
2022-08-09 05:00:00
5
5
2022-08-09 06:00:00
6
6
2022-08-09 07:00:00
7
7
2022-08-09 08:00:00
8
Expected Output Dataframe
# Expected output dataframe
df_target = df_a.copy()
for i, row in df_target.iterrows():
condition = (df_b["timestamp"] >= row["start_timestamp"]) & (df_b["timestamp"] <= row["end_timestamp"])
df_b_sub = df_b.loc[condition, :]
df_target.loc[i, "metric_mean"] = df_b_sub["metric"].mean()
df_target.loc[i, "metric_max"] = df_b_sub["metric"].max()
df_target.loc[i, "metric_min"] = df_b_sub["metric"].min()
print(df_target)
start_timestamp
end_timestamp
metric_mean
metric_max
metric_min
0
2022-08-09 00:30:00
2022-08-09 03:30:00
2.0
3.0
1.0
1
2022-08-09 01:00:00
2022-08-09 04:00:00
2.5
4.0
1.0
2
2022-08-09 01:15:00
2022-08-09 08:15:00
5.0
8.0
2.0
You can use pd.IntervalIndex and contains to create a dataframe with selected metric values and then compute the mean, max, min:
ai = pd.IntervalIndex.from_arrays(
df_a["start_timestamp"], df_a["end_timestamp"], closed="both"
)
t = df_b.apply(
lambda x: pd.Series((ai.contains(x["timestamp"])) * x["metric"]), axis=1
)
df_a[["metric_mean", "metric_max", "metric_min"]] = t[t.ne(0)].agg(
["mean", "max", "min"]
).T.values
print(df_a):
start_timestamp end_timestamp metric_mean metric_max metric_min
0 2022-08-09 00:30:00 2022-08-09 03:30:00 2.0 3.0 1.0
1 2022-08-09 01:00:00 2022-08-09 04:00:00 2.5 4.0 1.0
2 2022-08-09 01:15:00 2022-08-09 08:15:00 5.0 8.0 2.0
Check Below Code using SQLITE3
import sqlite3
conn = sqlite3.connect(':memory:')
df_a.to_sql('df_a',con=conn, index=False)
df_b.to_sql('df_b',con=conn, index=False)
pd.read_sql("""SELECT df_a.start_timestamp, df_a.end_timestamp
, AVG(df_b.metric) as metric_mean
, MAX(df_b.metric) as metric_max
, MIN(df_b.metric) as metric_min
FROM
df_a INNER JOIN df_b
ON df_b.timestamp BETWEEN df_a.start_timestamp AND df_a.end_timestamp
GROUP BY df_a.start_timestamp, df_a.end_timestamp""", con=conn)
Output:
Overview
For each row of a dataframe I want to calculate the x day high and low.
An x day high is higher than previous x days.
An x day low is lower than previous x days.
The for loop is explained in further detail in this post
Update:
Answer by #mozway below completes in around 20 seconds with dataset containing 18k rows. Can this be improved with numpy with broadcasting etc?
Example
2020-03-20 has an x_day_low value of 1 as it is lower than the previous day.
2020-03-27 has an x_day_high value of 8 as it is higher than the previous 8 days.
See desired output and test code below which is calculated with a for loop in the findHighLow function. How would I vectorize findHighLow as the actual dataframe is somewhat larger.
Test data
def genMockDataFrame(days,startPrice,colName,startDate,seed=None):
periods = days*24
np.random.seed(seed)
steps = np.random.normal(loc=0, scale=0.0018, size=periods)
steps[0]=0
P = startPrice+np.cumsum(steps)
P = [round(i,4) for i in P]
fxDF = pd.DataFrame({
'ticker':np.repeat( [colName], periods ),
'date':np.tile( pd.date_range(startDate, periods=periods, freq='H'), 1 ),
'price':(P)})
fxDF.index = pd.to_datetime(fxDF.date)
fxDF = fxDF.price.resample('D').ohlc()
fxDF.columns = [i.title() for i in fxDF.columns]
return fxDF
#rows set to 15 for minimal example but actual dataframe contains around 18000 rows.
number_of_rows = 15
df = genMockDataFrame(number_of_rows,1.1904,'tttmmm','19/3/2020',seed=157)
def findHighLow (df):
df['x_day_high'] = 0
df['x_day_low'] = 0
for n in reversed(range(len(df['High']))):
for i in reversed(range(n)):
if df['High'][n] > df['High'][i]:
df['x_day_high'][n] = n - i
else: break
for n in reversed(range(len(df['Low']))):
for i in reversed(range(n)):
if df['Low'][n] < df['Low'][i]:
df['x_day_low'][n] = n - i
else: break
return df
df = findHighLow (df)
Desired output should match this:
df[["High","Low","x_day_high","x_day_low"]]
High Low x_day_high x_day_low
date
2020-03-19 1.1937 1.1832 0 0
2020-03-20 1.1879 1.1769 0 1
2020-03-21 1.1767 1.1662 0 2
2020-03-22 1.1721 1.1611 0 3
2020-03-23 1.1819 1.1690 2 0
2020-03-24 1.1928 1.1807 4 0
2020-03-25 1.1939 1.1864 6 0
2020-03-26 1.2141 1.1964 7 0
2020-03-27 1.2144 1.2039 8 0
2020-03-28 1.2099 1.2018 0 1
2020-03-29 1.2033 1.1853 0 4
2020-03-30 1.1887 1.1806 0 6
2020-03-31 1.1972 1.1873 1 0
2020-04-01 1.1997 1.1914 2 0
2020-04-02 1.1924 1.1781 0 9
Here are two so solutions. Both produce the desired output, as posted in the question.
The first solution uses Numba and completes in 0.5 seconds on my machine for 20k rows. If you can use Numba, this is the way to go. The second solution uses only Pandas/Numpy and completes in 1.5 seconds for 20k rows.
Numba
#numba.njit
def count_smaller(arr):
current = arr[-1]
count = 0
for i in range(arr.shape[0]-2, -1, -1):
if arr[i] > current:
break
count += 1
return count
#numba.njit
def count_greater(arr):
current = arr[-1]
count = 0
for i in range(arr.shape[0]-2, -1, -1):
if arr[i] < current:
break
count += 1
return count
df["x_day_high"] = df.High.expanding().apply(count_smaller, engine='numba', raw=True)
df["x_day_low"] = df.Low.expanding().apply(count_greater, engine='numba', raw=True)
Pandas/Numpy
def count_consecutive_true(bool_arr):
return bool_arr[::-1].cumprod().sum()
def count_smaller(arr):
return count_consecutive_true(arr <= arr[-1]) - 1
def count_greater(arr):
return count_consecutive_true(arr >= arr[-1]) - 1
df["x_day_high"] = df.High.expanding().apply(count_smaller, raw=True)
df["x_day_low"] = df.Low.expanding().apply(count_greater, raw=True)
This last solution is similar to mozway's. However it runs faster because it doesn't need to perform a join and uses numpy as much as possible. It also looks arbitrarily far back.
You can use rolling to get the last N days, a comparison + cumprod on the reversed boolean array to keep only the last consecutive valid values, and sum to count them. Apply on each column using agg and join the output after adding a prefix.
# number of days
N = 8
df.join(df.rolling(f'{N+1}d', min_periods=1)
.agg({'High': lambda s: s.le(s.iloc[-1])[::-1].cumprod().sum()-1,
'Low': lambda s: s.ge(s.iloc[-1])[::-1].cumprod().sum()-1,
})
.add_prefix(f'{N}_days_')
)
Output:
Open High Low Close 8_days_High 8_days_Low
date
2020-03-19 1.1904 1.1937 1.1832 1.1832 0.0 0.0
2020-03-20 1.1843 1.1879 1.1769 1.1772 0.0 1.0
2020-03-21 1.1755 1.1767 1.1662 1.1672 0.0 2.0
2020-03-22 1.1686 1.1721 1.1611 1.1721 0.0 3.0
2020-03-23 1.1732 1.1819 1.1690 1.1819 2.0 0.0
2020-03-24 1.1836 1.1928 1.1807 1.1922 4.0 0.0
2020-03-25 1.1939 1.1939 1.1864 1.1936 6.0 0.0
2020-03-26 1.1967 1.2141 1.1964 1.2114 7.0 0.0
2020-03-27 1.2118 1.2144 1.2039 1.2089 7.0 0.0
2020-03-28 1.2080 1.2099 1.2018 1.2041 0.0 1.0
2020-03-29 1.2033 1.2033 1.1853 1.1880 0.0 4.0
2020-03-30 1.1876 1.1887 1.1806 1.1879 0.0 6.0
2020-03-31 1.1921 1.1972 1.1873 1.1939 1.0 0.0
2020-04-01 1.1932 1.1997 1.1914 1.1914 2.0 0.0
2020-04-02 1.1902 1.1924 1.1781 1.1862 0.0 7.0
How to calculate the slope parameter of the 10th closing price on a rolling basis and add it to the last column of the table (after the close, similar to the moving average price)?
Based on Python Dataframe Find n rows rolling slope without for loop
Fake data:
df = pd.DataFrame([[38611.38, '2022-03-08 22:23:00.000000'],
[38604.02, '2022-03-08 22:24:00.000000'],
[38611.76, '2022-03-08 22:25:00.000000'],
[38609.75, '2022-03-08 22:26:00.000000'],
[38601.35, '2022-03-08 22:27:00.000000']], columns = ['Close', 'Open time'])
df
Close Open time
0 38611.38 2022-03-08 22:23:00.000000
1 38604.02 2022-03-08 22:24:00.000000
2 38611.76 2022-03-08 22:25:00.000000
3 38609.75 2022-03-08 22:26:00.000000
4 38601.35 2022-03-08 22:27:00.000000
df.reset_index(drop=True)
window = 10
df['Close'].rolling(window).apply(lambda x: stats.linregress(x, x.index+1)[0], raw=False)
Result:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
995 0.017796
996 0.038905
997 0.052710
998 0.047330
999 0.043615
Name: Close, Length: 1000, dtype: float64
I have a dataframe that has, let's say 5 entries.
moment
stress
strain
0
0.12
13
0.11
1
0.23
14
0.12
2
0.56
15
0.56
I would like to get a 1D float list in the order of [moment, stress, strain], based on the linear interpolation of strain = 0.45
I have read a couple of threads talking about the interpolate() method from pandas. But it is used when you have Nan entry and you fill in the number.
How do I accomplish a similar task with my case?
Thank you
One method is to add new row to your dataframe with NaN values and sort it:
df = df.append(
{"moment": np.nan, "stress": np.nan, "strain": 0.45}, ignore_index=True
)
df = df.sort_values(by="strain").set_index("strain")
df = df.interpolate(method="index")
print(df)
Prints:
moment stress
strain
0.11 0.1200 13.00
0.12 0.2300 14.00
0.45 0.4775 14.75
0.56 0.5600 15.00
To get the values back:
df = df.reset_index()
print(
df.loc[df.strain == 0.45, ["moment", "stress", "strain"]]
.to_numpy()
.tolist()[0]
)
Prints:
[0.47750000000000004, 14.75, 0.45]
I'm trying to sort a data frame based on groups meeting conditions.
The I'm getting a syntax error for the way I'm sorting the groups.
And I'm losing the initial order of the data frame before attempting the above.
This is the order of sorting that I'm trying to achieve:
1) Sort on First and Test columns.
2) Test==1 groups, sort on Secondary then by Final column.
---Test==0 groups, sort on Final column only.
import pandas as pd
df=pd.DataFrame({"First":[100,100,100,100,100,100,200,200,200,200,200],"Test":[1,1,1,0,0,0,0,1,1,1,0],"Secondary":[.1,.1,.1,.2,.2,.3,.3,.3,.3,.4,.4],"Final":[1.1,2.2,3.3,4.4,5.5,6.6,7.7,8.8,9.9,10.10,11.11]})
def sorter(x):
if x["Test"]==1:
x.sort_values(['Secondary','Final'], inplace=True)
else:
x=x.sort_values('Final', inplace=True)
df=df.sort_values(["First","Test"],ascending=[False, False]).reset_index(drop=True)
df.groupby(['First','Test']).apply(lambda x: sorter(x))
df
Expected result:
First Test Secondary Final
200 1 0.4 10.1
200 1 0.3* 9.9*
200 1 0.3* 8.8*
200 0 0.4 11.11*
200 0 0.3 7.7*
100 1 0.5 2.2
100 1 0.1* 3.3*
100 1 0.1* 1.1*
100 0 0.3 6.6*
100 0 0.2 5.5*
100 0 0.2 4.4*
You can try of sorting in descending order without groupby,
w.r.t sequence you gave, the order of sorting will change.will it work for you
df=pd.DataFrame({"First":[100,100,100,100,100,100,200,200,200,200,200],"Test":[1,1,1,0,0,0,0,1,1,1,0],"Secondary":[.1,.5,.1,.9,.4,.1,.3,.3,.3,.4,.4],"Final":[1.1,2.2,3.3,4.4,5.5,6.6,7.7,8.8,9.9,10.10,11.11]})
df = df.groupby(['First','Test']).apply(lambda x: x.sort_values(['First','Test','Secondary','Final'],ascending=False) if x.iloc[0]['Test']==1 else x.sort_values(['First','Test','Final'],ascending=False)).reset_index(drop=True)
df.sort_values(['First','Test'],ascending=[True,False])
Out:
Final First Secondary Test
3 2.20 100 0.5 1
4 3.30 100 0.1 1
5 1.10 100 0.1 1
0 6.60 100 0.1 0
1 5.50 100 0.4 0
2 4.40 100 0.9 0
8 10.10 200 0.4 1
9 9.90 200 0.3 1
10 8.80 200 0.3 1
6 11.11 200 0.4 0
7 7.70 200 0.3 0
The trick was to sort subsets separately and replace the values in the original df.
This came up in other solutions to pandas sorting problems.
import pandas as pd
df=pd.DataFrame({"First":[100,100,100,100,100,100,200,200,200,200,200],"Test":[1,1,1,0,0,0,0,1,1,1,0],"Secondary":[.1,.5,.1,.9,.4,.1,.3,.3,.3,.4,.4],"Final":[1.1,2.2,3.3,4.4,5.5,6.6,7.7,8.8,9.9,10.10,11.11]})
df.sort_values(['First','Test','Secondary','Final'],ascending=False, inplace=True)
index_subset=df[df["Test"]==0].index
sorted_subset=df[df["Test"]==0].sort_values(['First','Final'],ascending=False)
df.loc[index_subset,:]=sorted_subset.values
print(df)