Merging 2 or more data frames and transposing the result - pandas

I have several DFs derived from a Panda binning process using the below code;
df2 = df.resample(rule=timedelta(milliseconds=250))[('diffA')].mean().dropna()
df3 = df.resample(rule=timedelta(milliseconds=250))[('diffB')].mean().dropna()
.. etc
Every DF will have column containing 'time' in Datetime format( example:2019-11-22 13:18:00.000 ) and second column containing a number (i.e. 0.06 ). Different DFs will have different 'time' bins. I am trying to concatenate all DFs into one , where certain elements of the resulting DF may contain 'NaN'.
The Datetime format of the DFs give an error when using;
method 1) df4=pd.merge(df2,df3,left_on='time',right_on='time')
method 2) pd.pivot_table(df2, values = 'diffA', index=['time'], columns = 'time').reset_index()
When DFs have been combined , I also want to transpose the resulting DF, where:
Rows: are 'DiffA','DiffB'..etc
Columns: are time bins accordingly.
Have tried the transpose() method with individual DFs, just to try, but getting an error as my time /index is in 'Datetime' format..
Once that is in place, I am looking for a method to extract rows from the resulting transposed DF as individual data series.
Please advise how I can achieve the above with some guidance, appreciate any feedback ! thank you so much for your help.
Data frames ( 2 - for example )
time DiffA
2019-11-25 08:18:01.250 0.06
2019-11-25 08:18:01.500 0.05
2019-11-25 08:18:01.750 0.04
2019-11-25 08:18:02.000 0
2019-11-25 08:18:02.250 0.22
2019-11-25 08:18:02.500 0.06
time DiffB
2019-11-26 08:18:01.250 0.2
2019-11-27 08:18:01.500 0.05
2019-11-25 08:18:01.000 0.6
2019-11-25 08:18:02.000 0.01
2019-11-25 08:18:02.250 0.8
2019-11-25 08:18:02.500 0.5
resulting merged DF should be as follows ( text only);
time ( first row )
2019-11-25 08:18:01.000,
2019-11-25 08:18:01.250,
2019-11-25 08:18:01.500,
2019-11-25 08:18:01.750,
2019-11-25 08:18:02.000,
2019-11-25 08:18:02.250,
2019-11-25 08:18:02.500,
2019-11-26 08:18:01.250,
2019-11-27 08:18:01.500
(second row)
diffA nan 0.06 0.05 0.04 0 0.22 0.06 nan nan
(third row)
diffB 0.6 nan nan nan 0.01 0.8 0.5 0.2 0.05

Solution
The core logic: You need to use outer-join on the column 'time' to merge each of the sampled-dataframes together to achieve your objective. Finally resetting the index to the column time completes the solution.
I will use the dummy data I created below to create a reproducible solution.
Note: I have used df as the final dataframe and df0 as the original dataframe. My df0 is your df.
df = pd.DataFrame()
for i, column_name in zip(range(5), column_names):
if i==0:
df = df0.sample(n=10, random_state=i).rename(columns={'data': f'df{column_name}'})
else:
df_other = df0.sample(n=10, random_state=i).rename(columns={'data': f'df{column_name}'})
df = pd.merge(df, df_other, on='time', how='outer')
print(df.set_index('time').T)
Output:
Dummy Data
import pandas as pd
# dummy data:
df0 = pd.DataFrame()
df0['time'] = pd.date_range(start='2020-02-01', periods=15, freq='D')
df0['data'] = np.random.randint(0, high=9, size=15)
print(df0)
Output:
time data
0 2020-02-01 6
1 2020-02-02 1
2 2020-02-03 7
3 2020-02-04 0
4 2020-02-05 8
5 2020-02-06 8
6 2020-02-07 1
7 2020-02-08 6
8 2020-02-09 2
9 2020-02-10 6
10 2020-02-11 8
11 2020-02-12 3
12 2020-02-13 0
13 2020-02-14 1
14 2020-02-15 0

Related

Summarize rows from a Pandas dataframe B that fall in certain time periods from another dataframe A

I am looking for an efficient way to summarize rows (in groupby-style) that fall in a certain time period, using Pandas in Python. Specifically:
The time period is given in dataframe A: there is a column for "start_timestamp" and a column for "end_timestamp", specifying the start and end time of the time period that is to be summarized. Hence, every row represents one time period that is meant to be summarized.
The rows to be summarized are given in dataframe B: there is a column for "timestamp" and a column "metric" with the values to be aggregated (with mean, max, min etc.). In reality, there might be more than just 1 "metric" column.
For every row's time period from dataframe A, I want to summarice the values of the "metric" column in dataframe B that fall in the given time period. Hence, the number of rows of the output dataframe will be exactly the same as the number of rows of dataframe A.
Any hints would be much appreciated.
Additional Requirements
The number of rows in dataframe A and dataframe B may be large (several thousand rows).
There may be many metrics to summarize in dataframe B (~100).
I want to avoid solving this problem with a for loop (as in the reproducible example below).
Reproducible Example
Input Dataframe A
# Input dataframe A
df_a = pd.DataFrame({
"start_timestamp": ["2022-08-09 00:30", "2022-08-09 01:00", "2022-08-09 01:15"],
"end_timestamp": ["2022-08-09 03:30", "2022-08-09 04:00", "2022-08-09 08:15"]
})
df_a.loc[:, "start_timestamp"] = pd.to_datetime(df_a["start_timestamp"])
df_a.loc[:, "end_timestamp"] = pd.to_datetime(df_a["end_timestamp"])
print(df_a)
start_timestamp
end_timestamp
0
2022-08-09 00:30:00
2022-08-09 03:30:00
1
2022-08-09 01:00:00
2022-08-09 04:00:00
2
2022-08-09 01:15:00
2022-08-09 08:15:00
Input Dataframe B
# Input dataframe B
df_b = pd.DataFrame({
"timestamp":[
"2022-08-09 01:00",
"2022-08-09 02:00",
"2022-08-09 03:00",
"2022-08-09 04:00",
"2022-08-09 05:00",
"2022-08-09 06:00",
"2022-08-09 07:00",
"2022-08-09 08:00",
],
"metric": [1, 2, 3, 4, 5, 6, 7, 8],
})
df_b.loc[:, "timestamp"] = pd.to_datetime(df_b["timestamp"])
print(df_b)
timestamp
metric
0
2022-08-09 01:00:00
1
1
2022-08-09 02:00:00
2
2
2022-08-09 03:00:00
3
3
2022-08-09 04:00:00
4
4
2022-08-09 05:00:00
5
5
2022-08-09 06:00:00
6
6
2022-08-09 07:00:00
7
7
2022-08-09 08:00:00
8
Expected Output Dataframe
# Expected output dataframe
df_target = df_a.copy()
for i, row in df_target.iterrows():
condition = (df_b["timestamp"] >= row["start_timestamp"]) & (df_b["timestamp"] <= row["end_timestamp"])
df_b_sub = df_b.loc[condition, :]
df_target.loc[i, "metric_mean"] = df_b_sub["metric"].mean()
df_target.loc[i, "metric_max"] = df_b_sub["metric"].max()
df_target.loc[i, "metric_min"] = df_b_sub["metric"].min()
print(df_target)
start_timestamp
end_timestamp
metric_mean
metric_max
metric_min
0
2022-08-09 00:30:00
2022-08-09 03:30:00
2.0
3.0
1.0
1
2022-08-09 01:00:00
2022-08-09 04:00:00
2.5
4.0
1.0
2
2022-08-09 01:15:00
2022-08-09 08:15:00
5.0
8.0
2.0
You can use pd.IntervalIndex and contains to create a dataframe with selected metric values and then compute the mean, max, min:
ai = pd.IntervalIndex.from_arrays(
df_a["start_timestamp"], df_a["end_timestamp"], closed="both"
)
t = df_b.apply(
lambda x: pd.Series((ai.contains(x["timestamp"])) * x["metric"]), axis=1
)
df_a[["metric_mean", "metric_max", "metric_min"]] = t[t.ne(0)].agg(
["mean", "max", "min"]
).T.values
print(df_a):
start_timestamp end_timestamp metric_mean metric_max metric_min
0 2022-08-09 00:30:00 2022-08-09 03:30:00 2.0 3.0 1.0
1 2022-08-09 01:00:00 2022-08-09 04:00:00 2.5 4.0 1.0
2 2022-08-09 01:15:00 2022-08-09 08:15:00 5.0 8.0 2.0
Check Below Code using SQLITE3
import sqlite3
conn = sqlite3.connect(':memory:')
df_a.to_sql('df_a',con=conn, index=False)
df_b.to_sql('df_b',con=conn, index=False)
pd.read_sql("""SELECT df_a.start_timestamp, df_a.end_timestamp
, AVG(df_b.metric) as metric_mean
, MAX(df_b.metric) as metric_max
, MIN(df_b.metric) as metric_min
FROM
df_a INNER JOIN df_b
ON df_b.timestamp BETWEEN df_a.start_timestamp AND df_a.end_timestamp
GROUP BY df_a.start_timestamp, df_a.end_timestamp""", con=conn)
Output:

Vectorize for loop and return x day high and low

Overview
For each row of a dataframe I want to calculate the x day high and low.
An x day high is higher than previous x days.
An x day low is lower than previous x days.
The for loop is explained in further detail in this post
Update:
Answer by #mozway below completes in around 20 seconds with dataset containing 18k rows. Can this be improved with numpy with broadcasting etc?
Example
2020-03-20 has an x_day_low value of 1 as it is lower than the previous day.
2020-03-27 has an x_day_high value of 8 as it is higher than the previous 8 days.
See desired output and test code below which is calculated with a for loop in the findHighLow function. How would I vectorize findHighLow as the actual dataframe is somewhat larger.
Test data
def genMockDataFrame(days,startPrice,colName,startDate,seed=None):
periods = days*24
np.random.seed(seed)
steps = np.random.normal(loc=0, scale=0.0018, size=periods)
steps[0]=0
P = startPrice+np.cumsum(steps)
P = [round(i,4) for i in P]
fxDF = pd.DataFrame({
'ticker':np.repeat( [colName], periods ),
'date':np.tile( pd.date_range(startDate, periods=periods, freq='H'), 1 ),
'price':(P)})
fxDF.index = pd.to_datetime(fxDF.date)
fxDF = fxDF.price.resample('D').ohlc()
fxDF.columns = [i.title() for i in fxDF.columns]
return fxDF
#rows set to 15 for minimal example but actual dataframe contains around 18000 rows.
number_of_rows = 15
df = genMockDataFrame(number_of_rows,1.1904,'tttmmm','19/3/2020',seed=157)
def findHighLow (df):
df['x_day_high'] = 0
df['x_day_low'] = 0
for n in reversed(range(len(df['High']))):
for i in reversed(range(n)):
if df['High'][n] > df['High'][i]:
df['x_day_high'][n] = n - i
else: break
for n in reversed(range(len(df['Low']))):
for i in reversed(range(n)):
if df['Low'][n] < df['Low'][i]:
df['x_day_low'][n] = n - i
else: break
return df
df = findHighLow (df)
Desired output should match this:
df[["High","Low","x_day_high","x_day_low"]]
High Low x_day_high x_day_low
date
2020-03-19 1.1937 1.1832 0 0
2020-03-20 1.1879 1.1769 0 1
2020-03-21 1.1767 1.1662 0 2
2020-03-22 1.1721 1.1611 0 3
2020-03-23 1.1819 1.1690 2 0
2020-03-24 1.1928 1.1807 4 0
2020-03-25 1.1939 1.1864 6 0
2020-03-26 1.2141 1.1964 7 0
2020-03-27 1.2144 1.2039 8 0
2020-03-28 1.2099 1.2018 0 1
2020-03-29 1.2033 1.1853 0 4
2020-03-30 1.1887 1.1806 0 6
2020-03-31 1.1972 1.1873 1 0
2020-04-01 1.1997 1.1914 2 0
2020-04-02 1.1924 1.1781 0 9
Here are two so solutions. Both produce the desired output, as posted in the question.
The first solution uses Numba and completes in 0.5 seconds on my machine for 20k rows. If you can use Numba, this is the way to go. The second solution uses only Pandas/Numpy and completes in 1.5 seconds for 20k rows.
Numba
#numba.njit
def count_smaller(arr):
current = arr[-1]
count = 0
for i in range(arr.shape[0]-2, -1, -1):
if arr[i] > current:
break
count += 1
return count
#numba.njit
def count_greater(arr):
current = arr[-1]
count = 0
for i in range(arr.shape[0]-2, -1, -1):
if arr[i] < current:
break
count += 1
return count
df["x_day_high"] = df.High.expanding().apply(count_smaller, engine='numba', raw=True)
df["x_day_low"] = df.Low.expanding().apply(count_greater, engine='numba', raw=True)
Pandas/Numpy
def count_consecutive_true(bool_arr):
return bool_arr[::-1].cumprod().sum()
def count_smaller(arr):
return count_consecutive_true(arr <= arr[-1]) - 1
def count_greater(arr):
return count_consecutive_true(arr >= arr[-1]) - 1
df["x_day_high"] = df.High.expanding().apply(count_smaller, raw=True)
df["x_day_low"] = df.Low.expanding().apply(count_greater, raw=True)
This last solution is similar to mozway's. However it runs faster because it doesn't need to perform a join and uses numpy as much as possible. It also looks arbitrarily far back.
You can use rolling to get the last N days, a comparison + cumprod on the reversed boolean array to keep only the last consecutive valid values, and sum to count them. Apply on each column using agg and join the output after adding a prefix.
# number of days
N = 8
df.join(df.rolling(f'{N+1}d', min_periods=1)
.agg({'High': lambda s: s.le(s.iloc[-1])[::-1].cumprod().sum()-1,
'Low': lambda s: s.ge(s.iloc[-1])[::-1].cumprod().sum()-1,
})
.add_prefix(f'{N}_days_')
)
Output:
Open High Low Close 8_days_High 8_days_Low
date
2020-03-19 1.1904 1.1937 1.1832 1.1832 0.0 0.0
2020-03-20 1.1843 1.1879 1.1769 1.1772 0.0 1.0
2020-03-21 1.1755 1.1767 1.1662 1.1672 0.0 2.0
2020-03-22 1.1686 1.1721 1.1611 1.1721 0.0 3.0
2020-03-23 1.1732 1.1819 1.1690 1.1819 2.0 0.0
2020-03-24 1.1836 1.1928 1.1807 1.1922 4.0 0.0
2020-03-25 1.1939 1.1939 1.1864 1.1936 6.0 0.0
2020-03-26 1.1967 1.2141 1.1964 1.2114 7.0 0.0
2020-03-27 1.2118 1.2144 1.2039 1.2089 7.0 0.0
2020-03-28 1.2080 1.2099 1.2018 1.2041 0.0 1.0
2020-03-29 1.2033 1.2033 1.1853 1.1880 0.0 4.0
2020-03-30 1.1876 1.1887 1.1806 1.1879 0.0 6.0
2020-03-31 1.1921 1.1972 1.1873 1.1939 1.0 0.0
2020-04-01 1.1932 1.1997 1.1914 1.1914 2.0 0.0
2020-04-02 1.1902 1.1924 1.1781 1.1862 0.0 7.0

How to calculate the slope parameter of the 10th closing price on a rolling basis

How to calculate the slope parameter of the 10th closing price on a rolling basis and add it to the last column of the table (after the close, similar to the moving average price)?
Based on Python Dataframe Find n rows rolling slope without for loop
Fake data:
df = pd.DataFrame([[38611.38, '2022-03-08 22:23:00.000000'],
[38604.02, '2022-03-08 22:24:00.000000'],
[38611.76, '2022-03-08 22:25:00.000000'],
[38609.75, '2022-03-08 22:26:00.000000'],
[38601.35, '2022-03-08 22:27:00.000000']], columns = ['Close', 'Open time'])
df
Close Open time
0 38611.38 2022-03-08 22:23:00.000000
1 38604.02 2022-03-08 22:24:00.000000
2 38611.76 2022-03-08 22:25:00.000000
3 38609.75 2022-03-08 22:26:00.000000
4 38601.35 2022-03-08 22:27:00.000000
df.reset_index(drop=True)
window = 10
df['Close'].rolling(window).apply(lambda x: stats.linregress(x, x.index+1)[0], raw=False)
Result:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
995 0.017796
996 0.038905
997 0.052710
998 0.047330
999 0.043615
Name: Close, Length: 1000, dtype: float64

return list by dataframe linear interpolation

I have a dataframe that has, let's say 5 entries.
moment
stress
strain
0
0.12
13
0.11
1
0.23
14
0.12
2
0.56
15
0.56
I would like to get a 1D float list in the order of [moment, stress, strain], based on the linear interpolation of strain = 0.45
I have read a couple of threads talking about the interpolate() method from pandas. But it is used when you have Nan entry and you fill in the number.
How do I accomplish a similar task with my case?
Thank you
One method is to add new row to your dataframe with NaN values and sort it:
df = df.append(
{"moment": np.nan, "stress": np.nan, "strain": 0.45}, ignore_index=True
)
df = df.sort_values(by="strain").set_index("strain")
df = df.interpolate(method="index")
print(df)
Prints:
moment stress
strain
0.11 0.1200 13.00
0.12 0.2300 14.00
0.45 0.4775 14.75
0.56 0.5600 15.00
To get the values back:
df = df.reset_index()
print(
df.loc[df.strain == 0.45, ["moment", "stress", "strain"]]
.to_numpy()
.tolist()[0]
)
Prints:
[0.47750000000000004, 14.75, 0.45]

Sorting Pandas data frame with groupby and conditions

I'm trying to sort a data frame based on groups meeting conditions.
The I'm getting a syntax error for the way I'm sorting the groups.
And I'm losing the initial order of the data frame before attempting the above.
This is the order of sorting that I'm trying to achieve:
1) Sort on First and Test columns.
2) Test==1 groups, sort on Secondary then by Final column.
---Test==0 groups, sort on Final column only.
import pandas as pd
df=pd.DataFrame({"First":[100,100,100,100,100,100,200,200,200,200,200],"Test":[1,1,1,0,0,0,0,1,1,1,0],"Secondary":[.1,.1,.1,.2,.2,.3,.3,.3,.3,.4,.4],"Final":[1.1,2.2,3.3,4.4,5.5,6.6,7.7,8.8,9.9,10.10,11.11]})
def sorter(x):
if x["Test"]==1:
x.sort_values(['Secondary','Final'], inplace=True)
else:
x=x.sort_values('Final', inplace=True)
df=df.sort_values(["First","Test"],ascending=[False, False]).reset_index(drop=True)
df.groupby(['First','Test']).apply(lambda x: sorter(x))
df
Expected result:
First Test Secondary Final
200 1 0.4 10.1
200 1 0.3* 9.9*
200 1 0.3* 8.8*
200 0 0.4 11.11*
200 0 0.3 7.7*
100 1 0.5 2.2
100 1 0.1* 3.3*
100 1 0.1* 1.1*
100 0 0.3 6.6*
100 0 0.2 5.5*
100 0 0.2 4.4*
You can try of sorting in descending order without groupby,
w.r.t sequence you gave, the order of sorting will change.will it work for you
df=pd.DataFrame({"First":[100,100,100,100,100,100,200,200,200,200,200],"Test":[1,1,1,0,0,0,0,1,1,1,0],"Secondary":[.1,.5,.1,.9,.4,.1,.3,.3,.3,.4,.4],"Final":[1.1,2.2,3.3,4.4,5.5,6.6,7.7,8.8,9.9,10.10,11.11]})
df = df.groupby(['First','Test']).apply(lambda x: x.sort_values(['First','Test','Secondary','Final'],ascending=False) if x.iloc[0]['Test']==1 else x.sort_values(['First','Test','Final'],ascending=False)).reset_index(drop=True)
df.sort_values(['First','Test'],ascending=[True,False])
Out:
Final First Secondary Test
3 2.20 100 0.5 1
4 3.30 100 0.1 1
5 1.10 100 0.1 1
0 6.60 100 0.1 0
1 5.50 100 0.4 0
2 4.40 100 0.9 0
8 10.10 200 0.4 1
9 9.90 200 0.3 1
10 8.80 200 0.3 1
6 11.11 200 0.4 0
7 7.70 200 0.3 0
The trick was to sort subsets separately and replace the values in the original df.
This came up in other solutions to pandas sorting problems.
import pandas as pd
df=pd.DataFrame({"First":[100,100,100,100,100,100,200,200,200,200,200],"Test":[1,1,1,0,0,0,0,1,1,1,0],"Secondary":[.1,.5,.1,.9,.4,.1,.3,.3,.3,.4,.4],"Final":[1.1,2.2,3.3,4.4,5.5,6.6,7.7,8.8,9.9,10.10,11.11]})
df.sort_values(['First','Test','Secondary','Final'],ascending=False, inplace=True)
index_subset=df[df["Test"]==0].index
sorted_subset=df[df["Test"]==0].sort_values(['First','Final'],ascending=False)
df.loc[index_subset,:]=sorted_subset.values
print(df)