Dataframe iteration using Numba instead of itertuples() for faster code - pandas

My problem
I have three dataframes which I am using itertuples to loop through.
Itertuples worked well for a time however now I am running too many iterations for itertuples to be efficient enough.
I'd like to use vectorisation or perhaps Numba as I have heard that they are both very fast. I've tried to make them work but I can't figure it out
All three dataframes are Open, High, Low, Close candlestick data with a few other columns i.e 'FG_Top'
The dataframes are
dflong - 15 minute candlestick data
dfshort - 5 minute candlestick data
dfshorter - 1 minute candlestick data
Dataframe creation code as requested in comments
import numpy as np
import pandas as pd
idx15m = ['2022-10-29 06:59:59.999', '2022-10-29 07:14:59.999', '2022-10-29 07:29:59.999', '2022-10-29 07:44:59.999',
'2022-10-29 07:59:59.999', '2022-10-29 08:14:59.999', '2022-10-29 08:29:59.999']
opn15m = [19010, 19204, 19283, 19839, 19892, 20000, 20192]
hgh15m = [19230, 19520, 19921, 19909, 20001, 20203, 21065]
low15m = [18782, 19090, 19245, 19809, 19256, 19998, 20016]
cls15m = [19204, 19283, 19839, 19892, 20000, 20192, 20157]
FG_Bottom = [np.nan, np.nan, np.nan, np.nan, np.nan, 19909, np.nan]
FG_Top = [np.nan, np.nan, np.nan, np.nan, np.nan, 19998, np.nan]
dflong = pd.DataFrame({'Open': opn15m, 'High': hgh15m, 'Low': low15m, 'Close': cls15m, 'FG_Bottom': FG_Bottom, 'FG_Top': FG_Top},
index=idx15m)
idx5m = ['2022-10-29 06:59:59.999', '2022-10-29 07:05:59.999', '2022-10-29 07:10:59.999', '2022-10-29 07:15:59.999',
'2022-10-29 07:20:59.999', '2022-10-29 07:25:59.999', '2022-10-29 07:30:59.999']
opn5m = [19012, 19102, 19165, 19747, 19781, 20009, 20082]
hgh5m = [19132, 19423, 19817, 19875, 20014, 20433, 21068]
low5m = [18683, 19093, 19157, 19758, 19362, 19893, 20018]
cls5m = [19102, 19165, 19747, 19781, 20009, 20082, 20154]
price_end5m = [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]
dfshort = pd.DataFrame({'Open': opn5m, 'High': hgh5m, 'Low': low5m, 'Close': cls5m, 'price_end': price_end5m},
index=idx5m)
idx1m = ['2022-10-29 06:59:59.999', '2022-10-29 07:01:59.999', '2022-10-29 07:02:59.999', '2022-10-29 07:03:59.999',
'2022-10-29 07:04:59.999', '2022-10-29 07:05:59.999', '2022-10-29 07:06:59.999']
opn1m = [19010, 19104, 19163, 19748, 19783, 20000, 20087]
hgh1m = [19130, 19420, 19811, 19878, 20011, 20434, 21065]
low1m = [18682, 19090, 19154, 19754, 19365, 19899, 20016]
cls1m = [19104, 19163, 19748, 19783, 20000, 20087, 20157]
price_end1m = [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]
dfshorter = pd.DataFrame({'Open': opn1m, 'High': hgh1m, 'Low': low1m, 'Close': cls1m, 'price_end': price_end1m},
index=idx1m)
Give 3 DataFrames that a similar to this following DataFrame
Example Dataframe
Open High ... FG_Top FG_Bottom
2022-10-29 06:59:59.999 20687.83 20700.46 ... NaN NaN
2022-10-29 07:14:59.999 20686.82 20695.74 ... NaN NaN
2022-10-29 07:29:59.999 20733.62 20745.30 ... 20733.62 20700.46
2022-10-29 07:44:59.999 20741.42 20762.75 ... NaN NaN
2022-10-29 07:59:59.999 20723.86 20777.00 ... NaN NaN
... ... ... ... ... ...
2022-11-10 02:14:59.999 16140.29 16167.09 ... NaN NaN
2022-11-10 02:29:59.999 16119.99 16195.19 ... NaN NaN
2022-11-10 02:44:59.999 16136.63 16263.15 ... NaN NaN
2022-11-10 02:59:59.999 16238.91 16238.91 ... NaN NaN
2022-11-10 03:14:59.999 16210.23 16499.00 ... NaN NaN
Code explaination:
I have my first dataframe which I loop over with the first loop, then loop again with a second nested loop. I have if statements checking certain conditions on each iteration and if those conditions are met then I make some values on the first dataframe np.nan
One of the conditions checked in the second loop calls a function which contains a third loop and checks for certain conditions in the other 2 dataframes.
# First loop
for fg_candle_idx, row in enumerate(dflong.itertuples()):
top = row.FG_Top
bottom = row.FG_Bottom
fg_candle_time = row.Index
if (pd.notnull(top)):
# Second loop
for future_candle_idx, r in enumerate(dflong.itertuples()):
future_candle_time = r.Index
next_future_candle = future_candle_time + timedelta(minutes=minutes)
future_candle_high = r.High
future_candle_low = r.Low
future_candle_close = r.Close
future_candle_open = r.Open
if future_candle_idx > fg_candle_idx:
div = r.price_end
# Check conditions, call function check_no_divs
if (pd.isnull(check_no_divs(dfshort, future_candle_time, next_future_candle))) & (
pd.isnull(check_no_divs(dfshorter, future_candle_time, next_future_candle))) & (
pd.isnull(div)):
if future_candle_high < bottom:
continue
elif future_candle_low > top:
continue
elif (future_candle_close < bottom) & \
(future_candle_open > top):
dflong.loc[fg_candle_time, 'FG_Bottom'] = np.nan
dflong.loc[fg_candle_time, 'FG_Top'] = np.nan
continue
# Many additional conditions checked...
The following code is the function check_no_divs
def check_no_divs(df, candle_time, next_candle):
no_divs = []
# Third Loop
for idx, row in enumerate(df.itertuples()):
compare_candle_time = row.Index
div = row.price_end
if (compare_candle_time >= candle_time) & (compare_candle_time <= next_candle):
if pd.notnull(div):
no_divs.append(True)
else:
no_divs.append(False)
elif compare_candle_time < candle_time:
continue
elif compare_candle_time > next_candle:
break
if all(no_divs) == False:
return np.nan
elif any(no_divs) == True:
return 1
Ideal Solution
Clearly using itertuples is far too inefficient for this problem. I think that there would be a much faster solution to this issue using efficient vectorisation or Numba.
Does anyone know how to make this work?
p.s. I'm still quite new to coding, i think my current code could be made more efficient still using itertuples but probably not efficient enough. I'd appreciate it if someone knows a way to greatly increase the speed of this code

I spent a lot of time researching and testing different code and came up with this solution using numba which gives a significant speed boost.
first import the required libraries
import numpy as np
import pandas as pd
from numba import njit, prange
Then define the function using numbas njit decotator
#njit
def filled_fg(fg_top, fg_bottom, dflongindex, Open, High, Low, Close, dflongprice_end,
dfshortprice_end, shortindex, dfshorterprice_end, shorterindex, conflu_top,
conflu_bottom):
# First loop
for i in prange(len(fg_top)):
top = fg_top[i]
bottom = fg_bottom[i]
if top is not np.nan:
if (bottom - top) > 0:
fg_top[i] = np.nan
fg_bottom[i] = np.nan
# Second loop
for j in prange(len(fg_top)):
if j > i:
future_candle_time = dflongindex[j]
next_future_candle = dflongindex[j + 1]
future_candle_high = High[j]
future_candle_low = Low[j]
future_candle_close = Close[j]
future_candle_open = Open[j]
long_div = dflongprice_end[j]
# Check conditions
if ((new_check_no_divs(dfshortprice_end, shortindex, future_candle_time,
next_future_candle)) == np.nan) & ((new_check_no_divs(
dfshorterprice_end, shorterindex, future_candle_time,
next_future_candle)) == np.nan) & (long_div == np.nan):
if future_candle_high < bottom:
continue
elif future_candle_low > top:
continue
# Do something when conditions are met...
elif (future_candle_close < bottom) & \
(future_candle_open > top):
fg_bottom[i] = np.nan
fg_top[i] = np.nan
continue
Define the second function also with numbas njit decorator
#njit
def check_no_divs(div_data, div_candle_time, first_future_candle, second_future_candle):
no_divs = []
for i in prange(len(div_data)):
if (div_candle_time[i] >= first_future_candle) & (div_candle_time[i] <= second_future_candle):
if div_data[i] is not np.nan:
return 1
else:
no_divs.append(0)
elif div_candle_time[i] < first_future_candle:
continue
elif div_candle_time[i] > second_future_candle:
break
div_count = 0
for i in no_divs:
div_count = div_count + i
if div_count == 0:
return np.nan
Before calling the function dataframe indexes need to be reset
dflong = dflong.reset_index()
dfshort = dfshort.reset_index()
dfshorter = dfshorter.reset_index()
Now call the function and use .values to return a numpy representation of the DataFrame.
fg_bottom, fg_top = filled_fg(dflong['FG_Top'].values,
dflong['FG_Bottom'].values,
dflong['index'].values,
dflong['Open'].values,
dflong['High'].values,
dflong['Low'].values,
dflong['Close'].values,
dflong['price_end'].values,
dfshort['price_end'].values,
dfshort['index'].values,
dfshorter['price_end'].values,
dfshorter['index'].values)
Finally the returned data needs to be readded to the original DataFrame dflong
dflong['FG_Bottom'] = fg_bottom
dflong['FG_Top'] = fg_top
Speed test results:
Original itertuples solution = 7.641393423080444 seconds
New Numba solution = 0.5985264778137207 seconds

Related

Remove the requirement to loop through numpy array

Overview
The code below contains a numpy array clusters with values that are compared against each row of a pandas Dataframe using np.where. The SoFunc function returns rows where all conditions are True and takes the clusters array as input.
Question
I can loop through this array to compare each array element against the respective np.where conditions. How do I remove the requirement to loop but still get the same output?
I appreciate looping though numpy arrays is inefficient and want to improve this code. The actual dataset will be much larger.
Prepare the reproducible mock data
def genMockDataFrame(days,startPrice,colName,startDate,seed=None):
periods = days*24
np.random.seed(seed)
steps = np.random.normal(loc=0, scale=0.0018, size=periods)
steps[0]=0
P = startPrice+np.cumsum(steps)
P = [round(i,4) for i in P]
fxDF = pd.DataFrame({
'ticker':np.repeat( [colName], periods ),
'date':np.tile( pd.date_range(startDate, periods=periods, freq='H'), 1 ),
'price':(P)})
fxDF.index = pd.to_datetime(fxDF.date)
fxDF = fxDF.price.resample('D').ohlc()
fxDF.columns = [i.title() for i in fxDF.columns]
return fxDF
def SoFunc(clust):
#generate mock data
df = genMockDataFrame(10,1.1904,'eurusd','19/3/2020',seed=157)
df["Upper_Band"] = 1.1928
df.loc["2020-03-27", "Upper_Band"] = 1.2118
df.loc["2020-03-26", "Upper_Band"] = 1.2200
df["Level"] = np.where((df["High"] >= clust)
& (df["Low"] <= clust)
& (df["High"] >= df["Upper_Band"] ),1,np.NaN
)
return df.dropna()
Loop through the clusters array
clusters = np.array([1.1929 , 1.2118 ])
l = []
for i in range(len(clusters)):
l.append(SoFunc(clusters[i]))
pd.concat(l)
Output
Open High Low Close Upper_Band Level
date
2020-03-19 1.1904 1.1937 1.1832 1.1832 1.1928 1.0
2020-03-25 1.1939 1.1939 1.1864 1.1936 1.1928 1.0
2020-03-27 1.2118 1.2144 1.2039 1.2089 1.2118 1.0
(Edited based on #tdy's comment below)
pandas.merge allows you to make len(clusters) copies of your dataframe and then pare it down to according to the conditions in your SoFunc function.
The cross merge creates a dataframe with a copy of df for each record in clusters_df. The overall result ought to be faster for large dataframes than the loop-based approach, provided you have enough memory to temporarily accommodate the merged dataframe (if not, the operation may spill over onto page / swap and slow down drastically).
import numpy as np
import pandas as pd
def genMockDataFrame(days,startPrice,colName,startDate,seed=None):
''' identical to the example provided '''
periods = days*24
np.random.seed(seed)
steps = np.random.normal(loc=0, scale=0.0018, size=periods)
steps[0]=0
P = startPrice+np.cumsum(steps)
P = [round(i,4) for i in P]
fxDF = pd.DataFrame({
'ticker':np.repeat( [colName], periods ),
'date':np.tile( pd.date_range(startDate, periods=periods, freq='H'), 1 ),
'price':(P)})
fxDF.index = pd.to_datetime(fxDF.date)
fxDF = fxDF.price.resample('D').ohlc()
fxDF.columns = [i.title() for i in fxDF.columns]
return fxDF
# create the base dataframe according to the former SoFunc
df = genMockDataFrame(10,1.1904,'eurusd','19/3/2020',seed=157)
df["Upper_Band"] = 1.1928
df.loc["2020-03-27"]["Upper_Band"] = 1.2118
df.loc["2020-03-26"]["Upper_Band"] = 1.2200
# create a df out of the cluster array
clusters = np.array([1.1929 , 1.2118 ])
clusters_df = pd.DataFrame({"clust": clusters})
# perform the merge, then filter and finally clean up
result_df = (
pd
.merge(df.reset_index(), clusters_df, how="cross") # for each entry in cluster, make a copy of df
.loc[lambda z: (z.Low <= z.clust) & (z.High >= z.clust) & (z.High >= z.Upper_Band), :] # filter the copies down
.drop(columns=["clust"]) # not needed in result
.assign(Level=1.0) # to match your result; not really needed
.set_index("date") # bring back the old index
)
print(result_df)
I recommend inspecting just the result of pd.merge(df.reset_index(), clusters_df, how="cross") to see how it works.

I am having a problem with a foor loop that includes dataframes

I have a dataframe with 8 columnds. If two of those columns satisfy a condition, I have to fill two columns with the product of other two. And after running the algorithm it is not working.
I have tryed to use series, I have tryed to use import warnings
warnings.filterwarnings("ignore") but it is not working
for i in seq:
if dataframefinal['trade'][i] == 1 and dataframefinal['z'][i] > 0:
dataframefinal['CloseAdj2'][i]= dataframefinal['Close2'][i] *
dataframefinal['trancosshort'][i]
dataframefinal['CloseAdj1'][i]= dataframefinal['Close1'][i] *
dataframefinal['trancostlong'][i]
elif dataframefinal['trade'][i] == 1 and dataframefinal['z'][i] < 0:
dataframefinal['CloseAdj2'][i]= dataframefinal['Close1'][i] *
dataframefinal['trancosshort'][i]
dataframefinal['CloseAdj1'][i]= dataframefinal['Close2'][i] *
dataframefinal['trancostlong'][i]
else:
dataframefinal['CloseAdj1'][i]= dataframefinal['Close1'][i]
dataframefinal['CloseAdj2'][i]= dataframefinal['Close2'][i]
You can use vectorized condition function numpy.select() to do this quickly:
import pandas as pd
from numpy.random import randn, randint
n = 10
df_data = pd.DataFrame(dict(trade=randint(0, 2, n),
z=randn(n),
Close1=randn(n),
Close2=randn(n),
trancosshort=randn(n),
trancostlong=randn(n)))
df_data["CloseAdj1"] = 0
df_data["CloseAdj2"] = 0
seq = [1, 3, 5, 7, 9]
df = df_data.loc[seq]
cond1 = df.eval("trade==1 and z > 0")
cond2 = df.eval("trade==2 and z < 0")
df["CloseAdj2"] = np.select([cond1, cond2],
[df.eval("Close2 * trancosshort"),
df.eval("Close1 * trancosshort")], df.Close2)
df["CloseAdj1"] = np.select([cond1, cond2],
[df.eval("Close1 * trancostlong"),
df.eval("Close2 * trancostlong")], df.Close1)
df_data.loc[seq, ["CloseAdj1", "CloseAdj2"]] = df[["CloseAdj1", "CloseAdj2"]]

Pandas accumulate data for linear regression

I try to adjust my data so total_gross per day is accumulated. E.g.
`Created` `total_gross` `total_gross_accumulated`
Day 1 100 100
Day 2 100 200
Day 3 100 300
Day 4 100 400
Any idea, how I have to change my code to have total_gross_accumulated available?
Here is my data.
my code:
from sklearn import linear_model
def load_event_data():
df = pd.read_csv('sample-data.csv', usecols=['created', 'total_gross'])
df['created'] = pd.to_datetime(df.created)
return df.set_index('created').resample('D').sum().fillna(0)
event_data = load_event_data()
X = event_data.index
y = event_data.total_gross
plt.xticks(rotation=90)
plt.plot(X, y)
plt.show()
List comprehension is the most pythonic way to do this.
SHORT answer:
This should give you the new column that you want:
n = event_data.shape[0]
# skip line 0 and start by accumulating from 1 until the end
total_gross_accumulated =[event_data['total_gross'][:i].sum() for i in range(1,n+1)]
# add the new variable in the initial pandas dataframe
event_data['total_gross_accumulated'] = total_gross_accumulated
OR faster
event_data['total_gross_accumulated'] = event_data['total_gross'].cumsum()
LONG answer:
Full code using your data:
import pandas as pd
def load_event_data():
df = pd.read_csv('sample-data.csv', usecols=['created', 'total_gross'])
df['created'] = pd.to_datetime(df.created)
return df.set_index('created').resample('D').sum().fillna(0)
event_data = load_event_data()
n = event_data.shape[0]
# skip line 0 and start by accumulating from 1 until the end
total_gross_accumulated =[event_data['total_gross'][:i].sum() for i in range(1,n+1)]
# add the new variable in the initial pandas dataframe
event_data['total_gross_accumulated'] = total_gross_accumulated
Results:
event_data.head(6)
# total_gross total_gross_accumulated
#created
#2019-03-01 3481810 3481810
#2019-03-02 4690 3486500
#2019-03-03 0 3486500
#2019-03-04 0 3486500
#2019-03-05 0 3486500
#2019-03-06 0 3486500
X = event_data.index
y = event_data.total_gross_accumulated
plt.xticks(rotation=90)
plt.plot(X, y)
plt.show()

How to apply a rolling Kalman Filter to a column in a DataFrame?

How to apply a rolling Kalman Filter to a DataFrame column (without using external data)?
That is, pretending that each row is a new point in time and therefore requires for the descriptive statistics to be updated (in a rolling manner) after each row.
For example, how to apply the Kalman Filter to any column in the below DataFrame?
n = 2000
index = pd.date_range(start='2000-01-01', periods=n)
data = np.random.randn(n, 4)
df = pd.DataFrame(data, columns=list('ABCD'), index=index)
I've seen previous responses (1 and 2) however they are not applying it to a DataFrame column (and they are not vectorized).
How to apply a rolling Kalman Filter to a column in a DataFrame?
Exploiting some good features of numpy and using pykalman library, and applying the Kalman Filter on column D for a rolling window of 3, we can write:
import pandas as pd
from pykalman import KalmanFilter
import numpy as np
def rolling_window(a, step):
shape = a.shape[:-1] + (a.shape[-1] - step + 1, step)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
def get_kf_value(y_values):
kf = KalmanFilter()
Kc, Ke = kf.em(y_values, n_iter=1).smooth(0)
return Kc
n = 2000
index = pd.date_range(start='2000-01-01', periods=n)
data = np.random.randn(n, 4)
df = pd.DataFrame(data, columns=list('ABCD'), index=index)
wsize = 3
arr = rolling_window(df.D.values, wsize)
zero_padding = np.zeros(shape=(wsize-1,wsize))
arrst = np.concatenate((zero_padding, arr))
arrkalman = np.zeros(shape=(len(arrst),1))
for i in range(len(arrst)):
arrkalman[i] = get_kf_value(arrst[i])
kalmandf = pd.DataFrame(arrkalman, columns=['D_kalman'], index=index)
df = pd.concat([df,kalmandf], axis=1)
df.head() should yield something like this:
A B C D D_kalman
2000-01-01 -0.003156 -1.487031 -1.755621 -0.101233 0.000000
2000-01-02 0.172688 -0.767011 -0.965404 -0.131504 0.000000
2000-01-03 -0.025983 -0.388501 -0.904286 1.062163 0.013633
2000-01-04 -0.846606 -0.576383 -1.066489 -0.041979 0.068792
2000-01-05 -1.505048 0.498062 0.619800 0.012850 0.252550

Cleaner pandas apply with function that cannot use pandas.Series and non-unique index

In the following, func represents a function that uses multiple columns (with coupling across the group) and cannot operate directly on pandas.Series. The 0*d['x'] syntax was the lightest I could think of to force the conversion, but I think it's awkward.
Additionally, the resulting pandas.Series (s) still includes the group index, which must be removed before adding as a column to the pandas.DataFrame. The s.reset_index(...) index manipulation seems fragile and error-prone, so I'm curious if it can be avoided. Is there an idiom for doing this?
import pandas
import numpy
df = pandas.DataFrame(dict(i=[1]*8,j=[1]*4+[2]*4,x=list(range(4))*2))
df['y'] = numpy.sin(df['x']) + 1000*df['j']
df = df.set_index(['i','j'])
print('# df\n', df)
def func(d):
x = numpy.array(d['x'])
y = numpy.array(d['y'])
# I want to do math with x,y that cannot be applied to
# pandas.Series, so explicitly convert to numpy arrays.
#
# We have to return an appropriately-indexed pandas.Series
# in order for it to be admissible as a column in the
# pandas.DataFrame. Instead of simply "return x + y", we
# have to make the conversion.
return 0*d['x'] + x + y
s = df.groupby(df.index).apply(func)
# The Series is still adorned with the (unnamed) group index,
# which will prevent adding as a column of df due to
# Exception: cannot handle a non-unique multi-index!
s = s.reset_index(level=0, drop=True)
print('# s\n', s)
df['z'] = s
print('# df\n', df)
Instead of
0*d['x'] + x + y
you could use
pd.Series(x+y, index=d.index)
When using groupy-apply, instead of dropping the group key index using:
s = df.groupby(df.index).apply(func)
s = s.reset_index(level=0, drop=True)
df['z'] = s
you can tell groupby to drop the keys using the keyword parameter group_keys=False:
df['z'] = df.groupby(df.index, group_keys=False).apply(func)
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(i=[1]*8,j=[1]*4+[2]*4,x=list(range(4))*2))
df['y'] = np.sin(df['x']) + 1000*df['j']
df = df.set_index(['i','j'])
def func(d):
x = np.array(d['x'])
y = np.array(d['y'])
return pd.Series(x+y, index=d.index)
df['z'] = df.groupby(df.index, group_keys=False).apply(func)
print(df)
yields
x y z
i j
1 1 0 1000.000000 1000.000000
1 1 1000.841471 1001.841471
1 2 1000.909297 1002.909297
1 3 1000.141120 1003.141120
2 0 2000.000000 2000.000000
2 1 2000.841471 2001.841471
2 2 2000.909297 2002.909297
2 3 2000.141120 2003.141120