calculate distance matrix with mixed categorical and numerics - pandas

I have a data frame with a mixture of numeric (15 fields) and categorical (5 fields) data.
I can create a complete distance matrix of the numeric fields following create distance matrix using own calculation pandas
I want to include the categorical fields as well.
Using as template:
import scipy
from scipy.spatial import distance_matrix
from scipy.spatial.distance import squareform
from scipy.spatial.distance import pdist
pd.DataFrame(squareform(pdist(df2.values, lambda u, v: np.sqrt((w*(u-v)**2).sum()))), index=df2.index, columns=df2.index)
in the squareform calculation, I would like to include the test np.where(u[2]==v[2], 0, 10) (as well as with the other categorical columns)
Hpw do I modify the lambda function to carry out this test as well
Here, the distance between [0,1]
= sqrt((2-1)^2 + (6-5)^2 + (cat - cat)^2)
= sqrt(1 + 1 + 0)
and the distance between [0,2]
= sqrt((3-1)^2 + (7-5)^2 + (dog - cat)^2)
= sqrt(4 + 4 + 100)
Can anyone suggest how I can implement this algorithm?

import pandas as pd
import numpy as np
from scipy.spatial.distance import pdist, squareform
df2 = pd.DataFrame({'col1':[1,2,3,4],'col2':[5,6,7,8],'col3':['cat','cat','dog','bird']})
def fun(u,v):
const = 0 if u[2] == v[2] else 10
return np.sqrt((u[0]-v[0])**2 + (u[1]-v[1])**2 + const**2)
pd.DataFrame(squareform(pdist(df2.values, fun)), index=df2.index, columns=df2.index)
0 1 2 3
0 0.000000 1.414214 10.392305 10.862780
1 1.414214 0.000000 10.099505 10.392305
2 10.392305 10.099505 0.000000 10.099505
3 10.862780 10.392305 10.099505 0.000000


Dataframe iteration using Numba instead of itertuples() for faster code

My problem
I have three dataframes which I am using itertuples to loop through.
Itertuples worked well for a time however now I am running too many iterations for itertuples to be efficient enough.
I'd like to use vectorisation or perhaps Numba as I have heard that they are both very fast. I've tried to make them work but I can't figure it out
All three dataframes are Open, High, Low, Close candlestick data with a few other columns i.e 'FG_Top'
The dataframes are
dflong - 15 minute candlestick data
dfshort - 5 minute candlestick data
dfshorter - 1 minute candlestick data
Dataframe creation code as requested in comments
import numpy as np
import pandas as pd
idx15m = ['2022-10-29 06:59:59.999', '2022-10-29 07:14:59.999', '2022-10-29 07:29:59.999', '2022-10-29 07:44:59.999',
'2022-10-29 07:59:59.999', '2022-10-29 08:14:59.999', '2022-10-29 08:29:59.999']
opn15m = [19010, 19204, 19283, 19839, 19892, 20000, 20192]
hgh15m = [19230, 19520, 19921, 19909, 20001, 20203, 21065]
low15m = [18782, 19090, 19245, 19809, 19256, 19998, 20016]
cls15m = [19204, 19283, 19839, 19892, 20000, 20192, 20157]
FG_Bottom = [np.nan, np.nan, np.nan, np.nan, np.nan, 19909, np.nan]
FG_Top = [np.nan, np.nan, np.nan, np.nan, np.nan, 19998, np.nan]
dflong = pd.DataFrame({'Open': opn15m, 'High': hgh15m, 'Low': low15m, 'Close': cls15m, 'FG_Bottom': FG_Bottom, 'FG_Top': FG_Top},
idx5m = ['2022-10-29 06:59:59.999', '2022-10-29 07:05:59.999', '2022-10-29 07:10:59.999', '2022-10-29 07:15:59.999',
'2022-10-29 07:20:59.999', '2022-10-29 07:25:59.999', '2022-10-29 07:30:59.999']
opn5m = [19012, 19102, 19165, 19747, 19781, 20009, 20082]
hgh5m = [19132, 19423, 19817, 19875, 20014, 20433, 21068]
low5m = [18683, 19093, 19157, 19758, 19362, 19893, 20018]
cls5m = [19102, 19165, 19747, 19781, 20009, 20082, 20154]
price_end5m = [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]
dfshort = pd.DataFrame({'Open': opn5m, 'High': hgh5m, 'Low': low5m, 'Close': cls5m, 'price_end': price_end5m},
idx1m = ['2022-10-29 06:59:59.999', '2022-10-29 07:01:59.999', '2022-10-29 07:02:59.999', '2022-10-29 07:03:59.999',
'2022-10-29 07:04:59.999', '2022-10-29 07:05:59.999', '2022-10-29 07:06:59.999']
opn1m = [19010, 19104, 19163, 19748, 19783, 20000, 20087]
hgh1m = [19130, 19420, 19811, 19878, 20011, 20434, 21065]
low1m = [18682, 19090, 19154, 19754, 19365, 19899, 20016]
cls1m = [19104, 19163, 19748, 19783, 20000, 20087, 20157]
price_end1m = [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]
dfshorter = pd.DataFrame({'Open': opn1m, 'High': hgh1m, 'Low': low1m, 'Close': cls1m, 'price_end': price_end1m},
Give 3 DataFrames that a similar to this following DataFrame
Example Dataframe
Open High ... FG_Top FG_Bottom
2022-10-29 06:59:59.999 20687.83 20700.46 ... NaN NaN
2022-10-29 07:14:59.999 20686.82 20695.74 ... NaN NaN
2022-10-29 07:29:59.999 20733.62 20745.30 ... 20733.62 20700.46
2022-10-29 07:44:59.999 20741.42 20762.75 ... NaN NaN
2022-10-29 07:59:59.999 20723.86 20777.00 ... NaN NaN
... ... ... ... ... ...
2022-11-10 02:14:59.999 16140.29 16167.09 ... NaN NaN
2022-11-10 02:29:59.999 16119.99 16195.19 ... NaN NaN
2022-11-10 02:44:59.999 16136.63 16263.15 ... NaN NaN
2022-11-10 02:59:59.999 16238.91 16238.91 ... NaN NaN
2022-11-10 03:14:59.999 16210.23 16499.00 ... NaN NaN
Code explaination:
I have my first dataframe which I loop over with the first loop, then loop again with a second nested loop. I have if statements checking certain conditions on each iteration and if those conditions are met then I make some values on the first dataframe np.nan
One of the conditions checked in the second loop calls a function which contains a third loop and checks for certain conditions in the other 2 dataframes.
# First loop
for fg_candle_idx, row in enumerate(dflong.itertuples()):
top = row.FG_Top
bottom = row.FG_Bottom
fg_candle_time = row.Index
if (pd.notnull(top)):
# Second loop
for future_candle_idx, r in enumerate(dflong.itertuples()):
future_candle_time = r.Index
next_future_candle = future_candle_time + timedelta(minutes=minutes)
future_candle_high = r.High
future_candle_low = r.Low
future_candle_close = r.Close
future_candle_open = r.Open
if future_candle_idx > fg_candle_idx:
div = r.price_end
# Check conditions, call function check_no_divs
if (pd.isnull(check_no_divs(dfshort, future_candle_time, next_future_candle))) & (
pd.isnull(check_no_divs(dfshorter, future_candle_time, next_future_candle))) & (
if future_candle_high < bottom:
elif future_candle_low > top:
elif (future_candle_close < bottom) & \
(future_candle_open > top):
dflong.loc[fg_candle_time, 'FG_Bottom'] = np.nan
dflong.loc[fg_candle_time, 'FG_Top'] = np.nan
# Many additional conditions checked...
The following code is the function check_no_divs
def check_no_divs(df, candle_time, next_candle):
no_divs = []
# Third Loop
for idx, row in enumerate(df.itertuples()):
compare_candle_time = row.Index
div = row.price_end
if (compare_candle_time >= candle_time) & (compare_candle_time <= next_candle):
if pd.notnull(div):
elif compare_candle_time < candle_time:
elif compare_candle_time > next_candle:
if all(no_divs) == False:
return np.nan
elif any(no_divs) == True:
return 1
Ideal Solution
Clearly using itertuples is far too inefficient for this problem. I think that there would be a much faster solution to this issue using efficient vectorisation or Numba.
Does anyone know how to make this work?
p.s. I'm still quite new to coding, i think my current code could be made more efficient still using itertuples but probably not efficient enough. I'd appreciate it if someone knows a way to greatly increase the speed of this code
I spent a lot of time researching and testing different code and came up with this solution using numba which gives a significant speed boost.
first import the required libraries
import numpy as np
import pandas as pd
from numba import njit, prange
Then define the function using numbas njit decotator
def filled_fg(fg_top, fg_bottom, dflongindex, Open, High, Low, Close, dflongprice_end,
dfshortprice_end, shortindex, dfshorterprice_end, shorterindex, conflu_top,
# First loop
for i in prange(len(fg_top)):
top = fg_top[i]
bottom = fg_bottom[i]
if top is not np.nan:
if (bottom - top) > 0:
fg_top[i] = np.nan
fg_bottom[i] = np.nan
# Second loop
for j in prange(len(fg_top)):
if j > i:
future_candle_time = dflongindex[j]
next_future_candle = dflongindex[j + 1]
future_candle_high = High[j]
future_candle_low = Low[j]
future_candle_close = Close[j]
future_candle_open = Open[j]
long_div = dflongprice_end[j]
# Check conditions
if ((new_check_no_divs(dfshortprice_end, shortindex, future_candle_time,
next_future_candle)) == np.nan) & ((new_check_no_divs(
dfshorterprice_end, shorterindex, future_candle_time,
next_future_candle)) == np.nan) & (long_div == np.nan):
if future_candle_high < bottom:
elif future_candle_low > top:
# Do something when conditions are met...
elif (future_candle_close < bottom) & \
(future_candle_open > top):
fg_bottom[i] = np.nan
fg_top[i] = np.nan
Define the second function also with numbas njit decorator
def check_no_divs(div_data, div_candle_time, first_future_candle, second_future_candle):
no_divs = []
for i in prange(len(div_data)):
if (div_candle_time[i] >= first_future_candle) & (div_candle_time[i] <= second_future_candle):
if div_data[i] is not np.nan:
return 1
elif div_candle_time[i] < first_future_candle:
elif div_candle_time[i] > second_future_candle:
div_count = 0
for i in no_divs:
div_count = div_count + i
if div_count == 0:
return np.nan
Before calling the function dataframe indexes need to be reset
dflong = dflong.reset_index()
dfshort = dfshort.reset_index()
dfshorter = dfshorter.reset_index()
Now call the function and use .values to return a numpy representation of the DataFrame.
fg_bottom, fg_top = filled_fg(dflong['FG_Top'].values,
Finally the returned data needs to be readded to the original DataFrame dflong
dflong['FG_Bottom'] = fg_bottom
dflong['FG_Top'] = fg_top
Speed test results:
Original itertuples solution = 7.641393423080444 seconds
New Numba solution = 0.5985264778137207 seconds

Using mystic to solve a parameter-dependent optimization problem

I have a non-convex quadratic optimization problem of type
x' * B * x,
where all entries of x are between 0 and 1 and the sum of all entries is identical to 1.
In scipy.optimize I would try to solve this optimization problem via
import numpy as np
from scipy.optimize import minimize, LinearConstraint
N = 2 # dimension 2 for this example
B = np.array([[2,-1],[-1,-1]]) # symmetric, but indefinite matrix
fnc = lambda x: x.T # B # x
res = minimize(fnc, x0 = np.ones((N,))/N, bounds = [(0,1) for m in range(N)], constraints = (LinearConstraint(np.ones((N,)),0.99, 1.01)))
So I start with initial guess [0.5, 0.5], I apply bounds (0,1) on each dimension and the equality constraint is handled by a very narrow double inequality constraint.
Now I would like to translate this to mystic because scipy does not work well with high-dimensional non-convex settings (which I am interested in).
What I was not able to find out is how to write the constraints in a form such that I only need to supply the matrix B, with variable dimension. All examples in mystic which I found so far do something like this:
def objective(x):
x0,x1,x2,x3,x4,x5,x6,x7,x8,x9 = x
return x0**2 + x1**2 + x0*x1 - 14*x0 - 16*x1 + (x2-10)**2 + \
4*(x3-5)**2 + (x4-3)**2 + 2*(x5-1)**2 + 5*x6**2 + \
7*(x7-11)**2 + 2*(x8-10)**2 + (x9-7)**2 + 45.0
bounds = [(-10,10)]*10
from mystic.symbolic import generate_constraint, generate_solvers, simplify
from mystic.symbolic import generate_penalty, generate_conditions
equations = """
4.0*x0 + 5.0*x1 - 3.0*x6 + 9.0*x7 - 105.0 <= 0.0
10.0*x0 - 8.0*x1 - 17.0*x6 + 2.0*x7 <= 0.0
-8.0*x0 + 2.0*x1 + 5.0*x8 - 2.0*x9 - 12.0 <= 0.0
3.0*(x0-2)**2 + 4.0*(x1-3)**2 + 2.0*x2**2 - 7.0*x3 - 120.0 <= 0.0
5.0*x0**2 + 8.0*x1 + (x2-6)**2 - 2.0*x3 - 40.0 <= 0.0
0.5*(x0-8)**2 + 2.0*(x1-4)**2 + 3.0*x4**2 - x5 - 30.0 <= 0.0
x0**2 + 2.0*(x1-2)**2 - 2.0*x0*x1 + 14.0*x4 - 6.0*x5 <= 0.0
-3.0*x0 + 6.0*x1 + 12.0*(x8-8)**2 - 7.0*x9 <= 0.0
cf = generate_constraint(generate_solvers(simplify(equations, target=['x5','x3'])))
pf = generate_penalty(generate_conditions(equations))
This is highly verbose and needs manual insertion of all the constraints and parameters etc. as a string which I would like to avoid: The dimensionality and the form of the matrix B will be different each time I need to run the optimization method. What I'd like to have (in a perfect world) would be something like
def objective(x):
return x # B # x # numpy syntax
equations = """
np.ones((1,N)) # x == 1.0
# constraint in a form which can handle variable dimension of x
Is that possible?
Mystic uses lists, by default, so you have to convert to an array in the cost function. There are a lot of other ways to create constraints without using symbolic strings, and in your particular case, there's one that works out of the box. I'd do something like this:
>>> import mystic as my
>>> import numpy as np
>>> N = 2 # dimension 2 for this example
>>> B = np.array([[2,-1],[-1,-1]]) # symmetric, but indefinite matrix
>>> c = my.constraints.normalized()(lambda x:x)
>>> bounds = [(0,1)]*N
>>> mon = my.monitors.VerboseMonitor(10)
>>> fnc = lambda x: np.array(x).T # B # x
>>> res = my.solvers.diffev2(fnc, x0=bounds, npop=10, bounds=bounds, ftol=1e-4, gtol=100, full_output=1, itermon=mon, constraints=c)
Generation 0 has ChiSquare: -0.920151
Generation 10 has ChiSquare: -0.999667
Generation 20 has ChiSquare: -1.000000
Generation 30 has ChiSquare: -1.000000
Generation 40 has ChiSquare: -1.000000
Generation 50 has ChiSquare: -1.000000
Generation 60 has ChiSquare: -1.000000
Generation 70 has ChiSquare: -1.000000
Generation 80 has ChiSquare: -1.000000
Generation 90 has ChiSquare: -1.000000
Generation 100 has ChiSquare: -1.000000
Generation 110 has ChiSquare: -1.000000
STOP("ChangeOverGeneration with {'tolerance': 0.0001, 'generations': 100}")
Optimization terminated successfully.
Current function value: -1.000000
Iterations: 113
Function evaluations: 1140
>>> res[0]
array([1.07421473e-07, 9.99999993e-01])
>>> res[1]
>>> my.scripts.log_reader(mon)

Pandas accumulate data for linear regression

I try to adjust my data so total_gross per day is accumulated. E.g.
`Created` `total_gross` `total_gross_accumulated`
Day 1 100 100
Day 2 100 200
Day 3 100 300
Day 4 100 400
Any idea, how I have to change my code to have total_gross_accumulated available?
Here is my data.
my code:
from sklearn import linear_model
def load_event_data():
df = pd.read_csv('sample-data.csv', usecols=['created', 'total_gross'])
df['created'] = pd.to_datetime(df.created)
return df.set_index('created').resample('D').sum().fillna(0)
event_data = load_event_data()
X = event_data.index
y = event_data.total_gross
plt.plot(X, y)
List comprehension is the most pythonic way to do this.
SHORT answer:
This should give you the new column that you want:
n = event_data.shape[0]
# skip line 0 and start by accumulating from 1 until the end
total_gross_accumulated =[event_data['total_gross'][:i].sum() for i in range(1,n+1)]
# add the new variable in the initial pandas dataframe
event_data['total_gross_accumulated'] = total_gross_accumulated
OR faster
event_data['total_gross_accumulated'] = event_data['total_gross'].cumsum()
LONG answer:
Full code using your data:
import pandas as pd
def load_event_data():
df = pd.read_csv('sample-data.csv', usecols=['created', 'total_gross'])
df['created'] = pd.to_datetime(df.created)
return df.set_index('created').resample('D').sum().fillna(0)
event_data = load_event_data()
n = event_data.shape[0]
# skip line 0 and start by accumulating from 1 until the end
total_gross_accumulated =[event_data['total_gross'][:i].sum() for i in range(1,n+1)]
# add the new variable in the initial pandas dataframe
event_data['total_gross_accumulated'] = total_gross_accumulated
# total_gross total_gross_accumulated
#2019-03-01 3481810 3481810
#2019-03-02 4690 3486500
#2019-03-03 0 3486500
#2019-03-04 0 3486500
#2019-03-05 0 3486500
#2019-03-06 0 3486500
X = event_data.index
y = event_data.total_gross_accumulated
plt.plot(X, y)

How to apply a rolling Kalman Filter to a column in a DataFrame?

How to apply a rolling Kalman Filter to a DataFrame column (without using external data)?
That is, pretending that each row is a new point in time and therefore requires for the descriptive statistics to be updated (in a rolling manner) after each row.
For example, how to apply the Kalman Filter to any column in the below DataFrame?
n = 2000
index = pd.date_range(start='2000-01-01', periods=n)
data = np.random.randn(n, 4)
df = pd.DataFrame(data, columns=list('ABCD'), index=index)
I've seen previous responses (1 and 2) however they are not applying it to a DataFrame column (and they are not vectorized).
How to apply a rolling Kalman Filter to a column in a DataFrame?
Exploiting some good features of numpy and using pykalman library, and applying the Kalman Filter on column D for a rolling window of 3, we can write:
import pandas as pd
from pykalman import KalmanFilter
import numpy as np
def rolling_window(a, step):
shape = a.shape[:-1] + (a.shape[-1] - step + 1, step)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
def get_kf_value(y_values):
kf = KalmanFilter()
Kc, Ke = kf.em(y_values, n_iter=1).smooth(0)
return Kc
n = 2000
index = pd.date_range(start='2000-01-01', periods=n)
data = np.random.randn(n, 4)
df = pd.DataFrame(data, columns=list('ABCD'), index=index)
wsize = 3
arr = rolling_window(df.D.values, wsize)
zero_padding = np.zeros(shape=(wsize-1,wsize))
arrst = np.concatenate((zero_padding, arr))
arrkalman = np.zeros(shape=(len(arrst),1))
for i in range(len(arrst)):
arrkalman[i] = get_kf_value(arrst[i])
kalmandf = pd.DataFrame(arrkalman, columns=['D_kalman'], index=index)
df = pd.concat([df,kalmandf], axis=1)
df.head() should yield something like this:
A B C D D_kalman
2000-01-01 -0.003156 -1.487031 -1.755621 -0.101233 0.000000
2000-01-02 0.172688 -0.767011 -0.965404 -0.131504 0.000000
2000-01-03 -0.025983 -0.388501 -0.904286 1.062163 0.013633
2000-01-04 -0.846606 -0.576383 -1.066489 -0.041979 0.068792
2000-01-05 -1.505048 0.498062 0.619800 0.012850 0.252550

Cleaner pandas apply with function that cannot use pandas.Series and non-unique index

In the following, func represents a function that uses multiple columns (with coupling across the group) and cannot operate directly on pandas.Series. The 0*d['x'] syntax was the lightest I could think of to force the conversion, but I think it's awkward.
Additionally, the resulting pandas.Series (s) still includes the group index, which must be removed before adding as a column to the pandas.DataFrame. The s.reset_index(...) index manipulation seems fragile and error-prone, so I'm curious if it can be avoided. Is there an idiom for doing this?
import pandas
import numpy
df = pandas.DataFrame(dict(i=[1]*8,j=[1]*4+[2]*4,x=list(range(4))*2))
df['y'] = numpy.sin(df['x']) + 1000*df['j']
df = df.set_index(['i','j'])
print('# df\n', df)
def func(d):
x = numpy.array(d['x'])
y = numpy.array(d['y'])
# I want to do math with x,y that cannot be applied to
# pandas.Series, so explicitly convert to numpy arrays.
# We have to return an appropriately-indexed pandas.Series
# in order for it to be admissible as a column in the
# pandas.DataFrame. Instead of simply "return x + y", we
# have to make the conversion.
return 0*d['x'] + x + y
s = df.groupby(df.index).apply(func)
# The Series is still adorned with the (unnamed) group index,
# which will prevent adding as a column of df due to
# Exception: cannot handle a non-unique multi-index!
s = s.reset_index(level=0, drop=True)
print('# s\n', s)
df['z'] = s
print('# df\n', df)
Instead of
0*d['x'] + x + y
you could use
pd.Series(x+y, index=d.index)
When using groupy-apply, instead of dropping the group key index using:
s = df.groupby(df.index).apply(func)
s = s.reset_index(level=0, drop=True)
df['z'] = s
you can tell groupby to drop the keys using the keyword parameter group_keys=False:
df['z'] = df.groupby(df.index, group_keys=False).apply(func)
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(i=[1]*8,j=[1]*4+[2]*4,x=list(range(4))*2))
df['y'] = np.sin(df['x']) + 1000*df['j']
df = df.set_index(['i','j'])
def func(d):
x = np.array(d['x'])
y = np.array(d['y'])
return pd.Series(x+y, index=d.index)
df['z'] = df.groupby(df.index, group_keys=False).apply(func)
x y z
i j
1 1 0 1000.000000 1000.000000
1 1 1000.841471 1001.841471
1 2 1000.909297 1002.909297
1 3 1000.141120 1003.141120
2 0 2000.000000 2000.000000
2 1 2000.841471 2001.841471
2 2 2000.909297 2002.909297
2 3 2000.141120 2003.141120