calculate distance matrix with mixed categorical and numerics - pandas

I have a data frame with a mixture of numeric (15 fields) and categorical (5 fields) data.
I can create a complete distance matrix of the numeric fields following create distance matrix using own calculation pandas
I want to include the categorical fields as well.
Using as template:
import scipy
from scipy.spatial import distance_matrix
from scipy.spatial.distance import squareform
from scipy.spatial.distance import pdist
df2=pd.DataFrame({'col1':[1,2,3,4],'col2':[5,6,7,8],'col3':['cat','cat','dog','bird']})
df2
pd.DataFrame(squareform(pdist(df2.values, lambda u, v: np.sqrt((w*(u-v)**2).sum()))), index=df2.index, columns=df2.index)
in the squareform calculation, I would like to include the test np.where(u[2]==v[2], 0, 10) (as well as with the other categorical columns)
Hpw do I modify the lambda function to carry out this test as well
Here, the distance between [0,1]
= sqrt((2-1)^2 + (6-5)^2 + (cat - cat)^2)
= sqrt(1 + 1 + 0)
and the distance between [0,2]
= sqrt((3-1)^2 + (7-5)^2 + (dog - cat)^2)
= sqrt(4 + 4 + 100)
etc.
Can anyone suggest how I can implement this algorithm?

import pandas as pd
import numpy as np
from scipy.spatial.distance import pdist, squareform
df2 = pd.DataFrame({'col1':[1,2,3,4],'col2':[5,6,7,8],'col3':['cat','cat','dog','bird']})
def fun(u,v):
const = 0 if u[2] == v[2] else 10
return np.sqrt((u[0]-v[0])**2 + (u[1]-v[1])**2 + const**2)
pd.DataFrame(squareform(pdist(df2.values, fun)), index=df2.index, columns=df2.index)
Result:
0 1 2 3
0 0.000000 1.414214 10.392305 10.862780
1 1.414214 0.000000 10.099505 10.392305
2 10.392305 10.099505 0.000000 10.099505
3 10.862780 10.392305 10.099505 0.000000

Related

Dataframe iteration using Numba instead of itertuples() for faster code

My problem
I have three dataframes which I am using itertuples to loop through.
Itertuples worked well for a time however now I am running too many iterations for itertuples to be efficient enough.
I'd like to use vectorisation or perhaps Numba as I have heard that they are both very fast. I've tried to make them work but I can't figure it out
All three dataframes are Open, High, Low, Close candlestick data with a few other columns i.e 'FG_Top'
The dataframes are
dflong - 15 minute candlestick data
dfshort - 5 minute candlestick data
dfshorter - 1 minute candlestick data
Dataframe creation code as requested in comments
import numpy as np
import pandas as pd
idx15m = ['2022-10-29 06:59:59.999', '2022-10-29 07:14:59.999', '2022-10-29 07:29:59.999', '2022-10-29 07:44:59.999',
'2022-10-29 07:59:59.999', '2022-10-29 08:14:59.999', '2022-10-29 08:29:59.999']
opn15m = [19010, 19204, 19283, 19839, 19892, 20000, 20192]
hgh15m = [19230, 19520, 19921, 19909, 20001, 20203, 21065]
low15m = [18782, 19090, 19245, 19809, 19256, 19998, 20016]
cls15m = [19204, 19283, 19839, 19892, 20000, 20192, 20157]
FG_Bottom = [np.nan, np.nan, np.nan, np.nan, np.nan, 19909, np.nan]
FG_Top = [np.nan, np.nan, np.nan, np.nan, np.nan, 19998, np.nan]
dflong = pd.DataFrame({'Open': opn15m, 'High': hgh15m, 'Low': low15m, 'Close': cls15m, 'FG_Bottom': FG_Bottom, 'FG_Top': FG_Top},
index=idx15m)
idx5m = ['2022-10-29 06:59:59.999', '2022-10-29 07:05:59.999', '2022-10-29 07:10:59.999', '2022-10-29 07:15:59.999',
'2022-10-29 07:20:59.999', '2022-10-29 07:25:59.999', '2022-10-29 07:30:59.999']
opn5m = [19012, 19102, 19165, 19747, 19781, 20009, 20082]
hgh5m = [19132, 19423, 19817, 19875, 20014, 20433, 21068]
low5m = [18683, 19093, 19157, 19758, 19362, 19893, 20018]
cls5m = [19102, 19165, 19747, 19781, 20009, 20082, 20154]
price_end5m = [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]
dfshort = pd.DataFrame({'Open': opn5m, 'High': hgh5m, 'Low': low5m, 'Close': cls5m, 'price_end': price_end5m},
index=idx5m)
idx1m = ['2022-10-29 06:59:59.999', '2022-10-29 07:01:59.999', '2022-10-29 07:02:59.999', '2022-10-29 07:03:59.999',
'2022-10-29 07:04:59.999', '2022-10-29 07:05:59.999', '2022-10-29 07:06:59.999']
opn1m = [19010, 19104, 19163, 19748, 19783, 20000, 20087]
hgh1m = [19130, 19420, 19811, 19878, 20011, 20434, 21065]
low1m = [18682, 19090, 19154, 19754, 19365, 19899, 20016]
cls1m = [19104, 19163, 19748, 19783, 20000, 20087, 20157]
price_end1m = [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]
dfshorter = pd.DataFrame({'Open': opn1m, 'High': hgh1m, 'Low': low1m, 'Close': cls1m, 'price_end': price_end1m},
index=idx1m)
Give 3 DataFrames that a similar to this following DataFrame
Example Dataframe
Open High ... FG_Top FG_Bottom
2022-10-29 06:59:59.999 20687.83 20700.46 ... NaN NaN
2022-10-29 07:14:59.999 20686.82 20695.74 ... NaN NaN
2022-10-29 07:29:59.999 20733.62 20745.30 ... 20733.62 20700.46
2022-10-29 07:44:59.999 20741.42 20762.75 ... NaN NaN
2022-10-29 07:59:59.999 20723.86 20777.00 ... NaN NaN
... ... ... ... ... ...
2022-11-10 02:14:59.999 16140.29 16167.09 ... NaN NaN
2022-11-10 02:29:59.999 16119.99 16195.19 ... NaN NaN
2022-11-10 02:44:59.999 16136.63 16263.15 ... NaN NaN
2022-11-10 02:59:59.999 16238.91 16238.91 ... NaN NaN
2022-11-10 03:14:59.999 16210.23 16499.00 ... NaN NaN
Code explaination:
I have my first dataframe which I loop over with the first loop, then loop again with a second nested loop. I have if statements checking certain conditions on each iteration and if those conditions are met then I make some values on the first dataframe np.nan
One of the conditions checked in the second loop calls a function which contains a third loop and checks for certain conditions in the other 2 dataframes.
# First loop
for fg_candle_idx, row in enumerate(dflong.itertuples()):
top = row.FG_Top
bottom = row.FG_Bottom
fg_candle_time = row.Index
if (pd.notnull(top)):
# Second loop
for future_candle_idx, r in enumerate(dflong.itertuples()):
future_candle_time = r.Index
next_future_candle = future_candle_time + timedelta(minutes=minutes)
future_candle_high = r.High
future_candle_low = r.Low
future_candle_close = r.Close
future_candle_open = r.Open
if future_candle_idx > fg_candle_idx:
div = r.price_end
# Check conditions, call function check_no_divs
if (pd.isnull(check_no_divs(dfshort, future_candle_time, next_future_candle))) & (
pd.isnull(check_no_divs(dfshorter, future_candle_time, next_future_candle))) & (
pd.isnull(div)):
if future_candle_high < bottom:
continue
elif future_candle_low > top:
continue
elif (future_candle_close < bottom) & \
(future_candle_open > top):
dflong.loc[fg_candle_time, 'FG_Bottom'] = np.nan
dflong.loc[fg_candle_time, 'FG_Top'] = np.nan
continue
# Many additional conditions checked...
The following code is the function check_no_divs
def check_no_divs(df, candle_time, next_candle):
no_divs = []
# Third Loop
for idx, row in enumerate(df.itertuples()):
compare_candle_time = row.Index
div = row.price_end
if (compare_candle_time >= candle_time) & (compare_candle_time <= next_candle):
if pd.notnull(div):
no_divs.append(True)
else:
no_divs.append(False)
elif compare_candle_time < candle_time:
continue
elif compare_candle_time > next_candle:
break
if all(no_divs) == False:
return np.nan
elif any(no_divs) == True:
return 1
Ideal Solution
Clearly using itertuples is far too inefficient for this problem. I think that there would be a much faster solution to this issue using efficient vectorisation or Numba.
Does anyone know how to make this work?
p.s. I'm still quite new to coding, i think my current code could be made more efficient still using itertuples but probably not efficient enough. I'd appreciate it if someone knows a way to greatly increase the speed of this code
I spent a lot of time researching and testing different code and came up with this solution using numba which gives a significant speed boost.
first import the required libraries
import numpy as np
import pandas as pd
from numba import njit, prange
Then define the function using numbas njit decotator
#njit
def filled_fg(fg_top, fg_bottom, dflongindex, Open, High, Low, Close, dflongprice_end,
dfshortprice_end, shortindex, dfshorterprice_end, shorterindex, conflu_top,
conflu_bottom):
# First loop
for i in prange(len(fg_top)):
top = fg_top[i]
bottom = fg_bottom[i]
if top is not np.nan:
if (bottom - top) > 0:
fg_top[i] = np.nan
fg_bottom[i] = np.nan
# Second loop
for j in prange(len(fg_top)):
if j > i:
future_candle_time = dflongindex[j]
next_future_candle = dflongindex[j + 1]
future_candle_high = High[j]
future_candle_low = Low[j]
future_candle_close = Close[j]
future_candle_open = Open[j]
long_div = dflongprice_end[j]
# Check conditions
if ((new_check_no_divs(dfshortprice_end, shortindex, future_candle_time,
next_future_candle)) == np.nan) & ((new_check_no_divs(
dfshorterprice_end, shorterindex, future_candle_time,
next_future_candle)) == np.nan) & (long_div == np.nan):
if future_candle_high < bottom:
continue
elif future_candle_low > top:
continue
# Do something when conditions are met...
elif (future_candle_close < bottom) & \
(future_candle_open > top):
fg_bottom[i] = np.nan
fg_top[i] = np.nan
continue
Define the second function also with numbas njit decorator
#njit
def check_no_divs(div_data, div_candle_time, first_future_candle, second_future_candle):
no_divs = []
for i in prange(len(div_data)):
if (div_candle_time[i] >= first_future_candle) & (div_candle_time[i] <= second_future_candle):
if div_data[i] is not np.nan:
return 1
else:
no_divs.append(0)
elif div_candle_time[i] < first_future_candle:
continue
elif div_candle_time[i] > second_future_candle:
break
div_count = 0
for i in no_divs:
div_count = div_count + i
if div_count == 0:
return np.nan
Before calling the function dataframe indexes need to be reset
dflong = dflong.reset_index()
dfshort = dfshort.reset_index()
dfshorter = dfshorter.reset_index()
Now call the function and use .values to return a numpy representation of the DataFrame.
fg_bottom, fg_top = filled_fg(dflong['FG_Top'].values,
dflong['FG_Bottom'].values,
dflong['index'].values,
dflong['Open'].values,
dflong['High'].values,
dflong['Low'].values,
dflong['Close'].values,
dflong['price_end'].values,
dfshort['price_end'].values,
dfshort['index'].values,
dfshorter['price_end'].values,
dfshorter['index'].values)
Finally the returned data needs to be readded to the original DataFrame dflong
dflong['FG_Bottom'] = fg_bottom
dflong['FG_Top'] = fg_top
Speed test results:
Original itertuples solution = 7.641393423080444 seconds
New Numba solution = 0.5985264778137207 seconds

Using mystic to solve a parameter-dependent optimization problem

I have a non-convex quadratic optimization problem of type
x' * B * x,
where all entries of x are between 0 and 1 and the sum of all entries is identical to 1.
In scipy.optimize I would try to solve this optimization problem via
import numpy as np
from scipy.optimize import minimize, LinearConstraint
N = 2 # dimension 2 for this example
B = np.array([[2,-1],[-1,-1]]) # symmetric, but indefinite matrix
fnc = lambda x: x.T # B # x
res = minimize(fnc, x0 = np.ones((N,))/N, bounds = [(0,1) for m in range(N)], constraints = (LinearConstraint(np.ones((N,)),0.99, 1.01)))
So I start with initial guess [0.5, 0.5], I apply bounds (0,1) on each dimension and the equality constraint is handled by a very narrow double inequality constraint.
Now I would like to translate this to mystic because scipy does not work well with high-dimensional non-convex settings (which I am interested in).
What I was not able to find out is how to write the constraints in a form such that I only need to supply the matrix B, with variable dimension. All examples in mystic which I found so far do something like this:
def objective(x):
x0,x1,x2,x3,x4,x5,x6,x7,x8,x9 = x
return x0**2 + x1**2 + x0*x1 - 14*x0 - 16*x1 + (x2-10)**2 + \
4*(x3-5)**2 + (x4-3)**2 + 2*(x5-1)**2 + 5*x6**2 + \
7*(x7-11)**2 + 2*(x8-10)**2 + (x9-7)**2 + 45.0
bounds = [(-10,10)]*10
from mystic.symbolic import generate_constraint, generate_solvers, simplify
from mystic.symbolic import generate_penalty, generate_conditions
equations = """
4.0*x0 + 5.0*x1 - 3.0*x6 + 9.0*x7 - 105.0 <= 0.0
10.0*x0 - 8.0*x1 - 17.0*x6 + 2.0*x7 <= 0.0
-8.0*x0 + 2.0*x1 + 5.0*x8 - 2.0*x9 - 12.0 <= 0.0
3.0*(x0-2)**2 + 4.0*(x1-3)**2 + 2.0*x2**2 - 7.0*x3 - 120.0 <= 0.0
5.0*x0**2 + 8.0*x1 + (x2-6)**2 - 2.0*x3 - 40.0 <= 0.0
0.5*(x0-8)**2 + 2.0*(x1-4)**2 + 3.0*x4**2 - x5 - 30.0 <= 0.0
x0**2 + 2.0*(x1-2)**2 - 2.0*x0*x1 + 14.0*x4 - 6.0*x5 <= 0.0
-3.0*x0 + 6.0*x1 + 12.0*(x8-8)**2 - 7.0*x9 <= 0.0
"""
cf = generate_constraint(generate_solvers(simplify(equations, target=['x5','x3'])))
pf = generate_penalty(generate_conditions(equations))
This is highly verbose and needs manual insertion of all the constraints and parameters etc. as a string which I would like to avoid: The dimensionality and the form of the matrix B will be different each time I need to run the optimization method. What I'd like to have (in a perfect world) would be something like
def objective(x):
return x # B # x # numpy syntax
equations = """
np.ones((1,N)) # x == 1.0
"""
# constraint in a form which can handle variable dimension of x
Is that possible?
Mystic uses lists, by default, so you have to convert to an array in the cost function. There are a lot of other ways to create constraints without using symbolic strings, and in your particular case, there's one that works out of the box. I'd do something like this:
>>> import mystic as my
>>> import numpy as np
>>> N = 2 # dimension 2 for this example
>>> B = np.array([[2,-1],[-1,-1]]) # symmetric, but indefinite matrix
>>> c = my.constraints.normalized()(lambda x:x)
>>> bounds = [(0,1)]*N
>>> mon = my.monitors.VerboseMonitor(10)
>>> fnc = lambda x: np.array(x).T # B # x
>>> res = my.solvers.diffev2(fnc, x0=bounds, npop=10, bounds=bounds, ftol=1e-4, gtol=100, full_output=1, itermon=mon, constraints=c)
Generation 0 has ChiSquare: -0.920151
Generation 10 has ChiSquare: -0.999667
Generation 20 has ChiSquare: -1.000000
Generation 30 has ChiSquare: -1.000000
Generation 40 has ChiSquare: -1.000000
Generation 50 has ChiSquare: -1.000000
Generation 60 has ChiSquare: -1.000000
Generation 70 has ChiSquare: -1.000000
Generation 80 has ChiSquare: -1.000000
Generation 90 has ChiSquare: -1.000000
Generation 100 has ChiSquare: -1.000000
Generation 110 has ChiSquare: -1.000000
STOP("ChangeOverGeneration with {'tolerance': 0.0001, 'generations': 100}")
Optimization terminated successfully.
Current function value: -1.000000
Iterations: 113
Function evaluations: 1140
>>> res[0]
array([1.07421473e-07, 9.99999993e-01])
>>> res[1]
-1.0000001999996087
>>> my.scripts.log_reader(mon)

Pandas accumulate data for linear regression

I try to adjust my data so total_gross per day is accumulated. E.g.
`Created` `total_gross` `total_gross_accumulated`
Day 1 100 100
Day 2 100 200
Day 3 100 300
Day 4 100 400
Any idea, how I have to change my code to have total_gross_accumulated available?
Here is my data.
my code:
from sklearn import linear_model
def load_event_data():
df = pd.read_csv('sample-data.csv', usecols=['created', 'total_gross'])
df['created'] = pd.to_datetime(df.created)
return df.set_index('created').resample('D').sum().fillna(0)
event_data = load_event_data()
X = event_data.index
y = event_data.total_gross
plt.xticks(rotation=90)
plt.plot(X, y)
plt.show()
List comprehension is the most pythonic way to do this.
SHORT answer:
This should give you the new column that you want:
n = event_data.shape[0]
# skip line 0 and start by accumulating from 1 until the end
total_gross_accumulated =[event_data['total_gross'][:i].sum() for i in range(1,n+1)]
# add the new variable in the initial pandas dataframe
event_data['total_gross_accumulated'] = total_gross_accumulated
OR faster
event_data['total_gross_accumulated'] = event_data['total_gross'].cumsum()
LONG answer:
Full code using your data:
import pandas as pd
def load_event_data():
df = pd.read_csv('sample-data.csv', usecols=['created', 'total_gross'])
df['created'] = pd.to_datetime(df.created)
return df.set_index('created').resample('D').sum().fillna(0)
event_data = load_event_data()
n = event_data.shape[0]
# skip line 0 and start by accumulating from 1 until the end
total_gross_accumulated =[event_data['total_gross'][:i].sum() for i in range(1,n+1)]
# add the new variable in the initial pandas dataframe
event_data['total_gross_accumulated'] = total_gross_accumulated
Results:
event_data.head(6)
# total_gross total_gross_accumulated
#created
#2019-03-01 3481810 3481810
#2019-03-02 4690 3486500
#2019-03-03 0 3486500
#2019-03-04 0 3486500
#2019-03-05 0 3486500
#2019-03-06 0 3486500
X = event_data.index
y = event_data.total_gross_accumulated
plt.xticks(rotation=90)
plt.plot(X, y)
plt.show()

How to apply a rolling Kalman Filter to a column in a DataFrame?

How to apply a rolling Kalman Filter to a DataFrame column (without using external data)?
That is, pretending that each row is a new point in time and therefore requires for the descriptive statistics to be updated (in a rolling manner) after each row.
For example, how to apply the Kalman Filter to any column in the below DataFrame?
n = 2000
index = pd.date_range(start='2000-01-01', periods=n)
data = np.random.randn(n, 4)
df = pd.DataFrame(data, columns=list('ABCD'), index=index)
I've seen previous responses (1 and 2) however they are not applying it to a DataFrame column (and they are not vectorized).
How to apply a rolling Kalman Filter to a column in a DataFrame?
Exploiting some good features of numpy and using pykalman library, and applying the Kalman Filter on column D for a rolling window of 3, we can write:
import pandas as pd
from pykalman import KalmanFilter
import numpy as np
def rolling_window(a, step):
shape = a.shape[:-1] + (a.shape[-1] - step + 1, step)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
def get_kf_value(y_values):
kf = KalmanFilter()
Kc, Ke = kf.em(y_values, n_iter=1).smooth(0)
return Kc
n = 2000
index = pd.date_range(start='2000-01-01', periods=n)
data = np.random.randn(n, 4)
df = pd.DataFrame(data, columns=list('ABCD'), index=index)
wsize = 3
arr = rolling_window(df.D.values, wsize)
zero_padding = np.zeros(shape=(wsize-1,wsize))
arrst = np.concatenate((zero_padding, arr))
arrkalman = np.zeros(shape=(len(arrst),1))
for i in range(len(arrst)):
arrkalman[i] = get_kf_value(arrst[i])
kalmandf = pd.DataFrame(arrkalman, columns=['D_kalman'], index=index)
df = pd.concat([df,kalmandf], axis=1)
df.head() should yield something like this:
A B C D D_kalman
2000-01-01 -0.003156 -1.487031 -1.755621 -0.101233 0.000000
2000-01-02 0.172688 -0.767011 -0.965404 -0.131504 0.000000
2000-01-03 -0.025983 -0.388501 -0.904286 1.062163 0.013633
2000-01-04 -0.846606 -0.576383 -1.066489 -0.041979 0.068792
2000-01-05 -1.505048 0.498062 0.619800 0.012850 0.252550

Cleaner pandas apply with function that cannot use pandas.Series and non-unique index

In the following, func represents a function that uses multiple columns (with coupling across the group) and cannot operate directly on pandas.Series. The 0*d['x'] syntax was the lightest I could think of to force the conversion, but I think it's awkward.
Additionally, the resulting pandas.Series (s) still includes the group index, which must be removed before adding as a column to the pandas.DataFrame. The s.reset_index(...) index manipulation seems fragile and error-prone, so I'm curious if it can be avoided. Is there an idiom for doing this?
import pandas
import numpy
df = pandas.DataFrame(dict(i=[1]*8,j=[1]*4+[2]*4,x=list(range(4))*2))
df['y'] = numpy.sin(df['x']) + 1000*df['j']
df = df.set_index(['i','j'])
print('# df\n', df)
def func(d):
x = numpy.array(d['x'])
y = numpy.array(d['y'])
# I want to do math with x,y that cannot be applied to
# pandas.Series, so explicitly convert to numpy arrays.
#
# We have to return an appropriately-indexed pandas.Series
# in order for it to be admissible as a column in the
# pandas.DataFrame. Instead of simply "return x + y", we
# have to make the conversion.
return 0*d['x'] + x + y
s = df.groupby(df.index).apply(func)
# The Series is still adorned with the (unnamed) group index,
# which will prevent adding as a column of df due to
# Exception: cannot handle a non-unique multi-index!
s = s.reset_index(level=0, drop=True)
print('# s\n', s)
df['z'] = s
print('# df\n', df)
Instead of
0*d['x'] + x + y
you could use
pd.Series(x+y, index=d.index)
When using groupy-apply, instead of dropping the group key index using:
s = df.groupby(df.index).apply(func)
s = s.reset_index(level=0, drop=True)
df['z'] = s
you can tell groupby to drop the keys using the keyword parameter group_keys=False:
df['z'] = df.groupby(df.index, group_keys=False).apply(func)
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(i=[1]*8,j=[1]*4+[2]*4,x=list(range(4))*2))
df['y'] = np.sin(df['x']) + 1000*df['j']
df = df.set_index(['i','j'])
def func(d):
x = np.array(d['x'])
y = np.array(d['y'])
return pd.Series(x+y, index=d.index)
df['z'] = df.groupby(df.index, group_keys=False).apply(func)
print(df)
yields
x y z
i j
1 1 0 1000.000000 1000.000000
1 1 1000.841471 1001.841471
1 2 1000.909297 1002.909297
1 3 1000.141120 1003.141120
2 0 2000.000000 2000.000000
2 1 2000.841471 2001.841471
2 2 2000.909297 2002.909297
2 3 2000.141120 2003.141120