Vectorization of selective cumulative sum - pandas

I have a pandas Series where each element is a list with indices:
series_example = pd.Series([[1, 3, 2], [1, 2]])
In addition, I have an array with values associated to every index:
arr_example = np.array([3., 0.5, 0.25, 0.1])
I want to create a new Series with the cumulative sums of the elements of the array given by the indices in the row of the input Series. In the example, the output Series would have the following contents:
0 [0.5, 0.6, 0.85]
1 [0.5, 0.75]
dtype: object
The non-vectorized way to do it would be the following:
def non_vector_transform(series, array):
series_output = pd.Series(np.zeros(len(series_example)), dtype = object)
for i in range(len(series)):
element_list = series[i]
series_output[i] = []
acum = 0
for element in element_list:
acum += array[element]
series_output[i].append(acum)
return series_output
I would like to do this in a vectorized way. Any vectorization magician to help me in here?

Use Series.apply and np.cumsum:
import numpy as np
import pandas as pd
series_example = pd.Series([[1, 3, 2], [1, 2]])
arr_example = np.array([3., 0.5, 0.25, 0.1])
result = series_example.apply(lambda x: np.cumsum(arr_example[x]))
print(result)
Or if you prefer a for loop:
import numpy as np
import pandas as pd
series_example = pd.Series([[1, 3, 2], [1, 2]])
arr_example = np.array([3., 0.5, 0.25, 0.1])
# Copy only if you do not want to overwrite the original series
result = series_example.copy()
for i, x in result.iteritems():
result[i] = np.cumsum(arr_example[x])
print(result)
Output:
0 [0.5, 0.6, 0.85]
1 [0.5, 0.75]
dtype: object

Related

Python: create (sparse) stacked diagonal block matrix

I need to create a matrix with the form
M=[
[a1, 0, 0],
[0, b1, 0],
[0, 0, c1],
[a2, 0, 0],
[0, b2, 0],
[0, 0, c2],
[a3, 0, 0],
[0, b3, 0],
[0, 0, c3],
...]
where a(i), b(i) and c(i) are [1xp] blocks. The resulting matrix M has the form [3m x 3p]. I am given the input data in the form of 3 matrices [m x p]:
A = [[a1.T, a2.T, a3.T, ...]].T
B = [[b1.T, b2.T, b3.T, ...]].T
C = [[c1.T, c2.T, c3.T, ...]].T
How can I create the matrix M? Ideally it would be sparse using the scipy.sparse library but I am even struggling creating it as a dense matrix using numpy. Is there no way around a loop or at least list comprehension in this case?
No need to make it complicated. For your scale, the following executes in less than a second.
import numpy as np
import scipy.sparse
from numpy.random import default_rng
rand = default_rng(seed=0)
m = 70_000
p = 20
abc = rand.random((3, m, p))
M_dense = np.zeros((m, 3, 3*p))
for i in range(3):
M_dense[:, i, i*p:(i+1)*p] = abc[i, ...]
M_sparse = scipy.sparse.csr_matrix(M_dense.reshape((-1, 3*p)))
print(M_sparse.shape)
(210000, 60)
Far better, though, is to construct the sparse matrix directly. Note the permuted shape of abc.
abc = rand.random((m, 3, p))
data = abc.ravel()
indices = np.tile(np.arange(3*p), m)
indptr = np.arange(0, data.size+1, p)
M_sparse = scipy.sparse.csr_matrix((data, indices, indptr))

How do I input a Time Series in spmvg nfoursid

I want to use this algorithm for n4sid model estimation. However, in the Documentation, there is an input DataFrame generated from Random Samples, where I want to input a Time Series Dataframe. Calling the nfoursid method leads to an Type Error or Value Error.
Documentation:
https://github.com/spmvg/nfoursid/blob/master/examples/Overview.ipynb
Imported libs:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from nfoursid.kalman import Kalman
from nfoursid.nfoursid import NFourSID
from nfoursid.state_space import StateSpace
import time
import datetime
import math
import scipy as sp
My input Time Series as Data Frame (flawless):
import yfinance as yfin
yfin.pdr_override()
spy = pdr.get_data_yahoo('AAPL',start='2022-08-23',end='2022-10-24')
spy['Log Return'] = np.log(spy['Adj Close']/spy['Adj Close'].shift(1))
AAPL=pd.DataFrame((spy['Log Return']))
The input DataFrame as proposed in the documentation:
state_space = StateSpace(A, B, C, D)
for _ in range(NUM_TRAINING_DATAPOINTS):
input_state = np.random.standard_normal((INPUT_DIM, 1))
noise = np.random.standard_normal((OUTPUT_DIM, 1)) * NOISE_AMPLITUDE
state_space.step(input_state, noise)
The call using the input proposed in the documentation:
#---->libs already imported
pd.set_option('display.max_columns', None)
np.random.seed(0) # reproducible results
NUM_TRAINING_DATAPOINTS = 1000
# create a training-set by simulating a state-space model with this many datapoints
NUM_TEST_DATAPOINTS = 20 # same for the test-set
INPUT_DIM = 3 #---->this probably needs to adapted to the AAPL dimensions
OUTPUT_DIM = 2
INTERNAL_STATE_DIM = 4 # actual order of the state-space model in the training- and test-set
NOISE_AMPLITUDE = .1 # add noise to the training- and test-set
FIGSIZE = 8
# define system matrices for the state-space model of the training- and test-set
A = np.array([
[1, .01, 0, 0],
[0, 1, .01, 0],
[0, 0, 1, .02],
[0, -.01, 0, 1],
]) / 1.01
B = np.array([
[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[0, 1, 1],
]
) / 3
C = np.array([
[1, 0, 1, 1],
[0, 0, 1, -1],
])
D = np.array([
[1, 0, 1],
[0, 1, 0]
]) / 10
)
#---->maybe I have to input the DataFrame already here at the state-space model:
state_space = StateSpace(A, B, C, D)
for _ in range(NUM_TRAINING_DATAPOINTS):
input_state = np.random.standard_normal((INPUT_DIM, 1))
noise = np.random.standard_normal((OUTPUT_DIM, 1)) * NOISE_AMPLITUDE
state_space.step(input_state, noise)
#----
#---->This is the method with the input DF, in this case the random state-space model
nfoursid = NFourSID(
state_space.to_dataframe(), # the state-space model can summarize inputs and outputs as a dataframe
output_columns=state_space.y_column_names,
input_columns=state_space.u_column_names,
num_block_rows=10
)
nfoursid.subspace_identification()
Pasting my DF at the call of the method nfoursid which leads to an error:
df2 = pd.DataFrame()
nfoursid = NFourSID(
output_columns=df2,
input_columns=AAPL,
num_block_rows=10
)
TypeError: NFourSID.init() missing 1 required positional argument: 'dataframe'
Pasting DF in the state_space led to:
ValueError: Dimensions of u (43, 1) are inconsistent. Expected (3, 1).
and
TypeError: 'DataFrame' object is not callable

Two Pandas dataframes, how to interpolate row-wise using scipy

How can I use scipy interpolate on two dataframes, interpolating row-rise?
For example, if I have:
dfx = pd.DataFrame({"a": [0.1, 0.2, 0.5, 0.6], "b": [3.2, 4.1, 1.1, 2.8]})
dfy = pd.DataFrame({"a": [0.8, 0.2, 1.1, 0.1], "b": [0.5, 1.3, 1.3, 2.8]})
display(dfx)
display(dfy)
And say I want to interpolate for y(x=0.5), how can I get the results into an array that I can put in a new dataframe?
Expected result is: [0.761290323 0.284615385 1.1 -0.022727273]
For example, for first row, you can see the expected value is 0.761290323:
x = [0.1, 3.2] # from dfx, row 0
y = [0.8,0.5] # from dfy, row 0
fig, ax = plt.subplots(1,1)
ax.plot(x,y)
f = scipy.interpolate.interp1d(x,y)
out = f(0.5)
print(out)
I tried the following but received ValueError: x and y arrays must be equal in length along interpolation axis.
f = scipy.interpolate.interp1d(dfx, dfy)
out = np.exp(f(0.5))
print(out)
Since you are looking for linear interpolation, you can do:
def interpolate(val, dfx, dfy):
t = (dfx['b'] - val) / (dfx['b'] - dfx['a'])
return dfy['a'] * t + dfy['b'] * (1-t)
interpolate(0.5, dfx, dfy)
Output:
0 0.885714
1 0.284615
2 1.100000
3 -0.022727
dtype: float64

Numpy equivalent of pandas replace (dictionary mapping)

I know working on numpy array can be quicker than pandas.
I am wondering if there is a equivalent way (and quicker) to do pandas.replace on a numpy array.
In the example below, I have created a dataframe and a dictionary. The dictionary contains the name of columns and its corresponding mapping. I wonder if there is any function which would allow me to feed a dicitonary to a numpy array to do the mapping and yield a quicker processing time?
import pandas as pd
import numpy as np
# Dataframe
d = {'col1': [1, 2, 3], 'col2': [4, 5, 6]}
df = pd.DataFrame(data=d)
# dictionary I want to map
d_mapping = {'col1' : {1:2 , 2:1} , 'col2' : {4:1}}
# result using pandas replace
print(df.replace(d_mapping))
# Instead of a pandas dataframe, I want to perform the same operation on a numpy array
df_np = df.to_records(index=False)
You can try np.select(). I believe it depends on the number of unique elements to replace.
def replace_values(df, d_mapping):
def replace_col(col):
# extract numpy array and column name from pd.Series
col, name = col.values, col.name
# generate condlist and choicelist
# for every key in mapping create a boolean mask
condlist = [col == x for x in d_mapping[name].keys()]
choicelist = d_mapping[name].values()
# use np.where to keep the existing value which won't be replaced
return np.select(condlist, choicelist, col)
return df.apply(replace_col)
usage:
replace_values(df, d_mapping)
I also believe that you you can speed up the code above if you use lists/arrays in the mapping instead of dicts and replace keys(), and values() calls with index lookups:
d_mapping = {"col1": [[1, 2], [2, 1]], "col2": [[4], [1]]}
...
lookups and are also expensive
m = d_mapping[name]
condlist = [col == x for x in m[0]]
choicelist = m[1]
...
np.isin(col, m[0]),
Upd:
Here is the benchmark
import pandas as pd
import numpy as np
# Dataframe
df = pd.DataFrame({"col1": [1, 2, 3], "col2": [4, 5, 6]})
# dictionary I want to map
d_mapping = {"col1": [[1, 2], [2, 1]], "col2": [[4], [1]]}
d_mapping_2 = {
col: dict(zip(*replacement)) for col, replacement in d_mapping.items()
}
def replace_values(df, mapping):
def replace_col(col):
col, (m0, m1) = col.values, mapping[col.name]
return np.select([col == x for x in m0], m1, col)
return df.apply(replace_col)
from timeit import timeit
print("np.select: ", timeit(lambda: replace_values(df, d_mapping), number=5000))
print("df.replace: ", timeit(lambda: df.replace(d_mapping_2), number=5000))
On my 6-year old laptop it prints:
np.select: 3.6562702230003197
df.replace: 4.714512745998945
np.select is ~20% faster

Value from iterative function in pandas

I have a dataframe and would like to have the values in one column being set through an iterative function as below.
import pandas as pd
import numpy as np
d = {'col1': [0.4444, 25.4615],
'col2': [0.5, 0.7],
'col3': [7, 7]}
df = pd.DataFrame(data=d)
df['col4'] = df['col1'] * df['col3']/4
def func(df):
a = np.exp(-df['col4'])
n = 1
while df['col2'] < a:
a = a + df['col4'] * 4 / n
n += 1
return n
df['col5'] = func(df)
I get an error message "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()." How can I run the function per row to solve the series/ambiguity problem?
EDIT: Added expected output.
out = {'col1': [0.4444, 25.4615],
'col2': [0.5, 0.7],
'col3': [7, 7],
'col4': [0.7777, 44.557625],
'col5': [0, 49]}
dfout = pd.DataFrame(out)
I am not sure what the values in col4 and col5 will be but according to the calculation I am trying to replicate those will be the values.
EDIT2: I had missed n+=1 in the while loop. added it now.
EDIT3: I am trying to apply
f(0) = e^-col4
f(n) = col4 * f(n-1) / n for n > 0
until f > col2 and then return the value of n per row.
Using the information you provided, this seems to be the solution:
import pandas as pd
import numpy as np
d = {'col1': [0.4444, 25.4615],
'col2': [0.5, 0.7],
'col3': [7, 7]}
df = pd.DataFrame(data=d)
df['col4'] = df['col1'] * df['col3']/4
def func(df):
n = 1
return n
df['col5'] = func(df)
For what it is worth, here is an inefficient solution: after each iteration, keep track of which coefficient starts satisfying the condition.
import pandas as pd
import numpy as np
d = {'col1': [0.4444, 25.4615],
'col2': [0.5, 0.7],
'col3': [7, 7]}
df = pd.DataFrame(data=d)
df['col4'] = df['col1'] * df['col3']/4
def func(df):
a = np.exp(-df['col4'])
n = 1
ns = [None] * len(df['col2'])
status = a > df['col2']
for i in range(len(status)):
if ns[i] is None and status[i]:
ns[i] = n
# stops when all coefficients satisfy the condition
while not status.all():
a = a * df['col4'] * n
status = a > df['col2']
n += 1
for i in range(len(status)):
if ns[i] is None and status[i]:
ns[i] = n
return ns
df['col5'] = func(df)
print(df['col5'])