How can I vectorize a function made with Scipy - numpy

I created a function with Scipy to get the slope and r values. The idea is to select by "key", "Days", and "Fruit weights" and see how they increase on time. Here is an example of my raw data and code.
enter image description here
f(x)
def fx_stats(df,key):
df = (df[df.key == key]).copy()
slope, intercept, r_value, p_value, std_err = stats.linregress(df.Days, df.Calculated_fruit_weight_g)
df.name = key
df.dat = df.Datapoints[df.key == key].unique()[0]
return df.name, slope, r_value, r_value**2, slope*7, df.dat
Loop
results = []
key = d.key.unique()
for i in key:
results.append(fx_stats(d,i))
results = pd.DataFrame(results)
fruits = results.rename(columns ={0:'key',1:'slope_g_per_day',2:'Pearson_r',3:"R2",4:'gain per week',5:'Datapoints'})
fruits
Question: How can I do this elegantly (short) way? I want to use Numpy vectorization to skip the loop, but I haven't been successful.
Expected result:
enter image description here

Related

how to vectorize exponential probability function

I believe code below is somewhat correct implementation of this exponential heatmap function:
def expfunc(image, landmark, sigma=6): #image = array of shape (512,512), landmark = array of shape (2,)
a= np.sqrt(np.log(2)/2)/sigma #
for i in range(image.shape[0]):
for j in range(image.shape[1]):
prob = np.exp(-a*(np.abs(i-landmark[0])+np.abs(j-landmark[1])))
if prob > 0.01:
image[i][j] = prob
else:
image[i][j]= 0
return image
My questions are:
How could I vectorize this code?
This probability function gives values to all pixels so how should proceed with very small values? Now I am using threshold of 0.01 for zeros?
Let me know if this works for you:
i = np.arange(image.shape[0])
j = np.arange(image.shape[1])
prob = np.exp(-a*(np.abs(i[:,None]-landmark[0])+np.abs(j-landmark[1])))
image = np.where(prob>0.01, prob, 0)
First compute the array prob for all of the indices i and j. Then prob has the same shape as image, and you can redefine image based on the values of prob using numpy.where.

Best way to find modes of an array along the column

Suppose that I have an array
a = np.array([[1,2.5,3,4],[1, 2.5, 3,3]])
I want to find the mode of each column without using stats.mode().
The only way I can think of is the following:
result = np.zeros(a.shape[1])
for i in range(len(result)):
curr_col = a[:,i]
result[i] = curr_col[np.argmax(np.unique(curr_col, return_counts = True))]
update:
There is some error in the above code and the correct one should be:
values, counts = np.unique(a[:,i], return_counts = True)
result[i] = values[np.argmax(counts)]
I have to use the loop because np.unique does not output compatible result for each column and there is no way to use np.bincount because the dtype is not int.
If you look at the numpy.unique documentation, this function returns the values and the associated counts (because you specified return_counts=True). A slight modification of your code is necessary to give the correct result. What you are trying todo is to find the value associated to the highest count:
import numpy as np
a = np.array([[1,5,3,4],[1,5,3,3],[1,5,3,3]])
result = np.zeros(a.shape[1])
for i in range(len(result)):
values, counts = np.unique(a[:,i], return_counts = True)
result[i] = values[np.argmax(counts)]
print(result)
Output:
% python3 script.py
[1. 5. 3. 4.]
Here is a code tha compares your solution with the scipy.stats.mode function:
import numpy as np
import scipy.stats as sps
import time
a = np.random.randint(1,100,(100,100))
t_start = time.time()
result = np.zeros(a.shape[1])
for i in range(len(result)):
values, counts = np.unique(a[:,i], return_counts = True)
result[i] = values[np.argmax(counts)]
print('Timer 1: ', (time.time()-t_start), 's')
t_start = time.time()
result_2 = sps.mode(a, axis=0).mode
print('Timer 2: ', (time.time()-t_start), 's')
print('Matrices are equal!' if np.allclose(result, result_2) else 'Matrices differ!')
Output:
% python3 script.py
Timer 1: 0.002721071243286133 s
Timer 2: 0.003339052200317383 s
Matrices are equal!
I tried several values for parameters and your code is actually faster than scipy.stats.mode function so it is probably close to optimal.

How to concatenate two tensors with intervals in tensorflow?

I want to concatenate two tensors checkerboard-ly in tensorflow2, like examples showed below:
example 1:
a = [[1,1],[1,1]]
b = [[0,0],[0,0]]
concated_a_and_b = [[1,0,1,0],[0,1,0,1]]
example 2:
a = [[1,1,1],[1,1,1],[1,1,1]]
b = [[0,0,0],[0,0,0],[0,0,0]]
concated_a_and_b = [[1,0,1,0,1,0],[0,1,0,1,0,1],[1,0,1,0,1,0]]
Is there a decent way in tensorflow2 to concatenate them like this?
A bit of background for this:
I first split a tensor c with a checkerboard mask into two halves a and b. A after some transformation I have to concat them back into oringnal shape and order.
What I mean by checkerboard-ly:
Step 1: Generate a matrix with alternated values
You can do this by first concatenating into [1, 0] pairs, and then by applying a final reshape.
Step 2: Reverse some rows
I split the matrix into two parts, reverse the second part and then rebuild the full matrix by picking alternatively from the first and second part
Code sample:
import math
import numpy as np
import tensorflow as tf
a = tf.ones(shape=(3, 4))
b = tf.zeros(shape=(3, 4))
x = tf.expand_dims(a, axis=-1)
y = tf.expand_dims(b, axis=-1)
paired_ones_zeros = tf.concat([x, y], axis=-1)
alternated_values = tf.reshape(paired_ones_zeros, [-1, a.shape[1] + b.shape[1]])
num_samples = alternated_values.shape[0]
middle = math.ceil(num_samples / 2)
is_num_samples_odd = middle * 2 != num_samples
# Gather first part of the matrix, don't do anything to it
first_elements = tf.gather_nd(alternated_values, [[index] for index in range(middle)])
# Gather second part of the matrix and reverse its elements
second_elements = tf.reverse(tf.gather_nd(alternated_values, [[index] for index in range(middle, num_samples)]), axis=[1])
# Pick alternatively between first and second part of the matrix
indices = np.concatenate([[[index], [index + middle]] for index in range(middle)], axis=0)
if is_num_samples_odd:
indices = indices[:-1]
output = tf.gather_nd(
tf.concat([first_elements, second_elements], axis=0),
indices
)
print(output)
I know this is not a decent way as it will affect time and space complexity. But it solves the above problem
def concat(tf1, tf2):
result = []
for (index, (tf_item1, tf_item2)) in enumerate(zip(tf1, tf2)):
item = []
for (subitem1, subitem2) in zip(tf_item1, tf_item2):
if index % 2 == 0:
item.append(subitem1)
item.append(subitem2)
else:
item.append(subitem2)
item.append(subitem1)
concated_a_and_b.append(item)
return concated_a_and_b

PyMC - variance-covariance matrix estimation

I read the following paper(http://www3.stat.sinica.edu.tw/statistica/oldpdf/A10n416.pdf) where they model the variance-covariance matrix Σ as:
Σ = diag(S)*R*diag(S) (Equation 1 in the paper)
S is the k×1 vector of standard deviations, diag(S) is the diagonal matrix with diagonal elements S, and R is the k×k correlation matrix.
How can I implement this using PyMC ?
Here is some initial code I wrote:
import numpy as np
import pandas as pd
import pymc as pm
k=3
prior_mu=np.ones(k)
prior_var=np.eye(k)
prior_corr=np.eye(k)
prior_cov=prior_var*prior_corr*prior_var
post_mu = pm.Normal("returns",prior_mu,1,size=k)
post_var=pm.Lognormal("variance",np.diag(prior_var),1,size=k)
post_corr_inv=pm.Wishart("inv_corr",n_obs,np.linalg.inv(prior_corr))
post_cov_matrix_inv = ???
muVector=[10,5,-2]
varMatrix=np.diag([10,20,10])
corrMatrix=np.matrix([[1,.2,0],[.2,1,0],[0,0,1]])
cov_matrix=varMatrix*corrMatrix*varMatrix
n_obs=10000
x=np.random.multivariate_normal(muVector,cov_matrix,n_obs)
obs = pm.MvNormal( "observed returns", post_mu, post_cov_matrix_inv, observed = True, value = x )
model = pm.Model( [obs, post_mu, post_cov_matrix_inv] )
mcmc = pm.MCMC()
mcmc.sample( 5000, 2000, 3 )
Thanks
[edit]
I think that can be done using the following:
#pm.deterministic
def post_cov_matrix_inv(post_sdev=post_sdev,post_corr_inv=post_corr_inv):
return np.diag(post_sdev)*post_corr_inv*np.diag(post_sdev)
Here is the solution for the benefit of someone who stumbles onto this post:
p=3
prior_mu=np.ones(p)
prior_sdev=np.ones(p)
prior_corr_inv=np.eye(p)
muVector=[10,5,1]
sdevVector=[3,5,10]
corrMatrix=np.matrix([[1,0,-.1],[0,1,.5],[-.1,.5,1]])
cov_matrix=np.diag(sdevVector)*corrMatrix*np.diag(sdevVector)
n_obs=2000
x=np.random.multivariate_normal(muVector,cov_matrix,n_obs)
prior_cov=np.diag(prior_sdev)*np.linalg.inv(prior_corr_inv)*np.diag(prior_sdev)
post_mu = pm.Normal("returns",prior_mu,1,size=p)
post_sdev=pm.Lognormal("sdev",prior_sdev,1,size=p)
post_corr_inv=pm.Wishart("inv_corr",n_obs,prior_corr_inv)
#post_cov_matrix_inv = pm.Wishart("inv_cov_matrix",n_obs,np.linalg.inv(prior_cov))
#pm.deterministic
def post_cov_matrix_inv(post_sdev=post_sdev,post_corr_inv=post_corr_inv,nobs=n_obs):
post_sdev_inv=(post_sdev)**-1
return np.diag(post_sdev_inv)*cov2corr(post_corr_inv/nobs)*np.diag(post_sdev_inv)
obs = pm.MvNormal( "observed returns", post_mu, post_cov_matrix_inv, observed = True, value = x )
model = pm.Model( [obs, post_mu, post_sdev ,post_corr_inv])
mcmc = pm.MCMC(model)
mcmc.sample( 25000, 15000, 1,progress_bar=False )

Numpy: regrid by averaging?

I'm trying to regrid a numpy array onto a new grid. In this specific case, I'm trying to regrid a power spectrum onto a logarithmic grid so that the data are evenly spaced logarithmically for plotting purposes.
Doing this with straight interpolation using np.interp results in some of the original data being ignored entirely. Using digitize gets the result I want, but I have to use some ugly loops to get it to work:
xfreq = np.fft.fftfreq(100)[1:50] # only positive, nonzero freqs
psw = np.arange(xfreq.size) # dummy array for MWE
# new logarithmic grid
logfreq = np.logspace(np.log10(np.min(xfreq)), np.log10(np.max(xfreq)), 100)
inds = np.digitize(xfreq,logfreq)
# interpolation: ignores data *but* populates all points
logpsw = np.interp(logfreq, xfreq, psw)
# so average down where available...
logpsw[np.unique(inds)] = [psw[inds==i].mean() for i in np.unique(inds)]
# the new plot
loglog(logfreq, logpsw, linewidth=0.5, color='k')
Is there a nicer way to accomplish this in numpy? I'd be satisfied with just a replacement of the inline loop step.
You can use bincount() twice to calculate the average value of every bins:
logpsw2 = np.interp(logfreq, xfreq, psw)
counts = np.bincount(inds)
mask = counts != 0
logpsw2[mask] = np.bincount(inds, psw)[mask] / counts[mask]
or use unique(inds, return_inverse=True) and bincount() twice:
logpsw4 = np.interp(logfreq, xfreq, psw)
uinds, inv_index = np.unique(inds, return_inverse=True)
logpsw4[uinds] = np.bincount(inv_index, psw) / np.bincount(inv_index)
Or if you use Pandas:
import pandas as pd
logpsw4 = np.interp(logfreq, xfreq, psw)
s = pd.groupby(pd.Series(psw), inds).mean()
logpsw4[s.index] = s.values