Python exponential curve fitting in pandas: Define function parameters per row - pandas

my dataframe [11 x 300], where the column header equals 'x' ([0.75,1,1.25,1.5,1.75,2,2.25,2.5,2.75,3,3.25]), and each row-value represents 'y' for. Each row can be described by an exponential function in the following format : a * x ^k + b.
The goal is to add three additional columns, describing a, k and b for that specific row. Just like: Python curve fitting on pandas dataframe then add coef to new columns
Instead of a polynomial function, my data needs be described in the following format: a * x **k + b.
As I cannot find any solution to derive the coefficients by using np.polyfit, I split my dataframe into different lists.
x = np.array([0.75,1,1.25,1.5,1.75,2,2.25,2.5,2.75,3,3.25])
y1 = np.array([288.79,238.32,199.42,181.22,165.50,154.74,152.25,152.26,144.81,144.81,144.81])
y2 = np.array([309.92,255.75,214.02,194.48,177.61,166.06,163.40,163.40,155.41,155.41,155.41])
...
y300 = np.array([352.18,290.63,243.20,221.00,201.83,188.71,185.68,185.68,176.60,176.60,176.60])
def func(x,a,k,b):
return a * (x**k) + b
popt1, pcov = curve_fit(func,x,y1, p0 = (300,-0.5,0))
...
popt300, pcov = curve_fit(func,x,y300, p0 = (300,-0.5,0))
output:
popt1
[107.73727907 -1.545475 123.48621504]
...
popt300
[131.38411712 -1.5454452 150.59522147
This works, when I split all dataframe rows into lists and define popt for every list/row.
Avoiding to split all 300 columns - I prefer to apply the same methodology as Python curve fitting on pandas dataframe then add coef to new columns
my_coep_array = pd.DataFrame(np.polyfit(x, df.values,1)).T
But how to define my np.polyfit - a * x **k + b?

Related

Plot Frequency of Values of Multiple Columns

I want to create a pandas plot of the frequency of occurrences of values in two columns. The scatter plot is to contain a regression line. The result is a heat map-like plot with a regression line.
First, combine columns 'A' and 'B' into a unique value. In this case both columns are numeric so I'm using addition. Next use value_counts to create a frequency. Use pandas scatter plot to create the scatter/bubble/heatmap. Finally use numpy.polyfit to drop a regression line.
combined = (plotdf['A']*plotdf['B'].nunique()+plotdf['B']) # combine numeric values of columns A and B
vcounts = combined.value_counts() # get value counts of combined values
frequency = combined.map(vcounts) # lookup count for each row
plt = plotdf.plot(x='A',y='B',c=frequency,s=frequency,colormap='viridis',kind='scatter',figsize=(16,8),title='Frequency of A and B')
plt.set(xlabel='A',ylabel='B')
x = plotdf['A'].values
y = plotdf['B'].values
m, b = np.polyfit(x, y, 1) # requires numpy
plt.plot(x, m*x + b, 'r') # r is color red

In pandas, how to apply a function to every column that returns two columns

I would like to apply a function to every column of my grouped multiindex pandas dataframe.
If I had a function my_function() that returns a scalar, I would use
data_grouped = data.groupby(['type'])
data_transf = data_grouped.apply(lambda x: my_function(x))
However, consider another function my_function_array() takes an array (all n rows within one group) as an input and returns an n x 2 array as the output.
How can I apply this to every column of my grouped dataframe data_grouped? That is, I want to take every column of my grouped data of m rows and replace it by the n x 2 output of my_function_array().
Here's some sample data. There are other groups (types) but I only show one
type frame x y
F1675 1 77.369027 108.013249
2 107.784096 22.177883
3 22.385162 65.024619
4 65.152003 77.74970
def my_function_array(data_vec, D=2, T=2):
N = len(data_vec) - (D-1)*T # length of embedded signal
embed_data = np.zeros([N,D])
for di in range(-D//2,D//2):
embed_data[:,di] = data_vec[ np.arange((D//2+di)*T, N+(D//2+di)*T) ]
return embed_data
Appyling the function to the second column y
my_function_array(np.array([108.013249, 22.177883, 65.024619, 77.74970]))
I have
array([[ 65.024619, 108.013249],
[ 77.7497 , 22.177883]])
So, the expected output is
type frame x_1 x_2 y_1 y_2
F1675 1 22.385162 77.369027 65.024619 108.013249
2 65.152003 107.784096 77.7497 22.177883
where x_1 and x_2 are the two columns resulting from x (the naming is not important, can be anything). Note that the groups have become shorter and wider.
I think you need return pd.DataFrame:
def my_function_array(data_vec, D=2, T=2):
# print (data_vec.name)
N = len(data_vec) - (D-1)*T # length of embedded signal
embed_data = np.zeros([N,D])
for di in range(-D//2,D//2):
embed_data[:,di] = data_vec[ np.arange((D//2+di)*T, N+(D//2+di)*T) ]
return pd.DataFrame(embed_data).add_prefix(data_vec.name)
f = lambda x: pd.concat([my_function_array(x[y]) for y in x], axis=1)
data_transf = data.groupby(['type']).apply(f)
print (data_transf)
x0 x1 y0 y1
type
F1675 0 22.385162 77.369027 65.024619 108.013249
1 65.152003 107.784096 77.749700 22.177883

pandas dataframe: apply function to previous rows based on "current" row

I need to add a column to a pandas dataframe where each value is an accumulation of the previous rows. The challenge that I am facing is that the previous values are a function of the "current" value, so I am not able to use cumsum().
The original code uses a double loop which I replaced with apply and the performance improved significantly even for a small dataset, but I feel that there should be a better approach. The code below shows the exact calculation that I need to perform.
def apply_formula(row, a, b): # where a and b are series
c = 0
cum = 0
for j in range(0, row.name + 1): # iterate through the previous rows
c = ((a[j] - a[j - 1]) / row.a) * log(row.b - b[j - 1]) / 2.3
cum += c # accumulate
return cum
df["new"] = df.apply(apply_formula, axis = 1, args = [df.a, df.b])
What pandas functions could help me solve this problem?

How to efficiently compute an L2 distance between rows of two array using only basic numpy operations? [duplicate]

I have 2 lists of points as numpy.ndarray, each row is the coordinate of a point, like:
a = np.array([[1,0,0],[0,1,0],[0,0,1]])
b = np.array([[1,1,0],[0,1,1],[1,0,1]])
Here I want to calculate the euclidean distance between all pairs of points in the 2 lists, for each point p_a in a, I want to calculate the distance between it and every point p_b in b. So the result is
d = np.array([[1,sqrt(3),1],[1,1,sqrt(3)],[sqrt(3),1,1]])
How to use matrix multiplication in numpy to compute the distance matrix?
Using direct numpy broadcasting, you can do this:
dist = np.sqrt(((a[:, None] - b[:, :, None]) ** 2).sum(0))
Alternatively, scipy has a routine that will compute this slightly more efficiently (particularly for large matrices)
from scipy.spatial.distance import cdist
dist = cdist(a, b)
I would avoid solutions that depend on factoring-out matrix products (of the form A^2 + B^2 - 2AB), because they can be numerically unstable due to floating point roundoff errors.
To compute the squared euclidean distance for each pair of elements off them - x and y, we need to find :
(Xik-Yjk)**2 = Xik**2 + Yjk**2 - 2*Xik*Yjk
and then sum along k to get the distance at coressponding point as dist(Xi,Yj).
Using associativity, it reduces to :
dist(Xi,Yj) = sum_k(Xik**2) + sum_k(Yjk**2) - 2*sum_k(Xik*Yjk)
Bringing in matrix-multiplication for the last part, we would have all the distances, like so -
dist = sum_rows(X^2), sum_rows(Y^2), -2*matrix_multiplication(X, Y.T)
Hence, putting into NumPy terms, we would end up with the euclidean distances for our case with a and b as the inputs, like so -
np.sqrt((a**2).sum(1)[:,None] + (b**2).sum(1) - 2*a.dot(b.T))
Leveraging np.einsum, we could replace the first two summation-reductions with -
np.einsum('ij,ij->i',a,a)[:,None] + np.einsum('ij,ij->i',b,b)
More info could be found on eucl_dist package's wiki page (disclaimer: I am its author).
If you have 2 each 1-dimensional arrays, x and y, you can convert the arrays into matrices with repeating columns, transpose, and apply the distance formula. This assumes that x and y are coordinated pairs. The result is a symmetrical distance matrix.
x = [1, 2, 3]
y = [4, 5, 6]
xx = np.repeat(x,3,axis = 0).reshape(3,3)
yy = np.repeat(y,3,axis = 0).reshape(3,3)
dist = np.sqrt((xx-xx.T)**2 + (yy-yy.T)**2)
dist
Out[135]:
array([[0. , 1.41421356, 2.82842712],
[1.41421356, 0. , 1.41421356],
[2.82842712, 1.41421356, 0. ]])
L2 distance = (a^2 + b^2 - 2ab)^0.5
a = np.random.randn(5, 3)
b = np.random.randn(2, 3)
a2 = np.sum(np.square(a), axis = 1)[..., None]
b2 = np.sum(np.square(b), axis = 1)[None, ...]
ab = -2*np.dot(a, b.T)
dist = np.sqrt(a2 + b2 + ab)

Python numpy percentile vs scipy percentileofscore

I am confused as to what I am doing incorrectly.
I have the following code:
import numpy as np
from scipy import stats
df
Out[29]: array([66., 69., 67., 75., 69., 69.])
val = 73.94
z1 = stats.percentileofscore(df, val)
print(z1)
Out[33]: 83.33333333333334
np.percentile(df, z1)
Out[34]: 69.999999999
I was expecting that np.percentile(df, z1) would give me back val = 73.94
I think you're not quite understanding what percentileofscore and percentile actually do. They are not inverses of each other.
From the docs for scipy.stats.percentileofscore:
The percentile rank of a score relative to a list of scores.
A percentileofscore of, for example, 80% means that 80% of the scores in a are below the given score. In the case of gaps or ties, the exact definition depends on the optional keyword, kind.
So when you supply the value 73.94, there are 5 elements of df that fall below that score, and 5/6 gives you your 83.3333% result.
Now in the Notes for numpy.percentile:
Given a vector V of length N, the q-th percentile of V is the value q/100 of the way from the minimum to the maximum in a sorted copy of V.
The default interpolation parameter is 'linear' so:
'linear': i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.
Since you have provided 83 as your input parameter, you're looking at a value 83/100 of the way from minimum to the maximum in your array.
If you're interested in digging through the source, you can find it here, but here is a simplified look at the calculation being done here:
ap = np.asarray(sorted(df))
Nx = df.shape[0]
indices = z1 / 100 * (Nx - 1)
indices_below = np.floor(indices).astype(int)
indices_above = indices_below + 1
weight_above = indices - indices_below
weight_below = 1 - weight_above
x1 = ap[b] * weight_below # 57.50000000000004
x2 = ap[a] * weight_above # 12.499999999999956
x1 + x2
70.0