I want to create a pandas plot of the frequency of occurrences of values in two columns. The scatter plot is to contain a regression line. The result is a heat map-like plot with a regression line.
First, combine columns 'A' and 'B' into a unique value. In this case both columns are numeric so I'm using addition. Next use value_counts to create a frequency. Use pandas scatter plot to create the scatter/bubble/heatmap. Finally use numpy.polyfit to drop a regression line.
combined = (plotdf['A']*plotdf['B'].nunique()+plotdf['B']) # combine numeric values of columns A and B
vcounts = combined.value_counts() # get value counts of combined values
frequency = combined.map(vcounts) # lookup count for each row
plt = plotdf.plot(x='A',y='B',c=frequency,s=frequency,colormap='viridis',kind='scatter',figsize=(16,8),title='Frequency of A and B')
plt.set(xlabel='A',ylabel='B')
x = plotdf['A'].values
y = plotdf['B'].values
m, b = np.polyfit(x, y, 1) # requires numpy
plt.plot(x, m*x + b, 'r') # r is color red
I would like to apply a function to every column of my grouped multiindex pandas dataframe.
If I had a function my_function() that returns a scalar, I would use
data_grouped = data.groupby(['type'])
data_transf = data_grouped.apply(lambda x: my_function(x))
However, consider another function my_function_array() takes an array (all n rows within one group) as an input and returns an n x 2 array as the output.
How can I apply this to every column of my grouped dataframe data_grouped? That is, I want to take every column of my grouped data of m rows and replace it by the n x 2 output of my_function_array().
Here's some sample data. There are other groups (types) but I only show one
type frame x y
F1675 1 77.369027 108.013249
2 107.784096 22.177883
3 22.385162 65.024619
4 65.152003 77.74970
def my_function_array(data_vec, D=2, T=2):
N = len(data_vec) - (D-1)*T # length of embedded signal
embed_data = np.zeros([N,D])
for di in range(-D//2,D//2):
embed_data[:,di] = data_vec[ np.arange((D//2+di)*T, N+(D//2+di)*T) ]
return embed_data
Appyling the function to the second column y
my_function_array(np.array([108.013249, 22.177883, 65.024619, 77.74970]))
I have
array([[ 65.024619, 108.013249],
[ 77.7497 , 22.177883]])
So, the expected output is
type frame x_1 x_2 y_1 y_2
F1675 1 22.385162 77.369027 65.024619 108.013249
2 65.152003 107.784096 77.7497 22.177883
where x_1 and x_2 are the two columns resulting from x (the naming is not important, can be anything). Note that the groups have become shorter and wider.
I think you need return pd.DataFrame:
def my_function_array(data_vec, D=2, T=2):
# print (data_vec.name)
N = len(data_vec) - (D-1)*T # length of embedded signal
embed_data = np.zeros([N,D])
for di in range(-D//2,D//2):
embed_data[:,di] = data_vec[ np.arange((D//2+di)*T, N+(D//2+di)*T) ]
return pd.DataFrame(embed_data).add_prefix(data_vec.name)
f = lambda x: pd.concat([my_function_array(x[y]) for y in x], axis=1)
data_transf = data.groupby(['type']).apply(f)
print (data_transf)
x0 x1 y0 y1
type
F1675 0 22.385162 77.369027 65.024619 108.013249
1 65.152003 107.784096 77.749700 22.177883
I need to add a column to a pandas dataframe where each value is an accumulation of the previous rows. The challenge that I am facing is that the previous values are a function of the "current" value, so I am not able to use cumsum().
The original code uses a double loop which I replaced with apply and the performance improved significantly even for a small dataset, but I feel that there should be a better approach. The code below shows the exact calculation that I need to perform.
def apply_formula(row, a, b): # where a and b are series
c = 0
cum = 0
for j in range(0, row.name + 1): # iterate through the previous rows
c = ((a[j] - a[j - 1]) / row.a) * log(row.b - b[j - 1]) / 2.3
cum += c # accumulate
return cum
df["new"] = df.apply(apply_formula, axis = 1, args = [df.a, df.b])
What pandas functions could help me solve this problem?
I have 2 lists of points as numpy.ndarray, each row is the coordinate of a point, like:
a = np.array([[1,0,0],[0,1,0],[0,0,1]])
b = np.array([[1,1,0],[0,1,1],[1,0,1]])
Here I want to calculate the euclidean distance between all pairs of points in the 2 lists, for each point p_a in a, I want to calculate the distance between it and every point p_b in b. So the result is
d = np.array([[1,sqrt(3),1],[1,1,sqrt(3)],[sqrt(3),1,1]])
How to use matrix multiplication in numpy to compute the distance matrix?
Using direct numpy broadcasting, you can do this:
dist = np.sqrt(((a[:, None] - b[:, :, None]) ** 2).sum(0))
Alternatively, scipy has a routine that will compute this slightly more efficiently (particularly for large matrices)
from scipy.spatial.distance import cdist
dist = cdist(a, b)
I would avoid solutions that depend on factoring-out matrix products (of the form A^2 + B^2 - 2AB), because they can be numerically unstable due to floating point roundoff errors.
To compute the squared euclidean distance for each pair of elements off them - x and y, we need to find :
(Xik-Yjk)**2 = Xik**2 + Yjk**2 - 2*Xik*Yjk
and then sum along k to get the distance at coressponding point as dist(Xi,Yj).
Using associativity, it reduces to :
dist(Xi,Yj) = sum_k(Xik**2) + sum_k(Yjk**2) - 2*sum_k(Xik*Yjk)
Bringing in matrix-multiplication for the last part, we would have all the distances, like so -
dist = sum_rows(X^2), sum_rows(Y^2), -2*matrix_multiplication(X, Y.T)
Hence, putting into NumPy terms, we would end up with the euclidean distances for our case with a and b as the inputs, like so -
np.sqrt((a**2).sum(1)[:,None] + (b**2).sum(1) - 2*a.dot(b.T))
Leveraging np.einsum, we could replace the first two summation-reductions with -
np.einsum('ij,ij->i',a,a)[:,None] + np.einsum('ij,ij->i',b,b)
More info could be found on eucl_dist package's wiki page (disclaimer: I am its author).
If you have 2 each 1-dimensional arrays, x and y, you can convert the arrays into matrices with repeating columns, transpose, and apply the distance formula. This assumes that x and y are coordinated pairs. The result is a symmetrical distance matrix.
x = [1, 2, 3]
y = [4, 5, 6]
xx = np.repeat(x,3,axis = 0).reshape(3,3)
yy = np.repeat(y,3,axis = 0).reshape(3,3)
dist = np.sqrt((xx-xx.T)**2 + (yy-yy.T)**2)
dist
Out[135]:
array([[0. , 1.41421356, 2.82842712],
[1.41421356, 0. , 1.41421356],
[2.82842712, 1.41421356, 0. ]])
L2 distance = (a^2 + b^2 - 2ab)^0.5
a = np.random.randn(5, 3)
b = np.random.randn(2, 3)
a2 = np.sum(np.square(a), axis = 1)[..., None]
b2 = np.sum(np.square(b), axis = 1)[None, ...]
ab = -2*np.dot(a, b.T)
dist = np.sqrt(a2 + b2 + ab)
I am confused as to what I am doing incorrectly.
I have the following code:
import numpy as np
from scipy import stats
df
Out[29]: array([66., 69., 67., 75., 69., 69.])
val = 73.94
z1 = stats.percentileofscore(df, val)
print(z1)
Out[33]: 83.33333333333334
np.percentile(df, z1)
Out[34]: 69.999999999
I was expecting that np.percentile(df, z1) would give me back val = 73.94
I think you're not quite understanding what percentileofscore and percentile actually do. They are not inverses of each other.
From the docs for scipy.stats.percentileofscore:
The percentile rank of a score relative to a list of scores.
A percentileofscore of, for example, 80% means that 80% of the scores in a are below the given score. In the case of gaps or ties, the exact definition depends on the optional keyword, kind.
So when you supply the value 73.94, there are 5 elements of df that fall below that score, and 5/6 gives you your 83.3333% result.
Now in the Notes for numpy.percentile:
Given a vector V of length N, the q-th percentile of V is the value q/100 of the way from the minimum to the maximum in a sorted copy of V.
The default interpolation parameter is 'linear' so:
'linear': i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.
Since you have provided 83 as your input parameter, you're looking at a value 83/100 of the way from minimum to the maximum in your array.
If you're interested in digging through the source, you can find it here, but here is a simplified look at the calculation being done here:
ap = np.asarray(sorted(df))
Nx = df.shape[0]
indices = z1 / 100 * (Nx - 1)
indices_below = np.floor(indices).astype(int)
indices_above = indices_below + 1
weight_above = indices - indices_below
weight_below = 1 - weight_above
x1 = ap[b] * weight_below # 57.50000000000004
x2 = ap[a] * weight_above # 12.499999999999956
x1 + x2
70.0