Calculate weighted statistical moments in Python - numpy

I've been looking for a function or package that would allow me to calculate the skew and kurtosis of a distribution in a weighted way, as I have histogram data.
For instance I have the data
import numpy as np
np.array([[1, 2],
[2, 5],
[3, 6],
[4,12],
[5, 1])
where the first column [1,2,3,4,5] are the values and the second column [2,5,6,12,1] are the frequencies of the values.
I have found out how to do the first two moments (mean, standard deviation) in a weighted way using the weighted_avg_and_std function specified in this thread, but I was not quite sure how I could extend this to both the skew and kurtosis, or even the nth statistical moment.
I have found the definitions themselves here and could manually write functions to implement this from scratch, but before I go and do that I was wondering if there were any existing packages or functions that might be able to do this.
Thanks
EDIT:
I figured it out, the following code works (please note that this is for population moments)
skewnewss = np.average(((values-average)/np.sqrt(variance))**3, weights=weights)
and
kurtosis=np.average(((values-average)/np.sqrt(variance))**4-3, weights=weights)

I think you have already listed all the ingredients that you need, following the formulas in the link you provided:
import numpy as np
a = np.array([[1,2],[2,5],[3,6],[4,12],[5,1]])
values, weights = a.T
def n_weighted_moment(values, weights, n):
assert n>0 & (values.shape == weights.shape)
w_avg = np.average(values, weights = weights)
w_var = np.sum(weights * (values - w_avg)**2)/np.sum(weights)
if n==1:
return w_avg
elif n==2:
return w_var
else:
w_std = np.sqrt(w_var)
return np.sum(weights * ((values - w_avg)/w_std)**n)/np.sum(weights)
#Same as np.average(((values - w_avg)/w_std)**n, weights=weights)
Which results in:
for n in range(1,5):
print(f'Moment {n} value is {n_weighted_moment(values, weights, n)}')
Moment 1 value is 3.1923076923076925
Moment 2 value is 1.0784023668639053
Moment 3 value is -0.5962505715592139
Moment 4 value is 2.384432138280637
Notice that while you are calculating the excess kurtosis, the formula implemented for a generic n-moment doesn't account for that.

Taken from here
Here is the code
def weighted_mean(var, wts):
"""Calculates the weighted mean"""
return np.average(var, weights=wts)
def weighted_variance(var, wts):
"""Calculates the weighted variance"""
return np.average((var - weighted_mean(var, wts))**2, weights=wts)
def weighted_skew(var, wts):
"""Calculates the weighted skewness"""
return (np.average((var - weighted_mean(var, wts))**3, weights=wts) /
weighted_variance(var, wts)**(1.5))
def weighted_kurtosis(var, wts):
"""Calculates the weighted skewness"""
return (np.average((var - weighted_mean(var, wts))**4, weights=wts) /
weighted_variance(var, wts)**(2))

Related

Optimize this loss function (any way to vectorize it?)

def get_model_score(preds, actuals, sw):
total_loss = 0
for i in range(len(preds)):
for idx, v in enumerate(actuals[i]):
if v != 0:
total_loss += sw[i] * abs(preds[i][idx] - actuals[i][idx])
loss = total_loss / (sum(sw) * len(preds))
return loss
I have a loss function which essentially is a weighted absolute mean error. However, we can expect every "true" sample to only have one non-zero value ex. [0, 0, 1]. We only want to account for the loss between this non-zero value and the corresponding predicted value.
Take the following examples:
True: [0, 0, 1]
Predicted: [0.5, -0.5, 0.5]
The loss for this sample would simply just be 0.5. (In the actual function though we do also have an array of sample-wise weights as well- "sw")
That being said I'm having trouble figuring out if my function can be vectorized and put into Numpy.
Looks like this is what did the trick
> np.sum(np.abs(actuals[~np.isnan(actuals)] - preds[~np.isnan(actuals)])
> * sw) / (sum(sw) * len(preds))
Actually ended up going with nulls instead of zeros so the condition is ~np.isnan(actuals).
But yea, I think the trick was the use the condition on both the actual and pred array. On the pred array it will grab the correct index based on the condition from the actual array. If this helps for anyone doing something similar.

How to draw a sample from a categorical distribution

I have a 3D numpy array with the probabilities of each category in the last dimension. Something like:
import numpy as np
from scipy.special import softmax
array = np.random.normal(size=(10, 100, 5))
probabilities = softmax(array, axis=2)
How can I sample from a categorical distribution with those probabilities?
EDIT:
Right now I'm doing it like this:
def categorical(x):
return np.random.multinomial(1, pvals=x)
samples = np.apply_along_axis(categorical, axis=2, arr=probabilities)
But it's very slow so I want to know if there's a way to vectorize this operation.
Drawing samples from a given probability distribution is done by building the evaluating the inverse cumulative distribution for a random number in the range 0 to 1. For a small number of discrete categories - like in the question - you can find the inverse using a linear search:
## Alternative test dataset
probabilities[:, :, :] = np.array([0.1, 0.5, 0.15, 0.15, 0.1])
n1, n2, m = probabilities.shape
cum_prob = np.cumsum(probabilities, axis=-1) # shape (n1, n2, m)
r = np.random.uniform(size=(n1, n2, 1))
# argmax finds the index of the first True value in the last axis.
samples = np.argmax(cum_prob > r, axis=-1)
print('Statistics:')
print(np.histogram(samples, bins=np.arange(m+1)-0.5)[0]/(n1*n2))
For the test dataset, a typical test output was:
Statistics:
[0.0998 0.4967 0.1513 0.1498 0.1024]
which looks OK.
If you have many, many categories (thousands), it's probably better to do a bisection search using a numba compiled function.

Knn give more weight to specific feature in distance

I'm using the Kobe Bryant Dataset.
I wish to predict the shot_made_flag with KnnRegressor.
I've used game_date to extract year and month features:
# covert season to years
kobe_data_encoded['season'] = kobe_data_encoded['season'].apply(lambda x: int(re.compile('(\d+)-').findall(x)[0]))
# add year and month using game_date
kobe_data_encoded['year'] = kobe_data_encoded['game_date'].apply(lambda x: int(re.compile('(\d{4})').findall(x)[0]))
kobe_data_encoded['month'] = kobe_data_encoded['game_date'].apply(lambda x: int(re.compile('-(\d+)-').findall(x)[0]))
kobe_data_encoded = kobe_data_encoded.drop(columns=['game_date'])
and I wish to use season, year, month features to give them more weight in the distance function so events with closer date to the current event will be closer neighbors but still maintain reasonable distances to potential other datapoints, so for example I don't wish an event withing the same day would be the closest neighbor just because of the date features but it'll take into account the other features such as shot_range etc..
To give it more weight I've tried to use metric argument with custom distance function but the arguments of the function are just numpy array without column information of pandas so I'm not sure what I can do and how to implement what I'm trying to do.
EDIT:
Using larger weights for date features to find the optimal k with cv of 10 running on k from [1, 100]:
from IPython.display import display
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
# scaling
min_max_scaler = preprocessing.MinMaxScaler()
scaled_features_df = kobe_data_encoded.copy()
column_names = ['loc_x', 'loc_y', 'minutes_remaining', 'period',
'seconds_remaining', 'shot_distance', 'shot_type', 'shot_zone_range']
scaled_features = min_max_scaler.fit_transform(scaled_features_df[column_names])
scaled_features_df[column_names] = scaled_features
not_classified_df = scaled_features_df[scaled_features_df['shot_made_flag'].isnull()]
classified_df = scaled_features_df[scaled_features_df['shot_made_flag'].notnull()]
X = classified_df.drop(columns=['shot_made_flag'])
y = classified_df['shot_made_flag']
cv = StratifiedKFold(n_splits=10, shuffle=True)
neighbors = [x for x in range(1, 100)]
cv_scores = []
weight = np.ones((X.shape[1],))
weight[[X.columns.get_loc("season"),
X.columns.get_loc("year"),
X.columns.get_loc("month")
]] = 5
weight = weight/weight.sum() #Normalize weights
def my_distance(x, y):
dist = ((x-y)**2)
return np.dot(dist, weight)
for k in neighbors:
print('k: ', k)
knn = KNeighborsClassifier(n_neighbors=k, metric=my_distance)
cv_scores.append(np.mean(cross_val_score(knn, X, y, cv=cv, scoring='roc_auc')))
#optimal K
optimal_k_index = cv_scores.index(min(cv_scores))
optimal_k = neighbors[optimal_k_index]
print('best k: ', optimal_k)
plt.plot(neighbors, cv_scores)
plt.xlabel('Number of Neighbors K')
plt.ylabel('ROC AUC')
plt.show()
Runs really slow, any idea on how to make it faster?
The idea of the weighted features is to find neighbors more close to the data point date to avoid data leakage and cv for finding optimal k.
First, you have to prepare a numpy 1D weight array, specifying weight for each feature. You could do something like:
weight = np.ones((M,)) # M is no of features
weight[[1,7,10]] = 2 # Increase weight of 1st,7th and 10th features
weight = weight/weight.sum() #Normalize weights
You can use kobe_data_encoded.columns to find indexes of season, year, month features in your dataframe to replace 2nd line above.
Now define a distance function, which by guideline have to take two 1D numpy array.
def my_dist(x,y):
global weight #1D array, same shape as x or y
dist = ((x-y)**2) #1D array, same shape as x or y
return np.dot(dist,weight) # a scalar float
And initialize KNeighborsRegressor as:
knn = KNeighborsRegressor(metric=my_dist)
EDIT:
To make things efficient, you can precompute distance matrix, and reuse it in KNN. This should bring in significant speedup by reducing calls to my_dist, since this non-vectorized custom python distance function is quite slow. So now -
dist = np.zeros((len(X),len(X))) #Computing NXN distance matrix
for i in range(len(X)): # You can halve this by using the fact that dist[i,j] = dist[j,i]
for j in range(len(X)):
dist[i,j] = my_dist(X[i],X[j])
for k in neighbors:
print('k: ', k)
knn = KNeighborsClassifier(n_neighbors=k, metric='precomputed') #Note: metric='precomputed'
cv_scores.append(np.mean(cross_val_score(knn, dist, y, cv=cv, scoring='roc_auc'))) #Note: passing dist instead of X
I couldn't test it, so let me know if something isn't alright.
Just add on Shihab's answer regarding distance computation. Can use scipy pdist as suggested in this post, which is faster and more efficient.
from scipy.spatial.distance import pdist, minkowski, squareform
# create the custom weight array
weight = ...
# calculate pairwise distances, using Minkowski norm with custom weights
distances = pdist(X, minkowski, 2, weight)
# reformat the result as a square matrix
distances_as_2d_matrix = squareform(distances)

How to use df.rolling(window, min_periods, win_type='exponential').sum()

I would like to calculate the rolling exponentially weighted mean with df.rolling().mean(). I get stuck at the win_type = 'exponential'.
I have tried other *win_types such as 'gaussian'. I think there would be sth a little different from 'exponential'.
dfTemp.rolling(window=21, min_periods=10, win_type='gaussian').mean(std=1)
# works fine
but when it comes to 'exponential',
dfTemp.rolling(window=21, min_periods=10, win_type='exponential').mean(tau=10)
# ValueError: The 'exponential' window needs one or more parameters -- pass a tuple.
How to use win_type='exponential'... Thanks~~~
I faced same issue and asked it on Russian SO:
Got the following answer:
x.rolling(window=(2,10), min_periods=1, win_type='exponential').mean(std=0.1)
You should pass tau value to window=(2, 10) parameter directly where 10 is a value for tau.
I hope it will help! Thanks to #MaxU
You can easily implement any kind of window by definining your kernel function.
Here's an example for a backward-looking exponential average:
import pandas as pd
import numpy as np
# Kernel function ( backward-looking exponential )
def K(x):
return np.exp(-np.abs(x)) * np.where(x<=0,1,0)
# Exponenatial average function
def exp_average(values):
N = len(values)
exp_weights = list(map(K, np.arange(-N,0) / N ))
return values.dot(exp_weights) / N
# Create a sample DataFrame
df = pd.DataFrame({
'date': [pd.datetime(2020,1,1)]*50 + [pd.datetime(2020,1,2)]*50,
'x' : np.random.randn(100)
})
# Finally, compute the exponenatial moving average using `rolling` and `apply`
df['mu'] = df.groupby(['date'])['x'].rolling(5).apply(exp_average, raw=True).values
df.head(10)
Notice that, if N is fixed, you can significantly reduce the execution time by keeping the weights constant:
N = 10
exp_weights = list(map(K, np.arange(-N,0) / N ))
def exp_average(values):
return values.dot(exp_weights) / N
Short answer: you should use pass tau to the applied function, e.g., rolling(d, win_type='exponential').sum(tau=10). Note that the mean function does not respect the exponential window as expected, so you may need to use sum(tau=10)/window_size to calculate the exponential mean. This is a BUG of current version Pandas (1.0.5).
Full example:
# To calculate the rolling exponential mean
import numpy as np
import pandas as pd
window_size = 10
tau = 5
a = pd.Series(np.random.rand(100))
rolling_mean_a = a.rolling(window_size, win_type='exponential').sum(tau=tau) / window_size
The answer of #Илья Митусов is not correct. With pandas 1.0.5, running the following code raises ValueError: exponential window requires tau:
import pandas as pd
import numpy as np
pd.Series(np.arange(10)).rolling(window=(4, 10), min_periods=1, win_type='exponential').mean(std=0.1)
This code has many problems. First, the 10 in window=(4, 10) is not tau, and will lead to wrong answers. Second, exponential window does not need the parameter std -- only gaussian window needs. Last, the tau should be provided to mean (although mean does not respect the win_type).

Minimizing negative log-likelihood of logistic regression, scipy returning warning: "Desired error not necessarily achieved due to precision loss."

I'm trying to sort out why scipy optimize isn't converging on a solution for the minimum negative-log-likelihood of the logistic regression function (as implemented below).
It seems to converge for smaller data sets, but for the larger data sets scipy returns the warning: "Desired error not necessarily achieved due to precision loss."
I thought this was a well-behaved optimization problem, so I'm anxious that I'm missing an obvious mistake.
Can anyone spot a mistake in my implementation or make a suggestion that I might try?
I'm using the default method, but I have had little luck with the other various methods that miminize allows.
Many thanks!
Quick summary of the implementation. I'm minimizing the following statement:
with the caveat that since b is a constant, I'm using the exponent -(w*x + b). I think I've implemented that function correct, but maybe I'm not seeing something. Since the data are constants with respect to the function being minimized, I just output a function definition that retains the data within it; thus, the function to be minimized only accepts the weights.
The data is a pandas dataframe of the format: rows == samples, columns == attributes, but LAST column == label (0 or 1). I've transformed all the data to make sure it is continuous, and I've normalized it to have a mean of 0 and a standard deviation of 1. I'm also starting with random weights between [0, 0.1], treating the first weight as 'b'.
def get_optimization_func_call(data, sheepda):
#
# Extract pos/neg data without label
pos_df = data[data[LABEL] == 1].as_matrix()[:, :-1]
neg_df = data[data[LABEL] == 0].as_matrix()[:, :-1]
#
# Def evaluation of positive terms by row
def eval_pos_row(pos_row, w, b):
cur_exponent = np.dot(w, pos_row) + b
cur_val = expit(cur_exponent)
if cur_val == 0:
print("pos", cur_exponent)
return (-1 * np.log(cur_val))
#
# Def evaluation of positive terms by row
def eval_neg_row(neg_row, w, b):
cur_exponent = np.dot(w, neg_row) + b
cur_val = 1.0 - expit(cur_exponent)
if cur_val == 0:
print("neg", cur_exponent)
return (-1 * np.log(cur_val))
#
# Define the function used for optimization
def log_likelihood(weights):
#
# Separate weights
w = weights[1:]
b = weights[0]
#
# Ge the norm of weights
w_norm = np.dot(w, w)
#
# Sum over positive examples
pos_sum = np.sum(
np.apply_along_axis(eval_pos_row, 1, pos_df, w, b)
)
neg_sum = np.sum(
np.apply_along_axis(eval_neg_row, 1, neg_df, w, b)
)
#
return (0.5 * w_norm) + sheepda * (pos_sum + neg_sum)
return log_likelihood
w = uniform.rvs(size=20) / 10.0
LL = get_optimization_func_call(clean_test_data, 0.5)
res = minimize(LL, w, options={"maxiter": 1e4, "disp": True})