Knn give more weight to specific feature in distance - pandas
I'm using the Kobe Bryant Dataset.
I wish to predict the shot_made_flag with KnnRegressor.
I've used game_date to extract year and month features:
# covert season to years
kobe_data_encoded['season'] = kobe_data_encoded['season'].apply(lambda x: int(re.compile('(\d+)-').findall(x)[0]))
# add year and month using game_date
kobe_data_encoded['year'] = kobe_data_encoded['game_date'].apply(lambda x: int(re.compile('(\d{4})').findall(x)[0]))
kobe_data_encoded['month'] = kobe_data_encoded['game_date'].apply(lambda x: int(re.compile('-(\d+)-').findall(x)[0]))
kobe_data_encoded = kobe_data_encoded.drop(columns=['game_date'])
and I wish to use season, year, month features to give them more weight in the distance function so events with closer date to the current event will be closer neighbors but still maintain reasonable distances to potential other datapoints, so for example I don't wish an event withing the same day would be the closest neighbor just because of the date features but it'll take into account the other features such as shot_range etc..
To give it more weight I've tried to use metric argument with custom distance function but the arguments of the function are just numpy array without column information of pandas so I'm not sure what I can do and how to implement what I'm trying to do.
EDIT:
Using larger weights for date features to find the optimal k with cv of 10 running on k from [1, 100]:
from IPython.display import display
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
# scaling
min_max_scaler = preprocessing.MinMaxScaler()
scaled_features_df = kobe_data_encoded.copy()
column_names = ['loc_x', 'loc_y', 'minutes_remaining', 'period',
'seconds_remaining', 'shot_distance', 'shot_type', 'shot_zone_range']
scaled_features = min_max_scaler.fit_transform(scaled_features_df[column_names])
scaled_features_df[column_names] = scaled_features
not_classified_df = scaled_features_df[scaled_features_df['shot_made_flag'].isnull()]
classified_df = scaled_features_df[scaled_features_df['shot_made_flag'].notnull()]
X = classified_df.drop(columns=['shot_made_flag'])
y = classified_df['shot_made_flag']
cv = StratifiedKFold(n_splits=10, shuffle=True)
neighbors = [x for x in range(1, 100)]
cv_scores = []
weight = np.ones((X.shape[1],))
weight[[X.columns.get_loc("season"),
X.columns.get_loc("year"),
X.columns.get_loc("month")
]] = 5
weight = weight/weight.sum() #Normalize weights
def my_distance(x, y):
dist = ((x-y)**2)
return np.dot(dist, weight)
for k in neighbors:
print('k: ', k)
knn = KNeighborsClassifier(n_neighbors=k, metric=my_distance)
cv_scores.append(np.mean(cross_val_score(knn, X, y, cv=cv, scoring='roc_auc')))
#optimal K
optimal_k_index = cv_scores.index(min(cv_scores))
optimal_k = neighbors[optimal_k_index]
print('best k: ', optimal_k)
plt.plot(neighbors, cv_scores)
plt.xlabel('Number of Neighbors K')
plt.ylabel('ROC AUC')
plt.show()
Runs really slow, any idea on how to make it faster?
The idea of the weighted features is to find neighbors more close to the data point date to avoid data leakage and cv for finding optimal k.
First, you have to prepare a numpy 1D weight array, specifying weight for each feature. You could do something like:
weight = np.ones((M,)) # M is no of features
weight[[1,7,10]] = 2 # Increase weight of 1st,7th and 10th features
weight = weight/weight.sum() #Normalize weights
You can use kobe_data_encoded.columns to find indexes of season, year, month features in your dataframe to replace 2nd line above.
Now define a distance function, which by guideline have to take two 1D numpy array.
def my_dist(x,y):
global weight #1D array, same shape as x or y
dist = ((x-y)**2) #1D array, same shape as x or y
return np.dot(dist,weight) # a scalar float
And initialize KNeighborsRegressor as:
knn = KNeighborsRegressor(metric=my_dist)
EDIT:
To make things efficient, you can precompute distance matrix, and reuse it in KNN. This should bring in significant speedup by reducing calls to my_dist, since this non-vectorized custom python distance function is quite slow. So now -
dist = np.zeros((len(X),len(X))) #Computing NXN distance matrix
for i in range(len(X)): # You can halve this by using the fact that dist[i,j] = dist[j,i]
for j in range(len(X)):
dist[i,j] = my_dist(X[i],X[j])
for k in neighbors:
print('k: ', k)
knn = KNeighborsClassifier(n_neighbors=k, metric='precomputed') #Note: metric='precomputed'
cv_scores.append(np.mean(cross_val_score(knn, dist, y, cv=cv, scoring='roc_auc'))) #Note: passing dist instead of X
I couldn't test it, so let me know if something isn't alright.
Just add on Shihab's answer regarding distance computation. Can use scipy pdist as suggested in this post, which is faster and more efficient.
from scipy.spatial.distance import pdist, minkowski, squareform
# create the custom weight array
weight = ...
# calculate pairwise distances, using Minkowski norm with custom weights
distances = pdist(X, minkowski, 2, weight)
# reformat the result as a square matrix
distances_as_2d_matrix = squareform(distances)
Related
Is nx.eigenvector_centrality_numpy() using the Arnoldi iteration instead of the basic power method?
Since nx.eigenvector_centrality_numpy() using ARPACK, is it mean that nx.eigenvector_centrality_numpy() using Arnoldi iteration instead of the basic power method? because when I try to compute manually using the basic power method, the result of my computation is different from the result of nx.eigenvector_centrality_numpy(). Can someone explain it to me? To make it more clear, here is my code and the result that I got from the function and the result when I compute manually. import networkx as nx G = nx.DiGraph() G.add_edge('a', 'b', weight=4) G.add_edge('b', 'a', weight=2) G.add_edge('b', 'c', weight=2) G.add_edge('b','d', weight=2) G.add_edge('c','b', weight=2) G.add_edge('d','b', weight=2) centrality = nx.eigenvector_centrality_numpy(G, weight='weight') centrality The result: {'a': 0.37796447300922725, 'b': 0.7559289460184545, 'c': 0.3779644730092272, 'd': 0.3779644730092272} Below is code from Power Method Python Program and I did a little bit of modification: # Power Method to Find Largest Eigen Value and Eigen Vector # Importing NumPy Library import numpy as np import sys # Reading order of matrix n = int(input('Enter order of matrix: ')) # Making numpy array of n x n size and initializing # to zero for storing matrix a = np.zeros((n,n)) # Reading matrix print('Enter Matrix Coefficients:') for i in range(n): for j in range(n): a[i][j] = float(input( 'a['+str(i)+']['+ str(j)+']=')) # Making numpy array n x 1 size and initializing to zero # for storing initial guess vector x = np.zeros((n)) # Reading initial guess vector print('Enter initial guess vector: ') for i in range(n): x[i] = float(input( 'x['+str(i)+']=')) # Reading tolerable error tolerable_error = float(input('Enter tolerable error: ')) # Reading maximum number of steps max_iteration = int(input('Enter maximum number of steps: ')) # Power Method Implementation lambda_old = 1.0 condition = True step = 1 while condition: # Multiplying a and x ax = np.matmul(a,x) # Finding new Eigen value and Eigen vector x = ax/np.linalg.norm(ax) lambda_new = np.vdot(ax,x) # Displaying Eigen value and Eigen Vector print('\nSTEP %d' %(step)) print('----------') print('Eigen Value = %0.5f' %(lambda_new)) print('Eigen Vector: ') for i in range(n): print('%0.5f\t' % (x[i])) # Checking maximum iteration step = step + 1 if step > max_iteration: print('Not convergent in given maximum iteration!') break # Calculating error error = abs(lambda_new - lambda_old) print('errror='+ str(error)) lambda_old = lambda_new condition = error > tolerable_error I used the same matrix and the result: STEP 99 ---------- Eigen Value = 3.70328 Eigen Vector: 0.51640 0.77460 0.25820 0.25820 errror=0.6172133998483682 STEP 100 ---------- Eigen Value = 4.32049 Eigen Vector: 0.71714 0.47809 0.35857 0.35857 Not convergent in given maximum iteration! I've to try to compute it with my calculator too and I know it's not convergent because |lambda1|=|lambda2|=4. I've to know the theory behind nx.eigenvector_centrality_numpy() properly so I can write it right for my thesis. Help me, please
Cluster groups continuously instead of discrete - python
I'm trying to cluster a group of points in a probabilistic manner. Using below, I have a single set of xy points, which are recorded in X and Y. I want to cluster into groups using a reference point, which is displayed in X2 and Y2. With the help of an answer the current approach is to measure the distance from the reference point and group using k-means. Although, it provides a method to cluster using the reference point, the hard cutoff and adherence to k clusters makes it somewhat unsuitable when dealing with numerous datasets. For instance, the number of clusters needed for this example is probably 3. But a separate example may different. I'd have to manually go through and alter k every time. Given the non-probabilistic nature of k-means a separate option could be GMM. Is it possible to account for the reference point when modelling? If I attach the output below the underlying model isn't clustering as I'm hoping for. If I look at the probability each point is within a group it's not clustered as I'd hoped. With this I run into the same problem with manually altering the amount of components. Because the points are distributed randomly, using “AIC” or “BIC” to select the appropriate number of clusters doesn't work. There is no optimal number. import pandas as pd import matplotlib.pyplot as plt from sklearn.cluster import KMeans df = pd.DataFrame({ 'X' : [-1.0,-1.0,0.5,0.0,0.0,2.0,3.0,5.0,0.0,-2.5,2.0,8.0,-10.5,15.0,-20.0,-32.0,-20.0,-20.0,-10.0,20.5,0.0,20.0,-30.0,-15.0,20.0,-15.0,-10.0], 'Y' : [0.0,1.0,-0.5,0.5,-0.5,0.0,1.0,4.0,5.0,-3.5,-2.0,-8.0,-0.5,-10.5,-20.5,0.0,16.0,-15.0,5.0,13.5,20.0,-20.0,2.0,-17.5,-15,19.0,20.0], 'X2' : [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0], 'Y2' : [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0], }) k-means: df['distance'] = np.sqrt(df['X']**2 + df['Y']**2) df['distance'] = np.sqrt((df['X2'] - df['Y2'])**2 + (df['BallY'] - df['y_post'])**2) model = KMeans(n_clusters = 2) model_data = np.array([df['distance'].values, np.zeros(df.shape[0])]) model.fit(model_data.T) df['group'] = model.labels_ plt.scatter(df['X'], df['Y'], c = model.labels_, cmap = 'bwr', marker = 'o', s = 5) plt.scatter(df['X2'], df['Y2'], c ='k', marker = 'o', s = 5) GMM: Y_sklearn = df[['X','Y']].values gmm = mixture.GaussianMixture(n_components=3, covariance_type='diag', random_state=42) gmm.fit(Y_sklearn) labels = gmm.predict(Y_sklearn) df['group'] = labels plt.scatter(Y_sklearn[:, 0], Y_sklearn[:, 1], c=labels, s=5, cmap='viridis'); plt.scatter(df['X2'], df['Y2'], c='red', marker = 'x', edgecolor = 'k', s = 5, zorder = 10) proba = pd.DataFrame(gmm.predict_proba(Y_sklearn).round(2)).reset_index(drop = True) df_pred = pd.concat([df, proba], axis = 1)
In my opinion, if you want to define clusters as "regions where points are close to each other", you should use DBSCAN. This clustering algorithm finds clusters by looking at regions where points are close to each other (i.e. dense regions), and are separated from other clusters by regions where points are less dense. This algorithm can categorize points as noise (outliers). Outliers are labelled -1. They are points that do not belong to any cluster. Here is some code to perform DBSCAN clustering, and to insert the cluster labels as a new categorical column in the original Y_sklearn DataFrame. It also prints how many clusters and how many outliers are found. import numpy as np import pandas as pd from sklearn.cluster import DBSCAN Y_sklearn = df.loc[:, ["X", "Y"]].copy() n_points = Y_sklearn.shape[0] dbs = DBSCAN() labels_clusters = dbs.fit_predict(Y_sklearn) #Number of found clusters (outliers are not considered a cluster). n_clusters = labels_clusters.max() + 1 print(f"DBSCAN found {n_clusters} clusters in dataset with {n_points} points.") #Number of found outliers (possibly no outliers found). n_outliers = np.count_nonzero((labels_clusters == -1)) if n_outliers: print(f"{n_outliers} outliers were found.\n") else: print(f"No outliers were found.\n") #Add cluster labels as a new column to original DataFrame. Y_sklearn["cluster"] = labels_clusters #Setting `cluster` column to Categorical dtype makes seaborn function properly treat #cluster labels as categorical, and not numerical. Y_sklearn["cluster"] = Y_sklearn["cluster"].astype("category") If you want to plot the results, I suggest you use Seaborn. Here is some code to plot the points of Y_sklearn DataFrame, and color them by the cluster they belong to. I also define a new color palette, which is just the default Seaborn color palette, but where outliers (with label -1) will be in black. import matplotlib.pyplot as plt import seaborn as sns name_palette = "tab10" palette = sns.color_palette(name_palette) if n_outliers: color_outliers = "black" palette.insert(0, color_outliers) else: pass sns.set_palette(palette) fig, ax = plt.subplots() sns.scatterplot(data=Y_sklearn, x="X", y="Y", hue="cluster", ax=ax, ) Using default hyperparameters, the DBSCAN algorithm finds no cluster in the data you provided: all points are considered outliers, because there is no region where points are significantly more dense. Is that your whole dataset, or is it just a sample? If it is a sample, the whole dataset will have much more points, and DBSCAN will certainly find some high density regions. Or you can try tweaking the hyperparameters, min_samples and eps in particular. If you want to "force" the algorithm to find more clusters, you can decrease min_samples (default is 5), or increase eps (default is 0.5). Of course, the optimal hyperparamete values depends on the specific dataset, but default values are considered quite good for DBSCAN. So, if the algorithm considers all points in your dataset to be outliers, it means that there are no "natural" clusters!
Do you mean density estimation? You can model your data as a Gaussian Mixture and then get a probability of a point to belong to the mixture. You can use sklearn.mixture.GaussianMixture for that. By changing number of components you can control how many clusters you will have. The metric to cluster on is Euclidian distance from the reference point. So the GMM model will provide you with prediction of which cluster the data point should be classified to. Since your metric is 1d, you will get a set of Gaussian distributions, i.e. a set of means and variances. So you can easily calculate the probability of any point to be in certain cluster, just by calculating how far it is from the reference point and put the value in the normal distribution pdf formula. To make image more clear, I'm changing the reference point to (-5, 5) and select number of clusters = 4. In order to get the best number of clusters, use some metric that minimizes total variance and penalizes growth of number of mixtures. For example argmin(model.covariances_.sum()*num_clusters) import pandas as pd from sklearn.mixture import GaussianMixture import numpy as np import matplotlib.pyplot as plt from scipy.stats import norm df = pd.DataFrame({ 'X' : [-1.0,-1.0,0.5,0.0,0.0,2.0,3.0,5.0,0.0,-2.5,2.0,8.0,-10.5,15.0,-20.0,-32.0,-20.0,-20.0,-10.0,20.5,0.0,20.0,-30.0,-15.0,20.0,-15.0,-10.0], 'Y' : [0.0,1.0,-0.5,0.5,-0.5,0.0,1.0,4.0,5.0,-3.5,-2.0,-8.0,-0.5,-10.5,-20.5,0.0,16.0,-15.0,5.0,13.5,20.0,-20.0,2.0,-17.5,-15,19.0,20.0], }) ref_X, ref_Y = -5, 5 dist = np.sqrt((df.X-ref_X)**2 + (df.Y-ref_Y)**2) n_mix = 4 gmm = GaussianMixture(n_mix) model = gmm.fit(dist.values.reshape(-1,1)) x = np.linspace(-35., 35.) y = np.linspace(-30., 30.) X, Y = np.meshgrid(x, y) XX = np.sqrt((X.ravel() - ref_X)**2 + (Y.ravel() - ref_Y)**2) Z = model.score_samples(XX.reshape(-1,1)) Z = Z.reshape(X.shape) # plot grid points probabilities plt.set_cmap('plasma') plt.contourf(X, Y, Z, 40) plt.scatter(df.X, df.Y, c=model.predict(dist.values.reshape(-1,1)), edgecolor='black') You can read more here and here P.S. score_samples() returns log likelihoods, use exp() to convert to probability
Taking your centre point of 0,0 we can calculate the Euclidean distance from this point to all points in your df. df['distance'] = np.sqrt(df['X']**2 + df['Y']**2) If you have a centre point other than zero it would be: df['distance'] = np.sqrt((centre_point_x - df['X'])**2 + (centre_point_y - df['Y'])**2) Using your data and chart as before, we can plot this and see the distance metric increasing as we move away from the centre. fig, ax = plt.subplots(figsize = (6,6)) ax.scatter(df['X'], df['Y'], c = df['distance'], cmap = 'viridis', marker = 'o', s = 30) ax.set_xlim([-35, 35]) ax.set_ylim([-35, 35]) plt.show() K-means We can now use this distance data and use it to calculate K-means clusters as you did before, but this time using the distance data and an array of zeros (zeros because this k-means requires a 2d-array but we only want to split the 1d aray of dimensional data. So the zeros act as 'filler' model = KMeans(n_clusters = 2) #choose how many clusters # create this 2d array for the KMeans model model_data = np.array([df['distance'].values, np.zeros(df.shape[0])]) model.fit(model_data.T) # transformed array because the above code produces # data with 27 columns and 2 rows but we want it the other way round df['group'] = model.labels_ # put the labels into the dataframe Then we can plot the results fig, ax = plt.subplots(figsize = (6,6)) ax.scatter(df['X'], df['Y'], c = df['group'], cmap = 'viridis', marker = 'o', s = 30) ax.set_xlim([-35, 35]) ax.set_ylim([-35, 35]) plt.show() With three clusters we get the following result: Other clustering methods Check out SKlearn's clustering page for more options. I experimented with DBSCAN with some good results but it depends on what you are trying to achieve exactly. Check out the table underneath their example charts to see how they each compare.
How to draw a sample from a categorical distribution
I have a 3D numpy array with the probabilities of each category in the last dimension. Something like: import numpy as np from scipy.special import softmax array = np.random.normal(size=(10, 100, 5)) probabilities = softmax(array, axis=2) How can I sample from a categorical distribution with those probabilities? EDIT: Right now I'm doing it like this: def categorical(x): return np.random.multinomial(1, pvals=x) samples = np.apply_along_axis(categorical, axis=2, arr=probabilities) But it's very slow so I want to know if there's a way to vectorize this operation.
Drawing samples from a given probability distribution is done by building the evaluating the inverse cumulative distribution for a random number in the range 0 to 1. For a small number of discrete categories - like in the question - you can find the inverse using a linear search: ## Alternative test dataset probabilities[:, :, :] = np.array([0.1, 0.5, 0.15, 0.15, 0.1]) n1, n2, m = probabilities.shape cum_prob = np.cumsum(probabilities, axis=-1) # shape (n1, n2, m) r = np.random.uniform(size=(n1, n2, 1)) # argmax finds the index of the first True value in the last axis. samples = np.argmax(cum_prob > r, axis=-1) print('Statistics:') print(np.histogram(samples, bins=np.arange(m+1)-0.5)[0]/(n1*n2)) For the test dataset, a typical test output was: Statistics: [0.0998 0.4967 0.1513 0.1498 0.1024] which looks OK. If you have many, many categories (thousands), it's probably better to do a bisection search using a numba compiled function.
Efficient way to calculate the pairwise matrix product between one tensor and all the rolling of another tensor
Suppose we have two tensors: tensor A whose shape is (d,m,n) tensor B whose shape is (d,n,l). If we want to get the pairwise matrix product of the right-most matrix of A and B, I think we can use np.einsum('dmn,...nl->d...ml',A,B) whose size is (d,d,m,l). However, I would like to get the pairwise product of not all the pairs. Import a parameter k, 1<=k<=d, I want to get the following pairwise matrix product: from A(0,...)#B(0,...) to A(0,...)#B(k-1,...) ; from A(1,...)#B(1,...) to A(1,...)#B(k,...) ; .... ; from A(d-2,...)#B(d-2,...), A(d-2,...)#B(d-1,...) to A(d-2,...)#B(k-3,...) ; from A(d-1,...)#B(d-1,...) to A(d-1,...)#B(k-2,...) . Note here we we use a rolling way to deal with tensor B. (like numpy.roll). Finally, we actually get a tensor whose shape is (d,k,m,l). What's the most efficient way to do this. I know several ways like: First get np.einsum('dmn,...nl->d...ml',A,B), then use a mask to extract the (d,k) pairs. tile B first, then use einsum in some way. But I think there exists a better way.
I doubt you can do much better than a for loop. Here is, for example, a vectorized version using einsum and stride_tricks compared to a double for loop: Code: from simple_benchmark import BenchmarkBuilder, MultiArgument import numpy as np from numpy.lib.stride_tricks import as_strided B = BenchmarkBuilder() #B.add_function() def loopy(A,B,k): d,m,n = A.shape l = B.shape[-1] out = np.empty((d,k,m,l),int) for i in range(d): for j in range(k): out[i,j] = A[i]#B[(i+j)%d] return out #B.add_function() def vectory(A,B,k): d,m,n = A.shape l = B.shape[-1] BB = np.concatenate([B,B[:k-1]],0) BB = as_strided(BB,(d,k,n,l),np.repeat(BB.strides,(2,1,1))) return np.einsum("ikl,ijln->ijkn",A,BB) #B.add_arguments('d x k x m x n x l') def argument_provider(): for exp in range(10): d,k,m,n,l = (np.r_[1.6,1.5,1.5,1.5,1.5]**exp*(4,2,2,2,2)).astype(int) print(d,k,m,n,l) A = np.random.randint(0,10,(d,m,n)) B = np.random.randint(0,10,(d,n,l)) yield k*d*m*n*l,MultiArgument([A,B,k]) r = B.run() r.plot() import pylab pylab.savefig('diagwa.png')
Linear Regression overfitting
I'm pursuing course 2 on this coursera course on linear regression (https://www.coursera.org/specializations/machine-learning) I've solved the training using graphlab but wanted to try out sklearn for the experience and learning. I'm using sklearn and pandas for this. The model overfits on the data. How can I fix this? This is the code. These are the coefficients i'm getting. [ -3.33628603e-13 1.00000000e+00] poly1_data = polynomial_dataframe(sales["sqft_living"], 1) poly1_data["price"] = sales["price"] model1 = LinearRegression() model1.fit(poly1_data, sales["price"]) print(model1.coef_) plt.plot(poly1_data['power_1'], poly1_data['price'], '.',poly1_data['power_1'], model1.predict(poly1_data),'-') plt.show() The plotted line is like this. As you see it connects every data point. and this is the plot of the input data
I wouldn't even call this overfit. I'd say you aren't doing what you think you should be doing. In particular, you forgot to add a column of 1's to your design matrix, X. For example: # generate some univariate data x = np.arange(100) y = 2*x + x*np.random.normal(0,1,100) df = pd.DataFrame([x,y]).T df.columns = ['x','y'] You're doing the following: model1 = LinearRegression() X = df["x"].values.reshape(1,-1)[0] # reshaping data y = df["y"].values.reshape(1,-1)[0] model1.fit(X,y) Which leads to: plt.plot(df['x'].values, df['y'].values,'.') plt.plot(X[0], model1.predict(X)[0],'-') plt.show() Instead, you want to add a column of 1's to your design matrix (X): X = np.column_stack([np.ones(len(df['x'])),df["x"].values.reshape(1,-1)[0]]) y = df["y"].values.reshape(1,-1) model1.fit(X,y) And (after some reshaping) you get: plt.plot(df['x'].values, df['y'].values,'.') plt.plot(df['x'].values, model1.predict(X),'-') plt.show()