Cluster groups continuously instead of discrete - python - pandas

I'm trying to cluster a group of points in a probabilistic manner. Using below, I have a single set of xy points, which are recorded in X and Y. I want to cluster into groups using a reference point, which is displayed in X2 and Y2.
With the help of an answer the current approach is to measure the distance from the reference point and group using k-means. Although, it provides a method to cluster using the reference point, the hard cutoff and adherence to k clusters makes it somewhat unsuitable when dealing with numerous datasets. For instance, the number of clusters needed for this example is probably 3. But a separate example may different. I'd have to manually go through and alter k every time.
Given the non-probabilistic nature of k-means a separate option could be GMM. Is it possible to account for the reference point when modelling? If I attach the output below the underlying model isn't clustering as I'm hoping for.
If I look at the probability each point is within a group it's not clustered as I'd hoped. With this I run into the same problem with manually altering the amount of components. Because the points are distributed randomly, using “AIC” or “BIC” to select the appropriate number of clusters doesn't work. There is no optimal number.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
df = pd.DataFrame({
'X' : [-1.0,-1.0,0.5,0.0,0.0,2.0,3.0,5.0,0.0,-2.5,2.0,8.0,-10.5,15.0,-20.0,-32.0,-20.0,-20.0,-10.0,20.5,0.0,20.0,-30.0,-15.0,20.0,-15.0,-10.0],
'Y' : [0.0,1.0,-0.5,0.5,-0.5,0.0,1.0,4.0,5.0,-3.5,-2.0,-8.0,-0.5,-10.5,-20.5,0.0,16.0,-15.0,5.0,13.5,20.0,-20.0,2.0,-17.5,-15,19.0,20.0],
'X2' : [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],
'Y2' : [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],
})
k-means:
df['distance'] = np.sqrt(df['X']**2 + df['Y']**2)
df['distance'] = np.sqrt((df['X2'] - df['Y2'])**2 + (df['BallY'] - df['y_post'])**2)
model = KMeans(n_clusters = 2)
model_data = np.array([df['distance'].values, np.zeros(df.shape[0])])
model.fit(model_data.T)
df['group'] = model.labels_
plt.scatter(df['X'], df['Y'], c = model.labels_, cmap = 'bwr', marker = 'o', s = 5)
plt.scatter(df['X2'], df['Y2'], c ='k', marker = 'o', s = 5)
GMM:
Y_sklearn = df[['X','Y']].values
gmm = mixture.GaussianMixture(n_components=3, covariance_type='diag', random_state=42)
gmm.fit(Y_sklearn)
labels = gmm.predict(Y_sklearn)
df['group'] = labels
plt.scatter(Y_sklearn[:, 0], Y_sklearn[:, 1], c=labels, s=5, cmap='viridis');
plt.scatter(df['X2'], df['Y2'], c='red', marker = 'x', edgecolor = 'k', s = 5, zorder = 10)
proba = pd.DataFrame(gmm.predict_proba(Y_sklearn).round(2)).reset_index(drop = True)
df_pred = pd.concat([df, proba], axis = 1)

In my opinion, if you want to define clusters as "regions where points are close to each other", you should use DBSCAN.
This clustering algorithm finds clusters by looking at regions where points are close to each other (i.e. dense regions), and are separated from other clusters by regions where points are less dense.
This algorithm can categorize points as noise (outliers). Outliers are labelled -1.
They are points that do not belong to any cluster.
Here is some code to perform DBSCAN clustering, and to insert the cluster labels as a new categorical column in the original Y_sklearn DataFrame. It also prints how many clusters and how many outliers are found.
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
Y_sklearn = df.loc[:, ["X", "Y"]].copy()
n_points = Y_sklearn.shape[0]
dbs = DBSCAN()
labels_clusters = dbs.fit_predict(Y_sklearn)
#Number of found clusters (outliers are not considered a cluster).
n_clusters = labels_clusters.max() + 1
print(f"DBSCAN found {n_clusters} clusters in dataset with {n_points} points.")
#Number of found outliers (possibly no outliers found).
n_outliers = np.count_nonzero((labels_clusters == -1))
if n_outliers:
print(f"{n_outliers} outliers were found.\n")
else:
print(f"No outliers were found.\n")
#Add cluster labels as a new column to original DataFrame.
Y_sklearn["cluster"] = labels_clusters
#Setting `cluster` column to Categorical dtype makes seaborn function properly treat
#cluster labels as categorical, and not numerical.
Y_sklearn["cluster"] = Y_sklearn["cluster"].astype("category")
If you want to plot the results, I suggest you use Seaborn. Here is some code to plot the points of Y_sklearn DataFrame, and color them by the cluster they belong to. I also define a new color palette, which is just the default Seaborn color palette, but where outliers (with label -1) will be in black.
import matplotlib.pyplot as plt
import seaborn as sns
name_palette = "tab10"
palette = sns.color_palette(name_palette)
if n_outliers:
color_outliers = "black"
palette.insert(0, color_outliers)
else:
pass
sns.set_palette(palette)
fig, ax = plt.subplots()
sns.scatterplot(data=Y_sklearn,
x="X",
y="Y",
hue="cluster",
ax=ax,
)
Using default hyperparameters, the DBSCAN algorithm finds no cluster in the data you provided: all points are considered outliers, because there is no region where points are significantly more dense. Is that your whole dataset, or is it just a sample? If it is a sample, the whole dataset will have much more points, and DBSCAN will certainly find some high density regions.
Or you can try tweaking the hyperparameters, min_samples and eps in particular. If you want to "force" the algorithm to find more clusters, you can decrease min_samples (default is 5), or increase eps (default is 0.5). Of course, the optimal hyperparamete values depends on the specific dataset, but default values are considered quite good for DBSCAN. So, if the algorithm considers all points in your dataset to be outliers, it means that there are no "natural" clusters!

Do you mean density estimation? You can model your data as a Gaussian Mixture and then get a probability of a point to belong to the mixture. You can use sklearn.mixture.GaussianMixture for that. By changing number of components you can control how many clusters you will have. The metric to cluster on is Euclidian distance from the reference point. So the GMM model will provide you with prediction of which cluster the data point should be classified to.
Since your metric is 1d, you will get a set of Gaussian distributions, i.e. a set of means and variances. So you can easily calculate the probability of any point to be in certain cluster, just by calculating how far it is from the reference point and put the value in the normal distribution pdf formula.
To make image more clear, I'm changing the reference point to (-5, 5) and select number of clusters = 4. In order to get the best number of clusters, use some metric that minimizes total variance and penalizes growth of number of mixtures. For example argmin(model.covariances_.sum()*num_clusters)
import pandas as pd
from sklearn.mixture import GaussianMixture
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
df = pd.DataFrame({
'X' : [-1.0,-1.0,0.5,0.0,0.0,2.0,3.0,5.0,0.0,-2.5,2.0,8.0,-10.5,15.0,-20.0,-32.0,-20.0,-20.0,-10.0,20.5,0.0,20.0,-30.0,-15.0,20.0,-15.0,-10.0],
'Y' : [0.0,1.0,-0.5,0.5,-0.5,0.0,1.0,4.0,5.0,-3.5,-2.0,-8.0,-0.5,-10.5,-20.5,0.0,16.0,-15.0,5.0,13.5,20.0,-20.0,2.0,-17.5,-15,19.0,20.0],
})
ref_X, ref_Y = -5, 5
dist = np.sqrt((df.X-ref_X)**2 + (df.Y-ref_Y)**2)
n_mix = 4
gmm = GaussianMixture(n_mix)
model = gmm.fit(dist.values.reshape(-1,1))
x = np.linspace(-35., 35.)
y = np.linspace(-30., 30.)
X, Y = np.meshgrid(x, y)
XX = np.sqrt((X.ravel() - ref_X)**2 + (Y.ravel() - ref_Y)**2)
Z = model.score_samples(XX.reshape(-1,1))
Z = Z.reshape(X.shape)
# plot grid points probabilities
plt.set_cmap('plasma')
plt.contourf(X, Y, Z, 40)
plt.scatter(df.X, df.Y, c=model.predict(dist.values.reshape(-1,1)), edgecolor='black')
You can read more here and here
P.S. score_samples() returns log likelihoods, use exp() to convert to probability

Taking your centre point of 0,0 we can calculate the Euclidean distance from this point to all points in your df.
df['distance'] = np.sqrt(df['X']**2 + df['Y']**2)
If you have a centre point other than zero it would be:
df['distance'] = np.sqrt((centre_point_x - df['X'])**2 + (centre_point_y - df['Y'])**2)
Using your data and chart as before, we can plot this and see the distance metric increasing as we move away from the centre.
fig, ax = plt.subplots(figsize = (6,6))
ax.scatter(df['X'], df['Y'], c = df['distance'], cmap = 'viridis', marker = 'o', s = 30)
ax.set_xlim([-35, 35])
ax.set_ylim([-35, 35])
plt.show()
K-means
We can now use this distance data and use it to calculate K-means clusters as you did before, but this time using the distance data and an array of zeros (zeros because this k-means requires a 2d-array but we only want to split the 1d aray of dimensional data. So the zeros act as 'filler'
model = KMeans(n_clusters = 2) #choose how many clusters
# create this 2d array for the KMeans model
model_data = np.array([df['distance'].values, np.zeros(df.shape[0])])
model.fit(model_data.T) # transformed array because the above code produces
# data with 27 columns and 2 rows but we want it the other way round
df['group'] = model.labels_ # put the labels into the dataframe
Then we can plot the results
fig, ax = plt.subplots(figsize = (6,6))
ax.scatter(df['X'], df['Y'], c = df['group'], cmap = 'viridis', marker = 'o', s = 30)
ax.set_xlim([-35, 35])
ax.set_ylim([-35, 35])
plt.show()
With three clusters we get the following result:
Other clustering methods
Check out SKlearn's clustering page for more options. I experimented with DBSCAN with some good results but it depends on what you are trying to achieve exactly. Check out the table underneath their example charts to see how they each compare.

Related

How to plot multiple corresponding labels in dendrogram that is truncated at an upper level?

When dendrogram is truncated at a set level (using truncate_mode = 'level', p = 5), leaf nodes in the visualization may correspond to multiple objects instead of singletons. The default behavior of the original function is to label those leaf nodes with the number of objects.
Is there a way to plot all corresponding labels instead of counts? I suspect 'leaf_label_func' parameter of 'scipy.cluster.hierarchy.dendrogram' could be utilized to this end but I was not able to make it work.
A sample code and part of its output is below.
from scipy.cluster.hierarchy import dendrogram
import matplotlib.pyplot as plt
def show_dendrogram(Z, labs, txt):
plt.figure(figsize=(10, 30))
...
dendrogram(
Z,
labels = labs,
truncate_mode = 'level',
p = 5,
...
)
plt.show()

statsmodels: IntegrationWarning: The maximum number of subdivisions (50) has been achieved

Trying to plot a CDF with seaborns, then encountered this error:
../venv/lib/python3.7/site-packages/statsmodels/nonparametric/kde.py:178: IntegrationWarning: The maximum number of subdivisions (50) has been achieved.
If increasing the limit yields no improvement it is advised to analyze
the integrand in order to determine the difficulties. If the position of a
local difficulty can be determined (singularity, discontinuity) one will
probably gain from splitting up the interval and calling the integrator
on the subranges. Perhaps a special-purpose integrator should be used.
args=endog)[0] for i in range(1, gridsize)]
Some minutes after pressing the return key
../venv/lib/python3.7/site-packages/statsmodels/nonparametric/kde.py:178: IntegrationWarning: The integral is probably divergent, or slowly convergent.
args=endog)[0] for i in range(1, gridsize)]
Code:
plt.figure()
plt.title('my distribution')
plt.ylabel('CDF')
plt.xlabel('x-labelled')
sns.kdeplot(data,cumulative=True)
plt.show()
If it could be of help:
print(len(data))
4360700
Sample data:
print(data[:10])
[ 0.00362846 0.00123409 0.00013711 -0.00029235 0.01515175 0.02780404
0.03610236 0.03410224 0.03887933 0.0307084 ]
Have no idea what the subdivisions are, is there a way to increase it?
A kde plot is created by summing one gaussian bell shape for every data point. Summing 4 million curves will create memory and performance problems, which might cause come functions to fail. The exact error message can be very cryptic.
The easiest way to work around the problem, is to subsample the data, as for a more or less smooth distribution the kde (and the cumultative kde or cdf) will look very similar whether the data is subsampled or not. Subsampling every 100th entry is easy using slicing data[::100].
Alternatively, with that many data, the "real" cdf can be drawn by plotting the sorted data versus N evenly spaced numbers from 0 to 1. (Where N is the number of data points.)
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
N = 1000000
data = np.random.normal(np.repeat(np.random.uniform(10, 20, 10), N // 10), 1)
sns.kdeplot(data[::100], cumulative=True, color='g', label='cumulative kde')
q = np.linspace(0, 1, data.size)
data.sort()
plt.plot(data, q, ':r', lw=2, label='cdf from sorted data')
plt.legend()
plt.show()
Note that in a similar, though slightly more involved, way you can draw a "more honest" kde given the differences of a large enough array of sorted data. np.interp interpolates the quantiles to a regularly spaced x-axis. As the raw differences are rather jaggy, some smoothing is needed.
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import statsmodels.api as sm
N = 1000000
data = np.random.normal(np.repeat(np.random.uniform(10, 20, 10), N // 10), 1)
sns.kdeplot(data[::100], cumulative=False, color='g', label='kde')
p = np.linspace(0, 1, data.size)
data.sort()
x = np.linspace(data.min(), data.max(), 1000)
y = np.interp(x, data, p)
# use lowess filter to smoothen the curve
lowess = sm.nonparametric.lowess(np.diff(y) * 1000 / (data.max() - data.min()), (x[:-1] + x[1:]) / 2, frac=0.05)
plt.plot(lowess[:, 0], lowess[:, 1], '-r', label='smoothed diff of sorted data')
# plt.plot((x[:-1]+x[1:])/2,
# np.convolve(np.diff(y), np.ones(20)/20, mode='same')*1000/(data.max() - data.min()),
# label='test np.diff')
plt.legend()
plt.show()

Knn give more weight to specific feature in distance

I'm using the Kobe Bryant Dataset.
I wish to predict the shot_made_flag with KnnRegressor.
I've used game_date to extract year and month features:
# covert season to years
kobe_data_encoded['season'] = kobe_data_encoded['season'].apply(lambda x: int(re.compile('(\d+)-').findall(x)[0]))
# add year and month using game_date
kobe_data_encoded['year'] = kobe_data_encoded['game_date'].apply(lambda x: int(re.compile('(\d{4})').findall(x)[0]))
kobe_data_encoded['month'] = kobe_data_encoded['game_date'].apply(lambda x: int(re.compile('-(\d+)-').findall(x)[0]))
kobe_data_encoded = kobe_data_encoded.drop(columns=['game_date'])
and I wish to use season, year, month features to give them more weight in the distance function so events with closer date to the current event will be closer neighbors but still maintain reasonable distances to potential other datapoints, so for example I don't wish an event withing the same day would be the closest neighbor just because of the date features but it'll take into account the other features such as shot_range etc..
To give it more weight I've tried to use metric argument with custom distance function but the arguments of the function are just numpy array without column information of pandas so I'm not sure what I can do and how to implement what I'm trying to do.
EDIT:
Using larger weights for date features to find the optimal k with cv of 10 running on k from [1, 100]:
from IPython.display import display
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
# scaling
min_max_scaler = preprocessing.MinMaxScaler()
scaled_features_df = kobe_data_encoded.copy()
column_names = ['loc_x', 'loc_y', 'minutes_remaining', 'period',
'seconds_remaining', 'shot_distance', 'shot_type', 'shot_zone_range']
scaled_features = min_max_scaler.fit_transform(scaled_features_df[column_names])
scaled_features_df[column_names] = scaled_features
not_classified_df = scaled_features_df[scaled_features_df['shot_made_flag'].isnull()]
classified_df = scaled_features_df[scaled_features_df['shot_made_flag'].notnull()]
X = classified_df.drop(columns=['shot_made_flag'])
y = classified_df['shot_made_flag']
cv = StratifiedKFold(n_splits=10, shuffle=True)
neighbors = [x for x in range(1, 100)]
cv_scores = []
weight = np.ones((X.shape[1],))
weight[[X.columns.get_loc("season"),
X.columns.get_loc("year"),
X.columns.get_loc("month")
]] = 5
weight = weight/weight.sum() #Normalize weights
def my_distance(x, y):
dist = ((x-y)**2)
return np.dot(dist, weight)
for k in neighbors:
print('k: ', k)
knn = KNeighborsClassifier(n_neighbors=k, metric=my_distance)
cv_scores.append(np.mean(cross_val_score(knn, X, y, cv=cv, scoring='roc_auc')))
#optimal K
optimal_k_index = cv_scores.index(min(cv_scores))
optimal_k = neighbors[optimal_k_index]
print('best k: ', optimal_k)
plt.plot(neighbors, cv_scores)
plt.xlabel('Number of Neighbors K')
plt.ylabel('ROC AUC')
plt.show()
Runs really slow, any idea on how to make it faster?
The idea of the weighted features is to find neighbors more close to the data point date to avoid data leakage and cv for finding optimal k.
First, you have to prepare a numpy 1D weight array, specifying weight for each feature. You could do something like:
weight = np.ones((M,)) # M is no of features
weight[[1,7,10]] = 2 # Increase weight of 1st,7th and 10th features
weight = weight/weight.sum() #Normalize weights
You can use kobe_data_encoded.columns to find indexes of season, year, month features in your dataframe to replace 2nd line above.
Now define a distance function, which by guideline have to take two 1D numpy array.
def my_dist(x,y):
global weight #1D array, same shape as x or y
dist = ((x-y)**2) #1D array, same shape as x or y
return np.dot(dist,weight) # a scalar float
And initialize KNeighborsRegressor as:
knn = KNeighborsRegressor(metric=my_dist)
EDIT:
To make things efficient, you can precompute distance matrix, and reuse it in KNN. This should bring in significant speedup by reducing calls to my_dist, since this non-vectorized custom python distance function is quite slow. So now -
dist = np.zeros((len(X),len(X))) #Computing NXN distance matrix
for i in range(len(X)): # You can halve this by using the fact that dist[i,j] = dist[j,i]
for j in range(len(X)):
dist[i,j] = my_dist(X[i],X[j])
for k in neighbors:
print('k: ', k)
knn = KNeighborsClassifier(n_neighbors=k, metric='precomputed') #Note: metric='precomputed'
cv_scores.append(np.mean(cross_val_score(knn, dist, y, cv=cv, scoring='roc_auc'))) #Note: passing dist instead of X
I couldn't test it, so let me know if something isn't alright.
Just add on Shihab's answer regarding distance computation. Can use scipy pdist as suggested in this post, which is faster and more efficient.
from scipy.spatial.distance import pdist, minkowski, squareform
# create the custom weight array
weight = ...
# calculate pairwise distances, using Minkowski norm with custom weights
distances = pdist(X, minkowski, 2, weight)
# reformat the result as a square matrix
distances_as_2d_matrix = squareform(distances)

How to get the array corresponding exaclty to contourf?

I have a rather complicated two dimensional function that I represent with few levels using contourf. How do I get the array corresponding exactly to filled contours ?
For example :
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0,1,100)
y = np.linspace(0,1,100)
[X,Y] = np.meshgrid(x,y)
Z = X**2 + Y**2
plt.subplot(1,2,1)
plt.imshow(Z, origin = 'lower')
plt.subplot(1,2,2)
plt.contourf(Z,[0,1,2])
plt.show()
plt.savefig('test.png')
I'd like to have the array from the contourplot that returns constant values between the different contours.
So far I did some thresholding:
test = Z
test[test<1] = 0
test[test>=1] = 2
plt.contourf(X,Y,test,[0,1,2])
plt.savefig('test1.png')
But contourf does a much better job at interpolating things. Furthermore, thresholding 'by hand' becomes a bit long when I have multiple contours.
I guess that, because contourf does all the job, there is a way to get this from the contourf object ?
P.S : why does my first piece of code produces subplots of different sizes ?

Fitting partial Gaussian

I'm trying to fit a sum of gaussians using scikit-learn because the scikit-learn GaussianMixture seems much more robust than using curve_fit.
Problem: It doesn't do a great job in fitting a truncated part of even a single gaussian peak:
from sklearn import mixture
import matplotlib.pyplot
import matplotlib.mlab
import numpy as np
clf = mixture.GaussianMixture(n_components=1, covariance_type='full')
data = np.random.randn(10000)
data = [[x] for x in data]
clf.fit(data)
data = [item for sublist in data for item in sublist]
rangeMin = int(np.floor(np.min(data)))
rangeMax = int(np.ceil(np.max(data)))
h = matplotlib.pyplot.hist(data, range=(rangeMin, rangeMax), normed=True);
plt.plot(np.linspace(rangeMin, rangeMax),
mlab.normpdf(np.linspace(rangeMin, rangeMax),
clf.means_, np.sqrt(clf.covariances_[0]))[0])
gives
now changing data = [[x] for x in data] to data = [[x] for x in data if x <0] in order to truncate the distribution returns
Any ideas how to get the truncation fitted properly?
Note: The distribution isn't necessarily truncated in the middle, there could be anything between 50% and 100% of the full distribution left.
I would also be happy if anyone can point me to alternative packages. I've only tried curve_fit but couldn't get it to do anything useful as soon as more than two peaks are involved.
A bit brutish, but simple solution would be to split the curve in two halfs (data = [[x] for x in data if x < 0]), mirror the left part (data.append([-data[d][0]])) and then do the regular Gaussian fit.
import numpy as np
from sklearn import mixture
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
np.random.seed(seed=42)
n = 10000
clf = mixture.GaussianMixture(n_components=1, covariance_type='full')
#split the data and mirror it
data = np.random.randn(n)
data = [[x] for x in data if x < 0]
n = len(data)
for d in range(n):
data.append([-data[d][0]])
clf.fit(data)
data = [item for sublist in data for item in sublist]
rangeMin = int(np.floor(np.min(data)))
rangeMax = int(np.ceil(np.max(data)))
h = plt.hist(data[0:n], bins=20, range=(rangeMin, rangeMax), normed=True);
plt.plot(np.linspace(rangeMin, rangeMax),
mlab.normpdf(np.linspace(rangeMin, rangeMax),
clf.means_, np.sqrt(clf.covariances_[0]))[0] * 2)
plt.show()
lhcgeneva the problem is once you have data that doesn't include the maximum of the curve more and more possible Gaussians can fit:
Black point represent the data, red points the fitted Gaussian
In the figure, black points represent the data to fit a curve, the red points the fitted results. This result was achieved by using A Simple Algorithm for Fitting a Gaussian Function