Retrieve Indices while performing K-Means algorithm - pandas

I have a data frame of following form;
dict_new={'var1':[1,0,1,0,2],'var2':[1,1,0,2,0],'var3':[1,1,1,2,1]}
pd.DataFrame(dict_new,index=['word1','word2','word3','word4','word5'])
Please note that actual dataset is quite big, above example is for simplicity. Then I performed K-means algorithm in sickit-learn, and took 2 cluster centroids for simplicity.
from sklearn.cluster import KMeans
num_clusters = 2
km = KMeans(n_clusters=num_clusters,verbose=1)
km.fit(dfnew.to_numpy())
Suppose the new cluster centroids are given by
centers=km.cluster_centers_
centers
array([[0. , 1.5 , 1.5 ],
[1.33333333, 0.33333333, 1. ]])
The goal is to find two closest words for each cluster centroid, i.e. for each cluster center identify two closest words. I used the distance_matrix from scipy package, and got the output as a 2 x 5 matrix, corresponding to 2 centers and 5 words. Please see code below.
from scipy.spatial import distance_matrix
distance_matrix(centers,np.asmatrix(dfnew.to_numpy()))
array([[1.22474487, 0.70710678, 1.87082869, 0.70710678, 2.54950976],
[0.74535599, 1.49071198, 0.47140452, 2.3570226 , 0.74535599]])
But we don't see the word indices here. So I am not being able to identify the two closest words for each centroid. Can I kindly get help on how we can retrieve the indices(which was defined in the original data frame). Help is appreciated.

Given that I understand what you want to do properly, here is a minimal working example on how to find the index of the words.
First, let's generate a similar reproducible environement
# import packages
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
from scipy.spatial import distance_matrix
# set up the DataFrame
dict_new={'var1':[1,0,1,0,2],'var2':[1,1,0,2,0],'var3':[1,1,1,2,1]}
df = pd.DataFrame(dict_new,index= ['word1','word2','word3','word4','word5'])
# get the cluster centers
kmeans = KMeans(n_clusters=2, random_state=0).fit(np.array(df))
centers = kmeans.cluster_centers_
If you only need to know the one closest word
Now, if you wanted to use a distance matrix, you could do (instead):
def closest(df, centers):
# define the distance matrix
mat = distance_matrix(centers, np.asmatrix(df.to_numpy()))
# get an ordered list of the closest word for each cluster centroid
closest_words = [df.index[i] for i in np.argmin(mat, axis=1)]
return closest_words
# example of it working for all centroids
print(closest(df, centers))
# > ['word3', 'word2']
If you need to know the 2 closest words
Now, if we want the two closest words:
def two_closest(df, centers):
# define the distance matrix
mat = distance_matrix(centers, np.asmatrix(df.to_numpy()))
# get an ordered list of lists of the closest two words for each cluster centroid
closest_two_words = [[df.index[i] for i in l] for l in np.argsort(mat, axis=1)[:,0:2]]
return closest_two_words
# example of it working for all centroids
print(two_closest(df, centers))
# > [['word3', 'word5'], ['word2', 'word4']]
Please tell if this is not what you wanted to do or if my answer does not fit your needs! And don't forget to mark the question as answered if I solved your problem.

Related

Why Do My K-means Cluster Scatterplot Colors Match Far-away Centroids

I have a simple notebook to read in text files, vectorize the text, use K-means clustering to label and then plot the documents. For testing purposes I choose a small number of documents from three distinct sources (Edgar Allen Poe fiction, Russian Troll Twitter, and Ukraine news) and deliberately chose K=3 as a kind of surface validity check. My problem is that in one visible cluster, several plot points are colored the same as a (visibly) far away cluster.
3 Clusters w/ Edge Points Colored as Distant Clusters
My code:
# import pandas to use dataframes and handle tabular data
import pandas as pd
# read in the data using panda's "read_csv" function
col_list = ["DOC_ID", "TEXT"]
data = pd.read_csv('/User/Documents/NLP/Three_Genre_Samples.csv', usecols=col_list)
# use regular expression to clean annoying "/n" newline characters
data = data.replace(r'\n',' ', regex=True)
#import sklearn for TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# vectorize text in the df and fit the TEXT data.
# using the Inverse Document Frequency (IDF) vector collected feature-wise over the corpus.
vectorizer = TfidfVectorizer(stop_words={'english'})
X = vectorizer.fit_transform(data.TEXT)# deliberate "K" value from document sources
true_k = 3
# define an unsupervised clustering "model" using KMeans
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=200, n_init=10)
#fit model to data
model.fit(X)
# define clusters lables (which are integers--a human needs to make them interpretable)
labels=model.labels_
title=[data.DOC_ID]
#make a "clustered" version of the dataframe
data_cl=data
# add label values as a new column, "Cluster"
data_cl['Cluster'] = labels
# output new, clustered dataframe to a csv file
data_cl.to_csv('/Users/Documents/NLP/Three_Genre_Samples_clustered.csv.csv')
# plot document clusters:
import numpy as np
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
model_indices = model.fit_predict(X)
pca = PCA(n_components=2)
scatter_plot_points = pca.fit_transform(X.toarray())
colors = ["r", "b", "c"]
x_axis = [o[0] for o in scatter_plot_points]
y_axis = [o[1] for o in scatter_plot_points]
fig, ax = plt.subplots(figsize=(20,10))
ax.scatter(x_axis, y_axis, c=[colors[d] for d in model_indices])
for i, txt in enumerate(labels):
ax.annotate(txt, (x_axis[i]+.005, y_axis[i]), size=16)
I'd be grateful for any insight.

How to average data for a variable over a number of timesteps

I was wondering if anyone could shed some light into how I can average this data:
I have a .nc file with data (dimensions: 2029,64,32) which relates to time, latitude and longitude. Using these commands I can plot individual timesteps:
timestep = data.variables['precip'][0]
plt.imshow(timestep)
plt.colorbar()
plt.show()
Giving a graph in this format for the 0th timestep:
I was wondering if there was any way to average this first dimension (the snapshots in time).
If you are looking to take a mean over all times, try using np.mean where you use the axis keyword to say which axis you want to average.
time_avaraged = np.mean(data.variables['precip'], axis = 0)
If you have NaN values then np.mean will give NaN for that lon/lat point. If you'd rather ignore them then use np.nanmean.
If you want to do specific times only, e.g. the first 1000 time steps, then you could do
time_avaraged = np.mean(data.variables['precip'][:1000,:,:], axis = 0)
I think if you're using pandas and numpy this may help you.Look for more details
import pandas as pd
import numpy as np
data = np.array([10,5,8,9,15,22,26,11,15,16,18,7])
d = pd.Series(data)
print(d.rolling(4).mean())

Fastest way to find nearest nonzero value in array from columns in pandas dataframe

I am looking for the nearest nonzero cell in a numpy 3d array based on the i,j,k coordinates stored in a pandas dataframe. My solution below works, but it is slower than I would like. I know my optimization skills are lacking, so I am hoping someone can help me find a faster option.
It takes 2 seconds to find the nearest non-zero for a 100 x 100 x 100 binary array, and I have hundreds of files, so any speed enhancements would be much appreciated!
a=np.random.randint(0,2,size=(100,100,100))
# df with i,j,k of interest
df=pd.DataFrame(np.random.randint(100,size=(100,3)).tolist(),
columns=['i','j','k'])
def find_nearest(a,df):
import numpy as np
import pandas as pd
import time
t0=time.time()
nzi = np.nonzero(a)
for i,r in df.iterrows():
dist = ((r['k'] - nzi[0])**2 + \
(r['i'] - nzi[1])**2 + \
(r['j'] - nzi[2])**2)
nidx = dist.argmin()
df.loc[i,['nk','ni','nj']]=(nzi[0][nidx],
nzi[1][nidx],
nzi[2][nidx])
print(time.time()-t0)
return(df)
The problem that you are trying to solve looks like a nearest-neighbor search.
The worst-case complexity of the current code is O(n m) with n the number of point to search and m the number of neighbour candidates. With n = 100 and m = 100**3 = 1,000,000, this means about hundreds of million iterations. To solve this efficiently, one can use a better algorithm.
The common way to solve this kind of problem consists in putting all elements in a BSP-Tree data structure (such as Quadtree or Octree. Such a data structure helps you to locate the nearest elements near a location in a O(log(m)) time. As a result, the overall complexity of this method is O(n log(m))! SciPy already implement KD-trees.
Vectorization generally also help to speed up the computation.
def find_nearest_fast(a,df):
from scipy.spatial import KDTree
import numpy as np
import pandas as pd
import time
t0=time.time()
candidates = np.array(np.nonzero(a)).transpose().copy()
tree = KDTree(candidates, leafsize=1024, compact_nodes=False)
searched = np.array([df['k'], df['i'], df['j']]).transpose()
distances, indices = tree.query(searched)
nearestPoints = candidates[indices,:]
df[['nk', 'ni', 'nj']] = nearestPoints
print(time.time()-t0)
return df
This implementation is 16 times faster on my machine. Note the results differ a bit since there are multiple nearest points for a given input point (with the same distance).

Loop over columns in a dataframe to produce histograms by category

I would like to investigate the frequency distribution of all the features (columns) in my df based on the outcome variable (target column). Having searched the solutions, I find this beautiful snippet from here which loop over features and generate histograms for features in the cancer dataset from Scikit-learn.
import numpy as np
import matplotlib.pyplot as plt
# from matplotlib.pyplot import matplotlib
fig,axes =plt.subplots(10,3, figsize=(12, 9)) # 3 columns each containing 10 figures, total 30 features
malignant=cancer.data[cancer.target==0] # define malignant
benign=cancer.data[cancer.target==1] # define benign
ax=axes.ravel()# flat axes with numpy ravel
for i in range(30):
_,bins=np.histogram(cancer.data[:,i],bins=40)
ax[i].hist(malignant[:,i],bins=bins,color='r',alpha=.5)
ax[i].hist(benign[:,i],bins=bins,color='g',alpha=0.3)
ax[i].set_title(cancer.feature_names[i],fontsize=9)
ax[i].axes.get_xaxis().set_visible(False) # the x-axis co-ordinates are not so useful, as we just want to look how well separated the histograms are
ax[i].set_yticks(())
ax[0].legend(['malignant','benign'],loc='best',fontsize=8)
plt.tight_layout()# let's make good plots
plt.show()
Assuming that I have df with all features and target variable organised across successive columns, how would I be able to loop over my columns to reproduce histograms. One solution that I have considered is a groupby method.
df.groupby("class").col01.plot(kind='kde', ax=axs[1])
Any ideas are much appreciated!
Actually you can use sns.FacetGrid for this:
# Random data:
np.random.seed(1)
df = pd.DataFrame(np.random.uniform(0,1,(100,6)), columns=list('ABCDEF'))
df['class'] = np.random.choice([0,1], p=[0.3,0.7], size=len(df))
# plot
g = sns.FacetGrid(df.melt(id_vars='class'),
col='variable',
hue='class',
col_wrap=3) # change this to your liking
g = g.map(sns.kdeplot, "value", alpha=0.5)
Output:

How to check if any point from a list of points is contained by any polygon from a list of polygons?

I have the following problem: I have a list of shapely points and a list of shapely polygons.
Now I want to check in which polygon a given point is.
At the moment I am using the following code, which seems not very clever:
# polygons_df is a pandas dataframe that contains the geometry of the polygons and the usage of the polygons (landuses in this case, e.g. residential)
# point_df is a pandas dataframe that contains the geometry of the points and the usage of the point (landuses in this case, e.g. residential)
# polylist is my list of shapely polygons
# pointlist is my list of shapely points
from shapely.geometry import Point, Polygon
import pandas as pd
import geopandas as gpd
i = 0
while i < len(polygons_df.index):
j = 0
while j < len(point_df.index):
if polylist[i].contains(point):
point.at[j, 'tags.landuse'] = polygons_df.iloc[i]['tags.landuse']
else:
pass
j += 1
i += 1
Can I somehow speed this up? I have more than 100.000 points and more than 10.000 polygons and these loops take a while. Thanks!
I know a solution was found in the comments for the particular problem, but to answer a related question of how to check if an array of points is inside a shapely Polygon, I found the following solution:
>>> poly = Polygon([(0,0), (1,0), (0,1)])
>>> contains = np.vectorize(lambda p: poly.contains(Point(p)), signature='(n)->()')
>>> contains(np.array([[0.5,0.49],[0.5,0.51],[0.5,0.52]]))
array([ True, False, False])
I don't know that this neccesarily speeds up the calculations, but at least you can avoid the for-loop.