I am trying to run Kmeans clustering on a data. My data frame is a pandas data frame which is of following dimensions.
People_reduced.shape
Out[155]:
(417837, 13)
Now while k-means is running fine, when I try to feed the output of Kmeans cluster labels and the original data frame to silhouette_score method of sklearn , it is throwing a weird error.
Here is the code I used:
kmeans=KMeans(n_clusters=2,init='k-means++',n_init=10, max_iter=20)
kmeans.fit(People_reduced.ix[:,1:])
cluster_labels = kmeans.labels_
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(People_reduced.ix[:,1:].values,cluster_labels)
Error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-154-b392e118f64a> in <module>()
19 # This gives a perspective into the density and separation of the formed
20 # clusters
---> 21 silhouette_avg = silhouette_score(People_reduced.ix[:,1:].values,cluster_labels)
22 #silhouette_avg = silhouette_score(People_reduced.ix[:,1:], cluster_labels)
23
TypeError: 'list' object is not callable
Related
I'm trying to simulate the segregation process in a city for a school project. I've managed to plot the city when initialized and after segregation, but I don't manage to create the animation showing the city's inhabitants moving to show the evolution.
I have two methods in my Ville class (I'm coding in French) that should make the animation together.
def afficher(self, inclure_satisfaction=False, inclure_carte_categories=False, size=5):
carte = self.carte_categories(inclure_satisfaction=inclure_satisfaction)
if inclure_carte_categories:
print("Voici la carte des catégories (à titre de vérification)")
print(carte)
mat_rs = masked_array(carte, carte!=1.5)
mat_ri = masked_array(carte, carte!=1)
mat_bs = masked_array(carte, carte!=2.5)
mat_bi = masked_array(carte, carte!=2)
plt.figure(figsize=(size, size))
affichage_rs = plt.imshow(mat_rs, cmap=cmap_rs)
affichage_ri = plt.imshow(mat_ri, cmap=cmap_ri)
affichage_bs = plt.imshow(mat_bs, cmap=cmap_bs)
affichage_bi = plt.imshow(mat_bi, cmap=cmap_bi)
return plt.figure()
(this function plot the map by first getting an array from the method carte_categories in function of the category of each inhabitant and then getting an array for each value to plot)
def resoudre2(self):
fig = plt.figure(figsize=(5,5))
list_of_artists = []
while self.habitants_insatisfaits != []:
self.demenagement_insatisfait_aleatoire()
list_of_artists.append([self.afficher(inclure_satisfaction=True)])
ani = ArtistAnimation(fig, list_of_artists, interval=200, blit=True)
return ani
(habitants_insatisfaits is a list that contains the "insatisfied inhabitants": there are two few people of their category around them, so they want to move somewhere else; so resoudre means solve, and this function loops until all the inhabitants are satisfied where they are (and this way the society is mechanically segregated)
The initialized city looks like this initialized city (dark colors for insatisfied inhabitants), and the segregated city looks like that segregated city.
But when I enter
a = ville1.resoudre2(compter=True)
I don't get an animation but only this error message:
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:211: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:206: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/matplotlib/cbook/__init__.py", line 196, in process
func(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/matplotlib/animation.py", line 951, in _start
self._init_draw()
File "/usr/local/lib/python3.7/dist-packages/matplotlib/animation.py", line 1533, in _init_draw
fig.canvas.draw_idle()
AttributeError: 'NoneType' object has no attribute 'canvas'
/usr/local/lib/python3.7/dist-packages/matplotlib/image.py:452: UserWarning: Warning: converting a masked element to nan.
dv = np.float64(self.norm.vmax) - np.float64(self.norm.vmin)
/usr/local/lib/python3.7/dist-packages/matplotlib/image.py:459: UserWarning: Warning: converting a masked element to nan.
a_min = np.float64(newmin)
/usr/local/lib/python3.7/dist-packages/matplotlib/image.py:464: UserWarning: Warning: converting a masked element to nan.
a_max = np.float64(newmax)
<string>:6: UserWarning: Warning: converting a masked element to nan.
/usr/local/lib/python3.7/dist-packages/matplotlib/colors.py:993: UserWarning: Warning: converting a masked element to nan.
data = np.asarray(value)
(first problem) and then every map (corresponding to each step of the segregating city) is plotted (second problem; see here). And when I try to type
print(a)
from IPython.display import HTML
HTML(a.to_html5_video())
to plot the animation, I only get
<matplotlib.animation.ArtistAnimation object at 0x7f4cd376bfd0>
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-20-d7ca1fcdadb6> in <module>()
1 print(a)
2 from IPython.display import HTML
----> 3 HTML(a.to_html5_video())
2 frames
/usr/local/lib/python3.7/dist-packages/matplotlib/animation.py in _init_draw(self)
1531 # Flush the needed figures
1532 for fig in figs:
-> 1533 fig.canvas.draw_idle()
1534
1535 def _pre_draw(self, framedata, blit):
AttributeError: 'NoneType' object has no attribute 'canvas'
So I don't understand why I get this error and not just my animation...
Thank you for your help, it's the first time I ask questions here so don't hesitate if you need more details about my code! :)
Nathan
Had the same issue, downgrading Matplotlib fixed the issue for me.
pip install matplotlib==3.5.1
I encounter TypeError: 'DataFrame' object is not callable with the following. Anyone can help? Thanks.
%cd -
dataset_orig = df_data_1(protected_attribute_names=['Gender'],
privileged_classes=['Male'],
features_to_drop=[])
dataset_orig_train, dataset_orig_test = dataset_orig.split([0.7], shuffle=True)
privileged_groups = [{'Gender': 1}]
unprivileged_groups = [{'Gender': 0}]
/home/wsuser/work
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-59-8c624cfec261> in <module>
5 # consider in this evaluation
6 privileged_classes=['Male'], # male is considered privileged
----> 7 features_to_drop=[]) # ignore all other attributes
8
9 dataset_orig_train, dataset_orig_test = dataset_orig.split([0.7], shuffle=True)
TypeError: 'DataFrame' object is not callable
It appears that df_data_1 is your dataset dataframe, right? If it is, you would need to update your script to convert it to StandardDataset:
from aif360.datasets import StandardDataset
dataset_orig = StandardDataset(df_data_1,
protected_attribute_names=['Gender'],
privileged_classes=['Male'],
features_to_drop=[],
favorable_classes=[1] # Update this with label values which are considered favorable in your dataset
)
I am not sure how your dataset looks like, but you can adapt the complete reproducible example here to do this process for your dataset.
I have a pyspark dataframe child with columns like:
lat1 lon1
80 70
65 75
I am trying to convert it into numpy matrix using IndexedRowMatrix as below:
from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix
mat = IndexedRowMatrix(child.select('lat','lon').rdd.map(lambda row: IndexedRow(row[0], Vectors.dense(row[1:]))))
But its throwing me error. I want to avoid converting to pandas dataframe to get the matrix.
error:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 33.0 failed 4 times, most recent failure: Lost task 0.3 in stage 33.0 (TID 733, ebdp-avdc-d281p.sys.comcast.net, executor 16): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/data/02/yarn/nm/usercache/mbansa001c/appcache/application_1506130884691_56333/container_e48_1506130884691_56333_01_000017/pyspark.zip/pyspark/worker.py", line 174, in main
process()
You want to avoid pandas, but you try to convert to an RDD, which is severely suboptimal...
Anyway, assuming you can collect the selected columns of your child dataframe (a reasonable assumption, since you aim to put them in a Numpy array), it can be done with plain Numpy:
import numpy as np
np.array(child.select('lat1', 'lon1').collect())
# array([[80, 70],
# [65, 75]])
I have 2 issues nested as one:
n_rows, n_cols = np.shape(Z)
ZT = Z.transpose()
ZTZ = np.dot(ZT,Z) # does return a value
ZTZ1 = np.matmul(ZT,Z) # error
print("Close?")
print(np.allclose(ZTZ,ZTZ1))
print("----")
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-211-f26bdaebc910> in <module>()
26
27 print
---> 28 coV = getCovariance(df)
29 #print(coV)
30 print
<ipython-input-211-f26bdaebc910> in getCovariance(df)
13 ZT = Z.transpose()
14 ZTZ = np.dot(ZT,Z)
---> 15 ZTZ1 = np.matmul(ZT,Z)
16 print("Close?")
17 print(np.allclose(ZTZ,ZTZ1))
AttributeError: 'module' object has no attribute 'matmul'
Okay ... so obviously matmul doesn't exist on my machine. Got it. Now how do I confirm that the dot is doing the same thing? Because I have a matrix that was once a pandas.DataFrame object and I converted it to a matrix through it's .as_matrix() method and I am getting rounding errors and need to check where things went wrong ... I also tried the standard * operator, but that doesn't work either on np.ndarray matrix objects.
SIDE NOTE: if there are any pro tips on rounding that could be transferred from someone with experience with pandas, that is also much appreciated because I can't seem to find out how pandas has given me a different matrix than a build in function from the numpy class (I have been asked to reimplement the function).
I'm trying to cluster some data with python and scipy but the following code does not work for reason I do not understand:
from scipy.sparse import *
matrix = dok_matrix((en,en), int)
for pub in pubs:
authors = pub.split(";")
for auth1 in authors:
for auth2 in authors:
if auth1 == auth2: continue
id1 = e2id[auth1]
id2 = e2id[auth2]
matrix[id1, id2] += 1
from scipy.cluster.vq import vq, kmeans2, whiten
result = kmeans2(matrix, 30)
print result
It says:
Traceback (most recent call last):
File "cluster.py", line 40, in <module>
result = kmeans2(matrix, 30)
File "/usr/lib/python2.7/dist-packages/scipy/cluster/vq.py", line 683, in kmeans2
clusters = init(data, k)
File "/usr/lib/python2.7/dist-packages/scipy/cluster/vq.py", line 576, in _krandinit
return init_rankn(data)
File "/usr/lib/python2.7/dist-packages/scipy/cluster/vq.py", line 563, in init_rankn
mu = np.mean(data, 0)
File "/usr/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line 2374, in mean
return mean(axis, dtype, out)
TypeError: mean() takes at most 2 arguments (4 given)
When I'm using kmenas instead of kmenas2 I have the following error:
Traceback (most recent call last):
File "cluster.py", line 40, in <module>
result = kmeans(matrix, 30)
File "/usr/lib/python2.7/dist-packages/scipy/cluster/vq.py", line 507, in kmeans
guess = take(obs, randint(0, No, k), 0)
File "/usr/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line 103, in take
return take(indices, axis, out, mode)
TypeError: take() takes at most 3 arguments (5 given)
I think I have the problems because I'm using sparse matrices but my matrices are too big to fit the memory otherwise. Is there a way to use standard clustering algorithms from scipy with sparse matrices? Or I have to re-implement them myself?
I created a new version of my code to work with vector space
el = len(experts)
pl = len(pubs)
print el, pl
from scipy.sparse import *
P = dok_matrix((pl, el), int)
p_id = 0
for pub in pubs:
authors = pub.split(";")
for auth1 in authors:
if len(auth1) < 2: continue
id1 = e2id[auth1]
P[p_id, id1] = 1
from scipy.cluster.vq import kmeans, kmeans2, whiten
result = kmeans2(P, 30)
print result
But I'm still getting the error:
TypeError: mean() takes at most 2 arguments (4 given)
What am I doing wrong?
K-means cannot be run on distance matrixes.
It needs a vector space to compute means in, that is why it is called k-means. If you want to use a distance matrix, you need to look into purely distance based algorithms such as DBSCAN and OPTICS (both on Wikipedia).
May I suggest, "Affinity Propagation" from scikit-learn? On the work I've been doing with it, I find that it has generally been able to find the 'naturally' occurring clusters within my data set. The inputs into the algorithm are an affinity matrix, or similarity matrix, of any arbitrary similarity measure.
I don't have a good handle on the kind of data you have on hand, so I can't speak to the exact suitability of this method to your data set, but it may be worth a try, perhaps?
Alternatively, if you're looking to cluster graphs, I'd take a look at NetworkX. That might be a useful tool for you. The reason I suggest this is because it looks like the data you're looking to work with networks of authors. Hence, with NetworkX, you can put in an adjacency matrix and find out which authors are clustered together.
For a further elaboration on this, you can see a question that I had asked earlier for inspiration here.