Can't extract clusters from fcluster after using scipy's hierarchichal clustering - numpy

After doing hierarchichal clustering on my dataset and plotting it with dendrogram function it seems that it was correct clustered, but when I call function fcluster to extract the cluster ids I just get one cluster id, ever.
Why is this happening?
My code:
for key, values in use_case_idx.items():
vectors = []
labels = []
for value in values:
labels.append(value[0])
vectors.append(value[1])
try:
distance_matrix = pdist(vectors, metric='cosine')
Z = linkage(distance_matrix, 'ward')
plt.title("Ward")
dendrogram(Z, labels=labels)
except:
continue
plt.show()
clusters = fcluster(Z, 10, criterion='distance')
print(clusters)
And thus, the output:
More examples on: https://imgur.com/a/kEfub
What's wrong with this code?
Note: Each vector has 50 dimensions

The y-axis of the dendrogram shows the cophenetic distance between different nodes. Because you are using the distance criterion with a large value (much larger than the cophenetic distance), all elements are grouped into the same cluster.
Try using a smaller threshold (e.g. 0.025 for the first dendrogram you show). The dendrogram can act as a guide to choose "good" thresholds---although "good" is very subjective.

If you want to cluster your data into n distinct clusters you can do this using the criterion 'maxclust' so for example fcluster(data,n,criterion = 'maxclust')

Related

Importance of seed and num_runs in the KMeans clustering

New to ML so trying to make sense of the following code. Specifically
In for run in np.arange(1, num_runs+1), what is the need for this loop? Why didn't the author use setMaxIter method of KMeans?
What is the importance of seeding in clustering?
Why did the author chose to set the seed explicitly rather than using the default one?
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
def optimal_k(df_in,index_col,k_min, k_max,num_runs):
'''
Determine optimal number of clusters by using Silhoutte Score Analysis.
:param df_in: the input dataframe
:param index_col: the name of the index column
:param k_min: the train dataset
:param k_min: the minmum number of the clusters
:param k_max: the maxmum number of the clusters
:param num_runs: the number of runs for each fixed clusters
:return k: optimal number of the clusters
:return silh_lst: Silhouette score
:return r_table: the running results table
:author: Wenqiang Feng
:email: von198#gmail.com.com
'''
start = time.time()
silh_lst = []
k_lst = np.arange(k_min, k_max+1)
r_table = df_in.select(index_col).toPandas()
r_table = r_table.set_index(index_col)
centers = pd.DataFrame()
for k in k_lst:
silh_val = []
for run in np.arange(1, num_runs+1):
# Trains a k-means model.
kmeans = KMeans()\
.setK(k)\
.setSeed(int(np.random.randint(100, size=1)))
model = kmeans.fit(df_in)
# Make predictions
predictions = model.transform(df_in)
r_table['cluster_{k}_{run}'.format(k=k, run=run)]= predictions.select('prediction').toPandas()
# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)
silh_val.append(silhouette)
silh_array=np.asanyarray(silh_val)
silh_lst.append(silh_array.mean())
elapsed = time.time() - start
silhouette = pd.DataFrame(list(zip(k_lst,silh_lst)),columns = ['k', 'silhouette'])
print('+------------------------------------------------------------+')
print("| The finding optimal k phase took %8.0f s. |" %(elapsed))
print('+------------------------------------------------------------+')
return k_lst[np.argmax(silh_lst, axis=0)], silhouette , r_table
I'll try to answer your questions based on my reading of the material.
The reason for this loop is that the author sets a new seed for every loop using int(np.random.randint(100, size=1)). If the feature variables exhibit patterns that automatically group them into visible clusters, then the starting seed should not have an impact on the final cluster memberships. However, if the data is evenly distributed, then we might end up with different cluster members based on the initial random variable. I believe the author is changing these seeds for each run to test different initial distributions. Using setMaxIter would set maximum iterations for the same seed (initial distribution).
Similar to the above - the seed defines the initial distribution of k points around which you're going to cluster. Depending on your underlying data distribution, the clusters can converge in different final distributions.
The author has control over the seed, as discussed in points 1 and 2. You can see for what seed your code converges around clusters as desired, and for which you might not get convergence. Also, if you iterate for, say, 100 different seeds and your code still converges into the same final clusters, you can remove the default seed as it likely doesn't matter. Another use is from a more software engineering perspective, setting explicit seed is super important if you want to, for example, write tests for your code and don't want it to randomly fail.

Miscalculation of new function "pcsegdist" in Matlab R2018b

I try to test the new function "pcsegdist" in Matlab R2018b. However, the result is wrong for Segment point cloud into clusters based on Euclidean distance
Example: I test with 3D data points- 1797 points (please see attached test.txt file). It is noted that the smallest distance between 2 neighbor points is 0.3736
tic
clear;clc;filename = 'test.txt'; load('test.txt');P = test(:,1:3);%get data=coordinate(x,y,z) from set of data "column" at (all row & column 1,2,3)
ptCloud = pointCloud(P);
minDistance = 0.71;%this value should less than the smallest 3D distance between 2 clusters
[labels,numClusters] = pcsegdist(ptCloud,minDistance);%numClusters: the number of Cluster
%labels: is the kx1 matrix. This is index of each voxel in each cluster
toc
%% Generate the cell_cluster
cell_cluster={};x=P(:,1);y=P(:,2);z=P(:,3);
for i=1:numClusters
cluster_i=[x(labels==i),y(labels==i),z(labels==i)];%call x,y,z coord of all points which is belong the same cluster
cell_cluster{end+1} = cluster_i;%this is (1xk)cell. where k=number of cluster
end
figure;Plot_cell(cell_cluster);view(3);% plot result cluster(using function to plot)
But when I verify by using manually method (ground truth data), the result should as below figure:
Thus, I wonder about the result of new function "pcsegdist" in Matlab R2018b, Or I misunderstand or I wrong somewhere ?enter link description here

Confused by random.randn()

I am a bit confused by the numpy function random.randn() which returns random values from the standard normal distribution in an array in the size of your choosing.
My question is that I have no idea when this would ever be useful in applied practices.
For reference about me I am a complete programming noob but studied math (mostly stats related courses) as an undergraduate.
The Python function randn is incredibly useful for adding in a random noise element into a dataset that you create for initial testing of a machine learning model. Say for example that you want to create a million point dataset that is roughly linear for testing a regression algorithm. You create a million data points using
x_data = np.linspace(0.0,10.0,1000000)
You generate a million random noise values using randn
noise = np.random.randn(len(x_data))
To create your linear data set you follow the formula
y = mx + b + noise_levels with the following code (setting b = 5, m = 0.5 in this example)
y_data = (0.5 * x_data ) + 5 + noise
Finally the dataset is created with
my_data = pd.concat([pd.DataFrame(data=x_data,columns=['X Data']),pd.DataFrame(data=y_data,columns=['Y'])],axis=1)
This could be used in 3D programming to generate non-overlapping random values. This would be useful for optimization of graphical effects.
Another possible use for statistical applications would be applying a formula in order to test against spacial factors affecting a given constant. Such as if you were measuring a span of time with some formula doing something but then needing to know what the effectiveness would be given various spans of time. This would return a statistic measuring for example that your formula is more effective in the shorter intervals or longer intervals, etc.
np.random.randn(d0, d1, ..., dn) Return a sample (or samples) from the “standard normal” distribution(mu=0, stdev=1).
For random samples from , use:
sigma * np.random.randn(...) + mu
This is because if Z is a standard normal deviate, then will have a normal distribution with expected value and standard deviation .
https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.randn.html
https://en.wikipedia.org/wiki/Normal_distribution

Interpreting the Y values of a normal distribution

I've written this code to generate a normal distribution of a set of values 1,2,3 :
import pandas as pd
import random
import numpy as np
df = pd.DataFrame({'col1':[1,2,3]})
print(df)
fig, ax = plt.subplots(1,1)
df.plot(kind='hist', normed=True, ax=ax)
Returns :
The X values are the range of possible values but how are the Y values interpreted ?
Reading http://www.stat.yale.edu/Courses/1997-98/101/normal.htm the Y value is calculated using :
A normal distribution has a bell-shaped density curve described by its
mean and standard deviation . The density curve is symmetrical,
centered about its mean, with its spread determined by its standard
deviation. The height of a normal density curve at a given point x is
given by
What is the meaning of this formula ?
I think you are confusing two concepts here. A histogram will just plot how many times a certain value appears. So for your list of [1,2,3], the value 1 will appear once and the same for 2 and 3. If you would have set Normed=False you would get the plot you have now with a height of 1.0.
However, when you set Normed=True, you will turn on normalization. Note that this does not have anything to do with a normal distribution. Have a look at the documentation for hist, which you can find here: http://matplotlib.org/api/pyplot_api.html?highlight=hist#matplotlib.pyplot.hist
There you see that what the option Normed does, which is:
If True, the first element of the return tuple will be the counts normalized to form a probability density, i.e., n/(len(x)`dbin), i.e., the integral of the histogram will sum to 1. If stacked is also True, the sum of the histograms is normalized to 1.
So it gives you the formula right there. So in your case, you have three points, i.e. len(x)=3. If you look at your plot you can see that your bins have a width of 0.2 so dbin=0.2. Each value appears only once for for both 1, 2, and 3, you will have n=1. Thus the height of your bars should be 1/(3*0.2) = 1.67, which is exactly what you see in your histogram.
Now for the normal distribution, that is just a specific probability function that is defined as the formula you gave. It is useful in many fields as it relates to uncertainties. You'll see it a lot in statistics for example. The Wikipedia article on it has lots of info.
If want to generate a list of values that conform to a normal distribution, I would suggest reading the documentation of numpy.random.normal which will do this for you: https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.normal.html

Interpolate in one direction

I have sampled data and plot it with imshow():
I would like to interpolate just in horizontal axis so that I can easier distinguish samples and spot features.
Is it possible to make interpolation just in one direction with MPL?
Update:
SciPy has whole package with various interpolation methods.
I used simplest interp1d, as suggested by tcaswell:
def smooth_inter_fun(r):
s = interpolate.interp1d(arange(len(r)), r)
xnew = arange(0, len(r)-1, .1)
return s(xnew)
new_data = np.vstack([smooth_inter_fun(r) for r in data])
Linear and cubic results:
As expected :)
This tutorial covers a range of interpolation available in numpy/scipy. If you want to just one direction, I would work on each row independently and then re-assemble the results. You might also be interested is simply smoothing your data (exmple, Python Smooth Time Series Data, Using strides for an efficient moving average filter).
def smooth_inter_fun(r):
#what ever process you want to use
new_data = np.vstack([smooth_inter_fun(r) for r in data])