maximal cliques on wiki - clique

Maximal cliques in the graph model shown on wikipedia
has included cliques of size 3 and 4 only. But I think cliques of size 2 (which are connecting the 3,4 size cliques) should also be included in maximal cliques according to definition mentioned. Can anybody correct me, if I am wrong.

Related

Elbow Method for optimal no. of clusters

I have a dataset that I am analysing to find the optimal number of clusters using k-means.
I am testing the number of clusters from [1..11] - which produces the following plot:
The original dataset has six classes but the elbow plot shows the bend really occurring at 3 clusters. For curiosity I overlaid a line on the plot from 11 clusters and back and it is almost a straight line to 6 clusters - which indicates to me that the real elbow is at 6, but it is subtle to see.
So, visually 3 looks to be the right answer, but given the known number of classes (6) the straight line I drew indicates 6...
Question:
How should you correctly interpret an elbow plot like this
(especially when you are not given the classes)?
Would you say the
elbow is at 3 or 6?
Based on the plot I'd say that there are 6 clusters. From my experience and intuition, I believe it makes sense to say that the "elbow" is where the "within cluster sum of squares" begins to decrease linearly.
However, for cluster validation, I recommend using silhouette coefficients as the "right answer" is objectively obtained. In addition, the silhouette coefficients takes the separation of clusters into account as well.

Find bounds of voronoi diagram that results equal area regions

I have voronoi diagram that contains 4 sites:
I am trying to find bounds B that would divide my voronoi regions into equal areas or close as possible. The only requirement here is that B would have constant aspect ratio c. In other words B width divided with height would always result to c (c = width/height).
Images here as example I am looking for general solution that would work on any 4 sites. I plan to use this solution on software realtime with constantly changing sites, so it is preferred it would not require huge number of iterations.
I am curious is there any algorithm to solve this issue. So far I tried:
Floyd relaxation, that is used to find equal area regions, but it modifies sites.
Reinforced learning, could not get anything relevant out of it.
Managed to solve it for 3 sites, but that did not scaled well to 4 sites.

k-mean clustering - inertia only gets larger

I am trying to use the KMeans clustering from faiss on a human pose dataset of body joints. I have 16 body parts so a dimension of 32. The joints are scaled in a range between 0 and 1. My dataset consists of ~ 900.000 instances. As mentioned by faiss (faiss_FAQ):
As a rule of thumb there is no consistent improvement of the k-means quantizer beyond 20 iterations and 1000 * k training points
Applying this to my problem I randomly select 50000 instances for training. As I want to check for a number of clusters k between 1 and 30.
Now to my "problem":
The inertia is increasing directly as the number of cluster increases (n_cluster on the x-axis):
I tried varying the number of iterations, the number of redos, verbose and spherical, but the results stay the same or get worse. I do not think that it is a problem of my implementation; I tested it on a small example with 2D data and very clear clusters and it worked.
Is it that the data is just bad clustered or is there another problem/mistake I have missed? Maybe the scaling of the values between 0 and 1? Should I try another approach?
I found my mistake. I had to increase the parameter max_points_per_centroid. As I have so many data points it sampled a sub-batch for the fit. For a larger number of clusters this sub-batch is larger. See FAQ of faiss:
max_points_per_centroid * k: there are too many points, making k-means unnecessarily slow. Then the training set is sampled
The larger subbatch of course has a larger inertia as there are more points in total.

Why the value to explained various ratio is too low in binary classification problem?

My input data $X$ has 100 samples and 1724 features. I want to classify my data into two classes. But explained various ratio value is too low like 0.05,0.04, No matter how many principal components I fix while performing PCA. I have seen in the literature that usually explained variance ratio is close to 1. I'm unable to understand my mistake. I tried to perform PCA by reducing number of features and increasing number of samples. But it doesn't make any significant effect on my result.
`pca = PCA(n_components=10).fit(X)
Xtrain = pca.transform(X)
explained_ratio=pca.explained_variance_ratio_
EX=explained_ratio
fig,ax1=plt.subplots(ncols=1,nrows=1)
ax1.plot(np.arange(len(EX)),EX,marker='o',color='blue')
ax1.set_ylabel(r"$\lambda$")
ax1.set_xlabel("l")

Is the number in RFECV gridscore equivalent to the selected features?

I am seeking some clarity surrounding the number associated with selector.grid_scores_ in RFECV.
I have used the following:
from sklearn.feature_selection import RFECV
estimator_RFECV = ExtraTreesClassifier(random_state=0)
estimator_RFECV = RFECV(estimator_RFECV, min_features_to_select = 20, step=1, cv=5, scoring='accuracy', verbose=1, n_jobs=-1)
estimator_RFECV = estimator_RFECV.fit(X_train, y_train)
Using estimator_RFECV.ranking_, 27 features are selected through CV, however, when I look at estimator_RFECV.grid_scores_, at 27, the value here (accuracy) is not the highest. Am I interpreting the grid_scores_ incorrect and I should not expect 27 to have the highest accuracy?
Here, estimator_RFECV.ranking_ will give you an array of feature ranking, such that ranking_[i] corresponds to the ranking position of the i-th feature. Selected (i.e., estimated best) features are assigned rank 1, feature ranked 2 will be less important than ranked 1 and so on.
So estimator_RFECV.ranking_ will give us ranking of features or we can say respective importance of feature.
However, estimator_RFECV.grid_scores_ will give us array based on scoring metrics, min_features_to_select and Maximum number of feature available. In the above case it should contain 8 elements each representing Accuracy with top X features where X belongs to 20 to 27.
And yes, it's always possible that model with lesser number of feature can have higher accuracy, because some features which we may have considered that were irrelevant.
Also, the RFECV documentation link from the official documentation could be helpful.