I checked unsupervised clsutering on gensim, fasttext, sklearn but did not find any documentation where I can cluster my text data using unsupervised learn without mentioning numbers of cluster to be identified
for example in sklearn KMneans clustering
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100)
Where I have to provide n_clusters.
In my case, I have text and it should be automatically identify numbers of clusters in it and cluster the text. Any reference article or link much appreciated.
DBSCAN is a density-based clustering method that we don't have to specify the number of clusters beforehand.
sklearn implementation : http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
Here is a good tutorial that gives an intuitive understanding on DBSCAN: http://mccormickml.com/2016/11/08/dbscan-clustering/
I extracted following from the above tutorial, which may be useful for you.
k-means requires specifying the number of clusters, ‘k’. DBSCAN does not, but does require specifying two parameters which influence the decision of whether two nearby points should be linked into the same cluster.
These two parameters are a distance threshold, ε (epsilon), and “MinPts” (minimum number of points), to be explained.
There are other methods (follow the link given in the comments) also, However, DBSCAN is a popular choice.
Related
I would like to use PyTextRank for keyphrase extraction. How can I feed feed 5 million documents (each document consisting of a few paragraphs) to the package?
This is the example I see on the official tutorial.
text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types.\n"
doc = nlp(text)
for phrase in doc._.phrases:
ic(phrase.rank, phrase.count, phrase.text)
ic(phrase.chunks)
Is my option only to concatenate several million documents to a single string and pass it to nlp(text)? I do not think I could use nlp.pipe(texts) as I want to create one network by computing words/phrases from all documents.
No, instead it would almost certainly be better to run these tasks in parallel. Many use cases of pytextrank have used Spark, Dask, Ray, etc., to parallelize running documents through a spaCy pipeline with pytestrank to extract entities.
For an example of parallelization with Ray, see https://github.com/Coleridge-Initiative/rclc/blob/4d5347d8d1ac2693901966d6dd6905ba14133f89/bin/index_phrases.py#L45
One question would be how you are associating the extracted entities with documents? Are these being collected into a dataset, or perhaps a database or key/value store?
However these results get collected, you could then construct a graph of co-occurring phrases, and also include additional semantics to help structure the results. A sister project kglab https://github.com/DerwenAI/kglab was created for these kinds of use cases. There are some examples in the Jupyter notebooks included with the kglab project; see https://derwen.ai/docs/kgl/tutorial/
FWIW, we'll have tutorials coming up at ODSC West about using kglab and pytextrank and there are several videos online (under Graph Data Science) for previous tutorials at conferences. We also have monthly public office hours through https://www.knowledgegraph.tech/ – message me #pacoid on Tw for details.
I have a question regarding boxcox transformation(or log transformation). I am working on a data-set which I have lots of skewed features. Now when I take the boxcox transformation, I get quite a nice distribution but the thing is correlation decrease. Now if I was working with linear models I would just consider correlation to decide I should transform the feature or not. But as I mentioned I am working with tree-based models, so should I transform the feature to get a more dispersed distribution or I leave the feature as it is to avoid a decrease in correlation.
I add a screenshot of distribution and its relationship with the target variable, for both transformed and not transformed(Left 2 plots original feature and target).
PS: Guessing from the plots, it seems to me that if I transform the feature it will be easier for tree to find a split for this particular feature.
Thanks a lot,
I am starting with my study of RL and was wondering how would one approach the observation features, which are not able to represent the state(hidden)?
Is there some systematic approach or some guidelines on how one would prefer the feature vector to look like? Discrete, dimension, Markov properties, embedding quality...?
I would like to process machine operation data streams and actually have a lot of direct measurements and many high-dim feature-vector (also stream).
Thank you very much for you input.
I want to use some clustering method for large social network dataset. The problem is how to evaluate the clustering method. yes, I can use some external ,internal and relative cluster validation methods. I used Normalized mutual information(NMI) as external validation method for cluster validation based on synthetic data. I produced some synthetic dataset by producing 5 clusters with equal number of nodes and some strongly connected links inside each cluster and weak links between clusters to check the clustering method, Then I analysed the spectral clustering and modularity based community detection methods on this synthetic datasets. I use the clustering with the best NMI for my real world dataset and check the error(cost function) of my algorithm and the result was good. Is my testing method for my cost function is good? or I should also validate clusters of my real word clusters again?
Thanks.
Try more than one measure.
There are a dozen cluster validation measures, and it's hard to predict which one is most appropriate for a problem. The differences between them are not really understood yet, so it's best if you consult more than one.
Also note that if you don't use a normalized measure, the baseline may be really high. So the measures are mostly useful to say "result A is more similar to result B than result C", but should not be taken as an absolute measure of quality. They are a relative measure of similarity.
I have a sequence of gps values each containing: timestamp, latitude, longitude, n_sats, gps_speed, gps_direction, ... (some subset of NMEA data). I'm not sure of what quality the direction and speed values are. Further, I cannot expect the sequence to be evenly spaced w.r.t. the timestamp. I want to get a smooth trajectory at an even time step.
I've read the Kalman Filter is the tool of choice for such tasks. Is this indeed the case?
I've found some implementations of the Kalman Filter for Python:
http://www.scipy.org/Cookbook/KalmanFiltering
http://ascratchpad.blogspot.de/2010/03/kalman-filter-in-python.html
These however appear to assume regularly spaced data, i.e. iterations.
What would it take to integrate support of irregularly spaced observations?
One thing I could imagine is to repeat/adapt the prediction step to a time-based model. Can you recommend such a model for this application? Would it need to take into account the NMEA speed values?
Having looked all over for an understandable resource on Kalman filters, I'd highly recommend this one: https://github.com/rlabbe/Kalman-and-Bayesian-Filters-in-Python
To your particular question regarding irregularly spaced observations: Look at Chapter 8 in the reference above, and under the heading "Nonstationary Processes". To summarize, you'll need to use a different state transition function and process noise covariance for each iteration. Those are the only things you'll need to change at each iteration, since they're the only components dependent on delta t.
You could also try kinematic interpolation to see if the results fit to what you expect.
Here's a Python implementation of one of these algorithms: https://gist.github.com/talespaiva/128980e3608f9bc5083b