How to combine TF-IDF with edit distance or Jaro-winkler distance - edit

I am looking for ways to improve the accuracy of TF-IDF weighing scheme in string matching (similarity). The main issue is that TF-IDF is sensitive to typographical errors in stings, and most large datasets tend to have typos.
I realised variants of edit distance (character-based similarity metrics---levienshtein, affine-gas, Jaro and Jaro-winkler) are suitable for computing similarity between strings where there are typographical errors, but not suitable when words are out of order in strings.
Hence I would like to use edit distance correcting ability to enhance the accuracy of TF-IDF.
Any ideas on how to address this challenge will be highly appreciated.
Thanks in advance.

There is a paper published by CMU researchers in 2003 and they have explained how to combine TFIDF with Jaro-Winkler:
https://www.cs.cmu.edu/~pradeepr/papers/ijcai03.pdf
Their Java code is also available on sourceforge as secondString project:
https://sourceforge.net/projects/secondstring/
Here is a link to Javadocs:
http://secondstring.sourceforge.net/javadoc/
The secondString project page:
http://secondstring.sourceforge.net/

Related

How is hashing implemented in SGNN (Self-Governing Neural Networks)?

So I've read the paper named Self-Governing Neural Networks for On-Device Short Text Classification which presents an embedding-free approach to projecting words into a neural representation. To quote them:
The key advantage of SGNNs over existing work is that they surmount the need for pre-trained word embeddings and complex networks with huge parameters. [...] our method is a truly embedding-free approach unlike majority of the widely-used state-of-the-art deep learning techniques in NLP
Basically, from what I understand, they proceed as follow:
You'd first need to compute n-grams (side-question: is that skip-gram like old skip-gram, or new skip-gram like word2vec? I assume it's the first one for what remains) on words' characters to obtain a featurized representation of words in a text, so as an example, with 4-grams you could yield a 1M-dimensional sparse feature vector per word. Hopefully, it's sparse so memory needn't to be fully used for that because it's almost one-hot (or count-vectorized, or tf-idf vectorized ngrams with lots of zeros).
Then you'd need to hash those n-grams sparse vectors using Locality-sensitive hashing (LSH). They seem to use Random Projection from what I've understood. Also, instead of ngram-vectors, they instead use tuples of n-gram feature index and its value for non-zero n-gram feature (which is also by definition a "sparse matrix" computed on-the-fly such as from a Default Dictionary of non-zero features instead of a full vector).
I found an implementation of Random Projection in scikit-learn. From my tests, it doesn't seem to yield a binary output, although the whole thing is using sparse on-the-fly computations within scikit-learn's sparse matrices as expected for a memory-efficient (non-zero dictionnary-like features) implementation I guess.
What doesn't work in all of this, and where my question lies, is in how they could end up with binary features from the sparse projection (the hashing). They seem to be saying that the hashing is done at the same time of computing the features, which is confusing, I would have expected the hashing to come in the order I wrote above as in 1-2-3 steps, but their steps 1 and 2 seems to be somehow merged.
My confusion arises mostly from the paragraphs starting with the phrase "On-the-fly Computation." at page 888 (PDF's page 2) of the paper in the right column. Here is an image depicting the passage that confuses me:
I'd like to convey my school project to a success (trying to mix BERT with SGNNs instead of using word embeddings). So, how would you demystify that? More precisely, how could a similar random hashing projection be achieved with scikit-learn, or TensorFlow, or with PyTorch? Trying to connect the dots here, I've significantly researched but their paper doesn't give implementation details, which is what I'd like to reproduce. I at least know that the SGNN uses 80 fourten-dimensionnal LSHes on character-level n-grams of words (is my understanding right in the first place?).
Thanks!
EDIT: after starting to code, I realized that the output of scikit-learn's SparseRandomProjection() looks like this:
[0.7278244729081154,
-0.7278244729081154,
0.0,
0.0,
0.7278244729081154,
0.0,
...
]
For now, this looks fine, it's closer to binary but it would still be castable to an integer instead of a float by using the good ratio in the first place. I still wonder about the skip-gram thing, I assume n-gram of characters of words for now but it's probably wrong. Will post code soon to GitHub.
EDIT #2: I coded something here, but with n-grams instead of skip-grams: https://github.com/guillaume-chevalier/SGNN-Self-Governing-Neural-Networks-Projection-Layer
More discussion threads on this here: https://github.com/guillaume-chevalier/SGNN-Self-Governing-Neural-Networks-Projection-Layer/issues?q=is%3Aissue
First of all, thanks for your implementation of the projection layer, it helped me get started with my own.
I read your discussion with #thinline72, and I agree with him that the features are calculated in the whole line of text, char by char, not word by word. I am not sure this difference in features is too relevant, though.
Answering your question: I interpret that they do steps 1 and 2 separately, as you suggested and did. Right, in the article excerpt that you include, they talk about hashing both in feature construction and projection, but I think those are 2 different hashes. And I interpret that the first hashing (feature construction) is automatically done by the CountVectorizer method.
Feel free to take a look at my implementation of the paper, where I built the end-to-end network and trained on the SwDA dataset, as split in the SGNN paper. I obtain a max of 71% accuracy, which is somewhat lower than the paper claims. I also used the binary hasher that #thinline72 recommended, and nltk's implementation of skipgrams (I am quite certain the SGNN paper is talking about "old" skipgrams, not "word2vec" skipgrams).

How to solve ambiguity in sentiment analysis?

I'm quite new to text mining and I'm challenging my self to do the sentiment analysis today. But I encounter some problems while doing the sentiment analysis.
In my language, a word can have some different meanings. Like "setan" means : 1) devils 2) cursing words. How to solve this ambiguity in sentiment analysis?
Also for everyone's information, the algorithm that I use is naive bayes classifier. And for the tools, I'm using RapidMiner.
I need your help. Any tips would be great. Thank you!
Training your data on a Naive Bayes classifier would make the model assign a probability for each word for every different class that you are trying to classify. In your case, since it's sentiment analysis, if you have Positive and Negative as the two classes, you would have probability for setan being Positive and Negative.
Keeping this in mind, if a word has multiple meanings that could account for both positive and negative sentiment, I would say make sure to include both kind of instances in your data so that while training the model, the corresponding probabilities are used to classify new text into Positive or Negative class.
In your case, it seems like both the meanings of setan have a negative connotation which really shouldn't be a problem. Words like "the","a" which are present in both Positive and Negative instances, famously called the stopwords should be removed since they don't really count towards the classification.
In your case if you are trying to train the model using their meanings specifically, you can refer this paper https://pdfs.semanticscholar.org/fc01/b42df3077a512620456d8a2714951eccbd67.pdf.

How to train a reverse embedding, like vec2word?

how do you train a neural network to map from a vector representation, to one hot vectors? The example I'm interested in is where the vector representation is the output of a word2vec embedding, and I'd like to map onto the the individual words which were in the language used to train the embedding, so I guess this is vec2word?
In a bit more detail; if I understand correctly, a cluster of points in embedded space represents similar words. Thus if you sample from points in that cluster, and use it as the input to vec2word, the output should be a mapping to similar individual words?
I guess I could do something similar to an encoder-decoder, but does it have to be that complicated/use so many parameters?
There's this TensorFlow tutorial, how to train word2vec, but I can't find any help to do the reverse? I'm happy to do it using any deeplearning library, and it's OK to do it using sampling/probabilistic.
Thanks a lot for your help, Ajay.
One easiest thing that you can do is to use the nearest neighbor word. Given a query feature of an unknown word fq, and a reference feature set of known words R={fr}, then you can find out what is the nearest fr* for fq, and use the corresponding fr* word as fq's word.

How to create a synthetic dataset

I want to run some Machine Learning clustering algorithms on some big data.
The problem is that I'm having troubles to find interesting data for this purpose on the web.Also, usually this data might be inconvenient to use because the format won't fit for me.
I need a txt file which each line represents a mathematical vector, each element seperated by space, for example:
1 2.2 3.1
1.12 0.13 4.46
1 2 54.44
Therefore, I decided to first run those algorithms on some synthetic data which I'll create by my self. How can I do this in a smart way with numpy?
In smart way, I mean that it won't be generated uniformly, because it's a little bit boring. How can I generate some interesting clusters?
I want to have 5GB / 10GB of data at the moment.
You need to define what you mean by "clusters", but I think what you are asking for is several random-parameter normal distributions combined together, for each of your coordinate values.
From http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.random.randn.html#numpy.random.randn:
For random samples from N(\mu, \sigma^2), use:
sigma * np.random.randn(...) + mu
And use <range> * np.random.rand(<howmany>) for each of sigma and mu.
There is no one good answer for such question. What is interesting? For clustering, unfortunately, there is no such thing as an interesting or even well posed problem. Clustering as such has no well defineid evaluation, consequently each method is equally good/bad, as long as it has well defined internal objective. So k-means will always be good one to minimize inter-cluster euclidean distance and will struggle with sparse data, non-convex, imbalanced clusters. DBScan will always be the best in greedy density based sense and will strugle with diverse density clusters. GMM will be always great fitting on gaussian mixtures, and will strugle with clusters which are not gaussians (for example lines, squares etc.).
From the question one could deduce that you are at the very begining of work with clustering and so need "just anything more complex than uniform", so I suggest you take a look at datasets generators, in particular accesible in scikit-learn (python) http://scikit-learn.org/stable/datasets/ or in clusterSim (R) http://www.inside-r.org/packages/cran/clusterSim/docs/cluster.Gen or clusterGeneration (R) https://cran.r-project.org/web/packages/clusterGeneration/clusterGeneration.pdf

how does lucene build VSM?

I understand the concept of VSM, TFIDF and cosine similarity, however, I am still confused about how lucene build VSM and calculate similarity for each query after reading lucene website.
As I understood, VSM is a matrix where the values of TFIDF of each term are filled. When i tried building VSM from a set of documents, it took a long time with this tool http://sourceforge.net/projects/wvtool/
This is not really related to the coding, because intuitively building a VSM matrix of large data is time consuming, but that seems not the case for lucene.
In additon, with a VSM prebuilt, finding most similar document which basically is the calculation of similarity between two documents or a query vs document often time consuming (assume millions of documents, because one has to compute similarity to everyone else), but lucene seems does it really fast. I guess that's also related to how it builds VSM internally. If possible, can someone also explain this ?
so please help me to understand two point here:
1. how lucene builds VSM so fast which can be used for calculating similarity.
2. how come lucene similarity calculation amoung millions of documents is so fast.
I'd appreciate it if an real example is given.
Thanks
As I understood, VSM is a matrix where the values of TFIDF of each term are filled.
This is more properly called a term-document matrix. The VSM is more of a conceptual framework from which this matrix, and the notion of cosine similarity arise.
Lucene stores term frequencies and document frequencies that can be used to get tf-idf weights for document and query terms. It uses those to compute a variant of cosine similarity outlined here. So, the rows of the term-document matrix are represented in the index, which is a hash table mapping terms to (document, tf) pairs plus a separate table mapping terms to their df value.
one has to compute similarity to everyone else
That's not true. If you review the textbook definition of cosine similarity, you'll find that it's the sum of products of corresponding term weights in a query and a document, normalized. Terms that occur in the document but not the query, or vice versa, have no effect on the similarity. It follows that, to compute cosine similarity, you only need to consider those documents that have some term in common with the query. That's how Lucene gets its speed: it does a hash table lookup for the query terms and computes similarities only to the documents that have non-zero intersection with the query's bag of words.