Semantic distance between two words - semantics

I would like to get a value which identifies the symantec distance between the words.
I know that from wordnet I can get a set of words which are hyponyms, synonyms.... to a particular word.
BUT is there a way to give two words as an input and get a value representing the distance between the two words from wordnet rather than the actual word?

you can get semantic similarity between two words, and you can use that as a distance measure. More the similarity lesser the distance or somehting.

Related

How to do Fuzzy match in Snowflake / SQL

How to do fuzzy match in Snowflake / SQL
Here is the business logic
The ABC Company INC, The north America, ABC (Those two should shows a match)
The 16K LLC, 16K LLC (Those two should shows a match)
enter image description here
I attached some test data. Thank so much guys!
Any matching attempt that treats string pairs like "The ABC Company INC" and "The north America, ABC" or "Preferred ABC Group" and "The Preferred Residences" as a match is probably going to give you many false positive matches, since in some of your examples there is only one word similar between the strings.
That said, Snowflake does provide a couple of functions that might help: EDITDISTANCE and JAROWINKLER_SIMILARITY.
EDITDISTANCE generates a number that represents the Levenshtein distance between two strings (basically the number of edits it would take to change one string into the other). A lower number indicates fewer edits needed and so potentially a closer match.
JAROWINKLER_SIMILARITY uses an algorithm to calculate a "similarity" score between 0 and 100 for two strings. A higher number indicates more similarity, 100 being an exact match.
You could use either or both of these functions to generate scores for each pair of strings and then decide on a threshold that best represents a match for your purposes.

Machine Learning text comparison model

I am creating a machine learning model that essentially returns the correctness of one text to another.
For example; “the cat and a dog”, “a dog and the cat”. The model needs to be able to identify that some words (“cat”/“dog”) are more important/significant than others (“a”/“the”). I am not interested in conjunction words etc. I would like to be able to tell the model which words are the most “significant” and have it determine how correct text 1 is to text 2, with the “significant” words bearing more weight than others.
It also needs to be able to recognise that phrases don’t necessarily have to be in the same order. The two above sentences should be an extremely high match.
What is the basic algorithm I should use to go about this? Is there an alternative to just creating a dataset with thousands of example texts and a score of correctness?
I am only after a broad overview/flowchart/process/algorithm.
I think TF-IDF might be a good fit to your problem, because:
Emphasis on words occurring in many documents (say, 90% of your sentences/documents contain the conjuction word 'and') is much smaller, essentially giving more weight to the more document specific phrasing (this is the IDF part).
Ordering in Term Frequency (TF) does not matter, as opposed to methods using sliding windows etc.
It is very lightweight when compared to representation oriented methods like the one mentioned above.
Big drawback: Your data, depending on the size of corpus, may have too many dimensions (the same number of dimensions as unique words), you could use stemming/lemmatization in order to mitigate this problem to some degree.
You may calculate similiarity between two TF-IDF vector using cosine similiarity for example.
EDIT: Woops, this question is 8 months old, sorry for the bump, maybe it will be of use to someone else though.

How does score multiplier operator in Watson Discovery service work?

I have a set of JSON documents uploaded to my WDS instance. I want to understand the importance of the score multiplier operator(^). The document just says,"Increases the score value of the search term". I have tried a simple query on one field, it multiplies the score by the number specified.
If I specify two fields & I want Watson Discovery to know which of the two fields is more important for the search, is score multiplier applicable in this case? With two fields & a score multiplier applied to one, I could not identify the difference. Also, on what datatypes is this allowed? It didn't work with a number.
I found this through some more experiments. Score multiplier is used when you want to increase the relative importance of fields in the query. So, for example, you want to give more importance to Name.LastName in the below example:
Name.FirstName:"ABC",Name.LastName:"DEF"^3
Here, LastName is given more relevance & the search results are ordered in the same way.
Could be useful for someone.

Similarity matching algorithm

I have products with different details in different attributes and I need to develop an algorithm to find the most similar to the one I'm trying to find.
For example, if a product has:
Weight: 100lb
Color: Black, Brown, White
Height: 10in
Conditions: new
Others can have different colors, weight, etc. Then I need to do a search where the most similar return first. For example, if everything matches but the color is only Black and White but not Brown, it's a better match than another product that is only Black but not White or Brown.
I'm open to suggestions as the project is just starting.
One approach, for example, I could do is restrict each attribute (weight, color, size) a limited set of option, so I can build a binary representation. So I have something like this for each product:
Colors Weight Height Condition
00011011000 10110110 10001100 01
Then if I do an XOR between the product's binary representation and my search, I can calculate the number of set bits to see how similar they are (all zeros would mean exact match).
The problem with this approach is that I cannot index that on a database, so I would need to read all the products to make the comparison.
Any suggestions on how I can approach this? Ideally I would like to have something I can index on a database so it's fast to query.
Further question: also if I could use different weights for each attribute, it would be awesome.
You basically need to come up with a distance metric to define the distance between two objects. Calculate the distance from the object in question to each other object, then you can either sort by minimum distance or just select the best.
Without some highly specialized algorithm based on the full data set, the best you can do is a linear time distance comparison with every other item.
You can estimate the nearest by keeping sorted lists of certain fields such as Height and Weight and cap the distance at a threshold (like in broad phase collision detection), then limit full distance comparisons to only those items that meet the thresholds.
What you want to do is a perfect use case for elasticsearch and other similar search oriented databases. I don't think you need to hack with bitmasks/etc.
You would typically maintain your primary data in your existing database (sql/cassandra/mongo/etc..anything works), and copy things that need searching to elasticsearch.
What are you talking about very similar to BK-trees. BK-tree constructs search tree with some metric associated with keys of this tree. Most common use of this tree is string corrections with Levenshtein or Damerau-Levenshtein distances. This is not static data structure, so it supports future insertions of elements.
When you search exact element (or insert element), you need to look through nodes of this tree and go to links with weight equal to distance between key of this node and your element. if you want to find similar objects, you need to go to several nodes simultaneously that supports your wishes of constrains of distances. (Maybe it can be even done with A* to fast finding one most similar object).
Simple example of BK-tree (from the second link)
BOOK
/ \
/(1) \(4)
/ \
BOOKS CAKE
/ / \
/(2) /(1) \(2)
/ | |
BOO CAPE CART
Your metric should be Hamming distance (count of differences between bit representations of two objects).
BUT! is it good to compare two integers as count of different bits in their representation? With Hamming distance HD(10000, 00000) == HD(10000, 10001). I.e. difference between numbers 16 and 0, and 16 and 17 is equal. Is it really what you need?
BK-tree with details:
https://hamberg.no/erlend/posts/2012-01-17-BK-trees.html
https://nullwords.wordpress.com/2013/03/13/the-bk-tree-a-data-structure-for-spell-checking/

Cosine similarity--one to many

I'm wondering if there's any good way to use cosine similarity to compare a single document with a set of documents. Obviously you could calculate the cosine similarity between the single document and every document in the set, but if you did this would you then take the average? Would you weight by the size of each of the other documents you're comparing the original document with? I'm also wondering if there's any way to combine all of the word counts in the set of documents you're comparing with so that in the end you only compute cosine similarity once; between the original document and the "aggregated" document. The reason I'm asking is that I have about 200,000 documents that I want to compare with a separate set of about 50,000 documents.Comparing each of the 200,000 with each of the 50,000 is a lot of calculating and I don't know if it's actually necessary if I'm just going to take some sort of average in the end anyway. Is my aggregated document idea a big no-no?
There is a way to speed this up significantly. The point is to notice that the word vectors are sparse. Thus you want to transform your documents into a table which is organized by word columns. One column per word. For each column you only store the non zero entries. That is one row per document that actually contains the word. Then you compute the partial sums by going through the columns and collect the results per document. This has the additional advantage that it is easy to parallelize.
To speed this up further you create a column per word per set and only compute and distribute the partial sums for the same word for documents of different sets.