Different document length in computing cosine similarity? - cosine-similarity

Is there any rule, when I like to find cosine similarity between two documents that have different number of words?

The standard formula does not require the number of words to match. You can just sum over the union of the words of both documents. All words that are in B but not in A give rise to a 0 in the word vector for A. All words that are in A but not in B give rise to a 0 in the word vector for B.

Related

Performing a sparse sum on Mathematica

I want to evaluate a sum in Mathematica of the form
g[[i,j,k,l,m,n]] x g[[o,p,q,r,s,t]] x ( complicated function of the indices )
But all these indices range from 0 to 3, so the total number of cases to sum over is 4^12, which will take an unforgiving amount of time. However, barely any elements of the array g[[i,j,k,l,m,n]] are nonzero -- there are probably around 8 nonzero entries -- so I would like to restrict the sum over {i,j,k,l,m,n,o,p,q,r,s,t} to precisely those combinations of indices for which both factors of g are nonzero.
I can't find a way to do this for summation over multiple indices, where the allowed index choices are particular combinations of {i,j,k,l,m,n} as opposed to specific values of each particular index. Any help appreciated!

How to find most similar numerical arrays to one array, using Numpy/Scipy?

Let's say I have a list of 5 words:
[this, is, a, short, list]
Furthermore, I can classify some text by counting the occurrences of the words from the list above and representing these counts as a vector:
N = [1,0,2,5,10] # 1x this, 0x is, 2x a, 5x short, 10x list found in the given text
In the same way, I classify many other texts (count the 5 words per text, and represent them as counts - each row represents a different text which we will be comparing to N):
M = [[1,0,2,0,5],
[0,0,0,0,0],
[2,0,0,0,20],
[4,0,8,20,40],
...]
Now, I want to find the top 1 (2, 3 etc) rows from M that are most similar to N. Or on simple words, the most similar texts to my initial text.
The challenge is, just checking the distances between N and each row from M is not enough, since for example row M4 [4,0,8,20,40] is very different by distance from N, but still proportional (by a factor of 4) and therefore very similar. For example, the text in row M4 can be just 4x as long as the text represented by N, so naturally all counts will be 4x as high.
What is the best approach to solve this problem (of finding the most 1,2,3 etc similar texts from M to the text in N)?
Generally speaking, the most widely standard technique of bag of words (i.e. you arrays) for similarity is to check cosine similarity measure. This maps your bag of n (here 5) words to a n-dimensional space and each array is a point (which is essentially also a point vector) in that space. The most similar vectors(/points) would be ones that have the least angle to your text N in that space (this automatically takes care of proportional ones as they would be close in angle). Therefore, here is a code for it (assuming M and N are numpy arrays of the similar shape introduced in the question):
import numpy as np
cos_sim = M[np.argmax(np.dot(N, M.T)/(np.linalg.norm(M)*np.linalg.norm(N)))]
which gives output [ 4 0 8 20 40] for your inputs.
You can normalise your row counts to remove the length effect as you discussed. Row normalisation of M can be done as M / M.sum(axis=1)[:, np.newaxis]. The residual values can then be calculated as the sum of the square difference between N and M per row. The minimum difference (ignoring NaN or inf values obtained if the row sum is 0) is then the most similar.
Here is an example:
import numpy as np
N = np.array([1,0,2,5,10])
M = np.array([[1,0,2,0,5],
[0,0,0,0,0],
[2,0,0,0,20],
[4,0,8,20,40]])
# sqrt of sum of normalised square differences
similarity = np.sqrt(np.sum((M / M.sum(axis=1)[:, np.newaxis] - N / np.sum(N))**2, axis=1))
# remove any Nan values obtained by dividing by 0 by making them larger than one element
similarity[np.isnan(similarity)] = similarity[0]+1
result = M[similarity.argmin()]
result
>>> array([ 4, 0, 8, 20, 40])
You could then use np.argsort(similarity)[:n] to get the n most similar rows.

How can I prove this language is regular?

I'm trying to prove if this language:
L = { w={0,1}* | #0(w) % 3 = 0 } (number of 0's is divisble by 3)
is regular using the pumping lemma, but I can't find a way to do it. All other examples I got, have a simple form or let's say a more defined form such as w = axbycz etc.
I don't think you can use pumping lemma to prove that a language is regular. To prove a language is regular, you just need to give a regular expression or a DFA. In this case the regular expression is quite easy:
1*(01*01*01*)*
(proof: the regular expression clearly does not accept any string which has the number of 0's not divisible by 3, so we just need to prove that all possible strings which has the number of 0's divisible by 3 is accepted by this regular expression, which can be done by confirming that for strings that contain 3n 0's, the regular expression matches it since 1n001n101n201n3...01n3n-201n3n-101n3n has the same number of 0's and the nk's can be substituted so that it matches the string, and that this format is clearly accepted by the regular expression)
Pumping lemma cannot be used to prove that a language is regular because we cannot set the y as in Daniel Martin's answer. Here is a counter-example, in a similar format as his answer (please correct me if I'm doing something fundamentally different from his answer):
We prove that the language L = {w=0n1p | n ∈ N, n>0, p is prime} is regular using pumping lemma as follows: note that there is at least one occurrence of 0, so we take y as 0, and we have xykz = 0n+k-11p, which still satisfy the language definition. Therefore L is regular.
But this is false, since we know that a sequence with prime-numbered length is not regular. The problem here is we cannot just set y to any character.
Any string in this language with at least three characters in it has this property: either the string has a "1" in it, or there are three "0"s in a row.
If the string contains a 1, then you can split it as in the pumping lemma and set y equal to some 1 in the string. Then obviously the strings xyz, xyyz, xyyyz, etc. are all in the language because all those strings have the same number of zeros.
If the string does not contain a 1, it contains three 0s in a row. Setting y to those three 0s, it should be obvious that xyz, xyyz, xyyyz, etc. are all in the language because you're adding three 0 characters each time, so you always have a number of 0s divisible by 3.
#justhalf in the comments is perfectly correct; the pumping lemma can be used to prove that a regular language can be pumped or that a language that cannot be pumped is not regular, but you cannot use the pumping lemma to prove that a language is regular in the first place. Mea Culpa.
Instead, here's a proof that the given language is regular based on the Myhill-Nerode Theorem:
Consider the set of all strings of 0s and 1s. Divide these strings into three sets:
E0, all strings such that the number of 0s is a multiple of three,
E1, all strings such that the number of 0s is one more than a multiple of three,
E2, all strings such that the number of 0s is two more than a multiple of three.
Obviously, every string of 0s and 1s is in one of these three sets.
Furthermore, if x and z are both strings of 0s and 1s, then consider what it means if the concatenation xz is in L:
If x is in E0, then xz is in L if and only if z is in E0
If x is in E1, then xz is in L if and only if z is in E2
If x is in E2, then xz is in L if and only if z is in E1
Therefore, in the language of the theorem, there is no distinguishing extension for any two strings in the same one of our three Ei sets, and therefore there are at most three equivalence classes. A finite number of equivalence classes means the language is regular.
(in fact, there are exactly three equivalence classes, but that isn't needed)
A language is regular if and only if some nondeterministic finite automaton recognizes it.
Automaton is a finite state machine.
We have to build an automaton that regonizes L.
For each state, thinking like:
"Where am I?"
"Where can I go to, with some given entry?"
So, for L = { w={0,1}* | #0(w) % 3 = 0 }
The possibilites (states) are:
The remainder (rest of division) is 0, 1 or 2. Which means we need three states.
Let q0,q1 and q2 be the states that represent the remainderes 0,1 and 2, respectively.
q0 is the start and final state.
Now, for "0" entries, do the math #0(w)%3 and go to the aproppriated state.
Transion functions:
f(q0, 0) = q1
f(q1, 0) = q2
f(q2, 0) = q0
For "1" entries, it just loops wherever it is, 'cause it doesn't change the machine state.
f(qx, 1) = qx
The pumping lemma proves if some language is not regular.
Here is a good book for theory of computation: Introduction to the Theory of Computation 3rd Edition
by Michael Sipser.

search any word inside a string in million rows

I have a set of 50k values say X. each value i want to compare with a set of 10k values say Y. if X is present any where in the string Y it matches.
So each value in X i want to check across each value in Y and assign X if it matches.
what would be the best method to complete this task. It is required for a data mining project.
I loaded the data into MS Access database.
then using a vba program
take each X . Update Y if it matches (Like '%X%') but it is a never ending process. The columns are indexed but no effect.
Is there any algorithm or steps to reduce it into step-by-step process and complete the mapping faster?
Please let me know if there is any other options available other than the answers given below. I ll explain the scenario bit more
Table1.Data
sentense1
sentense2
sentense3
sentense4
sentense5
sentense6
-
-
-
Sentense100k
Table2.Phrase (Means multiple words)
Phrase1
Phrase2
Phrase3
Phrase4
Phrase5
-
-
-
Phrase 100k
Want to check Phrase1 has any Match in Sentense1 to Sentense100k Exact Match of Phrase, anywhere Match of Phrase, Maximum Words in Phrase1 Match in Sentense etc.. and create a map based on best Match(ideally exact phrase available anywhere in the sentense)
Table3 Output
Data Best Possible Phrase Second Best Phrase(Optional)
Sentense1 Phrase1000 Phrase50k
Sentense2 Phrase10 Phrase70k
Please let me know any tool,logic to perform this. The logic what i tried in SQL
1.
Select A.Data,B.Phrase from Table1 A left join Table2 B on A.Data Like '%' + B.Phrase + '%'
2.
Check for any word in phrase available in sentense. So replaced all spaces with % like word1%word2%word3. then did query as
A.Data Like '%' + B.Phrase + '%' which is
A.Data Like '%word1%word2%word3%'
But it takes days to complete the task for this much data.
Any readily usable tools, indexing methods,queries would really help. The answers given below seems too technical for me to adapt. Please guide
You can build a suffix tree in linear time (you can look up suffix trees online), out of the concatenation of all strings in X and Y, with special unique symbols that end each string.
Then for each string Xi in X, you look it up in the suffix tree (linear time in length of Xi) and assign Xi to each string in Y that is somewhere in the subtree rooted at the end of Xi.
This is linear time in the number of strings in Y that Xi is assigned to.
Thus you get an optimal O(N + k) time algorithm, where:
N is the total length of all the strings in X and Y,
and k is the total number of matches between query strings in X and target strings in Y.

Weighted random letter in Objective-C

I need a simple way to randomly select a letter from the alphabet, weighted on the percentage I want it to come up. For example, I want the letter 'E' to come up in the random function 5.9% of the time, but I only want 'Z' to come up 0.3% of the time (and so on, based on the average occurrence of each letter in the alphabet). Any suggestions? The only way I see is to populate an array with, say, 10000 letters (590 'E's, 3 'Z's, and so on) and then randomly select an letter from that array, but it seems memory intensive and clumsy.
Not sure if this would work, but it seems like it might do the trick:
Take your list of letters and frequencies and sort them from
smallest frequency to largest.
Create a 26 element array where each element n contains the sum of all previous weights and the element n from the list of frequencies. Make note of the sum in the
last element of the array
Generate a random number between 0 and the sum you made note of above
Do a binary search of the array of sums until you reach the element where that number would fall
That's a little hard to follow, so it would be something like this:
if you have a 5 letter alphabet with these frequencies, a = 5%, b = 20%, c = 10%, d = 40%, e = 25%, sort them by frequency: a,c,b,e,d
Keep a running sum of the elements: 5, 15, 35, 60, 100
Generate a random number between 0 and 100. Say it came out 22.
Do a binary search for the element where 22 would fall. In this case it would be between element 2 and 3, which would be the letter "b" (rounding up is what you want here, I think)
You've already acknowledged the tradeoff between space and speed, so I won't get into that.
If you can calculate the frequency of each letter a priori, then you can pre-generate an array (or dynamically create and fill an array once) to scale up with your desired level of precision.
Since you used percentages with a single digit of precision after the decimal point, then consider an array of 1000 entries. Each index represents one tenth of one percent of frequency. So you'd have letter[0] to letter[82] equal to 'a', letter[83] to letter[97] equal to 'b', and so on up until letter[999] equal to 'z'. (Values according to Relative frequencies of letters in the English language)
Now generate a random number between 0 and 1 (using whatever favourite PRNG you have, assuming uniform distribution) and multiply the result by 1000. That gives you the index into your array, and your weighted-random letter.
Use the method explained here. Alas this is for Python but could be rewritten for C etc.
https://stackoverflow.com/a/4113400/129202
First you need to make a NSDicationary of the letters and their frequencies;
I'll explain it with an example:
let's say your dictionary is something like this:
{#"a": #0.2, #"b", #0.5, #"c": #0.3};
So the frequency of you letters covers the interval of [0, 1] this way:
a->[0, 0.2] + b->[0.2, 0.7] + c->[0.7, 1]
You generate a random number between 0 and 1. Then easily by checking that this random belongs to which interval and returning the corresponding letter you get what you want.
you seed the random function at the beginning of you program: srand48(time(0));
-(NSSting *)weightedRandomForDicLetters:(NSDictionary *)letterFreq
{
double randomNumber = drand48();
double endOfInterval = 0;
for (NSString *letter in dic){
endOfInterval += [[letterFreq objectForKey:letter] doubleValue];
if (randomNumber < endOfInterval) {
return letter;
}
}
}