What is the computational cost of finding an element in a sorted array - time-complexity

Say that I have an array of size n that has been sorted using Quicksort e.g. X= [1,2,3,6,7]. I want to match all the values in this array with n values in another array that has a random order e.g. Y= [3,7,6,2,1].
I can iterate through each element of Y and compare it to the middle value of X i.e 3 so I would only need to complete at most n/2 checks. What would be the total computational complexity of doing this for all values of Y? I am looking for a tight bound.

Related

Performing a sparse sum on Mathematica

I want to evaluate a sum in Mathematica of the form
g[[i,j,k,l,m,n]] x g[[o,p,q,r,s,t]] x ( complicated function of the indices )
But all these indices range from 0 to 3, so the total number of cases to sum over is 4^12, which will take an unforgiving amount of time. However, barely any elements of the array g[[i,j,k,l,m,n]] are nonzero -- there are probably around 8 nonzero entries -- so I would like to restrict the sum over {i,j,k,l,m,n,o,p,q,r,s,t} to precisely those combinations of indices for which both factors of g are nonzero.
I can't find a way to do this for summation over multiple indices, where the allowed index choices are particular combinations of {i,j,k,l,m,n} as opposed to specific values of each particular index. Any help appreciated!

Finding first index before value steps outside of boundary, numpy

Imagine we got large dataframe where: column 1 is time and column 2 is the value.
I wanna know all indices n where:
The values of n, n-1, n-2, n-3, ...., n-x are all inside boundary and the value of n+1 is outside a boundary.
I tried it via np.where, looking promising. I am able to find all points n where n+1 is outside of the boundaries, but I only need those, where n-1, ..., n-k are inside the boundaries too, not just n.

Is the time complexity of the following cases correct?

I am a bit confused about the (average case) time complexity of the following cases:
I have N=3 arrays, each with different number of elements:
Array1 has n1 elements
Array2 has n2 elements
Array3 has n3 elements
Case A: I perform quicksort on each array in a sequential manner, starting from the first array till the last.
In this case the time complexity will be N*O(nlogn) (where n is the generalized form of the number of elements of an array) or O(n1logn1 + n2logn2 + n3logn3), which asymptotically is equal to O(max(n1logn1, n2logn2, n3logn3))?
Case B: I perform quicksort on each array in parallel.
In this case the time complexity will be O(max(n1logn1, n2logn2, n3logn3))?
Case C: There is a 50% chance of performing quicksort (on all arrays, in parallel) and 50% chance of not sorting any array.
Isn't this case essentially the same as case B? I.e. 0.5 * O(max(n1logn1, n2logn2, n3logn3)), which asymptotically is equal to O(max(n1logn1, n2logn2, n3logn3))?
Therefore, all cases have the same time complexity, O(max(n1logn1, n2logn2, n3logn3))?

How to find most similar numerical arrays to one array, using Numpy/Scipy?

Let's say I have a list of 5 words:
[this, is, a, short, list]
Furthermore, I can classify some text by counting the occurrences of the words from the list above and representing these counts as a vector:
N = [1,0,2,5,10] # 1x this, 0x is, 2x a, 5x short, 10x list found in the given text
In the same way, I classify many other texts (count the 5 words per text, and represent them as counts - each row represents a different text which we will be comparing to N):
M = [[1,0,2,0,5],
[0,0,0,0,0],
[2,0,0,0,20],
[4,0,8,20,40],
...]
Now, I want to find the top 1 (2, 3 etc) rows from M that are most similar to N. Or on simple words, the most similar texts to my initial text.
The challenge is, just checking the distances between N and each row from M is not enough, since for example row M4 [4,0,8,20,40] is very different by distance from N, but still proportional (by a factor of 4) and therefore very similar. For example, the text in row M4 can be just 4x as long as the text represented by N, so naturally all counts will be 4x as high.
What is the best approach to solve this problem (of finding the most 1,2,3 etc similar texts from M to the text in N)?
Generally speaking, the most widely standard technique of bag of words (i.e. you arrays) for similarity is to check cosine similarity measure. This maps your bag of n (here 5) words to a n-dimensional space and each array is a point (which is essentially also a point vector) in that space. The most similar vectors(/points) would be ones that have the least angle to your text N in that space (this automatically takes care of proportional ones as they would be close in angle). Therefore, here is a code for it (assuming M and N are numpy arrays of the similar shape introduced in the question):
import numpy as np
cos_sim = M[np.argmax(np.dot(N, M.T)/(np.linalg.norm(M)*np.linalg.norm(N)))]
which gives output [ 4 0 8 20 40] for your inputs.
You can normalise your row counts to remove the length effect as you discussed. Row normalisation of M can be done as M / M.sum(axis=1)[:, np.newaxis]. The residual values can then be calculated as the sum of the square difference between N and M per row. The minimum difference (ignoring NaN or inf values obtained if the row sum is 0) is then the most similar.
Here is an example:
import numpy as np
N = np.array([1,0,2,5,10])
M = np.array([[1,0,2,0,5],
[0,0,0,0,0],
[2,0,0,0,20],
[4,0,8,20,40]])
# sqrt of sum of normalised square differences
similarity = np.sqrt(np.sum((M / M.sum(axis=1)[:, np.newaxis] - N / np.sum(N))**2, axis=1))
# remove any Nan values obtained by dividing by 0 by making them larger than one element
similarity[np.isnan(similarity)] = similarity[0]+1
result = M[similarity.argmin()]
result
>>> array([ 4, 0, 8, 20, 40])
You could then use np.argsort(similarity)[:n] to get the n most similar rows.

Weighted random letter in Objective-C

I need a simple way to randomly select a letter from the alphabet, weighted on the percentage I want it to come up. For example, I want the letter 'E' to come up in the random function 5.9% of the time, but I only want 'Z' to come up 0.3% of the time (and so on, based on the average occurrence of each letter in the alphabet). Any suggestions? The only way I see is to populate an array with, say, 10000 letters (590 'E's, 3 'Z's, and so on) and then randomly select an letter from that array, but it seems memory intensive and clumsy.
Not sure if this would work, but it seems like it might do the trick:
Take your list of letters and frequencies and sort them from
smallest frequency to largest.
Create a 26 element array where each element n contains the sum of all previous weights and the element n from the list of frequencies. Make note of the sum in the
last element of the array
Generate a random number between 0 and the sum you made note of above
Do a binary search of the array of sums until you reach the element where that number would fall
That's a little hard to follow, so it would be something like this:
if you have a 5 letter alphabet with these frequencies, a = 5%, b = 20%, c = 10%, d = 40%, e = 25%, sort them by frequency: a,c,b,e,d
Keep a running sum of the elements: 5, 15, 35, 60, 100
Generate a random number between 0 and 100. Say it came out 22.
Do a binary search for the element where 22 would fall. In this case it would be between element 2 and 3, which would be the letter "b" (rounding up is what you want here, I think)
You've already acknowledged the tradeoff between space and speed, so I won't get into that.
If you can calculate the frequency of each letter a priori, then you can pre-generate an array (or dynamically create and fill an array once) to scale up with your desired level of precision.
Since you used percentages with a single digit of precision after the decimal point, then consider an array of 1000 entries. Each index represents one tenth of one percent of frequency. So you'd have letter[0] to letter[82] equal to 'a', letter[83] to letter[97] equal to 'b', and so on up until letter[999] equal to 'z'. (Values according to Relative frequencies of letters in the English language)
Now generate a random number between 0 and 1 (using whatever favourite PRNG you have, assuming uniform distribution) and multiply the result by 1000. That gives you the index into your array, and your weighted-random letter.
Use the method explained here. Alas this is for Python but could be rewritten for C etc.
https://stackoverflow.com/a/4113400/129202
First you need to make a NSDicationary of the letters and their frequencies;
I'll explain it with an example:
let's say your dictionary is something like this:
{#"a": #0.2, #"b", #0.5, #"c": #0.3};
So the frequency of you letters covers the interval of [0, 1] this way:
a->[0, 0.2] + b->[0.2, 0.7] + c->[0.7, 1]
You generate a random number between 0 and 1. Then easily by checking that this random belongs to which interval and returning the corresponding letter you get what you want.
you seed the random function at the beginning of you program: srand48(time(0));
-(NSSting *)weightedRandomForDicLetters:(NSDictionary *)letterFreq
{
double randomNumber = drand48();
double endOfInterval = 0;
for (NSString *letter in dic){
endOfInterval += [[letterFreq objectForKey:letter] doubleValue];
if (randomNumber < endOfInterval) {
return letter;
}
}
}