Np.where function - numpy

I've got a little problem understanding the where function in numpy.
The ‘times’ array contains the discrete epochs at which GPS measurements exist (rounded to the nearest second).
The ‘locations’ array contains the discrete values of the latitude, longitude and altitude of the satellite interpolated from 10 seconds intervals to 1 second intervals at the ‘times’ epochs.
The ‘tracking’ array contains an array for each epoch in ‘times’ (array within an array). The arrays have 5 columns and 32 rows. The 32 rows correspond to the 32 satellites of the GPS constellation. The 0th row corresponds to the 1st satellite, the 31st to the 32nd. The columns contain the following (in order): is the satellite tracked (0), is L1 locked (1), is L2 locked (2), is L1 unexpectedly lost (3), is L2 unexpectedly lost (4).
We need to find all the unexpected losses and put them in an array so we can plot it on a map.
What we tried to do is:
i = 0
with np.load(r’folderpath\%i.npz' %i) as oneday_data: #replace folderpath with your directory
times = oneday_data['times']
locations = oneday_data['locations']
tracking = oneday_data['tracking']
A = np.where(tracking[:][:][4] ==1)
This should give us all the positions of the losses. With this indices it is easy to get the right locations. But it keeps returning useless data.
Can someone help us?

I think the problem is your dual slices. Further, having an array of arrays could lead to weird problems (I assume you mean an object array of 2D arrays).
So I think you need to dstack tracking into a 3D array, then do where on that. If the array is already 3D, then you can skip the dstack part. This will get the places where L2 is unexpectedly lost, which is what you did in your example:
tracking3d = np.dstack(tracking)
A0, A2 = np.where(tracking3d[:, 4, :]==1)
A0 is the position of the 1 along axis 0 (satellite), while A2 is the position of the same 1 along axis 2 (time epoch).
If the values of tracking can only be 0 or 1, you can simplify this by just doing np.where(tracking3d[:, 4, :]).
You can also roll the axes back into the configuration you were using (0: time epoch, 1: satellite, 2: tracking status)
tracking3d = np.rollaxis(np.dstack(tracking), 2, 0)
A0, A1 = np.where(tracking3d[:, :, 4]==1)
If you want to find the locations where L1 or L2 are unexpectedly lost, you can do this:
tracking3d = np.rollaxis(np.dstack(tracking), 2, 0)
A0, A1, _ = np.where(tracking3d[:, :, 3:]==1)
In this case it is the same, except there is a dummy variable _ used for the location along the last axis, since you don't care whether it was lost for L1 or L2 (if you do care, you could just do np.where independently for each axis).

Related

Plotting an exponential function given one parameter

I'm fairly new to python so bare with me. I have plotted a histogram using some generated data. This data has many many points. I have defined it with the variable vals. I have then plotted a histogram with these values, though I have limited it so that only values between 104 and 155 are taken into account. This has been done as follows:
bin_heights, bin_edges = np.histogram(vals, range=[104, 155], bins=30)
bin_centres = (bin_edges[:-1] + bin_edges[1:])/2.
plt.errorbar(bin_centres, bin_heights, np.sqrt(bin_heights), fmt=',', capsize=2)
plt.xlabel("$m_{\gamma\gamma} (GeV)$")
plt.ylabel("Number of entries")
plt.show()
Giving the above plot:
My next step is to take into account values from vals which are less than 120. I have done this as follows:
background_data=[j for j in vals if j <= 120] #to avoid taking the signal bump, upper limit of 120 MeV set
I need to plot a curve on the same plot as the histogram, which follows the form B(x) = Ae^(-x/λ)
I then estimated a value of λ using the maximum likelihood estimator formula:
background_data=[j for j in vals if j <= 120] #to avoid taking the signal bump, upper limit of 120 MeV set
#print(background_data)
N_background=len(background_data)
print(N_background)
sigma_background_data=sum(background_data)
print(sigma_background_data)
lamb = (sigma_background_data)/(N_background) #maximum likelihood estimator for lambda
print('lambda estimate is', lamb)
where lamb = λ. I got a value of roughly lamb = 27.75, which I know is correct. I now need to get an estimate for A.
I have been advised to do this as follows:
Given a value of λ, find A by scaling the PDF to the data such that the area beneath
the scaled PDF has equal area to the data
I'm not quite sure what this means, or how I'd go about trying to do this. PDF means probability density function. I assume an integration will have to take place, so to get the area under the data (vals), I have done this:
data_area= integrate.cumtrapz(background_data, x=None, dx=1.0)
print(data_area)
plt.plot(background_data, data_area)
However, this gives me an error
ValueError: x and y must have same first dimension, but have shapes (981555,) and (981554,)
I'm not sure how to fix it. The end result should be something like:
See the cumtrapz docs:
Returns: ... If initial is None, the shape is such that the axis of integration has one less value than y. If initial is given, the shape is equal to that of y.
So you are either to pass an initial value like
data_area = integrate.cumtrapz(background_data, x=None, dx=1.0, initial = 0.0)
or discard the first value of the background_data:
plt.plot(background_data[1:], data_area)

How to find most similar numerical arrays to one array, using Numpy/Scipy?

Let's say I have a list of 5 words:
[this, is, a, short, list]
Furthermore, I can classify some text by counting the occurrences of the words from the list above and representing these counts as a vector:
N = [1,0,2,5,10] # 1x this, 0x is, 2x a, 5x short, 10x list found in the given text
In the same way, I classify many other texts (count the 5 words per text, and represent them as counts - each row represents a different text which we will be comparing to N):
M = [[1,0,2,0,5],
[0,0,0,0,0],
[2,0,0,0,20],
[4,0,8,20,40],
...]
Now, I want to find the top 1 (2, 3 etc) rows from M that are most similar to N. Or on simple words, the most similar texts to my initial text.
The challenge is, just checking the distances between N and each row from M is not enough, since for example row M4 [4,0,8,20,40] is very different by distance from N, but still proportional (by a factor of 4) and therefore very similar. For example, the text in row M4 can be just 4x as long as the text represented by N, so naturally all counts will be 4x as high.
What is the best approach to solve this problem (of finding the most 1,2,3 etc similar texts from M to the text in N)?
Generally speaking, the most widely standard technique of bag of words (i.e. you arrays) for similarity is to check cosine similarity measure. This maps your bag of n (here 5) words to a n-dimensional space and each array is a point (which is essentially also a point vector) in that space. The most similar vectors(/points) would be ones that have the least angle to your text N in that space (this automatically takes care of proportional ones as they would be close in angle). Therefore, here is a code for it (assuming M and N are numpy arrays of the similar shape introduced in the question):
import numpy as np
cos_sim = M[np.argmax(np.dot(N, M.T)/(np.linalg.norm(M)*np.linalg.norm(N)))]
which gives output [ 4 0 8 20 40] for your inputs.
You can normalise your row counts to remove the length effect as you discussed. Row normalisation of M can be done as M / M.sum(axis=1)[:, np.newaxis]. The residual values can then be calculated as the sum of the square difference between N and M per row. The minimum difference (ignoring NaN or inf values obtained if the row sum is 0) is then the most similar.
Here is an example:
import numpy as np
N = np.array([1,0,2,5,10])
M = np.array([[1,0,2,0,5],
[0,0,0,0,0],
[2,0,0,0,20],
[4,0,8,20,40]])
# sqrt of sum of normalised square differences
similarity = np.sqrt(np.sum((M / M.sum(axis=1)[:, np.newaxis] - N / np.sum(N))**2, axis=1))
# remove any Nan values obtained by dividing by 0 by making them larger than one element
similarity[np.isnan(similarity)] = similarity[0]+1
result = M[similarity.argmin()]
result
>>> array([ 4, 0, 8, 20, 40])
You could then use np.argsort(similarity)[:n] to get the n most similar rows.

Flop count for variable initialization

Consider the following pseudo code:
a <- [0,0,0] (initializing a 3d vector to zeros)
b <- [0,0,0] (initializing a 3d vector to zeros)
c <- a . b (Dot product of two vectors)
In the above pseudo code, what is the flop count (i.e. number floating point operations)?
More generally, what I want to know is whether initialization of variables counts towards the total floating point operations or not, when looking at an algorithm's complexity.
In your case, both a and b vectors are zeros and I don't think that it is a good idea to use zeros to describe or explain the flops operation.
I would say that given vector a with entries a1,a2 and a3, and also given vector b with entries b1, b2, b3. The dot product of the two vectors is equal to aTb that gives
aTb = a1*b1+a2*b2+a3*b3
Here we have 3 multiplication operations
(i.e: a1*b1, a2*b2, a3*b3) and 2 addition operations. In total we have 5 operations or 5 flops.
If we want to generalize this example for n dimensional vectors a_n and b_n, we would have n times multiplication operations and n-1 times addition operations. In total we would end up with n+n-1 = 2n-1 operations or flops.
I hope the example I used above gives you the intuition.

Karatsuba and Toom-3 algorithms for 3-digit number multiplications

I was wondering about this problem concerning Katatsuba's algorithm.
When you apply Karatsuba you basically have to do 3 multiplications per one run of the loop
Those are (let's say ab and cd are 2-digit numbers with digits respectively a, b, c and d):
X = bd
Y = ac
Z = (a+c)(c+d)
and then the sums we were looking for are:
bd = X
ac = Y
(bc + ad) = Z - X - Y
My question is: let's say we have two 3-digit numbers: abc, def. I found out that we will have to perfom only 5 multiplications to do so. I also found this Toom-3 algorithm, but it uses polynomials I can;t quite get. Could someone write down those multiplications and how to calculate the interesting sums bd + ae, ce+ bf, cd + be + af
The basic idea is this: The number 237 is the polynomial p(x)=2x2+3x+7 evaluated at the point x=10. So, we can think of each integer corresponding to a polynomial whose coefficients are the digits of the number. When we evaluate the polynomial at x=10, we get our number back.
What is interesting is that to fully specify a polynomial of degree 2, we need its value at just 3 distinct points. We need 5 values to fully specify a polynomial of degree 4.
So, if we want to multiply two 3 digit numbers, we can do so by:
Evaluating the corresponding polynomials at 5 distinct points.
Multiplying the 5 values. We now have 5 function values of the polynomial of the product.
Finding the coefficients of this polynomial from the five values we computed in step 2.
Karatsuba multiplication works the same way, except that we only need 3 distinct points. Instead of at 10, we evaluate the polynomial at 0, 1, and "infinity", which gives us b,a+b,a and d,d+c,c which multiplied together give you your X,Z,Y.
Now, to write this all out in terms of abc and def is quite involved. In the Wikipedia article, it's actually done quite nicely:
In the Evaluation section, the polynomials are evaluated to give, for example, c,a+b+c,a-b+c,4a+2b+c,a for the first number.
In Pointwise products, the corresponding values for each number are multiplied, which gives:
X = cf
Y = (a+b+c)(d+e+f)
Z = (a+b-c)(d-e+f)
U = (4a+2b+c)(4d+2e+f)
V = ad
In the Interpolation section, these values are combined to give you the digits in the product. This involves solving a 5x5 system of linear equations, so again it's a bit more complicated than the Karatsuba case.

Fast way of multiplying two 1-D arrays

I have the following data:
A = [a0 a1 a2 a3 a4 a5 .... a24]
B = [b0 b1 b2 b3 b4 b5 .... b24]
which I then want to multiply as follows:
C = A * B' = [a0b0 a1b1 a2b2 ... a24b24]
This clearly involves 25 multiplies.
However, in my scenario, only 5 new values are shifted into A per "loop iteration" (and 5 old values are shifted out of A). Is there any fast way to exploit the fact that data is shifting through A rather than being completely new? Ideally I want to minimize the number of multiplication operations (at a cost of perhaps more additions/subtractions/accumulations). I initially thought a systolic array might help, but it doesn't (I think!?)
Update 1: Note B is fixed for long periods, but can be reprogrammed.
Update 2: the shifting of A is like the following: a[24] <= a[19], a[23] <= a[18]... a[1] <= new01, a[0] <= new00. And so on so forth each clock cycle
Many thanks!
Is there any fast way to exploit the fact that data is shifting through A rather than being completely new?
Even though all you're doing is the shifting and adding new elements to A, the products in C will, in general, all be different since one of the operands will generally change after each iteration. If you have additional information about the way the elements of A or B are structured, you could potentially use that structure to reduce the number of multiplications. Barring any such structural considerations, you will have to compute all 25 products each loop.
Ideally I want to minimize the number of multiplication operations (at a cost of perhaps more additions/subtractions/accumulations).
In theory, you can reduce the number of multiplications to 0 by shifting and adding the array elements to simulate multiplication. In practice, this will be slower than a hardware multiplication so you're better off just using any available hardware-based multiplication unless there's some additional, relevant constraint you haven't mentioned.
on the very first 5 data set you could be saving upto 50 multiplications. but after that its a flat road of multiplications. since for every set after the first 5 set you need to multiply with the new set of data.
i'l assume all the arrays are initialized to zero.
i dont think those 50 saved are of any use considering the amount of multiplication on the whole.
But still i will give you a hint on how to save those 50 maybe you could find an extension to it?
1st data set arrived : multiply the first data set in a with each of the data set in b. save all in a, copy only a[0] to a[4] to c. 25 multiplications here.
2nd data set arrived : multiply only a[0] to a[4](having new data) with b[0] to b[4] resp. save in a[0] to a[4],copy to a[0->9] to c. 5 multiplications here
3rd data set arrived : multiply a[0] to a[9] with b[0] to b[9] this time and copy to corresponding a[0->14] to c.10 multiplications here
4th data set : multiply a[0] to a[14] with corresponding b copy corresponding a[0->19] to c. 15 multiplications here.
5th data set : mutiply a[0] to a[19] with corresponding b copy corresponding a[0->24] to c. 20 multiplications here.
total saved mutiplications : 50 multiplications.
6th data set : usual data multiplications. 25 each. this is because for each set in the array a there a new data set avaiable so multiplication is unavoidable.
Can you add another array D to flag the changed/unchanged value in A. Each time you check this array to decide whether to do new multiplications or not.