Finding percentile for each data point in a numpy array - numpy

I have the following line of code:
threshold_value = numpy.percentile(a, q)
where a is my data and q is set at 95 let us say.
And let us say that if I changed q to be 90, I would get a different threshold value.
Well for each data point in a, I would like to calculate what value of q would yield threshold_value equal to a. So what I am interested in is perhaps a data point in a is below the threshold_value, but I want a percentile value to see exactly where is it at. When I have a test dataset, I compare each value to the threshold value to see if it exceeds it or not. So I don't want to give a q value, I want to be told what the q value is for a data point.
So I want function maybe a_percentile = function(a) where a_percentile is a transformation of the original data value to the percentile.

Use scipy.stats.percentileofscore https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.percentileofscore.html:
import numpy as np
from scipy.stats import percentileofscore
a = np.linspace(0, 10, 10)
[percentileofscore(a, i, kind='strict') for i in a]
Output:
[0.0, 10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0]

Related

Pandas wrong round decimation

I am calculating the duration of the data acquisition from some sensors. Although the data is collected faster, I would like to sample it at 10Hz. Anyways, I created a dataframe with a column called 'Time_diff' which I expect it goes [0.0, 0.1, 0.2, 0.3 ...]. However it goes somehow like [0.0, 0.1, 0.2, 0.30000004 ...]. I am rounding the data frame but still, I have this weird decimation. Is there any suggestions on how to fix it?
The code:
for i in range(self.n_of_trials):
start = np.zeros(0)
stop = np.zeros(0)
for df in self.trials[i].df_list:
start = np.append(stop, df['Time'].iloc[0])
stop = np.append(start, df['Time'].iloc[-1])
t_start = start.min()
t_stop = stop.max()
self.trials[i].duration = t_stop-t_start
t = np.arange(0, self.trials[i].duration+self.trials[i].dt, self.trials[i].dt)
self.trials[i].df_merged['Time_diff'] = t
self.trials[i].df_merged.round(1)
when I print the data it looks like this:
0 0.0
1 0.1
2 0.2
3 0.3
4 0.4
...
732 73.2
733 73.3
734 73.4
735 73.5
736 73.6
Name: Time_diff, Length: 737, dtype: float64
However when I open as csv file it is like that:
Addition
I think the problem is not csv conversion but how the float data converted/rounded. Here is the next part of the code where I merge more dataframes on 10Hz time stamps:
for j in range(len(self.trials[i].df_list)):
df = self.trials[i].df_list[j]
df.insert(0, 'Time_diff', round(df['Time']-t_start, 1))
df.round({'Time_diff': 1})
df.drop_duplicates(subset=['Time_diff'], keep='first', inplace=True)
self.trials[i].df_merged = pd.merge(self.trials[i].df_merged, df, how="outer", on="Time_diff", suffixes=(None, '_'+self.trials[i].df_list_names[j]))
#Test csv
self.trials[2].df_merged.to_csv(path_or_buf='merged.csv')
And since the inserted dataframes have exact correct decimation, it is not merged properly and create another instance with a new index.
This is not a rounding problem, it is a behavior intrinsic in how floating point numbers work. Actually 0.30000000000000004 is the result of 0.1+0.1+0.1 (try it out yourself in a Python prompt).
In practice not every decimal number is exactly representable as a floating point number so what you get is instead the closest possible value.
You have some options depending if you just want to improve the visualization or if you need to work on exact values. If for example you want to use that column for a merge you can use an approximate comparison instead of an exact one.
Another option is to use the decimal module: https://docs.python.org/3/library/decimal.html which works with exact arithmetic but can be slower.
In your case you said the column should represent frequency at steps of 10Hz so I think changing the representation so that you directly use 10, 20, 30, ... will allow you to use integers instead of floats.
If you want to see the "true" value of a floating point number in python you can use format(0.1*6, '.30f') and it will print the number with 30 digits (still an approximation but much better than the default).

How do I make a 2x2 python array, with random normalized values between 0 and 1?

I want the values randomly between 0 and 1 on a normal distrubition and filling a 2x2 array
This should do what you are requesting (see code comments for explanation):
from numpy.random import default_rng
#Create the random number generator
rng = default_rng()
#Create a 2x2 matrix of samples from a normal distribution.
#The values will be normalized later, so the default mean and standard deviation are okay.
vals = rng.standard_normal((2,2))
#Normalize values to be between 0 and 1
vals = (vals - vals.min()) / vals.ptp()
With this above method you will always have a zero and always have a 1 in your 2x2 matrix after normalization. I don't know what your use case is, but is this really what you want? Instead perhaps you could set your mean to 0.5, standard deviation to 0.17, and np.clip anything below 0 or above 1? Here's what that would look like:
import numpy as np
mean = 0.5
standard_deviation = 0.17
s = np.random.default_rng().normal(mean, standard_deviation, (2,2))
s = np.clip(s, 0.0, 1.0)
Reference Sources:
https://numpy.org/doc/stable/reference/random/index.html#random-quick-start
https://numpy.org/doc/stable/reference/random/generated/numpy.random.standard_normal.html

Numpy Random Choice with Non-regular Array Size

I'm making an array of sums of random choices from a negative binomial distribution (nbd), with each sum being of non-regular length. Right now I implement it as follows:
import numpy
from numpy.random import default_rng
rng = default_rng()
nbd = rng.negative_binomial(1, 0.5, int(1e6))
gmc = [12, 35, 4, 67, 2]
n_pp = np.empty(len(gmc))
for i in range(len(gmc)):
n_pp[i] = np.sum(rng.choice(nbd, gmc[i]))
This works, but when I perform it over my actual data it's very slow (gmc is of dimension 1e6), and I would like to vary this for multiple values of n and p in the nbd (in this example they're set to 1 and 0.5, respectively).
I'd like to work out a pythonic way to do this which eliminates the loop, but I'm not sure it's possible. I want to keep default_rng for the better random generation than the older way of doing it (np.random.choice), if possible.
The distribution of the sum of m samples from the negative binomial distribution with parameters (n, p) is the negative binomial distribution with parameters (m*n, p). So instead of summing random selections from a large, precomputed sample of negative_binomial(1, 0.5), you can generate your result directly with negative_binomial(gmc, 0.5):
In [68]: gmc = [12, 35, 4, 67, 2]
In [69]: npp = rng.negative_binomial(gmc, 0.5)
In [70]: npp
Out[70]: array([ 9, 34, 1, 72, 7])
(The negative_binomial method will broadcast its inputs, so we can pass gmc as an argument to generate all the samples with one call.)
More generally, if you want to vary the n that is used to generate nbd, you would multiply that n by the corresponding element in gmc and pass the product to rng.negative_binomial.

How to calculate average of values of a column for a particular value in another column?

I have a data frame that looks like this.
How can I get the average doc/duration for each window into another data frame?
I need it in the following way
Dataframe should contain only one column i.e mean. If there are 3000 windows then there should be 3000 rows in axis 0 which represent the windows and the mean will contain the average value. If that particular window is not present in the initial data frame the corresponding value for that window needs to be 0.
Use .groupby() method and then compute the mean:
import pandas as pd
df = pd.DataFrame({'10s_windows': [304, 374, 374, 374, 374, 3236, 3237, 3237, 3237],
'doc/duration': [0.1, 0.1, 0.2, 0.2, 0.12, 0.34, 0.32, 0.44, 0.2]})
new_df = df.groupby('10s_windows').mean()
Which results in:
Source: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

Find subgroups of a numpy array

I have a numpy array like this one:
A = ([249, 250, 3016, 3017, 5679, 5680, 8257, 8258,
10756, 10757, 13178, 13179, 15531, 15532, 17824, 17825,
20058, 20059, 22239, 22240, 24373, 24374, 26455, 26456,
28491, 28492, 30493, 30494, 32452, 32453, 34377, 34378,
36264, 36265, 38118, 38119, 39939, 39940, 41736, 41737,
43501, 43502, 45237, 45238, 46950, 46951, 48637, 48638])
I would like to write a small script that finds a subgroup of values of the array for which the difference is smaller than a certain threshold, let say 3, and that returns the highest value of the subgroup. In the case of A array the output should be:
A_out =([250,3017,5680,8258,10757,13179,...])
Is there a numpy function for that?
Here's a vectorized Numpy approach.
First, the data (in a numpy array) and the threshold:
In [41]: A = np.array([249, 250, 3016, 3017, 5679, 5680, 8257, 8258,
10756, 10757, 13178, 13179, 15531, 15532, 17824, 17825,
20058, 20059, 22239, 22240, 24373, 24374, 26455, 26456,
28491, 28492, 30493, 30494, 32452, 32453, 34377, 34378,
36264, 36265, 38118, 38119, 39939, 39940, 41736, 41737,
43501, 43502, 45237, 45238, 46950, 46951, 48637, 48638])
In [42]: threshold = 3
The following produces the array delta. It is almost the same as delta = np.diff(A), but I want to include one more value that is greater than the threshold at the end of delta.
In [43]: delta = np.hstack((diff(A), threshold + 1))
Now the group maxima are simply A[delta > threshold]:
In [46]: A[delta > threshold]
Out[46]:
array([ 250, 3017, 5680, 8258, 10757, 13179, 15532, 17825, 20059,
22240, 24374, 26456, 28492, 30494, 32453, 34378, 36265, 38119,
39940, 41737, 43502, 45238, 46951, 48638])
Or, if you want, A[delta >= threshold]. That gives the same result for this example:
In [47]: A[delta >= threshold]
Out[47]:
array([ 250, 3017, 5680, 8258, 10757, 13179, 15532, 17825, 20059,
22240, 24374, 26456, 28492, 30494, 32453, 34378, 36265, 38119,
39940, 41737, 43502, 45238, 46951, 48638])
There is a case where this answer differs from #DrV's answer. From your description, it isn't clear to me how a set of values such as 1, 2, 3, 4, 5, 6 should be handled. The consecutive differences are all 1, but the difference between the first and last is 5. The numpy calculation above will treat these as a single group. #DrV's answer will create two groups.
Interpretation 1: The value of an item in a group must not differ more than 3 units from that of the first item of the group
This is one of the things where NumPy's capabilities are at their limits. As you will have to iterate through the list, I suggest a pure Python approach:
first_of_group = A[0]
previous = A[0]
group_lasts = []
for a in A[1:]:
# if this item no longer belongs to the group
if abs(a - first_of_group) > 3:
group_lasts.append(previous)
first_of_group = a
previous = a
# add the last item separately, because it is always a last of the group
group_lasts.append(a)
Now you have the group lasts in group_lasts.
Using any NumPy array functionality here does not seem to provide much help.
Interpretation 2: The value of an item in a group must not differ more than 3 units from the previous item
This is easier, as we can easily form a list of group breaks as in Warren Weckesser's answer. Here NumPy is of a lot of help.