Pandas qcut returning out and bins does not appear consistent on the left boundary - pandas

If you do simply this:
out, bins = pd.qcut(range(10), 4, retbins=True)
out is:
[(-0.001, 2.25], (-0.001, 2.25], (-0.001, 2.25], (2.25, 4.5], (2.25, 4.5], (4.5, 6.75], (4.5, 6.75], (6.75, 9.0], (6.75, 9.0], (6.75, 9.0]]
Categories (4, interval[float64]): [(-0.001, 2.25] < (2.25, 4.5] < (4.5, 6.75] < (6.75, 9.0]]
bins is:
array([0. , 2.25, 4.5 , 6.75, 9. ])
Note that all bounds from 'out' and 'bins' are consistent except the '0' (left boundary). It seemed to be -0.001 in the category interval but 0.0 in the bins array.
This causes a problem for me since I serialized bins as is, and re-apply it back on new data. The context is machine learning, where you apply the exact same binning and categorize/embedding during inference on new data. Because of the difference in the left boundaries, the resulting categories don't match up and I have a bug.
Anyone knows why the categorical interval is -0.001 but 0.0 in the 'bins' while all other intervals seemed ok.

That looks like a rounding to me .
Your code with a slight mod :
pd.qcut(np.arange(0.1,1,0.01), 4, retbins=True)
Here are the results :
([(0.099, 0.322], (0.099, 0.322], (0.099, 0.322], (0.099, 0.322], (0.099, 0.322], ..., (0.767, 0.99], (0.767, 0.99], (0.767, 0.99], (0.767, 0.99], (0.767, 0.99]]
Length: 90
Categories (4, interval[float64]): [(0.099, 0.322] < (0.322, 0.545] < (0.545, 0.767] < (0.767, 0.99]]
Bins are
array([0.1 , 0.3225, 0.545 , 0.7675, 0.99 ]))
Note that the 0.099 of the left boundary has been rounded off to 0.1

I did the following, it feels like a hack. This works for me for now.
As #root pointed out, the -0.001 is there because the qcut uses left open interval. I.e. (0, 1] is anything bigger than 0 and equal or less than one but exclude 0 itself, while (-0.001, 1] will include 0. We can control the precision by doing this:
out, bins = pd.qcut(range(10), 4, precision=4, retbins=True)
which gives (-0.0001, 2.25] as first interval. I found this precision spec is important just in case its magnitude is small if you leave it (default). It can get 'lost' in type conversion downstream and will cause problem. I.e. you don't want it ever to be -1e-7 (which can happen depending on your data).
Then you fix bins[0] to have that -0.001:
bins[0] = out.categories[0].left
Such that now, if you perform this:
pd.cut(range(10), bins)
it will generate categorical values that match the original qcut using only the bins. This solves my problem. It feels like a hack. Would love to hear a more robust solution though.

Related

Pandas wrong round decimation

I am calculating the duration of the data acquisition from some sensors. Although the data is collected faster, I would like to sample it at 10Hz. Anyways, I created a dataframe with a column called 'Time_diff' which I expect it goes [0.0, 0.1, 0.2, 0.3 ...]. However it goes somehow like [0.0, 0.1, 0.2, 0.30000004 ...]. I am rounding the data frame but still, I have this weird decimation. Is there any suggestions on how to fix it?
The code:
for i in range(self.n_of_trials):
start = np.zeros(0)
stop = np.zeros(0)
for df in self.trials[i].df_list:
start = np.append(stop, df['Time'].iloc[0])
stop = np.append(start, df['Time'].iloc[-1])
t_start = start.min()
t_stop = stop.max()
self.trials[i].duration = t_stop-t_start
t = np.arange(0, self.trials[i].duration+self.trials[i].dt, self.trials[i].dt)
self.trials[i].df_merged['Time_diff'] = t
self.trials[i].df_merged.round(1)
when I print the data it looks like this:
0 0.0
1 0.1
2 0.2
3 0.3
4 0.4
...
732 73.2
733 73.3
734 73.4
735 73.5
736 73.6
Name: Time_diff, Length: 737, dtype: float64
However when I open as csv file it is like that:
Addition
I think the problem is not csv conversion but how the float data converted/rounded. Here is the next part of the code where I merge more dataframes on 10Hz time stamps:
for j in range(len(self.trials[i].df_list)):
df = self.trials[i].df_list[j]
df.insert(0, 'Time_diff', round(df['Time']-t_start, 1))
df.round({'Time_diff': 1})
df.drop_duplicates(subset=['Time_diff'], keep='first', inplace=True)
self.trials[i].df_merged = pd.merge(self.trials[i].df_merged, df, how="outer", on="Time_diff", suffixes=(None, '_'+self.trials[i].df_list_names[j]))
#Test csv
self.trials[2].df_merged.to_csv(path_or_buf='merged.csv')
And since the inserted dataframes have exact correct decimation, it is not merged properly and create another instance with a new index.
This is not a rounding problem, it is a behavior intrinsic in how floating point numbers work. Actually 0.30000000000000004 is the result of 0.1+0.1+0.1 (try it out yourself in a Python prompt).
In practice not every decimal number is exactly representable as a floating point number so what you get is instead the closest possible value.
You have some options depending if you just want to improve the visualization or if you need to work on exact values. If for example you want to use that column for a merge you can use an approximate comparison instead of an exact one.
Another option is to use the decimal module: https://docs.python.org/3/library/decimal.html which works with exact arithmetic but can be slower.
In your case you said the column should represent frequency at steps of 10Hz so I think changing the representation so that you directly use 10, 20, 30, ... will allow you to use integers instead of floats.
If you want to see the "true" value of a floating point number in python you can use format(0.1*6, '.30f') and it will print the number with 30 digits (still an approximation but much better than the default).

can any one explain about this code output

1.I have tried to understand this code but I couldn't.would you help me?
a = np.arange(5)
hist, bin_edges = np.histogram(a, density=True)
hist
2.why is the output like this ?
array([0.5, 0. , 0.5, 0. , 0. , 0.5, 0. , 0.5, 0. , 0.5])
The default for the bins argument to np.histogram is 10. So the histogram counts which bins your array elements fall into. In this case a = np.array([0, 1, 2, 3, 4]). If we are creating a histogram with 10 bins then we break the interval 0-4 (inclusive) into 10 equal bins. This gives us (note that 11 end points gives us 10 bins):
np.linspace(0, 4, 11) = array([0. , 0.4, 0.8, 1.2, 1.6, 2. , 2.4, 2.8, 3.2, 3.6, 4. ])
We now just need to see which bins your elements in the array a fall into. We can count them as follows:
[1, 0, 1, 0, 0, 1, 0, 1, 0, 1]
Now this is still not exactly what the output is. The density=True argument states (from the docs): "If True, the result is the value of the
probability density function at the bin, normalized such that
the integral over the range is 1."
Each bin (of height .5) has a width of .4 so 5 x .5 x .4 = 1 as is the requirement of this argument.
numpy.arange(5) generates a numpy array of 5 elements evenly spaced: array([0,1,2,3,4]).
np.histogram(a, density=True) returns the bin edges and the values of an histogram obtained from your array a using 10 bins (which is the default value).
bin_edges gives the edges of the bin, while histogram gives the number of occurrences for each bin. Given that you set density=True the occurrences are normalized (the integral over the range is 1.).
Look here for more information.
Please check this post. Hint: When you call np.histogram, the default bin value is 10, so that's why your output has 10 elements.

NaN in data frame: when first observation of time series is NaN, frontfill with first available, otherwise carry over last / previous observation

I am performing an ADF-test from statsmodels. The value series can have missing obversations. In fact, I am dropping the analysis if the fraction of NaNs is larger than c. However, if the series makes it through the I get the problem, that the adfuller cannot deal with missing data. Since this is training data with a minimum framesize, I would like to do:
1) if x(t=0) = NaN, then find the next non-NaN value (t>0)
2) otherwise if x(t) = NaN, then x(t) = x(t-1)
So I am compromising here my first value, but making sure the input data has always the same dimension. Alternatively, I could fill if the first value is missing with 0 making use of the limit option from dropna.
From the documentation the different option are not 100% clear to me:
method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill /
bfill: use NEXT valid observation to fill gap
pad / ffill: does that mean I carry over the previous value?
backfill / bfill: does that mean I the value is taken from a valid one in the future?
df.dropna(method = 'bfill', limit 1, inplace = True)
df.dropna(method = 'ffill', inplace = True)
Would that work with limit? The documentation uses 'limit = 1' but has predetermined a value to be filled.
1) if x(t=0) = NaN, then find the next non-NaN value (t>0) 2) otherwise if x(t) = NaN, then x(t) = x(t-1)
To front-fill all observations except for (possibly) the first ones, which should be backfilled, you can chain two calls to fillna, the first with method='ffill' and the second with method='fill':
df = pd.DataFrame({'a': [None, None, 1, None, 2, None]})
>>> df.fillna(method='ffill').fillna(method='bfill')
a
0 1.0
1 1.0
2 1.0
3 1.0
4 2.0
5 2.0

numpy.fft.fft not computing dft at frequencies given by numpy.fft.fftfreq?

This is a mathematical question, but it is tied to the numpy implementation, so I decided to ask it at SO. Perhaps I'm hugely misunderstanding something, but if so I would like to be put straight.
numpy.ftt.ftt computes DFT according to equation:
numpy.ftt.fftfreq is supposed to return frequencies at which DFT was computed.
Say we have:
x = [0, 0, 1, 0, 0]
X = np.fft.fft(x)
freq = np.fft.fftfreq(5)
Then for signal x, its DFT transformation is X, and frequencies at which X is computed are given by freq. For example X[0] is DFT of x at frequency freq[0], X[1] is DFT of x at frequency freq[1], and so on.
But when I compute DFT of a simple signal by hand with the formula quoted above, my results indicate that X[1] is DFT of x at frequency 1, not at freq[1], X[2] is DFT of x at frequency 2, etc, not at freq[2], etc.
As an example:
In [32]: x
Out[32]: [0, 0, 1, 0, 0]
In [33]: X
Out[33]:
array([
1.00000000+0.j,
-0.80901699-0.58778525j,
0.30901699+0.95105652j, 0.30901699-0.95105652j,
-0.80901699+0.58778525j])
In [34]: freq
Out[34]: array([ 0. , 0.2, 0.4, -0.4, -0.2])
If I compute DFT of above signal for k = 0.2 (or freq[1]), I get
X at freq = 0.2: 0.876 - 0.482j, which isn't X[1].
If however I compute for k = 1 I get the same results as are in X[1] or -0.809 - 0.588j.
So what am I misunderstanding? If numpy.fft.fft(x)[n] is a DFT of x at frequency n, not at frequency numpy.fft.fttfreq(len(x))[n], what is the purpose of numpy.fft.fttfreq?
I think that because the values in the array returned by the numpy.fft.fttfreq are equal to the (k/n)*sampling frequency.
The frequencies of the dft result are equal to k/n divided by the time spacing, because the periodic function's period's amplitude will become the inverse of the original value after fft. You can consider the digital signal function is a periodic sampling function convoluted by the analog signal function. The convolution in time domain means multiplication in frequency domain, so that the time spacing of the input data will affect the frequency spacing of the dft result and the frequency spacing's value will become the original one divided by the time spacing. Originally, the frequency spacing of the dft result is equal to 1/n when the time spacing is equal to 1. So after the dft, the frequency spacing will become 1/n divided by the time spacing, which eqauls to 1/n multiplied by the sampling frequency.
To calculate that, the numpy.fft.fttfreq has two arguments, the length of the input and time spacing, which means the inverse of the sampling rate. The length of the input is equal to n, and the time spacing is equal to the value which the result k/n divided by (Default is 1.)
I have tried to let k = 2, and the result is equal to the X[2] in your example. In this situation, the k/n*1 is equal to the freq[2].
The DFT is a dimensionless basis transform or matrix multiplication. The output or result of a DFT has nothing to do with frequencies unless you know the sampling rate represented by the input vector (samples per second, per meter, per radian, etc.)
You can compute a Goertzel filter of the same length N with k=0.2, but that result isn't contained in an DFT or FFT result of length N. A DFT only contains complex Goertzel filter results for integer k values. And to get from k to the frequency represented by X[k], you need to know the sample rate.
Yours is not a SO question
You wrote
If I compute DFT of above signal for k = 0.2 .
and I reply "You shouldn't"... the DFT can be meaningfully computed only for integer values of k.
The relationship between an index k and a frequency is given by f_k = k Δf or, if you prefer circular frequencies, ω_k = k Δω where Δf = 1/T and Δω = 2πΔf, T being the period of the signal.
The arguments of fftfreq are a bit misleading... the required one is the number of samples n and the optional argument is the sampling interval, by default d=1.0, but at any rate T=n*d and Δf = 1/(n*d)
>>> fftfreq(5) # d=1
array([ 0. , 0.2, 0.4, -0.4, -0.2])
>>> fftfreq(5,2)
array([ 0. , 0.1, 0.2, -0.2, -0.1])
>>> fftfreq(5,10)
array([ 0. , 0.02, 0.04, -0.04, -0.02])
and the different T are 5,10,50 and the respective df are -.2,0.1,0.02 as (I) expected.
Why fftfreq doesn't simply require the signal's period? because it is mainly intended as an helper in demangling the Nyquist frequency issue.
As you know, the DFT is periodic, for a signal x of length N you have that
DFT(x,k) is equal to DFT(x,k+mN) where m is an integer.
This imply that there are only N/2 positive and N/2 negative distinct frequencies and that, when N/2<k<N, the frequency that must be associated with k in the most meaningful way is not k df but (k-N) df.
To perform this, fftfreq needs more information that the period T, hence the choice of requiring n and computing df from an assumption on sampling interval.

Find subgroups of a numpy array

I have a numpy array like this one:
A = ([249, 250, 3016, 3017, 5679, 5680, 8257, 8258,
10756, 10757, 13178, 13179, 15531, 15532, 17824, 17825,
20058, 20059, 22239, 22240, 24373, 24374, 26455, 26456,
28491, 28492, 30493, 30494, 32452, 32453, 34377, 34378,
36264, 36265, 38118, 38119, 39939, 39940, 41736, 41737,
43501, 43502, 45237, 45238, 46950, 46951, 48637, 48638])
I would like to write a small script that finds a subgroup of values of the array for which the difference is smaller than a certain threshold, let say 3, and that returns the highest value of the subgroup. In the case of A array the output should be:
A_out =([250,3017,5680,8258,10757,13179,...])
Is there a numpy function for that?
Here's a vectorized Numpy approach.
First, the data (in a numpy array) and the threshold:
In [41]: A = np.array([249, 250, 3016, 3017, 5679, 5680, 8257, 8258,
10756, 10757, 13178, 13179, 15531, 15532, 17824, 17825,
20058, 20059, 22239, 22240, 24373, 24374, 26455, 26456,
28491, 28492, 30493, 30494, 32452, 32453, 34377, 34378,
36264, 36265, 38118, 38119, 39939, 39940, 41736, 41737,
43501, 43502, 45237, 45238, 46950, 46951, 48637, 48638])
In [42]: threshold = 3
The following produces the array delta. It is almost the same as delta = np.diff(A), but I want to include one more value that is greater than the threshold at the end of delta.
In [43]: delta = np.hstack((diff(A), threshold + 1))
Now the group maxima are simply A[delta > threshold]:
In [46]: A[delta > threshold]
Out[46]:
array([ 250, 3017, 5680, 8258, 10757, 13179, 15532, 17825, 20059,
22240, 24374, 26456, 28492, 30494, 32453, 34378, 36265, 38119,
39940, 41737, 43502, 45238, 46951, 48638])
Or, if you want, A[delta >= threshold]. That gives the same result for this example:
In [47]: A[delta >= threshold]
Out[47]:
array([ 250, 3017, 5680, 8258, 10757, 13179, 15532, 17825, 20059,
22240, 24374, 26456, 28492, 30494, 32453, 34378, 36265, 38119,
39940, 41737, 43502, 45238, 46951, 48638])
There is a case where this answer differs from #DrV's answer. From your description, it isn't clear to me how a set of values such as 1, 2, 3, 4, 5, 6 should be handled. The consecutive differences are all 1, but the difference between the first and last is 5. The numpy calculation above will treat these as a single group. #DrV's answer will create two groups.
Interpretation 1: The value of an item in a group must not differ more than 3 units from that of the first item of the group
This is one of the things where NumPy's capabilities are at their limits. As you will have to iterate through the list, I suggest a pure Python approach:
first_of_group = A[0]
previous = A[0]
group_lasts = []
for a in A[1:]:
# if this item no longer belongs to the group
if abs(a - first_of_group) > 3:
group_lasts.append(previous)
first_of_group = a
previous = a
# add the last item separately, because it is always a last of the group
group_lasts.append(a)
Now you have the group lasts in group_lasts.
Using any NumPy array functionality here does not seem to provide much help.
Interpretation 2: The value of an item in a group must not differ more than 3 units from the previous item
This is easier, as we can easily form a list of group breaks as in Warren Weckesser's answer. Here NumPy is of a lot of help.