Pandas wrong round decimation - pandas

I am calculating the duration of the data acquisition from some sensors. Although the data is collected faster, I would like to sample it at 10Hz. Anyways, I created a dataframe with a column called 'Time_diff' which I expect it goes [0.0, 0.1, 0.2, 0.3 ...]. However it goes somehow like [0.0, 0.1, 0.2, 0.30000004 ...]. I am rounding the data frame but still, I have this weird decimation. Is there any suggestions on how to fix it?
The code:
for i in range(self.n_of_trials):
start = np.zeros(0)
stop = np.zeros(0)
for df in self.trials[i].df_list:
start = np.append(stop, df['Time'].iloc[0])
stop = np.append(start, df['Time'].iloc[-1])
t_start = start.min()
t_stop = stop.max()
self.trials[i].duration = t_stop-t_start
t = np.arange(0, self.trials[i].duration+self.trials[i].dt, self.trials[i].dt)
self.trials[i].df_merged['Time_diff'] = t
self.trials[i].df_merged.round(1)
when I print the data it looks like this:
0 0.0
1 0.1
2 0.2
3 0.3
4 0.4
...
732 73.2
733 73.3
734 73.4
735 73.5
736 73.6
Name: Time_diff, Length: 737, dtype: float64
However when I open as csv file it is like that:
Addition
I think the problem is not csv conversion but how the float data converted/rounded. Here is the next part of the code where I merge more dataframes on 10Hz time stamps:
for j in range(len(self.trials[i].df_list)):
df = self.trials[i].df_list[j]
df.insert(0, 'Time_diff', round(df['Time']-t_start, 1))
df.round({'Time_diff': 1})
df.drop_duplicates(subset=['Time_diff'], keep='first', inplace=True)
self.trials[i].df_merged = pd.merge(self.trials[i].df_merged, df, how="outer", on="Time_diff", suffixes=(None, '_'+self.trials[i].df_list_names[j]))
#Test csv
self.trials[2].df_merged.to_csv(path_or_buf='merged.csv')
And since the inserted dataframes have exact correct decimation, it is not merged properly and create another instance with a new index.

This is not a rounding problem, it is a behavior intrinsic in how floating point numbers work. Actually 0.30000000000000004 is the result of 0.1+0.1+0.1 (try it out yourself in a Python prompt).
In practice not every decimal number is exactly representable as a floating point number so what you get is instead the closest possible value.
You have some options depending if you just want to improve the visualization or if you need to work on exact values. If for example you want to use that column for a merge you can use an approximate comparison instead of an exact one.
Another option is to use the decimal module: https://docs.python.org/3/library/decimal.html which works with exact arithmetic but can be slower.
In your case you said the column should represent frequency at steps of 10Hz so I think changing the representation so that you directly use 10, 20, 30, ... will allow you to use integers instead of floats.
If you want to see the "true" value of a floating point number in python you can use format(0.1*6, '.30f') and it will print the number with 30 digits (still an approximation but much better than the default).

Related

Statistics and Pandas: What normalization means in value_counts() in Pandas

The question is not about coding but to understand what normalize means in terms of statistics and correlation of data
This is an example of what I am doing.
Without normalization:
plt.subplot(111)
plt.plot(df['alcoholism'].value_counts(), marker='o')
plt.plot(df.query('no_show =="Yes"')['alcoholism'].value_counts(), color='black')
plt.show();
With normalization:
plt.subplot(111)
plt.plot(df['alcoholism'].value_counts(normalize=True), marker='o')
plt.plot(df.query('no_show =="Yes"')['alcoholism'].value_counts(normalize=True), color='black')
plt.show();
Which one better correlates the values with or without normalization? or is it a whole wrong idea?
I am new to data and pandas, so excuse my bad code, chaining, commenting, style :)
As you can see when you normalize (second plot), the sum of both points is equal to 1, for each line that is plotted. Normalizing is giving you the rate of occurrences of each value instead of the number of occurrences.
Heres what the doc says:
normalize : bool, default False
    Return proportions rather than frequencies.
value_counts() probably returns something like:
0 110000
1 1000
dtype: int64
and value_counts(normalize=True) probably returns something like:
0 0.990991
1 0.009009
dtype: float64
In other words, the relation between the normalized and non-normalized can be checked as:
>>> counts = df['alcoholism'].value_counts()
>>> rate = df['alcoholism'].value_counts(normalize=True)
>>> np.allclose(rate, counts / counts.sum())
True
Where np.allclose allowing to properly compare two series of floating point numbers.

Julia DataFrames equivalent of pandas pct_change()

Currently, I have written the below function for percent change calculation:
function pct_change(input::AbstractVector{<:Number})::AbstractVector{Number}
result = [NaN]
for i in 2:length(input)
push!(result, (input[i] - input[i-1])/abs(input[i-1]))
end
return result
end
This works as expected. But wanted to know whether there is a built-in function for Julia DataFrames similar to pandas pct_change which I can use directly? Or any other better way or improvements that I can make to my function above?
This is a very specific function and is not provided in DataFrames.jl, but rather TimeSeries.jl. Here is an example:
julia> using TimeSeries, Dates
julia> ta = TimeArray(Date(2018, 1, 1):Day(1):Date(2018, 12, 31), 1:365);
julia> percentchange(ta);
(there are some more options to what should be calculated)
The drawback is that it accepts only TimeArray objects and that it drops periods for which percent change cannot be calculated (as they are retained in Python).
If you want your custom definition consider denoting the first value as missing rather than NaN, as missing. Also your function will not produce the most accurate representation of the numbers (e.g. if you wanted to use BigFloat or exact calculations using Rational type they will be converted to Float64). Here are example alternative function implementations that avoid these problems:
function pct_change(input::AbstractVector{<:Number})
res = #view(input[2:end]) ./ #view(input[1:end-1]) .- 1
[missing; res]
end
or
function pct_change(input::AbstractVector{<:Number})
[i == 1 ? missing : (input[i]-input[i-1])/input[i-1] for i in eachindex(input)]
end
And now you have in both cases:
julia> pct_change(1:10)
10-element Array{Union{Missing, Float64},1}:
missing
1.0
0.5
0.33333333333333326
0.25
0.19999999999999996
0.16666666666666674
0.1428571428571428
0.125
0.11111111111111116
julia> pct_change(big(1):10)
10-element Array{Union{Missing, BigFloat},1}:
missing
1.0
0.50
0.3333333333333333333333333333333333333333333333333333333333333333333333333333391
0.25
0.2000000000000000000000000000000000000000000000000000000000000000000000000000069
0.1666666666666666666666666666666666666666666666666666666666666666666666666666609
0.1428571428571428571428571428571428571428571428571428571428571428571428571428547
0.125
0.111111111111111111111111111111111111111111111111111111111111111111111111111113
julia> pct_change(1//1:10)
10-element Array{Union{Missing, Rational{Int64}},1}:
missing
1//1
1//2
1//3
1//4
1//5
1//6
1//7
1//8
1//9
with proper values returned.

NaN in data frame: when first observation of time series is NaN, frontfill with first available, otherwise carry over last / previous observation

I am performing an ADF-test from statsmodels. The value series can have missing obversations. In fact, I am dropping the analysis if the fraction of NaNs is larger than c. However, if the series makes it through the I get the problem, that the adfuller cannot deal with missing data. Since this is training data with a minimum framesize, I would like to do:
1) if x(t=0) = NaN, then find the next non-NaN value (t>0)
2) otherwise if x(t) = NaN, then x(t) = x(t-1)
So I am compromising here my first value, but making sure the input data has always the same dimension. Alternatively, I could fill if the first value is missing with 0 making use of the limit option from dropna.
From the documentation the different option are not 100% clear to me:
method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill /
bfill: use NEXT valid observation to fill gap
pad / ffill: does that mean I carry over the previous value?
backfill / bfill: does that mean I the value is taken from a valid one in the future?
df.dropna(method = 'bfill', limit 1, inplace = True)
df.dropna(method = 'ffill', inplace = True)
Would that work with limit? The documentation uses 'limit = 1' but has predetermined a value to be filled.
1) if x(t=0) = NaN, then find the next non-NaN value (t>0) 2) otherwise if x(t) = NaN, then x(t) = x(t-1)
To front-fill all observations except for (possibly) the first ones, which should be backfilled, you can chain two calls to fillna, the first with method='ffill' and the second with method='fill':
df = pd.DataFrame({'a': [None, None, 1, None, 2, None]})
>>> df.fillna(method='ffill').fillna(method='bfill')
a
0 1.0
1 1.0
2 1.0
3 1.0
4 2.0
5 2.0

Group DataFrame by binning a column::Float64, in Julia

Say I have a DataFrame with a column of Float64s, I'd like to group the dataframe by binning that column. I hear the cut function might help, but it's not defined over dataframes. Some work has been done (https://gist.github.com/tautologico/3925372), but I'd rather use a library function rather than copy-pasting code from the Internet. Pointers?
EDIT Bonus karma for finding a way of doing this by month over UNIX timestamps :)
You could bin dataframes based on a column of Float64s like this. Here my bins are increments of 0.1 from 0.0 to 1.0, binning the dataframe based on a column of 100 random numbers between 0.0 and 1.0.
using DataFrames #load DataFrames
df = DataFrame(index = rand(Float64,100)) #Make a DataFrame with some random Float64 numbers
df_array = map(x->df[(df[:index] .>= x[1]) .& (df[:index] .<x[2]),:],zip(0.0:0.1:0.9,0.1:0.1:1.0)) #Map an anonymous function that gets every row between two numbers specified by a tuple called x, and map that anonymous function to an array of tuples generated using the zip function.
This will produce an array of 10 dataframes, each one with a different 0.1-sized bin.
As for the UNIX timestamp question, I'm not as familiar with that side of things, but after playing around a bit maybe something like this could work:
using Dates
df = DataFrame(unixtime = rand(1E9:1:1.1E9,100)) #Make a dataframe with floats containing pretend unix time stamps
df[:date] = Dates.unix2datetime.(df[:unixtime]) #convert those timestamps to DateTime types
df[:year_month] = map(date->string(Dates.Year.(date))*" "*string(Dates.Month.(date)),df[:date]) #Make a string for every month in your time range
df_array = map(ym->df[df[:year_month] .== ym,:],unique(df[:year_month])) #Bin based on each unique year_month string

Pandas: Group by rounded floating number

I have a dataframe with a column of floating numbers. For example:
df = pd.DataFrame({'A' : np.random.randn(100), 'B': np.random.randn(100)})
What I want to do is to group by column A after rounding column A to 2 decimal places.
The way I do it is highly inefficient:
df.groupby(df.A.map(lambda x: "%.2f" % x))
I particularly don't want to convert everything to a string, as speed becomes a huge problem. But I don't feel it is safe to do the following:
df.groupby(np.around(df.A, 2))
I am not sure, but I feel that there might be cases where two float64 numbers will have the same string representation after rounding to 2 decimal places, but might have slightly different representations when np.around to 2 decimal places. For example, is it possible a string representation of 1.52 can be represented by np.around(., 2) as 1.52000001 sometimes but 1.51999999 some other times?
My question is what is a better and more efficient way.
I think you not need to convert float to string.
import pandas as pd
from random import random
df = pd.DataFrame({'A' : map(lambda x: random(), range(100000)), 'B': map(lambda x: random(), range(100000))})
df.groupby(df['A'].apply(lambda x: round(x, 1))).count()