Julia DataFrames equivalent of pandas pct_change() - dataframe

Currently, I have written the below function for percent change calculation:
function pct_change(input::AbstractVector{<:Number})::AbstractVector{Number}
result = [NaN]
for i in 2:length(input)
push!(result, (input[i] - input[i-1])/abs(input[i-1]))
end
return result
end
This works as expected. But wanted to know whether there is a built-in function for Julia DataFrames similar to pandas pct_change which I can use directly? Or any other better way or improvements that I can make to my function above?

This is a very specific function and is not provided in DataFrames.jl, but rather TimeSeries.jl. Here is an example:
julia> using TimeSeries, Dates
julia> ta = TimeArray(Date(2018, 1, 1):Day(1):Date(2018, 12, 31), 1:365);
julia> percentchange(ta);
(there are some more options to what should be calculated)
The drawback is that it accepts only TimeArray objects and that it drops periods for which percent change cannot be calculated (as they are retained in Python).
If you want your custom definition consider denoting the first value as missing rather than NaN, as missing. Also your function will not produce the most accurate representation of the numbers (e.g. if you wanted to use BigFloat or exact calculations using Rational type they will be converted to Float64). Here are example alternative function implementations that avoid these problems:
function pct_change(input::AbstractVector{<:Number})
res = #view(input[2:end]) ./ #view(input[1:end-1]) .- 1
[missing; res]
end
or
function pct_change(input::AbstractVector{<:Number})
[i == 1 ? missing : (input[i]-input[i-1])/input[i-1] for i in eachindex(input)]
end
And now you have in both cases:
julia> pct_change(1:10)
10-element Array{Union{Missing, Float64},1}:
missing
1.0
0.5
0.33333333333333326
0.25
0.19999999999999996
0.16666666666666674
0.1428571428571428
0.125
0.11111111111111116
julia> pct_change(big(1):10)
10-element Array{Union{Missing, BigFloat},1}:
missing
1.0
0.50
0.3333333333333333333333333333333333333333333333333333333333333333333333333333391
0.25
0.2000000000000000000000000000000000000000000000000000000000000000000000000000069
0.1666666666666666666666666666666666666666666666666666666666666666666666666666609
0.1428571428571428571428571428571428571428571428571428571428571428571428571428547
0.125
0.111111111111111111111111111111111111111111111111111111111111111111111111111113
julia> pct_change(1//1:10)
10-element Array{Union{Missing, Rational{Int64}},1}:
missing
1//1
1//2
1//3
1//4
1//5
1//6
1//7
1//8
1//9
with proper values returned.

Related

Pandas wrong round decimation

I am calculating the duration of the data acquisition from some sensors. Although the data is collected faster, I would like to sample it at 10Hz. Anyways, I created a dataframe with a column called 'Time_diff' which I expect it goes [0.0, 0.1, 0.2, 0.3 ...]. However it goes somehow like [0.0, 0.1, 0.2, 0.30000004 ...]. I am rounding the data frame but still, I have this weird decimation. Is there any suggestions on how to fix it?
The code:
for i in range(self.n_of_trials):
start = np.zeros(0)
stop = np.zeros(0)
for df in self.trials[i].df_list:
start = np.append(stop, df['Time'].iloc[0])
stop = np.append(start, df['Time'].iloc[-1])
t_start = start.min()
t_stop = stop.max()
self.trials[i].duration = t_stop-t_start
t = np.arange(0, self.trials[i].duration+self.trials[i].dt, self.trials[i].dt)
self.trials[i].df_merged['Time_diff'] = t
self.trials[i].df_merged.round(1)
when I print the data it looks like this:
0 0.0
1 0.1
2 0.2
3 0.3
4 0.4
...
732 73.2
733 73.3
734 73.4
735 73.5
736 73.6
Name: Time_diff, Length: 737, dtype: float64
However when I open as csv file it is like that:
Addition
I think the problem is not csv conversion but how the float data converted/rounded. Here is the next part of the code where I merge more dataframes on 10Hz time stamps:
for j in range(len(self.trials[i].df_list)):
df = self.trials[i].df_list[j]
df.insert(0, 'Time_diff', round(df['Time']-t_start, 1))
df.round({'Time_diff': 1})
df.drop_duplicates(subset=['Time_diff'], keep='first', inplace=True)
self.trials[i].df_merged = pd.merge(self.trials[i].df_merged, df, how="outer", on="Time_diff", suffixes=(None, '_'+self.trials[i].df_list_names[j]))
#Test csv
self.trials[2].df_merged.to_csv(path_or_buf='merged.csv')
And since the inserted dataframes have exact correct decimation, it is not merged properly and create another instance with a new index.
This is not a rounding problem, it is a behavior intrinsic in how floating point numbers work. Actually 0.30000000000000004 is the result of 0.1+0.1+0.1 (try it out yourself in a Python prompt).
In practice not every decimal number is exactly representable as a floating point number so what you get is instead the closest possible value.
You have some options depending if you just want to improve the visualization or if you need to work on exact values. If for example you want to use that column for a merge you can use an approximate comparison instead of an exact one.
Another option is to use the decimal module: https://docs.python.org/3/library/decimal.html which works with exact arithmetic but can be slower.
In your case you said the column should represent frequency at steps of 10Hz so I think changing the representation so that you directly use 10, 20, 30, ... will allow you to use integers instead of floats.
If you want to see the "true" value of a floating point number in python you can use format(0.1*6, '.30f') and it will print the number with 30 digits (still an approximation but much better than the default).

Convert a float column with nan to int pandas

I am trying to convert a float pandas column with nans to int format, using apply.
I would like to use something like this:
df.col = df.col.apply(to_integer)
where the function to_integer is given by
def to_integer(x):
if np.isnan(x):
return np.NaN
else:
return int(x)
However, when I attempt to apply it, the column remains the same.
How could I achieve this without having to use the standard technique of dtypes?
You can't have NaN in an int column, NaN are float (unless you use an object type, which is not a good idea since you'll lose many vectorial abilities).
You can however use the new nullable integer type (NA).
Conversion can be done with convert_dtypes:
df = pd.DataFrame({'col': [1, 2, None]})
df = df.convert_dtypes()
# type(df.at[0, 'col'])
# numpy.int64
# type(df.at[2, 'col'])
# pandas._libs.missing.NAType
output:
col
0 1
1 2
2 <NA>
Not sure how you would achieve this without using dtypes. Sometimes when loading in data, you may have a column that contains mixed dtypes. Loading in a column with one dtype and attemping to turn it into mixed dtypes is not possible though (at least, not that I know of).
So I will echo what #mozway said and suggest you use nullable integer data types
e.g
df['col'] = df['col'].astype('Int64')
(note the capital I)

sparkSQL filter function not working with NaN

Good morning,
I have the following variables.
self.filters = 'px_variation > 0.15'
df
If I do df.collect() I get.
Row(px_variation=nan, subject_code='1010', list_tr_id=['X0', 'X1'], list_quantity=[3000.0, 1.0], list_cash_qty=[16500.0, 5.5])
I try to apply the following function
df.filter(self.filters)
And it's result is.
Row(px_variation=nan, subject_code='1010', list_tr_id=['X0', 'X1'], list_quantity=[3000.0, 1.0], list_cash_qty=[16500.0, 5.5])
As you can see px_variation on my DF is a numpy.nan but after appliying filter function it doesn't filter it.
Why isn't spark sql ignoring nan or using it to be filtered?
If I do the same operation but in python the result is as expected.
df.collect()[0].px_variation > 0.15 -> Result: False
Any idea? Thanks you.
The special value NaN is treated as
larger than any other numeric value.
by Spark's nan-semantics, even "larger" than infinity.
One option is to change the filter to
filters = 'px_variation > 0.15 and not isnan(px_variation)'
Another option to handle the NaN values is to replace them with None/null:
df.replace(float('nan'), None).filter('px_variation > 0.15')

How to perform calculations on pandas groupby dataframes

I have a dataframe which for the sake of showing a minimal example, I simplified to this:
df = pd.DataFrame({'pnl':[+10,+23,+15,-5,+20],'style':['obs','obs','obs','bf','bf']})
I would like to do the following:
Group the dataframe by style
Count the positive entries of pnl and divide by the total of entries of that same style.
For example, style 'bf' has 2 entries, one positive and one negative, so 1/2 (total) = 0.5.
This should yield the following result:
style win_rate
bf 0.5
osb 1
dtype: float64
I thought of having a list of the groups, iterate over them and build a new df... But it seems to me like an antipattern. I am pretty sure there is an easier / more pythonic solution.
Thanks.
You can do groupby on df['pnl'].gt(0) series with df['style'] and calculate mean
In [14]: df['pnl'].gt(0).groupby(df['style']).mean()
Out[14]:
style
bf 0.5
obs 1.0
Name: pnl, dtype: float64
You can try pd.crosstab, which uses groupby in the background, to get percentage of both positive and non-positive:
pd.crosstab(df['style'], df['pnl'].gt(0), normalize='index')
Output:
pnl False True
style
bf 0.5 0.5
obs 0.0 1.0

Why does sum and lambda sum differ in transform?

For the dataframe:
df = pd.DataFrame({
'key1': [1,1,1,2,3,np.nan],
'key2': ['one','two','one', 'three', 'two','one'],
'data1': [1,2,3,3,4,5]
})
The following transform using the sum function does not produce an error:
df.groupby(['key1'])['key1'].transform(sum)
However, this transform, also using the sum function, produces an error:
df.groupby(['key1'])['key1'].transform(lambda x : sum(x))
ValueError: Length mismatch: Expected axis has 5 elements, new values have 6 elements
Why?
This is probably a bug, but the reason as to why the two behave differently is easily explained by the fact that pandas internally overrides the builtin sum, min, and max functions. When you pass any of these functions to pandas, they are internally replaced by the numpy equivalents.
Now, your grouper has NaNs, and NaNs are automatically excluded, as the docs mention. With any of the builtin pandas agg functions, this issue appears to be handled as NaNs inserted in the output automatically, as you see with the first statement. The output is the same if you run df.groupby(['key1'])['key1'].transform('sum'). However, when you pass a lambda as in the second statement, for whatever reason this automatic replacement of missing outputs with NaN is not done.
A possible workaround is to group on the strings:
df.groupby(df.key1.astype(str))['key1'].transform(lambda x : sum(x))
0 3.0
1 3.0
2 3.0
3 2.0
4 3.0
5 NaN
Name: key1, dtype: float64
This way, the NaNs are not dropped, and you get rid of the length mismatch.