Why does sum and lambda sum differ in transform? - pandas

For the dataframe:
df = pd.DataFrame({
'key1': [1,1,1,2,3,np.nan],
'key2': ['one','two','one', 'three', 'two','one'],
'data1': [1,2,3,3,4,5]
})
The following transform using the sum function does not produce an error:
df.groupby(['key1'])['key1'].transform(sum)
However, this transform, also using the sum function, produces an error:
df.groupby(['key1'])['key1'].transform(lambda x : sum(x))
ValueError: Length mismatch: Expected axis has 5 elements, new values have 6 elements
Why?

This is probably a bug, but the reason as to why the two behave differently is easily explained by the fact that pandas internally overrides the builtin sum, min, and max functions. When you pass any of these functions to pandas, they are internally replaced by the numpy equivalents.
Now, your grouper has NaNs, and NaNs are automatically excluded, as the docs mention. With any of the builtin pandas agg functions, this issue appears to be handled as NaNs inserted in the output automatically, as you see with the first statement. The output is the same if you run df.groupby(['key1'])['key1'].transform('sum'). However, when you pass a lambda as in the second statement, for whatever reason this automatic replacement of missing outputs with NaN is not done.
A possible workaround is to group on the strings:
df.groupby(df.key1.astype(str))['key1'].transform(lambda x : sum(x))
0 3.0
1 3.0
2 3.0
3 2.0
4 3.0
5 NaN
Name: key1, dtype: float64
This way, the NaNs are not dropped, and you get rid of the length mismatch.

Related

Is dropna=True in pandas groupby useful?

I am not certain if this question is appropriate here, and apologies in advance if it is not.
I am a pandas maintainer, and recently I've been working on fixing bugs in pandas groupby when used with dropna=True and transform for the 1.5 release. For example, in pandas 1.4.2,
import pandas as pd
df = pd.DataFrame({'a': [1, 1, np.nan], 'b': [2, 3, 4]})
print(df.groupby('a', dropna=True).transform('sum'))
produces the incorrect (in particular, the last row) output
b
0 5
1 5
2 5
While working on this, I've been wondering how useful the dropna argument is in groupby. For aggregations (e.g. df.groupby('a').sum()) and filters (e.g. df.groupby('a').head(2)), it seems to me it's always possible to drop the offending rows prior to the groupby. In addition to this, in my use of pandas if I have null values in the groupers, then I want them in the groupby result. For transformations, where the resulting index should match that of the input, the value is instead filled with null. For the above code block, the output should be
b
0 5.0
1 5.0
2 NaN
But I can't imagine this result ever being useful. In case it is, it also is not too difficult to accomplish:
result = df.groupby('a', dropna=False).transform('sum')
result.loc[df['a'].isnull()] = np.nan
If we were able to deprecate and then remove the dropna argument to groupby (i.e. groupby always behaves as if dropna=False), then this would help simplify a good part of the groupby code.
So I'd like to ask if there are examples where dropna=True and the operation might be otherwise hard to accomplish.
Thanks!

Convert a float column with nan to int pandas

I am trying to convert a float pandas column with nans to int format, using apply.
I would like to use something like this:
df.col = df.col.apply(to_integer)
where the function to_integer is given by
def to_integer(x):
if np.isnan(x):
return np.NaN
else:
return int(x)
However, when I attempt to apply it, the column remains the same.
How could I achieve this without having to use the standard technique of dtypes?
You can't have NaN in an int column, NaN are float (unless you use an object type, which is not a good idea since you'll lose many vectorial abilities).
You can however use the new nullable integer type (NA).
Conversion can be done with convert_dtypes:
df = pd.DataFrame({'col': [1, 2, None]})
df = df.convert_dtypes()
# type(df.at[0, 'col'])
# numpy.int64
# type(df.at[2, 'col'])
# pandas._libs.missing.NAType
output:
col
0 1
1 2
2 <NA>
Not sure how you would achieve this without using dtypes. Sometimes when loading in data, you may have a column that contains mixed dtypes. Loading in a column with one dtype and attemping to turn it into mixed dtypes is not possible though (at least, not that I know of).
So I will echo what #mozway said and suggest you use nullable integer data types
e.g
df['col'] = df['col'].astype('Int64')
(note the capital I)

sparkSQL filter function not working with NaN

Good morning,
I have the following variables.
self.filters = 'px_variation > 0.15'
df
If I do df.collect() I get.
Row(px_variation=nan, subject_code='1010', list_tr_id=['X0', 'X1'], list_quantity=[3000.0, 1.0], list_cash_qty=[16500.0, 5.5])
I try to apply the following function
df.filter(self.filters)
And it's result is.
Row(px_variation=nan, subject_code='1010', list_tr_id=['X0', 'X1'], list_quantity=[3000.0, 1.0], list_cash_qty=[16500.0, 5.5])
As you can see px_variation on my DF is a numpy.nan but after appliying filter function it doesn't filter it.
Why isn't spark sql ignoring nan or using it to be filtered?
If I do the same operation but in python the result is as expected.
df.collect()[0].px_variation > 0.15 -> Result: False
Any idea? Thanks you.
The special value NaN is treated as
larger than any other numeric value.
by Spark's nan-semantics, even "larger" than infinity.
One option is to change the filter to
filters = 'px_variation > 0.15 and not isnan(px_variation)'
Another option to handle the NaN values is to replace them with None/null:
df.replace(float('nan'), None).filter('px_variation > 0.15')

OneHotEncoder gives ValueError : Input contains NaN ; even though my DataFrame doesn't contain any NaN as indicated by df.isna()

I am working on the titanic dataset and when trying to apply OneHotEncoding on one of the columns called 'Embarked' which has 3 possible values 'S','Q' and 'C'.
It gives me the
ValueError: Input contains NaN
I checked the contents of the column by using 2 methods. The first one being the for-loop with value_counts and the second one by writing the entire table to a csv:
for col in X.columns:
print(col)
print(X[col].value_counts(dropna=False))
X.isna().to_csv("xisna.csv")
print("notna================== :",X.notna().shape)
X.dropna(axis=0,how='any',inplace=True)
print("X.shape " ,X.shape)
return pd.DataFrame(X)
Which yielded
Embarked
S 518
C 139
Q 55
Name: Embarked, dtype: int64
I checked the contents of csv and reading through the over 700 entries, I did not find any 'True'-statement.
The pipeline that blocks at the ("cat",One...)
cat_attribs=["Sex","Embarked"]
special_attribs = {'drop_attribs' : ["Name","Cabin","Ticket","PassengerId"], k : [3]}
full_pipeline = ColumnTransformer([
("fill",fill_pipeline,list(strat_train_set)),
("emb_cat",OneHotEncoder(),['Sex']),
("cat",OneHotEncoder(),['Embarked']),
])
So where exactly is the NaN-value that I am missing?
I figured it out, a ColumnTransformer will concatenate the transformations instead of passing them along to the next transformer in line.
So any transformations done in fill_pipeline, won't be noticed by OneHotEncoder since it is still working with the untransformed dataset.
So I had to put the one hot encoding into the fill_pipeline instead of the ColumnTransformer.
full_pipeline = ColumnTransformer([
("fill",fill_pipeline,list(strat_train_set)),
("emb_cat",OneHotEncoder(),['Sex']),
("cat",OneHotEncoder(),['Embarked']),
])

Julia DataFrames equivalent of pandas pct_change()

Currently, I have written the below function for percent change calculation:
function pct_change(input::AbstractVector{<:Number})::AbstractVector{Number}
result = [NaN]
for i in 2:length(input)
push!(result, (input[i] - input[i-1])/abs(input[i-1]))
end
return result
end
This works as expected. But wanted to know whether there is a built-in function for Julia DataFrames similar to pandas pct_change which I can use directly? Or any other better way or improvements that I can make to my function above?
This is a very specific function and is not provided in DataFrames.jl, but rather TimeSeries.jl. Here is an example:
julia> using TimeSeries, Dates
julia> ta = TimeArray(Date(2018, 1, 1):Day(1):Date(2018, 12, 31), 1:365);
julia> percentchange(ta);
(there are some more options to what should be calculated)
The drawback is that it accepts only TimeArray objects and that it drops periods for which percent change cannot be calculated (as they are retained in Python).
If you want your custom definition consider denoting the first value as missing rather than NaN, as missing. Also your function will not produce the most accurate representation of the numbers (e.g. if you wanted to use BigFloat or exact calculations using Rational type they will be converted to Float64). Here are example alternative function implementations that avoid these problems:
function pct_change(input::AbstractVector{<:Number})
res = #view(input[2:end]) ./ #view(input[1:end-1]) .- 1
[missing; res]
end
or
function pct_change(input::AbstractVector{<:Number})
[i == 1 ? missing : (input[i]-input[i-1])/input[i-1] for i in eachindex(input)]
end
And now you have in both cases:
julia> pct_change(1:10)
10-element Array{Union{Missing, Float64},1}:
missing
1.0
0.5
0.33333333333333326
0.25
0.19999999999999996
0.16666666666666674
0.1428571428571428
0.125
0.11111111111111116
julia> pct_change(big(1):10)
10-element Array{Union{Missing, BigFloat},1}:
missing
1.0
0.50
0.3333333333333333333333333333333333333333333333333333333333333333333333333333391
0.25
0.2000000000000000000000000000000000000000000000000000000000000000000000000000069
0.1666666666666666666666666666666666666666666666666666666666666666666666666666609
0.1428571428571428571428571428571428571428571428571428571428571428571428571428547
0.125
0.111111111111111111111111111111111111111111111111111111111111111111111111111113
julia> pct_change(1//1:10)
10-element Array{Union{Missing, Rational{Int64}},1}:
missing
1//1
1//2
1//3
1//4
1//5
1//6
1//7
1//8
1//9
with proper values returned.