Pandas groupby().sum() is not ignoring None/empty/np.nan values - pandas

I am facing a weird problem with Pandas groupby().sum() together.
I have a DF with 2 category columns and 3 numeric/float value columns - value columns have Nones as shown below:
cat_1 cat_2 val_1 val_2 val_3
0 b z 0.1 NaN NaN
1 b x 0.1 NaN NaN
2 c y 0.1 1.0 NaN
3 c z 0.1 NaN NaN
4 c x 0.1 1.0 NaN
I want to group-by cat_1 and then sum the value columns: val_1, val_2 and val_3 per category.
This is what the final aggregation should look like:
cat_1 val_1 val_2 val_3
0 b 0.2 NaN NaN
1 c 0.3 2.0 NaN
Problem: is that when I do df.groupby(["cat_1"], as_index=False).sum(), I get 0.0 for categories where all values are None/null:
cat_1 val_1 val_2 val_3
0 b 0.2 0.0 0.0
1 c 0.3 2.0 0.0
How can I fix this issue? Thank you.

It seems there is no direct way of doing it like via DataFrame.sum(skipna=False) but it is not implemented for groupby-sum. However, as ScootCork mentioned, the required behaviour can be achieved using: groupby().sum(min_count=1) - from documentation.

Related

In pandas, replace table column with Series while joining indexes

I have a table with preexisting columns, and I want to entirely replace some of those columns with values from a series. The tricky part is that each series will have different indexes and I need to add these varying indexes to the table as necessary, like doing a join/merge operation.
For example, this code generates a table and 5 series where each series only has a subset of the indexes.
import random
cols=['a', 'b', 'c', 'd', 'e', 'f', 'g']
table = pd.DataFrame(columns=cols)
series = []
for i in range(5):
series.append(
pd.Series(
np.random.randint(0, 3, 2)*10,
index=pd.Index(random.sample(range(3), 2))
)
)
series
Output:
[1 10
2 0
dtype: int32,
2 0
0 20
dtype: int32,
2 20
1 0
dtype: int32,
2 0
0 10
dtype: int32,
1 20
2 10
dtype: int32]
But when I try to replace columns of the table with the series, a simple assignment doesn't work
for i in range(5):
col = cols[i]
table[col] = series[i]
table
Output:
a b c d e f g
1 10 NaN 0 NaN 20 NaN NaN
2 0 0 20 0 10 NaN NaN
because the assignment won't add any more indexes after the first series is assigned
Other things I've tried:
combine or combine_first gives the same result as above. (table[col] = table[col].combine(series[i], lambda a, b: b) and table[col] = series[i].combine_first(table[col]))
pd.concat doesn't work either because of duplicate labels (table[col] = pd.concat([table[col], series[i]]) gives ValueError: cannot reindex on an axis with duplicate labels) and I can't just drop the duplicates because other columns may already have values in those indexes
DataFrame.update won't work since it only takes indexes from the table (join='left'). I need to add indexes from the series to the table as necessary.
Of course, I can always do something like this:
table = table.join(series[i].rename('new'), how='outer')
table[col] = table.pop('new')
which gives the correct result:
a b c d e f g
0 NaN 20.0 NaN 10.0 NaN NaN NaN
1 10.0 NaN 0.0 NaN 20.0 NaN NaN
2 0.0 0.0 20.0 0.0 10.0 NaN NaN
But it's doing it in quite a roundabout way, and still isn't robust to column name collisions, so you'd have to add a handful more code to fiddle with column names and protect against that. This produces quite verbose and ugly code for what is a conceptually a very simple operation, that I believe there must be a better way of doing it.
pd.concat should work along the column axis:
out = pd.concat(series, axis=1)
print(out)
# Output
0 1 2 3 4
0 10.0 0.0 0.0 NaN 10.0
1 NaN 10.0 NaN 0.0 20.0
2 0.0 NaN 0.0 0.0 NaN
You could try constructing the dataframe using a dict comprehension like this:
series:
[0 10
1 0
dtype: int64,
0 0
1 0
dtype: int64,
2 20
0 0
dtype: int64,
0 20
2 0
dtype: int64,
0 0
1 0
dtype: int64]
code:
table = pd.DataFrame({
col: series[i]
for i, col in enumerate(cols)
if i < len(series)
})
table
output:
a b c d e
0 10.0 0.0 0.0 20.0 0.0
1 0.0 0.0 NaN NaN 0.0
2 NaN NaN 20.0 0.0 NaN
If you really need the nan columns at the end you could do:
table = pd.DataFrame({
col: series[i] if i < len(series) else np.nan
for i, col in enumerate(cols)
})
Output:
a b c d e f g
0 10.0 0.0 0.0 20.0 0.0 NaN NaN
1 0.0 0.0 NaN NaN 0.0 NaN NaN
2 NaN NaN 20.0 0.0 NaN NaN NaN

How to accumulate not null values of each row into separate column/series?

I have a dataframe with a subset of columns with the form:
col_A col_B col_C
0 NaN 1.0 NaN
1 NaN NaN NaN
2 NaN NaN 2.0
3 3.0 NaN 4.0
I want to create a series of the notnull values for each row. For rows with multiple not null values, I want either the average or first value e.g.
new_col
0 1.0
1 NaN
2 2.0
3 3.5
I'll eventually want to add this back to the original dataframe as a separate column so I need to persist rows with all NaN values by forward filling them e.g.
new_col
0 1.0
1 1.0
2 2.0
3 3.5
I know how to determine if there is a not null value in the dataframe, but I don't know how to select for it:
df[['col_A', 'col_B', 'col_C']].count(axis=1) >= 1
You can use:
df.mean(axis=1).ffill()
Or to restrict the columns:
df[['col_A', 'col_B', 'col_C']].mean(axis=1).ffill()
Output:
0 1.0
1 1.0
2 2.0
3 3.5
dtype: float64

groupby shows unobserved values of non-categorical columns

I created this simple example to illustrate my issue:
x = pd.DataFrame({"int_var1": range(3), "int_var2": range(3, 6), "cat_var": pd.Categorical(["a", "b", "a"]), "value": [0.1, 0.2, 0.3]})
it yields this DataFrame:
int_var1 int_var2 cat_var value
0 3 a 0.1
1 4 b 0.2
2 5 a 0.3
where the first two columns are integers, the third column is categorical with two levels, and the fourth column is floats. The issue is that when I try to use groupby followed by agg it seems I only have two options, either I can show no unobserved values like so:
x.groupby(['int_var1', 'int_var2', 'cat_var'], observed = True).agg({"value": "sum"}).fillna(0)
int_var1 int_var2 cat_var value
0 3 a 0.1
1 4 b 0.2
2 5 a 0.3
or I can show unobserved values for all grouping variables like so:
x.groupby(['int_var1', 'int_var2', 'cat_var'], observed = False).agg({"value": "sum"}).fillna(0)
int_var1 int_var2 cat_var value
0 3 a 0.1
b 0.0
4 a 0.0
b 0.0
5 a 0.0
b 0.0
1 3 a 0.0
b 0.0
4 a 0.0
b 0.2
5 a 0.0
b 0.0
2 3 a 0.0
b 0.0
4 a 0.0
b 0.0
5 a 0.3
b 0.0
Is there a way to show unobserved values for the categorical variables only and not every possible permutation of all grouping variables?
You can unstack the level of interest, cat_var in this case:
(x.groupby(['int_var1', 'int_var2', 'cat_var'],observed=True)
.agg({'value':'sum'})
.unstack('cat_var',fill_value=0)
)
Output:
value
cat_var a b
int_var1 int_var2
0 3 0.1 0.0
1 4 0.0 0.2
2 5 0.3 0.0

Sum of NaNs to equal NaN (not zero)

I can add a TOTAL column to this DF using df['TOTAL'] = df.sum(axis=1), and it adds the row elements like this:
col1 col2 TOTAL
0 1.0 5.0 6.0
1 2.0 6.0 8.0
2 0.0 NaN 0.0
3 NaN NaN 0.0
However, I would like the total of the bottom row to be NaN, not zero, like this:
col1 col2 TOTAL
0 1.0 5.0 6.0
1 2.0 6.0 8.0
2 0.0 NaN 0.0
3 NaN NaN Nan
Is there a way I can achieve this in a performant way?
Add parameter min_count=1 to DataFrame.sum:
min_count : int, default 0
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.
New in version 0.22.0: Added with the default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.
df['TOTAL'] = df.sum(axis=1, min_count=1)
print (df)
col1 col2 TOTAL
0 1.0 5.0 6.0
1 2.0 6.0 8.0
2 0.0 NaN 0.0
3 NaN NaN NaN

Converting a flat table of records to an aggregate dataframe in Pandas [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 4 years ago.
I have a flat table of records about objects. Object have a type (ObjType) and are hosted in containers (ContainerId). The records also have some other attributes about the objects. However, they are not of interest at present. So, basically, the data looks like this:
Id ObjName XT ObjType ContainerId
2 name1 x1 A 2
3 name2 x5 B 2
22 name5 x3 D 7
25 name6 x2 E 7
35 name7 x3 G 7
..
..
92 name23 x2 A 17
95 name24 x8 B 17
99 name25 x5 A 21
What I am trying to do is 're-pivot' this data to further analyze which containers are 'similar' by looking at the types of objects they host in aggregate.
So, I am looking to convert the above data to the form below:
ObjType A B C D E F G
ContainerId
2 2.0 1.0 1.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 1.0 2.0 1.0 1.0
9 1.0 1.0 0.0 1.0 0.0 0.0 0.0
11 0.0 0.0 0.0 2.0 3.0 1.0 1.0
14 1.0 1.0 0.0 1.0 0.0 0.0 0.0
17 1.0 1.0 0.0 0.0 0.0 0.0 0.0
21 1.0 0.0 0.0 0.0 0.0 0.0 0.0
This is how I have managed to do it currently (after a lot of stumbling and using various tips from questions such as this one). I am getting the right results but, being new to Pandas and Python, I feel that I must be taking a long route. (I have added a few comments to explain the pain points.)
import pandas as pd
rdf = pd.read_csv('.\\testdata.csv')
#The data in the below group-by is all that is needed but in a re-pivoted format...
rdfg = rdf.groupby('ContainerId').ObjType.value_counts()
#Remove 'ContainerId' and 'ObjType' from the index
#Had to do reset_index in two steps because otherwise there's a conflict with 'ObjType'.
#That is, just rdfg.reset_index() does not work!
rdx = rdfg.reset_index(level=['ContainerId'])
#Renaming the 'ObjType' column helps get around the conflict so the 2nd reset_index works.
rdx.rename(columns={'ObjType':'Count'}, inplace=True)
cdx = rdx.reset_index()
#After this a simple pivot seems to do it
cdf = cdx.pivot(index='ContainerId', columns='ObjType',values='Count')
#Replacing the NaNs because not all containers have all object types
cdf.fillna(0, inplace=True)
Ask: Can someone please share other possible approaches that could perform this transformation?
This is a use case for pd.crosstab. Docs.
e.g.
In [539]: pd.crosstab(df.ContainerId, df.ObjType)
Out[539]:
ObjType A B D E G
ContainerId
2 1 1 0 0 0
7 0 0 1 1 1
17 1 1 0 0 0
21 1 0 0 0 0