Group Pandas rows by ID and forward fill them to the right retaining NaN when it appears on all the rows with the same ID - pandas

I have a Pandas DataFrame that I need to:
group by the ID column (not in index)
forward fill rows to the right with the previous value (multiple columns) only if it's not a NaN (np.nan)
For each ID categorical value and each metric column (see the aX columns in the examples below) there is only value (the others when having multiple rows are NaN - np.nan).
Take this as an example:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: my_df = pd.DataFrame([
...: {"id": 1, "a1": 100.0, "a2": np.nan, "a3": np.nan, "a4": 90.0},
...: {"id": 1, "a1": np.nan, "a2": np.nan, "a3": 80.0, "a4": np.nan},
...: {"id": 20, "a1": np.nan, "a2": np.nan, "a3": 100.0, "a4": np.nan},
...: {"id": 20, "a1": np.nan, "a2": np.nan, "a3": np.nan, "a4": 30.0},
...: ])
In [4]: my_df.head(len(my_df))
Out[4]:
id a1 a2 a3 a4
0 1 100.0 NaN NaN 90.0
1 1 NaN NaN 80.0 NaN
2 20 NaN NaN 100.0 NaN
3 20 NaN NaN NaN 30.0
I have many more columns like a1 to a4.
I would like to:
pretend np.nan is zero 0.0 when on the same column and different row (with same ID) there is a number so I can sum them together like with groupby and subsequent aggregation functions
forward fill to the right on the same unique row (by ID) only if somewhere on a previous column to the left there was a number
So basically in the example this means that:
for ID 1 "a2"=100.0
for ID 2 "a1" and "a2" are both np.nan
See here:
In [5]: wanted_df = pd.DataFrame([
...: {"id": 1, "a1": 100.0, "a2": 100.0, "a3": 80.0, "a4": 90.0},
...: {"id": 20, "a1": np.nan, "a2": np.nan, "a3": 100.0, "a4": 30.0},
...: ])
In [6]: wanted_df.head(len(wanted_df))
Out[6]:
id a1 a2 a3 a4
0 1 100.0 100.0 80.0 90.0
1 20 NaN NaN 100.0 30.0
In [7]:
The forward filling to the right should apply to multiple columns on the same row,
not only for the closest row to the right.
When I use my_df.interpolate(method='pad', axis=1,limit=None,limit_direction='forward',limit_area=None,downcast=None,) then I still get multiple rows for the same ID.
When I use my_df.groupby("id").sum() then I see 0.0 everywhere rather than retaining the NaN values in those scenarios defined above.
When I use my_df.groupby("id").apply(np.sum) the ID columns is summed as well, so this is wrong as it should be retained.
How do I do this?

One idea is use min_count=1 to sum:
df = my_df.groupby("id").sum(min_count=1)
print (df)
a1 a2 a3 a4
id
1 100.0 NaN 80.0 90.0
20 NaN NaN 100.0 30.0
Or if need first non missing value is possible use GroupBy.first:
df = my_df.groupby("id").first()
print (df)
a1 a2 a3 a4
id
1 100.0 NaN 80.0 90.0
20 NaN NaN 100.0 30.0
More problematic is if multiple non missing values per groups and need all of them:
#added 20 to a1
my_df = pd.DataFrame([
{"id": 1, "a1": 100.0, "a2": np.nan, "a3": np.nan, "a4": 90.0},
{"id": 1, "a1": 20, "a2": np.nan, "a3": 80.0, "a4": np.nan},
{"id": 20, "a1": np.nan, "a2": np.nan, "a3": 100.0, "a4": np.nan},
{"id": 20, "a1": np.nan, "a2": np.nan, "a3": np.nan, "a4": 30.0},
])
print (my_df)
id a1 a2 a3 a4
0 1 100.0 NaN NaN 90.0
1 1 20.0 NaN 80.0 NaN
2 20 NaN NaN 100.0 NaN
3 20 NaN NaN NaN 30.0
def f(x):
return x.apply(lambda x: pd.Series(x.dropna().to_numpy()))
df1 = (my_df.set_index('id')
.groupby("id")
.apply(f)
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
id a1 a2 a3 a4
0 1 100.0 NaN 80.0 90.0
1 1 20.0 NaN NaN NaN
2 20 NaN NaN 100.0 30.0
First and second solution working differently:
df2 = my_df.groupby("id").sum(min_count=1)
print (df2)
a1 a2 a3 a4
id
1 120.0 NaN 80.0 90.0
20 NaN NaN 100.0 30.0
df3 = my_df.groupby("id").first()
print (df3)
a1 a2 a3 a4
id
1 100.0 NaN 80.0 90.0
20 NaN NaN 100.0 30.0
If same type of values, here numbers is possible also use:
#https://stackoverflow.com/a/44559180/2901002
def justify(a, invalid_val=0, axis=1, side='left'):
"""
Justifies a 2D array
Parameters
----------
A : ndarray
Input array to be justified
axis : int
Axis along which justification is to be made
side : str
Direction of justification. It could be 'left', 'right', 'up', 'down'
It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.
"""
if invalid_val is np.nan:
mask = ~np.isnan(a)
else:
mask = a!=invalid_val
justified_mask = np.sort(mask,axis=axis)
if (side=='up') | (side=='left'):
justified_mask = np.flip(justified_mask,axis=axis)
out = np.full(a.shape, invalid_val)
if axis==1:
out[justified_mask] = a[mask]
else:
out.T[justified_mask.T] = a.T[mask.T]
return out
f = lambda x: pd.DataFrame(justify(x.to_numpy(),
invalid_val=np.nan,
axis=0,
side='up'), columns=my_df.columns.drop('id'))
.dropna(how='all')
df1 = (my_df.set_index('id')
.groupby("id")
.apply(f)
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
id a1 a2 a3 a4
0 1 100.0 NaN 80.0 90.0
1 1 20.0 NaN NaN NaN
2 20 NaN NaN 100.0 30.0

Related

Pandas rolling mean only for non-NaNs

If have a DataFrame:
df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]
'A1': [1, 1, 2, 2, 2]
'A2': [1, 2, 3, 3, 3]})
I want to create a grouped-by on columns "A1" and "A2" and then apply a rolling-mean on "B" with window 3. If less values are available, that is fine, the mean should still be computed. But I do not want any values if there is no original entry.
Result should be:
pd.DataFrame({'B': [0, 1, 2, np.nan, 3]})
Applying df.rolling(3, min_periods=1).mean() yields:
pd.DataFrame({'B': [0, 1, 2, 2, 3]})
Any ideas?
Reason is for mean with widows=3 is ouput some scalars, not NaNs, possible solution is set NaN manually after rolling:
df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4],
'A': [1, 1, 2, 2, 2]})
df['C'] = df['B'].rolling(3, min_periods=1).mean().mask(df['B'].isna())
df['D'] = df.groupby('A')['B'].rolling(3, min_periods=1).mean().droplevel(0).mask(df['B'].isna())
print (df)
B A C D
0 0.0 1 0.0 0.0
1 1.0 1 0.5 0.5
2 2.0 2 1.0 2.0
3 NaN 2 NaN NaN
4 4.0 2 3.0 3.0
EDIT: For multiple grouping columns remove levels in Series.droplevel:
df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4],
'A1': [1, 1, 2, 2, 2],
'A2': [1, 2, 3, 3, 3]})
df['D'] = df.groupby(['A1','A2'])['B'].rolling(3, min_periods=1).mean().droplevel(['A1','A2']).mask(df['B'].isna())
print (df)
B A1 A2 D
0 0.0 1 1 0.0
1 1.0 1 2 1.0
2 2.0 2 3 2.0
3 NaN 2 3 NaN
4 4.0 2 3 3.0

Product demand down calculation in pandas df without loop

I'm having trouble with shift and diff and I feel it is simple?
Assume I have customers with different product demands, and they get handled with priority top down. I'd like to have it efficient without looping....
df_situation = pd.DataFrame(
{
"cust": [1, 2, 3, 3,4],
"prod": [1, 1, 1, 2,2],
"available": [1000, np.nan, np.nan, 2000, np.nan],
"needed": [200, 300, 1000, 1000,1000],
}
)
My objective is to get some additional columns like this, but it looks like difference calculations and shift operation are in a "chicken and egg problem situation".
Thanks in advance for any hint
leftover_prod is the available ffill - the cumulative demand groupby cumsum:
a = df_situation['available'].ffill()
df_situation['leftover_prod'] = (
a - df_situation.groupby('prod')['demand'].cumsum()
)
0 800.0
1 500.0
2 -500.0
3 1000.0
4 0.0
Name: leftover_prod, dtype: float64
fulfilled_cust is either the demand if there is enough leftover_prod or the leftover_prod groupby shift + np.where:
s = (df_situation.groupby('prod')['leftover_prod']
.shift()
.fillna(df_situation['available']))
df_situation['fulfilled_cust'] = np.where(
s.ge(df_situation['demand']), df_situation['demand'], s
)
0 200.0
1 300.0
2 500.0
3 1000.0
4 1000.0
Name: fulfilled_cust, dtype: float64
missing_cust is the demand - the fulfilled_cust:
df_situation['missing_cust'] = (
df_situation['demand'] - df_situation['fulfilled_cust']
)
0 0.0
1 0.0
2 500.0
3 0.0
4 0.0
Name: missing_cust, dtype: float64
Together:
a = df_situation['available'].ffill()
df_situation['leftover_prod'] = (
a - df_situation.groupby('prod')['demand'].cumsum()
)
s = (df_situation.groupby('prod')['leftover_prod']
.shift()
.fillna(df_situation['available']))
df_situation['fulfilled_cust'] = np.where(
s.ge(df_situation['demand']), df_situation['demand'], s
)
df_situation['missing_cust'] = (
df_situation['demand'] - df_situation['fulfilled_cust']
)
cust prod available demand leftover_prod fulfilled_cust missing_cust
0 1 1 1000.0 200 800.0 200.0 0.0
1 2 1 NaN 300 500.0 300.0 0.0
2 3 1 NaN 1000 -500.0 500.0 500.0
3 3 2 2000.0 1000 1000.0 1000.0 0.0
4 4 2 NaN 1000 0.0 1000.0 0.0
imports and DataFrame used:
import numpy as np
import pandas as pd
df_situation = pd.DataFrame({
"cust": [1, 2, 3, 3, 4],
"prod": [1, 1, 1, 2, 2],
"available": [1000, np.nan, np.nan, 2000, np.nan],
"demand": [200, 300, 1000, 1000, 1000],
})
(changed "needed" to "demand" as it appears in image.)

How to use apply for multiple Pandas dataset columns?

I am hardly trying to fill some columns with NaN values, selected from a previous list. The code is going to the else path and never makes the correct modifications...
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': [0.0, np.nan, np.nan, 100],
'C': [20, 0.0002, 10000, np.nan],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
num_cols = ['B', 'C']
fill_mean = lambda col: col.fillna(col.mean()) if col.name in num_cols else col
df2.apply(fill_mean, axis=1)
You can do this much simpler using
df1.fillna(df1.mean())
This will fill the numeric columns' nas by the column mean:
A B C D
0 A0 0.0 20.000000 D0
1 A1 50.0 0.000200 D1
2 A2 50.0 10000.000000 D2
3 A3 100.0 3340.000067 D3
I am not sure if your desired output it just the mean on all columns (single row). If that is the case, may be the below solution could help.
df = df1.select_dtypes(include='float').mean().to_frame().T
df = pd.concat([df, df.reindex(columns = df1.select_dtypes(exclude='float').columns)], axis=1, sort=False)
print(df)
B C A D
0 50.0 3340.000067 NaN NaN

How to join a Series to a DataFrame: cannot append a non-category item to a CategoricalIndex

I think I'm following the instructions to a t but I still get this error I don't understand. I have a DatFrame and a Series, both with the same MultiIndex consisting of levels "Woche" and "cluster":
DataFrame "weekly":
Cat Base Major
Woche cluster
18w46 0 9.0 NaN
D 5.0 NaN
E 35.0 NaN
F 7.0 50.0
G 80.0 15.0
Series "df2":
Woche cluster
18w46 0 9
D 4
E 1
F 5
G 94
Name: Bruch, dtype: int64
weekly = weekly.join(df2)
gives this error: TypeError: cannot append a non-category item to a CategoricalIndex.
I don't get it. weekly.index.is_categorical() and df2.index.is_categorical() both yield False.
What am I doing wrong?
The problem might be that weekly.columns -- the columns, not the index -- is a CategoricalIndex.
For example,
import numpy as np
import pandas as pd
nan = np.nan
weekly = pd.DataFrame(
{
"Woche": ["18w46", "18w46", "18w46", "18w46", "18w46"],
"cluster": ["0", "D", "E", "F", "G"],
"Base": [9.0, 5.0, 35.0, 7.0, 80.0],
"Major": [nan, nan, nan, 50.0, 15.0],
}
).set_index(["Woche", "cluster"])
weekly.columns = pd.CategoricalIndex(weekly.columns)
df2 = pd.DataFrame(
{
"Woche": ["18w46", "18w46", "18w46", "18w46", "18w46"],
"cluster": ["0", "D", "E", "F", "G"],
"Bruch": [9, 4, 1, 5, 94],
}
).set_index(["Woche", "cluster"])["Bruch"]
weekly.join(df2)
raises TypeError: cannot append a non-category item to a CategoricalIndex.
If weekly.columns.is_categorical() is True, the problem could be avoided by making weekly.columns a regular pd.Index:
weekly.columns = weekly.columns.tolist()

How to calculate the aggregate variance in pivot table

when I use aggfunc = np.var in pivot table. I found the value of metrics became NaN. But when it comes to aggfunc = np.sum it doesn't.
why the original value was changed with aggfunc = np.var or aggfunc = np.std. I can not found answer in the docs. docs of pivot table
import pandas as pd
import numpy as np
df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
"bar", "bar", "bar", "bar"],
"B": ["one", "one", "one", "two", "two",
"one", "one", "two", "two"],
"C": ["small", "large", "large", "small",
"small", "large", "small", "small",
"large"],
"D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
"E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
print(df.pivot_table(
index = ['A', 'B'],
values = ['D', 'E'],
columns = ['C'],
aggfunc= np.sum,
margins=True,
margins_name = 'sum',
dropna = False
))
print('-' * 100)
df = df.pivot_table(
index = ['A', 'B'],
values = ['D', 'E'],
columns = ['C'],
aggfunc= np.var,
margins=True,
margins_name = 'var',
dropna = False
)
print(df)
D E
C large small sum large small sum
A B
bar one 4.0 5.0 9 6.0 8.0 14
two 7.0 6.0 13 9.0 9.0 18
foo one 4.0 1.0 5 9.0 2.0 11
two NaN 6.0 6 NaN 11.0 11
sum 15.0 18.0 33 24.0 30.0 54
-----------------------------------------------------------------------
D E
C large small var large small var
A B
bar one NaN NaN 0.500000 NaN NaN 2.000000
two NaN NaN 0.500000 NaN NaN 0.000000
foo one 0.000000 NaN 0.333333 0.500000 NaN 2.333333
two NaN 0.0 0.000000 NaN 0.5 0.500000
var 5.583333 3.8 3.555556 4.666667 7.5 4.888889
what's more, I found the var of D = large is np.var([4.0, 7.0, 4.0]) = 2.0 instead of 5.583333.
what I expected is:
D E
C large small var large small var
A B
bar one 4.0 5.0 0.25 6.0 8.0 1.0
two 7.0 6.0 0.25 9.0 9.0 0
foo one 4.0 1.0 2.25 9.0 2.0 12.25
two NaN 6.0 0 NaN 11.0 0.0
var 2.0 4.25 3.6 2.0 11.25 7.34
What is the meaning of aggfunc = np.var in pivot table?
Pandas uses by default ddof = 1, see here for details on np.var.
When you have just one value, then the variance using ddof = 1 will be NaN as you try to divide by zero.
Var of D = large is np.var([2,2,4,7], ddof=1) = 5.583333333333333, so everything is correct (you'll have to use the individual values, not the sums).
If you need var with ddof = 0 then you can provide your own function:
def var0(x):
return np.var(x, ddof=0)
print(df.pivot_table(
index = ['A', 'B'],
values = ['D', 'E'],
columns = ['C'],
aggfunc= var0,
margins=True,
margins_name = 'var',
dropna = False
))
Result:
D E
C large small var large small var
A B
bar one 0.0000 0.00 0.250000 0.00 0.00 1.000000
two 0.0000 0.00 0.250000 0.00 0.00 0.000000
foo one 0.0000 0.00 0.222222 0.25 0.00 1.555556
two NaN 0.00 0.000000 NaN 0.25 0.250000
var 4.1875 3.04 3.555556 3.50 6.00 4.888889
UPDATE based on the edited question.
Pivot table with the sums of C and additionally the var of the sums as margin columns/row.
We first create the sum pivot table with margin columns/row named var. Then we updated these margin columns/row with the var of the sum table:
dfs = df.pivot_table(
index = ['A', 'B'],
values = ['D', 'E'],
columns = ['C'],
aggfunc= np.sum,
margins=True,
margins_name = 'var',
dropna = False)
dfs[[('D','var'),('E','var')]] = df.pivot_table(
index = ['A', 'B'],
values = ['D', 'E'],
columns = ['C'],
aggfunc= np.sum,
dropna = False).stack().groupby(level=(0,1)).apply(var0)
dfs.iloc[-1] = dfs.iloc[:-1].apply(var0)
Result:
D E
C large small var large small var
A B
bar one 4.0 5.00 0.250000 6.0 8.00 1.000000
two 7.0 6.00 0.250000 9.0 9.00 0.000000
foo one 4.0 1.00 2.250000 9.0 2.00 12.250000
two NaN 6.00 0.000000 NaN 11.00 0.000000
var 2.0 4.25 0.824219 2.0 11.25 26.792969
In the margin row (last row) the var columns are calculated as the var of the row vars. I don't understand how the OP calculated his values for these two cells. Anyway they don't seem to make much sense.