I'm having trouble with shift and diff and I feel it is simple?
Assume I have customers with different product demands, and they get handled with priority top down. I'd like to have it efficient without looping....
df_situation = pd.DataFrame(
{
"cust": [1, 2, 3, 3,4],
"prod": [1, 1, 1, 2,2],
"available": [1000, np.nan, np.nan, 2000, np.nan],
"needed": [200, 300, 1000, 1000,1000],
}
)
My objective is to get some additional columns like this, but it looks like difference calculations and shift operation are in a "chicken and egg problem situation".
Thanks in advance for any hint
leftover_prod is the available ffill - the cumulative demand groupby cumsum:
a = df_situation['available'].ffill()
df_situation['leftover_prod'] = (
a - df_situation.groupby('prod')['demand'].cumsum()
)
0 800.0
1 500.0
2 -500.0
3 1000.0
4 0.0
Name: leftover_prod, dtype: float64
fulfilled_cust is either the demand if there is enough leftover_prod or the leftover_prod groupby shift + np.where:
s = (df_situation.groupby('prod')['leftover_prod']
.shift()
.fillna(df_situation['available']))
df_situation['fulfilled_cust'] = np.where(
s.ge(df_situation['demand']), df_situation['demand'], s
)
0 200.0
1 300.0
2 500.0
3 1000.0
4 1000.0
Name: fulfilled_cust, dtype: float64
missing_cust is the demand - the fulfilled_cust:
df_situation['missing_cust'] = (
df_situation['demand'] - df_situation['fulfilled_cust']
)
0 0.0
1 0.0
2 500.0
3 0.0
4 0.0
Name: missing_cust, dtype: float64
Together:
a = df_situation['available'].ffill()
df_situation['leftover_prod'] = (
a - df_situation.groupby('prod')['demand'].cumsum()
)
s = (df_situation.groupby('prod')['leftover_prod']
.shift()
.fillna(df_situation['available']))
df_situation['fulfilled_cust'] = np.where(
s.ge(df_situation['demand']), df_situation['demand'], s
)
df_situation['missing_cust'] = (
df_situation['demand'] - df_situation['fulfilled_cust']
)
cust prod available demand leftover_prod fulfilled_cust missing_cust
0 1 1 1000.0 200 800.0 200.0 0.0
1 2 1 NaN 300 500.0 300.0 0.0
2 3 1 NaN 1000 -500.0 500.0 500.0
3 3 2 2000.0 1000 1000.0 1000.0 0.0
4 4 2 NaN 1000 0.0 1000.0 0.0
imports and DataFrame used:
import numpy as np
import pandas as pd
df_situation = pd.DataFrame({
"cust": [1, 2, 3, 3, 4],
"prod": [1, 1, 1, 2, 2],
"available": [1000, np.nan, np.nan, 2000, np.nan],
"demand": [200, 300, 1000, 1000, 1000],
})
(changed "needed" to "demand" as it appears in image.)
Related
If have a DataFrame:
df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]
'A1': [1, 1, 2, 2, 2]
'A2': [1, 2, 3, 3, 3]})
I want to create a grouped-by on columns "A1" and "A2" and then apply a rolling-mean on "B" with window 3. If less values are available, that is fine, the mean should still be computed. But I do not want any values if there is no original entry.
Result should be:
pd.DataFrame({'B': [0, 1, 2, np.nan, 3]})
Applying df.rolling(3, min_periods=1).mean() yields:
pd.DataFrame({'B': [0, 1, 2, 2, 3]})
Any ideas?
Reason is for mean with widows=3 is ouput some scalars, not NaNs, possible solution is set NaN manually after rolling:
df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4],
'A': [1, 1, 2, 2, 2]})
df['C'] = df['B'].rolling(3, min_periods=1).mean().mask(df['B'].isna())
df['D'] = df.groupby('A')['B'].rolling(3, min_periods=1).mean().droplevel(0).mask(df['B'].isna())
print (df)
B A C D
0 0.0 1 0.0 0.0
1 1.0 1 0.5 0.5
2 2.0 2 1.0 2.0
3 NaN 2 NaN NaN
4 4.0 2 3.0 3.0
EDIT: For multiple grouping columns remove levels in Series.droplevel:
df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4],
'A1': [1, 1, 2, 2, 2],
'A2': [1, 2, 3, 3, 3]})
df['D'] = df.groupby(['A1','A2'])['B'].rolling(3, min_periods=1).mean().droplevel(['A1','A2']).mask(df['B'].isna())
print (df)
B A1 A2 D
0 0.0 1 1 0.0
1 1.0 1 2 1.0
2 2.0 2 3 2.0
3 NaN 2 3 NaN
4 4.0 2 3 3.0
I have a Pandas DataFrame that I need to:
group by the ID column (not in index)
forward fill rows to the right with the previous value (multiple columns) only if it's not a NaN (np.nan)
For each ID categorical value and each metric column (see the aX columns in the examples below) there is only value (the others when having multiple rows are NaN - np.nan).
Take this as an example:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: my_df = pd.DataFrame([
...: {"id": 1, "a1": 100.0, "a2": np.nan, "a3": np.nan, "a4": 90.0},
...: {"id": 1, "a1": np.nan, "a2": np.nan, "a3": 80.0, "a4": np.nan},
...: {"id": 20, "a1": np.nan, "a2": np.nan, "a3": 100.0, "a4": np.nan},
...: {"id": 20, "a1": np.nan, "a2": np.nan, "a3": np.nan, "a4": 30.0},
...: ])
In [4]: my_df.head(len(my_df))
Out[4]:
id a1 a2 a3 a4
0 1 100.0 NaN NaN 90.0
1 1 NaN NaN 80.0 NaN
2 20 NaN NaN 100.0 NaN
3 20 NaN NaN NaN 30.0
I have many more columns like a1 to a4.
I would like to:
pretend np.nan is zero 0.0 when on the same column and different row (with same ID) there is a number so I can sum them together like with groupby and subsequent aggregation functions
forward fill to the right on the same unique row (by ID) only if somewhere on a previous column to the left there was a number
So basically in the example this means that:
for ID 1 "a2"=100.0
for ID 2 "a1" and "a2" are both np.nan
See here:
In [5]: wanted_df = pd.DataFrame([
...: {"id": 1, "a1": 100.0, "a2": 100.0, "a3": 80.0, "a4": 90.0},
...: {"id": 20, "a1": np.nan, "a2": np.nan, "a3": 100.0, "a4": 30.0},
...: ])
In [6]: wanted_df.head(len(wanted_df))
Out[6]:
id a1 a2 a3 a4
0 1 100.0 100.0 80.0 90.0
1 20 NaN NaN 100.0 30.0
In [7]:
The forward filling to the right should apply to multiple columns on the same row,
not only for the closest row to the right.
When I use my_df.interpolate(method='pad', axis=1,limit=None,limit_direction='forward',limit_area=None,downcast=None,) then I still get multiple rows for the same ID.
When I use my_df.groupby("id").sum() then I see 0.0 everywhere rather than retaining the NaN values in those scenarios defined above.
When I use my_df.groupby("id").apply(np.sum) the ID columns is summed as well, so this is wrong as it should be retained.
How do I do this?
One idea is use min_count=1 to sum:
df = my_df.groupby("id").sum(min_count=1)
print (df)
a1 a2 a3 a4
id
1 100.0 NaN 80.0 90.0
20 NaN NaN 100.0 30.0
Or if need first non missing value is possible use GroupBy.first:
df = my_df.groupby("id").first()
print (df)
a1 a2 a3 a4
id
1 100.0 NaN 80.0 90.0
20 NaN NaN 100.0 30.0
More problematic is if multiple non missing values per groups and need all of them:
#added 20 to a1
my_df = pd.DataFrame([
{"id": 1, "a1": 100.0, "a2": np.nan, "a3": np.nan, "a4": 90.0},
{"id": 1, "a1": 20, "a2": np.nan, "a3": 80.0, "a4": np.nan},
{"id": 20, "a1": np.nan, "a2": np.nan, "a3": 100.0, "a4": np.nan},
{"id": 20, "a1": np.nan, "a2": np.nan, "a3": np.nan, "a4": 30.0},
])
print (my_df)
id a1 a2 a3 a4
0 1 100.0 NaN NaN 90.0
1 1 20.0 NaN 80.0 NaN
2 20 NaN NaN 100.0 NaN
3 20 NaN NaN NaN 30.0
def f(x):
return x.apply(lambda x: pd.Series(x.dropna().to_numpy()))
df1 = (my_df.set_index('id')
.groupby("id")
.apply(f)
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
id a1 a2 a3 a4
0 1 100.0 NaN 80.0 90.0
1 1 20.0 NaN NaN NaN
2 20 NaN NaN 100.0 30.0
First and second solution working differently:
df2 = my_df.groupby("id").sum(min_count=1)
print (df2)
a1 a2 a3 a4
id
1 120.0 NaN 80.0 90.0
20 NaN NaN 100.0 30.0
df3 = my_df.groupby("id").first()
print (df3)
a1 a2 a3 a4
id
1 100.0 NaN 80.0 90.0
20 NaN NaN 100.0 30.0
If same type of values, here numbers is possible also use:
#https://stackoverflow.com/a/44559180/2901002
def justify(a, invalid_val=0, axis=1, side='left'):
"""
Justifies a 2D array
Parameters
----------
A : ndarray
Input array to be justified
axis : int
Axis along which justification is to be made
side : str
Direction of justification. It could be 'left', 'right', 'up', 'down'
It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.
"""
if invalid_val is np.nan:
mask = ~np.isnan(a)
else:
mask = a!=invalid_val
justified_mask = np.sort(mask,axis=axis)
if (side=='up') | (side=='left'):
justified_mask = np.flip(justified_mask,axis=axis)
out = np.full(a.shape, invalid_val)
if axis==1:
out[justified_mask] = a[mask]
else:
out.T[justified_mask.T] = a.T[mask.T]
return out
f = lambda x: pd.DataFrame(justify(x.to_numpy(),
invalid_val=np.nan,
axis=0,
side='up'), columns=my_df.columns.drop('id'))
.dropna(how='all')
df1 = (my_df.set_index('id')
.groupby("id")
.apply(f)
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
id a1 a2 a3 a4
0 1 100.0 NaN 80.0 90.0
1 1 20.0 NaN NaN NaN
2 20 NaN NaN 100.0 30.0
I want to bucket volumes and build up a summary report over the aggregate data via those buckets. Currently I use apply to do this, but apply can be very slow for large data sets. Is there a generic form of the syntax given in create_lt_ten_buckets? I'm guessing this is more of a numpy thing, with which I am less familiar.
def create_buckets(df_internal, comparison_operator, column_to_bucket, min_value, max_value, ranges_pivots):
low = [min_value] + ranges_pivots
high = ranges_pivots + [max_value]
ranges = list(zip(low, high))
max_str_len = len(str(max(high + low)))
def get_value(row):
count = 0
for l,h in ranges:
if comparison_operator(l, row[column_to_bucket]) and comparison_operator(row[column_to_bucket], h):
return "{}|{}_to_{}".format(str(count).zfill(max_str_len),l,h)
count+=1
return "OUTOFBAND"
df_internal["{}_BUCKETED".format(column_to_bucket)] = df_internal.apply(get_value, axis=1)
def create_lt_ten_bucket(df_internal, column_to_bucket):
df_internal["{}_is_lt_ten".format(column_to_bucket)] = df_internal[column_to_bucket] < 10
dftest = pd.DataFrame([1,2,3,4,5, 44, 250, 22], columns=["value_alpha"])
create_buckets(dftest, lambda v1,v2: v1 <= v2, "value_alpha", 0, 999, [1, 2, 5, 10, 25, 50, 100, 200])
display(dftest)
create_lt_ten_bucket(dftest, "value_alpha")
display(dftest)
dftest.groupby('value_alpha_BUCKETED').sum().sort_values('value_alpha_BUCKETED')
OUTPUT
value_alpha value_alpha_BUCKETED
0 1 000|0_to_1
1 2 001|1_to_2
2 3 002|2_to_5
3 4 002|2_to_5
4 5 002|2_to_5
5 44 005|25_to_50
6 250 008|200_to_999
7 22 004|10_to_25
dftest = pd.DataFrame([1,2,3,4,5, 44, 250, 22], columns=["value_alpha"])
create_buckets(dftest, lambda v1,v2: v1 <= v2, "value_alpha", 0, 999999999, [1, 2, 5, 10, 25, 50, 100, 200])
display(dftest)
create_lt_ten_bucket(dftest, "value_alpha")
display(dftest)
OUTPUT
value_alpha value_alpha_BUCKETED value_alpha_is_lt_ten
0 1 000|0_to_1 True
1 2 001|1_to_2 True
2 3 002|2_to_5 True
3 4 002|2_to_5 True
4 5 002|2_to_5 True
5 44 005|25_to_50 False
6 250 008|200_to_999 False
7 22 004|10_to_25 False
In the end I'm trying to get a summary of the data similar to this:
dftest.groupby('value_alpha_BUCKETED').sum().sort_values('value_alpha_BUCKETED')
value_alpha value_alpha_is_lt_ten
value_alpha_BUCKETED
000|0_to_1 1 1.0
001|1_to_2 2 1.0
002|2_to_5 12 3.0
004|10_to_25 22 0.0
005|25_to_50 44 0.0
008|200_to_999 250 0.0
I'm not entirely clear on what you're asking, but what you have is roughly pd.cut and pd.DataFrame.groupby:
dftest['new_bucket'] = pd.cut(dftest['value_alpha'], [0, 1, 2, 5, 10, 25, 50, 100, 200, 999])
dftest['value_alpha_is_lt_ten'] = dftest['value_alpha'] < 10
print(dftest.groupby("new_bucket").sum())
value_alpha value_alpha_is_lt_ten
new_bucket
(0, 1] 1 1.0
(1, 2] 2 1.0
(2, 5] 12 3.0
(5, 10] 0 0.0
(10, 25] 22 0.0
(25, 50] 44 0.0
(50, 100] 0 0.0
(100, 200] 0 0.0
(200, 999] 250 0.0
If you don't want the empty buckets, you could .query on values where value_alpha > 0
I want to create a barplot where the 'Age_round' are grouped together and also in ascending order. Right now the bars are all separated
import matplotlib.pyplot as plt
df.plot(kind='bar',x='Age_round',y='number of purchased hours(mins)')
plt.xlabel('Age_round')
plt.ylabel('number of purchased hours(mins)')
# plt.xticks(np.arange(start = 4, stop = 17, step = 1))
plt.title('Age Distribution Graph')
plt.grid()
This is my dataframe below
Package Age_round gender
1 7000 9.0 1
2 7000 10.0 0
3 5000 9.0 0
4 9000 10.0 1
5 3000 12.0 1
6 5000 9.0 1
7 9000 10.0 1
8 6000 16.0 1
9 6000 12.0 0
10 6000 7.0 1
11 12000 7.0 1
12 12000 15.0 1
13 6000 10.0 1
Essentially, I would love to create a barplot where the x-axis is 'Age_round' ,y-axis showing the frequency and the 'Package' are differentiated by bars of different colour
I wrote a piece of code that does this job, not sure if this is the best way :
made a newdf to read frequency data for each age against Package and assigned 'values(age)' as its index
values = df.Age_round.unique()
values.sort()
newdf = pd.DataFrame()
for x in values :
freq_x = df[df['Age_round']==x]['Package'].value_counts()
newdf = newdf.append(freq_x)
newdf.index = values
newdf.plot(kind='bar',stacked=True, figsize=(15,6))
Here is a possible implementation:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(columns=['Package', 'Age_round', 'gender'],
data=[[7000, 9.0, 1], [7000, 10.0, 0], [5000, 9.0, 0], [9000, 10.0, 1], [3000, 12.0, 1],
[5000, 9.0, 1], [9000, 10.0, 1], [6000, 16.0, 1], [6000, 12.0, 0], [6000, 7.0, 1],
[12000, 7.0, 1], [12000, 15.0, 1], [6000, 10.0, 1]])
df['Age_round'] = df['Age_round'].astype(int) # optionally round the numbers to integers
df.sort_values(['Age_round', 'Package']).plot(kind='bar', x='Age_round', y='Package', rot=0, color='deeppink')
plt.xlabel('Age (rounded)')
plt.ylabel('Number of purchased hours(mins)')
plt.title('Age Distribution Graph')
plt.grid(True, axis='y')
plt.show()
when I use aggfunc = np.var in pivot table. I found the value of metrics became NaN. But when it comes to aggfunc = np.sum it doesn't.
why the original value was changed with aggfunc = np.var or aggfunc = np.std. I can not found answer in the docs. docs of pivot table
import pandas as pd
import numpy as np
df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
"bar", "bar", "bar", "bar"],
"B": ["one", "one", "one", "two", "two",
"one", "one", "two", "two"],
"C": ["small", "large", "large", "small",
"small", "large", "small", "small",
"large"],
"D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
"E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
print(df.pivot_table(
index = ['A', 'B'],
values = ['D', 'E'],
columns = ['C'],
aggfunc= np.sum,
margins=True,
margins_name = 'sum',
dropna = False
))
print('-' * 100)
df = df.pivot_table(
index = ['A', 'B'],
values = ['D', 'E'],
columns = ['C'],
aggfunc= np.var,
margins=True,
margins_name = 'var',
dropna = False
)
print(df)
D E
C large small sum large small sum
A B
bar one 4.0 5.0 9 6.0 8.0 14
two 7.0 6.0 13 9.0 9.0 18
foo one 4.0 1.0 5 9.0 2.0 11
two NaN 6.0 6 NaN 11.0 11
sum 15.0 18.0 33 24.0 30.0 54
-----------------------------------------------------------------------
D E
C large small var large small var
A B
bar one NaN NaN 0.500000 NaN NaN 2.000000
two NaN NaN 0.500000 NaN NaN 0.000000
foo one 0.000000 NaN 0.333333 0.500000 NaN 2.333333
two NaN 0.0 0.000000 NaN 0.5 0.500000
var 5.583333 3.8 3.555556 4.666667 7.5 4.888889
what's more, I found the var of D = large is np.var([4.0, 7.0, 4.0]) = 2.0 instead of 5.583333.
what I expected is:
D E
C large small var large small var
A B
bar one 4.0 5.0 0.25 6.0 8.0 1.0
two 7.0 6.0 0.25 9.0 9.0 0
foo one 4.0 1.0 2.25 9.0 2.0 12.25
two NaN 6.0 0 NaN 11.0 0.0
var 2.0 4.25 3.6 2.0 11.25 7.34
What is the meaning of aggfunc = np.var in pivot table?
Pandas uses by default ddof = 1, see here for details on np.var.
When you have just one value, then the variance using ddof = 1 will be NaN as you try to divide by zero.
Var of D = large is np.var([2,2,4,7], ddof=1) = 5.583333333333333, so everything is correct (you'll have to use the individual values, not the sums).
If you need var with ddof = 0 then you can provide your own function:
def var0(x):
return np.var(x, ddof=0)
print(df.pivot_table(
index = ['A', 'B'],
values = ['D', 'E'],
columns = ['C'],
aggfunc= var0,
margins=True,
margins_name = 'var',
dropna = False
))
Result:
D E
C large small var large small var
A B
bar one 0.0000 0.00 0.250000 0.00 0.00 1.000000
two 0.0000 0.00 0.250000 0.00 0.00 0.000000
foo one 0.0000 0.00 0.222222 0.25 0.00 1.555556
two NaN 0.00 0.000000 NaN 0.25 0.250000
var 4.1875 3.04 3.555556 3.50 6.00 4.888889
UPDATE based on the edited question.
Pivot table with the sums of C and additionally the var of the sums as margin columns/row.
We first create the sum pivot table with margin columns/row named var. Then we updated these margin columns/row with the var of the sum table:
dfs = df.pivot_table(
index = ['A', 'B'],
values = ['D', 'E'],
columns = ['C'],
aggfunc= np.sum,
margins=True,
margins_name = 'var',
dropna = False)
dfs[[('D','var'),('E','var')]] = df.pivot_table(
index = ['A', 'B'],
values = ['D', 'E'],
columns = ['C'],
aggfunc= np.sum,
dropna = False).stack().groupby(level=(0,1)).apply(var0)
dfs.iloc[-1] = dfs.iloc[:-1].apply(var0)
Result:
D E
C large small var large small var
A B
bar one 4.0 5.00 0.250000 6.0 8.00 1.000000
two 7.0 6.00 0.250000 9.0 9.00 0.000000
foo one 4.0 1.00 2.250000 9.0 2.00 12.250000
two NaN 6.00 0.000000 NaN 11.00 0.000000
var 2.0 4.25 0.824219 2.0 11.25 26.792969
In the margin row (last row) the var columns are calculated as the var of the row vars. I don't understand how the OP calculated his values for these two cells. Anyway they don't seem to make much sense.