I have the following issue with groupby aggregation, i.e adding groups which are not presented in the dataframe but based on the desired output should be included. An example:
import pandas as pd
from pandas.compat import StringIO
csvdata = StringIO("""day,sale
1,1
2,4
2,10
4,7
5,2.3
7,4.4
2,3.4""")
#day 3,6 are intentionally not included here but I'd like to have it in output
df = pd.read_csv(csvdata, sep=",")
df1=df.groupby(['day'])['sale'].agg('sum').reset_index().rename(columns={'sale':'dailysale'})
df1
How can I get the following? Thank you!
1 1.0
2 17.4
3 0.0
4 7.0
5 2.3
6 0.0
7 4.4
You can add Series.reindex with specified range after aggregating sum:
df1 = (df.groupby(['day'])['sale']
.sum()
.reindex(range(1, 8), fill_value=0)
.reset_index(name='dailysale'))
print (df1)
day dailysale
0 1 1.0
1 2 17.4
2 3 0.0
3 4 7.0
4 5 2.3
5 6 0.0
6 7 4.4
Another idea is use ordered categorical, so aggregate sum add missing rows:
df['day'] = pd.Categorical(df['day'], categories=range(1, 8), ordered=True)
df1 = df.groupby(['day'])['sale'].sum().reset_index(name='dailysale')
print (df1)
day dailysale
0 1 1.0
1 2 17.4
2 3 0.0
3 4 7.0
4 5 2.3
5 6 0.0
6 7 4.4
Related
I'm new to Stack Overflow, and I just have a question about solving a problem in pandas. I am looking to create a function that returns the index of the first future instance where a column is less than each row's value for that column.
For example, consider the dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Val': [1, 2, 3, 4, 0, 1, -1, -2, -3]}, index = np.arange(0,9))
df
Index
Val
0
1
1
2
2
3
3
4
4
0
5
1
6
-1
7
-2
8
-3
I am looking for the output:
Index
F(Val)
0
4
1
4
2
4
3
4
4
6
5
6
6
7
7
8
8
NaN
Or the series/array equivalent of F(Val).
I've been able to solve this quite easily using for loops, but obviously this is extremely slow on the large dataset I am working with an not a very elegant or optimal solution. My hope is that the solution is an efficient pandas function that employs vectorization.
Also, as a bonus question (if anyone can assist), how might the maximum value between each row's index and the F(Val) index be computed using vectorization? The output should look like:
Index
G(Val)
0
4
1
4
2
4
3
4
4
1
5
1
6
-1
7
-2
8
NaN
Thanks!
You can use:
grp = df['Val'].lt(df['Val'].shift()).shift(fill_value=0).cumsum()
df['F(Val)'] = df.groupby(grp).transform(lambda x: x.index[-1]).shift(-1)
print(df)
# Output
Val F(Val)
0 1 4.0
1 2 4.0
2 3 4.0
3 4 4.0
4 0 6.0
5 1 6.0
6 -1 7.0
7 -2 8.0
8 -3 NaN
Using numpy broadcasting and the lower triangle:
a = df['Val'].to_numpy()
m = np.tril(a[:,None]<=a, k=-1)
df['F(Val)'] = np.where(m.any(0), m.argmax(0), np.nan)
Same logic with expanding:
df['F(Val)'] = (df.loc[::-1, 'Val'].expanding()
.apply(lambda x: s.idxmax() if len(s:=(x.iloc[-2::-1]<=x.iloc[-1]))
else np.nan)
)
Output (with a difference to the provided one):
Val F(Val)
0 1 5.0 # here the next is 5
1 2 4.0
2 3 4.0
3 4 4.0
4 2 5.0
5 -2 7.0
6 -1 7.0
7 -2 8.0
8 -3 NaN
I am trying to do a groupby transform by rank with the condition of the same value will rank in ascending order (method='first') and ranking will be by descending (ascending=False). Rather than doing a groupby rank and pandas merge.
Sample code for groupby rank and pandas merge:
data = {
"id": [1,1,2,2,3,3,4,4,5,5],
"value": [10,10,20,20,30,30,40,40,20,20]
}
df = pd.DataFrame(data)
df_rank = df.drop_duplicates()
df_rank["rank"] = df_rank["value"].rank(method="first", ascending=False)
df = pd.merge(df, df_rank[["id","rank"]], on="id", how="left")
df
Out[71]:
id value rank
0 1 10 5.0
1 1 10 5.0
2 2 20 3.0
3 2 20 3.0
4 3 30 2.0
5 3 30 2.0
6 4 40 1.0
7 4 40 1.0
8 5 20 4.0
9 5 20 4.0
I want it to be done by groupby transform method or a more optimized solution. Thanks!
I have two dataframes: (i) One has two indexes and two headers, and (ii) the other one has one index and one header. The second level of each axis in the first dataframe relates to each axis of the second dataframe. I need to multiply both dataframes based on that relation between the axis.
Dataframe 1:
Dataframe 2:
Expected result (multiplication by index/header):
Try using pd.DataFrame.mul with the level parameter:
import pandas as pd
df = pd.DataFrame([[9,10,2,1,6,5],
[4, 0,3,4,6,6],
[9, 3,9,1,2,3],
[3, 5,9,3,9,0],
[4,4,8,5,10,5],
[5, 3,1,8,5,6]])
df.columns = pd.MultiIndex.from_arrays([[2020]*3+[2021]*3,[1,2,3,1,2,3]])
df.index = pd.MultiIndex.from_arrays([[1]*3+[2]*3,[1,2,3,1,2,3]])
print(df)
print('\n')
df2 = pd.DataFrame([[.1,.3,.6],[.4,.4,.3],[.5,.4,.1]], index=[1,2,3], columns=[1,2,3])
print(df2)
print('\n')
df_out = df.mul(df2, level=1)
print(df_out)
Output:
2020 2021
1 2 3 1 2 3
1 1 9 10 2 1 6 5
2 4 0 3 4 6 6
3 9 3 9 1 2 3
2 1 3 5 9 3 9 0
2 4 4 8 5 10 5
3 5 3 1 8 5 6
1 2 3
1 0.1 0.3 0.6
2 0.4 0.4 0.3
3 0.5 0.4 0.1
2020 2021
1 2 3 1 2 3
1 1 0.9 3.0 1.2 0.1 1.8 3.0
2 1.6 0.0 0.9 1.6 2.4 1.8
3 4.5 1.2 0.9 0.5 0.8 0.3
2 1 0.3 1.5 5.4 0.3 2.7 0.0
2 1.6 1.6 2.4 2.0 4.0 1.5
3 2.5 1.2 0.1 4.0 2.0 0.6
I have a data frame:
import numpy as np
import pandas as pd
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 10, size=(5, 2)), columns=['col1', 'col2'])
Which generates the following frame:
col1 col2
0 6 3
1 7 4
2 6 9
3 2 6
4 7 4
I want to replace all values from row 2 forward with whatever value on row 1. So I type:
df.loc[2:] = df.loc[1:1]
But the resulting frame is filled with nan:
col1 col2
0 6.0 3.0
1 7.0 4.0
2 NaN NaN
3 NaN NaN
4 NaN NaN
I know I can use fillna(method='ffill') to get what I want but why did the broadcasting not work and result is NaN? Expected result:
col1 col2
0 6 3
1 7 4
2 7 4
3 7 4
4 7 4
Edit: pandas version 0.24.2
I believe df.loc[1:1] is just the empty array, hence converted to NaN? It should be df.loc[2:, 'Value'] = df.loc[1, 'Value'].
I have a pandas data frame that looks similar to below:
A
0 1
1 NaN
2 2
3 NaN
4 NaN
5 3
6 4
8 NaN
9 5
10 NaN
What I want it is:
A
0 1
1 1.1
2 2
3 2.1
4 2.2
5 3
6 4
8 4.1
9 5
10 5.1
The missing values I want to fill incrementally by 0.1. I have been playing with np.arrange but I cannot work out how to piece everything together. I feel I am on the right path but would appreciate some help. thank you
In []: import pandas as pd
In []: import numpy as np
In []: np.arange(1, 2, 0.1)
Out[]: array([1. , 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9])
In []: def up(x):
return x.astype(str) + '.' + np.arange(len(x)).astype(str)
In []: data = pd.DataFrame([[1,0],[0,1],[1,0],[0,1]], columns=["A", "B"])
In []: out = data.apply(up).values
array([['1.0', '0.0'],
['0.1', '1.1'],
['1.2', '0.2'],
['0.3', '1.3']], dtype=object)
In []: df = pd.DataFrame(out)
A B
0 1.0 0.0
1 0.1 1.1
2 1.2 0.2
3 0.3 1.3
A little bit hard to get that point
s=df.A.isnull().astype(int).diff().ne(0).cumsum()[df.A.isnull()]# creat the group Id for those NaN value , if they are NaN they belong to same Id
df.A.fillna(df.A.ffill()+s.groupby(s).cumcount().add(1).mul(0.1))# then we using fillna , and creat the position inorder to adding the .01 for each
Out[1764]:
0 1.0
1 1.1
2 2.0
3 2.1
4 2.2
5 3.0
6 4.0
8 4.1
9 5.0
10 5.1
Name: A, dtype: float64