Get and Modify column in groups on rows that meet a condition

Get and Modify column in groups on rows that meet a condition - pandas

I have this DataFrame:
df = pd.DataFrame({'day': [1, 1, 1, 2, 2, 2, 3, 3, 3], 'hour': [10, 10, 10, 11, 11, 11, 12, 12, 12], 'sales': [0, 40, 30, 10, 80, 70, 0, 0, 20]})
day hour sales
0 1 10 0
1 1 10 40
2 1 10 30
3 2 11 10
4 2 11 80
5 2 11 70
6 3 12 0
7 3 12 0
8 3 12 20
And I would like to filter to get the first entry of each day that has volume greater than 0. And as an additional thing I would like to change the 'sales' column for these to 9.
So to get something like this:
day hour sales
0 1 10 0
1 1 9 40
2 1 10 30
3 2 9 10
4 2 11 80
5 2 11 70
6 3 12 0
7 3 12 0
8 3 9 20
I only came up with this iterative solution. But is there a solution, how I can apply it in a more functional way?
# Group by day:
groups = df.groupby(by=['day'])
# Get all indices of first non-zero sales entry per day:
indices = []
for name, group in groups:
group = group[group['sales'] > 0]
indices.append(group.index.to_list()[0])
# Change their values:
df.iloc[indices, df.columns.get_loc('hour')] = 9

You can create a group of df['day'] after checking of sales is greater than 0 , then get idxmax and filter out groups which doesnot have any value greater than 0 by using any , then assign with loc[]
g = df['sales'].gt(0).groupby(df['day'])
idx = g.idxmax()
df.loc[idx[g.any()],'hour']=9
print(df)
day hour sales
0 1 10 0
1 1 9 40
2 1 10 30
3 2 9 10
4 2 11 80
5 2 11 70
6 3 12 0
7 3 12 0
8 3 9 20

Create a mask m that groups by day as well as rows where the sales are not 0.
Then, use this mask as well as df['sales'] > 0 to change those specific rows to 9 with np.where()
df = pd.DataFrame({'day': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'hour': [10, 10, 10, 11, 11, 11, 12, 12, 12],
'sales': [0, 40, 30, 10, 80, 70, 0, 0, 20]})
m = df.groupby(['day', df['sales'].ne(0)])['sales'].transform('first')
df['hour'] = np.where((df['sales'] == m) & (df['sales'] > 0), 9, df['hour'])
df
Out[37]:
day hour sales
0 1 10 0
1 1 9 40
2 1 10 30
3 2 9 10
4 2 11 80
5 2 11 70
6 3 12 0
7 3 12 0
8 3 9 20

Related

create column based on column values - merge integers

I would like to create a new column "Group". The integer values from column "Step_ID" should be converted into 1 and 2. The fist two values should be converted to 1, the second two values to 2, the third two values to 1 etc. See the image below.
import pandas as pd
data = {'Step_ID': [1, 1, 2, 2, 3, 4, 5, 6, 6, 7, 8, 8, 9, 10, 11, 11]}
df1 = pd.DataFrame(data)

You can try:
m = (df.Step_ID % 2) + df.Step_ID
df['new_group'] = (m.ne(m.shift()).cumsum() % 2).replace(0,2)
OUTPUT:
Step_ID new_group
0 1 1
1 1 1
2 2 1
3 2 1
4 3 2
5 4 2
6 5 1
7 6 1
8 6 1
9 7 2
10 8 2
11 8 2
12 9 1
13 10 1
14 11 2
15 11 2

Pandas: how to group on column change?

I am working with a log system, and I need to group data not in a standard way.
Alas with my limited knowledge of Pandas I couldn't find any example, probably because I don't know proper search terms.
This is a sample dataframe:
df = pd.DataFrame({
"speed": [2, 4, 6, 8, 8, 9, 2, 3, 8, 9, 13, 18, 25, 27, 18, 8, 6, 8, 12, 20, 27, 34, 36, 41, 44, 54, 61, 60, 61, 40, 17, 12, 15, 24],
"class": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 5, 5, 5, 3, 1, 1, 1, 2]
})
df.groupby(by="class").groups returns indexed of each row, all grouped together by class value:
class indexes
1: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14, 15, 16, 17, 18, 30, 32],
2: [12, 13, 19, 20, 21, 22, 33],
3: [23, 24, 29],
4: [25],
5: [26, 27, 28]
I need instead to split every time column class changes:
speed class
0 2 1
1 4 1
2 6 1
3 8 1
4 8 1
5 9 1
6 2 1
7 3 1
8 8 1
9 9 1
10 13 1
11 18 1
12 25 2 <= split here
13 27 2
14 18 1 <= split here
15 8 1
16 6 1
17 8 1
18 12 1 <= split here
19 20 2
20 27 2
21 34 2
22 36 2 <= split here
23 41 3
24 44 3 <= split here
25 54 4 <= split here
26 61 5
27 60 5
28 61 5 <= split here
29 40 3 <= split here
30 17 1 <= split here
31 12 1
32 15 1
33 24 2 <= split here
The desired grouping should return something like:
class count mean
0 1 12 70.50
1 2 2 26.00
2 1 5 10.40
3 2 4 29.25
4 3 2 42.50
5 4 1 54.00
6 5 3 60.66
7 3 1 40.00
8 1 3 14.66
9 2 1 24.00
Is there any command to do it not iteratively?

Use Series.cumsum with compare if not equal shifted values and aggregate by GroupBy.agg:
g = df["class"].ne(df["class"].shift()).cumsum()
df = (df.groupby(['class', g], sort=False)['speed'].agg(['count','mean'])
.reset_index(level=1, drop=True)
.reset_index())
print (df)
class count mean
0 1 12 7.500000
1 2 2 26.000000
2 1 5 10.400000
3 2 4 29.250000
4 3 2 42.500000
5 4 1 54.000000
6 5 3 60.666667
7 3 1 40.000000
8 1 3 14.666667
9 2 1 24.000000

You can groupby the cumsum of when the class column differs from the value below it:
df.groupby(df["class"].diff().ne(0).cumsum()).speed.agg(['size', 'mean'])
size mean
class
1 12 7.500000
2 2 26.000000
3 5 10.400000
4 4 29.250000
5 2 42.500000
6 1 54.000000
7 3 60.666667
8 1 40.000000
9 3 14.666667
10 1 24.000000
Update: I hadn't seen how you wanted the class column: what you can do is group by the original class column as well as the cumsum above, and do a bit of index-sorting and resetting (but at this point this answer just converges with #jezrael's answer :P)
result = (
df.groupby(["class", df["class"].diff().ne(0).cumsum()])
.speed.agg(["size", "mean"])
.sort_index(level=1)
.reset_index(level=0)
.reset_index(drop=True)
)
class size mean
0 1 12 7.500000
1 2 2 26.000000
2 1 5 10.400000
3 2 4 29.250000
4 3 2 42.500000
5 4 1 54.000000
6 5 3 60.666667
7 3 1 40.000000
8 1 3 14.666667
9 2 1 24.000000

Pandas - Add many new columns based on many aggregate functions

Pandas 1.0.5
import pandas as pd
d = pd.DataFrame({
"card_id": [1, 1, 2, 2, 1, 1, 2, 2],
"day": [1, 1, 1, 1, 2, 2, 2, 2],
"amount": [1, 2, 10, 20, 3, 4, 30, 40]
})
#add columns
d['count'] = d.groupby(['card_id', 'day'])["amount"].transform('count')
d['min'] = d.groupby(['card_id', 'day'])["amount"].transform('min')
d['max'] = d.groupby(['card_id', 'day'])["amount"].transform('max')
I would like to change the three transform lines to one line. I tried this:
d['count', 'min', 'max'] = d.groupby(['card_id', 'day'])["amount"].transform('count', 'min', 'max')
Error: "TypeError: count() takes 1 positional argument but 3 were given"
I also tried this:
d[('count', 'min', 'max')] = d.groupby(['card_id', 'day']).agg(
count = pd.NamedAgg('amount', 'count')
,min = pd.NamedAgg('amount', 'min')
,max = pd.NamedAgg('amount', 'max')
)
Error: "TypeError: incompatible index of inserted column with frame index"

Use merge,
d = pd.DataFrame({
"card_id": [1, 1, 2, 2, 1, 1, 2, 2],
"day": [1, 1, 1, 1, 2, 2, 2, 2],
"amount": [1, 2, 10, 20, 3, 4, 30, 40]
})
df_out = d.groupby(['card_id', 'day']).agg(
count = pd.NamedAgg('amount', 'count')
,min = pd.NamedAgg('amount', 'min')
,max = pd.NamedAgg('amount', 'max')
)
d.merge(df_out, left_on=['card_id', 'day'], right_index=True)
Output:
card_id day amount count min max
0 1 1 1 2 1 2
1 1 1 2 2 1 2
2 2 1 10 2 10 20
3 2 1 20 2 10 20
4 1 2 3 2 3 4
5 1 2 4 2 3 4
6 2 2 30 2 30 40
7 2 2 40 2 30 40
The output of you groupyby is creating a multilevel index and the index of this ouput doesn't match the index of d, hence the error. However, we can join the columns in d to the index in output of the groupby using merge with column names and right_index=True.

You could use the assign function to get the results in one go:
grouping = df.groupby(["card_id", "day"])
df.assign(
count=grouping.transform("count"),
min=grouping.transform("min"),
max=grouping.transform("max"),
)
card_id day amount count min max
0 1 1 1 2 1 2
1 1 1 2 2 1 2
2 2 1 10 2 10 20
3 2 1 20 2 10 20
4 1 2 3 2 3 4
5 1 2 4 2 3 4
6 2 2 30 2 30 40
7 2 2 40 2 30 40

How to group consecutive values in other columns into ranges based on one column

I have the following dataframe:
I would like to get the following output from the dataframe
Is there anyway to group other columns ['B', 'index'] based on column 'A' using groupby aggregate function, pivot_table in pandas.
I couldn't think about an approach to write code.

Use:
df=df.reset_index() #if 'index' not is a colum
g=df['A'].ne(df['A'].shift()).cumsum()
new_df=df.groupby(g,as_index=False).agg(index=('index',list),A=('A','first'),B=('B',lambda x: list(x.unique())))
print(new_df)
In pandas <0.25:
new_df=df.groupby(g,as_index=False).agg({'index':list,'A':'first','B':lambda x: list(x.unique())})
if you want to repeat repeated in the index use the same function for the index column as for B:
new_df=df.groupby(g,as_index=False).agg(index=('index',lambda x: list(x.unique())),A=('A','first'),B=('B',lambda x: list(x.unique())))
print(new_df)
Here is an example:
df=pd.DataFrame({'index':range(20),
'A':[1,1,1,1,2,2,0,0,0,1,1,1,1,1,1,0,0,0,3,3]
,'B':[1,2,3,5,5,5,7,8,9,9,9,12,12,14,15,16,17,18,19,20]})
print(df)
index A B
0 0 1 1
1 1 1 2
2 2 1 3
3 3 1 5
4 4 2 5
5 5 2 5
6 6 0 7
7 7 0 8
8 8 0 9
9 9 1 9
10 10 1 9
11 11 1 12
12 12 1 12
13 13 1 14
14 14 1 15
15 15 0 16
16 16 0 17
17 17 0 18
18 18 3 19
19 19 3 20
g=df['A'].ne(df['A'].shift()).cumsum()
new_df=df.groupby(g,as_index=False).agg(index=('index',list),A=('A','first'),B=('B',lambda x: list(x.unique())))
print(new_df)
index A B
0 [0, 1, 2, 3] 1 [1, 2, 3, 5]
1 [4, 5] 2 [5]
2 [6, 7, 8] 0 [7, 8, 9]
3 [9, 10, 11, 12, 13, 14] 1 [9, 12, 14, 15]
4 [15, 16, 17] 0 [16, 17, 18]
5 [18, 19] 3 [19, 20]

Conditional filter of entire group for DataFrameGroupBy

If I have the following data
>>> data = pd.DataFrame({'day': [1, 1, 1, 1, 2, 2, 2, 2, 3, 4],
'hour':[4, 5, 6, 7, 4, 5, 6, 7, 4, 7]})
>>> data
day hour
0 1 4
1 1 5
2 1 6
3 1 7
4 2 4
5 2 5
6 2 6
7 2 7
8 3 4
9 4 7
And I would like to keep only days where hour has 4 unique values then I would think to do something like this
>>> data.groupby('day').apply(lambda x: x[x['hour'].nunique() == 4])
But this returns KeyError: True
I am hoping to get this
>>> data
day hour
0 1 4
1 1 5
2 1 6
3 1 7
4 2 4
5 2 5
6 2 6
7 2 7
Where we see that where day == 3 and day == 4 have been filtered because when grouped by day they don't have 4 unique values of hour. I'm doing this at scale so simply filtering where (day == 3) & (day == 4) is not an option. I think grouping would be a good way to do this but can't get it to work. Anyone have experience with applying functions to DataFrameGroupBy?

I think you actually need to filter the data:
>>> data.groupby('day').filter(lambda x: x['hour'].nunique() == 4)
day hour
0 1 4
1 1 5
2 1 6
3 1 7
4 2 4
5 2 5
6 2 6
7 2 7

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Get and Modify column in groups on rows that meet a condition - pandas

Related

create column based on column values - merge integers

Pandas: how to group on column change?

Pandas - Add many new columns based on many aggregate functions

How to group consecutive values in other columns into ranges based on one column

Conditional filter of entire group for DataFrameGroupBy

Categories

Resources