Groupby problem: count values from 2 columns

Groupby problem: count values from 2 columns - dataframe

First time asking. I know that this may be a silly question, but I'm overwhelmed. I need to group a DataFrame and count the values but I’m having problems.
Lets suppose that I have this DF:
Plate 2021 2022 Total
AAA 1 1
BBB 0 1
CCC 1 1
BBB 1 1
AAA 1 1
AAA 0 1
BBB 1 0
CCC 0 1
My expected outcome would be:
Plate 2021 2022 Total
AAA 2 3 5
BBB 2 2 4
CCC 1 2 3
I tried different combinations of gropupby such as:
df.groupby('Plate').sum()
df.groupby('Plate', '2020', '2021')('total).sum()
but still didn't reach the expected result.
I would apreciate your help :)

Related

Pandas group by and choose all rows except last one in group

I have a pandas df as follows:
MATERIAL DATE HIGH LOW
AAA 2022-01-01 10 0
AAA 2022-01-02 0 0
AAA 2022-01-03 5 2
BBB 2022-01-01 0 0
BBB 2022-01-02 10 5
BBB 2022-01-03 8 4
I want to groupby MATERIAL and sort_values by DATE
and choose all rows except last one in the group.
The resulting result should be:
MATERIAL DATE HIGH LOW
AAA 2022-01-01 10 0
AAA 2022-01-02 0 0
BBB 2022-01-01 0 0
BBB 2022-01-02 10 5
I have tried df.sort_values('DATE').groupby('MATERIAL').head(-1) but this results in an empty df.
The DATE is a pd.datetime object.
Thanks!

Use Series.duplicated with keep='last' for all values without last:
df = df.sort_values(['MATERIAL','DATE'])
df = df[df['MATERIAL'].duplicated(keep='last')]
print (df)
MATERIAL DATE HIGH LOW
0 AAA 2022-01-01 10 0
1 AAA 2022-01-02 0 0
3 BBB 2022-01-01 0 0
4 BBB 2022-01-02 10 5
With groupby solution is possible by GroupBy.cumcount with descending count and filter all rows without 0:
df = df.sort_values(['MATERIAL','DATE'])
df = df[df.groupby('MATERIAL').cumcount(ascending=False).ne(0)]
print (df)
MATERIAL DATE HIGH LOW
0 AAA 2022-01-01 10 0
1 AAA 2022-01-02 0 0
3 BBB 2022-01-01 0 0
4 BBB 2022-01-02 10 5

Another way is to sort by dates first, then group and take every row except the last one using indexing:
>>> df.sort_values("DATE").groupby("MATERIAL").apply(lambda group_df: group_df.iloc[:-1])
MATERIAL DATE HIGH LOW
MATERIAL
AAA 0 AAA 2022-01-01 10 0
1 AAA 2022-01-02 0 0
BBB 3 BBB 2022-01-01 0 0
4 BBB 2022-01-02 10 5

You could use:
(df.groupby('MATERIAL', as_index=False, group_keys=False)
.apply(lambda d: d.iloc[:len(d)-1])
)
output:
MATERIAL DATE HIGH LOW
0 AAA 2022-01-01 10 0
1 AAA 2022-01-02 0 0
3 BBB 2022-01-01 0 0
4 BBB 2022-01-02 10 5

Another way would be using groupby+transform with nth as -1, and compare this with DATE column and only select rows which doesnot match this:
df = df.sort_values(['MATERIAL','DATE'])
c = df['DATE'].ne(df.groupby("MATERIAL")['DATE'].transform('nth',-1))
out = df[c].copy()
print(out)
MATERIAL DATE HIGH LOW
0 AAA 2022-01-01 10 0
1 AAA 2022-01-02 0 0
3 BBB 2022-01-01 0 0
4 BBB 2022-01-02 10 5
Side note: Since you have a date column, you can also use transform with max or last but that would only limit you to the last row as opposed to the second last row for example for which you might need nth as shown above:
c = df['DATE'].ne(df.groupby("MATERIAL")['DATE'].transform('max'))

df1.loc[df1.sort_values(['MATERIAL','DATE'])\
.duplicated(subset='MATERIAL',keep='last')]\
.pipe(print)
MATERIAL DATE HIGH LOW
0 AAA 2022-01-01 10 0
1 AAA 2022-01-02 0 0
3 BBB 2022-01-01 0 0
4 BBB 2022-01-02 10 5

Asking help of pandas as groupby function?

As my new comer for python using, could help me how to create the "Number column" by number sequence under different name of "test1" column, thanks.
(for example: pandas groupby function??)

Try:
df["Number"] = (df.test1 != df.test1.shift()).cumsum()
print(df)
Prints:
test1 Number
0 AAA 1
1 AAA 1
2 AAA 1
3 AAA 1
4 BBB 2
5 BBB 2
6 BBB 2
7 AAA 3
8 AAA 3
9 AAA 3
10 CCC 4
11 CCC 4

Count NaN Single Column with Multiple Conditions in Other Columns

i cannot seem to figure this out trying many different things and there apparently is no answer across the web that i have found. I have data that has values in a single column "data" and i need to sum or count occurrences of NaN in this column based on a groupby of conditions in two other columns such as this which resembles my data below:
site data day month year
0 Red NaN 20 1 2020
1 Red 5.6 31 1 2020
2 Red NaN 6 1 2020
3 Red NaN 9 2 2020
3 Blue 4.5 14 1 2020
4 Blue 6.2 19 2 2020
5 Blue NaN 11 2 2020
The outcome should look like this:
site month count sumNaN
0 Red 1 3 2
1 Red 2 1 1
2 Blue 1 1 0
3 Blue 2 2 1
Thank you very much.

Try:
(df.assign(data=df['data'].isna())
.groupby(['site','month'])
['data'].agg(['count','sum'])
.reset_index()
)
Output:
site month count sum
0 Blue 1 1 0
1 Blue 2 2 1
2 Red 1 3 2
3 Red 2 1 1

You can used named aggregation within the agg:
(df.groupby(['site', 'month'], as_index = False)
.agg(count=('data', 'size'),
sumNaN=('data', lambda df: df.isna().sum())
)
)
site month count sumNaN
0 Blue 1 1 0.0
1 Blue 2 2 1.0
2 Red 1 3 2.0
3 Red 2 1 1.0

Subtract 2 Dataframes if Numberical Columns; for text use most recent dataframe

I have two dataframes, df_new and df_old they have same number and names for the fields. I want to subtract df_old from df_new if cells are in numerical fields. If cells are in text fields, I just want to use df_new value.
I'm trying to have the result in a separate dataframe called changes.
I need help incorporating a dtype check condition on the formula below so it will work only on numerical fields.
df_new.set_index('Index').subtract(df_old.set_index('Index'), fill_value=0)
NEW
Part Qty Stock Notes
0 AAA 40 10 yyy
1 BBB 40 10 yyy
2 CCC 50 20 yyy
3 DDD 40 10
4 EEE 40 10
5 FFF 40 10
OLD
Part Qty Stock Notes
0 AAA 40 10 xxx
1 BBB 40 10 xxx
2 CCC 40 10 xxx
3 DDD 40 10
4 EEE 40 10
5 FFF 40 10
CHANGES
Part Qty Stock Notes
0 AAA 0 0 yyy
1 BBB 0 0 yyy
2 CCC 10 10 yyy
3 DDD 0 0
4 EEE 0 0
5 FFF 0 0

What about:
from pandas.api.types import is_numeric_dtype
to_subtract = [column for column, dtype in new_df.dtypes.items() if is_numeric_dtype(dtype)]
new_df[to_subtract] = new_df[to_subtract] - old_df[to_subtract]

Pandas get order of column value grouped by other column value

I have the following dataframe:
srch_id price
1 30
1 20
1 25
3 15
3 102
3 39
Now I want to create a third column in which I determine the price position grouped by the search id. This is the result I want:
srch_id price price_position
1 30 3
1 20 1
1 25 2
3 15 1
3 102 3
3 39 2
I think I need to use the transform function. However I can't seem to figure out how I should handle the argument I get using .transform():
def k(r):
return min(r)
tmp = train.groupby('srch_id')['price']
train['min'] = tmp.transform(k)
Because r is either a list or an element?

You can use series.rank() with df.groupby():
df['price_position']=df.groupby('srch_id')['price'].rank()
print(df)
srch_id price price_position
0 1 30 3.0
1 1 20 1.0
2 1 25 2.0
3 3 15 1.0
4 3 102 3.0
5 3 39 2.0

is this:
df['price_position'] = df.sort_values('price').groupby('srch_id').price.cumcount() + 1
Out[1907]:
srch_id price price_position
0 1 30 3
1 1 20 1
2 1 25 2
3 3 15 1
4 3 102 3
5 3 39 2

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Groupby problem: count values from 2 columns - dataframe

Related

Pandas group by and choose all rows except last one in group

Asking help of pandas as groupby function?

Count NaN Single Column with Multiple Conditions in Other Columns

Subtract 2 Dataframes if Numberical Columns; for text use most recent dataframe

Pandas get order of column value grouped by other column value

Categories

Resources