First time asking. I know that this may be a silly question, but I'm overwhelmed. I need to group a DataFrame and count the values but I’m having problems.
Lets suppose that I have this DF:
Plate 2021 2022 Total
AAA 1 1
BBB 0 1
CCC 1 1
BBB 1 1
AAA 1 1
AAA 0 1
BBB 1 0
CCC 0 1
My expected outcome would be:
Plate 2021 2022 Total
AAA 2 3 5
BBB 2 2 4
CCC 1 2 3
I tried different combinations of gropupby such as:
df.groupby('Plate').sum()
df.groupby('Plate', '2020', '2021')('total).sum()
but still didn't reach the expected result.
I would apreciate your help :)
Related
I have a pandas df as follows:
MATERIAL DATE HIGH LOW
AAA 2022-01-01 10 0
AAA 2022-01-02 0 0
AAA 2022-01-03 5 2
BBB 2022-01-01 0 0
BBB 2022-01-02 10 5
BBB 2022-01-03 8 4
I want to groupby MATERIAL and sort_values by DATE
and choose all rows except last one in the group.
The resulting result should be:
MATERIAL DATE HIGH LOW
AAA 2022-01-01 10 0
AAA 2022-01-02 0 0
BBB 2022-01-01 0 0
BBB 2022-01-02 10 5
I have tried df.sort_values('DATE').groupby('MATERIAL').head(-1) but this results in an empty df.
The DATE is a pd.datetime object.
Thanks!
Use Series.duplicated with keep='last' for all values without last:
df = df.sort_values(['MATERIAL','DATE'])
df = df[df['MATERIAL'].duplicated(keep='last')]
print (df)
MATERIAL DATE HIGH LOW
0 AAA 2022-01-01 10 0
1 AAA 2022-01-02 0 0
3 BBB 2022-01-01 0 0
4 BBB 2022-01-02 10 5
With groupby solution is possible by GroupBy.cumcount with descending count and filter all rows without 0:
df = df.sort_values(['MATERIAL','DATE'])
df = df[df.groupby('MATERIAL').cumcount(ascending=False).ne(0)]
print (df)
MATERIAL DATE HIGH LOW
0 AAA 2022-01-01 10 0
1 AAA 2022-01-02 0 0
3 BBB 2022-01-01 0 0
4 BBB 2022-01-02 10 5
Another way is to sort by dates first, then group and take every row except the last one using indexing:
>>> df.sort_values("DATE").groupby("MATERIAL").apply(lambda group_df: group_df.iloc[:-1])
MATERIAL DATE HIGH LOW
MATERIAL
AAA 0 AAA 2022-01-01 10 0
1 AAA 2022-01-02 0 0
BBB 3 BBB 2022-01-01 0 0
4 BBB 2022-01-02 10 5
You could use:
(df.groupby('MATERIAL', as_index=False, group_keys=False)
.apply(lambda d: d.iloc[:len(d)-1])
)
output:
MATERIAL DATE HIGH LOW
0 AAA 2022-01-01 10 0
1 AAA 2022-01-02 0 0
3 BBB 2022-01-01 0 0
4 BBB 2022-01-02 10 5
Another way would be using groupby+transform with nth as -1, and compare this with DATE column and only select rows which doesnot match this:
df = df.sort_values(['MATERIAL','DATE'])
c = df['DATE'].ne(df.groupby("MATERIAL")['DATE'].transform('nth',-1))
out = df[c].copy()
print(out)
MATERIAL DATE HIGH LOW
0 AAA 2022-01-01 10 0
1 AAA 2022-01-02 0 0
3 BBB 2022-01-01 0 0
4 BBB 2022-01-02 10 5
Side note: Since you have a date column, you can also use transform with max or last but that would only limit you to the last row as opposed to the second last row for example for which you might need nth as shown above:
c = df['DATE'].ne(df.groupby("MATERIAL")['DATE'].transform('max'))
df1.loc[df1.sort_values(['MATERIAL','DATE'])\
.duplicated(subset='MATERIAL',keep='last')]\
.pipe(print)
MATERIAL DATE HIGH LOW
0 AAA 2022-01-01 10 0
1 AAA 2022-01-02 0 0
3 BBB 2022-01-01 0 0
4 BBB 2022-01-02 10 5
As my new comer for python using, could help me how to create the "Number column" by number sequence under different name of "test1" column, thanks.
(for example: pandas groupby function??)
Try:
df["Number"] = (df.test1 != df.test1.shift()).cumsum()
print(df)
Prints:
test1 Number
0 AAA 1
1 AAA 1
2 AAA 1
3 AAA 1
4 BBB 2
5 BBB 2
6 BBB 2
7 AAA 3
8 AAA 3
9 AAA 3
10 CCC 4
11 CCC 4
i cannot seem to figure this out trying many different things and there apparently is no answer across the web that i have found. I have data that has values in a single column "data" and i need to sum or count occurrences of NaN in this column based on a groupby of conditions in two other columns such as this which resembles my data below:
site data day month year
0 Red NaN 20 1 2020
1 Red 5.6 31 1 2020
2 Red NaN 6 1 2020
3 Red NaN 9 2 2020
3 Blue 4.5 14 1 2020
4 Blue 6.2 19 2 2020
5 Blue NaN 11 2 2020
The outcome should look like this:
site month count sumNaN
0 Red 1 3 2
1 Red 2 1 1
2 Blue 1 1 0
3 Blue 2 2 1
Thank you very much.
Try:
(df.assign(data=df['data'].isna())
.groupby(['site','month'])
['data'].agg(['count','sum'])
.reset_index()
)
Output:
site month count sum
0 Blue 1 1 0
1 Blue 2 2 1
2 Red 1 3 2
3 Red 2 1 1
You can used named aggregation within the agg:
(df.groupby(['site', 'month'], as_index = False)
.agg(count=('data', 'size'),
sumNaN=('data', lambda df: df.isna().sum())
)
)
site month count sumNaN
0 Blue 1 1 0.0
1 Blue 2 2 1.0
2 Red 1 3 2.0
3 Red 2 1 1.0
I have two dataframes, df_new and df_old they have same number and names for the fields. I want to subtract df_old from df_new if cells are in numerical fields. If cells are in text fields, I just want to use df_new value.
I'm trying to have the result in a separate dataframe called changes.
I need help incorporating a dtype check condition on the formula below so it will work only on numerical fields.
df_new.set_index('Index').subtract(df_old.set_index('Index'), fill_value=0)
NEW
Part Qty Stock Notes
0 AAA 40 10 yyy
1 BBB 40 10 yyy
2 CCC 50 20 yyy
3 DDD 40 10
4 EEE 40 10
5 FFF 40 10
OLD
Part Qty Stock Notes
0 AAA 40 10 xxx
1 BBB 40 10 xxx
2 CCC 40 10 xxx
3 DDD 40 10
4 EEE 40 10
5 FFF 40 10
CHANGES
Part Qty Stock Notes
0 AAA 0 0 yyy
1 BBB 0 0 yyy
2 CCC 10 10 yyy
3 DDD 0 0
4 EEE 0 0
5 FFF 0 0
What about:
from pandas.api.types import is_numeric_dtype
to_subtract = [column for column, dtype in new_df.dtypes.items() if is_numeric_dtype(dtype)]
new_df[to_subtract] = new_df[to_subtract] - old_df[to_subtract]
I have the following dataframe:
srch_id price
1 30
1 20
1 25
3 15
3 102
3 39
Now I want to create a third column in which I determine the price position grouped by the search id. This is the result I want:
srch_id price price_position
1 30 3
1 20 1
1 25 2
3 15 1
3 102 3
3 39 2
I think I need to use the transform function. However I can't seem to figure out how I should handle the argument I get using .transform():
def k(r):
return min(r)
tmp = train.groupby('srch_id')['price']
train['min'] = tmp.transform(k)
Because r is either a list or an element?
You can use series.rank() with df.groupby():
df['price_position']=df.groupby('srch_id')['price'].rank()
print(df)
srch_id price price_position
0 1 30 3.0
1 1 20 1.0
2 1 25 2.0
3 3 15 1.0
4 3 102 3.0
5 3 39 2.0
is this:
df['price_position'] = df.sort_values('price').groupby('srch_id').price.cumcount() + 1
Out[1907]:
srch_id price price_position
0 1 30 3
1 1 20 1
2 1 25 2
3 3 15 1
4 3 102 3
5 3 39 2