Pandas group by and choose all rows except last one in group - pandas

I have a pandas df as follows:
MATERIAL DATE HIGH LOW
AAA 2022-01-01 10 0
AAA 2022-01-02 0 0
AAA 2022-01-03 5 2
BBB 2022-01-01 0 0
BBB 2022-01-02 10 5
BBB 2022-01-03 8 4
I want to groupby MATERIAL and sort_values by DATE
and choose all rows except last one in the group.
The resulting result should be:
MATERIAL DATE HIGH LOW
AAA 2022-01-01 10 0
AAA 2022-01-02 0 0
BBB 2022-01-01 0 0
BBB 2022-01-02 10 5
I have tried df.sort_values('DATE').groupby('MATERIAL').head(-1) but this results in an empty df.
The DATE is a pd.datetime object.
Thanks!

Use Series.duplicated with keep='last' for all values without last:
df = df.sort_values(['MATERIAL','DATE'])
df = df[df['MATERIAL'].duplicated(keep='last')]
print (df)
MATERIAL DATE HIGH LOW
0 AAA 2022-01-01 10 0
1 AAA 2022-01-02 0 0
3 BBB 2022-01-01 0 0
4 BBB 2022-01-02 10 5
With groupby solution is possible by GroupBy.cumcount with descending count and filter all rows without 0:
df = df.sort_values(['MATERIAL','DATE'])
df = df[df.groupby('MATERIAL').cumcount(ascending=False).ne(0)]
print (df)
MATERIAL DATE HIGH LOW
0 AAA 2022-01-01 10 0
1 AAA 2022-01-02 0 0
3 BBB 2022-01-01 0 0
4 BBB 2022-01-02 10 5

Another way is to sort by dates first, then group and take every row except the last one using indexing:
>>> df.sort_values("DATE").groupby("MATERIAL").apply(lambda group_df: group_df.iloc[:-1])
MATERIAL DATE HIGH LOW
MATERIAL
AAA 0 AAA 2022-01-01 10 0
1 AAA 2022-01-02 0 0
BBB 3 BBB 2022-01-01 0 0
4 BBB 2022-01-02 10 5

You could use:
(df.groupby('MATERIAL', as_index=False, group_keys=False)
.apply(lambda d: d.iloc[:len(d)-1])
)
output:
MATERIAL DATE HIGH LOW
0 AAA 2022-01-01 10 0
1 AAA 2022-01-02 0 0
3 BBB 2022-01-01 0 0
4 BBB 2022-01-02 10 5

Another way would be using groupby+transform with nth as -1, and compare this with DATE column and only select rows which doesnot match this:
df = df.sort_values(['MATERIAL','DATE'])
c = df['DATE'].ne(df.groupby("MATERIAL")['DATE'].transform('nth',-1))
out = df[c].copy()
print(out)
MATERIAL DATE HIGH LOW
0 AAA 2022-01-01 10 0
1 AAA 2022-01-02 0 0
3 BBB 2022-01-01 0 0
4 BBB 2022-01-02 10 5
Side note: Since you have a date column, you can also use transform with max or last but that would only limit you to the last row as opposed to the second last row for example for which you might need nth as shown above:
c = df['DATE'].ne(df.groupby("MATERIAL")['DATE'].transform('max'))

df1.loc[df1.sort_values(['MATERIAL','DATE'])\
.duplicated(subset='MATERIAL',keep='last')]\
.pipe(print)
MATERIAL DATE HIGH LOW
0 AAA 2022-01-01 10 0
1 AAA 2022-01-02 0 0
3 BBB 2022-01-01 0 0
4 BBB 2022-01-02 10 5

Related

Groupby problem: count values from 2 columns

First time asking. I know that this may be a silly question, but I'm overwhelmed. I need to group a DataFrame and count the values but I’m having problems.
Lets suppose that I have this DF:
Plate 2021 2022 Total
AAA 1 1
BBB 0 1
CCC 1 1
BBB 1 1
AAA 1 1
AAA 0 1
BBB 1 0
CCC 0 1
My expected outcome would be:
Plate 2021 2022 Total
AAA 2 3 5
BBB 2 2 4
CCC 1 2 3
I tried different combinations of gropupby such as:
df.groupby('Plate').sum()
df.groupby('Plate', '2020', '2021')('total).sum()
but still didn't reach the expected result.
I would apreciate your help :)

Asking help of pandas as groupby function?

As my new comer for python using, could help me how to create the "Number column" by number sequence under different name of "test1" column, thanks.
(for example: pandas groupby function??)
Try:
df["Number"] = (df.test1 != df.test1.shift()).cumsum()
print(df)
Prints:
test1 Number
0 AAA 1
1 AAA 1
2 AAA 1
3 AAA 1
4 BBB 2
5 BBB 2
6 BBB 2
7 AAA 3
8 AAA 3
9 AAA 3
10 CCC 4
11 CCC 4

Subtract 2 Dataframes if Numberical Columns; for text use most recent dataframe

I have two dataframes, df_new and df_old they have same number and names for the fields. I want to subtract df_old from df_new if cells are in numerical fields. If cells are in text fields, I just want to use df_new value.
I'm trying to have the result in a separate dataframe called changes.
I need help incorporating a dtype check condition on the formula below so it will work only on numerical fields.
df_new.set_index('Index').subtract(df_old.set_index('Index'), fill_value=0)
NEW
Part Qty Stock Notes
0 AAA 40 10 yyy
1 BBB 40 10 yyy
2 CCC 50 20 yyy
3 DDD 40 10
4 EEE 40 10
5 FFF 40 10
OLD
Part Qty Stock Notes
0 AAA 40 10 xxx
1 BBB 40 10 xxx
2 CCC 40 10 xxx
3 DDD 40 10
4 EEE 40 10
5 FFF 40 10
CHANGES
Part Qty Stock Notes
0 AAA 0 0 yyy
1 BBB 0 0 yyy
2 CCC 10 10 yyy
3 DDD 0 0
4 EEE 0 0
5 FFF 0 0
What about:
from pandas.api.types import is_numeric_dtype
to_subtract = [column for column, dtype in new_df.dtypes.items() if is_numeric_dtype(dtype)]
new_df[to_subtract] = new_df[to_subtract] - old_df[to_subtract]

Groupby sum, sort and transpose

I am new in pandas and groupby functionality.
I have Dataframe as shown below, which is a transaction data of customer as shown below, I want to find out the top two Dprtmnt per Cus_No based on their total Amount.
Cus_No Date Dprtmnt Amount
111 6-Jun-18 AAA 100
111 6-Jun-18 AAA 50
111 8-Jun-18 BBB 125
111 8-Aug-18 CCC 130
111 12-Dec-18 BBB 200
111 15-Feb-17 AAA 10
111 18-Jan-18 AAA 20
222 6-Jun-18 DDD 100
222 6-Jun-18 AAA 50
222 8-Jun-18 AAA 125
222 8-Aug-18 DDD 130
222 12-Dec-18 AAA 200
222 15-Feb-17 CCC 10
222 18-Jan-18 CCC 20
My expected output is shown below.
Cus_No Top1D Top1Sum Top1_Frqnc Top2D Top2Sum Top2_Frqnc
111 BBB 325 2 AAA 180 4
222 AAA 375 3 DDD 230 2
First aggregate by GroupBy.agg with sum and size, sort and get top2 by GroupBy.head, last reshape by DataFrame.unstack and create new columns names by map and join:
df = (df.groupby(['Cus_No','Dprtmnt'])['Amount']
.agg([('Sum','sum'),('Frqnc','size')])
.sort_values('Sum', ascending=False)
.groupby(level=0).head(2))
df = (df.set_index(df.groupby(level=0).cumcount().add(1).astype(str), append=True)
.reset_index(level=1)
.unstack()
.sort_index(axis=1, level=1))
df.columns = df.columns.map(''.join)
df = df.reset_index()
print (df)
RangeIndex(start=0, stop=14, step=1)
Cus_No Dprtmnt1 Frqnc1 Sum1 Dprtmnt2 Frqnc2 Sum2
0 111 BBB 2 325 AAA 4 180
1 222 AAA 3 375 DDD 2 230

Pandas: Add new column with several values to groupby dataframe

for my dataframe, I want to add a new column for every single unique value in another column. The new column consists of several datetime entries that every unique value of the other column should get.
Example:
Original Df:
ID
1
2
3
New Column DF:
Date
2015/01/01
2015/02/01
2015/03/01
Resulting Df:
ID Date
1 2015/01/01
2015/02/01
2015/03/01
2 2015/01/01
2015/02/01
2015/03/01
3 2015/01/01
2015/02/01
2015/03/01
I tried to stick to this solution: https://stackoverflow.com/a/12394122/3856569
But it gives me the following error: Length of values does not match length of index
Anyone has a simple solution to do that? Thanks a lot!
UPDATE: replicating ids 6 times:
In [172]: %paste
data = """\
id
1
2
3
"""
df = pd.read_csv(io.StringIO(data))
# repeat each ID 6 times
df = pd.DataFrame(df['id'].tolist()*6, columns=['id'])
start_date = pd.to_datetime('2015-01-01')
df['date'] = start_date
df['date'] = df.groupby('id', as_index=False)\
.transform(lambda x: pd.date_range(start_date,
freq='1D',
periods=len(x)))
df.sort_values(by=['id','date'])
## -- End pasted text --
Out[172]:
id date
0 1 2015-01-01
3 1 2015-01-02
6 1 2015-01-03
9 1 2015-01-04
12 1 2015-01-05
15 1 2015-01-06
1 2 2015-01-01
4 2 2015-01-02
7 2 2015-01-03
10 2 2015-01-04
13 2 2015-01-05
16 2 2015-01-06
2 3 2015-01-01
5 3 2015-01-02
8 3 2015-01-03
11 3 2015-01-04
14 3 2015-01-05
17 3 2015-01-06
OLD more generic answer:
prepare sample DF:
start_date = pd.to_datetime('2015-01-01')
data = """\
id
1
2
2
3
1
2
3
2
1
"""
df = pd.read_csv(io.StringIO(data))
In [200]: df
Out[200]:
id
0 1
1 2
2 2
3 3
4 1
5 2
6 3
7 2
8 1
Solution:
In [201]: %paste
df['date'] = start_date
df['date'] = df.groupby('id', as_index=False)\
.transform(lambda x: pd.date_range(start_date,
freq='1D',
periods=len(x)))
## -- End pasted text --
In [202]: df
Out[202]:
id date
0 1 2015-01-01
1 2 2015-01-01
2 2 2015-01-02
3 3 2015-01-01
4 1 2015-01-02
5 2 2015-01-03
6 3 2015-01-02
7 2 2015-01-04
8 1 2015-01-03
Sorted:
In [203]: df.sort_values(by='id')
Out[203]:
id date
0 1 2015-01-01
4 1 2015-01-02
8 1 2015-01-03
1 2 2015-01-01
2 2 2015-01-02
5 2 2015-01-03
7 2 2015-01-04
3 3 2015-01-01
6 3 2015-01-02
A rather straightforward numpy approach, making use of repeat and tile:
import numpy as np
import pandas as pd
N = 3 # arbitrary number of IDs/dates
ID = np.arange(N) + 1
dates = pd.date_range('20160101', periods=N)
df = pd.DataFrame({'ID' : np.repeat(ID, N),
'dates' : np.tile(dates, N)})
Resulting DataFrame:
In [1]: df
Out[1]:
ID dates
0 1 2016-01-01
1 1 2016-01-02
2 1 2016-01-03
3 2 2016-01-01
4 2 2016-01-02
5 2 2016-01-03
6 3 2016-01-01
7 3 2016-01-02
8 3 2016-01-03
Update
Assuming you already have a DataFrame of IDs, as pointed out by MaxU, you can tile the IDs
df = pd.DataFrame({'ID' : np.tile(df['ID'], N),
'dates' : np.tile(dates, N)})
# now df needs sorting
df = df.sort_values(by=['ID', 'dates'])
Resulting DataFrame:
In [5]: df
Out[5]:
ID dates
0 1 2016-01-01
3 1 2016-01-01
6 1 2016-01-01
1 2 2016-01-02
4 2 2016-01-02
7 2 2016-01-02
2 3 2016-01-03
5 3 2016-01-03
8 3 2016-01-03