Is there a way to group columns with multiple conditions using pandas? - pandas

My dataframe is similar to:
transaction date cash
0 1 2020-01-01 72
1 2 2020-01-03 100
2 2 2020-01-05 -75
3 3 2020-01-05 82
I want the output to group by transaction and to sum the cash for each transaction (if there is two amounts) BUT ALSO to return the later date. So for transaction 2 the end result would show transaction, date, cash as: 2,1/5/2020, 25...
Not sure how to make tables to help the visuals in my question yet so sorry, please let me know if there's any questions.

Use groupby + agg. Check the docs examples.
output = df.groupby('transaction').agg({'date': 'max', 'cash': 'sum'})

This solution assumes that the date column is encoded as proper datetime instances. If this is not currently the case, try df['date'] = pd.to_datetime(df['date']) before doing the following.
>>> df
transaction date cash
0 1 2020-01-01 72
1 2 2020-01-03 100
2 2 2020-01-05 -75
3 3 2020-01-05 82
>>> transactions = df.groupby('transaction')
>>> pd.concat((transactions['cash'].sum(), transactions['date'].max()), axis=1)
cash date
transaction
1 72 2020-01-01
2 25 2020-01-05
3 82 2020-01-05
transactions['date'].max() picks the date furthest into the future of those with the same transaction ID.

Related

Create an incremental count from a cumulative count by date segmented by another series in a Pandas data frame

I have cumulative data (the series 'cumulative_count') in a date frame ('df1') that is segmented by the series 'state' and I want to create a new series in the data frame that shows incremental count by 'state'.
So:
df1 = pd.DataFrame({'date': ['2020-01-03','2020-01-03','2020-01-03','2020-01-04','2020-01-04','2020-01-04','2020-01-05','2020-01-05','2020-01-05'],'state': ['NJ','NY','CT','NJ','NY','CT','NJ','NY','CT'], 'cumulative_count': [1,3,5,3,6,7,19,15,20]})
...is transformed to have the new series added ('incremental count') where the incremental count is calculated by date but also segmented by state with the result generated being...
df2 = pd.DataFrame({'date': ['2020-01-03','2020-01-03','2020-01-03','2020-01-04','2020-01-04','2020-01-04','2020-01-05','2020-01-05','2020-01-05'],'state': ['NJ','NY','CT','NJ','NY','CT','NJ','NY','CT'], 'cumulative_count': [1,3,5,3,6,7,19,15,20],'incremental_count': [1,3,5,2,3,2,16,9,13]})
Any recommendations on how to do this would be greatly appreciated. Thanks!
Since your DataFrame is already sorted by 'date', you are looking to take the diff within each state group. Then fillna to get the correct value for the first date within each state.
df1['incremental_count'] = (df1.groupby('state')['cumulative_count'].diff()
.fillna(df1['cumulative_count'], downcast='infer'))
date state cumulative_count incremental_count
0 2020-01-03 NJ 1 1
1 2020-01-03 NY 3 3
2 2020-01-03 CT 5 5
3 2020-01-04 NJ 3 2
4 2020-01-04 NY 6 3
5 2020-01-04 CT 7 2
6 2020-01-05 NJ 19 16
7 2020-01-05 NY 15 9
8 2020-01-05 CT 20 13

How could I remove duplicates if duplicates mean less than 30days?

(using sql or pandas)
I want to delete records if the Date difference between two records is less than 30 days.
But first record of ID must be remained.
#example
ROW ID DATE
1 A 2020-01-01 -- first
2 A 2020-01-03
3 A 2020-01-31
4 A 2020-02-05
5 A 2020-02-28
6 A 2020-03-09
7 B 2020-03-06 -- first
8 B 2020-05-07
9 B 2020-06-02
#expected results
ROW ID DATE
1 A 2020-01-01
4 A 2020-02-05
6 A 2020-03-09
7 B 2020-03-06
8 B 2020-05-07
ROW 2,3 are within 30 days from ROW 1
ROW 5 is within 30 days from ROW 4
ROW 9 is within 30 days from ROW 8
To cope with your task it is not possible to call any
vectorized methods.
The cause is that after a row is recognized as a duplicate, then
this row "does not count" when you check further rows.
E.g. after rows 2020-01-03 and 2020-01-31 were deleted (as
"too close" to the previous row) then 2020-02-05 row should be
left, because now the distance to the previous row (2020-01-01)
is big enough.
So I came up with a solution based on a "function with memory":
def isDupl(elem):
if isDupl.prev is None:
isDupl.prev = elem
return False
dDiff = (elem - isDupl.prev).days
rv = dDiff <= 30
if not rv:
isDupl.prev = elem
return rv
This function should be invoked for each DATE in the
current group (with same ID) but before that isDupl.prev
must be set to None.
So the function to apply to each group of rows is:
def isDuplGrp(grp):
isDupl.prev = None
return grp.DATE.apply(isDupl)
And to get the expected result, run:
df[~(df.groupby('ID').apply(isDuplGrp).reset_index(level=0, drop=True))]
(you may save it back to df).
The result is:
ROW ID DATE
0 1 A 2020-01-01
3 4 A 2020-02-05
5 6 A 2020-03-09
6 7 B 2020-03-06
7 8 B 2020-05-07
And finally, a remark about the other solution:
It contains rows:
3 4 A 2020-02-05
4 5 A 2020-02-28
which are only 23 days apart, so this solution is wrong.
The same pertains to rows:
5 A 2020-02-28
6 A 2020-03-09
which are also too close in time.
You can try this:
Convert date to datetime64
Get the first date from each group df.groupby('ID')['DATE'].transform('first')
Add a filter to keep only dates greater than 30 days
Append the first date of each group to the dataframe
Code:
df['DATE'] = pd.to_datetime(df['DATE'])
df1 = df[(df['DATE'] - df.groupby('ID')['DATE'].transform('first')) >= pd.Timedelta(30, unit='D')]
df1 = df1.append(df.groupby('ID', as_index=False).agg('first')).sort_values(by=['ID', 'DATE'])
print(df1)
ROW ID DATE
0 1 A 2020-01-01
2 3 A 2020-01-31
3 4 A 2020-02-05
4 5 A 2020-02-28
5 6 A 2020-03-09
1 7 B 2020-03-06
7 8 B 2020-05-07
8 9 B 2020-06-02

Pandas groupby and rolling window

I`m trying to calculate the sum of one field for a specific period of time, after grouping function is applied.
My dataset look like this:
Date Company Country Sold
01.01.2020 A BE 1
02.01.2020 A BE 0
03.01.2020 A BE 1
03.01.2020 A BE 1
04.01.2020 A BE 1
05.01.2020 B DE 1
06.01.2020 B DE 0
I would like to add a new column per each row, that calculates the sum of Sold (per each group "Company, Country" for the last 7 days - not including the current day
Date Company Country Sold LastWeek_Count
01.01.2020 A BE 1 0
02.01.2020 A BE 0 1
03.01.2020 A BE 1 1
03.01.2020 A BE 1 1
04.01.2020 A BE 1 3
05.01.2020 B DE 1 0
06.01.2020 B DE 0 1
I tried the following, but it is also including the current date, and it gives differnt values for the same date, i.e 03.01.2020
df['LastWeek_Count'] = df.groupby(['Company', 'Country']).rolling(7, on ='Date')['Sold'].sum().reset_index()
Is there a buildin function in pandas that I can use to perform these calculations?
You can use a .rolling window of 8 and then subtract the sum of the Date (for each grouped row) to effectively get the previous 7 days. For this sample data, we should also pass min_periods=1 (otherwise you will get NaN values, but for your actual dataset, you will need to decide what you want to do with windows that are < 8).
Then from the .rolling window of 8, simply do another .groupby of the relevant columns but also include Date this time, and take the max value of the newly created LastWeek_Count column. You need to take the max, because you have multiple records per day, so by taking the max, you are taking the total aggregated amount per Date.
Then, create a series that takes the grouped by sum per Date. In the final step subtract the sum by date from the rolling 8-day max, which is a workaround to how you can get the sum of the previous 7 days, as there is not a parameter for an offset with .rolling:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df['LastWeek_Count'] = df.groupby(['Company', 'Country']).rolling(8, min_periods=1, on='Date')['Sold'].sum().reset_index()['Sold']
df['LastWeek_Count'] = df.groupby(['Company', 'Country', 'Date'])['LastWeek_Count'].transform('max')
s = df.groupby(['Company', 'Country', 'Date'])['Sold'].transform('sum')
df['LastWeek_Count'] = (df['LastWeek_Count']-s).astype(int)
Out[17]:
Date Company Country Sold LastWeek_Count
0 2020-01-01 A BE 1 0
1 2020-01-02 A BE 0 1
2 2020-01-03 A BE 1 1
3 2020-01-03 A BE 1 1
4 2020-01-04 A BE 1 3
5 2020-01-05 B DE 1 0
6 2020-01-06 B DE 0 1
One way would be to first consolidate the Sold value of each group (['Date', 'Company', 'Country']) on a single line using a temporary DF.
After that, apply your .groupby with .rolling with an interval of 8 rows.
After calculating the sum, subtract the value of each line with the value in Sold column and add that column in the original DF with .merge
#convert Date column to datetime
df['Date'] = pd.to_datetime(df['Date'], format='%d.%m.%Y')
#create a temporary DataFrame
df2 = df.groupby(['Date', 'Company', 'Country'])['Sold'].sum().reset_index()
#calc the lastweek
df2['LastWeek_Count'] = (df2.groupby(['Company', 'Country'])
.rolling(8, min_periods=1, on = 'Date')['Sold']
.sum().reset_index(drop=True)
)
#subtract the value of 'lastweek' from the current 'Sold'
df2['LastWeek_Count'] = df2['LastWeek_Count'] - df2['Sold']
#add th2 new column in the original DF
df.merge(df2.drop(columns=['Sold']), on = ['Date', 'Company', 'Country'])
#output:
Date Company Country Sold LastWeek_Count
0 2020-01-01 A BE 1 0.0
1 2020-01-02 A BE 0 1.0
2 2020-01-03 A BE 1 1.0
3 2020-01-03 A BE 1 1.0
4 2020-01-04 A BE 1 3.0
5 2020-01-05 B DE 1 0.0
6 2020-01-06 B DE 0 1.0

Pandas Lambda Function Format Month and Day

I have a DF "ltyc" that looks like this:
month day wind_speed
0 1 1 11.263604
1 1 2 11.971495
2 1 3 11.989080
3 1 4 12.558736
4 1 5 11.850899
And, i apply a lambda function:
ltyc['date'] = pd.to_datetime(ltyc["month"], format='%m').apply(lambda dt: dt.replace(year=2020))
To get it to look like this:
month day wind_speed date
0 1 1 11.263604 2020-01-01
1 1 2 11.971495 2020-01-01
2 1 3 11.989080 2020-01-01
3 1 4 12.558736 2020-01-01
4 1 5 11.850899 2020-01-01
Except, I need it to look like this so that the days change also...but I cannot figure out how to format the lambda statement to do this instead as this is what I need.
month day wind_speed date
0 1 1 11.263604 2020-01-01
1 1 2 11.971495 2020-01-02
2 1 3 11.989080 2020-01-03
3 1 4 12.558736 2020-01-04
4 1 5 11.850899 2020-01-05
I have tried this:
ltyc['date'] = pd.to_datetime(ltyc["month"], format='%m%d').apply(lambda dt: dt.replace(year=2020))
and i get this error:
ValueError: time data '1' does not match format '%m%d' (match)
Thank you for help since i'm trying to figure out the lambda functions.
create a series with value 2020 and name year. Concat it to ['month', 'day'] and passing to pd.to_datetime. As long as, you passing a dataframe with columns names in this order year, month, date, pd.to_datetime will convert it to the appropriate datetime series.
#Allolz suggestion:
ltyc['date'] = pd.to_datetime(ltyc[['day', 'month']].assign(year=2020))
Out[367]:
month day wind_speed date
0 1 1 11.263604 2020-01-01
1 1 2 11.971495 2020-01-02
2 1 3 11.989080 2020-01-03
3 1 4 12.558736 2020-01-04
4 1 5 11.850899 2020-01-05
Or you may use reindex to create the sub-dataframe to pass to pd.to_datetime
ltyc['date'] = pd.to_datetime(ltyc.reindex(['year','month','day'],
axis=1, fill_value=2020))
Original:
s = pd.Series([2020]*len(ltyc), name='year')
ltyc['date'] = pd.to_datetime(pd.concat([s, ltyc[['month','day']]], axis=1))
This is similar to a previous answer, but does not persist the 'helper' column with the year. In brief, we pass a data frame with three columns (year, month, day) to the to_datetime() function.
ltyc['date'] = pd.to_datetime(ltyc
.assign(year=2020)
.filter(['year', 'month', 'day'])
)
You could also use your method and add month and day together with .astype(str) and then add %d to the format. The problem with your lambda is that you only considered month, so this is how you would consider month and day.
ltyc['date'] = (pd.to_datetime(ltyc["month"].astype(str) + '-' + ltyc["day"].astype(str),
format='%m-%d')
.apply(lambda dt: dt.replace(year=2020)))
output:
month day wind_speed date
0 1 1 11.263604 2020-01-01
1 1 2 11.971495 2020-01-02
2 1 3 11.989080 2020-01-03
3 1 4 12.558736 2020-01-04
4 1 5 11.850899 2020-01-05

Drop Duplicates based on Nearest Datetime condition

import pandas as pd
def nearest(items, pivot):
return min(items, key=lambda x: abs(x - pivot))
df = pd.read_csv("C:/Files/input.txt", dtype=str)
duplicatesDf = df[df.duplicated(subset=['CLASS_ID', 'START_TIME', 'TEACHER_ID'], keep=False)]
duplicatesDf['START_TIME'] = pd.to_datetime(duplicatesDf['START_TIME'], format='%Y/%m/%d %H:%M:%S.%f')
print duplicatesDf
print df['START_TIME'].dt.date
df:
ID,CLASS_ID,START_TIME,TEACHER_ID,END_TIME
1,123,2020/06/01 20:47:26.000,o1,2020/06/02 00:00:00.000
2,123,2020/06/01 20:47:26.000,o1,2020/06/04 20:47:26.000
3,789,2020/06/01 20:47:26.000,o3,2020/06/03 14:47:26.000
4,789,2020/06/01 20:47:26.000,o3,2020/06/03 14:40:00.000
5,456,2020/06/01 20:47:26.000,o5,2020/06/08 20:00:26.000
So, I've got a dataframe like mentioned above. As you can see, I have multiple records with the same CLASS_ID,START_DATE and TEACHER_ID. Whenever, multiple records like these are present, I would like to retain only 1 record based on the condition that, the retained record should have its END_DATE nearest to its START_DATE(by minute level precision).
In this case,
for CLASS_ID 123, the record with ID 1 will be retained, as its END_DATE 2020/06/02 00:00:00.000 is nearest to its START_DATE 2020/06/01 20:47:26.000 as compared to record with ID 2 whose END_DATE is 2020/06/04 20:47:26.000. Similarly for CLASS_ID 789, record with ID 4 will be retained.
Hence the expected output will be:
ID,CLASS_ID,START_TIME,TEACHER_ID,END_TIME
1,123,2020/06/01 20:47:26.000,o1,2020/06/02 00:00:00.000
4,789,2020/06/01 20:47:26.000,o3,2020/06/03 14:40:00.000
5,456,2020/06/01 20:47:26.000,o5,2020/06/08 20:00:26.000
I've been going through the following links,
https://stackoverflow.com/a/32237949,
https://stackoverflow.com/a/33043374
to find a solution but have unfortunately reached an impasse.
Hence, would some kind soul mind helping me out a bit. Many thanks.
IIUC, we can use .loc and idxmin() after creating a condtional column to measure the elapsed time between the start and the end, we will apply idxmin() as a groupby operation on your CLASS_ID column.
df.loc[
df.assign(mins=(df["END_TIME"] - df["START_TIME"]))
.groupby("CLASS_ID")["mins"]
.idxmin()
]
ID CLASS_ID START_TIME TEACHER_ID END_TIME
0 1 123 2020-06-01 20:47:26 o1 2020-06-02 00:00:00
4 5 456 2020-06-01 20:47:26 o5 2020-06-08 20:00:26
3 4 789 2020-06-01 20:47:26 o3 2020-06-03 14:40:00
in steps.
Time Delta.
print(df.assign(mins=(df["END_TIME"] - df["START_TIME"]))[['CLASS_ID','mins']])
CLASS_ID mins
0 123 0 days 03:12:34
1 123 3 days 00:00:00
2 789 1 days 18:00:00
3 789 1 days 17:52:34
4 456 6 days 23:13:00
minimum index from time delta column whilst grouping with CLASS_ID
print(df.assign(mins=(df["END_TIME"] - df["START_TIME"]) )
.groupby("CLASS_ID")["mins"]
.idxmin())
CLASS_ID
123 0
456 4
789 3
Name: mins, dtype: int64