Pandas groupby multi conditions and date difference calculation - pandas

I am stuck understanding the method to use. I have the following dataframe:
df = {'CODE': ['BBLGLC70M','BBLGLC70M','ZZTNRD77', 'ZZTNRD77', 'AACCBD', 'AACCBD', 'BCCDN', 'BCCDN', 'BCCDN'],
'DATE': ['16/05/2019','25/09/2019', '16/03/2020', '27/02/2020', '16/07/2020', '21/07/2020', '13/02/2020', '23/07/2020', '27/02/2020'],
'TYPE': ['PRI', 'PRI', 'PRI', 'PRI', 'PUB', 'PUB', 'PUB', 'PRI', 'PUB'],
'DESC' : ['KO', 'OK', 'KO', 'KO', 'KO', 'OK', 'KO', 'OK', 'OK']
}
df = pd.DataFrame(df)
df['DATE'] = pd.to_datetime(df['DATE'], format = '%d/%m/%Y')
df
I need to:
groupby the same 'CODE',
check if the 'DESC' is not the same
check if the 'TYPE' is the same
calculate the month difference between dates that satisfy the previous 2 commands
The expected output is the below:

The following code uses .drop_duplicates() and .duplicated() to keep or throw out rows from your dataframe that have duplicate values.
How would you calculate a month's difference? A month can be 28, 30 or 31 days. You could divide the end result by 30 and get an indication of the number of months difference. So I kept it in days for now.
import pandas as pd
df = {'CODE': ['BBLGLC70M','BBLGLC70M','ZZTNRD77', 'ZZTNRD77', 'AACCBD', 'AACCBD', 'BCCDN', 'BCCDN', 'BCCDN'],
'DATE': ['16/05/2019','25/09/2019', '16/03/2020', '27/02/2020', '16/07/2020', '21/07/2020', '13/02/2020', '23/07/2020', '27/02/2020'],
'TYPE': ['PRI', 'PRI', 'PRI', 'PRI', 'PUB', 'PUB', 'PUB', 'PRI', 'PUB'],
'DESC' : ['KO', 'OK', 'KO', 'KO', 'KO', 'OK', 'KO', 'OK', 'OK']
}
df = pd.DataFrame(df)
df['DATE'] = pd.to_datetime(df['DATE'], format = '%d/%m/%Y')
# only keep rows that have the same code and type
df = df[df.duplicated(subset=['CODE', 'TYPE'], keep=False)]
# throw out rows that have the same code and desc
df = df.drop_duplicates(subset=['CODE', 'DESC'], keep=False)
# find previous date
df = df.sort_values(by=['CODE', 'DATE'])
df['previous_date'] = df.groupby('CODE')['DATE'].transform('shift')
# drop rows that don't have a previous date
df = df.dropna()
# calculate the difference between current date and previous date
df['difference_in_dates'] = (df['DATE'] - df['previous_date'])
This results in the following df:
CODE DATE TYPE DESC previous_date difference_in_dates
AACCBD 2020-07-21 PUB OK 2020-07-16 5 days
BBLGLC70M 2019-09-25 PRI OK 2019-05-16 132 days
BCCDN 2020-02-27 PUB OK 2020-02-13 14 days

Related

Filtering a specific column in a pandas DataFrame using the 'filter' keyword and a lambda function

How can I use the 'filter' keyword and a lambda function to filter a specific column in a pandas DataFrame?
import pandas as pd
data = [{
'Language': 'Python',
'Percent grow': 56
}, {
'Language': 'Java',
'Percent grow': 34
}, {
'Language': 'C',
'Percent grow': 25
}, {
'Language': 'C++',
'Percent grow': 12
}, {
'Language': 'go',
'Percent grow': 5
}]
df = pd.DataFrame(data)
f = lambda x : x['Percent grow'] > 30
df.filter(f)
IIUC, you want pandas.DataFrame.loc with boolean indexing :
df.loc[df["Percent grow"].gt(30)]
To make it out with pandas.DataFrame.filter, one way will be :
(
df
.set_index("Percent grow")
.pipe(lambda x: x.filter([x for x in x.index if x>30], axis=0))
.reset_index()
.reindex(df.columns, axis=1)
)
Output :
Language Percent grow
0 Python 56
1 Java 34

grouper day and cumsum speed

I have the following df:
I want to group this df on the first column(ID) and on the second column(key), from there to build a cumsum for each day. The cumsum should be on the last column(speed).
I tried this with the following code :
df = pd.read_csv('df.csv')
df['Time'] = pd.to_datetime(df['Time'], format='%Y-%m-%d %H:%M:%S')
df = df.sort_values(['ID','key'])
grouped = df.groupby(['ID','key'])
test = pd.DataFrame()
test2 = pd.DataFrame()
for name, group in grouped:
test = group.groupby(pd.Grouper(key='Time', freq='1d'))['Speed'].cumsum()
test = test.reset_index()
test['ID'] = ''
test['ID'] = name[0]
test['key'] = ''
test['key'] = name[1]
test2 = test2.append(test)
But the result seem quite off, there are more rows than 5. For each day one row with the cumsum of each ID and key.
Does anyone see the reason for my problem ?
thanks in advance
Friendly reminder, it's useful to include a runable example
import pandas as pd
data = [{"cid":33613,"key":14855,"ts":1550577600000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550579340000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550584800000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550682000000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550685900000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550773380000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550858400000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550941200000,"value":25.0},
{"cid":33613,"key":14855,"ts":1550978400000,"value":50.0}]
df = pd.DataFrame(data)
df['ts'] = pd.to_datetime(df['ts'], unit='ms')
I believe what you need can be accomplished as follows:
df.set_index('ts').groupby(['cid', 'key'])['value'].resample('D').sum().cumsum()
Result:
cid key ts
33613 14855 2019-02-19 150.0
2019-02-20 250.0
2019-02-21 300.0
2019-02-22 350.0
2019-02-23 375.0
2019-02-24 425.0
Name: value, dtype: float64

GroupBy Function Not Applying

I am trying to groupby for the following specializations but I am not getting the expected result (or any for that matter). The data stays ungrouped even after this step. Any idea what's wrong in my code?
cols_specials = ['Enterprise ID','Specialization','Specialization Branches','Specialization Type']
specials = pd.read_csv(agg_specials, engine='python')
specials = specials.merge(roster, left_on='Enterprise ID', right_on='Enterprise ID', how='left')
specials = specials[cols_specials]
specials = specials.groupby(['Enterprise ID'])['Specialization'].transform(lambda x: '; '.join(str(x)))
specials.to_csv(end_report_specials, index=False, encoding='utf-8-sig')
Please try using agg:
import pandas as pd
df = pd.DataFrame(
[
['john', 'eng', 'build'],
['john', 'math', 'build'],
['kevin', 'math', 'asp'],
['nick', 'sci', 'spi']
],
columns = ['id', 'spec', 'type']
)
df.groupby(['id'])[['spec']].agg(lambda x: ';'.join(x))
resiults in:
if you need to preserve starting number of lines, use transform. transform returns one column:
df['spec_grouped'] = df.groupby(['id'])[['spec']].transform(lambda x: ';'.join(x))
df
results in:

pandas groupby with function as key

I would like to calculate the mean with with a timespan of 3 years.
My data are like that :
import pandas as pd
import numpy as np
N=120
data = {'p1': np.random.randint(50,100,N),
'p2': np.random.randint(0,100,N),
'p3': np.random.randint(10,70,N)
}
df = (pd.DataFrame(data, index=pd.bdate_range(start='20100101', periods=N, freq='BM'))
.stack()
.reset_index()
.rename(columns={'level_0': 'date', 'level_1': 'type', 0: 'price'})
.sort_values('date')
)
I tried :
(df.sort_values('date')
.groupby(['type',
''.join([(df.date.dt.year-3), '-', (df.date.dt.year)]) #3 years time span
]
)
['price']
.apply(lambda x: x.mean())
)
but get an error message :
TypeError: sequence item 0: expected str instance, Series found
I would like to calculate the mean (and others stat) on price with group by type/time period of 2010-2013, 2011-2014, 2012-2015...
The label is important because I can use :
(df.sort_values('date')
.groupby(['type', df.date.dt.year//3]) #3 years time span
['price']
.apply(lambda x: x.mean())
)
any idea ?
I think I found the answer to my own question with (someone else could be interested) :
(df.sort_values('date')
.groupby(['type', (df.date.dt.year-3).astype(str).str.cat((df.date.dt.year).astype(str), sep='-')
]
)
['price']
.apply(lambda x: x.mean())
)

How to set frequency with pd.to_datetime()?

When fitting a statsmodel, I'm receiving a warning about the date frequency.
First, I import a dataset:
import statsmodels as sm
df = sm.datasets.get_rdataset(package='datasets', dataname='airquality').data
df['Year'] = 1973
df['Date'] = pd.to_datetime(df[['Year', 'Month', 'Day']])
df.drop(columns=['Year', 'Month', 'Day'], inplace=True)
df.set_index('Date', inplace=True, drop=True)
Next I try to fit a SES model:
fit = sm.tsa.api.SimpleExpSmoothing(df['Wind']).fit()
Which returns this warning:
/anaconda3/lib/python3.6/site-packages/statsmodels/tsa/base/tsa_model.py:171: ValueWarning: No frequency information was provided, so inferred frequency D will be used.
% freq, ValueWarning)
My dataset is daily so inferred 'D' is ok, but I was wondering how I can manually set the frequency.
Note that the DatetimeIndex doesn't have the freq (last line) ...
DatetimeIndex(['1973-05-01', '1973-05-02', '1973-05-03', '1973-05-04',
'1973-05-05', '1973-05-06', '1973-05-07', '1973-05-08',
'1973-05-09', '1973-05-10',
...
'1973-09-21', '1973-09-22', '1973-09-23', '1973-09-24',
'1973-09-25', '1973-09-26', '1973-09-27', '1973-09-28',
'1973-09-29', '1973-09-30'],
dtype='datetime64[ns]', name='Date', length=153, freq=None)
As per this answer I've checked for missing dates, but there doesn't appear to be any:
pd.date_range(start = '1973-05-01', end = '1973-09-30').difference(df.index)
DatetimeIndex([], dtype='datetime64[ns]', freq='D')
How should I set the frequency for the index?
I think pd.to_datetime not set default frequency, need DataFrame.asfreq:
df = df.set_index('Date').asfreq('d')
print (df.index)
DatetimeIndex(['1973-05-01', '1973-05-02', '1973-05-03', '1973-05-04',
'1973-05-05', '1973-05-06', '1973-05-07', '1973-05-08',
'1973-05-09', '1973-05-10',
...
'1973-09-21', '1973-09-22', '1973-09-23', '1973-09-24',
'1973-09-25', '1973-09-26', '1973-09-27', '1973-09-28',
'1973-09-29', '1973-09-30'],
dtype='datetime64[ns]', name='Date', length=153, freq='D')
But if duplicated values in index get error:
df = pd.concat([df, df])
df = df.set_index('Date')
print (df.asfreq('d').index)
ValueError: cannot reindex from a duplicate axis
Solution is use resample with some aggregate function:
print (df.resample('2D').mean().index)
DatetimeIndex(['1973-05-01', '1973-05-03', '1973-05-05', '1973-05-07',
'1973-05-09', '1973-05-11', '1973-05-13', '1973-05-15',
'1973-05-17', '1973-05-19', '1973-05-21', '1973-05-23',
'1973-05-25', '1973-05-27', '1973-05-29', '1973-05-31',
'1973-06-02', '1973-06-04', '1973-06-06', '1973-06-08',
'1973-06-10', '1973-06-12', '1973-06-14', '1973-06-16',
'1973-06-18', '1973-06-20', '1973-06-22', '1973-06-24',
'1973-06-26', '1973-06-28', '1973-06-30', '1973-07-02',
'1973-07-04', '1973-07-06', '1973-07-08', '1973-07-10',
'1973-07-12', '1973-07-14', '1973-07-16', '1973-07-18',
'1973-07-20', '1973-07-22', '1973-07-24', '1973-07-26',
'1973-07-28', '1973-07-30', '1973-08-01', '1973-08-03',
'1973-08-05', '1973-08-07', '1973-08-09', '1973-08-11',
'1973-08-13', '1973-08-15', '1973-08-17', '1973-08-19',
'1973-08-21', '1973-08-23', '1973-08-25', '1973-08-27',
'1973-08-29', '1973-08-31', '1973-09-02', '1973-09-04',
'1973-09-06', '1973-09-08', '1973-09-10', '1973-09-12',
'1973-09-14', '1973-09-16', '1973-09-18', '1973-09-20',
'1973-09-22', '1973-09-24', '1973-09-26', '1973-09-28',
'1973-09-30'],
dtype='datetime64[ns]', name='Date', freq='2D')
The problem is caused by the not explicitly set frequence. In most cases you can't be sure that your data does not have any gaps, so generate a data range with
rng = pd.date_range(start = '1973-05-01', end = '1973-09-30', freq='D')
reindex your DataFrame with this rng and fill the np.nan with your method or value of choice.