How can I use the 'filter' keyword and a lambda function to filter a specific column in a pandas DataFrame?
import pandas as pd
data = [{
'Language': 'Python',
'Percent grow': 56
}, {
'Language': 'Java',
'Percent grow': 34
}, {
'Language': 'C',
'Percent grow': 25
}, {
'Language': 'C++',
'Percent grow': 12
}, {
'Language': 'go',
'Percent grow': 5
}]
df = pd.DataFrame(data)
f = lambda x : x['Percent grow'] > 30
df.filter(f)
IIUC, you want pandas.DataFrame.loc with boolean indexing :
df.loc[df["Percent grow"].gt(30)]
To make it out with pandas.DataFrame.filter, one way will be :
(
df
.set_index("Percent grow")
.pipe(lambda x: x.filter([x for x in x.index if x>30], axis=0))
.reset_index()
.reindex(df.columns, axis=1)
)
Output :
Language Percent grow
0 Python 56
1 Java 34
I have the following df:
I want to group this df on the first column(ID) and on the second column(key), from there to build a cumsum for each day. The cumsum should be on the last column(speed).
I tried this with the following code :
df = pd.read_csv('df.csv')
df['Time'] = pd.to_datetime(df['Time'], format='%Y-%m-%d %H:%M:%S')
df = df.sort_values(['ID','key'])
grouped = df.groupby(['ID','key'])
test = pd.DataFrame()
test2 = pd.DataFrame()
for name, group in grouped:
test = group.groupby(pd.Grouper(key='Time', freq='1d'))['Speed'].cumsum()
test = test.reset_index()
test['ID'] = ''
test['ID'] = name[0]
test['key'] = ''
test['key'] = name[1]
test2 = test2.append(test)
But the result seem quite off, there are more rows than 5. For each day one row with the cumsum of each ID and key.
Does anyone see the reason for my problem ?
thanks in advance
Friendly reminder, it's useful to include a runable example
import pandas as pd
data = [{"cid":33613,"key":14855,"ts":1550577600000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550579340000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550584800000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550682000000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550685900000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550773380000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550858400000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550941200000,"value":25.0},
{"cid":33613,"key":14855,"ts":1550978400000,"value":50.0}]
df = pd.DataFrame(data)
df['ts'] = pd.to_datetime(df['ts'], unit='ms')
I believe what you need can be accomplished as follows:
df.set_index('ts').groupby(['cid', 'key'])['value'].resample('D').sum().cumsum()
Result:
cid key ts
33613 14855 2019-02-19 150.0
2019-02-20 250.0
2019-02-21 300.0
2019-02-22 350.0
2019-02-23 375.0
2019-02-24 425.0
Name: value, dtype: float64
I am trying to groupby for the following specializations but I am not getting the expected result (or any for that matter). The data stays ungrouped even after this step. Any idea what's wrong in my code?
cols_specials = ['Enterprise ID','Specialization','Specialization Branches','Specialization Type']
specials = pd.read_csv(agg_specials, engine='python')
specials = specials.merge(roster, left_on='Enterprise ID', right_on='Enterprise ID', how='left')
specials = specials[cols_specials]
specials = specials.groupby(['Enterprise ID'])['Specialization'].transform(lambda x: '; '.join(str(x)))
specials.to_csv(end_report_specials, index=False, encoding='utf-8-sig')
Please try using agg:
import pandas as pd
df = pd.DataFrame(
[
['john', 'eng', 'build'],
['john', 'math', 'build'],
['kevin', 'math', 'asp'],
['nick', 'sci', 'spi']
],
columns = ['id', 'spec', 'type']
)
df.groupby(['id'])[['spec']].agg(lambda x: ';'.join(x))
resiults in:
if you need to preserve starting number of lines, use transform. transform returns one column:
df['spec_grouped'] = df.groupby(['id'])[['spec']].transform(lambda x: ';'.join(x))
df
results in:
I would like to calculate the mean with with a timespan of 3 years.
My data are like that :
import pandas as pd
import numpy as np
N=120
data = {'p1': np.random.randint(50,100,N),
'p2': np.random.randint(0,100,N),
'p3': np.random.randint(10,70,N)
}
df = (pd.DataFrame(data, index=pd.bdate_range(start='20100101', periods=N, freq='BM'))
.stack()
.reset_index()
.rename(columns={'level_0': 'date', 'level_1': 'type', 0: 'price'})
.sort_values('date')
)
I tried :
(df.sort_values('date')
.groupby(['type',
''.join([(df.date.dt.year-3), '-', (df.date.dt.year)]) #3 years time span
]
)
['price']
.apply(lambda x: x.mean())
)
but get an error message :
TypeError: sequence item 0: expected str instance, Series found
I would like to calculate the mean (and others stat) on price with group by type/time period of 2010-2013, 2011-2014, 2012-2015...
The label is important because I can use :
(df.sort_values('date')
.groupby(['type', df.date.dt.year//3]) #3 years time span
['price']
.apply(lambda x: x.mean())
)
any idea ?
I think I found the answer to my own question with (someone else could be interested) :
(df.sort_values('date')
.groupby(['type', (df.date.dt.year-3).astype(str).str.cat((df.date.dt.year).astype(str), sep='-')
]
)
['price']
.apply(lambda x: x.mean())
)
When fitting a statsmodel, I'm receiving a warning about the date frequency.
First, I import a dataset:
import statsmodels as sm
df = sm.datasets.get_rdataset(package='datasets', dataname='airquality').data
df['Year'] = 1973
df['Date'] = pd.to_datetime(df[['Year', 'Month', 'Day']])
df.drop(columns=['Year', 'Month', 'Day'], inplace=True)
df.set_index('Date', inplace=True, drop=True)
Next I try to fit a SES model:
fit = sm.tsa.api.SimpleExpSmoothing(df['Wind']).fit()
Which returns this warning:
/anaconda3/lib/python3.6/site-packages/statsmodels/tsa/base/tsa_model.py:171: ValueWarning: No frequency information was provided, so inferred frequency D will be used.
% freq, ValueWarning)
My dataset is daily so inferred 'D' is ok, but I was wondering how I can manually set the frequency.
Note that the DatetimeIndex doesn't have the freq (last line) ...
DatetimeIndex(['1973-05-01', '1973-05-02', '1973-05-03', '1973-05-04',
'1973-05-05', '1973-05-06', '1973-05-07', '1973-05-08',
'1973-05-09', '1973-05-10',
...
'1973-09-21', '1973-09-22', '1973-09-23', '1973-09-24',
'1973-09-25', '1973-09-26', '1973-09-27', '1973-09-28',
'1973-09-29', '1973-09-30'],
dtype='datetime64[ns]', name='Date', length=153, freq=None)
As per this answer I've checked for missing dates, but there doesn't appear to be any:
pd.date_range(start = '1973-05-01', end = '1973-09-30').difference(df.index)
DatetimeIndex([], dtype='datetime64[ns]', freq='D')
How should I set the frequency for the index?
I think pd.to_datetime not set default frequency, need DataFrame.asfreq:
df = df.set_index('Date').asfreq('d')
print (df.index)
DatetimeIndex(['1973-05-01', '1973-05-02', '1973-05-03', '1973-05-04',
'1973-05-05', '1973-05-06', '1973-05-07', '1973-05-08',
'1973-05-09', '1973-05-10',
...
'1973-09-21', '1973-09-22', '1973-09-23', '1973-09-24',
'1973-09-25', '1973-09-26', '1973-09-27', '1973-09-28',
'1973-09-29', '1973-09-30'],
dtype='datetime64[ns]', name='Date', length=153, freq='D')
But if duplicated values in index get error:
df = pd.concat([df, df])
df = df.set_index('Date')
print (df.asfreq('d').index)
ValueError: cannot reindex from a duplicate axis
Solution is use resample with some aggregate function:
print (df.resample('2D').mean().index)
DatetimeIndex(['1973-05-01', '1973-05-03', '1973-05-05', '1973-05-07',
'1973-05-09', '1973-05-11', '1973-05-13', '1973-05-15',
'1973-05-17', '1973-05-19', '1973-05-21', '1973-05-23',
'1973-05-25', '1973-05-27', '1973-05-29', '1973-05-31',
'1973-06-02', '1973-06-04', '1973-06-06', '1973-06-08',
'1973-06-10', '1973-06-12', '1973-06-14', '1973-06-16',
'1973-06-18', '1973-06-20', '1973-06-22', '1973-06-24',
'1973-06-26', '1973-06-28', '1973-06-30', '1973-07-02',
'1973-07-04', '1973-07-06', '1973-07-08', '1973-07-10',
'1973-07-12', '1973-07-14', '1973-07-16', '1973-07-18',
'1973-07-20', '1973-07-22', '1973-07-24', '1973-07-26',
'1973-07-28', '1973-07-30', '1973-08-01', '1973-08-03',
'1973-08-05', '1973-08-07', '1973-08-09', '1973-08-11',
'1973-08-13', '1973-08-15', '1973-08-17', '1973-08-19',
'1973-08-21', '1973-08-23', '1973-08-25', '1973-08-27',
'1973-08-29', '1973-08-31', '1973-09-02', '1973-09-04',
'1973-09-06', '1973-09-08', '1973-09-10', '1973-09-12',
'1973-09-14', '1973-09-16', '1973-09-18', '1973-09-20',
'1973-09-22', '1973-09-24', '1973-09-26', '1973-09-28',
'1973-09-30'],
dtype='datetime64[ns]', name='Date', freq='2D')
The problem is caused by the not explicitly set frequence. In most cases you can't be sure that your data does not have any gaps, so generate a data range with
rng = pd.date_range(start = '1973-05-01', end = '1973-09-30', freq='D')
reindex your DataFrame with this rng and fill the np.nan with your method or value of choice.