https://www.kaggle.com/anokas/time-travel-eda
what is these code mean exactly?groupby('date_x')['outcome'].mean(),I could not find this in sklearn doc.
date_x['Class probability'] = df_train.groupby('date_x')['outcome'].mean()
date_x['Frequency'] = df_train.groupby('date_x')['outcome'].size()
date_x.plot( secondary_y='Frequency',figsize=(22, 10))
thanks!
I think better is use DataFrameGroupBy.agg for aggregate by size for length of groups and mean per groups which are grouping by column date_x:
d = {'mean':'Class probability','size':'Frequency'}
df = df_train.groupby('date_x')['outcome'].agg(['mean','size']).rename(columns=d)
df.plot( secondary_y='Frequency',figsize=(22, 10))
For more information check applying multiple functions at once.
Sample:
d = {'date_x':pd.to_datetime(['2015-01-01','2015-01-01','2015-01-01',
'2015-01-02','2015-01-02']),
'outcome':[20,30,40,50,60]}
df_train = pd.DataFrame(d)
print (df_train)
date_x outcome
0 2015-01-01 20 ->1.group
1 2015-01-01 30 ->1.group
2 2015-01-01 40 ->1.group
3 2015-01-02 50 ->2.group
4 2015-01-02 60 ->2.group
d = {'mean':'Class probability','size':'Frequency'}
df = df_train.groupby('date_x')['outcome'].agg(['mean','size']).rename(columns=d)
print (df)
Class probability Frequency
date_x
2015-01-01 30 3
2015-01-02 55 2
Related
Example:
year_month = ['201801','201801','201801','201801','201801','201802','201802','201802','201802','201802']
Services = ['23','67','23','67','23','23','23','4','4','67']
df = list(zip(year_month, Services)
df = pd.DataFrame(df, columns=['Date', 'Services'])
Help me!
My date column is already in the right format, and I`ve alread have the YYYTMM column from that.
I tried something like:
df2 = df.loc[:, ['YYYYMM', 'Services']]
df2 = df.groupby(['YYYYMM']).count().reset_index()
EXPECTED OUTPUT
Quantity of services per month/year.
year_month 4 23 67
201801 0 3 2
201801 2 2 1
enter image description here
out = df.groupby('Date', as_index=False).count()
out
Date Services
0 201801 5
1 201802 5
Update
finally i know desired output.
pd.crosstab(df['Date'], df['Services']).sort_index(axis=1, key=lambda x: x.astype('int'))
Services 4 23 67
Date
201801 0 3 2
201802 2 2 1
I have a column with dates in string format '2017-01-01'. Is there a way to extract day and month from it using pandas?
I have converted the column to datetime dtype but haven't figured out the later part:
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
df.dtypes:
Date datetime64[ns]
print(df)
Date
0 2017-05-11
1 2017-05-12
2 2017-05-13
With dt.day and dt.month --- Series.dt
df = pd.DataFrame({'date':pd.date_range(start='2017-01-01',periods=5)})
df.date.dt.month
Out[164]:
0 1
1 1
2 1
3 1
4 1
Name: date, dtype: int64
df.date.dt.day
Out[165]:
0 1
1 2
2 3
3 4
4 5
Name: date, dtype: int64
Also can do with dt.strftime
df.date.dt.strftime('%m')
Out[166]:
0 01
1 01
2 01
3 01
4 01
Name: date, dtype: object
A simple form:
df['MM-DD'] = df['date'].dt.strftime('%m-%d')
Use dt to get the datetime attributes of the column.
In [60]: df = pd.DataFrame({'date': [datetime.datetime(2018,1,1),datetime.datetime(2018,1,2),datetime.datetime(2018,1,3),]})
In [61]: df
Out[61]:
date
0 2018-01-01
1 2018-01-02
2 2018-01-03
In [63]: df['day'] = df.date.dt.day
In [64]: df['month'] = df.date.dt.month
In [65]: df
Out[65]:
date day month
0 2018-01-01 1 1
1 2018-01-02 2 1
2 2018-01-03 3 1
Timing the methods provided:
Using apply:
In [217]: %timeit(df['date'].apply(lambda d: d.day))
The slowest run took 33.66 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 210 µs per loop
Using dt.date:
In [218]: %timeit(df.date.dt.day)
10000 loops, best of 3: 127 µs per loop
Using dt.strftime:
In [219]: %timeit(df.date.dt.strftime('%d'))
The slowest run took 40.92 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 284 µs per loop
We can see that dt.day is the fastest
This should do it:
df['day'] = df['Date'].apply(lambda r:r.day)
df['month'] = df['Date'].apply(lambda r:r.month)
I have a pandas df which I am looking to build a pivot table with.
Here is a sample table
Name Week Category Amount
ABC 1 Clothing 50
ABC 1 Food 10
ABC 1 Food 10
ABC 1 Auto 20
DEF 1 Food 10
DEF 1 Services 20
The pivot table I am looking to create is to sum up the amounts per Name, per week per category.
Essentially, I am looking to land up with a table as follows:
Name Week Clothing Food Auto Services Total
ABC 1 50 20 20 0 90
DEF 1 0 10 0 20 30
If a user has no category value in a particular week, I take it as 0
And the total is the row sum.
I tried some of the options mentioned at https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html but couldnt get it to work...any thoughts on how I can achieve this. I used
df.pivot_table(values=['Amount'], index=['Name','Week','Category'], aggfunc=[np.sum]) followed by df.unstack() but that did not yield the desired result as both Week and Category got unstacked.
Thanks!
df_pvt = pd.pivot_table(df, values = 'Amount', index = ['Name', 'Week'], columns = 'Category', aggfunc = np.sum, margins=True, margins_name = 'Total', fill_value = 0
df_pvt.columns.name = None
df_pvt = df_pvt.reset_index()
Let us try crosstab
out = pd.crosstab(index = [df['Name'],df['Week']],
columns = df['Category'],
values=df['Amount'],
margins=True,
aggfunc='sum').fillna(0).iloc[:-1].reset_index()
Category Name Week Auto Clothing Food Services All
0 ABC 1 20.0 50.0 20.0 0.0 90
1 DEF 1 0.0 0.0 10.0 20.0 30
i am having a hard time using a group by + where to apply a sum to a broader range.
given this code:
from io import StringIO
import numpy as np
f = pd.read_csv(StringIO("""
fund_id,l_s,val
fund1,L,10
fund1,L,20
fund1,S,30
fund2,L,15
fund2,L,25
fund2,L,35
"""))
# fund total - works as expected
f['fund_total'] = f.groupby('fund_id')['val'].transform(np.sum)
# fund L total - applied only to L rows.
f['fund_total_l'] = f[f['l_s'] == "L"].groupby('fund_id')['val'].transform(np.sum)
f
this code gets me close:
numbers are correct, but i would like fund_total_l column to show 30 for all rows of fund1 (not just L). I want a fund level summary, but sum filtered by the l_s column
i know i can do this with multiple steps, but this needs to be a single operation. i can use a separate generic function if that helps.
playground: https://repl.it/repls/UnusualImpeccableDaemons
Use Series.where, to create NaN, these will be ignored in your sum:
f['val_temp'] = f['val'].where(f['l_s'] == "L")
f['fund_total_l'] = f.groupby('fund_id')['val_temp'].transform('sum')
f = f.drop(columns='val_temp')
Or in one line using assign:
df['fun_total_l'] = (
f.assign(val_temp=f['val'].where(f['l_s'] == "L"))
.groupby('fund_id')['val_temp'].transform('sum')
)
Another way would be to partly use your solution, but then use DataFrame.reindex to get the original index back and then use ffill and bfill to fill up our NaN:
f['fund_total_l'] = (
f[f['l_s'] == "L"]
.groupby('fund_id')['val']
.transform('sum')
.reindex(f.index)
.ffill()
.bfill()
)
fund_id l_s val fund_total_l
0 fund1 L 10 30.0
1 fund1 L 20 30.0
2 fund1 S 30 30.0
3 fund2 L 15 75.0
4 fund2 L 25 75.0
5 fund2 L 35 75.0
I think there is a more elegant solution, but I'm not able to broadcast the results back to the individual rows.
Essentially, with a boolean mask of all the "L" rows
f.groupby("fund_id").apply(lambda g:sum(g["val"]*(g["l_s"]=="L")))
you obtain
fund_id
fund1 30
fund2 75
dtype: int64
now we can just merge after using reset_index to obtain
pd.merge(f, f.groupby("fund_id").apply(lambda g:sum(g["val"]*(g["l_s"]=="L"))).reset_index(), on="fund_id")
to obtain
fund_id l_s val 0
0 fund1 L 10 30
1 fund1 L 20 30
2 fund1 S 30 30
3 fund2 L 15 75
4 fund2 L 25 75
5 fund2 L 35 75
However, I'd guess that the merging is not necessary and can be obtained directly in apply
I want to select all the previous 6 months records for a customer whenever a particular transaction is done by the customer.
Data looks like:
Cust_ID Transaction_Date Amount Description
1 08/01/2017 12 Moved
1 03/01/2017 15 X
1 01/01/2017 8 Y
2 10/01/2018 6 Moved
2 02/01/2018 12 Z
Here, I want to see for the Description "Moved" and then select all the last 6 months for every Cust_ID.
Output should look like:
Cust_ID Transaction_Date Amount Description
1 08/01/2017 12 Moved
1 03/01/2017 15 X
2 10/01/2018 6 Moved
I want to do this in python. Please help.
Idea is created Series of datetimes filtered by Moved and shifted by MonthOffset, last filter by Series.map values less like this offsets:
EDIT: Get all datetimes for each Moved values:
df['Transaction_Date'] = pd.to_datetime(df['Transaction_Date'])
df = df.sort_values(['Cust_ID','Transaction_Date'])
df['g'] = df['Description'].iloc[::-1].eq('Moved').cumsum()
s = (df[df['Description'].eq('Moved')]
.set_index(['Cust_ID','g'])['Transaction_Date'] - pd.offsets.MonthOffset(6))
mask = df.join(s.rename('a'), on=['Cust_ID','g'])['a'] < df['Transaction_Date']
df1 = df[mask].drop('g', axis=1)
EDIT1: Get all datetimes for Moved with minimal datetimes per groups, another Moved per groups are removed:
print (df)
Cust_ID Transaction_Date Amount Description
0 1 10/01/2017 12 X
1 1 01/23/2017 15 Moved
2 1 03/01/2017 8 Y
3 1 08/08/2017 12 Moved
4 2 10/01/2018 6 Moved
5 2 02/01/2018 12 Z
#convert to datetimes
df['Transaction_Date'] = pd.to_datetime(df['Transaction_Date'])
#mask for filter Moved rows
mask = df['Description'].eq('Moved')
#filter and sorting this rows
df1 = df[mask].sort_values(['Cust_ID','Transaction_Date'])
print (df1)
Cust_ID Transaction_Date Amount Description
1 1 2017-01-23 15 Moved
3 1 2017-08-08 12 Moved
4 2 2018-10-01 6 Moved
#get duplicated filtered rows in df1
mask = df1.duplicated('Cust_ID')
#create Series for map
s = df1[~mask].set_index('Cust_ID')['Transaction_Date'] - pd.offsets.MonthOffset(6)
print (s)
Cust_ID
1 2016-07-23
2 2018-04-01
Name: Transaction_Date, dtype: datetime64[ns]
#create mask for filter out another Moved (get only first for each group)
m2 = ~mask.reindex(df.index, fill_value=False)
df1 = df[(df['Cust_ID'].map(s) < df['Transaction_Date']) & m2]
print (df1)
Cust_ID Transaction_Date Amount Description
0 1 2017-10-01 12 X
1 1 2017-01-23 15 Moved
2 1 2017-03-01 8 Y
4 2 2018-10-01 6 Moved
EDIT2:
#get last duplicated filtered rows in df1
mask = df1.duplicated('Cust_ID', keep='last')
#create Series for map
s = df1[~mask].set_index('Cust_ID')['Transaction_Date']
print (s)
Cust_ID
1 2017-08-08
2 2018-10-01
Name: Transaction_Date, dtype: datetime64[ns]
m2 = ~mask.reindex(df.index, fill_value=False)
#filter by between Moved and next 6 months
df3 = df[df['Transaction_Date'].between(df['Cust_ID'].map(s), df['Cust_ID'].map(s + pd.offsets.MonthOffset(6))) & m2]
print (df3)
Cust_ID Transaction_Date Amount Description
3 1 2017-08-08 12 Moved
0 1 2017-10-01 12 X
4 2 2018-10-01 6 Moved