difference between group sums pandas - pandas

DF:
fruits date amount
0 Apple 2018-01-01 100
1 Orange 2018-01-01 200
2 Apple 2018-01-01 150
3 Apple 2018-01-02 100
4 Orange 2018-01-02 100
5 Orange 2018-01-02 100
Code to create this:
f = [["Apple","2018-01-01",100],["Orange","2018-01-01",200],["Apple","2018-01-01",150],
["Apple","2018-01-02",100],["Orange","2018-01-02",100],["Orange","2018-01-02",100]]
df = pd.DataFrame(f,columns = ["fruits","date","amount"])
I am trying to aggregate the sale of fruits for each date and find the difference between sums
Expected Op:
date diff
2018-01-01 . 50
2018-01-02 . -100
As in find the sum of sales of Apple and orange and find the difference between the sums
I am able to find the sum:
df.groupby(["date","fruits"])["amount"].agg("sum")
date fruits
2018-01-01 Apple 250
Orange 200
2018-01-02 Apple 100
Orange 200
Name: amount, dtype: int64
Any suggestions on how to find the difference in pandas itself.

Using groupby date and apply using lambda function as:
df.groupby("date").apply(lambda x: x.loc[x['fruits']=='Apple','amount'].sum() -
x.loc[x['fruits']=='Orange','amount'].sum())
date
2018-01-01 50
2018-01-02 -100
dtype: int64
Or grouping the fruits separately and finding the difference:
A = df[df.fruits.isin(['Apple'])].groupby('date')['amount'].sum()
O = df[df.fruits.isin(['Orange'])].groupby('date')['amount'].sum()
O-A
date
2018-01-01 -50
2018-01-02 100
Name: amount, dtype: int64

def get_diff(grp):
grp = grp.groupby('fruits').agg(sum)['amount'].values
return grp[0] - grp[1]
df.groupby('date').apply(get_diff)
Output
date
2018-01-01 50
2018-01-02 -100

Add unstack for reshape and then subtract with pop for extract columns:
df = df.groupby(["date","fruits"])["amount"].sum().unstack()
df['diff'] = df.pop('Apple') - df.pop('Orange')
print (df)
fruits diff
date
2018-01-01 50
2018-01-02 -100

Related

Find index range of a sequence in dataframe column

I have a timeseries:
Sales
2018-01-01 66.65
2018-01-02 66.68
2018-01-03 65.87
2018-01-04 66.79
2018-01-05 67.97
2018-01-06 96.92
2018-01-07 96.90
2018-01-08 96.90
2018-01-09 96.38
2018-01-10 95.57
Given an arbitrary sequence of values, let's say [66.79,67.97,96.92,96.90], how could I obtain the corresponding indices, for example: [2018-01-04, 2018-01-05,2018-01-06,2018-01-07]?
Use pandas.Series.isin to filter the column Sales then pandas.DataFrame.index to return the row labels (aka index, dates in your df) and finally pandas.Series.to_list to build a list :
vals = [66.79,67.97,96.92,96.90]
result = df[df['Sales'].isin(vals)].index.to_list()
# Output :
print(result)
['2018-01-04', '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08']

Assign Day number to unique date in pandas [duplicate]

I have a column with dates in string format '2017-01-01'. Is there a way to extract day and month from it using pandas?
I have converted the column to datetime dtype but haven't figured out the later part:
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
df.dtypes:
Date datetime64[ns]
print(df)
Date
0 2017-05-11
1 2017-05-12
2 2017-05-13
With dt.day and dt.month --- Series.dt
df = pd.DataFrame({'date':pd.date_range(start='2017-01-01',periods=5)})
df.date.dt.month
Out[164]:
0 1
1 1
2 1
3 1
4 1
Name: date, dtype: int64
df.date.dt.day
Out[165]:
0 1
1 2
2 3
3 4
4 5
Name: date, dtype: int64
Also can do with dt.strftime
df.date.dt.strftime('%m')
Out[166]:
0 01
1 01
2 01
3 01
4 01
Name: date, dtype: object
A simple form:
df['MM-DD'] = df['date'].dt.strftime('%m-%d')
Use dt to get the datetime attributes of the column.
In [60]: df = pd.DataFrame({'date': [datetime.datetime(2018,1,1),datetime.datetime(2018,1,2),datetime.datetime(2018,1,3),]})
In [61]: df
Out[61]:
date
0 2018-01-01
1 2018-01-02
2 2018-01-03
In [63]: df['day'] = df.date.dt.day
In [64]: df['month'] = df.date.dt.month
In [65]: df
Out[65]:
date day month
0 2018-01-01 1 1
1 2018-01-02 2 1
2 2018-01-03 3 1
Timing the methods provided:
Using apply:
In [217]: %timeit(df['date'].apply(lambda d: d.day))
The slowest run took 33.66 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 210 µs per loop
Using dt.date:
In [218]: %timeit(df.date.dt.day)
10000 loops, best of 3: 127 µs per loop
Using dt.strftime:
In [219]: %timeit(df.date.dt.strftime('%d'))
The slowest run took 40.92 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 284 µs per loop
We can see that dt.day is the fastest
This should do it:
df['day'] = df['Date'].apply(lambda r:r.day)
df['month'] = df['Date'].apply(lambda r:r.month)

Create a row for each year between two dates

I have a dataframe with two date columns (format: YYYY-MM-DD). I want to create one row for each year between those two dates. The rows would be identical with a new column which specifies the year. For example, if the dates are 2018-01-01 and 2020-01-01 then there would be three rows with same data and a new column with values 2018, 2019, and 2020.
You can use a custom function to compute the range then explode the column:
# Ensure to have datetime
df['date1'] = pd.to_datetime(df['date1'])
df['date2'] = pd.to_datetime(df['date2'])
# Create the new column
date_range = lambda x: range(x['date1'].year, x['date2'].year+1)
df = df.assign(year=df.apply(date_range, axis=1)).explode('year', ignore_index=True)
Output:
>>> df
date1 date2 year
0 2018-01-01 2020-01-01 2018
1 2018-01-01 2020-01-01 2019
2 2018-01-01 2020-01-01 2020
This should work for you:
import pandas
# some sample data
df = pandas.DataFrame(data={
'foo': ['bar', 'baz'],
'date1':['2018-01-01', '2022-01-01'],
'date2':['2020-01-01', '2017-01-01']
})
# cast date columns to datetime
for col in ['date1', 'date2']:
df[col] = pandas.to_datetime(df[col])
# reset index to ensure that selection by length of index works
df = df.reset_index(drop=True)
# the range of years between the two dates, and iterate through the resulting
# series to unpack the range of years and add a new row with the original data and the year
for i, years in df.apply(
lambda x: range(
min(x.date1, x.date2).year,
max(x.date1, x.date2).year + 1
),
axis='columns'
).iteritems():
for year in years:
new_index = len(df.index)
df.loc[new_index] = df.loc[i].values
df.loc[new_index, 'year'] = int(year)
output:
>>> df
foo date1 date2 year
0 bar 2018-01-01 2020-01-01 NaN
1 baz 2022-01-01 2017-01-01 NaN
2 bar 2018-01-01 2020-01-01 2018.0
3 bar 2018-01-01 2020-01-01 2019.0
4 bar 2018-01-01 2020-01-01 2020.0
5 baz 2022-01-01 2017-01-01 2017.0
6 baz 2022-01-01 2017-01-01 2018.0
7 baz 2022-01-01 2017-01-01 2019.0
8 baz 2022-01-01 2017-01-01 2020.0
9 baz 2022-01-01 2017-01-01 2021.0
10 baz 2022-01-01 2017-01-01 2022.0

Using a custom scoring function with pandas groupby to create a column in another dataframe

This is my partial df=
dStart y_test y_pred
2018-01-01 1 2
2018-01-01 2 2
2018-01-02 3 3
2018-01-02 1 2
2018-01-02 2 3
I want to create a column in another dataframe (df1) with the Mathews Correlation Coefficient of each unique dStart.
from sklearn.metrics import matthews_corrcoef
def mcc_func(y_test,y_pred):
return matthews_corrcoef(df[y_test].values,df[y_pred].values)
df1['mcc']=df.groupby('dStart').apply(mcc_func('y_test','y_pred'))
This function doesn't work -- I think because the function returns a float, and 'apply' wants to use the function on the groupby data itself, but I can't figure out how to give the right function to apply.
You need to apply the function within the grouped object -
g = df.groupby('dStart')
g.apply(lambda x: matthews_corrcoef(x['y_test'], x['y_pred']))
#OUTPUT
#dStart
#2018-01-01 0.0
#2018-01-02 0.0
#dtype: float64
Use apply with lambda function:
df = (df.groupby(['dStart']).apply(lambda x: matthews_corrcoef(x['y_test'], x['y_pred']))
.reset_index(name='Matthews_corrcoef'))
print(df)
dStart Matthews_corrcoef
0 2018-01-01 0.0
1 2018-01-02 0.0

Selecting all the previous 6 months data records from occurrence of a particular value in a column in pandas

I want to select all the previous 6 months records for a customer whenever a particular transaction is done by the customer.
Data looks like:
Cust_ID Transaction_Date Amount Description
1 08/01/2017 12 Moved
1 03/01/2017 15 X
1 01/01/2017 8 Y
2 10/01/2018 6 Moved
2 02/01/2018 12 Z
Here, I want to see for the Description "Moved" and then select all the last 6 months for every Cust_ID.
Output should look like:
Cust_ID Transaction_Date Amount Description
1 08/01/2017 12 Moved
1 03/01/2017 15 X
2 10/01/2018 6 Moved
I want to do this in python. Please help.
Idea is created Series of datetimes filtered by Moved and shifted by MonthOffset, last filter by Series.map values less like this offsets:
EDIT: Get all datetimes for each Moved values:
df['Transaction_Date'] = pd.to_datetime(df['Transaction_Date'])
df = df.sort_values(['Cust_ID','Transaction_Date'])
df['g'] = df['Description'].iloc[::-1].eq('Moved').cumsum()
s = (df[df['Description'].eq('Moved')]
.set_index(['Cust_ID','g'])['Transaction_Date'] - pd.offsets.MonthOffset(6))
mask = df.join(s.rename('a'), on=['Cust_ID','g'])['a'] < df['Transaction_Date']
df1 = df[mask].drop('g', axis=1)
EDIT1: Get all datetimes for Moved with minimal datetimes per groups, another Moved per groups are removed:
print (df)
Cust_ID Transaction_Date Amount Description
0 1 10/01/2017 12 X
1 1 01/23/2017 15 Moved
2 1 03/01/2017 8 Y
3 1 08/08/2017 12 Moved
4 2 10/01/2018 6 Moved
5 2 02/01/2018 12 Z
#convert to datetimes
df['Transaction_Date'] = pd.to_datetime(df['Transaction_Date'])
#mask for filter Moved rows
mask = df['Description'].eq('Moved')
#filter and sorting this rows
df1 = df[mask].sort_values(['Cust_ID','Transaction_Date'])
print (df1)
Cust_ID Transaction_Date Amount Description
1 1 2017-01-23 15 Moved
3 1 2017-08-08 12 Moved
4 2 2018-10-01 6 Moved
#get duplicated filtered rows in df1
mask = df1.duplicated('Cust_ID')
#create Series for map
s = df1[~mask].set_index('Cust_ID')['Transaction_Date'] - pd.offsets.MonthOffset(6)
print (s)
Cust_ID
1 2016-07-23
2 2018-04-01
Name: Transaction_Date, dtype: datetime64[ns]
#create mask for filter out another Moved (get only first for each group)
m2 = ~mask.reindex(df.index, fill_value=False)
df1 = df[(df['Cust_ID'].map(s) < df['Transaction_Date']) & m2]
print (df1)
Cust_ID Transaction_Date Amount Description
0 1 2017-10-01 12 X
1 1 2017-01-23 15 Moved
2 1 2017-03-01 8 Y
4 2 2018-10-01 6 Moved
EDIT2:
#get last duplicated filtered rows in df1
mask = df1.duplicated('Cust_ID', keep='last')
#create Series for map
s = df1[~mask].set_index('Cust_ID')['Transaction_Date']
print (s)
Cust_ID
1 2017-08-08
2 2018-10-01
Name: Transaction_Date, dtype: datetime64[ns]
m2 = ~mask.reindex(df.index, fill_value=False)
#filter by between Moved and next 6 months
df3 = df[df['Transaction_Date'].between(df['Cust_ID'].map(s), df['Cust_ID'].map(s + pd.offsets.MonthOffset(6))) & m2]
print (df3)
Cust_ID Transaction_Date Amount Description
3 1 2017-08-08 12 Moved
0 1 2017-10-01 12 X
4 2 2018-10-01 6 Moved