Pandas groupby calculate difference - pandas

import pandas as pd
data = [['2017-09-30','A',123],['2017-12-31','A',23],['2017-09-30','B',74892],['2017-12-31','B',52222],['2018-09-30','A',37599],['2018-12-31','A',66226]]
df = pd.DataFrame.from_records(data,columns=["Date", "Company", "Revenue YTD"])
df['Date'] = pd.to_datetime(df['Date'])
df = df.groupby(['Company',df['Date'].dt.year]).diff()
print(df)
Date Revenue YTD
0 NaT NaN
1 92 days -100.0
2 NaT NaN
3 92 days -22670.0
4 NaT NaN
5 92 days 28627.0
I would like to calculate the company's revenue difference by September and December. I have tried with groupby company and year. But the result is not what I am expecting
Expecting result
Date Company Revenue YTD
0 2017 A -100
1 2018 A -22670
2 2017 B 28627

IIUC, this should work
(df.assign(Date=df['Date'].dt.year,
Revenue_Diff=df.groupby(['Company',df['Date'].dt.year])['Revenue YTD'].diff())
.drop('Revenue YTD', axis=1)
.dropna()
)
Output:
Date Company Revenue_Diff
1 2017 A -100.0
3 2017 B -22670.0
5 2018 A 28627.0

Try this:
Set it up:
import pandas as pd
import numpy as np
data = [['2017-09-30','A',123],['2017-12-31','A',23],['2017-09-30','B',74892],['2017-12-31','B',52222],['2018-09-30','A',37599],['2018-12-31','A',66226]]
df = pd.DataFrame.from_records(data,columns=["Date", "Company", "Revenue YTD"])
df['Date'] = pd.to_datetime(df['Date'])
Update with np.diff():
my_func = lambda x: np.diff(x)
df = (df.groupby([df.Date.dt.year, df.Company])
.agg({'Revenue YTD':my_func}))
print(df)
Revenue YTD
Date Company
2017 A -100
B -22670
2018 A 28627
Hope this helps.

Related

how to create monthly and season 24 hours average table using pandas

I have a dataframe with 2 columns: Date and LMP and there are totals of 8760 rows. This is the dummy dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': pd.date_range('2023-01-01 00:00', '2023-12-31 23:00', freq='1H'), 'LMP': np.random.randint(10, 20, 8760)})
I extract month from the date and then created the season column for the specific dates. Like this
df['month'] = pd.DatetimeIndex(df['Date']).month
season = []
for i in df['month']:
if i <= 2 or i == 12:
season.append('Winter')
elif 2 < i <= 5:
season.append('Spring')
elif 5 < i <= 8:
season.append('Summer')
else:
season.append('Autumn')
df['Season'] = season
df2 = df.groupby(['month']).mean()
df3 = df.groupby(['Season']).mean()
print(df2['LMP'])
print(df3['LMP'])
Output:
**month**
1 20.655113
2 20.885532
3 19.416946
4 22.025248
5 26.040606
6 19.323863
7 51.117965
8 51.434093
9 21.404680
10 14.701989
11 20.009590
12 38.706160
**Season**
Autumn 18.661426
Spring 22.499365
Summer 40.856845
Winter 26.944382
But I want the output to be in 24 hour average for both monthly and seasonal.
Desired Output:
for seasonal 24 hours average
For monthyl 24 hours average
Note: in the monthyl 24 hour average columns are months(1,2,3,4,5,6,7,8,9,10,11,12) and rows are hours(starting from 0).
Can anyone help?
try:
df['hour']=pd.DatetimeIndex(df['Date']).hour
dft = df[['Season', 'hour', 'LMP']]
dftg = dft.groupby(['hour', 'Season'])['LMP'].mean()
dftg.reset_index().pivot(index='hour', columns='Season')
result:

Pandas Shift Column & Remove Row

I have a dataframe 'df1' that has 2 columns and i need to shift the 2nd column down a row and then remove the entire top row of the df1.
My data looks like this:
year ER12
0 2017 -2.05
1 2018 1.05
2 2019 -0.04
3 2020 -0.60
4 2021 -99.99
And, I need it to look like this:
year ER12
0 2018 -2.05
1 2019 1.05
2 2020 -0.04
3 2021 -0.60
We can try this:
df = df.assign(ER12=df.ER12.shift()).dropna().reset_index(drop=True)
print(df)
year ER12
0 2018 -2.05
1 2019 1.05
2 2020 -0.04
3 2021 -0.60
This works on your example:
import pandas as pd
df = pd.DataFrame({'year':[2017,2018,2019,2020,2021], 'ER12':[-2.05,1.05,-0.04,-0.6,-99.99]})
df['year'] = df['year'].shift(-1)
df = df.dropna()

Create Dataframe from Matrix Search Calculation Pandas

I have the follwoign Dataframe:
df = pd.DataFrame({'Idenitiy': ['Haus1', 'Haus2', 'Haus1','Haus2'],
'kind': ['Gas', 'Gas', 'Strom','Strom'],
'2005':[2,3,5,6],
'2006':[2,3.5,5.5,7]})
No I would like to have the following dataframe as an output as the Product of the entitites:
Year Product(Gas) Product(Strom)
2005 6 30
2006 6 38,5
2007 7 38,5
Thank you.
Here's a way to do:
# multiply column values
from functools import reduce
def mult(f):
v = [reduce(lambda a,b : a*b, f['2005']), reduce(lambda a,b : a*b, f['2006'])]
return pd.Series(v, index=['2005','2006'])
# groupby and multiply column values
df1 = df.groupby('kind')[['2005','2006']].apply(mult).unstack().reset_index()
df1.columns = ['Year','Kind','vals']
print(df1)
Year Kind vals
0 2005 Gas 6.0
1 2005 Strom 30.0
2 2006 Gas 7.0
3 2006 Strom 38.5
# reshape the table
df1 = (df1
.pivot_table(index='Year', columns=['Kind'], values='vals'))
# fix column names
df1 = df1.add_prefix('Product_')
df1.columns.name = None
df1 = df1.reset_index()
Year Product_Gas Product_Strom
0 2005 6.0 30.0
1 2006 7.0 38.5

converting ddmmyy into mmyy format by using pandas?

i have column(month) in the ddmmyy format, how i can convert that into mmyy format.
Month
6/1/2017
5/1/2017
i have used below code, can someone help
import pandas as pd
df = pd.read_csv(r"C:\Users\venkagop\Subbu\UK_IYA.csv")
df['Month']=pd.to_datetime(df['Month'],format='%d/%m/%y')
df.to_csv(r"C:\Users\venkagop\Subbu\my test.csv")
I think you can convert column to datetimes in read_csv by parameter parse_dates and dayfirst and then convert to custom format by strftime:
df = pd.read_csv(r"C:\Users\venkagop\Subbu\UK_IYA.csv", parse_dates=['Month'], dayfirst=True)
df['Month']= df['Month'].dt.strftime('%b %y')
df.to_csv(r"C:\Users\venkagop\Subbu\my test.csv")
Your code:
df = pd.read_csv(r"C:\Users\venkagop\Subbu\UK_IYA.csv")
df['Month']=pd.to_datetime(df['Month'],format='%d/%m/%y').dt.strftime('%b %y')
df.to_csv(r"C:\Users\venkagop\Subbu\my test.csv")
Sample:
import pandas as pd
temp=u"""Month,sale
05/03/12,2
05/04/12,4
05/05/12,6
05/06/12,8"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), parse_dates=['Month'], dayfirst=True)
print (df)
Month sale
0 2012-03-05 2
1 2012-04-05 4
2 2012-05-05 6
3 2012-06-05 8
df['Month']= df['Month'].dt.strftime('%b %y')
print (df)
Month sale
0 Mar 12 2
1 Apr 12 4
2 May 12 6
3 Jun 12 8

Pandas Dataframe merging columns

I have a pandas dataframe like the following
Year Month Day Securtiy Trade Value NewDate
2011 1 10 AAPL Buy 1500 0
My question is, how can I merge the columns Year, Month, Day into column NewDate
so that the newDate column looks like the following
2011-1-10
The best way is to parse it when reading as csv:
In [1]: df = pd.read_csv('foo.csv', sep='\s+', parse_dates=[['Year', 'Month', 'Day']])
In [2]: df
Out[2]:
Year_Month_Day Securtiy Trade Value NewDate
0 2011-01-10 00:00:00 AAPL Buy 1500 0
You can do this without the header, by defining column names while reading:
pd.read_csv(input_file, header=['Year', 'Month', 'Day', 'Security','Trade', 'Value' ], parse_dates=[['Year', 'Month', 'Day']])
If it's already in your DataFrame, you could use an apply:
In [11]: df['Date'] = df.apply(lambda s: pd.Timestamp('%s-%s-%s' % (s['Year'], s['Month'], s['Day'])), 1)
In [12]: df
Out[12]:
Year Month Day Securtiy Trade Value NewDate Date
0 2011 1 10 AAPL Buy 1500 0 2011-01-10 00:00:00
df['Year'] + '-' + df['Month'] + '-' + df['Date']
You can create a new Timestamp as follows:
df['newDate'] = df.apply(lambda x: pd.Timestamp('{0}-{1}-{2}'
.format(x.Year, x.Month, x.Day),
axix=1)
>>> df
Year Month Day Securtiy Trade Value NewDate newDate
0 2011 1 10 AAPL Buy 1500 0 2011-01-10