Pandas 'multi-index' issue in merging dataframes

Pandas 'multi-index' issue in merging dataframes - pandas

I have a panel dataset as df
stock year date return
VOD 2017 01-01 0.05
VOD 2017 01-02 0.03
VOD 2017 01-03 0.04
... ... ... ....
BAT 2017 01-01 0.05
BAT 2017 01-02 0.07
BAT 2017 01-03 0.10
so I use this code to get the mean and skewness of the return for each stock in each year.
df2=df.groupby(['stock','year']).mean().reset_index()
df3=df.groupby(['stock','year']).skew().reset_index()
df2 and df3 look fine.
df2 is like (after I change the column name)
stock year mean_return
VOD 2017 0.09
BAT 2017 0.14
... ... ...
df3 is like (after I change the column name)
stock year return_skewness
VOD 2017 -0.34
BAT 2017 -0.04
... ... ...
The problem is when I tried to merge df2 and df3 by using
want=pd.merge(df2,df2, on=['stock','year'],how='outer')
python gave me
'The column label 'stock' is not unique.
For a multi-index, the label must be a tuple with elements corresponding to each level.'
, which confuses me alot.
I can use want = pd.merge(df2,df3, left_index=True, right_index=True, how='outer') to merge df2 and df3, but after that i have to rename the columns as column names are in parentheses.
Is there any convenient way to merge df2 and df3 ? Thanks

Better is use agg for specify aggregate function in list and column for aggregation after function:
df3 = (df.groupby(['stock','year'])['return']
.agg([('mean_return','mean'),('return_skewness','skew')])
.reset_index())
print (df3)
stock year mean_return return_skewness
0 BAT 2017 0.073333 0.585583
1 VOD 2017 0.040000 0.000000
Your solution should be changed with remove reset_index, rename and last concat, also is specified column return for aggregate:
s2=df.groupby(['stock','year'])['return'].mean().rename('mean_return')
s3=df.groupby(['stock','year'])['return'].skew().rename('return_skewness')
df3 = pd.concat([s2, s3], axis=1).reset_index()
print (df3)
stock year mean_return return_skewness
0 BAT 2017 0.073333 0.585583
1 VOD 2017 0.040000 0.000000
EDIT:
If need aggregate all numeric columns remove list after groupby first and then use map with join for flatten MultiIndex:
print (df)
stock year date return col
0 VOD 2017 01-01 0.05 1
1 VOD 2017 01-02 0.03 8
2 VOD 2017 01-03 0.04 9
3 BAT 2017 01-01 0.05 1
4 BAT 2017 01-02 0.07 4
5 BAT 2017 01-03 0.10 3
df3 = df.groupby(['stock','year']).agg(['mean','skew'])
print (df3)
return col
mean skew mean skew
stock year
BAT 2017 0.073333 0.585583 2.666667 -0.935220
VOD 2017 0.040000 0.000000 6.000000 -1.630059
df3.columns = df3.columns.map('_'.join)
df3 = df3.reset_index()
print (df3)
stock year return_mean return_skew col_mean col_skew
0 BAT 2017 0.073333 0.585583 2.666667 -0.935220
1 VOD 2017 0.040000 0.000000 6.000000 -1.630059
Your solutions should be changed:
df2=df.groupby(['stock','year']).mean().add_prefix('mean_')
df3=df.groupby(['stock','year']).skew().add_prefix('skew_')
df3 = pd.concat([df2, df3], axis=1).reset_index()
print (df3)
stock year mean_return mean_col skew_return skew_col
0 BAT 2017 0.073333 2.666667 0.585583 -0.935220
1 VOD 2017 0.040000 6.000000 0.000000 -1.630059

A easier way to bypass this issue:
df2.to_clipboard(index=False)
df2clip=pd.read_clipboard(sep='\t')
df3.to_clipboard(index=False)
df3clip=pd.read_clipboard(sep='\t')
Then merge 2 df again:
pd.merge(df2clip,df3clip,on=['stock','year'],how='outer')

Related

How to groupby year and unstack years into columns in pandas?

I have a pandas time series ser
ser
>>>
date x
2018-01-01 0.912
2018-01-02 0.704
...
2021-02-01 1.285
and I want to take a cumulative sum by year and make each year into a column as such, and the date index should now be just dates in year (e.g. Jan 01, Jan 02,... the formatting of Month and Day doesn't matter)
date 2018_x 2019_x 2020_x 2021_x 2022_x
Jan-01 0.912 ... ... ... ...
Jan-02 1.616 ... ... ... ...
...
I know how to groupby and take a cumulative sum, but then I want to do some sort of unstacking operation to get the years into columns
ser.groupby(ser.index.year).cumsum()
# what do I do next?
The standard pandas unstack() operation doesn't work here.
Can anyone please advise how to do this?

First you can aggregate sum per MM-DD with years and then reshape by Series.unstack:
df = ser.groupby([ser.index.strftime('%m-%d'), ser.index.year]).sum().unstack(fill_value=0).cumsum()
print (df)
date 2018 2021
date
01-01 0.912 0.000
01-02 1.616 0.000
02-01 1.616 1.285
Or if no duplicated datetimes create MultiIndex without groupby:
ser.index = [ser.index.strftime('%m-%d'), ser.index.year]
df = ser.unstack(fill_value=0).cumsum()
print (df)
date 2018 2021
date
01-01 0.912 0.000
01-02 1.616 0.000
02-01 1.616 1.285

Pandas Shift Column & Remove Row

I have a dataframe 'df1' that has 2 columns and i need to shift the 2nd column down a row and then remove the entire top row of the df1.
My data looks like this:
year ER12
0 2017 -2.05
1 2018 1.05
2 2019 -0.04
3 2020 -0.60
4 2021 -99.99
And, I need it to look like this:
year ER12
0 2018 -2.05
1 2019 1.05
2 2020 -0.04
3 2021 -0.60

We can try this:
df = df.assign(ER12=df.ER12.shift()).dropna().reset_index(drop=True)
print(df)
year ER12
0 2018 -2.05
1 2019 1.05
2 2020 -0.04
3 2021 -0.60

This works on your example:
import pandas as pd
df = pd.DataFrame({'year':[2017,2018,2019,2020,2021], 'ER12':[-2.05,1.05,-0.04,-0.6,-99.99]})
df['year'] = df['year'].shift(-1)
df = df.dropna()

Pandas Implement Equation & Groupby 2 Conditions

I have data that looks like this below and I'm trying to calculate the CRMSE (centered root mean squared error) by site_name and year. Maybe i need an agg function or a lambda function to do this at each groupby parameters (plant_name, year). The dataframe data for df3m1:
plant_name year month obsvals modelvals
0 ARIZONA I 2021 1 8.90 8.30
1 ARIZONA I 2021 2 7.98 7.41
2 CAETITE I 2021 1 9.10 7.78
3 CAETITE I 2021 2 6.05 6.02
The equation that I need to implement by plant_name and year looks like this:
crmse = df3m1.groupby(['plant_name','year'])(( (df3m1.obsvals - df3m1.obsvals.mean()) -
(df3m1.modelvals - df3m1.modelvals.mean()) ) ** 2).mean() ** .5
This is a bit advanced for me yet on how to integrate a groupby and a calculation at the same time. thank you. Final dataframe would look like:
plant_name year crmse
0 ARIZONA I 2021 ?
1 CAETITE I 2021 ?
I have tried things like this with groupby -
crmse = df3m1.groupby(['plant_name','year'])(( (df3m1.obsvals -
df3m1.obsvals.mean()) - (df3m1.modelvals - df3m1.modelvals.mean()) )
** 2).mean() ** .5
but get errors like this:
TypeError: 'DataFrameGroupBy' object is not callable

Using groupby is correct. After that, we would have used .agg normally, but computing csrme interacts with multiple columns (obsvals and modelvals). So we pass the entire dataframe then take columns as we want by using .apply.
Code:
def crmse(x, y):
return np.sqrt(np.mean(np.square( (x - x.mean()) - (y - y.mean()) )))
def f(df):
return pd.Series(crmse(df['obsvals'], df['modelvals']), index=['crmse'])
crmse_series = (
df3m1
.groupby(['plant_name', 'year'])
.apply(f)
)
crmse_series
crmse
plant_name year
ARIZONA I 2021 0.015
CAETITE I 2021 0.645
You can merge the series into the original dataframe with merge.
df = df.merge(crmse_series, on=['plant_name', 'year'])
df
plant_name year month obsvals modelvals crmse
0 ARIZONA I 2021 1 8.90 8.30 0.015
1 ARIZONA I 2021 2 7.98 7.41 0.015
2 CAETITE I 2021 1 9.10 7.78 0.645
3 CAETITE I 2021 2 6.05 6.02 0.645
See Also:
Apply multiple functions to multiple groupby columns

Sort a alphanumeric column in pandas and replace it with original column of the dataset [duplicate]

I have a data frame like this:
print(df)
0 1 2
0 354.7 April 4.0
1 55.4 August 8.0
2 176.5 December 12.0
3 95.5 February 2.0
4 85.6 January 1.0
5 152 July 7.0
6 238.7 June 6.0
7 104.8 March 3.0
8 283.5 May 5.0
9 278.8 November 11.0
10 249.6 October 10.0
11 212.7 September 9.0
As you can see, months are not in calendar order. So I created a second column to get the month number corresponding to each month (1-12). From there, how can I sort this data frame according to calendar months' order?

Use sort_values to sort the df by a specific column's values:
In [18]:
df.sort_values('2')
Out[18]:
0 1 2
4 85.6 January 1.0
3 95.5 February 2.0
7 104.8 March 3.0
0 354.7 April 4.0
8 283.5 May 5.0
6 238.7 June 6.0
5 152.0 July 7.0
1 55.4 August 8.0
11 212.7 September 9.0
10 249.6 October 10.0
9 278.8 November 11.0
2 176.5 December 12.0
If you want to sort by two columns, pass a list of column labels to sort_values with the column labels ordered according to sort priority. If you use df.sort_values(['2', '0']), the result would be sorted by column 2 then column 0. Granted, this does not really make sense for this example because each value in df['2'] is unique.

I tried the solutions above and I do not achieve results, so I found a different solution that works for me. The ascending=False is to order the dataframe in descending order, by default it is True. I am using python 3.6.6 and pandas 0.23.4 versions.
final_df = df.sort_values(by=['2'], ascending=False)
You can see more details in pandas documentation here.

Using column name worked for me.
sorted_df = df.sort_values(by=['Column_name'], ascending=True)

Panda's sort_values does the work.
There are various parameters one can pass, such as ascending (bool or list of bool):
Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by.
As the default is ascending, and OP's goal is to sort ascending, one doesn't need to specify that parameter (see the last note below for the way to solve descending), so one can use one of the following ways:
Performing the operation in-place, and keeping the same variable name. This requires one to pass inplace=True as follows:
df.sort_values(by=['2'], inplace=True)
# or
df.sort_values(by = '2', inplace = True)
# or
df.sort_values('2', inplace = True)
If doing the operation in-place is not a requirement, one can assign the change (sort) to a variable:
With the same name of the original dataframe, df as
df = df.sort_values(by=['2'])
With a different name, such as df_new, as
df_new = df.sort_values(by=['2'])
All this previous operations would give the following output
0 1 2
4 85.6 January 1.0
3 95.5 February 2.0
7 104.8 March 3.0
0 354.7 April 4.0
8 283.5 May 5.0
6 238.7 June 6.0
5 152 July 7.0
1 55.4 August 8.0
11 212.7 September 9.0
10 249.6 October 10.0
9 278.8 November 11.0
2 176.5 December 12.0
Finally, one can reset the index with pandas.DataFrame.reset_index, to get the following
df.reset_index(drop = True, inplace = True)
# or
df = df.reset_index(drop = True)
[Out]:
0 1 2
0 85.6 January 1.0
1 95.5 February 2.0
2 104.8 March 3.0
3 354.7 April 4.0
4 283.5 May 5.0
5 238.7 June 6.0
6 152 July 7.0
7 55.4 August 8.0
8 212.7 September 9.0
9 249.6 October 10.0
10 278.8 November 11.0
11 176.5 December 12.0
A one-liner that sorts ascending, and resets the index would be as follows
df = df.sort_values(by=['2']).reset_index(drop = True)
[Out]:
0 1 2
0 85.6 January 1.0
1 95.5 February 2.0
2 104.8 March 3.0
3 354.7 April 4.0
4 283.5 May 5.0
5 238.7 June 6.0
6 152 July 7.0
7 55.4 August 8.0
8 212.7 September 9.0
9 249.6 October 10.0
10 278.8 November 11.0
11 176.5 December 12.0
Notes:
If one is not doing the operation in-place, forgetting the steps mentioned above may lead one (as this user) to not be able to get the expected result.
There are strong opinions on using inplace. For that, one might want to read this.
One is assuming that the column 2 is not a string. If it is, one will have to convert it:
Using pandas.to_numeric
df['2'] = pd.to_numeric(df['2'])
Using pandas.Series.astype
df['2'] = df['2'].astype(float)
If one wants in descending order, one needs to pass ascending=False as
df = df.sort_values(by=['2'], ascending=False)
# or
df.sort_values(by = '2', ascending=False, inplace=True)
[Out]:
0 1 2
2 176.5 December 12.0
9 278.8 November 11.0
10 249.6 October 10.0
11 212.7 September 9.0
1 55.4 August 8.0
5 152 July 7.0
6 238.7 June 6.0
8 283.5 May 5.0
0 354.7 April 4.0
7 104.8 March 3.0
3 95.5 February 2.0
4 85.6 January 1.0

Just as another solution:
Instead of creating the second column, you can categorize your string data(month name) and sort by that like this:
df.rename(columns={1:'month'},inplace=True)
df['month'] = pd.Categorical(df['month'],categories=['December','November','October','September','August','July','June','May','April','March','February','January'],ordered=True)
df = df.sort_values('month',ascending=False)
It will give you the ordered data by month name as you specified while creating the Categorical object.

Just adding some more operations on data. Suppose we have a dataframe df, we can do several operations to get desired outputs
ID cost tax label
1 216590 1600 test
2 523213 1800 test
3 250 1500 experiment
(df['label'].value_counts().to_frame().reset_index()).sort_values('label', ascending=False)
will give sorted output of labels as a dataframe
index label
0 test 2
1 experiment 1

This worked for me
df.sort_values(by='Column_name', inplace=True, ascending=False)

You probably need to reset the index after sorting:
df = df.sort_values('2')
df = df.reset_index(drop=True)

Here is template of sort_values according to pandas documentation.
DataFrame.sort_values(by, axis=0,
ascending=True,
inplace=False,
kind='quicksort',
na_position='last',
ignore_index=False, key=None)[source]
In this case it will be like this.
df.sort_values(by=['2'])
API Reference pandas.DataFrame.sort_values

Just adding a few more insights
df=raw_df['2'].sort_values() # will sort only one column (i.e 2)
but ,
df =raw_df.sort_values(by=["2"] , ascending = False) # this will sort the whole df in decending order on the basis of the column "2"

If you want to sort column dynamically but not alphabetically.
and dont want to use pd.sort_values().
you can try below solution.
Problem : sort column "col1" in this sequence ['A', 'C', 'D', 'B']
import pandas as pd
import numpy as np
## Sample DataFrame ##
df = pd.DataFrame({'col1': ['A', 'B', 'D', 'C', 'A']})
>>> df
col1
0 A
1 B
2 D
3 C
4 A
## Solution ##
conditions = []
values = []
for i,j in enumerate(['A','C','D','B']):
conditions.append((df['col1'] == j))
values.append(i)
df['col1_Num'] = np.select(conditions, values)
df.sort_values(by='col1_Num',inplace = True)
>>> df
col1 col1_Num
0 A 0
4 A 0
3 C 1
2 D 2
1 B 3

This one worked for me:
df=df.sort_values(by=[2])
Whereas:
df=df.sort_values(by=['2'])
is not working.

Example:
Assume you have a column with values 1 and 0 and you want to separate and use only one value, then:
// furniture is one of the columns in the csv file.
allrooms = data.groupby('furniture')['furniture'].agg('count')
allrooms
myrooms1 = pan.DataFrame(allrooms, columns = ['furniture'], index = [1])
myrooms2 = pan.DataFrame(allrooms, columns = ['furniture'], index = [0])
print(myrooms1);print(myrooms2)

aggregate data by quarter

I have a pivot pandas data frame (sales by region) that got created from another pandas data frame (sales by store) using the pivot_table method.
As an example:
df = pd.DataFrame(
{'store':['A','B','C','D','E']*7,
'region':['NW','NW','SW','NE','NE']*7,
'date':['2017-03-30']*5+['2017-04-05']*5+['2017-04-07']*5+['2017-04-12']*5+['2017-04-13']*5+['2017-04-17']*5+['2017-04-20']*5,
'sales':[30,1,133,9,1,30,3,135,9,11,30,1,140,15,15,25,10,137,9,3,29,10,137,9,11,30,19,145,20,10,30,8,141,25,25]
})
df['date'] = pd.to_datetime(df['date'])
df_sales = df.pivot_table(index = ['region'], columns = ['date'], aggfunc = [np.sum], margins = True)
df_sales = df_sales.ix[:,range(0, df_sales.shape[1]-1)]
My goal is to do the following to the sales data frame, df_sales.
Create a new dataframe that summarizes sales by quarter. I could use the original dataframe df, or the sales_df.
As of quarter here we only have only two quarters (USA fiscal calendar year) so the quarterly aggregated data frame would look like:
2017Q1 2017Q2
10 27
31 37.5
133 139.17
I take the average for all days in Q1, and same for Q2. Thus, for example for the North east region, 'NE', the Q1 is the average of only one day 2017-03-30, i.e., 10, and for the Q2 is the average across 2017-04-05 to 2017-04-20, i.e.,
(20+30+12+20+30+50)/6=27
Any suggestions?
ADDITIONAL NOTE: I would ideally do the quarter aggregations on the df_sales pivoted table since it's a much smaller dataframe to keep in memory. The current solution does it on the original df, but I am still seeking a way to do it in the df_sales dataframe.

UPDATE:
Setup:
df.date = pd.to_datetime(df.date)
df_sales = df.pivot_table(index='region', columns='date', values='sales', aggfunc='sum')
In [318]: df_sales
Out[318]:
date 2017-03-30 2017-04-05 2017-04-07 2017-04-12 2017-04-13 2017-04-17 2017-04-20
region
NE 10 20 30 12 20 30 50
NW 31 33 31 35 39 49 38
SW 133 135 140 137 137 145 141
Solution:
In [319]: (df_sales.groupby(pd.PeriodIndex(df_sales.columns, freq='Q'), axis=1)
...: .apply(lambda x: x.sum(axis=1)/x.shape[1])
...: )
Out[319]:
date 2017Q1 2017Q2
region
NE 10.0 27.000000
NW 31.0 37.500000
SW 133.0 139.166667
Solution based on the original DF:
In [253]: (df.groupby(['region', pd.PeriodIndex(df.date, freq='Q-DEC')])
...: .apply(lambda x: x['sales'].sum()/x['date'].nunique())
...: .to_frame('avg').unstack('date')
...: )
...:
Out[253]:
avg
date 2017Q1 2017Q2
region
NE 10.0 27.000000
NW 31.0 37.500000
SW 133.0 139.166667
NOTE: df - is the original DF (before "pivoting")

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pandas 'multi-index' issue in merging dataframes - pandas

A easier way to bypass this issue: df2.to_clipboard(index=False) df2clip=pd.read_clipboard(sep='\t') df3.to_clipboard(index=False) df3clip=pd.read_clipboard(sep='\t') Then merge 2 df again: pd.merge(df2clip,df3clip,on=['stock','year'],how='outer')

Related

How to groupby year and unstack years into columns in pandas?

Pandas Shift Column & Remove Row

Pandas Implement Equation & Groupby 2 Conditions

Sort a alphanumeric column in pandas and replace it with original column of the dataset [duplicate]

aggregate data by quarter

Categories

Resources