Joining two pandas dataframes with multi-indexed columns - pandas

I want to join two pandas dataframes, one of which has multi-indexed columns.
This is how I make the first dataframe.
data_large = pd.DataFrame({"name":["a", "b", "c"], "sell":[10, 60, 50], "buy":[20, 30, 40]})
data_mini = pd.DataFrame({"name":["b", "c", "d"], "sell":[60, 20, 10], "buy":[30, 50, 40]})
data_topix = pd.DataFrame({"name":["a", "b", "c"], "sell":[10, 80, 0], "buy":[70, 30, 40]})
df_out = pd.concat([dfi.set_index('name') for dfi in [data_large, data_mini, data_topix]],
keys=['Large', 'Mini', 'Topix'], axis=1)\
.rename_axis(mapper=['name'], axis=0).rename_axis(mapper=['product','buy_sell'], axis=1)
df_out
And this is the second dataframe.
group = pd.DataFrame({"name":["a", "b", "c", "d"], "group":[1, 1, 2, 2]})
group
How can I join the second to the first, on the column name, keeping the multi-indexed columns?
This did not work and it flattened the multi-index.
df_final = df_out.merge(group, on=['name'], how='left')
Any help would be appreciated!

If need MultiIndex after merge is necessary convert column group to MultiIndex DataFrame, here is converted column name to index for merge by index, else both columns has to be converted to MultiIndex:
group = group.set_index('name')
group.columns = pd.MultiIndex.from_product([group.columns, ['new']])
df_final = df_out.merge(group, on=['name'], how='left')
Or:
df_final = df_out.merge(group, left_index=True, right_index=True, how='left')
print (df_final)
product Large Mini Topix group
buy_sell sell buy sell buy sell buy new
name
a 10.0 20.0 NaN NaN 10.0 70.0 1
b 60.0 30.0 60.0 30.0 80.0 30.0 1
c 50.0 40.0 20.0 50.0 0.0 40.0 2
d NaN NaN 10.0 40.0 NaN NaN 2
Another possible way, but with warning is convert values to MultiIndex after merge:
df_final = df_out.merge(group, on=['name'], how='left')
UserWarning: merging between different levels can give an unintended result (2 levels on the left, 1 on the right)
warnings.warn(msg, UserWarning)
L = [x if isinstance(x, tuple) else (x, 'new') for x in df_final.columns.tolist()]
df_final.columns = pd.MultiIndex.from_tuples(L)
print (df_final)
name Large Mini Topix group
new sell buy sell buy sell buy new
0 a 10.0 20.0 NaN NaN 10.0 70.0 1
1 b 60.0 30.0 60.0 30.0 80.0 30.0 1
2 c 50.0 40.0 20.0 50.0 0.0 40.0 2
3 d NaN NaN 10.0 40.0 NaN NaN 2
EDIT: If need group in MultiIndex:
group = group.set_index(['name'])
group.columns = pd.MultiIndex.from_product([group.columns, ['new']])
df_final = (df_out.merge(group, on=['name'], how='left')
.set_index([('group','new')], append=True)
.rename_axis(['name','group']))
print (df_final)
product Large Mini Topix
buy_sell sell buy sell buy sell buy
name group
a 1 10.0 20.0 NaN NaN 10.0 70.0
b 1 60.0 30.0 60.0 30.0 80.0 30.0
c 2 50.0 40.0 20.0 50.0 0.0 40.0
d 2 NaN NaN 10.0 40.0 NaN NaN
Or:
df_final = df_out.merge(group, on=['name'], how='left').set_index(['name','group'])
df_final.columns = pd.MultiIndex.from_tuples(df_final.columns)
print (df_final)
Large Mini Topix
sell buy sell buy sell buy
name group
a 1 10.0 20.0 NaN NaN 10.0 70.0
b 1 60.0 30.0 60.0 30.0 80.0 30.0
c 2 50.0 40.0 20.0 50.0 0.0 40.0
d 2 NaN NaN 10.0 40.0 NaN NaN

Related

groupby transform with if condition in pandas

I have a data frame as given below
df = pd.DataFrame({'key': ['a', 'a', 'a', 'b', 'c', 'c'] , 'val' : [10, np.nan, 9 , 10, 11, 13]})
df
key val
0 a 10.0
1 a NaN
2 a 9.0
3 b 10.0
4 c 11.0
5 c 13.0
I want to perform groupby and transform that new column is each value divided by group mean , which I can do as below
df['new'] = df.groupby('key')['val'].transform(lambda g : g/g.mean())
df.new
0 1.052632
1 NaN
2 0.947368
3 1.000000
4 0.916667
5 1.083333
Name: new, dtype: float64
Now I have condition that if val is np.nan then new column value will be np.inf which should result as below
0 1.052632
1 np.inf
2 0.947368
3 1.000000
4 0.916667
5 1.083333
Name: new, dtype: float64
In other words how can I have this check if a val is np.nan with groupby and transform.
Thanks in advance
Add Series.replace:
df['new'] = (df.groupby('key')['val'].transform(lambda g : g/g.mean())
.replace(np.nan, np.inf))
print (df)
key val new
0 a 10.0 1.052632
1 a NaN inf
2 a 9.0 0.947368
3 b 10.0 1.000000
4 c 11.0 0.916667
5 c 13.0 1.083333
Or numpy.where:
df['new'] = np.where(df.val.isna(),
np.inf, df.groupby('key')['val'].transform(lambda g : g/g.mean()))
print (df)
key val new
0 a 10.0 1.052632
1 a NaN inf
2 a 9.0 0.947368
3 b 10.0 1.000000
4 c 11.0 0.916667
5 c 13.0 1.083333

Pivot to multi-index and combining columns into one level using pandas

I am trying to pivot a dataframe such that the unique values in an 'ID' column will be used for column labels and a multi-index will be created to organize the data into grouped rows. The second level of the multi-index, will be unique values obtained from the 'date' column and the first level of the multi-index will contain all other columns 'not considered' in the pivoting operation.
Here's the dataframe sample:
df = pd.DataFrame(
data=[['A', '10/19/2020', 33, 0.2],
['A', '10/6/2020', 17, 0.6],
['A', '11/8/2020', 7, 0.3],
['A', '11/14/2020', 19, 0.2],
['B', '10/28/2020', 26, 0.6],
['B', '11/6/2020', 19, 0.3],
['B', '11/10/2020', 29, 0.1]],
columns=['ID', 'Date', 'Temp', 'PPM'])
original df
ID Date Temp PPM
0 A 10/19/2020 33 0.2
1 A 10/6/2020 17 0.6
2 A 11/8/2020 7 0.3
3 A 11/14/2020 19 0.2
4 B 10/28/2020 26 0.6
5 B 11/6/2020 19 0.3
6 B 11/10/2020 29 0.1
desired output
ID A B
Date
Temp 10/19/2020 33 NaN
10/28/2020 NaN 26
11/6/2020 17 19
11/8/2020 7 NaN
11/10/2020 NaN 29
11/14/2020 19 NaN
PPM 10/19/2020 0.2 NaN
10/28/2020 NaN 0.6
11/6/2020 0.6 0.3
11/8/2020 0.3 NaN
11/10/2020 NaN 0.1
11/14/2020 0.2 NaN
I took a look at this extensive answer for pivoting dataframes in pandas, but I am unable to see how it covers/apply it to, the specific case I am trying to implement.
EDIT: While I've provided dates as strings in the sample, these are actually datetime64 objects in the full dataframe I'm dealing with.
Let us try set_index and unstack
out = df.set_index(['ID','Date']).unstack().T
Out[27]:
ID A B
Date
Temp 10/19/2020 33.0 NaN
10/28/2020 NaN 26.0
10/6/2020 17.0 NaN
11/10/2020 NaN 29.0
11/14/2020 19.0 NaN
11/6/2020 NaN 19.0
11/8/2020 7.0 NaN
PPM 10/19/2020 0.2 NaN
10/28/2020 NaN 0.6
10/6/2020 0.6 NaN
11/10/2020 NaN 0.1
11/14/2020 0.2 NaN
11/6/2020 NaN 0.3
11/8/2020 0.3 NaN

Groupby Year and other column and calculate average based on specific condition pandas

I have a data frame as shown below
Tenancy_ID Unit_ID End_Date Rental_value
1 A 2012-04-26 10
2 A 2012-08-27 20
3 A 2013-04-27 50
4 A 2014-04-27 40
1 B 2011-06-26 10
2 B 2011-09-27 30
3 B 2013-04-27 60
4 B 2015-04-27 80
From the above I would like to prepare below data frame
Expected Output:
Unit_ID Avg_2011 Avg_2012 Avg_2013 Avg_2014 Avg_2015
A NaN 15 50 40 NaN
B 20 NaN 60 NaN 80
Steps:
Unit_ID = A, has two contracts in 2012 with rental value 10 and 20, Hence the average is 15.
Avg_2012 = Average rental value in 2012.
Use pivot_table directly with the s.dt.year
#df['End_Date']=pd.to_datetime(df['End_Date']) if dtype of End_Date is not datetime
final = (df.pivot_table('Rental_value','Unit_ID',df['End_Date'].dt.year)
.add_prefix('Avg_').reset_index().rename_axis(None,axis=1))
print(final)
Unit_ID Avg_2011 Avg_2012 Avg_2013 Avg_2014 Avg_2015
0 A NaN 15.0 50.0 40.0 NaN
1 B 20.0 NaN 60.0 NaN 80.0
You can aggregate averages and reshape by Series.unstack, last change columns names by DataFrame.add_prefix and last data cleaning - DataFrame.reset_index with DataFrame.rename_axis:
df1 = (df.groupby(['Unit_ID', df['End_Date'].dt.year])['Rental_value']
.mean()
.unstack()
.add_prefix('Avg_')
.reset_index()
.rename_axis(None, axis=1))
print (df1)
Unit_ID Avg_2011 Avg_2012 Avg_2013 Avg_2014 Avg_2015
0 A NaN 15.0 50.0 40.0 NaN
1 B 20.0 NaN 60.0 NaN 80.0

How to put groupby result into the same row

I have the following dataframe:
import pandas as pd
df = pd.DataFrame({'id' :["c1","c1","c1","c2","c2","c3","c3","c3","c3","c4","c4","c5","c6","c6","c6","c7","c7"],'store' : ["first","second","second","first",
"second","first","third","fourth",
"fifth","second","fifth","first",
"first","second","third","fourth","fifth"],
'purchase': [10,10,10,20,20,30,30,30,30,40,40,50,60,60,60,70,70]})
after you do groupby:
df_group= df.groupby(['id','store']).agg({'purchase': ["sum"]})
Result of df_group
I want to let each card have all the purchases in the different stores appear in the same row, for example:
id 1_store 1_sum 2_store 2_sum 3_store 3_sum 4_store 4_sum...
0 c1 first 10 second 20
1 C2 first 20 second 20
2 c3 fifth 30 first 30 fourth 30 third 30
I don't want to use unstack on store, the reason behind it is there are so many stores, it will cause too much columns for stores and most of them are empty.
How can I achieve the above result?
Thanks
Need to create a cumcount variable to get the column labels, then this becomes a .pivot_table problem: You get quite the MultiIndex on the columns, which we can collapse:
df_group['idx'] = df_group.groupby(level=0).cumcount()+1
df_res = (df_group.reset_index()
.pivot_table(index='id',
columns='idx',
values=['store', 'purchase'],
aggfunc='first')
.sort_index(level=2, axis=1))
Output:
purchase store purchase store purchase store purchase store
sum sum sum sum
idx 1 1 2 2 3 3 4 4
id
c1 10.0 first 20.0 second NaN NaN NaN NaN
c2 20.0 first 20.0 second NaN NaN NaN NaN
c3 30.0 fifth 30.0 first 30.0 fourth 30.0 third
c4 40.0 fifth 40.0 second NaN NaN NaN NaN
c5 50.0 first NaN NaN NaN NaN NaN NaN
c6 60.0 first 60.0 second 60.0 third NaN NaN
c7 70.0 fifth 70.0 fourth NaN NaN NaN NaN
If need to collapse the columns (probably a good idea since it's not lexsorted anymore):
df_res.columns = ['_'.join(map(str, [y for y in x[::-1] if y != ''])) for x in df_res.columns]
1_sum_purchase 1_store 2_sum_purchase 2_store 3_sum_purchase 3_store 4_sum_purchase 4_store
id
c1 10.0 first 20.0 second NaN NaN NaN NaN
c2 20.0 first 20.0 second NaN NaN NaN NaN
c3 30.0 fifth 30.0 first 30.0 fourth 30.0 third
c4 40.0 fifth 40.0 second NaN NaN NaN NaN
c5 50.0 first NaN NaN NaN NaN NaN NaN
c6 60.0 first 60.0 second 60.0 third NaN NaN
c7 70.0 fifth 70.0 fourth NaN NaN NaN NaN

Pandas dataframe creating multiple rows at once via .loc

I can create a new row in a dataframe using .loc():
>>> df = pd.DataFrame({'a':[10, 20], 'b':[100,200]}, index='1 2'.split())
>>> df
a b
1 10 100
2 20 200
>>> df.loc[3, 'a'] = 30
>>> df
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
But how can I create more than one row using the same method?
>>> df.loc[[4, 5], 'a'] = [40, 50]
...
KeyError: '[4 5] not in index'
I'm familiar with .append() but am looking for a way that does NOT require constructing a new row into a Series before having it appended to df.
Desired input:
>>> df.loc[[4, 5], 'a'] = [40, 50]
Desired output
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
4 40.0 NaN
5 50.0 NaN
Where last 2 rows are newly added.
Admittedly, this is a very late answer, but I have had to deal with a similar problem and think my solution might be helpful to others as well.
After recreating your data, it is basically a two-step approach:
Recreate data:
import pandas as pd
df = pd.DataFrame({'a':[10, 20], 'b':[100,200]}, index='1 2'.split())
df.loc[3, 'a'] = 30
Extend the df.index using .reindex:
idx = list(df.index)
new_rows = list(map(str, range(4, 6))) # easier extensible than new_rows = ["4", "5"]
idx.extend(new_rows)
df = df.reindex(index=idx)
Set the values using .loc:
df.loc[new_rows, "a"] = [40, 50]
giving you
>>> df
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
4 40.0 NaN
5 50.0 NaN
Example data
>>> data = pd.DataFrame({
'a': [10, 6, -3, -2, 4, 12, 3, 3],
'b': [6, -3, 6, 12, 8, 11, -5, -5],
'id': [1, 1, 1, 1, 6, 2, 2, 4]})
Case 1 Note that range can be altered to whatever it is that you desire.
>>> for i in range(10):
... data.loc[i, 'a'] = 30
...
>>> data
a b id
0 30.0 6.0 1.0
1 30.0 -3.0 1.0
2 30.0 6.0 1.0
3 30.0 12.0 1.0
4 30.0 8.0 6.0
5 30.0 11.0 2.0
6 30.0 -5.0 2.0
7 30.0 -5.0 4.0
8 30.0 NaN NaN
9 30.0 NaN NaN
Case 2 Here we are adding a new column to a data frame that had 8 rows to begin with. As we extend our new column c to be of length 10 the other columns are extended with NaN.
>>> for i in range(10):
... data.loc[i, 'c'] = 30
...
>>> data
a b id c
0 10.0 6.0 1.0 30.0
1 6.0 -3.0 1.0 30.0
2 -3.0 6.0 1.0 30.0
3 -2.0 12.0 1.0 30.0
4 4.0 8.0 6.0 30.0
5 12.0 11.0 2.0 30.0
6 3.0 -5.0 2.0 30.0
7 3.0 -5.0 4.0 30.0
8 NaN NaN NaN 30.0
9 NaN NaN NaN 30.0
Also somewhat late, but my solution was similar to the accepted one:
import pandas as pd
df = pd.DataFrame({'a':[10, 20], 'b':[100,200]}, index=[1,2])
# single index assignment always works
df.loc[3, 'a'] = 30
# multiple indices
new_rows = [4,5]
# there should be a nicer way to add more than one index/row at once,
# but at least this is just one extra line:
df = df.reindex(index=df.index.append(pd.Index(new_rows))) # note: Index.append() doesn't accept non-Index iterables?
# multiple new rows now works:
df.loc[new_rows, "a"] = [40, 50]
print(df)
... which yields:
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
4 40.0 NaN
5 50.0 NaN
This also works now (useful when performance on aggregating dataframes matters):
# inserting whole rows:
df.loc[new_rows] = [[41, 51], [61,71]]
print(df)
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
4 41.0 51.0
5 61.0 71.0