How to join pandas dataframes which have a multiindex - pandas

Problem Description
I have a dataframe with a multi-index that is three levels deep (0, 1, 2) and I'd like to join this dataframe with another dataframe which is indexed by level 2 of my original dataframe.
In code, I'd like to turn:
pd.DataFrame(['a', 'b', 'c', 'd']).transpose().set_index([0, 1, 2])
and
pd.DataFrame(['c', 'e']).transpose().set_index(0)
into
pd.DataFrame(['a', 'b', 'c', 'd', 'e']).transpose().set_index([0, 1, 2])
What I've tried
I've tried using swaplevel and then join. Didn't work, though some of the error messages suggested that if only I could set on properly this might work.
I tried concat, but couldn't get this to work either. Not sure it can't work though...
Notes:
I have seen this question in which the answer seems to dodge the question (while solving the problem).

pandas will naturally do this for you if the names of the index levels line up. You can rename the index of the second dataframe and join accordingly.
d1 = pd.DataFrame(['a', 'b', 'c', 'd']).transpose().set_index([0, 1, 2])
d2 = pd.DataFrame(['c', 'e']).transpose().set_index(0)
d1.join(d2.rename_axis(2))
3 1
0 1 2
a b c d e
More Comprehensive Example
d1 = pd.DataFrame([
[1, 2],
[3, 4],
[5, 6],
[7, 8]
], pd.MultiIndex.from_product([['A', 'B'], ['X', 'Y']], names=['One', 'Two']))
d2 = pd.DataFrame([
list('abcdefg')
], ['Y'], columns=list('ABCDEFG'))
d3 = pd.DataFrame([
list('hij')
], ['A'], columns=list('HIJ'))
d1.join(d2.rename_axis('Two')).join(d3.rename_axis('One'))
0 1 A B C D E F G H I J
One Two
A X 1 2 NaN NaN NaN NaN NaN NaN NaN h i j
Y 3 4 a b c d e f g h i j
B X 5 6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Y 7 8 a b c d e f g NaN NaN NaN

Related

Insert index to level in multiindex dataframe [duplicate]

This question already has an answer here:
Adding a new nested level value to a MultiIndex DataFrame
(1 answer)
Closed last month.
Is it possible to add one index to a level in multiindex dataframe?
For example, I am trying to add 'new_index' to level 1 with nan value.
#Sample data
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
df = df.set_index([['one', 'two', 'three'], [1, 2, 3]])
df.index.names = ['first', 'second']
df
#Output
A B C
first second
one 1 1 4 7
two 2 2 5 8
three 3 3 6 9
#Desired Output
A B C
first second
one 1 1 4 7
new_index NaN NaN NaN
two 2 2 5 8
new_index NaN NaN NaN
three 3 3 6 9
new_index NaN NaN NaN
Thank you very much.
This is what I found.
df = df.unstack("second").stack(level=0)
df["new_index"] = "NA"
df.stack().unstack(level=1)
df
#output
A B C
first second
one 1 1.00000000 4.00000000 7.00000000
new_index NA NA NA
three 3 3.00000000 6.00000000 9.00000000
new_index NA NA NA
two 2 2.00000000 5.00000000 8.00000000
new_index NA NA NA
For NA is actually just string "NA", it can not rigorously be answer.
But replacing it with np.nan will make 'new_index' disappear since pd.stack() will dropna.
Any other idea?

Find names of n largest values in each row of dataframe

I can find the n largest values in each row of a numpy array (link) but doing so loses the column information which is what I want. Say I have some data:
import pandas as pd
import numpy as np
np.random.seed(42)
data = np.random.rand(5,5)
data = pd.DataFrame(data, columns = list('abcde'))
data
a b c d e
0 0.374540 0.950714 0.731994 0.598658 0.156019
1 0.155995 0.058084 0.866176 0.601115 0.708073
2 0.020584 0.969910 0.832443 0.212339 0.181825
3 0.183405 0.304242 0.524756 0.431945 0.291229
4 0.611853 0.139494 0.292145 0.366362 0.456070
I want the names of the largest contributors in each row. So for n = 2 the output would be:
0 b c
1 c e
2 b c
3 c d
4 a e
I can do it by looping over the dataframe but that would be inefficient. Is there a more pythonic way?
With pandas.Series.nlargest function:
df.apply(lambda x: x.nlargest(2).index.values, axis=1)
0 [b, c]
1 [c, e]
2 [b, c]
3 [c, d]
4 [a, e]
Another option using numpy.argpartition to find the top n index per row and then extract column names by index:
import numpy as np
nlargest_index = np.argpartition(data.values, data.shape[1] - n)[:, -n:]
data.columns.values[nlargest_index]
#array([['c', 'b'],
# ['e', 'c'],
# ['c', 'b'],
# ['d', 'c'],
# ['e', 'a']], dtype=object)
Can a dense ranking be used for this?
N = 2
threshold = len(data.columns) - N
nlargest = data[data.rank(method="dense", axis=1) > threshold]
>>> nlargest
a b c d e
0 NaN 0.950714 0.731994 NaN NaN
1 NaN NaN 0.866176 NaN 0.708073
2 NaN 0.969910 0.832443 NaN NaN
3 NaN NaN 0.524756 0.431945 NaN
4 0.611853 NaN NaN NaN 0.456070
>>> nlargest.stack()
0 b 0.950714
c 0.731994
1 c 0.866176
e 0.708073
2 b 0.969910
c 0.832443
3 c 0.524756
d 0.431945
4 a 0.611853
e 0.456070
dtype: float64

Fill in NA column values with values from another row based on condition

I want to replace the missing values of one row with column values of another row based on a condition. The real problem has many more columns with NA values. In this example, I want to fill na values for row 4 with values from row 0 for columns A and B, as the value 'e' maps to 'a' for column C.
df = pd.DataFrame({'A': [0, 1, np.nan, 3, np.nan],
'B': [5, 6, np.nan, 8, np.nan],
'C': ['a', 'b', 'c', 'd', 'e']})
df
Out[21]:
A B C
0 0.0 5.0 a
1 1.0 6.0 b
2 NaN NaN c
3 3.0 8.0 d
4 NaN NaN e
I have tried this:
df.loc[df.C == 'e', ['A', 'B']] = df.loc[df.C == 'a', ['A', 'B']]
Is it possible to use a nested np.where statement instead?
Your code fails due to index alignement. As the indices are different (0 vs 4), NaN are assigned.
Use the underlying numpy array to bypass index alignement:
df.loc[df.C == 'e', ['A', 'B']] = df.loc[df.C == 'a', ['A', 'B']].values
NB. You must have the same size on both sides of the equal sign.
Output:
A B C
0 0.0 5.0 a
1 1.0 6.0 b
2 NaN NaN c
3 3.0 8.0 d
4 0.0 5.0 e

How to find column names in pandas dataframe that contain all unique values except NaN?

I want to find columns that contain all non-duplicates from a pandas data frame except NaN.
x y z
a 1 2 A
b 2 2 B
c NaN 3 D
d 4 NaN NaN
e NaN NaN NaN
The columns "x" and "z" have non-duplicate values except NaN, so I want to pick them out and create a new data frame.
Let us use nunique
m=df.nunique()==df.notnull().sum()
subdf=df.loc[:,m]
x z
a 1.0 A
b 2.0 B
c NaN D
d 4.0 NaN
e NaN NaN
m.index[m].tolist()
['x', 'z']
Compare length of unique values and length of values after applying dropna().
Try this code.
import pandas as pd
import numpy as np
df = pd.DataFrame({"x":[1, 2, np.nan, 4, np.nan],
"y":[2, 2, 3, np.nan, np.nan],
"z":["A", "B", "D", np.nan, np.nan]})
for col in df.columns:
if len(df[col].dropna()) == len(df[col].dropna().unique()):
print(col)

pandas set_index with NA and None values seem to be not working

I am trying to index a pandas DataFrame using columns with occasional NA and None in them. This seems to be failing. In the example below, df0 has (None,e) combination on index 3, but df1 has (NaN,e). Any suggestions?
import pandas as pd
import numpy as np
df0 = pd.DataFrame({'k1':['4',np.NaN,'6',None,np.NaN], 'k2':['a','d',np.NaN,'e',np.NaN], 'v':[1,2,3,4,5]})
df1 = df0.copy().set_index(['k1','k2'])
>>> df0
Out[3]:
k1 k2 v
0 4 a 1
1 NaN d 2
2 6 NaN 3
3 None e 4
4 NaN NaN 5
>>> df1
Out[4]:
v
k1 k2
4 a 1
NaN d 2
6 NaN 3
NaN e 4
NaN 5
Edit: I see the point--so this is the expected behavior.
This is expected behaviour, the None value is being converted to NaN and as the value is duplicated it isn't being shown:
In [31]:
df1.index
Out[31]:
MultiIndex(levels=[['4', '6'], ['a', 'd', 'e']],
labels=[[0, -1, 1, -1, -1], [0, 1, -1, 2, -1]],
names=['k1', 'k2'])
From the above you can see that -1 is being used to display NaN values, with respect to the output, if your df was like the following then the output shows the same behaviour:
In [34]:
df0 = pd.DataFrame({'k1':['4',np.NaN,'6',1,1], 'k2':['a','d',np.NaN,'e',np.NaN], 'v':[1,2,3,4,5]})
df1 = df0.copy().set_index(['k1','k2'])
df1
Out[34]:
v
k1 k2
4 a 1
NaN d 2
6 NaN 3
1 e 4
NaN 5
You can see that 1 is repeated for the last two rows