pandas set_index with NA and None values seem to be not working - pandas

I am trying to index a pandas DataFrame using columns with occasional NA and None in them. This seems to be failing. In the example below, df0 has (None,e) combination on index 3, but df1 has (NaN,e). Any suggestions?
import pandas as pd
import numpy as np
df0 = pd.DataFrame({'k1':['4',np.NaN,'6',None,np.NaN], 'k2':['a','d',np.NaN,'e',np.NaN], 'v':[1,2,3,4,5]})
df1 = df0.copy().set_index(['k1','k2'])
>>> df0
Out[3]:
k1 k2 v
0 4 a 1
1 NaN d 2
2 6 NaN 3
3 None e 4
4 NaN NaN 5
>>> df1
Out[4]:
v
k1 k2
4 a 1
NaN d 2
6 NaN 3
NaN e 4
NaN 5
Edit: I see the point--so this is the expected behavior.

This is expected behaviour, the None value is being converted to NaN and as the value is duplicated it isn't being shown:
In [31]:
df1.index
Out[31]:
MultiIndex(levels=[['4', '6'], ['a', 'd', 'e']],
labels=[[0, -1, 1, -1, -1], [0, 1, -1, 2, -1]],
names=['k1', 'k2'])
From the above you can see that -1 is being used to display NaN values, with respect to the output, if your df was like the following then the output shows the same behaviour:
In [34]:
df0 = pd.DataFrame({'k1':['4',np.NaN,'6',1,1], 'k2':['a','d',np.NaN,'e',np.NaN], 'v':[1,2,3,4,5]})
df1 = df0.copy().set_index(['k1','k2'])
df1
Out[34]:
v
k1 k2
4 a 1
NaN d 2
6 NaN 3
1 e 4
NaN 5
You can see that 1 is repeated for the last two rows

Related

Find names of n largest values in each row of dataframe

I can find the n largest values in each row of a numpy array (link) but doing so loses the column information which is what I want. Say I have some data:
import pandas as pd
import numpy as np
np.random.seed(42)
data = np.random.rand(5,5)
data = pd.DataFrame(data, columns = list('abcde'))
data
a b c d e
0 0.374540 0.950714 0.731994 0.598658 0.156019
1 0.155995 0.058084 0.866176 0.601115 0.708073
2 0.020584 0.969910 0.832443 0.212339 0.181825
3 0.183405 0.304242 0.524756 0.431945 0.291229
4 0.611853 0.139494 0.292145 0.366362 0.456070
I want the names of the largest contributors in each row. So for n = 2 the output would be:
0 b c
1 c e
2 b c
3 c d
4 a e
I can do it by looping over the dataframe but that would be inefficient. Is there a more pythonic way?
With pandas.Series.nlargest function:
df.apply(lambda x: x.nlargest(2).index.values, axis=1)
0 [b, c]
1 [c, e]
2 [b, c]
3 [c, d]
4 [a, e]
Another option using numpy.argpartition to find the top n index per row and then extract column names by index:
import numpy as np
nlargest_index = np.argpartition(data.values, data.shape[1] - n)[:, -n:]
data.columns.values[nlargest_index]
#array([['c', 'b'],
# ['e', 'c'],
# ['c', 'b'],
# ['d', 'c'],
# ['e', 'a']], dtype=object)
Can a dense ranking be used for this?
N = 2
threshold = len(data.columns) - N
nlargest = data[data.rank(method="dense", axis=1) > threshold]
>>> nlargest
a b c d e
0 NaN 0.950714 0.731994 NaN NaN
1 NaN NaN 0.866176 NaN 0.708073
2 NaN 0.969910 0.832443 NaN NaN
3 NaN NaN 0.524756 0.431945 NaN
4 0.611853 NaN NaN NaN 0.456070
>>> nlargest.stack()
0 b 0.950714
c 0.731994
1 c 0.866176
e 0.708073
2 b 0.969910
c 0.832443
3 c 0.524756
d 0.431945
4 a 0.611853
e 0.456070
dtype: float64

How to find column names in pandas dataframe that contain all unique values except NaN?

I want to find columns that contain all non-duplicates from a pandas data frame except NaN.
x y z
a 1 2 A
b 2 2 B
c NaN 3 D
d 4 NaN NaN
e NaN NaN NaN
The columns "x" and "z" have non-duplicate values except NaN, so I want to pick them out and create a new data frame.
Let us use nunique
m=df.nunique()==df.notnull().sum()
subdf=df.loc[:,m]
x z
a 1.0 A
b 2.0 B
c NaN D
d 4.0 NaN
e NaN NaN
m.index[m].tolist()
['x', 'z']
Compare length of unique values and length of values after applying dropna().
Try this code.
import pandas as pd
import numpy as np
df = pd.DataFrame({"x":[1, 2, np.nan, 4, np.nan],
"y":[2, 2, 3, np.nan, np.nan],
"z":["A", "B", "D", np.nan, np.nan]})
for col in df.columns:
if len(df[col].dropna()) == len(df[col].dropna().unique()):
print(col)

Construct DataFrame from list of dicts

Trying to construct pandas DataFrame from list of dicts
List of dicts:
a = [{'1': 'A'},
{'2': 'B'},
{'3': 'C'}]
Pass list of dicts into pd.DataFrame():
df = pd.DataFrame(a)
Actual results:
1 2 3
0 A NaN NaN
1 NaN B NaN
2 NaN NaN C
pd.DataFrame(a, columns=['Key', 'Value'])
Actual results:
Key Value
0 NaN NaN
1 NaN NaN
2 NaN NaN
Expected results:
Key Value
0 1 A
1 2 B
2 3 C
try this,
from collections import ChainMap
data = dict(ChainMap(*a))
pd.DataFrame(data.items(), columns= ['Key','Value'])
O/P:
Key Value
0 1 A
1 2 B
2 3 C
Something like this with a list comprehension:
pd.DataFrame(([(x, y) for i in a for x, y in i.items()]),columns=['Key','Value'])
Key Value
0 1 A
1 2 B
2 3 C

Assigning to a slice from another DataFrame requires matching column names?

If I want to set (replace) part of a DataFrame with values from another, I should be able to assign to a slice (as in this question) like this:
df.loc[rows, cols] = df2
Not so in this case, it nulls out the slice instead:
In [32]: df
Out[32]:
A B
0 1 -0.240180
1 2 -0.012547
2 3 -0.301475
In [33]: df2
Out[33]:
C
0 x
1 y
2 z
In [34]: df.loc[:,'B']=df2
In [35]: df
Out[35]:
A B
0 1 NaN
1 2 NaN
2 3 NaN
But it does work with just a column (Series) from df2, which is not an option if I want multiple columns:
In [36]: df.loc[:,'B']=df2['C']
In [37]: df
Out[37]:
A B
0 1 x
1 2 y
2 3 z
Or if the column names match:
In [47]: df3
Out[47]:
B
0 w
1 a
2 t
In [48]: df.loc[:,'B']=df3
In [49]: df
Out[49]:
A B
0 1 w
1 2 a
2 3 t
Is this expected? I don't see any explanation for it in docs or Stackoverflow.
Yes, this is expected. Label alignment is one of the core features of pandas. When you use df.loc[:,'B'] = df2 it needs to align two DataFrames:
df.align(df2)
Out:
( A B C
0 1 -0.240180 NaN
1 2 -0.012547 NaN
2 3 -0.301475 NaN, A B C
0 NaN NaN x
1 NaN NaN y
2 NaN NaN z)
The above shows how each DataFrame looks when aligned as a tuple (the first one is df and the second one is df2). If your df2 also had a column named B with values [1, 2, 3], it would become:
df.align(df2)
Out:
( A B C
0 1 -0.240180 NaN
1 2 -0.012547 NaN
2 3 -0.301475 NaN, A B C
0 NaN 1 x
1 NaN 2 y
2 NaN 3 z)
Since B's are aligned, your assignment would result in
df.loc[:,'B'] = df2
df
Out:
A B
0 1 1
1 2 2
2 3 3
When you use a Series, the alignment will be on a single axis (on index in your example). Since they exactly match, there will be no problem and it will assign the values from df2['C'] to df['B'].
You can either rename the labels before the alignment or use a data structure that doesn't have labels (a numpy array, a list, a tuple...).
You can use the underlying NumPy array:
df.loc[:,'B'] = df2.values
df
A B
0 1 x
1 2 y
2 3 z
Pandas indexing is always sensitive to labeling of both rows and columns. In this case, your rows check out, but your columns do not. (B != C).
Using the underlying NumPy array makes the operation index-insensitive.
The reason that this does work when df2 is a Series is because Series have no concept of columns. The only alignment is on the rows, which are aligned.

How to join pandas dataframes which have a multiindex

Problem Description
I have a dataframe with a multi-index that is three levels deep (0, 1, 2) and I'd like to join this dataframe with another dataframe which is indexed by level 2 of my original dataframe.
In code, I'd like to turn:
pd.DataFrame(['a', 'b', 'c', 'd']).transpose().set_index([0, 1, 2])
and
pd.DataFrame(['c', 'e']).transpose().set_index(0)
into
pd.DataFrame(['a', 'b', 'c', 'd', 'e']).transpose().set_index([0, 1, 2])
What I've tried
I've tried using swaplevel and then join. Didn't work, though some of the error messages suggested that if only I could set on properly this might work.
I tried concat, but couldn't get this to work either. Not sure it can't work though...
Notes:
I have seen this question in which the answer seems to dodge the question (while solving the problem).
pandas will naturally do this for you if the names of the index levels line up. You can rename the index of the second dataframe and join accordingly.
d1 = pd.DataFrame(['a', 'b', 'c', 'd']).transpose().set_index([0, 1, 2])
d2 = pd.DataFrame(['c', 'e']).transpose().set_index(0)
d1.join(d2.rename_axis(2))
3 1
0 1 2
a b c d e
More Comprehensive Example
d1 = pd.DataFrame([
[1, 2],
[3, 4],
[5, 6],
[7, 8]
], pd.MultiIndex.from_product([['A', 'B'], ['X', 'Y']], names=['One', 'Two']))
d2 = pd.DataFrame([
list('abcdefg')
], ['Y'], columns=list('ABCDEFG'))
d3 = pd.DataFrame([
list('hij')
], ['A'], columns=list('HIJ'))
d1.join(d2.rename_axis('Two')).join(d3.rename_axis('One'))
0 1 A B C D E F G H I J
One Two
A X 1 2 NaN NaN NaN NaN NaN NaN NaN h i j
Y 3 4 a b c d e f g h i j
B X 5 6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Y 7 8 a b c d e f g NaN NaN NaN