How to find column names in pandas dataframe that contain all unique values except NaN? - pandas

I want to find columns that contain all non-duplicates from a pandas data frame except NaN.
x y z
a 1 2 A
b 2 2 B
c NaN 3 D
d 4 NaN NaN
e NaN NaN NaN
The columns "x" and "z" have non-duplicate values except NaN, so I want to pick them out and create a new data frame.

Let us use nunique
m=df.nunique()==df.notnull().sum()
subdf=df.loc[:,m]
x z
a 1.0 A
b 2.0 B
c NaN D
d 4.0 NaN
e NaN NaN
m.index[m].tolist()
['x', 'z']

Compare length of unique values and length of values after applying dropna().
Try this code.
import pandas as pd
import numpy as np
df = pd.DataFrame({"x":[1, 2, np.nan, 4, np.nan],
"y":[2, 2, 3, np.nan, np.nan],
"z":["A", "B", "D", np.nan, np.nan]})
for col in df.columns:
if len(df[col].dropna()) == len(df[col].dropna().unique()):
print(col)

Related

Fill in NA column values with values from another row based on condition

I want to replace the missing values of one row with column values of another row based on a condition. The real problem has many more columns with NA values. In this example, I want to fill na values for row 4 with values from row 0 for columns A and B, as the value 'e' maps to 'a' for column C.
df = pd.DataFrame({'A': [0, 1, np.nan, 3, np.nan],
'B': [5, 6, np.nan, 8, np.nan],
'C': ['a', 'b', 'c', 'd', 'e']})
df
Out[21]:
A B C
0 0.0 5.0 a
1 1.0 6.0 b
2 NaN NaN c
3 3.0 8.0 d
4 NaN NaN e
I have tried this:
df.loc[df.C == 'e', ['A', 'B']] = df.loc[df.C == 'a', ['A', 'B']]
Is it possible to use a nested np.where statement instead?
Your code fails due to index alignement. As the indices are different (0 vs 4), NaN are assigned.
Use the underlying numpy array to bypass index alignement:
df.loc[df.C == 'e', ['A', 'B']] = df.loc[df.C == 'a', ['A', 'B']].values
NB. You must have the same size on both sides of the equal sign.
Output:
A B C
0 0.0 5.0 a
1 1.0 6.0 b
2 NaN NaN c
3 3.0 8.0 d
4 0.0 5.0 e

Replace all column values with the values ? and n.a with NaN

In dataframe, how to replace all column values with the values ? and n.a with NaN?
I tried
df.fillna(0),inplace=True
but '?' didn't replace.
To replace all non-NaN values, you can try
df = df.where(~df.notna(), "?")
and to replace all NaN values,
df.fillna(0, inplace=True)
IIUC try with replace:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1, 2, "?"],
['n.a', 5, 6],
[7, '?', 9]],
columns=list('ABC'))
df = df.replace({'?': np.NaN, 'n.a': np.NaN})
df before replace:
A B C
0 1 2 ?
1 n.a 5 6
2 7 ? 9
df after replace:
A B C
0 1.0 2.0 NaN
1 NaN 5.0 6.0
2 7.0 NaN 9.0

Condensing Wide Data Based on Column Name

Is there an elegant way to do what I'm trying to do in Pandas? My data looks something like:
df = pd.DataFrame({
'alpha': [1, np.nan, np.nan, np.nan],
'bravo': [np.nan, np.nan, np.nan, -1],
'charlie': [np.nan, np.nan, np.nan, np.nan],
'delta': [np.nan, 1, np.nan, np.nan],
})
print(df)
alpha bravo charlie delta
0 1.0 NaN NaN NaN
1 NaN NaN NaN 1.0
2 NaN NaN NaN NaN
3 NaN -1.0 NaN NaN
and I want to transform that into something like:
position value
0 alpha 1
1 delta 1
2 NaN NaN
3 bravo -1
So for each row in the original data I want to find the non-NaN value and retrieve the name of the column it was found in. Then I'll store the column and value in new columns called 'position' and 'value'.
I can guarantee that each row in the original data contains exactly zero or one non-NaN values.
My only idea is to iterate over each row but I know that idea is bad and there must be a more pandorable way to do it. I'm not exactly sure how to word my problem so I'm having trouble Googling for ideas. Thanks for any advice!
We can use DataFrame.melt to un pivot your data, then use sort_values and drop_duplicates:
df = (
df.melt(var_name='position')
.sort_values('value')
.drop_duplicates('position', ignore_index=True)
)
position value
0 bravo -1.0
1 alpha 1.0
2 delta 1.0
3 charlie NaN
Another option would be to use DataFrame.bfill over the column axis. Since you noted that:
can guarantee that each row in the original data contains exactly zero or one non-NaN values
values = df.bfill(axis=1).iloc[:, 0]
dfn = pd.DataFrame({'positions': df.columns, 'values': values})
positions values
0 alpha 1.0
1 bravo 1.0
2 charlie NaN
3 delta -1.0
Another way to do this. Actually, I just noticed, that it is quite similar to Erfan's first proposal:
# get the index as a column
df2= df.reset_index(drop=False)
# melt the columns keeping index as the id column
# and sort the result, so NaNs appear at the end
df3= df2.melt(id_vars=['index'])
df3.sort_values('value', ascending=True, inplace=True)
# now take the values of the first row per index
df3.groupby('index')[['variable', 'value']].agg('first')
Or shorter:
(
df.reset_index(drop=False)
.melt(id_vars=['index'])
.sort_values('value')
.groupby('index')[['variable', 'value']].agg('first')
)
The result is:
variable value
index
0 alpha 1.0
1 delta 1.0
2 alpha NaN
3 bravo -1.0

How to join pandas dataframes which have a multiindex

Problem Description
I have a dataframe with a multi-index that is three levels deep (0, 1, 2) and I'd like to join this dataframe with another dataframe which is indexed by level 2 of my original dataframe.
In code, I'd like to turn:
pd.DataFrame(['a', 'b', 'c', 'd']).transpose().set_index([0, 1, 2])
and
pd.DataFrame(['c', 'e']).transpose().set_index(0)
into
pd.DataFrame(['a', 'b', 'c', 'd', 'e']).transpose().set_index([0, 1, 2])
What I've tried
I've tried using swaplevel and then join. Didn't work, though some of the error messages suggested that if only I could set on properly this might work.
I tried concat, but couldn't get this to work either. Not sure it can't work though...
Notes:
I have seen this question in which the answer seems to dodge the question (while solving the problem).
pandas will naturally do this for you if the names of the index levels line up. You can rename the index of the second dataframe and join accordingly.
d1 = pd.DataFrame(['a', 'b', 'c', 'd']).transpose().set_index([0, 1, 2])
d2 = pd.DataFrame(['c', 'e']).transpose().set_index(0)
d1.join(d2.rename_axis(2))
3 1
0 1 2
a b c d e
More Comprehensive Example
d1 = pd.DataFrame([
[1, 2],
[3, 4],
[5, 6],
[7, 8]
], pd.MultiIndex.from_product([['A', 'B'], ['X', 'Y']], names=['One', 'Two']))
d2 = pd.DataFrame([
list('abcdefg')
], ['Y'], columns=list('ABCDEFG'))
d3 = pd.DataFrame([
list('hij')
], ['A'], columns=list('HIJ'))
d1.join(d2.rename_axis('Two')).join(d3.rename_axis('One'))
0 1 A B C D E F G H I J
One Two
A X 1 2 NaN NaN NaN NaN NaN NaN NaN h i j
Y 3 4 a b c d e f g h i j
B X 5 6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Y 7 8 a b c d e f g NaN NaN NaN

pandas set_index with NA and None values seem to be not working

I am trying to index a pandas DataFrame using columns with occasional NA and None in them. This seems to be failing. In the example below, df0 has (None,e) combination on index 3, but df1 has (NaN,e). Any suggestions?
import pandas as pd
import numpy as np
df0 = pd.DataFrame({'k1':['4',np.NaN,'6',None,np.NaN], 'k2':['a','d',np.NaN,'e',np.NaN], 'v':[1,2,3,4,5]})
df1 = df0.copy().set_index(['k1','k2'])
>>> df0
Out[3]:
k1 k2 v
0 4 a 1
1 NaN d 2
2 6 NaN 3
3 None e 4
4 NaN NaN 5
>>> df1
Out[4]:
v
k1 k2
4 a 1
NaN d 2
6 NaN 3
NaN e 4
NaN 5
Edit: I see the point--so this is the expected behavior.
This is expected behaviour, the None value is being converted to NaN and as the value is duplicated it isn't being shown:
In [31]:
df1.index
Out[31]:
MultiIndex(levels=[['4', '6'], ['a', 'd', 'e']],
labels=[[0, -1, 1, -1, -1], [0, 1, -1, 2, -1]],
names=['k1', 'k2'])
From the above you can see that -1 is being used to display NaN values, with respect to the output, if your df was like the following then the output shows the same behaviour:
In [34]:
df0 = pd.DataFrame({'k1':['4',np.NaN,'6',1,1], 'k2':['a','d',np.NaN,'e',np.NaN], 'v':[1,2,3,4,5]})
df1 = df0.copy().set_index(['k1','k2'])
df1
Out[34]:
v
k1 k2
4 a 1
NaN d 2
6 NaN 3
1 e 4
NaN 5
You can see that 1 is repeated for the last two rows