Pandas: Cannot assign NaN using a boolean mask on MultiIndex columns - pandas

First, create this DataFrame:
df = pd.DataFrame([[1,-2,3],[4,5,-6],[-7,8,9]],
columns=pd.MultiIndex.from_tuples(
[('foo', 'bar'), ('foo', 'baz'), ('ignore', 'other')]))
That is:
foo ignore
bar baz other
0 1 -2 3
1 4 5 -6
2 -7 8 9
Now, try to replace the negative values under foo with NAN:
df.foo[df.foo < 0] = np.nan
That doesn't do anything but print a warning:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
OK, let's do that:
df.loc[:,'foo'][df.foo < 0] = np.nan
That doesn't print a warning, but it also does nothing!
But it works if we use a non-NAN value:
df.loc[:,'foo'][df.foo < 0] = 666
Now I have:
foo ignore
bar baz other
0 1 666 3
1 4 5 -6
2 666 8 9
But I want to fill with NAN, not 666. Is there an easy way that works?

You can use slicers with DataFrame.mask:
idx = pd.IndexSlice
sliced = df.loc[:, idx['foo',:]]
print (sliced)
foo
bar baz
0 1 -2
1 4 5
2 -7 8
df.loc[:, idx['foo',:]] = sliced.mask(sliced < 0)
print (df)
foo ignore
bar baz other
0 1.0 NaN 3
1 4.0 5.0 -6
2 NaN 8.0 9
Another solution with concat:
idx = pd.IndexSlice
df1 = df.loc[:, idx['foo',:]]
print (df1)
foo
bar baz
0 1 -2
1 4 5
2 -7 8
df1 = df1.mask(df1 < 0)
print (df1)
foo
bar baz
0 1.0 NaN
1 4.0 5.0
2 NaN 8.0
print (pd.concat([df1, df.drop('foo', axis=1, level=0)], axis=1))
foo ignore
bar baz other
0 1.0 NaN 3
1 4.0 5.0 -6
2 NaN 8.0 9

Related

Why do I get warning with this function?

I am trying to generate a new column containing boolean values of whether a value of each row is Null or not. I wrote the following function,
def not_null(row):
null_list = []
for value in row:
null_list.append(pd.isna(value))
return null_list
df['not_null'] = df.apply(not_null, axis=1)
But I get the following warning message,
A value is trying to be set on a copy of a slice from a DataFrame.
Is there a better way to write this function?
Note: I want to be able to apply this function to each row regardless of knowing the header row name or not
Final output ->
Column1 | Column2 | Column3 | null_idx
NaN | Nan | Nan | [0, 1, 2]
1 | 23 | 34 | []
test1 | Nan | Nan | [1, 2]
First your error means there is some filtering before in your code and need DataFrame.copy:
df = df[df['col'].gt(100)].copy()
Then your solution should be improved:
df = pd.DataFrame({'a':[np.nan, 1, np.nan],
'b':[np.nan,4,6],
'c':[4,5,3]})
df['list_boolean_for_missing'] = [x[x].tolist() for x in df.isna().to_numpy()]
print (df)
a b c list_boolean_for_missing
0 NaN NaN 4 [True, True]
1 1.0 4.0 5 []
2 NaN 6.0 3 [True]
Your function:
dd = lambda x: [pd.isna(value) for value in x]
df['list_boolean_for_missing'] = df.apply(not_null, axis=1)
If need:
I am trying to generate a new column containing boolean values of whether a value of each row is Null or not
df['not_null'] = df.notna().all(axis=1)
print (df)
a b c not_null
0 NaN NaN 4 False
1 1.0 4.0 5 True
2 NaN 6.0 3 False
EDIT: For list of positions create helper array by np.arange and filter it:
arr = np.arange(len(df.columns))
df['null_idx'] = [arr[x].tolist() for x in df.isna().to_numpy()]
print (df)
a b c null_idx
0 NaN NaN 4 [0, 1]
1 1.0 4.0 5 []
2 NaN 6.0 3 [0]

Drop a column based on the existence of another column

I'm actually trying to figure out how to drop a column based on the existence of another column. Here is my problem :
I start with this DataFrame. Each "X" column is associated with a "Y" column using a number. (X_1,Y_1 / X_2,Y_2 ...)
Index X_1 X_2 Y_1 Y_2
1 4 0 A NaN
2 7 0 A NaN
3 6 0 B NaN
4 2 0 B NaN
5 8 0 A NaN
I drop NaN values using pd.dropna(). The result I get is this DataFrame :
Index X_1 X_2 Y_1
1 4 0 A
2 7 0 A
3 6 0 B
4 2 0 B
5 8 0 A
The problem is that I want to delete the "X" column associated to the "Y" column that just got dropped. I would like to use a condition that basically says :
"If Y_2 is not in the DataFrame, drop the X_2 column"
I used a for loop combined to if, but it doesn't seem to work. Any ideas ?
Thanks and have a good day.
Setup
>>> df
CHA_COEXPM1_COR CHA_COEXPM2_COR CHA_COFMAT1_COR CHA_COFMAT2_COR
Index
1 4 0 A NaN
2 7 0 A NaN
3 6 0 B NaN
4 2 0 B NaN
5 8 0 A NaN
Solution
Identify the columns having NaN values in any row
Group the identified columns using the numeric identifier and transform using any
Filter the columns using the boolean mask created in the previous step
m = df.isna().any()
m = m.groupby(m.index.str.extract(r'(\d+)_')[0]).transform('any')
Result
>>> df.loc[:, ~m]
CHA_COEXPM1_COR CHA_COFMAT1_COR
Index
1 4 A
2 7 A
3 6 B
4 2 B
5 8 A
Slightly modified example to be closer to actual DataFrame:
df = pd.DataFrame({
'Index': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'X_V1_C': {0: 4, 1: 7, 2: 6, 3: 2, 4: 8},
'X_V2_C': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'Y_V1_C': {0: 'A', 1: 'A', 2: 'B', 3: 'B', 4: 'A'},
'Y_V2_C': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}
})
Index X_V1_C X_V2_C Y_V1_C Y_V2_C
0 1 4 0 A NaN
1 2 7 0 A NaN
2 3 6 0 B NaN
3 4 2 0 B NaN
4 5 8 0 A NaN
set_index on any columns to be "saved"
Extract the numbers from the columns and create a MultiIndex
df.columns = pd.MultiIndex.from_arrays([df.columns.str.extract(r'(\d+)')[0],
df.columns])
0 1 2 1 2 # Numbers Extracted From Columns
X_V1_C X_V2_C Y_V1_C Y_V2_C
Index
1 4 0 A NaN
2 7 0 A NaN
3 6 0 B NaN
4 2 0 B NaN
5 8 0 A NaN
Check where There are groups with all NaN columns with DataFrame.isna all on axis=0 (columns) then any relative to level=0 (the number that was extracted)
col_mask = ~df.isna().all(axis=0).any(level=0)
0
1 True # Keep 1 Group
2 False # Don't Keep 2 Group
dtype: bool
4.filter the DataFrame with the mask using loc then droplevel on the added number level
df = df.loc[:, col_mask.index[col_mask]].droplevel(axis=1, level=0)
X_V1_C Y_V1_C
Index
1 4 A
2 7 A
3 6 B
4 2 B
5 8 A
All Together
df = df.set_index('Index')
df.columns = pd.MultiIndex.from_arrays([df.columns.str.extract(r'(\d+)')[0],
df.columns])
col_mask = ~df.isna().all(axis=0).any(level=0)
df = df.loc[:, col_mask.index[col_mask]].droplevel(axis=1, level=0)
df:
X_V1_C Y_V1_C
Index
1 4 A
2 7 A
3 6 B
4 2 B
5 8 A
drop nas
df.dropna(axis=1, inplace=True)
compute suffixes and columns with both suffixes
suffixes = [i[2:] for i in df.columns]
cols = [c for c in df.columns if suffixes.count(c[2:]) == 2]
filter columns
df[cols]
full code:
df = df.set_index('Index').dropna(axis=1)
suffixes = [i[2:] for i in df2.columns]
df[[c for c in df2.columns if suffixes.count(c[2:]) == 2]]

What is the difference between the 'set' operation using loc vs iloc?

What is the difference between the 'set' operation using loc vs iloc?
df.iloc[2, df.columns.get_loc('ColName')] = 3
#vs#
df.loc[2, 'ColName'] = 3
Why does the website of iloc (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) not have any set examples like those shown in loc website (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html)? Is loc the preferred way?
There isn't much of a difference to say. It all comes down to your need and requirement.
Say you have label of the index and column name (most of the time) you are supposed to use loc (location) operator to assign the values.
Whereas like in normal matrix, you usually are going to have only the index number of the row and column and hence the cell location via integers (for i) your are supposed to use iloc (integer based location) for assignment.
Pandas DataFrame support indexing via both usual integer based and index based.
The problem arise when the index (the row or column) is itself integer instead of some string. So to make a clear difference to what operation user want to perform using integer based or label based indexing the two operations is provided.
Main difference is iloc set values by position, loc by label.
Here are some alternatives:
Sample:
Not default index (if exist label 2 is overwritten cell, else appended new row with label):
np.random.seed(123)
df = pd.DataFrame(np.random.randint(10, size=(3,3)), columns=['A','B','C'], index=[2,1,8])
print (df)
A B C
2 2 2 6
1 1 3 9
8 6 1 0
df.iloc[2, df.columns.get_loc('A')] = 30
print (df)
A B C
2 2 2 6
1 1 3 9
8 30 1 0
Appended new row with 0:
df.loc[0, 'A'] = 70
print (df)
A B C
a 2.0 2.0 6.0
b 1.0 3.0 9.0
c 30.0 1.0 0.0
0 70.0 NaN NaN
Overwritten label 2:
df.loc[2, 'A'] = 50
print (df)
A B C
2 50 2 6
1 1 3 9
8 30 1 0
Default index (working same, because 3rd index has label 2):
df = pd.DataFrame(np.random.randint(10, size=(3,3)), columns=['A','B','C'])
print (df)
df.iloc[2, df.columns.get_loc('A')] = 30
print (df)
A B C
0 2 2 6
1 1 3 9
2 30 1 0
df.loc[2, 'A'] = 50
print (df)
A B C
0 2 2 6
1 1 3 9
2 50 1 0
Not integer index - (working for set by position, for select by label is appended new row):
np.random.seed(123)
df = pd.DataFrame(np.random.randint(10, size=(3,3)), columns=['A','B','C'], index=list('abc'))
print (df)
df.iloc[2, df.columns.get_loc('A')] = 30
print (df)
A B C
a 2 2 6
b 1 3 9
c 30 1 0
df.loc[2, 'A'] = 50
print (df)
A B C
a 2.0 2.0 6.0
b 1.0 3.0 9.0
c 30.0 1.0 0.0
2 50.0 NaN NaN

Warning with loc function with pandas dataframe

While working on a SO Question i came across a warning error using with loc, precise details are as belows:
DataFrame Samples:
First dataFrame df1 :
>>> data1 = {'Sample': ['Sample_A','Sample_D', 'Sample_E'],
... 'Location': ['Bangladesh', 'Myanmar', 'Thailand'],
... 'Year':[2012, 2014, 2015]}
>>> df1 = pd.DataFrame(data1)
>>> df1.set_index('Sample')
Location Year
Sample
Sample_A Bangladesh 2012
Sample_D Myanmar 2014
Sample_E Thailand 2015
Second dataframe df2:
>>> data2 = {'Num': ['Value_1','Value_2','Value_3','Value_4','Value_5'],
... 'Sample_A': [0,1,0,0,1],
... 'Sample_B':[0,0,1,0,0],
... 'Sample_C':[1,0,0,0,1],
... 'Sample_D':[0,0,1,1,0]}
>>> df2 = pd.DataFrame(data2)
>>> df2.set_index('Num')
Sample_A Sample_B Sample_C Sample_D
Num
Value_1 0 0 1 0
Value_2 1 0 0 0
Value_3 0 1 0 1
Value_4 0 0 0 1
Value_5 1 0 1 0
>>> samples
['Sample_A', 'Sample_D', 'Sample_E']
While i'm taking samples to preserve the column from it as follows it works but at the same time it produce warning ..
>>> df3 = df2.loc[:, samples]
>>> df3
Sample_A Sample_D Sample_E
0 0 0 NaN
1 1 0 NaN
2 0 1 NaN
3 0 1 NaN
4 1 0 NaN
Warnings:
indexing.py:1472: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
return self._getitem_tuple(key)
Would like to know about to handle this to a better way!
Use reindex like:
df3 = df2.reindex(columns=samples)
print (df3)
Sample_A Sample_D Sample_E
0 0 0 NaN
1 1 0 NaN
2 0 1 NaN
3 0 1 NaN
4 1 0 NaN
Or if want only intersected columns use Index.intersection:
df3 = df2[df2.columns.intersection(samples)]
#alternative
#df3 = df2[np.intersect1d(df2.columns, samples)]
print (df3)
Sample_A Sample_D
0 0 0
1 1 0
2 0 1
3 0 1
4 1 0

Pandas: Delete duplicated items in a specific column

I have a panda dataframe (here represented using excel):
Now I would like to delete all dublicates (1) of a specific row (B).
How can I do it ?
For this example, the result would look like that:
You can use duplicated for boolean mask and then set NaNs by loc, mask or numpy.where:
df.loc[df['B'].duplicated(), 'B'] = np.nan
df['B'] = df['B'].mask(df['B'].duplicated())
df['B'] = np.where(df['B'].duplicated(), np.nan,df['B'])
Alternative if need remove duplicates rows by B column:
df = df.drop_duplicates(subset=['B'])
Sample:
df = pd.DataFrame({
'B': [1,2,1,3],
'A':[1,5,7,9]
})
print (df)
A B
0 1 1
1 5 2
2 7 1
3 9 3
df.loc[df['B'].duplicated(), 'B'] = np.nan
print (df)
A B
0 1 1.0
1 5 2.0
2 7 NaN
3 9 3.0
df = df.drop_duplicates(subset=['B'])
print (df)
A B
0 1 1
1 5 2
3 9 3