selecting nan values in a pandas dataframe using loc [duplicate] - pandas

Given this dataframe, how to select only those rows that have "Col2" equal to NaN?
df = pd.DataFrame([range(3), [0, np.NaN, 0], [0, 0, np.NaN], range(3), range(3)], columns=["Col1", "Col2", "Col3"])
which looks like:
0 1 2
0 0 1 2
1 0 NaN 0
2 0 0 NaN
3 0 1 2
4 0 1 2
The result should be this one:
0 1 2
1 0 NaN 0

Try the following:
df[df['Col2'].isnull()]

#qbzenker provided the most idiomatic method IMO
Here are a few alternatives:
In [28]: df.query('Col2 != Col2') # Using the fact that: np.nan != np.nan
Out[28]:
Col1 Col2 Col3
1 0 NaN 0.0
In [29]: df[np.isnan(df.Col2)]
Out[29]:
Col1 Col2 Col3
1 0 NaN 0.0

If you want to select rows with at least one NaN value, then you could use isna + any on axis=1:
df[df.isna().any(axis=1)]
If you want to select rows with a certain number of NaN values, then you could use isna + sum on axis=1 + gt. For example, the following will fetch rows with at least 2 NaN values:
df[df.isna().sum(axis=1)>1]
If you want to limit the check to specific columns, you could select them first, then check:
df[df[['Col1', 'Col2']].isna().any(axis=1)]
If you want to select rows with all NaN values, you could use isna + all on axis=1:
df[df.isna().all(axis=1)]
If you want to select rows with no NaN values, you could notna + all on axis=1:
df[df.notna().all(axis=1)]
This is equivalent to:
df[df['Col1'].notna() & df['Col2'].notna() & df['Col3'].notna()]
which could become tedious if there are many columns. Instead, you could use functools.reduce to chain & operators:
import functools, operator
df[functools.reduce(operator.and_, (df[i].notna() for i in df.columns))]
or numpy.logical_and.reduce:
import numpy as np
df[np.logical_and.reduce([df[i].notna() for i in df.columns])]
If you're looking for filter the rows where there is no NaN in some column using query, you could do so by using engine='python' parameter:
df.query('Col2.notna()', engine='python')
or use the fact that NaN!=NaN like #MaxU - stop WAR against UA
df.query('Col2==Col2')

Related

Pandas: If condition on multiple columns having null values and fillna with 0

I have a below dataframe, and my requirement is that, if both columns have np.nan then no change, if either of column has empty value then fill na with 0 value. I wrote this code but why its not working. Please suggest.
import pandas as pd
import numpy as np
data = {'Age': [np.nan, np.nan, 22, np.nan, 50,99],
'Salary': [217, np.nan, 262, 352, 570, np.nan]}
df = pd.DataFrame(data)
print(df)
cond1 = (df['Age'].isnull()) & (df['Salary'].isnull())
if cond1 is False:
df['Age'] = df['Age'].fillna(0)
df['Salary'] = df['Salary'].fillna(0)
print(df)
You can just assign it with update
c = ['Age','Salary']
df.update(df.loc[~df[c].isna().all(1),c].fillna(0))
df
Out[341]:
Age Salary
0 0.0 217.0
1 NaN NaN
2 22.0 262.0
3 0.0 352.0
4 50.0 570.0
5 99.0 0.0
c1 = df['Age'].isna()
c2 = df['Salary'].isna()
df[np.c_[c1 & ~c2, ~c1 & c2]]=0
df
Age Salary
0 0.0 217.0
1 NaN NaN
2 22.0 262.0
3 0.0 352.0
4 50.0 570.0
5 99.0 0.0
tmp=df.loc[(df['Age'].isna() & df['Salary'].isna())]
df.fillna(0,inplace=True)
df.loc[tmp.index]=np.nan
This might be a bit less sophisticated than the other answers but worked for me:
I first save the row(s) containing both Nan values at the same time
then fillna the original df as per normal
then set np.nan back to the location where we saved both rows containing Nan at the same time
Get the rows that are all nulls and use where to exclude them during the fill:
bools = df.isna().all(axis = 1)
df.where(bools, df.fillna(0))
Age Salary
0 0.0 217.0
1 NaN NaN
2 22.0 262.0
3 0.0 352.0
4 50.0 570.0
5 99.0 0.0
Your if statement won't work because you need to check each row for True or False; cond1 is a series, and cannot be compared correctly to False (it will just return False, which is not entirely true), there can be multiple False and True in the series.
An inefficient way would be to traverse the rows:
for row, index in zip(cond1, df.index):
if not row:
df.loc[index] = df.loc[index].fillna(0)
apart from the inefficiency, you are keeping track of index positions; the pandas options try to abstract the process while being quite efficient, since the looping is in C

pandas joining strings in a group, skipping na values

I'm using a combination of str.join (let's call the column joined col_str) and groupby (Let's call the grouped col col_a) in order to summarize data row-wise.
col_str, may contain nan values. Unsurprisingly, and as seen in str.join documentation, joining nan will result in an empty string:
df = df.join(df['col_a'].map(df.groupby('col_a')['col_str'].unique().str.join(', '))
To mitigate this, I tried to convert col_str to string (e.g. df['col_str'] = df['col_str'].astype(str) ). But then, empty values now literally have a string nan value, hence considered non empty.
Not only that str.join now includes nan strings, but also other calculations over the script, that rely on those nans, are ruined.
To address that, I thought about converting just the non-empty values as follows:
df['col_str'] = np.where(pd.isnull(df['col_str']), df['col_str'],
df['col_str'].astype(str))
But now str.join return empty values again :-(
So, I tried fillna('') and even dropna(). None provided me with the desired results.
You get the vicious cycle here, right?
astype(str) => nan strings in join and calculations ruined
Leaving as-is => join.str returns empty results.
Thanks for your assistance!
Edit:
Data is read from a csv. Sample:
Code to test -
df = pd.read_csv('/Users/goidelg/Downloads/sample_data.csv', low_memory=False)
print("---Original DF ---")
print(df)
print("---Joining NaNs as NaN---")
print(df.join(df['col_a'].map(df.groupby('col_a')['col_str'].unique().str.join(', ')).rename('strings_concat')))
print("---Convertin col to str---")
df['col_str'] = df['col_str'].astype(str)
print(df.join(df['col_a'].map(df.groupby('col_a')['col_str'].unique().str.join(', ')).rename('strings_concat')))
And results for the script:
First remove missing values by DataFrame.dropna or Series.notna in boolean indexing:
df = pd.DataFrame({'col_a':[1,2,3,4,1,2,3,4,1,2],
'col_str':['a','b','c','d',np.nan, np.nan, np.nan, np.nan,'a', 's']})
df1 = (df.join(df['col_a'].map(df[df['col_str'].notna()]
.groupby('col_a')['col_str'].unique()
.str.join(', ')). rename('labels')))
print (df1)
col_a col_str labels
0 1 a a
1 2 b b, s
2 3 c c
3 4 d d
4 1 NaN a
5 2 NaN b, s
6 3 NaN c
7 4 NaN d
8 1 a a
9 2 s b, s
df2 = (df.join(df['col_a'].map(df.dropna(subset=['col_str'])
.groupby('col_a')['col_str']
.unique().str.join(', ')).rename('labels')))
print (df2)
col_a col_str labels
0 1 a a
1 2 b b, s
2 3 c c
3 4 d d
4 1 NaN a
5 2 NaN b, s
6 3 NaN c
7 4 NaN d
8 1 a a
9 2 s b, s

Pandas columns headers split

I have a dataframe with colums header made up of 3 tags which are split by '__'
E.g
A__2__66 B__4__45
0
1
2
3
4
5
I know I cant split the header and just use the first tag with this code; df.columns=df.columns.str.split('__').str[0]
giving:
A B
0
1
2
3
4
5
Is there a way I can use a combination of the tags, for example 1 and 3.
giving
A__66 B__45
0
1
2
3
4
5
I've trided the below but its not working
df.columns=df.columns.str.split('__').str[0]+'__'+df.columns.str.split('__').str[2]
With specific regex substitution:
In [124]: df.columns.str.replace(r'__[^_]+__', '__')
Out[124]: Index(['A__66', 'B__45'], dtype='object')
Use Index.map with f-strings for select first and third values of lists:
df.columns = df.columns.str.split('__').map(lambda x: f'{x[0]}__{x[2]}')
print (df)
A__66 B__45
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
Also you can try split and join:
df.columns=['__'.join((i[0],i[-1])) for i in df.columns.str.split('__')]
#Columns: [A__66, B__45]
I found your own solution perfectly fine, and probably most readable. Just needs a little adjustment
df.columns = df.columns.str.split('__').str[0] + '__' + df.columns.str.split('__').str[-1]
Index(['A__66', 'B__45'], dtype='object')
Or for the sake of efficiency, we do not want to call str.split twice:
lst_split = df.columns.str.split('__')
df.columns = lst_split.str[0] + '__' + lst_split.str[-1]
Index(['A__66', 'B__45'], dtype='object')

map one column in a df to another df where all words are present

I am trying to map a column to a dataframe from another dataframe where all words exist from the target dataframe
multiple matches are fine as I can filter them out after.
Thanks in advance!
df1
ColA
this is a sentence
with some words
in a column
and another
for fun
df2
ColB ColC
this a 123
in column 456
fun times 789
Some attempts
dfResult = df1.apply(lambda x: np.all([word in x.df1['ColA'].split(' ') for word in x.df2['ColB'].split(' ')]),axis = 1)
dfResult = df1.ColA.apply(lambda sentence: all(word in sentence for word in df2.ColB))
desired output
dfResult
ColA ColC
this is a sentence 123
with some words NaN
in a column 456
and another NaN
for fun NaN
Turn to set and look for subsets with Numpy broadcasting
Disclaimer: No assurances that this will be fast.
A = df1.ColA.str.split().apply(set).to_numpy() # If pandas version is < 0.24 use `.values`
B = df2.ColB.str.split().apply(set).to_numpy() # instead of `.to_numpy()`
C = df2.ColC.to_numpy()
# When `dtype` is `object` Numpy falls back on performing
# the operation on each pair of values. Since these are `set` objects
# `<=` tests for subset.
i, j = np.where(B <= A[:, None])
out = pd.array([np.nan] * len(A), pd.Int64Dtype()) # Empty nullable integers
# Use `out = np.empty(len(A), dtype=object)` if pandas version is < 0.24
out[i] = C[j]
df1.assign(ColC=out)
ColA ColC
0 this is a sentence 123
1 with some words NaN
2 in a column 456
3 and another NaN
4 for fun NaN
By using loop and set.issubset
pd.DataFrame([[y if set(z.split()).issubset(set(x.split())) else np.nan for z,y in zip(df2.ColB,df2.ColC)] for x in df1.ColA ]).max(1)
Out[34]:
0 123.0
1 NaN
2 456.0
3 NaN
4 NaN
dtype: float64

how to create a variable in pandas dataframe based on another variable

I have an issue where I have a dataframe data with multiple columns and I want to create a variable filter in the dataframe and assign the value 1 if activation_date is null else 0.
I have written this code but this is failing to get the results, everything is getting 0 irrespective if the dates are still present.
data['filter'] = [0 if x is not None else 1 for x in data['activation_dt']]
I think you need isnull for check None or NaNs and then convert True to 1 and False to 0 by astype(int):
data = pd.DataFrame({'activation_dt':[None, np.nan, 1]})
print (data)
activation_dt
0 NaN
1 NaN
2 1.0
data['filter'] = data['activation_dt'].isnull().astype(int)
print (data)
activation_dt filter
0 NaN 1
1 NaN 1
2 1.0 0