At each NaN value, drop the row and column it's located in from pandas DataFrame - pandas

I have some unknown DataFrame that can be of any size and shape, for example:
first1 first2 first3 first4
a NaN 22 56.0 65
c 380.0 40 NaN 66
b 390.0 50 80.0 64
My objective is to delete all columns and rows at which there is a NaN value.
In this specific case, the output should be:
first2 first4
b 50 64
Also, I need to preserve the option to use "all" like in pandas.DataFrame.dropna, meaning when an argument "all" passed, a column or a row must be dropped only if all its values are missing.
When I tried the following code:
def dropna_mta_style(df, how='any'):
new_df = df.dropna(axis=0, how = how).dropna(axis=1, how = how)
return new_df
It obviously didn't work, because it drops firstly the rows, and then searches for columns with Nan's, but it was already dropped.
Thanks in advance!
P.S: for and while loops, python built-in functions that act on iterables (all, any, map, ...), list and dictionary comprehensions shouldn't be used.

Solution intended for readability:
rows = df.dropna(axis=0).index
cols = df.dropna(axis=1).columns
df = df.loc[rows, cols]

Would something like this work ?
df.dropna(axis=1,how='any').loc[df.dropna(axis=0,how='any').index]
(Meaning we take the indexes of all rows for which we dont have NaNs in any row df.dropna(axis=0,how='any').index - then use that to locate the rows we want from the original df for which we drop all columns having at least one NaN)

This should remove all rows and columns dynamically
df['Check'] = df.isin([np.nan]).any(axis=1)
df = df.dropna(axis = 1)
df = df.loc[df['Check'] == False]
df.drop('Check', axis = 1, inplace = True)
df

def dropna_mta_style(df, how='any'):
if (how == 'all'):
null_col =df.isna().all(axis=0).to_frame(name='col')
col_names = null_col[null_col['col'] == True].index
null_row =df.isna().all(axis=1).to_frame(name='row')
row_index = null_row[null_row['row'] == True].index
if len(row_names) > 0:
new_df=df.drop(axis=1, columns=col_names)
else:
new_df = df.dropna(axis=0, how = how).dropna(axis=1, how = how)
return new_df
here is a breakdown of the change made to the function
BEFORE;
first1 first2 first3 first4 first5
a NaN 22.0 NaN 65.0 NaN
c 380.0 40.0 NaN 66.0 NaN
b 390.0 50.0 NaN 64.0 NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN
find the null columns
null_col =df.isna().all(axis=0).to_frame(name='col')
col_names = null_col[null_col['col'] == True].index
col_names
Index(['first3', 'first5'], dtype='object')
find the rows with all null rows
null_row =df.isna().all(axis=1).to_frame(name='row')
row_index = null_row[null_row['row'] == True].index
row_index
Index([3, 4, 5, 6], dtype='object')
if len(row_names) > 0:
df2=df.drop(axis=1, columns=col_names)
df2
AFTER:
first1 first2 first4
a NaN 22.0 65.0
c 380.0 40.0 66.0
b 390.0 50.0 64.0
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
incorporating in your method

Related

How to keep all values from a dataframe except where NaN is present in another dataframe?

I am new to Pandas and I am stuck at this specific problem where I have 2 DataFrames in Pandas, e.g.
>>> df1
A B
0 1 9
1 2 6
2 3 11
3 4 8
>>> df2
A B
0 Nan 0.05
1 Nan 0.05
2 0.16 Nan
3 0.16 Nan
What I am trying to achieve is to retain all values from df1 except where there is a NaN in df2 i.e.
>>> df3
A B
0 Nan 9
1 Nan 6
2 3 Nan
3 4 Nan
I am talking about dfs with 10,000 rows each so I can't do this manually. Also indices and columns are the exact same in each case. I also have no NaN values in df1.
As far as I understand df.update() will either overwrite all values including NaN or update only those that are NaN.
You can use boolean masking using DataFrame.notna.
# df2 = df2.astype(float) # This needed if your dtypes are not floats.
m = df2.notna()
df1[m]
A B
0 NaN 9.0
1 NaN 6.0
2 3.0 NaN
3 4.0 NaN

Converting a list of list of tuples to a DataFrame (First argument: column, Second argument: value)

So I need a DataFrame out of a list_of_list_of_tuples:
My data looks like this:
tuples = [[(5,0.45),(6,0.56)],[(1,0.23),(2,0.54),(6,0.63)],[(3,0.86),(6,0.36)]]
What I need is this:
index
1
2
3
4
5
6
1
nan
nan
nan
nan
0.45
0.56
2
0.23
0.54
nan
nan
nan
0.63
3
nan
nan
0.86
nan
nan
0.36
So that the first argument in the tuple is the column, and the second is the value.
An index would be nice also.
Can anyone help me?
I have no idea how to formulate the code.
Convert each tuple to dictionary, pass to DataFrame constructor and last add DataFrame.reindex for change order and also add missing columns:
df = pd.DataFrame([dict(x) for x in tuples])
df = df.reindex(range(df.columns.min(), df.columns.max() + 1), axis=1)
print (df)
1 2 3 4 5 6
0 NaN NaN NaN NaN 0.45 0.56
1 0.23 0.54 NaN NaN NaN 0.63
2 NaN NaN 0.86 NaN NaN 0.36
tuples = [[(5,0.45),(6,0.56)],[(1,0.23),(2,0.54),(6,0.63)],[(3,0.86),(6,0.36)]]
for x in tuples:
print(x)
index=[]
values=[]
for tuple in x:
print(tuple[0],tuple[1])
index.append(tuple[0])
values.append(tuple[1])
print(index,values)

Adding a dataframe to an existing dataframe at specific rows and columns

I have a loop that each time creates a dataframe(DF) with a form
DF
ID LCAR RCAR ... LPCA1 LPCA2 RPCA2
0 d0129 312.255859 397.216797 ... 1.098888 1.101905 1.152332
and then add that dataframe to an existing dataframe(main_exl_df) with this form:
main_exl_df
ID Date ... COGOTH3 COGOTH3X COGOTH3F
0 d0129 NaN ... NaN NaN NaN
1 d0757 NaN ... 0.0 NaN NaN
2 d2430 NaN ... NaN NaN NaN
3 d3132 NaN ... 0.0 NaN NaN
4 d0371 NaN ... 0.0 NaN NaN
... ... ... ... ... ... ...
2163 d0620 NaN ... 0.0 NaN NaN
2164 d2410 NaN ... 0.0 NaN NaN
2165 d0752 NaN ... NaN NaN NaN
2166 d0407 NaN ... 0.0 NaN NaN
at each iteration main_exl_df is saved and then loaded again for the next iteration.
I tried
main_exl_df = pd.concat([main_exl_df, DF], axis=1)
but this add the columns each time to the right side of the main_exl_df and does not recognize the index if 'ID' row.
how I can specify to add the new dataframe(DF) at the row with correct ID and right columns?
Merge is the way to go for combining columns in such cases. When you use pd.merge, you need to specify whether the merge is inner, left or right. Assuming that in this case, you want to keep all the rows in main_exl_df, you should merge using:
main_exl_df = main_exl_df.merge(DF, how='left', on='ID')
If you want to keep rows from both the dataframes, use outer as argument value:
main_exl_df = main_exl_df.merge(DF, how='outer', on='ID')
This is what solved the problem at the end (with the help of this answer):
I used the merge function however merge created duplicate columns with _x and _y suffixes. To get rid of the _x suffixes I used this function:
def drop_x(df):
# list comprehension of the cols that end with '_x'
to_drop = [x for x in df if x.endswith('_x')]
df.drop(to_drop, axis=1, inplace=True)
and then merged the two dataframes while replacing the _y suffixes with empty string:
col_to_use = DF.columns.drop_duplicates(main_exl_df)
main_exl_df = main_exl_df.merge(DF[col_to_use], on='ID', how='outer', suffixes=('_x', ''))
drop_x(main_exl_df)

Compare 2 columns and replace to None if found equal

The following command will replace all values for matching row to None.
ndf.iloc[np.where(ndf.path3=='sys_bck_20190101.tar.gz')] = np.nan
What I really need to do is to replace the value of a single column called path4 if it matches with column path3. This does not work:
ndf.iloc[np.where(ndf.path3==ndf.path4), ndf.path3] = np.nan
Update:
There is a pandas method "fillna" that can be used with axis = 'columns'.
Is there a similar method to write "NA" values to the duplcate columns?
I can do this, but it does not look like pythonic.
ndf.loc[ndf.path1==ndf.path2, 'path1'] = np.nan
ndf.loc[ndf.path2==ndf.path3, 'path2'] = np.nan
ndf.loc[ndf.path3==ndf.path4, 'path3'] = np.nan
ndf.loc[ndf.path4==ndf.filename, 'path4'] = np.nan
Update 2
Let me explain the issue:
Assuming this dataframe:
ndf = pd.DataFrame({
'path1':[4,5,4,5,5,4],
'path2':[4,5,4,5,5,4],
'path3':list('abcdef'),
'path4':list('aaabef'),
'col':list('aaabef')
})
The expected results :
0 NaN 4.0 NaN NaN a
1 NaN 5.0 b NaN a
2 NaN 4.0 c NaN a
3 NaN 5.0 d NaN b
4 NaN 5.0 NaN NaN e
5 NaN 4.0 NaN NaN f
As you can see this is reverse of fillna. And I guess there is no easy way to do this in pandas. I have already mentioned the commands I can use. I will like to know if there is a better way to achieve this.
Use:
for c1, c2 in zip(ndf.columns, ndf.columns[1:]):
ndf.loc[ndf[c1]==ndf[c2], c1] = np.nan
print (ndf)
path1 path2 path3 path4 col
0 NaN 4.0 NaN NaN a
1 NaN 5.0 b NaN a
2 NaN 4.0 c NaN a
3 NaN 5.0 d NaN b
4 NaN 5.0 NaN NaN e
5 NaN 4.0 NaN NaN f

In pandas, how can all columns that do not contain at least one NaN be dropped from a DataFrame?

I have a DataFrame in which some columns have NaN values. I want to drop all columns that do not have at least one NaN value in them.
I am able to identify the NaN values by creating a DataFrame filled with Boolean values (True in place of NaN values, False otherwise):
data.isnull()
Then, I am able to identify the columns that contain at least one NaN value by creating a series of column names with associated Boolean values (True if the column contains at least one NaN value, False otherwise):
data.isnull().any(axis = 0)
When I attempt to use this series to drop the columns that do not contain at least one NaN value, I run into a problem: the columns that do not contain NaN values are dropped:
data = data.loc[:, data.isnull().any(axis = 0)]
How should I do this?
Consider the dataframe df
df = pd.DataFrame([
[1, 2, None],
[3, None, 4],
[5, 6, None]
], columns=list('ABC'))
df
A B C
0 1 2.0 NaN
1 3 NaN 4.0
2 5 6.0 NaN
IIUC:
pandas
dropna with thresh parameter
df.dropna(1, thresh=2)
A B
0 1 2.0
1 3 NaN
2 5 6.0
loc + boolean indexing
df.loc[:, df.isnull().sum() < 2]
A B
0 1 2.0
1 3 NaN
2 5 6.0
I used sample DF from #piRSquared's answer.
If you want to "to drop the columns that do not contain at least one NaN value":
In [19]: df
Out[19]:
A B C
0 1 2.0 NaN
1 3 NaN 4.0
2 5 6.0 NaN
In [26]: df.loc[:, df.isnull().any()]
Out[26]:
B C
0 2.0 NaN
1 NaN 4.0
2 6.0 NaN