How to remove all type of nan from the dataframe.? - pandas

I had a data frame, which is shown below. I want to merge column values into one column, excluding nan values.
Image 1:
When I am using the code
df3["Generation"] = df3[df3.columns[5:]].apply(lambda x: ','.join(x.dropna()), axis=1)
I am getting results like this.
Image 2:

I suspect that these columns are of type string; thus, they are not affected by x.dropna().
One example that I made is this, which gives similar results as yours.
df = pd.DataFrame({'a': [np.nan, np.nan, 1, 2], 'b': [1, 1, np.nan, None]}).astype(str)
df.apply(lambda x: ','.join(x.dropna()))
0 nan,1.0
1 nan,1.0
2 1.0,nan
3 2.0,nan
dtype: object
-----------------
# using simple string comparing solves the problem
df.apply(lambda x: ','.join(x[x!='nan']), axis=1)
0 1.0
1 1.0
2 1.0
3 2.0
dtype: object

Related

How to merge same name column from two different dataframes?

I have four different datasets. I have merged three of the dataframes correctly. I have same name column in 3rd and 4th dataset. When I merge it with 4th dataset. I am not getting the same name column values in well mannerd way. The user_id is repeating when I merge. I don't want to repeat the user_id. I want to see the value in the del_keys column where it's showing me NaN value rather than it's showing me the value in the last of table. Moreover, I want to merge values of same name column on the basis of their user_id.
In the above image you can see what kind of problem I am getting.
My expected output will look like. There should not be repeated user_id.
using merge on user_id column
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'user_id': [1, 2, 3, 4],
'del': [1.0, np.nan, np.nan, np.nan]
})
df2 = pd.DataFrame({
'user_id': [3, 4, 5],
'del_keys': [1.0, 2.0, 3.0]
})
final=df.merge(df2,on="user_id",how="outer")
Combine first to get rid of Nan values and then drop duplicates
final["del_keys"]=final['del_keys_y'].combine_first(final['del_keys_x'])
final.drop(columns=["del_keys_x","del_keys_y"],inplace=True)
final.drop_duplicates(subset="user_id")
I'm guessing that you use pd.concat to merge the dataframes.
Some dataframes:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'user_id': [1, 2, 3],
'del_keys': [1.0, np.nan, np.nan]
})
df2 = pd.DataFrame({
'user_id': [3, 4, 5],
'del_keys': [1.0, 2.0, 3.0]
})
Merge using pd.concat:
df = pd.concat([df1, df2])
>>> user_id del_keys
0 1 1.0
1 2 NaN
2 3 NaN
0 3 1.0
1 4 2.0
2 5 3.0
Remove duplicates using pd.drop_duplicates:
(
df
.sort_values('del_keys')
.drop_duplicates('user_id', keep='first')
.sort_values('user_id')
)
>>> user_id del_keys
0 1 1.0
1 2 NaN
0 3 1.0
1 4 2.0
2 5 3.0
First, we sort the values by del_keys such that all NaNs are the bottom of the dataframe. Then we can drop the duplicates and keep the first occurrence for each user_id. Lastly, we can sort again to restore the original order.

Replace NaN values of pandas.DataFrame based on values of other columns (according to formula)

Demo dataframe:
import pandas as pd
df = pd.DataFrame({'a': [1,None,3], 'b': [5,10,15]})
I want to replace all NaN values in a with the corresponding values in b**2, and make b NaN (shift NaN values and make some operations on them).
Desired result:
1 5
100 NaN
3 15
How is it possible with pandas?
You can get the rows you want to change using df['a'].isnull(). Then you can use that to update the columns with loc.
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1, None, 3], 'b': [5, 10, 15]})
change = df['a'].isnull()
df.loc[change, ['a', 'b']] = [df.loc[change, 'b']**2, np.NaN]
print(df)
Note that the change variable is only to keep from repeating df['a'].isnull() on both sides of the assignment. You could replace it with that expression to do this in one line, but I think that looks cluttered.
Result:
a b
0 1.0 5.0
1 100.0 NaN
2 3.0 15.0

dataframe groupby nth with same behaviour as first and last

In a dataframe, when performing groupby['col'].first() we get the first not nan value in each column (same for last).
I am trying to get the second not nan value and I cannot find how. The only relevant function that I found is groupby['col'].nth(1), but it just gives me the second row with nans if exist. groupby['col'].nth(1, dropna='any') doesn't do the job since it skips rows with nans and doesn't check each column seperately.
example:
df = pd.DataFrame({
'A': [1, 1, 1, 1, 1],
'B': [np.nan, 2, 3, 4, 5],
'C': [np.nan, np.nan, 3, 4, 5]
}, columns=['A', 'B', 'C'])
first() behaviour:
df.groupby('A').first().reset_index()
results with:
A B C
0 1 2.0 3.0
on the other hand:
df.groupby('A').nth(0, dropna='any').reset_index()
gives:
A B C
0 1 3.0 3.0
Is there a way to get the same behaviour of first/last in the nth function so I can apply it also for second or any nth item?
You can use the generic aggregate method to filter each series with notna and then pick the index you want, for example:
df.groupby('A').aggregate(lambda x: x.array[pd.notna(x)][0])
Produces:
B C
A
1 2.0 3.0
Changing the index to 1 to get the second notna value gives:
B C
A
1 3.0 4.0
Of course that lambda is a bit naive because it will raise an IndexError if the array isn't long enough. A function like this should work:
def nth_notna(n):
def inner(series):
a = series.array[pd.notna(series)]
if len(a) - 1 < n:
return np.nan
return a[n]
return inner
Then df.groupby('A').aggregate(nth_notna(3)) will produce:
B C
A
1 5.0 NaN

Python pandas DataFrame operations with NaN

On pandas DataFrame, I'm trying to compute percent change between two features. For example:
df = pd.DataFrame({'A': [100, 100, 100], 'B': [105, 110, 93], 'C': ['NaN', 102, 'NaN']})
I attempting to compute change between df['A'] - df['C'], but on the rows where we have 'NaN', use value from 'B' column.
Expecting result: [-5, -2, 7]
since, df['C'].loc[0] is NaN, first value is 100 - 105 (from 'B').
But second value is 100 -102.
I think simpliest is replace missing values by another column by Series.fillna:
#if need replace strings NaN to missing values np.nan
df['C'] = pd.to_numeric(df.C, errors='coerce')
s = df['A'] - df['C'].fillna(df.B)
print (s)
0 -5.0
1 -2.0
2 7.0
dtype: float64
Another idea with numpy.where and test missing values by Series.isna:
a = np.where(df.C.isna(), df['A'] - df['B'], df['A'] - df['C'])
print (a)
[-5. -2. 7.]
s = df['A'] - np.where(df.C.isna(), df['B'], df['C'])
print (s)
0 -5.0
1 -2.0
2 7.0
Name: A, dtype: float64

How to use pandas rename() on multi-index columns?

How can can simply rename a MultiIndex column from a pandas DataFrame, using the rename() function?
Let's look at an example and create such a DataFrame:
import pandas
df = pandas.DataFrame({'A': [1, 1, 1, 2, 2], 'B': range(5), 'C': range(5)})
df = df.groupby("A").agg({"B":["min","max"],"C":"mean"})
print(df)
B C
min max mean
A
1 0 2 1.0
2 3 4 3.5
I am able to select a given MultiIndex column by using a tuple for its name:
print(df[("B","min")])
A
1 0
2 3
Name: (B, min), dtype: int64
However, when using the same tuple naming with the rename() function, it does not seem it is accepted:
df.rename(columns={("B","min"):"renamed"},inplace=True)
print(df)
B C
min max mean
A
1 0 2 1.0
2 3 4 3.5
Any idea how rename() should be called to deal with Multi-Index columns?
PS : I am aware of the other options to flatten the column names before, but this prevents one-liners so I am looking for a cleaner solution (see my previous question)
This doesn't answer the question as worded, but it will work for your given example (assuming you want them all renamed with no MultiIndex):
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 1, 2, 2], 'B': range(5), 'C': range(5)})
df = df.groupby("A").agg(
renamed=('B', 'min'),
B_max=('B', 'max'),
C_mean=('C', 'mean'),
)
print(df)
renamed B_max C_mean
A
1 0 2 1.0
2 3 4 3.5
For more info, you can see the pandas docs and some related other questions.