Replace NaN values of pandas.DataFrame based on values of other columns (according to formula)

Replace NaN values of pandas.DataFrame based on values of other columns (according to formula) - pandas

Demo dataframe:
import pandas as pd
df = pd.DataFrame({'a': [1,None,3], 'b': [5,10,15]})
I want to replace all NaN values in a with the corresponding values in b**2, and make b NaN (shift NaN values and make some operations on them).
Desired result:
1 5
100 NaN
3 15
How is it possible with pandas?

You can get the rows you want to change using df['a'].isnull(). Then you can use that to update the columns with loc.
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1, None, 3], 'b': [5, 10, 15]})
change = df['a'].isnull()
df.loc[change, ['a', 'b']] = [df.loc[change, 'b']**2, np.NaN]
print(df)
Note that the change variable is only to keep from repeating df['a'].isnull() on both sides of the assignment. You could replace it with that expression to do this in one line, but I think that looks cluttered.
Result:
a b
0 1.0 5.0
1 100.0 NaN
2 3.0 15.0

Related

drop rows from a Pandas dataframe based on which rows have missing values in another dataframe

I'm trying to drop rows with missing values in any of several dataframes.
They all have the same number of rows, so I tried this:
model_data_with_NA = pd.concat([other_df,
standardized_numerical_data,
encode_categorical_data], axis=1)
ok_rows = ~(model_data_with_NA.isna().all(axis=1))
model_data = model_data_with_NA.dropna()
assert(sum(ok_rows) == len(model_data))
False!
As a newbie in Python, I wonder why this doesn't work? Also, is it better to use hierarchical indexing? Then I can extract the original columns from model_data.

In Short
I believe the all in ~(model_data_with_NA.isna().all(axis=1)) should be replaced with any.
The reason is that all checks here if every value in a row is missing, and any checks if one of the values is missing.
Full Example
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'a':[1, 2, 3]})
df2 = pd.DataFrame({'b':[1, np.nan]})
df3 = pd.DataFrame({'c': [1, 2, np.nan]})
model_data_with_na = pd.concat([df1, df2, df3], axis=1)
ok_rows = ~(model_data_with_na.isna().any(axis=1))
model_data = model_data_with_na.dropna()
assert(sum(ok_rows) == len(model_data))
model_data_with_na
a
b
c
0
1
1
1
1
2
nan
2
2
3
nan
nan
model_data
a
b
c
0
1
1
1

How to merge same name column from two different dataframes?

I have four different datasets. I have merged three of the dataframes correctly. I have same name column in 3rd and 4th dataset. When I merge it with 4th dataset. I am not getting the same name column values in well mannerd way. The user_id is repeating when I merge. I don't want to repeat the user_id. I want to see the value in the del_keys column where it's showing me NaN value rather than it's showing me the value in the last of table. Moreover, I want to merge values of same name column on the basis of their user_id.
In the above image you can see what kind of problem I am getting.
My expected output will look like. There should not be repeated user_id.

using merge on user_id column
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'user_id': [1, 2, 3, 4],
'del': [1.0, np.nan, np.nan, np.nan]
})
df2 = pd.DataFrame({
'user_id': [3, 4, 5],
'del_keys': [1.0, 2.0, 3.0]
})
final=df.merge(df2,on="user_id",how="outer")
Combine first to get rid of Nan values and then drop duplicates
final["del_keys"]=final['del_keys_y'].combine_first(final['del_keys_x'])
final.drop(columns=["del_keys_x","del_keys_y"],inplace=True)
final.drop_duplicates(subset="user_id")

I'm guessing that you use pd.concat to merge the dataframes.
Some dataframes:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'user_id': [1, 2, 3],
'del_keys': [1.0, np.nan, np.nan]
})
df2 = pd.DataFrame({
'user_id': [3, 4, 5],
'del_keys': [1.0, 2.0, 3.0]
})
Merge using pd.concat:
df = pd.concat([df1, df2])
>>> user_id del_keys
0 1 1.0
1 2 NaN
2 3 NaN
0 3 1.0
1 4 2.0
2 5 3.0
Remove duplicates using pd.drop_duplicates:
(
df
.sort_values('del_keys')
.drop_duplicates('user_id', keep='first')
.sort_values('user_id')
)
>>> user_id del_keys
0 1 1.0
1 2 NaN
0 3 1.0
1 4 2.0
2 5 3.0
First, we sort the values by del_keys such that all NaNs are the bottom of the dataframe. Then we can drop the duplicates and keep the first occurrence for each user_id. Lastly, we can sort again to restore the original order.

Dataframe columns cleaning

I am trying to clean a number of columns in a dataset and try to iterate to different columns.
import pandas as pd
df = pd.DataFrame({
'A': [7.3\N\P,nan\T\Z,11.0\R\Z],
'B': [nan\J\N, nan\A\G, 10.8\F\U],
'C': [12.4\A\I, 13.3\H\Z, 8.200000000000001\B\W]})
for name, values in df.iloc[:, 0:3].iteritems():
def myreplace(s):
for char in ['\A','\B','\C','\D','\E','\F','\G','\H','\I',
'\J','\K','\L','\M','\\N','\O','\P','\Q','\R',
'\S','\T','\V','\W','\X','\Y','\Z','\\U']:
s = s.map(lambda x: x.replace(char, ''))
return s
df = df.apply(myreplace)
I get the error: 'float' object has no attribue 'replace'
I could run this part on one column and it works, but I need to run it on several columns so this part does not work as I get an error that 'Dataframe'objec has no attribute 'str'
df_data.str.replace('[\\\|A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z]', '')
I am really new to python pandas dataframe. Will appreciate the help

Given, assuming the goal is to extract numbers from the strings:
A B C
0 7.3\N\P nan\J\N 12.4\A\I
1 nan\T\Z nan\A\G 13.3\H\Z
2 11.0\R\Z 10.8\F\U 8.200000000000001\B\W
Doing:
cols = ['A', 'B', 'C']
for col in cols:
df[col] = df[col].str.extract('(\d*\.\d*)').astype(float)
Output:
A B C
0 7.3 NaN 12.4
1 NaN NaN 13.3
2 11.0 10.8 8.2

pandas devide column by another one safely ( zeros, None, str)

How to divide one column by another one zero and str safe?
I don't want to create new 'A' and 'B' cols without zeros and str for some reason. If devision is not possible, I want to get Nones.
df = pd.DataFrame({'A': [0, None, 2, 1 ,5], 'B': [1, 3,'', 'cat', 4]})
I try:
df['C'] = df['B'].divide(df['A'], fill_value=None) # error with zero devision
In fact, this works, but maybe there is more elegant way?
`df['C'] = df.apply(lambda row: row['B']/row['A'] if isinstance(row['A'], numbers.Number) and isinstance(row['B'], numbers.Number) and row['A'] != 0 else None, axis = 1) # this works perfectly but looks ugly`

Use pd.to_numeric to coerce non-numeric types:
import pandas as pd
import numpy as np
df['C'] = pd.to_numeric(df['B'], errors='coerce').divide(pd.to_numeric(df['A'], errors='coerce'))
# A B C
#0 0.0 1 inf
#1 NaN 3 NaN
#2 2.0 NaN
#3 1.0 cat NaN
#4 5.0 4 0.8
If you don't want np.inf then:
df['C'] = df.C.replace(np.inf, np.NaN)

How to use pandas rename() on multi-index columns?

How can can simply rename a MultiIndex column from a pandas DataFrame, using the rename() function?
Let's look at an example and create such a DataFrame:
import pandas
df = pandas.DataFrame({'A': [1, 1, 1, 2, 2], 'B': range(5), 'C': range(5)})
df = df.groupby("A").agg({"B":["min","max"],"C":"mean"})
print(df)
B C
min max mean
A
1 0 2 1.0
2 3 4 3.5
I am able to select a given MultiIndex column by using a tuple for its name:
print(df[("B","min")])
A
1 0
2 3
Name: (B, min), dtype: int64
However, when using the same tuple naming with the rename() function, it does not seem it is accepted:
df.rename(columns={("B","min"):"renamed"},inplace=True)
print(df)
B C
min max mean
A
1 0 2 1.0
2 3 4 3.5
Any idea how rename() should be called to deal with Multi-Index columns?
PS : I am aware of the other options to flatten the column names before, but this prevents one-liners so I am looking for a cleaner solution (see my previous question)

This doesn't answer the question as worded, but it will work for your given example (assuming you want them all renamed with no MultiIndex):
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 1, 2, 2], 'B': range(5), 'C': range(5)})
df = df.groupby("A").agg(
renamed=('B', 'min'),
B_max=('B', 'max'),
C_mean=('C', 'mean'),
)
print(df)
renamed B_max C_mean
A
1 0 2 1.0
2 3 4 3.5
For more info, you can see the pandas docs and some related other questions.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Replace NaN values of pandas.DataFrame based on values of other columns (according to formula) - pandas

Demo dataframe: import pandas as pd df = pd.DataFrame({'a': [1,None,3], 'b': [5,10,15]}) I want to replace all NaN values in a with the corresponding values in b**2, and make b NaN (shift NaN values and make some operations on them). Desired result: 1 5 100 NaN 3 15 How is it possible with pandas?

Related

drop rows from a Pandas dataframe based on which rows have missing values in another dataframe

How to merge same name column from two different dataframes?

Dataframe columns cleaning

pandas devide column by another one safely ( zeros, None, str)

How to use pandas rename() on multi-index columns?

Categories

Resources