Python pandas DataFrame operations with NaN - pandas

On pandas DataFrame, I'm trying to compute percent change between two features. For example:
df = pd.DataFrame({'A': [100, 100, 100], 'B': [105, 110, 93], 'C': ['NaN', 102, 'NaN']})
I attempting to compute change between df['A'] - df['C'], but on the rows where we have 'NaN', use value from 'B' column.
Expecting result: [-5, -2, 7]
since, df['C'].loc[0] is NaN, first value is 100 - 105 (from 'B').
But second value is 100 -102.

I think simpliest is replace missing values by another column by Series.fillna:
#if need replace strings NaN to missing values np.nan
df['C'] = pd.to_numeric(df.C, errors='coerce')
s = df['A'] - df['C'].fillna(df.B)
print (s)
0 -5.0
1 -2.0
2 7.0
dtype: float64
Another idea with numpy.where and test missing values by Series.isna:
a = np.where(df.C.isna(), df['A'] - df['B'], df['A'] - df['C'])
print (a)
[-5. -2. 7.]
s = df['A'] - np.where(df.C.isna(), df['B'], df['C'])
print (s)
0 -5.0
1 -2.0
2 7.0
Name: A, dtype: float64

Related

How to Exclude NaNs from Pandas Rolling, but not return NaN if there is one in the DataFrame

Currently I have the DataFrame seen below and I want to do a rolling average over the last 10 occurrences that have actual values, but to skip the NaNs
Example DataFrame
The issues is that if I run df['AST_Hit'].rolling(10).mean(skipna=True).shift(1) I get this DataFrame below which is not what I am looking for
Example Output DataFrame
I've tried using window and min_period but that does not give me what I want as I don't want the average over anything greater than 10.
Ideally I would like the DataFrame to be able to discard a NaN, but still look to see if there are 10 values in that selection. From what I am describing I think I need some sort of max period where it is equal to 10 as well as the min period equal to 10, but I could not find anything on Pandas documentation for rolling on setting up a max period.
Maybe it would also be best if I just dropped any NaN rows. My DataFrame is much bigger than what is seen, so it isn't just those 3 rows that contain a NaN, but it may be the best course of action
Any help or tips is greatly appreciated.
This might help you:
import pandas as pd
import numpy as np
# create a sample DataFrame with non-numeric columns
np.random.seed(123)
df = pd.DataFrame({
'Date': pd.date_range(start='2022-01-01', periods=100),
'AST_Hit': np.random.randint(0, 10, size=100),
'Other_Column': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] * 10
})
df.iloc[3:6, 1] = np.nan
df.iloc[7, 0] = np.nan
df.iloc[10:15, 2] = np.nan
df.iloc[20:25, 1] = np.nan
df.iloc[30:40, 2] = np.nan
# compute rolling average over the last 10 non-null values
rolling_mask = df['AST_Hit'].notnull().rolling(window=10, min_periods=1).sum().eq(10)
result = df['AST_Hit'].rolling(window=10, min_periods=1).apply(lambda x: np.mean(x[rolling_mask]))
print(result)
which gives
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
95 5.6
96 5.1
97 4.7
98 4.2
99 3.9
Name: AST_Hit, Length: 100, dtype: float64

How to remove all type of nan from the dataframe.?

I had a data frame, which is shown below. I want to merge column values into one column, excluding nan values.
Image 1:
When I am using the code
df3["Generation"] = df3[df3.columns[5:]].apply(lambda x: ','.join(x.dropna()), axis=1)
I am getting results like this.
Image 2:
I suspect that these columns are of type string; thus, they are not affected by x.dropna().
One example that I made is this, which gives similar results as yours.
df = pd.DataFrame({'a': [np.nan, np.nan, 1, 2], 'b': [1, 1, np.nan, None]}).astype(str)
df.apply(lambda x: ','.join(x.dropna()))
0 nan,1.0
1 nan,1.0
2 1.0,nan
3 2.0,nan
dtype: object
-----------------
# using simple string comparing solves the problem
df.apply(lambda x: ','.join(x[x!='nan']), axis=1)
0 1.0
1 1.0
2 1.0
3 2.0
dtype: object

Conditional imputation with average of non-missing columns with pandas toolbox

This question focus on pandas own functions. There are still solutions (pandas DataFrame: replace nan values with average of columns) but with own written functions.
In SPSS there is function MEAN.n which gives you the mean value of list of numbers only when n elements of that list are valid (not pandas.NA). With that function you are able to imputat missing values only if a minimum number of items are valid.
Are there pandas function to do this with?
Example
Values [1, 2, 3, 4, NA].
Mean of the valid values is 2.5.
The resulting list should be [1, 2, 3, 4, 2.5].
Assume the rule that in a 5 item list 3 should have valid values for imputation. Otherwise the result is NA.
Values [1, 2, NA, NA, NA].
Mean of the valid values is 1.5 but it does not matter.
The resulting list should not be changed [1, 2, NA, NA, NA] because imputation is not allowed.
Assuming you want to work with pandas, you can define a custom wrapper (using only pandas functions) to fillna with the mean only if a minimum number of items are not NA:
from pandas import NA
s1 = pd.Series([1, 2, 3, 4, NA])
s2 = pd.Series([1, 2, NA, NA, NA])
def fillna_mean(s, N=4):
return s if s.notna().sum() < N else s.fillna(s.mean())
fillna_mean(s1)
# 0 1.0
# 1 2.0
# 2 3.0
# 3 4.0
# 4 2.5
# dtype: float64
fillna_mean(s2)
# 0 1
# 1 2
# 2 <NA>
# 3 <NA>
# 4 <NA>
# dtype: object
fillna_mean(s2, N=2)
# 0 1.0
# 1 2.0
# 2 1.5
# 3 1.5
# 4 1.5
# dtype: float64
Lets try list comprehension, though it will be messy
Option1
You can use pd.Series and numpy
s= [x if np.isnan(lst).sum()>=3 else pd.Series(lst).mean(skipna=True) if x is np.nan else x for x in lst]
Option2 use numpy all through
s=[x if np.isnan(lst).sum()>=3 else np.mean([x for x in lst if str(x) != 'nan']) if x is np.nan else x for x in lst]
Case1
lst=[1, 2, 3, 4, np.nan]
outcome
[1, 2, 3, 4, 2.5]
Case2
lst=[1, 2, np.nan, np.nan, np.nan]
outcome
[1, 2, nan, nan, nan]
if you wanted it as a pd. Series, simply
pd.Series(s, name='lst')
How it works
s=[x if np.isnan(lst).sum()>=3 \ #give me element x if the sum of nans in the list is greater than or equal to 3
else pd.Series(lst).mean(skipna=True) if x is np.nan else x \# Otherwise replace the Nan in list with the mean of non NaN elements in the list
for x in lst\#For every element in lst
]

Replace NaN values of pandas.DataFrame based on values of other columns (according to formula)

Demo dataframe:
import pandas as pd
df = pd.DataFrame({'a': [1,None,3], 'b': [5,10,15]})
I want to replace all NaN values in a with the corresponding values in b**2, and make b NaN (shift NaN values and make some operations on them).
Desired result:
1 5
100 NaN
3 15
How is it possible with pandas?
You can get the rows you want to change using df['a'].isnull(). Then you can use that to update the columns with loc.
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1, None, 3], 'b': [5, 10, 15]})
change = df['a'].isnull()
df.loc[change, ['a', 'b']] = [df.loc[change, 'b']**2, np.NaN]
print(df)
Note that the change variable is only to keep from repeating df['a'].isnull() on both sides of the assignment. You could replace it with that expression to do this in one line, but I think that looks cluttered.
Result:
a b
0 1.0 5.0
1 100.0 NaN
2 3.0 15.0

pandas devide column by another one safely ( zeros, None, str)

How to divide one column by another one zero and str safe?
I don't want to create new 'A' and 'B' cols without zeros and str for some reason. If devision is not possible, I want to get Nones.
df = pd.DataFrame({'A': [0, None, 2, 1 ,5], 'B': [1, 3,'', 'cat', 4]})
I try:
df['C'] = df['B'].divide(df['A'], fill_value=None) # error with zero devision
In fact, this works, but maybe there is more elegant way?
`df['C'] = df.apply(lambda row: row['B']/row['A'] if isinstance(row['A'], numbers.Number) and isinstance(row['B'], numbers.Number) and row['A'] != 0 else None, axis = 1) # this works perfectly but looks ugly`
Use pd.to_numeric to coerce non-numeric types:
import pandas as pd
import numpy as np
df['C'] = pd.to_numeric(df['B'], errors='coerce').divide(pd.to_numeric(df['A'], errors='coerce'))
# A B C
#0 0.0 1 inf
#1 NaN 3 NaN
#2 2.0 NaN
#3 1.0 cat NaN
#4 5.0 4 0.8
If you don't want np.inf then:
df['C'] = df.C.replace(np.inf, np.NaN)