function to replace null values with mean - pandas

I have an unemployment data for the 30 countries and there are some missing values but in the excel sheet these all numbers are all strings so I first convert them to floats and then if row is empty then I want to replace row with its columns mean value. Function works well doesnt return any error but when I print the data still I have the Null values
data=pd.read_excel(r'C:\Users\OĞUZ\Desktop\employment.xlsx')
data=data.set_index('Unnamed: 0')
for column in data:
for row in column:
if len(row)>5:
row=float(row)
if row.isnull():
row=column.mean()
print(data['Argentina'].head())
This is what I get after print.
Unnamed: 0
1990 NaN
1991 NaN
1992 NaN
1993 NaN
1994 NaN
Name: Argentina, dtype: float64

You can either iterate over the columns, or use DataFrame.transform or DataFrame.apply.
Whichever approach you use, you'll want to:
Convert column values from strings to floats
Calculate the mean of the column
Use Series.fillna to fill the NaN values with the previously calcualted value
Create Data
import pandas as pd
import numpy as np
rng = np.random.default_rng(0)
df = pd.DataFrame({
"a": rng.integers(5, size=10),
"b": rng.integers(5, 10, size=10),
"c": rng.integers(10, 15, size=10)
}).astype(str)
df.loc[2:5, :] = np.nan
# note all the numbers you see are actually strings
print(df)
a b c
0 4 8 11
1 3 9 14
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 0 8 12
7 0 7 10
8 0 7 13
9 4 9 13
Solution - DataFrame transform
def clean_column(series):
series = pd.to_numeric(series, downcast="float")
avg = series.mean()
return series.fillna(avg)
new_df = df.transform(clean_column)
print(new_df)
0 4.000000 8.0 11.000000
1 3.000000 9.0 14.000000
2 1.833333 8.0 12.166667
3 1.833333 8.0 12.166667
4 1.833333 8.0 12.166667
5 1.833333 8.0 12.166667
6 0.000000 8.0 12.000000
7 0.000000 7.0 10.000000
8 0.000000 7.0 13.000000
9 4.000000 9.0 13.000000

To fill NaNs use df.fillna(value). For the mean use df.mean(). If your column is named Argentina this could look like below:
df.Argentina.fillna(df.Argentina.mean(), inplace=True)
The inplace=True is for the reassignment. The line is equivalent to
df.Argentina = df.Argentina.fillna(df.Argentina.mean())
Example
df = pd.DataFrame({'Argentina':[1,np.nan,2,4]}, index=[1990, 1991, 1992, 1993])
>>> df
Argentina
1990 1.0
1991 NaN
1992 2.0
1993 4.0
df.Argentina.fillna(df.Argentina.mean(), inplace=True)
>>> df
Argentina
1990 1.000000
1991 2.333333
1992 2.000000
1993 4.000000
If you have many columns, and you want to fill the NaNs with values depending by the column, you can loop over the column names like below:
for name in df.columns:
df[name].fillna(df[name].mean(), inplace=True)

Related

At each NaN value, drop the row and column it's located in from pandas DataFrame

I have some unknown DataFrame that can be of any size and shape, for example:
first1 first2 first3 first4
a NaN 22 56.0 65
c 380.0 40 NaN 66
b 390.0 50 80.0 64
My objective is to delete all columns and rows at which there is a NaN value.
In this specific case, the output should be:
first2 first4
b 50 64
Also, I need to preserve the option to use "all" like in pandas.DataFrame.dropna, meaning when an argument "all" passed, a column or a row must be dropped only if all its values are missing.
When I tried the following code:
def dropna_mta_style(df, how='any'):
new_df = df.dropna(axis=0, how = how).dropna(axis=1, how = how)
return new_df
It obviously didn't work, because it drops firstly the rows, and then searches for columns with Nan's, but it was already dropped.
Thanks in advance!
P.S: for and while loops, python built-in functions that act on iterables (all, any, map, ...), list and dictionary comprehensions shouldn't be used.
Solution intended for readability:
rows = df.dropna(axis=0).index
cols = df.dropna(axis=1).columns
df = df.loc[rows, cols]
Would something like this work ?
df.dropna(axis=1,how='any').loc[df.dropna(axis=0,how='any').index]
(Meaning we take the indexes of all rows for which we dont have NaNs in any row df.dropna(axis=0,how='any').index - then use that to locate the rows we want from the original df for which we drop all columns having at least one NaN)
This should remove all rows and columns dynamically
df['Check'] = df.isin([np.nan]).any(axis=1)
df = df.dropna(axis = 1)
df = df.loc[df['Check'] == False]
df.drop('Check', axis = 1, inplace = True)
df
def dropna_mta_style(df, how='any'):
if (how == 'all'):
null_col =df.isna().all(axis=0).to_frame(name='col')
col_names = null_col[null_col['col'] == True].index
null_row =df.isna().all(axis=1).to_frame(name='row')
row_index = null_row[null_row['row'] == True].index
if len(row_names) > 0:
new_df=df.drop(axis=1, columns=col_names)
else:
new_df = df.dropna(axis=0, how = how).dropna(axis=1, how = how)
return new_df
here is a breakdown of the change made to the function
BEFORE;
first1 first2 first3 first4 first5
a NaN 22.0 NaN 65.0 NaN
c 380.0 40.0 NaN 66.0 NaN
b 390.0 50.0 NaN 64.0 NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN
find the null columns
null_col =df.isna().all(axis=0).to_frame(name='col')
col_names = null_col[null_col['col'] == True].index
col_names
Index(['first3', 'first5'], dtype='object')
find the rows with all null rows
null_row =df.isna().all(axis=1).to_frame(name='row')
row_index = null_row[null_row['row'] == True].index
row_index
Index([3, 4, 5, 6], dtype='object')
if len(row_names) > 0:
df2=df.drop(axis=1, columns=col_names)
df2
AFTER:
first1 first2 first4
a NaN 22.0 65.0
c 380.0 40.0 66.0
b 390.0 50.0 64.0
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
incorporating in your method

Adding columns with null values in pandas dataframe [duplicate]

When summing two pandas columns, I want to ignore nan-values when one of the two columns is a float. However when nan appears in both columns, I want to keep nan in the output (instead of 0.0).
Initial dataframe:
Surf1 Surf2
0 0
NaN 8
8 15
NaN NaN
16 14
15 7
Desired output:
Surf1 Surf2 Sum
0 0 0
NaN 8 8
8 15 23
NaN NaN NaN
16 14 30
15 7 22
Tried code:
-> the code below ignores nan-values but when taking the sum of two nan-values, it gives 0.0 in the output where I want to keep it as NaN in that particular case to keep these empty values separate from values that are actually 0 after summing.
import pandas as pd
import numpy as np
data = pd.DataFrame({"Surf1": [10,np.nan,8,np.nan,16,15], "Surf2": [22,8,15,np.nan,14,7]})
print(data)
data.loc[:,'Sum'] = data.loc[:,['Surf1','Surf2']].sum(axis=1)
print(data)
From the documentation pandas.DataFrame.sum
By default, the sum of an empty or all-NA Series is 0.
>>> pd.Series([]).sum() # min_count=0 is the default 0.0
This can be controlled with the min_count parameter. For example, if you’d like the sum of an empty series to be NaN, pass min_count=1.
Change your code to
data.loc[:,'Sum'] = data.loc[:,['Surf1','Surf2']].sum(axis=1, min_count=1)
output
Surf1 Surf2
0 10.0 22.0
1 NaN 8.0
2 8.0 15.0
3 NaN NaN
4 16.0 14.0
5 15.0 7.0
Surf1 Surf2 Sum
0 10.0 22.0 32.0
1 NaN 8.0 8.0
2 8.0 15.0 23.0
3 NaN NaN NaN
4 16.0 14.0 30.0
5 15.0 7.0 22.0
You could mask the result by doing:
df.sum(1).mask(df.isna().all(1))
0 0.0
1 8.0
2 23.0
3 NaN
4 30.0
5 22.0
dtype: float64
You can do:
df['Sum'] = df.dropna(how='all').sum(1)
Output:
Surf1 Surf2 Sum
0 10.0 22.0 32.0
1 NaN 8.0 8.0
2 8.0 15.0 23.0
3 NaN NaN NaN
4 16.0 14.0 30.0
5 15.0 7.0 22.0
You can use min_count, this will sum all the row when there is at least on not null, if all null return null
df['SUM']=df.sum(min_count=1,axis=1)
#df.sum(min_count=1,axis=1)
Out[199]:
0 0.0
1 8.0
2 23.0
3 NaN
4 30.0
5 22.0
dtype: float64
I think All the solutions listed above work only for the cases when when it is the FIRST column value that is missing. If you have cases when the first column value is non-missing but the second column value is missing, try using:
df['sum'] = df['Surf1']
df.loc[(df['Surf2'].notnull()), 'sum'] = df['Surf1'].fillna(0) + df['Surf2']

How to keep all values from a dataframe except where NaN is present in another dataframe?

I am new to Pandas and I am stuck at this specific problem where I have 2 DataFrames in Pandas, e.g.
>>> df1
A B
0 1 9
1 2 6
2 3 11
3 4 8
>>> df2
A B
0 Nan 0.05
1 Nan 0.05
2 0.16 Nan
3 0.16 Nan
What I am trying to achieve is to retain all values from df1 except where there is a NaN in df2 i.e.
>>> df3
A B
0 Nan 9
1 Nan 6
2 3 Nan
3 4 Nan
I am talking about dfs with 10,000 rows each so I can't do this manually. Also indices and columns are the exact same in each case. I also have no NaN values in df1.
As far as I understand df.update() will either overwrite all values including NaN or update only those that are NaN.
You can use boolean masking using DataFrame.notna.
# df2 = df2.astype(float) # This needed if your dtypes are not floats.
m = df2.notna()
df1[m]
A B
0 NaN 9.0
1 NaN 6.0
2 3.0 NaN
3 4.0 NaN

Create new dataframe columns from old dataframe rows using for loop --> N/A values

I created a dataframe df1:
df1 = pd.read_csv('FBK_var_conc_1.csv', names = ['Cycle', 'SQ'])
df1 = df1['SQ'].copy()
df1 = df1.to_frame()
df1.head(n=10)
SQ
0 2430.0
1 2870.0
2 2890.0
3 3270.0
4 3350.0
5 3520.0
6 26900.0
7 26300.0
8 28400.0
9 3230.0
And then created a second dataframe df2, that I want to fill with the row values of df 1:
df2 = pd.DataFrame()
for x in range(12):
y='Experiment %d' % (x+1)
df2[y]= df1.iloc[3*x:3*x+3]
df2
I get the column names from Experiment 1 - Experiment 12 in df2 and the first column i filled with the right values, but all following columns are filled with N/A.
> Experiment 1 Experiment 2 Experiment 3 Experiment 4 Experiment 5 Experiment 6 Experiment 7 Experiment 8 Experiment 9 Experiment
> 10 Experiment 11 Experiment 12
> 0 2430.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
> 1 2870.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
> 2 2890.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
I've been looking at this for the last 2 hours but can't figure out why the columns after column 1 aren't filled with values.
Desired output:
Experiment 1 Experiment 2 Experiment 3 Experiment 4 Experiment 5 Experiment 6 Experiment 7 Experiment 8 Experiment 9 Experiment 10 Experiment 11 Experiment 12
2430 3270 26900 3230 2940 243000 256000 249000 2880 26100 3890 33400
2870 3350 26300 3290 3180 242000 254000 250000 3390 27900 3730 30700
2890 3520 28400 3090 3140 253000 260000 237000 3510 27400 3760 29600
I found the issue.
I had to use .values
So the final line of the loop has to be:
df2[y] = df1.iloc[3*x:3*x+3].values
and I get the right output

Strange pandas.DataFrame.sum(axis=1) behaviour

I have a pandas DataFrame compiled from some web data (for tennis games) that exhibits strange behaviour when summing across selected rows.
DataFrame:
In [178]: tdf.shape
Out[178]: (47028, 57)
In [201]: cols
Out[201]: ['L1', 'L2', 'L3', 'L4', 'L5', 'W1', 'W2', 'W3', 'W4', 'W5']
In [177]: tdf[cols].head()
Out[177]:
L1 L2 L3 L4 L5 W1 W2 W3 W4 W5
0 4.0 2 NaN NaN NaN 6.0 6 NaN NaN NaN
1 3.0 3 NaN NaN NaN 6.0 6 NaN NaN NaN
2 7.0 5 3 NaN NaN 6.0 7 6 NaN NaN
3 1.0 4 NaN NaN NaN 6.0 6 NaN NaN NaN
4 6.0 7 4 NaN NaN 7.0 5 6 NaN NaN
When then trying to compute the sum over the rows using tdf[cols].sum(axis=1). From the above table, the sum for the 1st row should be 18.0, but it is reported as 10, as below:
In [180]: tdf[cols].sum(axis=1).head()
Out[180]:
0 10.0
1 9.0
2 13.0
3 7.0
4 13.0
dtype: float64
The problem seems to be caused by a specific record (row 13771), because when I exclude this row, the sum is calculated correctly:
In [182]: tdf.iloc[:13771][cols].sum(axis=1).head()
Out[182]:
0 18.0
1 18.0
2 34.0
3 17.0
4 35.0
dtype: float64
whereas, including it:
In [183]: tdf.iloc[:13772][cols].sum(axis=1).head()
Out[183]:
0 10.0
1 9.0
2 13.0
3 7.0
4 13.0
dtype: float64
Gives the wrong result for the entire column.
The offending record is as follows:
In [196]: tdf[cols].iloc[13771]
Out[196]:
L1 1
L2 1
L3 NaN
L4 NaN
L5 NaN
W1 6
W2 0
W3
W4 NaN
W5 NaN
Name: 13771, dtype: object
In [197]: tdf[cols].iloc[13771].W3
Out[197]: ' '
In [198]: type(tdf[cols].iloc[13771].W3)
Out[198]: str
I'm running the following versions:
In [192]: sys.version
Out[192]: '3.4.3 (default, Nov 17 2016, 01:08:31) \n[GCC 4.8.4]'
In [193]: pd.__version__
Out[193]: '0.19.2'
In [194]: np.__version__
Out[194]: '1.12.0'
Surely a single poorly formatted record should not influence the sum of other records? Is this a bug or am I doing something wrong?
Help much appreciated!
Problem is with empty string - then dtype of column W3 is object (obviously string) and sum omit it.
Solutions:
Try replace problematic empty string value to NaN and then cast to float
tdf.loc[13771, 'W3'] = np.nan
tdf.W3 = tdf.W3.astype(float)
Or try replace all empty strings to NaN in subset cols:
tdf[cols] = tdf[cols].replace({'':np.nan})
#if necessary
tdf[cols] = tdf[cols].astype(float)
Another solution is use to_numeric in problematic column - replace all non numeric to NaN:
tdf.W3 = pd.to_numerice(tdf.W3, erors='coerce')
Or generally apply for columns cols:
tdf[cols] = tdf[cols].apply(lambda x: pd.to_numeric(x, errors='coerce'))