Pandas: fill in NaN values with dictionary references another column - pandas

I have a dictionary that looks like this
dict = {'b' : '5', 'c' : '4'}
My dataframe looks something like this
A B
0 a 2
1 b NaN
2 c NaN
Is there a way to fill in the NaN values using the dictionary mapping from columns A to B while keeping the rest of the column values?

You can map dict values inside fillna
df.B = df.B.fillna(df.A.map(dict))
print(df)
A B
0 a 2
1 b 5
2 c 4

This can be done simply
df['B'] = df['B'].fillna(df['A'].apply(lambda x: dict.get(x)))
This can work effectively for a bigger dataset as well.

Unfortunately, this isn't one of the options for a built-in function like pd.fillna().
Edit: Thanks for the correction. Apparently this is possible as illustrated in #Vaishali's answer.
However, you can subset the data frame first on the missing values and then apply the map with your dictionary.
df.loc[df['B'].isnull(), 'B'] = df['A'].map(dict)

Related

What's the best way to insert columns in a pandas Dataframe when you don't know the exact number of columns?

I have an input dataframe.
I have also a list, with the same len as the number of rows in the dataframe.
Every element of the list is a dictionary: the key is the name of the new column, and the value is the value to be inserted in the cell.
I have to insert the columns from that list in the dataframe.
What is the best way to do so?
So far, given the input dataframe indf and the list l, I came up with something on the line of:
from copy import deepcopy
outdf = deepcopy(indf)
for index, row in indf.iterrows():
e = l[index]
for key, value in e:
outdf.loc[index, key] = value
But it doesn't seem pythonic and pandasnic and I get performance warnings like:
<ipython-input-5-9dde586a9c14>:8: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
If the sorting of the list and the data frame is the same, you can convert your list of dictionaries to a data frame:
mylist = [
{'a':1,'b':2,'c':3},
{'e':11,'f':22,'c':33},
{'a':111,'b':222,'c':333}
]
mylist_df = pd.DataFrame(mylist)
a
b
c
e
f
0
1
2
3
nan
nan
1
nan
nan
33
11
22
2
111
222
333
nan
nan
Then you can use pd.concat to merge the list to your input data frame:
result = pd.concat([input_df, mylist_df], axis=1)
In this way, there is always a column created for all unique keys in your dictionary, regardless of they exist in one dictionary and not the other.

How to apply a function to every two columns of a pandas data and return a new dataframe

For example I have a dataframe as follows:
 
A
B
C
D
0
1.049380
0.512696
0.135421
1.396424
1
-0.367589
-0.741008
-1.543296
0.355291
2
1.244623
-0.295761
1.238826
-0.017174
3
0.378124
0.870361
-0.733288
-0.228948
I want to call stats.ttest_ind on all combination of two columns and get new dataframe as follows (don't care the dummy values):
A
B
C
D
A
nan
0.512696
0.135421
1.396424
B
-0.367589
nan
-1.543296
0.355291
C
1.244623
-0.295761
nan
-0.017174
D
0.378124
0.870361
-0.733288
nan
You could use a list comprehension:
ttest_lists = [[ stats.ttest_ind(df[col_i], df[col_j]) if col_i!=col_j else np.nan
for col_i in df] for col_j in df]
To get a DataFrame rather than a list of lists, you can then use:
ttest_df = pd.DataFrame(ttest_lists, columns=df.columns, index=df.columns)

Series.replace cannot use dict-like to_replace and non-None value [duplicate]

I've got a pandas DataFrame filled mostly with real numbers, but there is a few nan values in it as well.
How can I replace the nans with averages of columns where they are?
This question is very similar to this one: numpy array: replace nan values with average of columns but, unfortunately, the solution given there doesn't work for a pandas DataFrame.
You can simply use DataFrame.fillna to fill the nan's directly:
In [27]: df
Out[27]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 NaN -2.027325 1.533582
4 NaN NaN 0.461821
5 -0.788073 NaN NaN
6 -0.916080 -0.612343 NaN
7 -0.887858 1.033826 NaN
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
In [28]: df.mean()
Out[28]:
A -0.151121
B -0.231291
C -0.530307
dtype: float64
In [29]: df.fillna(df.mean())
Out[29]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.151121 -2.027325 1.533582
4 -0.151121 -0.231291 0.461821
5 -0.788073 -0.231291 -0.530307
6 -0.916080 -0.612343 -0.530307
7 -0.887858 1.033826 -0.530307
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
The docstring of fillna says that value should be a scalar or a dict, however, it seems to work with a Series as well. If you want to pass a dict, you could use df.mean().to_dict().
Try:
sub2['income'].fillna((sub2['income'].mean()), inplace=True)
In [16]: df = DataFrame(np.random.randn(10,3))
In [17]: df.iloc[3:5,0] = np.nan
In [18]: df.iloc[4:6,1] = np.nan
In [19]: df.iloc[5:8,2] = np.nan
In [20]: df
Out[20]:
0 1 2
0 1.148272 0.227366 -2.368136
1 -0.820823 1.071471 -0.784713
2 0.157913 0.602857 0.665034
3 NaN -0.985188 -0.324136
4 NaN NaN 0.238512
5 0.769657 NaN NaN
6 0.141951 0.326064 NaN
7 -1.694475 -0.523440 NaN
8 0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794
In [22]: df.mean()
Out[22]:
0 -0.251534
1 -0.040622
2 -0.841219
dtype: float64
Apply per-column the mean of that columns and fill
In [23]: df.apply(lambda x: x.fillna(x.mean()),axis=0)
Out[23]:
0 1 2
0 1.148272 0.227366 -2.368136
1 -0.820823 1.071471 -0.784713
2 0.157913 0.602857 0.665034
3 -0.251534 -0.985188 -0.324136
4 -0.251534 -0.040622 0.238512
5 0.769657 -0.040622 -0.841219
6 0.141951 0.326064 -0.841219
7 -1.694475 -0.523440 -0.841219
8 0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794
Although, the below code does the job, BUT its performance takes a big hit, as you deal with a DataFrame with # records 100k or more:
df.fillna(df.mean())
In my experience, one should replace NaN values (be it with Mean or Median), only where it is required, rather than applying fillna() all over the DataFrame.
I had a DataFrame with 20 variables, and only 4 of them required NaN values treatment (replacement). I tried the above code (Code 1), along with a slightly modified version of it (code 2), where i ran it selectively .i.e. only on variables which had a NaN value
#------------------------------------------------
#----(Code 1) Treatment on overall DataFrame-----
df.fillna(df.mean())
#------------------------------------------------
#----(Code 2) Selective Treatment----------------
for i in df.columns[df.isnull().any(axis=0)]: #---Applying Only on variables with NaN values
df[i].fillna(df[i].mean(),inplace=True)
#---df.isnull().any(axis=0) gives True/False flag (Boolean value series),
#---which when applied on df.columns[], helps identify variables with NaN values
Below is the performance i observed, as i kept on increasing the # records in DataFrame
DataFrame with ~100k records
Code 1: 22.06 Seconds
Code 2: 0.03 Seconds
DataFrame with ~200k records
Code 1: 180.06 Seconds
Code 2: 0.06 Seconds
DataFrame with ~1.6 Million records
Code 1: code kept running endlessly
Code 2: 0.40 Seconds
DataFrame with ~13 Million records
Code 1: --did not even try, after seeing performance on 1.6 Mn records--
Code 2: 3.20 Seconds
Apologies for a long answer ! Hope this helps !
If you want to impute missing values with mean and you want to go column by column, then this will only impute with the mean of that column. This might be a little more readable.
sub2['income'] = sub2['income'].fillna((sub2['income'].mean()))
# To read data from csv file
Dataset = pd.read_csv('Data.csv')
X = Dataset.iloc[:, :-1].values
# To calculate mean use imputer class
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
Directly use df.fillna(df.mean()) to fill all the null value with mean
If you want to fill null value with mean of that column then you can use this
suppose x=df['Item_Weight'] here Item_Weight is column name
here we are assigning (fill null values of x with mean of x into x)
df['Item_Weight'] = df['Item_Weight'].fillna((df['Item_Weight'].mean()))
If you want to fill null value with some string then use
here Outlet_size is column name
df.Outlet_Size = df.Outlet_Size.fillna('Missing')
Pandas: How to replace NaN (nan) values with the average (mean), median or other statistics of one column
Say your DataFrame is df and you have one column called nr_items. This is: df['nr_items']
If you want to replace the NaN values of your column df['nr_items'] with the mean of the column:
Use method .fillna():
mean_value=df['nr_items'].mean()
df['nr_item_ave']=df['nr_items'].fillna(mean_value)
I have created a new df column called nr_item_ave to store the new column with the NaN values replaced by the mean value of the column.
You should be careful when using the mean. If you have outliers is more recommendable to use the median
Another option besides those above is:
df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))
It's less elegant than previous responses for mean, but it could be shorter if you desire to replace nulls by some other column function.
using sklearn library preprocessing class
from sklearn.impute import SimpleImputer
missingvalues = SimpleImputer(missing_values = np.nan, strategy = 'mean', axis = 0)
missingvalues = missingvalues.fit(x[:,1:3])
x[:,1:3] = missingvalues.transform(x[:,1:3])
Note: In the recent version parameter missing_values value change to np.nan from NaN
I use this method to fill missing values by average of a column.
fill_mean = lambda col : col.fillna(col.mean())
df = df.apply(fill_mean, axis = 0)
You can also use value_counts to get the most frequent values. This would work on different datatypes.
df = df.apply(lambda x:x.fillna(x.value_counts().index[0]))
Here is the value_counts api reference.

How to update multi columns in pandas

I have DF has 5 columns. 3 columns are character type, and other are numeric type. I wanted to update missing values of character type columns are "missing".
I have written update statement like below, but it's not working.
df.select_dtypes(include='object') = df.select_dtypes(include='object').apply(lambda x: x.fillna('missing'))
It's working only when i specify column names.
df[['Manufacturer','Model','Type']] = df.select_dtypes(include='object').apply(lambda x: x.fillna('missing'))
Could you please tell me how i can correct my first update statement?
Here df.select_dtypes(include='object') return new DataFrame, so cannot assign like in first answer, possible solution is use DataFrame.update (working inplace), also apply here is not necessary.
print (df)
Manufacturer Model Type a c
0 a g NaN 4 NaN
1 NaN NaN aa 4 8.0
df.update(df.select_dtypes(include='object').fillna('missing'))
print (df)
Manufacturer Model Type a c
0 a g missing 4 NaN
1 missing missing aa 4 8.0
Or get columns names with strings like:
cols = df.select_dtypes(include='object').columns
df[cols] = df[cols].fillna('missing')
print (df)

Slicing and Setting Values in Pandas, with a composite of position and labels

I want to set a value in a specific cell in a pandas dataFrame.
I know which position the row is in (I can even get the row by using df.iloc[i], for example), and I know the name of the column, but I can't work out how to select the cell so that I can set a value to it.
df.loc[i,'columnName']=val
won't work because I want the row in position i, not labelled with index i. Also
df.iloc[i, 'columnName'] = val
obviously doesn't like being given a column name. So, short of converting to a dict and back, how do I go about this? Help very much appreciated, as I can't find anything that helps me in the pandas documentation.
You can use ix to set a specific cell:
In [209]:
df = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
df
Out[209]:
a b c
0 1.366340 1.643899 -0.264142
1 0.052825 0.363385 0.024520
2 0.526718 -0.230459 1.481025
3 1.068833 -0.558976 0.812986
4 0.208232 0.405090 0.704971
In [210]:
df.ix[1,'b'] = 0
df
Out[210]:
a b c
0 1.366340 1.643899 -0.264142
1 0.052825 0.000000 0.024520
2 0.526718 -0.230459 1.481025
3 1.068833 -0.558976 0.812986
4 0.208232 0.405090 0.704971
You can also call iloc on the col of interest:
In [211]:
df['b'].iloc[2] = 0
df
Out[211]:
a b c
0 1.366340 1.643899 -0.264142
1 0.052825 0.000000 0.024520
2 0.526718 0.000000 1.481025
3 1.068833 -0.558976 0.812986
4 0.208232 0.405090 0.704971
You can get the position of the column with get_loc:
df.iloc[i, df.columns.get_loc('columnName')] = val