Function to replace all NaN values with zero: - pandas

I am trying to clean and fill out around 300 columns. I have already replaced all the empty fields with 'NaN', and now I am trying to convert those values to 0, if certain checks are passed:
NaN values need to be present in the column.
There cannot already exist 0 values in the column.
If 0 already exists, replace with 0.1 instead.
(I am still trying to figure out what to replace with, since 0 already contributes with relevant information for that particular column in the dataframe)
thus far I have implemented
def convert(df, col):
if (df[col].isnull().sum() > 0): #& (df[df[col] != '0'])
#if (df[df[col] != '0']):
df[col].replace(np.NaN, '0', inplace = True)
for col in df.columns:
convert(df, col)
But, checking for the second condition (no zeroes can exist in the column already) is not working. Tried to implement it (commented out part), but returns following error:
TypeError: Cannot perform 'rand_' with a dtyped [float64] array and scalar of type [bool]
On an another note, regarding the field of Data Science; I am not sure whether some of the columns should have their empty fields replaced by the column-mean instead of 0. I have features describing weight, dimensions, prices etc.

Use boolean mask.
Suppose the following dataframe:
>>> df
A B C
0 0.0 1 2.0
1 NaN 4 5.0 # <- NaN should be replace by 0.1
2 6.0 7 NaN # <- NaN should be replace by 0
m1 = df.isna().any() # Is there a NaN in columns (not mandatory)
m2 = df.eq(0).any() # Is there a 0 in columns
# Replace by 0
df.update(df.loc[:, m1 & ~m2].fillna(0))
# Replace by 0.1
df.update(df.loc[:, m1 & m2].fillna(0.1))
Only the second mask is useful
Output result:
>>> df
A B C
0 0.0 1 2.0
1 0.1 4 5.0
2 6.0 7 0.0

Related

How to format cells in a joined table?

Both tables that I merge have the cells formatted correctly, as numbers, but when I make a left join, the numbers in one of the original tables get dis-formatted (you see e+ in those numbers). What should I do to see those numbers un full?
Problem: When merging, some SKU values that appear in df1 do not appear in df2. In order to represent unavailable values, pandas automatically uses NaN, which is a floating point value. Thus, the integer ISBNs are converted to float. Given the size of the ISBNs, pandas then formats these floating point values in scientific notation.
You could solve this by defining your own floating point value formatter (pd.options.display.float_format), but in your case it might be easier / more effective to convert the ISBNs to a string before merging.
Example:
>>> import pandas as pd
>>> df1 = pd.DataFrame({"SKU": list("abcde"), "ISBN": list(range(1, 6))})
>>> df2 = pd.DataFrame({"SKU": list("bcef"), "ISBN": list(range(4, 8))})
Your problem:
>>> pd.merge(df1, df2, on="SKU", how="left")
SKU ISBN_x ISBN_y
0 a 1 NaN
1 b 2 4.0
2 c 3 5.0
3 d 4 NaN
4 e 5 6.0
>>> _.dtypes
SKU object
ISBN_x int64
ISBN_y float64 # <<< Problematic
vs possible solution:
>>> pd.merge(df1.astype(str), df2.astype(str), on="SKU", how="left")
SKU ISBN_x ISBN_y
0 a 1 NaN
1 b 2 4
2 c 3 5
3 d 4 NaN
4 e 5 6
>>> _.dtypes
SKU object
ISBN_x object
ISBN_y object

Series.replace cannot use dict-like to_replace and non-None value [duplicate]

I've got a pandas DataFrame filled mostly with real numbers, but there is a few nan values in it as well.
How can I replace the nans with averages of columns where they are?
This question is very similar to this one: numpy array: replace nan values with average of columns but, unfortunately, the solution given there doesn't work for a pandas DataFrame.
You can simply use DataFrame.fillna to fill the nan's directly:
In [27]: df
Out[27]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 NaN -2.027325 1.533582
4 NaN NaN 0.461821
5 -0.788073 NaN NaN
6 -0.916080 -0.612343 NaN
7 -0.887858 1.033826 NaN
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
In [28]: df.mean()
Out[28]:
A -0.151121
B -0.231291
C -0.530307
dtype: float64
In [29]: df.fillna(df.mean())
Out[29]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.151121 -2.027325 1.533582
4 -0.151121 -0.231291 0.461821
5 -0.788073 -0.231291 -0.530307
6 -0.916080 -0.612343 -0.530307
7 -0.887858 1.033826 -0.530307
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
The docstring of fillna says that value should be a scalar or a dict, however, it seems to work with a Series as well. If you want to pass a dict, you could use df.mean().to_dict().
Try:
sub2['income'].fillna((sub2['income'].mean()), inplace=True)
In [16]: df = DataFrame(np.random.randn(10,3))
In [17]: df.iloc[3:5,0] = np.nan
In [18]: df.iloc[4:6,1] = np.nan
In [19]: df.iloc[5:8,2] = np.nan
In [20]: df
Out[20]:
0 1 2
0 1.148272 0.227366 -2.368136
1 -0.820823 1.071471 -0.784713
2 0.157913 0.602857 0.665034
3 NaN -0.985188 -0.324136
4 NaN NaN 0.238512
5 0.769657 NaN NaN
6 0.141951 0.326064 NaN
7 -1.694475 -0.523440 NaN
8 0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794
In [22]: df.mean()
Out[22]:
0 -0.251534
1 -0.040622
2 -0.841219
dtype: float64
Apply per-column the mean of that columns and fill
In [23]: df.apply(lambda x: x.fillna(x.mean()),axis=0)
Out[23]:
0 1 2
0 1.148272 0.227366 -2.368136
1 -0.820823 1.071471 -0.784713
2 0.157913 0.602857 0.665034
3 -0.251534 -0.985188 -0.324136
4 -0.251534 -0.040622 0.238512
5 0.769657 -0.040622 -0.841219
6 0.141951 0.326064 -0.841219
7 -1.694475 -0.523440 -0.841219
8 0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794
Although, the below code does the job, BUT its performance takes a big hit, as you deal with a DataFrame with # records 100k or more:
df.fillna(df.mean())
In my experience, one should replace NaN values (be it with Mean or Median), only where it is required, rather than applying fillna() all over the DataFrame.
I had a DataFrame with 20 variables, and only 4 of them required NaN values treatment (replacement). I tried the above code (Code 1), along with a slightly modified version of it (code 2), where i ran it selectively .i.e. only on variables which had a NaN value
#------------------------------------------------
#----(Code 1) Treatment on overall DataFrame-----
df.fillna(df.mean())
#------------------------------------------------
#----(Code 2) Selective Treatment----------------
for i in df.columns[df.isnull().any(axis=0)]: #---Applying Only on variables with NaN values
df[i].fillna(df[i].mean(),inplace=True)
#---df.isnull().any(axis=0) gives True/False flag (Boolean value series),
#---which when applied on df.columns[], helps identify variables with NaN values
Below is the performance i observed, as i kept on increasing the # records in DataFrame
DataFrame with ~100k records
Code 1: 22.06 Seconds
Code 2: 0.03 Seconds
DataFrame with ~200k records
Code 1: 180.06 Seconds
Code 2: 0.06 Seconds
DataFrame with ~1.6 Million records
Code 1: code kept running endlessly
Code 2: 0.40 Seconds
DataFrame with ~13 Million records
Code 1: --did not even try, after seeing performance on 1.6 Mn records--
Code 2: 3.20 Seconds
Apologies for a long answer ! Hope this helps !
If you want to impute missing values with mean and you want to go column by column, then this will only impute with the mean of that column. This might be a little more readable.
sub2['income'] = sub2['income'].fillna((sub2['income'].mean()))
# To read data from csv file
Dataset = pd.read_csv('Data.csv')
X = Dataset.iloc[:, :-1].values
# To calculate mean use imputer class
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
Directly use df.fillna(df.mean()) to fill all the null value with mean
If you want to fill null value with mean of that column then you can use this
suppose x=df['Item_Weight'] here Item_Weight is column name
here we are assigning (fill null values of x with mean of x into x)
df['Item_Weight'] = df['Item_Weight'].fillna((df['Item_Weight'].mean()))
If you want to fill null value with some string then use
here Outlet_size is column name
df.Outlet_Size = df.Outlet_Size.fillna('Missing')
Pandas: How to replace NaN (nan) values with the average (mean), median or other statistics of one column
Say your DataFrame is df and you have one column called nr_items. This is: df['nr_items']
If you want to replace the NaN values of your column df['nr_items'] with the mean of the column:
Use method .fillna():
mean_value=df['nr_items'].mean()
df['nr_item_ave']=df['nr_items'].fillna(mean_value)
I have created a new df column called nr_item_ave to store the new column with the NaN values replaced by the mean value of the column.
You should be careful when using the mean. If you have outliers is more recommendable to use the median
Another option besides those above is:
df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))
It's less elegant than previous responses for mean, but it could be shorter if you desire to replace nulls by some other column function.
using sklearn library preprocessing class
from sklearn.impute import SimpleImputer
missingvalues = SimpleImputer(missing_values = np.nan, strategy = 'mean', axis = 0)
missingvalues = missingvalues.fit(x[:,1:3])
x[:,1:3] = missingvalues.transform(x[:,1:3])
Note: In the recent version parameter missing_values value change to np.nan from NaN
I use this method to fill missing values by average of a column.
fill_mean = lambda col : col.fillna(col.mean())
df = df.apply(fill_mean, axis = 0)
You can also use value_counts to get the most frequent values. This would work on different datatypes.
df = df.apply(lambda x:x.fillna(x.value_counts().index[0]))
Here is the value_counts api reference.

How to update multi columns in pandas

I have DF has 5 columns. 3 columns are character type, and other are numeric type. I wanted to update missing values of character type columns are "missing".
I have written update statement like below, but it's not working.
df.select_dtypes(include='object') = df.select_dtypes(include='object').apply(lambda x: x.fillna('missing'))
It's working only when i specify column names.
df[['Manufacturer','Model','Type']] = df.select_dtypes(include='object').apply(lambda x: x.fillna('missing'))
Could you please tell me how i can correct my first update statement?
Here df.select_dtypes(include='object') return new DataFrame, so cannot assign like in first answer, possible solution is use DataFrame.update (working inplace), also apply here is not necessary.
print (df)
Manufacturer Model Type a c
0 a g NaN 4 NaN
1 NaN NaN aa 4 8.0
df.update(df.select_dtypes(include='object').fillna('missing'))
print (df)
Manufacturer Model Type a c
0 a g missing 4 NaN
1 missing missing aa 4 8.0
Or get columns names with strings like:
cols = df.select_dtypes(include='object').columns
df[cols] = df[cols].fillna('missing')
print (df)

Substitute some values from a field into another

I'd like to substitute some valors from a field into another. For instance:
Let's say I have a pandas.DataFrame object with an identifier df (yeap, very original), it has several columns but there are some of them which are relevant, and cannot be empty.
I noticed some of the values were set into another field. Let's say field1 is a relevant field, and field2 is not. I have a thousand of registers and it's increasing every week, when I get new data, and as I love make things be automated I first check for these possible values:
idx = df[df.field1.isna() & df.field2.notna()].index
Then I tried to replace them:
df.loc[idx, ['field1']] = df.loc[idx, ['field2']]
But when I see the result nothing has changed... why? I con make substitutions this way with a single value, but if they differ I cannot anymore.
df.loc[idx, ['field1']] = "Not empty any longer" # This will work
I can't figure it out how to achieve this in a ... good way? I mean, I don't want to check it manually, it doesn't matter if they're only 50, I have to do the same with other fields and I may get more like this (and I will).
Thanks!
Try this: df.loc[idx, ['field1']] = df.loc[idx, ['field2']].values
Example:
# The None in 'field1' should be replaced by the 'field2' value
df = pd.DataFrame({'field1':[1,2,3,None,5], 'field2':[6,7,8,8,None]})
idx = df[df.field1.isna() & df.field2.notna()].index
df.loc[idx, ['field1']] = df.loc[idx, ['field2']].values
Original dataframe:
df
field1 field2
0 1.0 6.0
1 2.0 7.0
2 3.0 8.0
3 NaN 8.0
4 5.0 NaN
Modified df:
df
field1 field2
0 1.0 6.0
1 2.0 7.0
2 3.0 8.0
3 8.0 8.0
4 5.0 NaN

issue with pandas and semilog for boxplot

I have a pandas dataframe that has columns:
'video' and 'link' of click values
with an index of datetime. For some reason, when I use semilogy and boxplot with the video series, I get the error
ValueError: Data has no positive values, and therefore can not be log-scaled.
but when I do it on the 'link' series I can draw the boxplot correctly.
I have verified that both the 'video' and 'link' series has NaN values and positive values.
Any thoughts on why this is occurring? Below is what I've done to verify that this is the case
Below is sample code:
#get all the not null values of video to show that there are positive
temp=a.types_pivot[a.types_pivot['video'].notnull()]
print temp
#get a count of all the NaN values to show both 'video' and 'link' has NaN
count = 0
for item in a.types_pivot['video']:
if(item.is_integer() == False):
count += 1
#try to draw the plots
print "there is %s nan values in video" % (count)
fig=plt.figure(figsize=(6,6),dpi=50)
ax=fig.add_subplot(111)
ax.semilogy()
plt.boxplot(a.types_pivot['video'].values)
Here is relevant output from the code for video series
type link video
created_time
2011-02-10 15:00:51+00:00 NaN 5
2011-02-17 17:50:38+00:00 NaN 5
2011-03-22 14:04:56+00:00 NaN 5
there is 5463 nan values in video
I run the same exact code except I do
a.types_pivot['link']
and I am able to draw the boxplot.
Below is the relevant output from the link series
Index: 5269 entries, 2011-01-24 20:03:58+00:00 to 2012-06-22 16:56:30+00:00
Data columns:
link 5269 non-null values
photo 0 non-null values
question 0 non-null values
status 0 non-null values
swf 0 non-null values
video 0 non-null values
dtypes: float64(6)
there is 216 nan values in link
Using the describe function
a.types_pivot['video'].describe()
<pre>
count 22.000000
mean 16.227273
std 15.275040
min 1.000000
25% 5.250000
50% 9.500000
75% 23.000000
max 58.000000
</pre>
Note: I'm unable to upload images due to some issue with imgur. I'll try again later.
Take advantage of pandas matplotlib helper / wrappers by calling pd.DataFrame.boxplot(). I believe this will take care of the NaN values for you. It will also put both Series in the same plot so you can easily compare data.
Example
Create a dataframe with some NaN values and negative values
In [7]: df = pd.DataFrame(np.random.rand(10, 5))
In [8]: df.ix[2:4,3] = np.nan
In [9]: df.ix[2:3,4] = -0.45
In [10]: df
Out[10]:
0 1 2 3 4
0 0.391882 0.776331 0.875009 0.350585 0.154517
1 0.772635 0.657556 0.745614 0.725191 0.483967
2 0.057269 0.417439 0.861274 NaN -0.450000
3 0.997749 0.736229 0.084077 NaN -0.450000
4 0.886303 0.596473 0.943397 NaN 0.816650
5 0.018724 0.459743 0.472822 0.598056 0.273341
6 0.894243 0.097513 0.691781 0.802758 0.785258
7 0.222901 0.292646 0.558909 0.220400 0.622068
8 0.458428 0.039280 0.670378 0.457238 0.912308
9 0.516554 0.445004 0.356060 0.861035 0.433503
Note that I can count the number of NaN values like so:
In [14]: df[3].isnull().sum() # Count NaNs in the 4th column
Out[14]: 3
A box plot is simply:
In [16]: df.boxplot()
You could create a semi-log boxplot, for example, by:
In [23]: np.log(df).boxplot()
Or, more generally, modify / transform to you heart's content, and then boxplot.
In [24]: df_mod = np.log(df).dropna()
In [25]: df_mod.boxplot()