Native method to skip Nans in a lambda function?

Native method to skip Nans in a lambda function? - pandas

I was wondering if there's a native method to skip nans in a lambda function.
I have dataframe 'y' in the form below. I'm attempting to turn the Year column into ints. But the lambda function breaks because of the NaN. I've come up with the below, but I'm wondering if there are better ways to deal with this pervasive issue? Thanks!
Year
137 2005
138 NaN
To deal with it, i just used try/except. I wonder if there' a better way to deal with NaNs.
def turn_int(x):
try:
return int(x)
except:
return np.nan
y.Year.apply(lambda x: turn_int(x))

int doesn't have a representation of NaN. The normal way to deal with it would be to drop all the NaN's first:
year = y.Year.dropna().astype(int)

I have done this for int series.
import numpy as np
y['year'] = y['year'].apply(lambda x:x if np.isnan(x) else int(x))

Related

Convert type object column to float

I have a table with a column named "price". This column is of type object. So, it contains numbers as strings and also NaN or ? characters. I want to find the mean of this column but first I have to remove the NaN and ? values and also convert it to float
I am using the following code:
import pandas as pd
import numpy as np
df = pd.read_csv('Automobile_data.csv', sep = ',')
df = df.dropna('price', inplace=True)
df['price'] = df['price'].astype('int')
df['price'].mean()
But, this doesn't work. The error says:
ValueError: No axis named price for object type DataFrame
How can I solve this problem?

edit: in pandas version 1.3 and less, you need subset=[col] wrapped in a list/array. In verison 1.4 and greater you can pass a single column as a string.
You've got a few problems:
df.dropna() arguments require the axis and then the subset. The axis is rows/columns, and then subset is which of those to look at. So you want this to be (I think) df.dropna(axis='rows',subset='price')
Using inplace=True makes the whole thing return None, and so you have set df = None. You don't want to do that. If you are using inplace=True, then you don't assign something to that, the whole line would just be df.dropna(...,inplace=True).
Don't use inplace=True, just do the assignment. That is, you should use df=df.dropna(axis='rows',subset='price')

Set non-scientific float representation as default floating point format for pandas [duplicate]

How can one modify the format for the output from a groupby operation in pandas that produces scientific notation for very large numbers?
I know how to do string formatting in python but I'm at a loss when it comes to applying it here.
df1.groupby('dept')['data1'].sum()
dept
value1 1.192433e+08
value2 1.293066e+08
value3 1.077142e+08
This suppresses the scientific notation if I convert to string but now I'm just wondering how to string format and add decimals.
sum_sales_dept.astype(str)

Granted, the answer I linked in the comments is not very helpful. You can specify your own string converter like so.
In [25]: pd.set_option('display.float_format', lambda x: '%.3f' % x)
In [28]: Series(np.random.randn(3))*1000000000
Out[28]:
0 -757322420.605
1 -1436160588.997
2 -1235116117.064
dtype: float64
I'm not sure if that's the preferred way to do this, but it works.
Converting numbers to strings purely for aesthetic purposes seems like a bad idea, but if you have a good reason, this is one way:
In [6]: Series(np.random.randn(3)).apply(lambda x: '%.3f' % x)
Out[6]:
0 0.026
1 -0.482
2 -0.694
dtype: object

Here is another way of doing it, similar to Dan Allan's answer but without the lambda function:
>>> pd.options.display.float_format = '{:.2f}'.format
>>> Series(np.random.randn(3))
0 0.41
1 0.99
2 0.10
or
>>> pd.set_option('display.float_format', '{:.2f}'.format)

You can use round function just to suppress scientific notation for specific dataframe:
df1.round(4)
or you can suppress is globally by:
pd.options.display.float_format = '{:.4f}'.format

If you want to style the output of a data frame in a jupyter notebook cell, you can set the display style on a per-dataframe basis:
df = pd.DataFrame({'A': np.random.randn(4)*1e7})
df.style.format("{:.1f}")
See the documentation here.

Setting a fixed number of decimal places globally is often a bad idea since it is unlikely that it will be an appropriate number of decimal places for all of your various data that you will display regardless of magnitude. Instead, try this which will give you scientific notation only for large and very small values (and adds a thousands separator unless you omit the ","):
pd.set_option('display.float_format', lambda x: '%,g' % x)
Or to almost completely suppress scientific notation without losing precision, try this:
pd.set_option('display.float_format', str)

I had multiple dataframes with different floating point, so thx to Allans idea made dynamic length.
pd.set_option('display.float_format', lambda x: f'%.{len(str(x%1))-2}f' % x)
The minus of this is that if You have last 0 in float, it will cut it. So it will be not 0.000070, but 0.00007.

Expanding on this useful comment, here is a solution setting the formatting options only to display the results without changing options permanently:
with pd.option_context('display.float_format', lambda x: f'{x:,.3f}'):
display(sum_sales_dept)
dept
value1 119,243,300.0
value2 129,306,600.0
value3 107,714,200.0

If you would like to use the values, say as part of csvfile csv.writer, the numbers can be formatted before creating a list:
df['label'].apply(lambda x: '%.17f' % x).values.tolist()

Naming variable for median calculation using Numpy

I'm using numpy to get the median. The dataframe has two variables. Is there a way to tell it which variable I want the median for?
np.median(dataframename)

You must make cast your dataframe to numpy vector. Try this:
#input data in dataframename
dataframename = np.asarray(dataframename)
dataframename = dataframename.astype(float)
np.median(dataframename)

I realized that my data was not in a dataframe. Once I put it in, this worked.
dataframename.loc[:,"var18"].median()

Clean np array of NaN while deleting entries in other array accordingly

I have two numpy arrays, one of which contains about 1% NaNs.
a = np.array([-2,5,nan,6])
b = np.array([2,3,1,0])
I'd like to compute the mean squared error of a and b using sklearn's mean_squared_error.
So my question is, what's the pythonic way of removing all NaNs from a while at the same time deleting all corresponding entries from b as efficiently as possible?

You can simply use vanilla NumPy's np.nanmean for this purpose:
In [136]: np.nanmean((a-b)**2)
Out[136]: 18.666666666666668
If this didn't exist, or you really wanted to use the sklearn method, you could create a mask to index the NaNs:
In [148]: mask = ~np.isnan(a)
In [149]: mean_squared_error(a[mask], b[mask])
Out[149]: 18.666666666666668

Error when using numpy to encode categorical features of dataset

I use the following function to encode the categorical features of my dataset (it has 27 features where 11 of them is categorical):
from sklearn import preprocessing
def features_encoding(data):
columnsToEncode = list(data.select_dtypes(include=['category', 'object']))
le = preprocessing.LabelEncoder()
for feature in columnsToEncode:
try:
data[feature] = le.fit_transform(data[feature])
except:
continue
return data
But I get this error:
FutureWarning: numpy not_equal will not check object identity in the future. The comparison did not return the same result as suggested by the identity (`is`)) and will change.
flag = np.concatenate(([True], aux[1:] != aux[:-1]))
I don't understand this error. Kindly, can someone explain what is it about and how to fix it?

This is almost certainly being caused by there being np.nan more than once in an array of dtype=object that is passed into np.unique.
This may help clarify what's going on:
>>> np.nan is np.nan
True
>>> np.nan == np.nan
False
>>> np.array([np.nan], dtype=object) == np.array([np.nan], dtype=object)
FutureWarning: numpy equal will not check object identity in the future. The comparison did not return the same result as suggested by the identity (`is`)) and will change.
array([ True], dtype=bool)
So when comparing two arrays of dtype=object, numpy checks if the return of the comparison function is False when both objects being compared are the exact same. Because right now it assumes that all objects compare equal to themselves, but that will be changed at same time in the future.
All in all, it's just a warning, so you can ignore it, at least for now...

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Native method to skip Nans in a lambda function? - pandas

int doesn't have a representation of NaN. The normal way to deal with it would be to drop all the NaN's first: year = y.Year.dropna().astype(int)

I have done this for int series. import numpy as np y['year'] = y['year'].apply(lambda x:x if np.isnan(x) else int(x))

Related

Convert type object column to float

Set non-scientific float representation as default floating point format for pandas [duplicate]

Naming variable for median calculation using Numpy

Clean np array of NaN while deleting entries in other array accordingly

Error when using numpy to encode categorical features of dataset

Categories

Resources