How to change a string to NaN when applying astype? - pandas

I have a column in a dataframe that has integers like: [1,2,3,4,5,6..etc]
My problem: In this field one of the field has a string, like this: [1,2,3,2,3,'hello form France',1,2,3]
the Dtype of this column is object.
I want to cast it to float with column.astype(float) but I get an error because that string.
The columns has over 10.000 records and there is only this record with string. How can I cast to float and change this string to NaN for example?

You can use pd.to_numeric with errors='coerce'
import pandas as pd
df = pd.DataFrame({
'all_nums':range(5),
'mixed':[1,2,'woo',4,5],
})
df['mixed'] = pd.to_numeric(df['mixed'], errors='coerce')
df.head()
Before:
After:

Related

Dealing with greater than and less than values in numeric data when reading csv in pandas

My csv file contains numeric data where some values have greater than or less than symbols e.g. ">244". I want my data type to be a float. When reading the file into pandas:
df = pd.read_csv('file.csv')
I get a warning:
Columns (2) have mixed types. Specify dtype option on import or set low_memory=False.
I have checked this question: Pandas read_csv: low_memory and dtype options and tried specifying the date type of the relevant column with:
df = pd.read_csv('file.csv',dtype={'column':'float'})
However, this gives an error:
ValueError: could not convert string to float: '>244'
I have also tried
df = pd.read_csv('file.csv',dtype={'column':'float'}, error_bad_lines=False)
However this does not solve my problem, and I get the same error above.
My problem appears to be that my data has a mixture of string and floats. Can I ignore any rows containing strings in particular columns when reading in the data?
You can use:
df = pd.read_csv('file.csv', dtype={'column':'str'})
Then:
df['column'] = pd.to_numeric(df['column'], errors='coerce')
I found a workaround which was read in my data
df = pd.read_csv('file.csv')
Then remove any values with '<' or '>'
df = df.loc[df['column'].str[:1] != '<']
df = df.loc[df['column'].str[:1] != '>']
Then convert to numeric with pd.to_numeric
df['column'] = pd.to_numeric(df['column'])

The asType does not work in Pandas to int64?

I try to convert columns type to int64:
new_df.astype({'NUM': 'int64'})
Afetr df.info() I see this:
0 NUM 10 non-null object
Why?
The type casting is not done in-place, DataFrame.astype returns a new DataFrame with the correct types. So you have to reassign the result to new_df.
new_df = new_df.astype({'NUM': 'int64'})
print(new_df.info())

how can i get mean value of str type in a dataframe in Pandas

I have a DataFrame from pandas:
i want to get a mean value of "stop_duration" for each "violation_raw".
How can i do it if column "stop_duration" is object type
df = enter code herepd.read_csv('police.csv', parse_dates=['stop_date'])
df[['stop_date', 'violation_raw','stop_duration']]
My table:
the table
Use to_datetime function to convert object to datetime. Also specifying a format to match your data.
import pandas as pd
df["column"] = pd.to_datetime(df["column"], format="%M-%S Min")

Pandas Interpolation: {ValueError}Invalid fill method. Expecting pad (ffill) or backfill (bfill). Got linear

I am trying to interpolate time series data, df, which looks like:
id data lat notes analysis_date
0 17358709 NaN 26.125979 None 2019-09-20 12:00:00+00:00
1 17358709 NaN 26.125979 None 2019-09-20 12:00:00+00:00
2 17352742 -2.331365 26.125979 None 2019-09-20 12:00:00+00:00
3 17358709 -4.424366 26.125979 None 2019-09-20 12:00:00+00:00
I try: df.groupby(['lat', 'lon']).apply(lambda group: group.interpolate(method='linear')), and it throws {ValueError}Invalid fill method. Expecting pad (ffill) or backfill (bfill). Got linear
I suspect the issue is with the fact that I have None values, and I do not want to interpolate those. What is the solution?
df.dtypes gives me:
id int64
data float64
lat float64
notes object
analysis_date datetime64[ns, psycopg2.tz.FixedOffsetTimezone...
dtype: object
DataFrame.interpolate has issues with timezone-aware datetime64ns columns, which leads to that rather cryptic error message. E.g.
import pandas as pd
df = pd.DataFrame({'time': pd.to_datetime(['2010', '2011', 'foo', '2012', '2013'],
errors='coerce')})
df['time'] = df.time.dt.tz_localize('UTC').dt.tz_convert('Asia/Kolkata')
df.interpolate()
ValueError: Invalid fill method. Expecting pad (ffill) or backfill
(bfill). Got linear
In this case interpolating that column is unnecessary so only interpolate the column you need. We still want DataFrame.interpolate so select with [[ ]] (Series.interpolate leads to some odd reshaping)
df['data'] = df.groupby(['lat', 'lon']).apply(lambda x: x[['data']].interpolate())
This error happens because one of the columns you are interpolating is of object data type. Interpolating works only for numerical data types such as integer or float.
If you need to use interpolating for an object or categorical data type, then first convert it to a numerical data type. For this, you need to encode your column first. The following piece of code will resolve your problem:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
notes_encoder=LabelEncoder()
df['notes'] = notes_encoder.fit_transform(df['notes'])
After doing this, check the column's data type. It must be int. If it is categorical ,then change its type to int using the following code:
df['notes']=df['notes'].astype('int32')

How do I replace all NaNs in a pandas dataframe with the string "None"

I have a dataframe and some of them are empty. I want to make it a None string so I can parse it easier than a NaN value.
df = df.replace(np.nan, 'None', regex=True)
Use the code above.