Pandas throws ValueError while converting to datetime from string - pandas

I have an excel sheet with a column that is supposed to contain date values but pandas reads it as float64. It has blanks
df:
date_int
15022016
23072017
I want to convert to a datetime object. I do:
df['date_int1'] = df['date_int'].astype(str).fillna('01011900')#To fill the blanks
df['date_int2']=pd.to_datetime(df['date_int1'],format='%d%m%Y')
I get error while converting to datetime:
TypeError: Unrecognized value type: <class 'str'>
ValueError: unconverted data remains: .0

You shouldn't convert to string until you've filled the NaNs. Otherwise, the NaNs are also stringified, and at the point there is nothing to fill.
df
date_int
0 15022016.0
1 23072017.0
2 NaN
df['date_int'] = df['date_int'].fillna(1011900, downcast='infer').astype(str)
pd.to_datetime(df['date_int'], format='%d%m%Y', errors='coerce')
0 2016-02-15
1 2017-07-23
2 1900-01-10
Name: date_int, dtype: datetime64[ns]

See comment from #Wen-Ben. Convert the data to int first.
df.date_int = df.date_int.astype(int)
Then the rest of the code will work fine.

Related

How to change a string to NaN when applying astype?

I have a column in a dataframe that has integers like: [1,2,3,4,5,6..etc]
My problem: In this field one of the field has a string, like this: [1,2,3,2,3,'hello form France',1,2,3]
the Dtype of this column is object.
I want to cast it to float with column.astype(float) but I get an error because that string.
The columns has over 10.000 records and there is only this record with string. How can I cast to float and change this string to NaN for example?
You can use pd.to_numeric with errors='coerce'
import pandas as pd
df = pd.DataFrame({
'all_nums':range(5),
'mixed':[1,2,'woo',4,5],
})
df['mixed'] = pd.to_numeric(df['mixed'], errors='coerce')
df.head()
Before:
After:

Pandas Interpolation: {ValueError}Invalid fill method. Expecting pad (ffill) or backfill (bfill). Got linear

I am trying to interpolate time series data, df, which looks like:
id data lat notes analysis_date
0 17358709 NaN 26.125979 None 2019-09-20 12:00:00+00:00
1 17358709 NaN 26.125979 None 2019-09-20 12:00:00+00:00
2 17352742 -2.331365 26.125979 None 2019-09-20 12:00:00+00:00
3 17358709 -4.424366 26.125979 None 2019-09-20 12:00:00+00:00
I try: df.groupby(['lat', 'lon']).apply(lambda group: group.interpolate(method='linear')), and it throws {ValueError}Invalid fill method. Expecting pad (ffill) or backfill (bfill). Got linear
I suspect the issue is with the fact that I have None values, and I do not want to interpolate those. What is the solution?
df.dtypes gives me:
id int64
data float64
lat float64
notes object
analysis_date datetime64[ns, psycopg2.tz.FixedOffsetTimezone...
dtype: object
DataFrame.interpolate has issues with timezone-aware datetime64ns columns, which leads to that rather cryptic error message. E.g.
import pandas as pd
df = pd.DataFrame({'time': pd.to_datetime(['2010', '2011', 'foo', '2012', '2013'],
errors='coerce')})
df['time'] = df.time.dt.tz_localize('UTC').dt.tz_convert('Asia/Kolkata')
df.interpolate()
ValueError: Invalid fill method. Expecting pad (ffill) or backfill
(bfill). Got linear
In this case interpolating that column is unnecessary so only interpolate the column you need. We still want DataFrame.interpolate so select with [[ ]] (Series.interpolate leads to some odd reshaping)
df['data'] = df.groupby(['lat', 'lon']).apply(lambda x: x[['data']].interpolate())
This error happens because one of the columns you are interpolating is of object data type. Interpolating works only for numerical data types such as integer or float.
If you need to use interpolating for an object or categorical data type, then first convert it to a numerical data type. For this, you need to encode your column first. The following piece of code will resolve your problem:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
notes_encoder=LabelEncoder()
df['notes'] = notes_encoder.fit_transform(df['notes'])
After doing this, check the column's data type. It must be int. If it is categorical ,then change its type to int using the following code:
df['notes']=df['notes'].astype('int32')

pandas datetime is shown as numbers in plot

I have got a datetime variable in pandas dataframe 1, when I check the dtypes, it shows the right format (datetime) [2], however when I try to plot this variable, it is being plotted as numbers and not datetime [3].
The most surprising is that this variable was working fine till yesterday, I do not know what has change today :( and as the dtype is showing fine, I am clueless what else could go wrong.
I would highly appreciate your feedback.
thank you,
1
df.head()
reactive_power current timeofmeasurement
0 0 0.000 2018-12-12 10:43:41
1 0 0.000 2018-12-12 10:44:32
2 0 1.147 2018-12-12 10:46:16
3 262 1.135 2018-12-12 10:47:30
4 1159 4.989 2018-12-12 10:49:47
[2]
[] df.dtypes
reactive_power int64
current float64
timeofmeasurement datetime64[ns]
dtype: object
[3]
[]1
You need to convert your datetime column from string type into datetime type, and then set it as index. I don't have your original code, but something along those lines:
#Convert to datetime
df["current timeofmeasurement"] = pd.to_datetime(df["current timeofmeasurement"], format = "%Y-%m-%d %H:%H:%S")
#Set date as index
df = df.set_index("current timeofmeasurement")
#Then you can plot easily
df.plot()

pandas converts float64 to int

I am trying to convert the dtype of a column (A) in a dataframe from float64 to int,
df['A'].astype(numpy.int64)
but after that, A still gets float64 as dtype. I am wondering how to resolve the issue.
It seems your output is not assign back, so need:
df['A'] = df['A'].astype(numpy.int64)
If NaNs use fillna for convert them to int:
df['A'] = df['A'].fillna(0).astype(numpy.int64)
Or remove all rows with NaNs in A column by dropna:
df = df.dropna('A')
df['A'] = df['A'].astype(numpy.int64)
If you have NaN values, then Pandas can't convert it to int.
But most probably you just didn't assign result back to A column (as #jezrael has already said).
If you would try to convert NaN's to integer you would get the following exception:
In [4]: df = pd.DataFrame({'A':[1,2,np.nan,4]})
In [5]: df
Out[5]:
A
0 1.0
1 2.0
2 NaN
3 4.0
In [6]: df['A'] = df['A'].astype(np.int64)
...
skipped
...
ValueError: Cannot convert non-finite values (NA or inf) to integer

Pandas not detecting the datatype of a Series properly

I'm running into something a bit frustrating with pandas Series. I have a DataFrame with several columns, with numeric and non-numeric data. For some reason, however, pandas thinks some of the numeric columns are non-numeric, and ignores them when I try to run aggregating functions like .describe(). This is a problem, since pandas raises errors when I try to run analyses on these columns.
I've copied some commands from the terminal as an example. When I slice the 'ND_Offset' column (the problematic column in question), pandas tags it with the dtype of object. Yet, when I call .describe(), pandas tags it with the dtype float64 (which is what it should be). The 'Dwell' column, on the other hand, works exactly as it should, with pandas giving float64 both times.
Does anyone know why I'm getting this behavior?
In [83]: subject.phrases['ND_Offset'][:3]
Out[83]:
SubmitTime
2014-06-02 22:44:44 0.3607049
2014-06-02 22:44:44 0.2145484
2014-06-02 22:44:44 0.4031347
Name: ND_Offset, dtype: object
In [84]: subject.phrases['ND_Offset'].describe()
Out[84]:
count 1255.000000
unique 432.000000
top 0.242308
freq 21.000000
dtype: float64
In [85]: subject.phrases['Dwell'][:3]
Out[85]:
SubmitTime
2014-06-02 22:44:44 111
2014-06-02 22:44:44 81
2014-06-02 22:44:44 101
Name: Dwell, dtype: float64
In [86]: subject.phrases['Dwell'].describe()
Out[86]:
count 1255.000000
mean 99.013546
std 30.109327
min 21.000000
25% 81.000000
50% 94.000000
75% 111.000000
max 291.000000
dtype: float64
And when I use the .groupby function to group the data by another attribute (when these Series are a part of a DataFrame), I get the DataError: No numeric types to aggregate error when I try to call .agg(np.mean) on the group. When I try to call .agg(np.sum) on the same data, on the other hand, things work fine.
It's a bit bizarre -- can anyone explain what's going on?
Thank you!
It might be because the ND_Offset column (what I call A below) contains a non-numeric value such as an empty string. For example,
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [0.36, ''], 'B': [111, 81]})
print(df['A'].describe())
# count 2.00
# unique 2.00
# top 0.36
# freq 1.00
# dtype: float64
try:
print(df.groupby(['B']).agg(np.mean))
except Exception as err:
print(err)
# No numeric types to aggregate
print(df.groupby(['B']).agg(np.sum))
# A
# B
# 81
# 111 0.36
Aggregation using np.sum works because
In [103]: np.sum(pd.Series(['']))
Out[103]: ''
whereas np.mean(pd.Series([''])) raises
TypeError: Could not convert to numeric
To debug the problem, you could try to find the non-numeric value(s) using:
for val in df['A']:
if not isinstance(val, float):
print('Error: val = {!r}'.format(val))