I have got a datetime variable in pandas dataframe 1, when I check the dtypes, it shows the right format (datetime) [2], however when I try to plot this variable, it is being plotted as numbers and not datetime [3].
The most surprising is that this variable was working fine till yesterday, I do not know what has change today :( and as the dtype is showing fine, I am clueless what else could go wrong.
I would highly appreciate your feedback.
thank you,
1
df.head()
reactive_power current timeofmeasurement
0 0 0.000 2018-12-12 10:43:41
1 0 0.000 2018-12-12 10:44:32
2 0 1.147 2018-12-12 10:46:16
3 262 1.135 2018-12-12 10:47:30
4 1159 4.989 2018-12-12 10:49:47
[2]
[] df.dtypes
reactive_power int64
current float64
timeofmeasurement datetime64[ns]
dtype: object
[3]
[]1
You need to convert your datetime column from string type into datetime type, and then set it as index. I don't have your original code, but something along those lines:
#Convert to datetime
df["current timeofmeasurement"] = pd.to_datetime(df["current timeofmeasurement"], format = "%Y-%m-%d %H:%H:%S")
#Set date as index
df = df.set_index("current timeofmeasurement")
#Then you can plot easily
df.plot()
Related
This question already has answers here:
datetime to string with series in pandas
(3 answers)
Closed 9 months ago.
This post was edited and submitted for review 9 months ago and failed to reopen the post:
Original close reason(s) were not resolved
I want to convert time columns (dtype: datetime64[ns]) in a panda.DataFrame into strings representing the year and month only.
It works as expected if all values in the column are valid.
0 2019-4
1 2017-12
dtype: object
But with missing values (pandas.NaT) in the column the result confuses me.
0 -1 days +23:59:59.999979806
1 -1 days +23:59:59.999798288
2 NaT
dtype: timedelta64[ns]
Or with .unique() it is array([ -20194, -201712, 'NaT'], dtype='timedelta64[ns]').
What happens here seems that somehow the result becomes a timedelta64. But I don't understand why this happens. The question is why does this happen?
The complete example code:
#!/usr/bin/env pyhton3
import pandas as pd
import numpy as np
# series with missing values
series = pd.Series([
np.datetime64('2019-04-08'),
np.datetime64('2017-12-05')])
def year_month_string(cell):
"""Convert a datetime64 into string representation with
year and month only.
"""
if pd.isna(cell):
return pd.NaT
return '{}-{}'.format(cell.year, cell.month)
print(series.apply(year_month_string))
# 0 2019-4
# 1 2017-12
# dtype: object
# Series with a missing value
series_nat = pd.Series([
np.datetime64('2019-04-08'),
np.datetime64('2017-12-05'),
pd.NaT])
result = series_nat.apply(year_month_string)
print(result)
# 0 -1 days +23:59:59.999979806
# 1 -1 days +23:59:59.999798288
# 2 NaT
# dtype: timedelta64[ns]
print(result.unique())
# array([ -20194, -201712, 'NaT'], dtype='timedelta64[ns]')
Don't use a custom function, use strftime with %-m (the minus strips the leading zeros):
series_nat.dt.strftime('%Y-%-m')
output:
0 2019-4
1 2017-12
2 NaN
dtype: object
%m would keep the leading zeros:
series_nat.dt.strftime('%Y-%m')
output:
0 2019-04
1 2017-12
2 NaN
dtype: object
I am trying to create a row that has columns from t0 to t(n).
I have a complete data frame (df) that stores the full set of data, and a data series (df_t) specific time markers I am interested in.
What I want is to create a row that has the time marker as t0 then the previous [sequence_length] rows from the complete data frame.
def t_data(df, df_t, col_names, sequence_length):
df_ret = pd.DataFrame()
for i in range(sequence_length):
col_names_seq = [col_name + "_" + str(i) for col_name in col_names]
df_ret[col_names_seq] = df[df.shift(i)["time"].isin(df_t)][col_names]
return df_ret
Running:
t_data(df, df_t, ["close"], 3)
I get:
close_0 close_1 close_2
1110 1.32080 NaN NaN
2316 1.30490 NaN NaN
2549 1.30290 NaN NaN
The obvious line in issue is:
df[df.shift(i)["time"].isin(df_t)][col_names]
I have tried several ways but cant seem to select data surrounding a subset.
Sample (df):
time open close high low volume EMA21 EMA13 EMA9
20 2005-01-10 04:10:00 1.3071 1.3074 1.3075 1.3070 32.0 1.306624 1.306790 1.306887
21 2005-01-10 04:15:00 1.3074 1.3073 1.3075 1.3073 16.0 1.306685 1.306863 1.306969
22 2005-01-10 04:20:00 1.3073 1.3072 1.3074 1.3072 35.0 1.306732 1.306911 1.307015
Sample (df_t):
1110 2005-01-13 23:00:00
2316 2005-01-18 03:30:00
2549 2005-01-18 22:55:00
Name: time, dtype: datetime64[ns]
I don’t have data but hope this drawing helps:
def t_data(df, df_T, n):
# Get the indexs of the original df that matches with the values of df_T
indexs = df.reset_index().merge(df_T, how="inner")['index'].tolist()
#create new index list where we will store the index-n vales
newIndex = []
#create list of values to subtract from index
toSub = np.arange(n)
#loop over index values and subtract the values, and append in newIndex
for i in indexs:
for sub in toSub:
v = i - sub
newIndex.append(v)
#Use iloc to get all the rows in the original df with the newIndex values that we want
closedCosts = df.iloc[newIndex].reset_index(drop = True)["close"].values
#concat our data back to df_T, and reshape closedCosts by n columns
df_final = pd.concat([df_T, pd.DataFrame(closedCosts.reshape(-1, n))], axis= 1)
#return final df
return df_final
This should do what you're asking for. The easiest way to do this is to figure out all the indexs that you would want from the original df with its corresponding closing value. Note: you will have to rename the columns after this, but all the values are there.
I have a dataframe that just has timedate stamps of data type "object". I want to convert the whole dataframe to a datetime data type. Also I would like to convert all the columns to the linux epoch nano seconds. So, I can use this dataframe in pca. enter image description here
Sample:
rng = pd.date_range('2017-04-03', periods=3).astype(str)
time_df = pd.DataFrame({'s': rng, 'a': rng})
print (time_df)
s a
0 2017-04-03 2017-04-03
1 2017-04-04 2017-04-04
2 2017-04-05 2017-04-05
Use DataFrame.apply with converting to datetimes and then to native epoch format by convert to numpy array and then to integers:
f = lambda x: pd.to_datetime(x, infer_datetime_format=True).values.astype(np.int64)
#pandas 0.24+
#f = lambda x: pd.to_datetime(x, infer_datetime_format=True).to_numpy().astype(np.int64)
time_df = time_df.apply(f)
print (time_df)
s a
0 1491177600000000000 1491177600000000000
1 1491264000000000000 1491264000000000000
2 1491350400000000000 1491350400000000000
I have an excel sheet with a column that is supposed to contain date values but pandas reads it as float64. It has blanks
df:
date_int
15022016
23072017
I want to convert to a datetime object. I do:
df['date_int1'] = df['date_int'].astype(str).fillna('01011900')#To fill the blanks
df['date_int2']=pd.to_datetime(df['date_int1'],format='%d%m%Y')
I get error while converting to datetime:
TypeError: Unrecognized value type: <class 'str'>
ValueError: unconverted data remains: .0
You shouldn't convert to string until you've filled the NaNs. Otherwise, the NaNs are also stringified, and at the point there is nothing to fill.
df
date_int
0 15022016.0
1 23072017.0
2 NaN
df['date_int'] = df['date_int'].fillna(1011900, downcast='infer').astype(str)
pd.to_datetime(df['date_int'], format='%d%m%Y', errors='coerce')
0 2016-02-15
1 2017-07-23
2 1900-01-10
Name: date_int, dtype: datetime64[ns]
See comment from #Wen-Ben. Convert the data to int first.
df.date_int = df.date_int.astype(int)
Then the rest of the code will work fine.
I have code at the moment written to change two columns of my dataframe from strings into datetime.datetime objects similar to the following:
def converter(date):
date = dt.strptime(date, '%m/%d/%Y %H:%M:%S')
return date
df = pd.DataFrame({'A':['12/31/9999 0:00:00','1/1/2018 0:00:00'],
'B':['4/1/2015 0:00:00','11/1/2014 0:00:00']})
df['A'] = df['A'].apply(converter)
df['B'] = df['B'].apply(converter)
When I run this code and print the dataframe, it comes out like this
A B
0 9999-12-31 00:00:00 2015-04-01
1 2018-01-01 00:00:00 2014-11-01
When I checked the data types of each column, they read
A object
B datetime64[ns]
But when I check the format of the actual cells of the first row, they read
<class 'datetime.datetime'>
<class 'pandas._libs.tslib.Timestamp'>
After experimenting around, I think I've run into an out of bounds error because of the date '12/31/9999 0:00:00' in column 'A' and this is causing this column to be cast as a datetime.datetime object. My question is how I can also convert column 'B' of my dataframe to a datetime.datetime object so that I can run a query on the columns similar to
df.query('A > B')
without getting an error or the wrong output.
Thanks in advance
Since '9999' is just some dummy year, you can simplify your life by choosing a dummy year which is in bounds (or one that makes more sense given your actual data):
import pandas as pd
df.replace('9999', '2060', regex=True).apply(pd.to_datetime)
Output:
A B
0 2060-12-31 2015-04-01
1 2018-01-01 2014-11-01
A datetime64[ns]
B datetime64[ns]
dtype: object
As #coldspeed points out, it's perhaps better to remove those bad dates:
df.apply(pd.to_datetime, errors='coerce')
# A B
#0 NaT 2015-04-01
#1 2018-01-01 2014-11-01