How to convert a numeric column (without a calendar date) into datetime - pandas

I am working with a time series data in pandas df that doesn't have a real calendar date but an index value that indicates an equal time interval in between each value. I'm trying to convert it into a datetime type with daily or weekly frequency. Is there a way to keep the values same while changing the type (like without setting an actual calander date)?
Index,Col1,Col2
1,6.5,0.7
2,6.2,0.3
3,0.4,2.1

pd.to_datetime can create dates when given time units relative to some origin. The default is the POSIX origin 1970-01-01 00:00:00 and time in nanoseconds.
import pandas as pd
df['date1'] = pd.to_datetime(df.index, unit='D', origin='2010-01-01')
df['date2'] = pd.to_datetime(df.index, unit='W')
Output:
# Col1 Col2 date1 date2
#Index
#1 6.5 0.7 2010-01-02 1970-01-08
#2 6.2 0.3 2010-01-03 1970-01-15
#3 0.4 2.1 2010-01-04 1970-01-22
Alternatively, you can add timedeltas to the specified start:
pd.to_datetime('2010-01-01') + pd.to_timedelta(df.index, unit='D')
or just keep them as a timedelta:
pd.to_timedelta(df.index, unit='D')
#TimedeltaIndex(['1 days', '2 days', '3 days'], dtype='timedelta64[ns]', name='Index', freq=None)

Related

pd.to_datetime doesn't work with %a format

I'm having a trouble with pandas to_datetime function
When I call the function in this way:
import pandas as pd
pd.to_datetime(['Wed', 'Thu', 'Mon', 'Tue', 'Fri'], format='%a')
I get this result:
DatetimeIndex(['1900-01-01', '1900-01-01', '1900-01-01', '1900-01-01',
'1900-01-01'],
dtype='datetime64[ns]', freq=None)
I don't know why pandas don't recognize the correct format.
I want get a datetime object which has the right day independently the month or year
This is not a pandas issue but with datetime in python.
Here is the best documentation I can find why '1900-01-01' Python Datetime Technical Details.
Note:
For the datetime.strptime() class method, the default value is
1900-01-01T00:00:00.000: any components not specified in the format
string will be pulled from the default value.
Basically, it could be any Monday, Tuesday (day of the week name) in the month of Jan 1900. Therefore, it is ambiguous, thus returning the default value of 1900-01-01T00:00:00.000. If you put in a day of the month then that date is determinant using the given defaults of Jan 1900, so using strptime with 1 and %d does lead to the date of 1900-Jan-01 00:00:00.000 and 2 with %d will lead to 1900-Jan-02 00:00:00.000, just using Mon or Monday, is not determinant to a datetime.
This is my interpretation of the documentation and the issue you are experiencing.

Adding variable hours to timestamp in Spark SQL

I have one column Start_Time with a timestamp, and one column Time_Zone_Offset, an integer. How can I add the Time_Zone_Offset to Start_Time as a number of hours?
Example MyTable:
id Start_Time Time_Zone_Offset
1 2020-01-12 00:00:00 1
2 2020-01-12 00:00:00 2
Desired output:
id Local_Start_Time
1 2020-01-12 01:00:00
2 2020-01-12 02:00:00
Attempt:
SELECT id, Start_time + INTERVAL time_zone_offset HOURS AS Local_Start_Time
FROM MyTable
This doesn't seem to work, and I can't use from_utc_timestamp as I don't have the actual timezone details, just the time-zone offset at the time being considered.
(Hope you are using pyspark)
In deed, coudn't make it work with SQL, I manage to get to the result by converting to timestamp, its probably not the best way but it works (i proceeded step by step to make sure the references were working, thought i would need a user defined function, but apparently not)
from pyspark.sql.functions import col,explode,lit
from pyspark.sql import functions as F
df2 = df.withColumn("Start_Time", F.unix_timestamp("Start_Time"))
df2.show()
df3 = df.withColumn("Start_Time", F.unix_timestamp("Start_Time") + df["Time_Zone_Offset"]*60*60)
df3.show()
df4 = df3.withColumn('Start_Time', F.from_unixtime("Start_Time", "YYYY-MM-DD HH:00:00")).show()
Adding an alternative to Benoit's answer using a python UDF:
from pyspark.sql import SQLContext
from datetime import datetime, timedelta
from pyspark.sql.types import TimestampType
# Defining pyspark function to add hours onto datetime
def addHours(my_datetime, hours):
# Accounting for NULL (None in python) values
if (hours is None) or (my_datetime is None):
adjusted_datetime = None
else:
adjusted_datetime = my_datetime + timedelta(hours = hours)
return adjusted_datetime
# Registering the function as a UDF to use in SQL, and defining the output type as 'TimestampType' (this is important, the default is StringType)
sqlContext.udf.register("add_hours", addHours, TimestampType());
followed by:
SELECT id, add_hours(Start_Time, Time_Zone_Offset) AS Local_Start_Time
FROM MyTable
Beginning from Spark 3.0, you may use the make_interval(years, months, weeks, days, hours, mins, secs) function if you want to add intervals using values from other columns.
SELECT
id
, Start_time + make_interval(0, 0, 0, 0, time_zone_offset, 0, 0) AS Local_Start_Time
FROM MyTable
For anyone else coming to this question and using Spark SQL via Databricks, the dateadd function works in the same way as most other SQL languages:
select dateadd(microsecond,30,'2022-11-04') as microsecond
,dateadd(millisecond,30,'2022-11-04') as millisecond
,dateadd(second ,30,'2022-11-04') as second
,dateadd(minute ,30,'2022-11-04') as minute
,dateadd(hour ,30,'2022-11-04') as hour
,dateadd(day ,30,'2022-11-04') as day
,dateadd(week ,30,'2022-11-04') as week
,dateadd(month ,30,'2022-11-04') as month
,dateadd(quarter ,30,'2022-11-04') as quarter
,dateadd(year ,30,'2022-11-04') as year
Output
microsecond
millisecond
second
minute
hour
day
week
month
quarter
year
2022-11-04T00:00:00.000+0000
2022-11-04T00:00:00.030+0000
2022-11-04T00:00:30.000+0000
2022-11-04T00:30:00.000+0000
2022-11-05T06:00:00.000+0000
2022-12-04T00:00:00.000+0000
2023-06-02T00:00:00.000+0000
2025-05-04T00:00:00.000+0000
2030-05-04T00:00:00.000+0000
2052-11-04T00:00:00.000+0000

How to generate a time series column from today to the next 600 days in pandas?

How to generate a time series column from today to the next 600 days in pandas?
I'm a new pandas learner. I can generate a new column as follows:
dates = pd.date_range('2010-01-01', '2011-8-23', freq='D')
Output:
DatetimeIndex(['2010-01-01', '2010-01-02', '2010-01-03', '2010-01-04',
'2010-01-05', '2010-01-06', '2010-01-07', '2010-01-08',
'2010-01-09', '2010-01-10',
...
'2011-08-14', '2011-08-15', '2011-08-16', '2011-08-17',
'2011-08-18', '2011-08-19', '2011-08-20', '2011-08-21',
'2011-08-22', '2011-08-23'],
dtype='datetime64[ns]', length=600, freq='D')
My question is: what should we do if we do only know the starting date, and the time period 600 days? we don't know the ending date. How to modify the code?
And another follow up questions, how to set the starting date to current or yesterday's date?
Just change the period to 600, you should get your out put
pd.date_range(start='2010-01-01', periods=5, freq='D')
Out[335]:
DatetimeIndex(['2010-01-01', '2010-01-02', '2010-01-03', '2010-01-04',
'2010-01-05'],
dtype='datetime64[ns]', freq='D')
For get today'date
pd.to_datetime('today')
Out[338]: Timestamp('2017-09-29 00:00:00')
First, import core package datetime
import datetime
Then you can instantiate a datetime object and add 600 days using the timedelta() method
start_date = "2010-01-01"
start_date = datetime.datetime.strptime(start_date, "%Y-%m-%d")
end_date = start_date + datetime.timedelta(days=600)
To now get the string back, we can use strftime() like:
end_date = end_date.strftime("%Y-%m-%d")
> '2011-08-24'

Matplotlib Default date format?

I'm using Pandas to read a .csv file that a 'Timestamp' date column in the format:
31/12/2016 00:00
I use the following line to convert it to a datetime64 dtype:
time = pd.to_datetime(df['Timestamp'])
The column has an entry corresponding to every 15mins for almost a year, and I've run into a problem when I want to plot more than 1 months worth.
Pandas seems to change the format from ISO to US upon reading (so YYYY:MM:DD to YYYY:DD:MM), so my plots have 30 day gaps whenever the datetime represents a new day. A plot of the first 5 days looks like:
This is the raw data in the file either side of the jump:
01/01/2017 23:45
02/01/2017 00:00
If I print the values being plotted (after reading) around the 1st jump, I get:
2017-01-01 23:45:00
2017-02-01 00:00:00
So is there a way to get pandas to read the dates properly?
Thanks!
You can specify a format parameter in pd.to_datetime to tell pandas how to parse the date exactly, which I suppose is what you need:
time = pd.to_datetime(df['Timestamp'], format='%d/%m/%Y %H:%M')
pd.to_datetime('02/01/2017 00:00')
#Timestamp('2017-02-01 00:00:00')
pd.to_datetime('02/01/2017 00:00', format='%d/%m/%Y %H:%M')
#Timestamp('2017-01-02 00:00:00')

Difference between two timestamps in Pandas

I have the following two time column,"Time1" and "Time2".I have to calculate the "Difference" column,which is (Time2-Time1) in Pandas:
Time1 Time2 Difference
8:59:45 9:27:30 -1 days +23:27:45
9:52:29 10:08:54 -1 days +23:16:26
8:07:15 8:07:53 00:00:38
When Time1 and Time2 are in different hours,I am getting result as"-1 days +" .My desired output for First two values are given below:
Time1 Time2 Difference
8:59:45 9:27:30 00:27:45
9:52:29 10:08:54 00:16:26
How can I get this output in Pandas?
Both time values are in 'datetime64[ns]' dtype.
The issue is not that time1 and time2 are in different hours, it's that time2 is before time1 so time2-time1 is negative, and this is how negative timedeltas are stored. If you just want the difference in minutes as a negative number, you could extract the minutes before calculating the difference:
(df.Time1.dt.minute- df.Time2.dt.minute)
I was not able to reproduce the issue using pandas 17.1:
import pandas as pd
d = {
"start_time": [
"8:59:45",
"9:52:29",
"8:07:15"
],
"end_time": [
"9:27:30",
"10:08:54",
"8:07:53"
]
}
from datetime import datetime
df = pd.DataFrame(data=d)
df['start_time'] = pd.to_datetime(df['start_time'])
df['end_time'] = pd.to_datetime(df['end_time'])
df.end_time - df.start_time
0 00:27:45
1 00:16:25
2 00:00:38
dtype: timedelta64[ns]