How to properly handle Daylight Savings Time in Apache Airflow? - dst

In airflow, everything is supposed to be UTC (which is not affected by DST).
However, we have workflows that deliver things based on time zones that are affected by DST.
An example scenario:
We have a job scheduled with a start date at 8:00 AM Eastern and a schedule interval of 24 hours.
Everyday at 8 AM Eastern the scheduler sees that it has been 24 hours since the last run, and runs the job.
DST Happens and we lose an hour.
Today at 8 AM Eastern the scheduler sees that it has only been 23 hours because the time on the machine is UTC, and doesn't run the job until 9AM Eastern, which is a late delivery
Is there a way to schedule dags so they run at the correct time after a time change?

Off the top of my head:
If your machine is timezone-aware, set up your DAG to run at 8AM EST and 8AM EDT in UTC. Something like 0 11,12 * * *. Have the first task a ShortCircuit operator. Then use something like pytz to localize the current time. If it is within your required time, continue (IE: run the DAG). Otherwise, return False. You'll have a tiny overhead 2 extra tasks per day, but the latency should be minimal as long as your machine isn't overloaded.
sloppy example:
from datetime import datetime
from pytz import utc, timezone
# ...
def is8AM(**kwargs):
ti = kwargs["ti"]
curtime = utc.localize(datetime.utcnow())
# If you want to use the exec date:
# curtime = utc.localize(ti.execution_date)
eastern = timezone('US/Eastern') # From docs, check your local names
loc_dt = curtime.astimezone(eastern)
if loc_dt.hour == 8:
return True
return False
start_task = ShortCircuitOperator(
task_id='check_for_8AM',
python_callable=is8AM,
provide_context=True,
dag=dag
)
Hope this is helpful
Edit: runtimes were wrong, subtracted instead of adding. Additionally, due to how runs are launched, you'll probably end up wanting to schedule for 7AM with an hourly schedule if you want them to run at 8.

We used #apathyman solution, but instead of ShortCircuit we just used PythonOperator that fails if its not the hour we want, and has a retry with timedelta of 1 hour.
that way we have only 1 run per day instead of 2.
and the schedule interval set to run only on the first hour
So basicly, something like that (most code taken from above answer, thanks #apathyman):
from datetime import datetime
from datetime import timedelta
from pytz import utc, timezone
def is8AM(**kwargs):
ti = kwargs["ti"]
curtime = utc.localize(datetime.utcnow())
# If you want to use the exec date:
# curtime = utc.localize(ti.execution_date)
eastern = timezone('US/Eastern') # From docs, check your local names
loc_dt = curtime.astimezone(eastern)
if loc_dt.hour == 8:
return True
exit("Not the time yet, wait 1 hour")
start_task = PythonOperator(
task_id='check_for_8AM',
python_callable=is8AM,
provide_context=True,
retries=1,
retry_delay=timedelta(hours=1),
dag=dag
)

This question was asked when airflow was on version 1.8.x.
This functionality is built-in now, as of airflow 1.10.
https://airflow.apache.org/timezone.html
Set the timezone in airflow.cfg and dst should be handled correctly.

I believe we just need a PythonOperator to handle this case.
If the DAG need to run in DST TZ (for ex.: America/New_York, Europe/London, Australia/Sydney), then below is the workaround steps I can think about:
Convert the DAG schedule to UTC TZ.
Because the TZ having DST, then we need to choose the bigger offset
when doing the convert. For ex:
With America/New_York TZ: we must use the offset -4. So schedule */10 11-13 * * 1-5 will be converted to */10 15-17 * * 1-5
With Europe/London: we must use the offset +1. So schedule 35 */4 * * * will be converted to 35 3-23/4 * * *
With Australia/Sydney: we must use the offset +11. So schedule 15 8,9,12,18 * * * will be converted to 15 21,22,1,7 * * *
Use PythonOperator to make a task before all the main tasks. This task will check if current time is in DST of specified TZ or not. If it's, then the task will sleep in 1 hour.
This way we can handle the case of DST TZ.
def is_DST(zonename):
tz = pytz.timezone(zonename)
now = pytz.utc.localize(datetime.utcnow())
return now.astimezone(tz).dst() != timedelta(0)
def WQ_DST_handler(TZ, **kwargs):
if is_DST(TZ):
print('Currently is daily saving time (DST) in {0}, will process to next task now'.format(TZ))
else:
print('Currently is not daily saving time (DST) in {0}, will sleep 1 hour...'.format(TZ))
time.sleep(60 * 60)
DST_handler = PythonOperator(
task_id='DST_handler',
python_callable=WQ_DST_handler,
op_kwargs={'TZ': TZ_of_dag},
dag=dag
)
DST_handler >> main_tasks
This workaround has a disadvantage: with any DAG that need to run in DST TZ, we have to create 1 further task (DST_handler in above example), and this task still need to send to work nodes to execute, too (although it's almost just a sleep command).

Related

How to fix time to 00:01:00 in current-dateTime function output in XSLT?

I am using select="fn:current-dateTime()+ xs:dayTimeDuration('P1D') to get current date + 1 day as final output. But I want to fix the time to 00:01 in output. How can I do this?
I am new to XSLT.I have tried replace function but not working.
Try:
dateTime(current-date() + xs:dayTimeDuration('P1D'), xs:time('00:01:00'))
Note that this will preserve the local timezone returned by the current-date() function; for example, if your timezone is 5.5 hours ahead of GMT, you will get a result of 2022-11-03T00:01:00+05:30 if you run the evaluation today (2022-11-02).

How do I repeat BigQueryOperator Dag and pass different dates to my sql file

I have a query I want to run using the BigQueryOperator. Each day, it will run for the past 21 days. The sql file stays the same, but the date passed to the file changes. So for example today it will run for today's date, then repeat for yesterday's date, and then repeat for 2 days ago, all the way up to 21 days ago. So it will run on 7/14/2021, and so I need to pass this date to my sql file. Then It will run for 7/13/2021, and the date I need to pass to my sql file is 7/13/2021. How can I have this dag repeat for a date range, and dynamically pass this date to the sql file.
In the BigQueryOperator, variables are passed in the "user_defined_macros, section, so I don't know how to change the date I am passing. I thought about looping over an array of dates, but I don't know how to pass that date to the sql file linked in the BigQueryOperator.
My sql file is 300 lines long, so I included a simple example below, as people seem to ask for one.
DAG
with DAG(
dag_id,
schedule_interval='0 12 * * *',
start_date=datetime(2021, 1, 1),
template_searchpath='/opt/airflow/dags',
catchup=False,
user_defined_macros={"varsToPass":Var1
}
) as dag:
query_one = BigQueryOperator(
task_id='query_one',
sql='/sql/something.sql',
use_legacy_sql=False,
destination_dataset_table ='table',
write_disposition = 'WRITE_TRUNCATE'
)
sql file
SELECT * FROM table WHERE date = {{CHANGING_DATE}}
Your code is confusing because you describe a repeated pattern of today,today-1 day, ..., today - 21 days however your code shows write_disposition = 'WRITE_TRUNCATE' which means that only the LAST query matters because each query erase the result of the previous one. Since no more information provided I assume you actually mean to run a single query between the today to today - 21 days.
Also You didn't mention if the date that you are referring to is Airflow execution_date or today date.
If it's execution_date you don't need to pass any parameters. the SQL needs to be:
SELECT * FROM table WHERE date BETWEEN {{ execution_date }} AND
{{ execution_date - macros.timedelta(days=21) }}
If it's today then you need to pass parameter with params:
from datetime import datetime
query_one = BigQueryOperator(
task_id='query_one',
sql='/sql/something.sql',
use_legacy_sql=False,
destination_dataset_table ='table',
write_disposition = 'WRITE_TRUNCATE',
params={
"end": datetime.utcnow().strftime('%Y-%m-%d'),
"start": (datetime.now() - datetime.timedelta(days=21)).strftime('%Y-%m-%d')
}
)
Then in the SQL you can use it as:
SELECT * FROM table WHERE date BETWEEN {{ params.start }} AND
{{ params.end }}
I'd like to point that if you are not using execution_date then I don't see the value of passing the date from Airflow. You can just do it directly with BigQuery by setting the query to:
SELECT *
FROM table
WHERE date BETWEEN DATE_SUB(current_date(), INTERVAL 21 DAY) AND current_date()
If my assumption was incorrect and you want to run 21 queries then you can do that with a loop as you described:
from datetime import datetime, timedelta
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
a = []
for i in range(0, 21):
a.append(
BigQueryOperator(
task_id=f'query_{i}',
sql='/sql/something.sql',
use_legacy_sql=False,
destination_dataset_table='table',
write_disposition='WRITE_TRUNCATE', # This is probably wrong, I just copied it from your code.
params={
"date_value": (datetime.now() - timedelta(days=i)).strftime('%Y-%m-%d')
}
)
)
if i not in [0]:
a[i - 1] >> a[i]
Then in your /sql/something.sql the query should be:
SELECT * FROM table WHERE date = {{ params.date_value }}
As mentioned this will create a workflow :
Note also that BigQueryOperator is deprecated. You should use BigQueryExecuteQueryOperator which available in Google provider via
from airflow.providers.google.cloud.operators.bigquery import BigQueryExecuteQueryOperator
for more information about how to install Google provider please see the 2nd part of the following answer.

Visual Basic Recording amount of seconds passed

In my code I want to record the amount of seconds passed. I was using Second(Now) to measure how much time has passed since a point and used the integer in comparisons but have realised that if the minute ends, it goes to zero. This led me to add in Minute(Now) so that I could multiply this by 60 then add on Second(Now) but there is a similar problem of this number becoming zero once the hour passes.
What can I use instead of this to record the amount of seconds elapsed after a certain time.
You can use a TimeSpan to register the difference between two Now DateTime by using the DateTime.substract method. here is an illustration. Only after the operation do we convert the elapsed time into the unit we desire to use (for example seconds in our case)
Static start_time As DateTime
Static stop_time As DateTime
Dim elapsed_time As TimeSpan
start_time = Now
''' Processing here
stop_time = Now
elapsed_time = stop_time.Subtract(start_time)
Dim totalSecondsStr = elapsed_time.TotalSeconds.ToString("0.000000")

Hive FROM_UNIXTIME() with milliseconds

I have seen enough posts where we divide by 1000 or cast to convert from Milliseconds epoch time to Timestamp. I would like to know how can we retain the Milliseconds piece too in the timestamp.
1440478800123 The last 3 bytes are milliseconds. How do i convert this to something like YYYYMMDDHHMMSS.sss
I need to capture the millisecond portion also in the converted timestamp
Thanks
select cast(epoch_ms as timestamp)
actually works, because when casting to a timestamp (as opposed to using from_unixtime()), Hive seems to assume an int or bigint is milliseconds. A floating point type is treated as seconds. That is undocumented as far as I can see, and possibly a bug. I wanted a string which includes the timezone (which can be important - particularly if the server changes to summer/daylight savings time), and wanted to be explicit about the conversion in case the cast functionality changes. So this gives an ISO 8601 date (adjust format string as needed for another format)
select from_unixtime(
floor( epoch_ms / 1000 )
, printf( 'yyyy-MM-dd HH:mm:ss.%03dZ', epoch_ms % 1000 )
)
create a hive udf in java
package com.kishore.hiveudf;
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.Date;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.udf.UDFType;
#UDFType(stateful = true)
public class TimestampToDateUDF extends UDF {
String dateFormatted;
public String evaluate(long timestamp) {
Date date = new Date(timestamp);
DateFormat formatter = new SimpleDateFormat("YYYYMMDDHHmmss:SSS");
dateFormatted = formatter.format(date);
return dateFormatted;
}
}
export as TimestampToDateUDF.jar
hive> ADD JAR /home/kishore/TimestampToDate.jar;
hive> create TEMPORARY FUNCTION toDate AS 'com.kishore.hiveudf.TimestampToDateUDF' ;
output
select * from tableA;
OK
1440753288123
Time taken: 0.071 seconds, Fetched: 1 row(s)
hive> select toDate(timestamp) from tableA;
OK
201508240144448:123
Time taken: 0.08 seconds, Fetched: 1 row(s)

Convert Local Time to UTC in SQL Server 2005 [duplicate]

We are dealing with an application that needs to handle global time data from different time zones and daylight savings time settings. The idea is to store everything in UTC format internally and only convert back and forth for the localized user interfaces. Does the SQL Server offer any mechanisms for dealing with the translations given a time, a country and a timezone?
This must be a common problem, so I'm surprised google wouldn't turn up anything usable.
Any pointers?
This works for dates that currently have the same UTC offset as SQL Server's host; it doesn't account for daylight savings changes. Replace YOUR_DATE with the local date to convert.
SELECT DATEADD(second, DATEDIFF(second, GETDATE(), GETUTCDATE()), YOUR_DATE);
7 years passed and...
actually there's this new SQL Server 2016 feature that does exactly what you need.
It is called AT TIME ZONE and it converts date to a specified time zone considering DST (daylight saving time) changes.
More info here:
https://msdn.microsoft.com/en-us/library/mt612795.aspx
While a few of these answers will get you in the ballpark, you cannot do what you're trying to do with arbitrary dates for SqlServer 2005 and earlier because of daylight savings time. Using the difference between the current local and current UTC will give me the offset as it exists today. I have not found a way to determine what the offset would have been for the date in question.
That said, I know that SqlServer 2008 provides some new date functions that may address that issue, but folks using an earlier version need to be aware of the limitations.
Our approach is to persist UTC and perform the conversion on the client side where we have more control over the conversion's accuracy.
Here is the code to convert one zone DateTime to another zone DateTime
DECLARE #UTCDateTime DATETIME = GETUTCDATE();
DECLARE #ConvertedZoneDateTime DATETIME;
-- 'UTC' to 'India Standard Time' DATETIME
SET #ConvertedZoneDateTime = #UTCDateTime AT TIME ZONE 'UTC' AT TIME ZONE 'India Standard Time'
SELECT #UTCDateTime AS UTCDATE,#ConvertedZoneDateTime AS IndiaStandardTime
-- 'India Standard Time' to 'UTC' DATETIME
SET #UTCDateTime = #ConvertedZoneDateTime AT TIME ZONE 'India Standard Time' AT TIME ZONE 'UTC'
SELECT #ConvertedZoneDateTime AS IndiaStandardTime,#UTCDateTime AS UTCDATE
Note: AT TIME ZONE works only on SQL Server 2016+ and the advantage is that it automatically considers Daylight when converting to a particular Time zone
For SQL Server 2016 and newer, and Azure SQL Database, use the built in AT TIME ZONE statement.
For older editions of SQL Server, you can use my SQL Server Time Zone Support project to convert between IANA standard time zones, as listed here.
UTC to Local is like this:
SELECT Tzdb.UtcToLocal('2015-07-01 00:00:00', 'America/Los_Angeles')
Local to UTC is like this:
SELECT Tzdb.LocalToUtc('2015-07-01 00:00:00', 'America/Los_Angeles', 1, 1)
The numeric options are flag for controlling the behavior when the local time values are affected by daylight saving time. These are described in detail in the project's documentation.
SQL Server 2008 has a type called datetimeoffset. It's really useful for this type of stuff.
http://msdn.microsoft.com/en-us/library/bb630289.aspx
Then you can use the function SWITCHOFFSET to move it from one timezone to another, but still keeping the same UTC value.
http://msdn.microsoft.com/en-us/library/bb677244.aspx
Rob
I tend to lean towards using DateTimeOffset for all date-time storage that isn't related to a local event (ie: meeting/party, etc, 12pm-3pm at the museum).
To get the current DTO as UTC:
DECLARE #utcNow DATETIMEOFFSET = CONVERT(DATETIMEOFFSET, SYSUTCDATETIME())
DECLARE #utcToday DATE = CONVERT(DATE, #utcNow);
DECLARE #utcTomorrow DATE = DATEADD(D, 1, #utcNow);
SELECT #utcToday [today]
,#utcTomorrow [tomorrow]
,#utcNow [utcNow]
NOTE: I will always use UTC when sending over the wire... client-side JS can easily get to/from local UTC. See: new Date().toJSON() ...
The following JS will handle parsing a UTC/GMT date in ISO8601 format to a local datetime.
if (typeof Date.fromISOString != 'function') {
//method to handle conversion from an ISO-8601 style string to a Date object
// Date.fromISOString("2009-07-03T16:09:45Z")
// Fri Jul 03 2009 09:09:45 GMT-0700
Date.fromISOString = function(input) {
var date = new Date(input); //EcmaScript5 includes ISO-8601 style parsing
if (!isNaN(date)) return date;
//early shorting of invalid input
if (typeof input !== "string" || input.length < 10 || input.length > 40) return null;
var iso8601Format = /^(\d{4})-(\d{2})-(\d{2})((([T ](\d{2}):(\d{2})(:(\d{2})(\.(\d{1,12}))?)?)?)?)?([Zz]|([-+])(\d{2})\:?(\d{2}))?$/;
//normalize input
var input = input.toString().replace(/^\s+/,'').replace(/\s+$/,'');
if (!iso8601Format.test(input))
return null; //invalid format
var d = input.match(iso8601Format);
var offset = 0;
date = new Date(+d[1], +d[2]-1, +d[3], +d[7] || 0, +d[8] || 0, +d[10] || 0, Math.round(+("0." + (d[12] || 0)) * 1000));
//use specified offset
if (d[13] == 'Z') offset = 0-date.getTimezoneOffset();
else if (d[13]) offset = ((parseInt(d[15],10) * 60) + (parseInt(d[16],10)) * ((d[14] == '-') ? 1 : -1)) - date.getTimezoneOffset();
date.setTime(date.getTime() + (offset * 60000));
if (date.getTime() <= new Date(-62135571600000).getTime()) // CLR DateTime.MinValue
return null;
return date;
};
}
Yes, to some degree as detailed here.
The approach I've used (pre-2008) is to do the conversion in the .NET business logic before inserting into the DB.
You can use GETUTCDATE() function to get UTC datetime
Probably you can select difference between GETUTCDATE() and GETDATE() and use this difference to ajust your dates to UTC
But I agree with previous message, that it is much easier to control right datetime in the business layer (in .NET, for example).
SUBSTRING(CONVERT(VARCHAR(34), SYSDATETIMEOFFSET()), 29, 5)
Returns (for example):
-06:0
Not 100% positive this will always work.
Sample usage:
SELECT
Getdate=GETDATE()
,SysDateTimeOffset=SYSDATETIMEOFFSET()
,SWITCHOFFSET=SWITCHOFFSET(SYSDATETIMEOFFSET(),0)
,GetutcDate=GETUTCDATE()
GO
Returns:
Getdate SysDateTimeOffset SWITCHOFFSET GetutcDate
2013-12-06 15:54:55.373 2013-12-06 15:54:55.3765498 -08:00 2013-12-06 23:54:55.3765498 +00:00 2013-12-06 23:54:55.373