How do I repeat BigQueryOperator Dag and pass different dates to my sql file - google-bigquery

I have a query I want to run using the BigQueryOperator. Each day, it will run for the past 21 days. The sql file stays the same, but the date passed to the file changes. So for example today it will run for today's date, then repeat for yesterday's date, and then repeat for 2 days ago, all the way up to 21 days ago. So it will run on 7/14/2021, and so I need to pass this date to my sql file. Then It will run for 7/13/2021, and the date I need to pass to my sql file is 7/13/2021. How can I have this dag repeat for a date range, and dynamically pass this date to the sql file.
In the BigQueryOperator, variables are passed in the "user_defined_macros, section, so I don't know how to change the date I am passing. I thought about looping over an array of dates, but I don't know how to pass that date to the sql file linked in the BigQueryOperator.
My sql file is 300 lines long, so I included a simple example below, as people seem to ask for one.
DAG
with DAG(
dag_id,
schedule_interval='0 12 * * *',
start_date=datetime(2021, 1, 1),
template_searchpath='/opt/airflow/dags',
catchup=False,
user_defined_macros={"varsToPass":Var1
}
) as dag:
query_one = BigQueryOperator(
task_id='query_one',
sql='/sql/something.sql',
use_legacy_sql=False,
destination_dataset_table ='table',
write_disposition = 'WRITE_TRUNCATE'
)
sql file
SELECT * FROM table WHERE date = {{CHANGING_DATE}}

Your code is confusing because you describe a repeated pattern of today,today-1 day, ..., today - 21 days however your code shows write_disposition = 'WRITE_TRUNCATE' which means that only the LAST query matters because each query erase the result of the previous one. Since no more information provided I assume you actually mean to run a single query between the today to today - 21 days.
Also You didn't mention if the date that you are referring to is Airflow execution_date or today date.
If it's execution_date you don't need to pass any parameters. the SQL needs to be:
SELECT * FROM table WHERE date BETWEEN {{ execution_date }} AND
{{ execution_date - macros.timedelta(days=21) }}
If it's today then you need to pass parameter with params:
from datetime import datetime
query_one = BigQueryOperator(
task_id='query_one',
sql='/sql/something.sql',
use_legacy_sql=False,
destination_dataset_table ='table',
write_disposition = 'WRITE_TRUNCATE',
params={
"end": datetime.utcnow().strftime('%Y-%m-%d'),
"start": (datetime.now() - datetime.timedelta(days=21)).strftime('%Y-%m-%d')
}
)
Then in the SQL you can use it as:
SELECT * FROM table WHERE date BETWEEN {{ params.start }} AND
{{ params.end }}
I'd like to point that if you are not using execution_date then I don't see the value of passing the date from Airflow. You can just do it directly with BigQuery by setting the query to:
SELECT *
FROM table
WHERE date BETWEEN DATE_SUB(current_date(), INTERVAL 21 DAY) AND current_date()
If my assumption was incorrect and you want to run 21 queries then you can do that with a loop as you described:
from datetime import datetime, timedelta
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
a = []
for i in range(0, 21):
a.append(
BigQueryOperator(
task_id=f'query_{i}',
sql='/sql/something.sql',
use_legacy_sql=False,
destination_dataset_table='table',
write_disposition='WRITE_TRUNCATE', # This is probably wrong, I just copied it from your code.
params={
"date_value": (datetime.now() - timedelta(days=i)).strftime('%Y-%m-%d')
}
)
)
if i not in [0]:
a[i - 1] >> a[i]
Then in your /sql/something.sql the query should be:
SELECT * FROM table WHERE date = {{ params.date_value }}
As mentioned this will create a workflow :
Note also that BigQueryOperator is deprecated. You should use BigQueryExecuteQueryOperator which available in Google provider via
from airflow.providers.google.cloud.operators.bigquery import BigQueryExecuteQueryOperator
for more information about how to install Google provider please see the 2nd part of the following answer.

Related

Previous day of current_date() in Hue hive

I'm trying to write a query to return records from a column called "businessdate" (which is in the YYYY-MM-DD format, and it is a string data type), using the previous days "businessdate" records and am stuck. I've tried different things, and I get error messages ranging from argument or matching method, etc. It's probably simple, and I feel dumb. Please help!
Query;
Select businessdate from dbname
Where businessdate = current_date - 1
Use date_sub function to subtract one day from current_date:
select businessdate
from dbname.tablename
where businessdate = date_sub(current_date,1)

Compare prev_execution_date in Airflow to timestamp in BigQuery using SQL

I am trying to insert data from one BigQuery table to another using an Airflow DAG. I want to filter data such that the updateDate in my source table is greater than the previous execution date of my DAG run.
The updateDate in my source table looks like this: 2021-04-09T20:11:11Zand is of STRING data type whereasprev_execution_datelooks like this:2021-04-10T11:00:00+00:00which is why I am trying to convert myupdateDate` to TIMESTAMP first and then to ISO format as shown below.
SELECT *
FROM source_table
WHERE FORMAT_TIMESTAMP("%Y-%m-%dT%X%Ez", TIMESTAMP(UpdateDate)) > TIMESTAMP('{{ prev_execution_date }}')
But I am getting the error message: No matching signature for operator > for argument types: STRING, TIMESTAMP. Supported signature: ANY > ANY. Clearly the left hand side of my WHERE-clause above is of type STRING. How can I convert it to TIMESTAMP or to a correct format for that matter to be able to compare to prev_execution_date?
I have also tried with the following:
WHERE FORMAT_TIMESTAMP("%Y-%m-%dT%X%Ez", TIMESTAMP(UpdatedWhen)) > STRING('{{ prev_execution_date }}')
which results in the error message: Could not cast literal "2021-04-11T11:50:31.284349+00:00" to type DATE
I would appreciate some help regarding how to write my BigQuery SQL query to compare the String timestamp to previous execution date of Airflow DAG.
Probably you wanted to try parse_timestamp instead:
SELECT *
FROM source_table
WHERE PARSE_TIMESTAMP("%Y-%m-%dT%X%Ez", UpdateDate) > TIMESTAMP('{{ prev_execution_date }}')
although looks like it will work even without it
SELECT *
FROM source_table
WHERE TIMESTAMP(UpdateDate) > TIMESTAMP('{{ prev_execution_date }}')

Shell script to delete records from Oracle DB table

I have a situation where I need to run a shell script every 30 days to delete records from Oracle database table.
The table has a column 'updated_date'. I need to write a query to delete records where 'updated_date' less than current date minus 30 days.
In Java I can do it comfortably. How to calculate the dynamic date and pass it to SQL query in Unix shell script.
Can someone please help me.
You could use the delete statement suggested by Littlefoot to utilize the system date from within the database.
But, since you asked -
How to calculate the dynamic date and pass it to SQL query in Unix
shell script,
This is how you can do it.
First, use the date command to get the current date in 'yyyy-mm-dd' format.
export dt=$(date +%Y-%m-%d)
You may then use the date variable within your shell script in sqlplus.
sqlplus -s userid/passwd#db_server<<EOF
delete from yourtable
where updated_date < DATE '${dt}' - 30;
commit;
exit
EOF
That would be something like this:
delete from your_table
where updated_date < trunc(sysdate) - 30;
SYSDATE is a function that returns current date (and time - that's why I used TRUNC function which will remove time component). It also means that you don't have to calculate date and pass it to SQL query - Oracle know it itself.
Though, note that SYSDATE shows database server date, so - if you work on different time zones, you might need to consider that fact.
Also, "minus 30 days": is it always 30 days, or did you actually mean "minus 1 month"? If so, you'd change the condition to
where updated_date < add_months(trunc(sysdate), -1)
and it'll take care about number of days in a month (30, 31; including February as well as leap years).

How to properly handle Daylight Savings Time in Apache Airflow?

In airflow, everything is supposed to be UTC (which is not affected by DST).
However, we have workflows that deliver things based on time zones that are affected by DST.
An example scenario:
We have a job scheduled with a start date at 8:00 AM Eastern and a schedule interval of 24 hours.
Everyday at 8 AM Eastern the scheduler sees that it has been 24 hours since the last run, and runs the job.
DST Happens and we lose an hour.
Today at 8 AM Eastern the scheduler sees that it has only been 23 hours because the time on the machine is UTC, and doesn't run the job until 9AM Eastern, which is a late delivery
Is there a way to schedule dags so they run at the correct time after a time change?
Off the top of my head:
If your machine is timezone-aware, set up your DAG to run at 8AM EST and 8AM EDT in UTC. Something like 0 11,12 * * *. Have the first task a ShortCircuit operator. Then use something like pytz to localize the current time. If it is within your required time, continue (IE: run the DAG). Otherwise, return False. You'll have a tiny overhead 2 extra tasks per day, but the latency should be minimal as long as your machine isn't overloaded.
sloppy example:
from datetime import datetime
from pytz import utc, timezone
# ...
def is8AM(**kwargs):
ti = kwargs["ti"]
curtime = utc.localize(datetime.utcnow())
# If you want to use the exec date:
# curtime = utc.localize(ti.execution_date)
eastern = timezone('US/Eastern') # From docs, check your local names
loc_dt = curtime.astimezone(eastern)
if loc_dt.hour == 8:
return True
return False
start_task = ShortCircuitOperator(
task_id='check_for_8AM',
python_callable=is8AM,
provide_context=True,
dag=dag
)
Hope this is helpful
Edit: runtimes were wrong, subtracted instead of adding. Additionally, due to how runs are launched, you'll probably end up wanting to schedule for 7AM with an hourly schedule if you want them to run at 8.
We used #apathyman solution, but instead of ShortCircuit we just used PythonOperator that fails if its not the hour we want, and has a retry with timedelta of 1 hour.
that way we have only 1 run per day instead of 2.
and the schedule interval set to run only on the first hour
So basicly, something like that (most code taken from above answer, thanks #apathyman):
from datetime import datetime
from datetime import timedelta
from pytz import utc, timezone
def is8AM(**kwargs):
ti = kwargs["ti"]
curtime = utc.localize(datetime.utcnow())
# If you want to use the exec date:
# curtime = utc.localize(ti.execution_date)
eastern = timezone('US/Eastern') # From docs, check your local names
loc_dt = curtime.astimezone(eastern)
if loc_dt.hour == 8:
return True
exit("Not the time yet, wait 1 hour")
start_task = PythonOperator(
task_id='check_for_8AM',
python_callable=is8AM,
provide_context=True,
retries=1,
retry_delay=timedelta(hours=1),
dag=dag
)
This question was asked when airflow was on version 1.8.x.
This functionality is built-in now, as of airflow 1.10.
https://airflow.apache.org/timezone.html
Set the timezone in airflow.cfg and dst should be handled correctly.
I believe we just need a PythonOperator to handle this case.
If the DAG need to run in DST TZ (for ex.: America/New_York, Europe/London, Australia/Sydney), then below is the workaround steps I can think about:
Convert the DAG schedule to UTC TZ.
Because the TZ having DST, then we need to choose the bigger offset
when doing the convert. For ex:
With America/New_York TZ: we must use the offset -4. So schedule */10 11-13 * * 1-5 will be converted to */10 15-17 * * 1-5
With Europe/London: we must use the offset +1. So schedule 35 */4 * * * will be converted to 35 3-23/4 * * *
With Australia/Sydney: we must use the offset +11. So schedule 15 8,9,12,18 * * * will be converted to 15 21,22,1,7 * * *
Use PythonOperator to make a task before all the main tasks. This task will check if current time is in DST of specified TZ or not. If it's, then the task will sleep in 1 hour.
This way we can handle the case of DST TZ.
def is_DST(zonename):
tz = pytz.timezone(zonename)
now = pytz.utc.localize(datetime.utcnow())
return now.astimezone(tz).dst() != timedelta(0)
def WQ_DST_handler(TZ, **kwargs):
if is_DST(TZ):
print('Currently is daily saving time (DST) in {0}, will process to next task now'.format(TZ))
else:
print('Currently is not daily saving time (DST) in {0}, will sleep 1 hour...'.format(TZ))
time.sleep(60 * 60)
DST_handler = PythonOperator(
task_id='DST_handler',
python_callable=WQ_DST_handler,
op_kwargs={'TZ': TZ_of_dag},
dag=dag
)
DST_handler >> main_tasks
This workaround has a disadvantage: with any DAG that need to run in DST TZ, we have to create 1 further task (DST_handler in above example), and this task still need to send to work nodes to execute, too (although it's almost just a sleep command).

bigquery UDF support

I am very new to BigQuery by google
I want to parse time stamp (yyyy/mm/dd:hh:mm:ss) based on the day and the month wish to bucket days into weeks.
I didn't find any BigQuery function which does this.
Hence, I was wondering if there was a way in which I can write a UDF and then access it in a BigQuery query
There are two questions here, so two answers:
BigQuery does support UDFs: docs. (It didn't when I first answered this.)
Even without UDFs, the date bucketing is still doable. BigQuery has one time parsing function, PARSE_UTC_USEC, which is expecting input in the form YYYY-MM-DD hh:mm:ss. You'll need to use REGEXP_REPLACE to get your date into the right format. Once you've done that, UTC_USEC_TO_WEEK will block things into weeks, and you can group by that. So tying all that together, if your table has a column called timestamp, you could get counts by week via something like
SELECT week, COUNT(week)
FROM (SELECT UTC_USEC_TO_WEEK(
PARSE_UTC_USEC(
REGEXP_REPLACE(
timestamp,
r"(\d{4})/(\d{2})/(\d{2}):(\d{2}):(\d{2}):(\d{2})",
r"\1-\2-\3 \4:\5:\6")), 0) AS week
FROM mytable)
GROUP BY week;
Note that the 0 here is the argument for which day of the week to use as the "beginning"; I've used Sunday, but for "business"-y things using 1 (i.e. Monday) would likely make more sense.
Just in case you need it, the section on timestamp functions in the docs is helpful.
UDF support in BigQuery is now here! https://cloud.google.com/bigquery/user-defined-functions
Here is some code that will convert a string time specifier into a JavaScript Date object, and extract some properties from it; see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Date for information on properties available for JS dates.
QUERY (replace the nested select with your table):
SELECT day_of_week, month_date
FROM parseDate(select '2015/08/01 12:00:00' as date_string);
CODE:
function parsedate(row, emit) {
var d = new Date(row.date_string);
emit({day_of_week: d.getDay(),
month_date: d.getDate()});
}
bigquery.defineFunction(
'parseDate', // Name of the function exported to SQL
['date_string'], // Names of input columns
[{'name': 'day_of_week', 'type': 'integer'},
{'name': 'month_date', 'type': 'integer'}],
parsedate
);