I want to schedule a google cloud bigquery stored procedure in apache airflow. I did not see any documentation in the airflow. Which scheduler should I use to schedule a bigquery stored procedure on apache airflow. Could you show me any examples? Thank you so much.
https://airflow.apache.org/docs/apache-airflow-providers-google/stable/operators/cloud/bigquery.html#execute-bigquery-jobs
The BigQueryInsertJobOperator should be the operator to use in your DAG to execute SQL -- which is what you'd be executing to call a stored procedure -- in BigQuery.
For example:
call_stored_procedure = BigQueryInsertJobOperator(
task_id="call_stored_procedure",
configuration={
"query": {
"query": "CALL `project_id.dataset.stored_procedure_name`(arg1, arg2); ",
"useLegacySql": False,
}
},
location=location,
)
Related
I need to create procedural logic using data stored in aws s3 from athena or glue.
actually it is migrating a stored procedure from sql server to aws, but I don't know what aws service or where to do it with, it doesn't use database but s3 tables.
Thank you very much for guiding me on how to do it.
Athena doesn't support stored procedures, but however you can leverage UDFs to define the same logic as in your source stored procedure.
Below is the syntax for an UDF and refer to this for more information:
USING EXTERNAL FUNCTION UDF_name(variable1 data_type[, variable2 data_type][,...])
RETURNS data_type
LAMBDA 'lambda_function'
SELECT [...] UDF_name(expression) [...]
I have a BQ script stored as a "Saved Query". I was wondering if I can execute it using the bq command-line utility. Could not find something relevant in the documentation.
"Saved queries" are only accessible through the console and are not accessible through the API.
An approach that might suit you better would be to use Scripts and stored procedures.
In this way you define your SQL routine in a script myprocedure, and use "CALL mydataset.myprocedure()" to run it.
With bq it's then simply:
bq query --use_legacy_sql=false 'CALL mydataset.myprocedure()'
We are trying to get the Maximum scheduled trigger time from the list of scheduled trigger in ADF
We have one ADF pipeline, which has multiple scheduled trigger. The pipeline will run at 6:10, 6:20, 6:30, 6:40...... till 10 AM UTC at a gap of every 10 minutes. Is there any possible way to get the Max of scheduled trigger i.e. 10 AM UTC in my case.
We have tried with several system variables, but none worked. We might take an API approach to get the job done, but I want to stay native to the ADF world.
You could refer to ADF REST API:Trigger Runs - Query By Factory.
In the request body,define the lastUpdatedAfter and lastUpdatedBefore properties,like the below example:
{
"lastUpdatedAfter": "2018-06-16T00:36:44.3345758Z",
"lastUpdatedBefore": "2018-06-16T00:49:48.3686473Z",
"filters": [
{
"operand": "TriggerName",
"operator": "Equals",
"values": [
"exampleTrigger"
]
}
]
}
Then loop the trigger runs data from response to get the max row.
We might take an API approach to get the job done
You could use Azure Http Trigger Function,or use Web Activity in the ADF to call your specific api.
I am trying to solve the below problem using oozie. Any suggestions about solution are much appreciated.
Back ground : I had developed a code to import data from SQL database using (oozie - Sqoop import) and done some transformation and loaded the data to Hive. Now I need to do a count check between SQL and Hive for reconciliation
Is there any way I can do that using oozie.
I am thinking about executing sql query using "sqoop eval" and hive query using "hive action" from oozie , but I am wondering how can we get the results back to oozie / capture the results after the query execution .
Once the results are available I need to do a reconciliation in subsequent action
I had implemented it using a py-spark action , by executing sqoop eval and Hive Dataframe counts. Its working fine.
I'm new to Google Cloud and would like to know best use cases on how to schedule queries and export them to Google Cloud Storage.
I've seen documentations on how to manually export data but couldn't find anything specific on doing it in an automated way.
Is there any best way on how to approach this ?
Thanks
It is possible to create scheduled export jobs with the scheduled queries feature and EXPORT DATA statement. For example, this script below backups data daily to GCS as Parquet files with SNAPPY compression. Each time the job is executed it takes all the data from the day before.
DECLARE backup_date DATE DEFAULT DATE_SUB(#run_date, INTERVAL 1 day);
EXPORT DATA
OPTIONS ( uri = CONCAT('gs://my-bucket/', CAST(backup_date AS STRING), '/*.parquet'),
format='PARQUET',
compression='SNAPPY',
overwrite=FALSE ) AS
SELECT
*
FROM
`my-project.my-dataset.my-table`
WHERE
DATE(timestamp) = backup_date
From the BiqQuery UI you can then create a scheduled query and set the trigger frequency and trigger time.
Implement your table export function [1] using Node.js, python or Go. These languages are supported by Cloud Functions and BigQuery.
Insert the above function in Cloud Function [2] service which is an event driven serverless compute platform.
Trigger the above function using Cloud Scheduler [3] using a schedule interval of your choice. The schedule interval can be specified using a cron expression. The scheduler will trigger the function via a REST call on the function´s endpoint.
Verify the success of the above operation by visiting your bucket and ensuring that the table(s) has been successfully exported.