BQ client.load_table_from_uri behaves different from python to with in Airflow - google-bigquery

I am trying to load a CSV into BQ using a custom operator in Airflow.
My custom operator is using
load_job_config = bigquery.LoadJobConfig(
schema=self.schema_fields,
skip_leading_rows=self.skip_leading_rows,
source_format=bigquery.SourceFormat.CSV
)
load_job = client.load_table_from_uri(
'gs://' + self.source_bucket + "/" + self.source_object, self.dsp_tmp_dataset_table,
job_config=load_job_config
) #
The issue I am facing is that I always get errors
google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table nonprod-cloud-composer:dsp_data_transformation.tremorvideo_daily_datafeed. Field Date has changed type from TIMESTAMP to DATE
The exact same code when run outside of Airflow as a stand alone python works fine.
I am using the exactly same schema object , same source CSV file just that the environment is different.
Below is the high level steps followed
Created table in BQ
Using the
LOAD DATA OVERWRITE XXXX
FROM FILES (
format = 'CSV',
uris = ['gs://xxx.csv']);
This worked fine and the data was loaded into the table.
3. Truncated the table and tried to run the custom operator what has the code above listed. Then faced errors.
4. Created a simple python program with to test the bq load job and that works fine too.
Its just that when ever the same load job is triggered using Airflow the schema detection fails and leads to all sorts of errors.

Related

Hudi errors with 'DELETE is only supported with v2 tables.'

I'm trying out Hudi, Delta Lake, and Iceberg in AWS Glue v3 engine (Spark 3.1) and have both Delta Lake and Iceberg running just fine end to end using a test pipeline I built with test data. Note I am not using any of the Glue Custom Connectors. I'm using pyspark and standard Spark code (not the Glue classes that wrap the standard Spark classes)
For Hudi, the install of the Hudi jar is working fine as I'm able to write the table in the Hudi format and can create the table DDL in the Glue Catalog just fine and read it via Athena. However, when I try to run a crud statement on the newly created table, I get errors. For example, trying to run a simple DELETE SparkSQL statement, I get the error: 'DELETE is only supported with v2 tables.'
I've added the following jars when building the SparkSession:
org.apache.hudi:hudi-spark3.1-bundle_2.12:0.11.0
com.amazonaws:aws-java-sdk:1.10.34
org.apache.hadoop:hadoop-aws:2.7.3
And I set the following config for the SparkSession:
self.config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer')
I've tried many different versions of writing the data/creating the table including:
hudi_options = {
'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
'hoodie.table.version': 2,
'hoodie.table.name': 'db.table_name',
'hoodie.datasource.write.recordkey.field': 'id', key is required in table.
'hoodie.datasource.write.partitionpath.field': '',
'hoodie.datasource.write.table.name': 'db.table_name',
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.precombine.field': 'date_modified',
'hoodie.upsert.shuffle.parallelism': 2,
'hoodie.insert.shuffle.parallelism': 2
}
df.write \
.format('hudi') \
.options(**hudi_options) \
.mode('overwrite') \
.save('s3://...')
sql = f"""CREATE TABLE {FULL_TABLE_NAME}
USING {DATA_FORMAT}
options (
type = 'cow',
primaryKey = 'id',
preCombineField = 'date_modified',
partitionPathField = '',
hoodie.table.name = 'db.table_name',
hoodie.datasource.write.recordkey.field = 'id',
hoodie.datasource.write.precombine.field = 'date_modified',
hoodie.datasource.write.partitionpath.field = '',
hoodie.table.version = 2
)
LOCATION '{WRITE_LOC}'
AS SELECT * FROM {SOURCE_VIEW};"""
spark.sql(sql)
The above works fine. It's when I try to run a CRUD operation on the table created above that I get errors. For instance, I try deleting records via the SparkSQL DELETE statement and get the error 'DELETE is only supported with v2 tables.'. I can't figure out why it's complaining about not being a v2 table. Any clues would be hugely appreciated.

How to invoke an on-demand bigquery Data transfer service?

I really liked BigQuery's Data Transfer Service. I have flat files in the exact schema sitting to be loaded into BQ. It would have been awesome to just setup DTS schedule that picked up GCS files that match a pattern and load the into BQ. I like the built in option to delete source files after copy and email in case of trouble. But the biggest bummer is that the minimum interval is 60 minutes. That is crazy. I could have lived with a 10 min delay perhaps.
So if I set up the DTS to be on demand, how can I invoke it from an API? I am thinking create a cronjob that calls it on demand every 10 mins. But I can’t figure out through the docs how to call it.
Also, what is my second best most reliable and cheapest way of moving GCS files (no ETL needed) into bq tables that match the exact schema. Should I use Cloud Scheduler, Cloud Functions, DataFlow, Cloud Run etc.
If I use Cloud Function, how can I submit all files in my GCS at time of invocation as one bq load job?
Lastly, anyone know if DTS will lower the limit to 10 mins in future?
So if I set up the DTS to be on demand, how can I invoke it from an API? I am thinking create a cronjob that calls it on demand every 10 mins. But I can’t figure out through the docs how to call it.
StartManualTransferRuns is part of the RPC library but does not have a REST API equivalent as of now. How to use that will depend on your environment. For instance, you can use the Python Client Library (docs).
As an example, I used the following code (you'll need to run pip install google-cloud-bigquery-datatransfer for the depencencies):
import time
from google.cloud import bigquery_datatransfer_v1
from google.protobuf.timestamp_pb2 import Timestamp
client = bigquery_datatransfer_v1.DataTransferServiceClient()
PROJECT_ID = 'PROJECT_ID'
TRANSFER_CONFIG_ID = '5e6...7bc' # alphanumeric ID you'll find in the UI
parent = client.project_transfer_config_path(PROJECT_ID, TRANSFER_CONFIG_ID)
start_time = bigquery_datatransfer_v1.types.Timestamp(seconds=int(time.time() + 10))
response = client.start_manual_transfer_runs(parent, requested_run_time=start_time)
print(response)
Note that you'll need to use the right Transfer Config ID and the requested_run_time has to be of type bigquery_datatransfer_v1.types.Timestamp (for which there was no example in the docs). I set a start time 10 seconds ahead of the current execution time.
You should get a response such as:
runs {
name: "projects/PROJECT_NUMBER/locations/us/transferConfigs/5e6...7bc/runs/5e5...c04"
destination_dataset_id: "DATASET_NAME"
schedule_time {
seconds: 1579358571
nanos: 922599371
}
...
data_source_id: "google_cloud_storage"
state: PENDING
params {
...
}
run_time {
seconds: 1579358581
}
user_id: 28...65
}
and the transfer is triggered as expected (nevermind the error):
Also, what is my second best most reliable and cheapest way of moving GCS files (no ETL needed) into bq tables that match the exact schema. Should I use Cloud Scheduler, Cloud Functions, DataFlow, Cloud Run etc.
With this you can set a cron job to execute your function every ten minutes. As discussed in the comments, the minimum interval is 60 minutes so it won't pick up files less than one hour old (docs).
Apart from that, this is not a very robust solution and here come into play your follow-up questions. I think these might be too broad to address in a single StackOverflow question but I would say that, for on-demand refresh, Cloud Scheduler + Cloud Functions/Cloud Run can work very well.
Dataflow would be best if you needed ETL but it has a GCS connector that can watch a file pattern (example). With this you would skip the transfer, set the watch interval and the load job triggering frequency to write the files into BigQuery. VM(s) would be running constantly in a streaming pipeline as opposed to the previous approach but a 10-minute watch period is possible.
If you have complex workflows/dependencies, Airflow has recently introduced operators to start manual runs.
If I use Cloud Function, how can I submit all files in my GCS at time of invocation as one bq load job?
You can use wildcards to match a file pattern when you create the transfer:
Also, this can be done on a file-by-file basis using Pub/Sub notifications for Cloud Storage to trigger a Cloud Function.
Lastly, anyone know if DTS will lower the limit to 10 mins in future?
There is already a Feature Request here. Feel free to star it to show your interest and receive updates
Now your can easy manual run transfer Bigquery data use RESTApi:
HTTP request
POST https://bigquerydatatransfer.googleapis.com/v1/{parent=projects/*/locations/*/transferConfigs/*}:startManualRuns
About this part > {parent=projects//locations//transferConfigs/*}, check on CONFIGURATION of your Transfer then notice part like image bellow.
Here
More here:
https://cloud.google.com/bigquery-transfer/docs/reference/datatransfer/rest/v1/projects.locations.transferConfigs/startManualRuns
following the Guillem's answer and the API updates, this is my new code:
import time
from google.cloud.bigquery import datatransfer_v1
from google.protobuf.timestamp_pb2 import Timestamp
client = datatransfer_v1.DataTransferServiceClient()
config = '34y....654'
PROJECT_ID = 'PROJECT_ID'
TRANSFER_CONFIG_ID = config
parent = client.transfer_config_path(PROJECT_ID, TRANSFER_CONFIG_ID)
start_time = Timestamp(seconds=int(time.time()))
request = datatransfer_v1.types.StartManualTransferRunsRequest(
{ "parent": parent, "requested_run_time": start_time }
)
response = client.start_manual_transfer_runs(request, timeout=360)
print(response)
For this to work, you need to know the correct TRANSFER_CONFIG_ID.
In my case, I wanted to list all the BigQuery Scheduled queries, to get a specific ID. You can do it like that :
# Put your projetID here
PROJECT_ID = 'PROJECT_ID'
from google.cloud import bigquery_datatransfer_v1
bq_transfer_client = bigquery_datatransfer_v1.DataTransferServiceClient()
parent = bq_transfer_client.project_path(PROJECT_ID)
# Iterate over all results
for element in bq_transfer_client.list_transfer_configs(parent):
# Print Display Name for each Scheduled Query
print(f'[Schedule Query Name]:\t{element.display_name}')
# Print name of all elements (it contains the ID)
print(f'[Name]:\t\t{element.name}')
# Extract the IDs:
TRANSFER_CONFIG_ID= element.name.split('/')[-1]
print(f'[TRANSFER_CONFIG_ID]:\t\t{TRANSFER_CONFIG_ID}')
# You can print the entire element for debug purposes
print(element)

Using Jinja template variables with BigQueryOperator in Airflow

I'm attempting to use the BigQueryOperator in Airflow by using a variable to populate the sql= attribute. The problem I'm running into is that the file extension is dropped when using Jinja variables. I've setup my code as follows:
dag = DAG(
dag_id='data_ingest_dag',
template_searchpath=['/home/airflow/gcs/dags/sql/'],
default_args=DEFAULT_DAG_ARGS
)
bigquery_transform = BigQueryOperator(
task_id='bq-transform',
write_disposition='WRITE_TRUNCATE',
sql="{{dag_run.conf['sql_script']}}",
destination_dataset_table='{{dag_run.conf["destination_dataset_table"]}}',
dag=dag
)
The passed variable contains the name of the SQL file stored in the separate SQL directory. If I pass the value as a static string, sql="example_file.sql", everything works fine. However, when I pass the example_file.sql using Jinja template variable it automatically drops the file extension and I receive this error:
BigQuery job failed.
Final error was: {u'reason': u'invalidQuery', u'message': u'Syntax error: Unexpected identifier "example_file" at [1:1]', u'location': u'query'}
Additionally, I've tried hardcoding ".sql" to the end of the variable anticipating that the extension would be dropped. However, this causes the entire variable reference to be interpreted as as string.
How do you use variables to populate BigQueryOperator attributes?
Reading the BigQuery operator docstring it seems that you can provide the sql statement in 2 ways:
1. As a string that can contain templating macros
2. A reference to a file that can contain templating macros (the file, not the file name).
You cannot template the file name but only the SQL statement. In fact, your error message shows that BigQuery did not recognize the identifier "example_file". If you inspect the BigQuery history for the project which ran that query, you will see that the query string was "example_file.sql" which is not a valid SQL statement, thus the error.

How do I skip header row using Python glcoud.bigquery client?

I have a daily GCP billing export file in csv format containing GCP billing details. This export contains a header row. I've setup a load job as follows (summarized):
from google.cloud import bigquery
job = client.load_table_from_storage(job_name, dest_table, source_gs_file)
job.source_format = 'CSV'
job.skipLeadingRows=1
job.begin()
This job produces the error:
Could not parse 'Start Time' as a timestamp. Required format is YYYY-MM-DD HH:MM[:SS[.SSSSSS]]
This error means that it is still trying to parse the header row even though I specified skipLeadingRows=1. What am I doing wrong here?
You should use skip_leading_rows instead of skipLeadingRows when using the Python SDK.
skip_leading_rows: Number of rows to skip when reading data (CSV only).
Reference: https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.LoadJobConfig.html
I cannot reproduce this. I took the example you gave ("2017-02-04T00:00:00-08:00"), added 3 rows/timestamps to a csv file, uploaded it to GCS, and finally created an empty table in BigQuery with one column of type TIMESTAMP.
File contents:
2017-02-04T00:00:00-08:00
2017-02-03T00:00:00-08:00
2017-02-02T00:00:00-08:00
I then ran the example Python script found here, and it successfully loaded the file into the table:
Loaded 3 rows into timestamp_test:gcs_load_test.
def load_data_from_gcs(dataset_name, table_name, source):
bigquery_client = bigquery.Client()
dataset = bigquery_client.dataset(dataset_name)
table = dataset.table(table_name)
job_name = str(uuid.uuid4())
job = bigquery_client.load_table_from_storage(job_name, table, source)
job.begin()
wait_for_job(job)
print('Loaded {} rows into {}:{}.'.format(job.output_rows, dataset_name, table_name))

Making sure data is loaded

I use the following command to load data.
/home/bigquery/bq load --max_bad_record=30000 -F '^' company.junelog entry.gz country:STRING,telco_name:STRING,datetime:STRING, ...
It has happened that when I got non-zero return code the data was still loaded. How do I make sure that the command is successful or not? Checking return code does not seem to help. There are times when I loaded the same file again because I got an error but the data was already available in bigquery.
You can use bq show -j of the load job and check job status.
If you are writing code to do the load, so you don't know the job id, you can pass the job id into the load operation (as long as it is unique) so you will know which job to check.
For instance you can run
/home/bigquery/bq load --job_id=some_unique_job_id --max_bad_record=30000 -F '^' company.junelog entry.gz country:STRING,telco_name:STRING,datetime:STRING, ...'
then
/home/bigquery/bq show --j some_unique_job_id
Note if you are creating new tables for every load (as opposed to appending), you could use the write disposition WRITE_EMPTY to make sure you only did the load if the table was empty, thus preventing adding the same data twice. This isn't directly supported in bq.py, but you could use the underlying bigquery_client.py to make this call, or use the REST api directly.