BigQuery Scheduled Data Transfer throws "Incompatible table partitioning specification." Error - but error message is truncated - google-bigquery

I'm using the new BQ Data Transfer UI and upon scheduling a Data Transfer, the transfer fails.
The error message in Run History isn't terribly helpful as the error message seems truncated.
Incompatible table partitioning specification. Expects partitioning specification interval(type:hour), but input partitioning specification is ; JobID: xxxxxxxxxxxx
Note the part of the error that says..."but input partition specification is..." with nothing before the semicolon. Seems this error is truncated.
Some details about the run:
The run imports data from a CSV file located in a GCS Bucket on a nightly basis. Once successfully ingested the process will delete the file. The target table in BQ is a partitioned table using the default partition pseudo column (_PARTITIONTIME)
What I have done so far:
Reran the scheduled Data Transfer -- which failed and threw the same error
Deleted the target table in BQ and recreated it with different partition specifications (day, hour, month) -- then Reran Scheduled Transfer -- failed and threw same error.
Imported the data manually (I downloaded the file from GCS and uploaded it locally from my machine) using the BQ UI (Create Table, append the specific table) - Worked perfectly.
Checked to see if this was a known issue here on Stack Overflow and only found this (which is now closed) -- close, but not exactly the issue. (BigQuery Data Transfer Service with BigQuery partitioned table)
What I'm holding off doing since it would take a bit more work.
Change schema of the target BQ table to include a specified column specific for partitioning
Include a system-generated timestamp in the original file inside of GCS and ensure the process recognizes this as the partitioning field.
Am I doing something wrong? Or is this a known issue?

Alright, I believe I have solved this. It looks like you need to include runtime parameters into your target table if the destination table is being partitioned.
https://cloud.google.com/bigquery-transfer/docs/gcs-transfer-parameters
Specifically this section called "Runtime Parameter Examples" here: https://cloud.google.com/bigquery-transfer/docs/gcs-transfer-parameters#loading_a_snapshot_of_all_data_into_an_ingestion-time_partitioned_table
They also advise that minutes cannot be specified in these parameters.
You will need to append the parameters to your destination table details as shown below:

Related

Reading data from GCS with BigQuery fails with "Not Found", but the date (files) exists

I have a service that is constantly updating files in GCS bucket with hive format:
bucket
device_id=aaaa
month=01
part-0.parquet
month=02
part-0.parquet
....
device_id=bbbb
month=01
part-0.parquet
month=02
part-0.parquet
....
If today we are at month=02 and I ran the following with BigQuery:
SELECT DISTINCT event_id
FROM `project_id.dataset.table`
WHERE month = '02';
I get the error: Not found: Files /bigstore/bucket_name/device_id=aaaa/month=02/part-0.parquet
I checked and the file is there when the query ran.
If I run
SELECT DISTINCT event_id
FROM `project_id.dataset.table`
WHERE month = '01';
I get results without any errors. I guess the error is related to the fact that I'm modifying the data while querying it. But as I understand this should not be the case with GCS, this is from their docs.
Because uploads are strongly consistent, you will never receive a 404 Not Found response or stale data for a read-after-write or read-after-metadata-update operation.
I saw some posts that this could be related to my bucket been Multi-region.
Any other insights?
It could be for some reason that you get this error.
When you load data from Cloud Storage into a BigQuery table, the
dataset that contains the table must be in the same regional or
multi- regional location as the Cloud Storage bucket.
Due to consistency, for buckets, while metadata updates are strongly
consistent for read-after-metadata-update operations, the process
could take time to finish the changes.
Using a Multi-region bucket is not recommended.
In this case, it could be due to consistency, because while you are updating the files GCS at the same time you are executing the query, so when you execute a query the parquet file was available to read and you didn’t get the error, but the next time the parquet file wasn’t available because the service was updating the file and you got the error.
Unfortunately, there is not a simple way, to solve this problem, but here are some options:
You can add a pub/sub routine to the bucket and/or file and quick off
your query after the service finished updating the files.
Make a workflow that blocks the updating of the files in their
buckets until their query finishes.
If the query fails with “not found” for file ABCD and you have
verified ABCD exists in GCS, then retry the query X times.
You need to backup your data into another location where you won't
update these files constantly, just once a day.
You could move the data into a managed storage where you won't have
this problem because you can do snapshotting.

How transfer data from Google Cloud Storage to Biq Query in partitioned and clustered table?

In first time,I created a empty table with partition and cluster. After that, I would like to configure data transfere service to fill my table from Google Cloud Storage.But when I configure the transfer, I didn't see a parameter field which allows to choose the cluster field.
I tried to do the same thing without the cluster and I can fill my table easily.
Big query error when I ran the transfer:
Failed to start job for table matable$20190701 with error INVALID_ARGUMENT: Incompatible table partitioning specification. Destination table exists with partitioning specification interval(type:DAY,field:) clustering(string_field_15), but transfer target partitioning specification is interval(type:DAY,field:). Please retry after updating either the destination table or the transfer partitioning specification.
When you define the table you specify the partitioning and clustering columns. That's everything you need to do.
When you load the data (from CLI or UI) from GCS BigQuery automatically partition and cluster the data.
If you can give more detail of how you create the table and set up the transfer would be helpful to provide a more detailed explanation.
Thanks for your time.
Of course :
empty table configuration
transfer configuration
I success to transfer datat without cluster but, when I add a cluster in my empty table,the trasnfer fails.

databricks error IllegalStateException: The transaction log has failed integrity checks

I have a table that I need drop, delete transaction log and recreate, but while I am trying to drop I get following error.
I have ran repair table statement on this one and could be responsible for error but not sure.
IllegalStateException: The transaction log has failed integrity checks. We recommend you contact Databricks support for assistance. To disable this check, set spark.databricks.delta.state.corruptionIsFatal to false. Failed verification of:
Table size (bytes) - Expected: 0 Computed: 63233
Number of files - Expected: 0 Computed: 1
We think this may just be related to s3 eventual consistency. Please try waiting a few extra minutes after deleting the Delta directory before writing new data to it. Also, normal MSCK REPAIR TABLE doesn't do anything for Delta, as Delta doesn't use the Hive Metastore to store the partitions. There is an FSCK REPAIR TABLE, but that is for removing the file entries from the transaction log of a Databricks Delta table that can no longer be found in the underlying file system.
We don't recommend overwriting a Delta table in place, like you might with a normal Spark table. Delta is not like a normal table - it's a table, plus a transaction log, and many versions of your data (unless fully vacuumed). If you want to overwrite parts of the table, or even the whole table, you should use Delta's delete functionality. If you want to completely change the table, consider writing to an entirely new directory, such as /table/v2/... and separately deleting the other table.
To skip the issue from occurring can use below command (PySpark notebook):
spark.conf.set("spark.databricks.delta.state.corruptionIsFatal", False)

Is there size limit on appending ORC data files to Vora tables

I created a Vora table in Vora 1.3 and tried to append data to that table from ORC files that I got from SAP BW archiving process (NLS on Hadoop). I had 20 files, in total containing approx 50 Mio records.
When I tried to use the "files" setting in the APPEND statement as "/path/*", after approx 1 hour Vora returned this error message:
com.sap.spark.vora.client.VoraClientException: Could not load table F002_5F: [Vora [eba156.extendtec.com.au:42681.1640438]] java.lang.RuntimeException: Wrong magic number in response, expected: 0x56320170, actual: 0x00000000. An unsuccessful attempt to load a table might lead to an inconsistent table state. Please drop the table and re-create it if necessary. with error code 0, status ERROR_STATUS
Next thing I tried was appending data from each file using separate APPEND statements. On the 15th append (of 20) I've got the same error message.
The error indicates that the Vora engine on node eba156.extendtec.com.au is not available. I suspect it either crashed or ran into an out-of-memory situtation.
You can check the log directory for a crash dump. If you find one, please open a customer message for further investigation.
If you do not find a crash dump, it is likely a out-of-memory situation. You should find confirmation in either the engine log file or in /var/log/messages (if the oom killer ended the process). In that case, the available memory is not sufficient to load the data.

Loading Avro-file to BigQuery fails with an internal error

Google BigQuery has on March 23, 2016 announced "Added support for Avro source format for load operations and as a federated data source in the BigQuery API or command-line tool". It says here "This is a Beta release of Avro format support. This feature is not covered by any SLA or deprecation policy and may be subject to backward-incompatible changes.". However, I'd expect the feature to work.
I didn't find anywhere code examples on how to use Avro format for loading. Neither I did find examples on how to use bq-tool for loading.
Here's my practical issue. I haven't been able to load data into BigQuery in Avro-format.
The following happens using bq-tool. The dataset, table name and bucket name have been obfuscated:
$ bq extract --destination_format=AVRO dataset.events_avro_test gs://BUCKET/events_bq_tool.avro
Waiting on bqjob_r62088699049ce969_0000015432b7627a_1 ... (36s) Current status: DONE
$ bq load --source_format=AVRO dataset.events_avro_test gs://BUCKET/events_bq_tool.avro
Waiting on bqjob_r6cefe75ece6073a1_0000015432b83516_1 ... (2s) Current status: DONE
BigQuery error in load operation: Error processing job 'dataset:bqjob_r6cefe75ece6073a1_0000015432b83516_1': An internal error occurred and the request could not be completed.
Basically, I am extracting from a table and inserting to the same table causing an internal error.
Additionally, I have Java program that does the same (extract from table X and load to table X) with the same result (internal error). But I think the above illustrates the problem as clearly as possible, and because of that I'm not sharing the code here. In Java, If I extract from an empty table and insert that, the insert job doesn't fail.
My questions are
I think BigQuery API should never fail with internal error. Why is that happening with my test?
Is the extracted Avro file compatible with an insert job?
There seems to be no specification what the Avro schema in an insert job is like, at least I couldn't find any. Could the documentation be created?
UPDATED 2016-04-25:
So far I've managed to get an Avro load job not to give an internal error based on the hint of not using REQUIRED fields. However, I haven't managed to load non-null values.
Consider this Avro-schema:
{
"type": "record",
"name": "root",
"fields": [
{
"name": "x",
"type": "string"
}
]
}
The BigQuery table has one column, x that is NULLABLE.
If I insert N (I've tried with one and two) rows (x being e.g. 1), I got N rows in BigQuery but x always having value null.
If I change the table so that X is REQUIRED I get an internal error.
There is no exact match from a BQ schema to Avro schema, and vice versa, so when you export a BQ table to Avro file and then import back, the schema will be different. I see the destination table of your load already exists, in this case we throw an error when the schema of the destination table doesn't match the schema we converted from the Avro schema. This should be an external error though, we're investigating why it's an internal error.
We're in the middle of upgrading the export pipeline, and the new import pipeline has a bug that doesn't work with the Avro file exported by the current pipeline. The fix should be deployed in a couple weeks. After that, if you import the exported file to a non-existent destination table, or a destination table with compatible schema, it should work. Meanwhile, importing your own Avro files should work. You can also query it directly on GCS without importing it.
There's a problem with the error mapping for the AVRO reader here. The error should have been along the lines of: "The reference schema differs from the existing data: The required field 'api_key' is missing"
Looking at your load job configuration, it includes REQUIRED fields. It sounds like some of the data you are trying to load doesn't specify these required fields, so the operation fails.
I suggest avoiding required fields.
So, there's a bug in BigQuery: an insert job using Avro format does not work if the destination table exists, but gives an internal error. The workaround is to use createDisposition CREATE_IF_NEEDED and not to have the pre-existing table there. I verified that this works.
Hua Zung's comment says that the bug will be fixed in "the fix should be deployed in a couple weeks". Needless to say that existing major bugs in the live system should be documented somewhere.
While updating the system, I really recommend improving the Avro documentation. Now there's no mention on what the Avro schema should be like (type record, name root and fields array having the columns(?)) and not even the fact that each record in the Avro file maps to a row in the destination table (obvious, but should be mentioned). Also what happens with schema mismatch is not documented.
Thanks for the help, I'll be now switching to Avro-format. It's so much better than CSV.