Bigquery internal error during copy job to move tables between datasets - google-bigquery

I'm currently migrating around 200 tables in Bigquery (BQ) from one dataset (FROM_DATASET) to another one (TO_DATASET). Each one of these tables has a _TABLE_SUFFIX corresponding to a date (I have three years of data for each table). Each suffix contains typically between 5 GB and 80 GB of data.
I'm doing this using a Python script that asks BQ, for each table, for each suffix, to run the following query:
-- example table=T_SOME_TABLE, suffix=20190915
CREATE OR REPLACE TABLE `my-project.TO_DATASET.T_SOME_TABLE_20190915`
COPY `my-project.FROM_DATASET.T_SOME_TABLE_20190915`
Everything works except for three tables (and all their suffixes) where the copy job fails at each _TABLE_SUFFIX with this error:
An internal error occurred and the request could not be completed. This is usually caused by a transient issue. Retrying the job with back-off as described in the BigQuery SLA should solve the problem: https://cloud.google.com/bigquery/sla. If the error continues to occur please contact support at https://cloud.google.com/support. Error: 4893854
Retrying the job after some time actually works but of course is slowing the process. Is there anyone who has an idea on what the problem might be?
Thanks.

It turned out that those three problematic tables were some legacy ones with lots of columns. In particular, the BQ GUI shows this warning for two of them:
"Schema and preview are not displayed because the table has too many
columns and may cause the BigQuery console to become unresponsive"
This was probably the issue.
In the end, I managed to migrate everything by implementing a backoff mechanism to retry failed jobs.

Related

BigQuery Scheduled Query won't run

I've got data buckets setup in GCS and using BigQuery to run all my .csv files from that bucket to build a table. That works flawlessly. I made a simple deduplication query that when manually run, selects only distinct rows and creates a new table with "DeDupe" appended (Code below). That runs flawlessly.
CREATE OR REPLACE TABLE
`project-name-123456.dataset_2022.dataset 2022 DeDuped` AS
SELECT
DISTINCT *
FROM
`project-name-123456.dataset_2022.dataset 2022`
The issue I am having is with scheduling that query. Every time it tries to run I get the error "Error status: Not found: Dataset project-name-123456:dataset_2022 was not found in location US; JobID: project-name-123456:628d7766-0000-2d36-a82f-94eb2c0a664a"
The only thing I can figure is that I have my data location for the dataset as "us-central1" as it has a free tier. And when I go to my scheduled query, whether I select the same data location, or "Default" it always changes to "US Multiple".
Is there a way to fix this?
Or do I need to create my dataset in "US Multiple"?
Trying to cut down on costs as much as possible by keeping it in the us-central1
EDIT: Seems like I just needed to delete and recreate the scheduled query again. Chatted with Google Support and they sorted it. Sorry all!

SSAS Tabular Model. Column disappears after SSAS sevice restart

SSAS Version: 14.0.226.1
Visual Studio Version: 4.7.02558
Issue: once model is delployed to the server, it is processed w/o any errors. But if the SSAS server is rebooted, one of the dimensions throws an error while processing. It just loses one of the column. Here is the error that I get (Failed to save modifications to the server. Error returned: 'The 'Global_Code_SKU' column does not exist in the rowset.):
The column data sample looks like this:
The model contains 2 dimensions and a fact table with 632 million rows in it. May it be that the fact table size is an issue? Maybe dictionary's too big?
How I fix it: by deploying model again without partitions and roles, just metadata, and this fixes the issue, however sometimes servers can be rebooted without notification, so the processing job fails next day (it runs once a day).
Is there any suggestion I can consider to fix this? I searched for a while, haven't found any solution though.
There was a hidden sign in right before the first symbol in one of the names, so after comparing binaries of the two strings we wound that we just should recreate the table and that solved the problem
Some suggestions to try:
After reboot, connect to the SSAS server using SSMS and right click the database in question and choose Script -> Script database as. Is the column Global_Code_SKU still there? Is it hidden? Is it available in the source?
What datatype is the Global_Code_SKU? I've had problems with columns with similar values being auto-identified by SSAS as binary and therefore excluded from the load.

Performance issue with joining tables over 3 DB links

Recently we have faced the problem where we are fetching the data from 3 different data sources over a db link. It was running fine when we were fetching 16 columns from the tables by joining three sources. But as we increased the column from 16 to 50 the query is taking too much time.
Here we are fetching the data from 3 different data source consider as A(Singapore), B(Malaysia), C (India) and and creating a view with the combination of above 3 regions and the view we have published to the front -end team (Tableau team) to perform the visualization process over that data
Any suggestion how to solve the problem? I am planning for below alternatives
applying /*+ DRIVING_SITE */ hint so that it run on remote server with the update statistics.
creating a MV on local server and refreshing the data over night, but it will not have proper update data.
creating a mv on local server and partitioning the mv and refreshing the partition whenever the changes has occurred at remote site, so to alert for changes planning to create a queuing system or DBMS_PIPE if it helps.

Get the Last Modified date for all BigQuery tables in a BigQuery Project

I have several databases within a BigQuery project which are populated by various jobs engines and applications. I would like to maintain a dashboard of all of the Last Modified dates for every table within our project to monitor job failures.
Are there any command line or SQL commands which could provide this list of Last Modified dates?
For a SQL command you could try this one:
#standardSQL
SELECT *, TIMESTAMP_MILLIS(last_modified_time)
FROM `dataset.__TABLES__` where table_id = 'table_id'
I recommend you though to see if you can log these errors at the application level. By doing so you can also understand why something didn't work as expected.
If you are already using GCP you can make use of Stackdriver (it works on AWS as well), we started using it in our projects and I recommend giving it a try (we tested for python applications though, not sure how the tool performs on other clients but it might be quite similar).
I've just queried stacked GA4 data using the following code:
FROM analytics_#########.__TABLES__
where table_id LIKE 'events_2%'
I have kept the 2 on the events to ensure my intraday tables do not pull through also.

BigQuery: Unable to delete table

We have a large table (somewhat large < 15 million rows) that we have been filling up with stress and stability testing. We are trying to delete the table but it is resisting.
Here's what we have tried:
delete table from the web console. No errors...but it doesn't delete the table.
delete from command line interface. We get an error message: "BigQuery error in rm operation: Backend Error"
We have also tried to delete the whole dataset from the console and that fails as well. No errors reported.
We tried to delete the whole dataset from the commandline. We get the same error message: "BigQuery error in rm operation: Backend Error"
Other tables with the same schema can be deleted without error. Our schema does use 9999 columns (the max) which would be the only odd thing we may be doing.
You've hit a bug with tables that have a large number of updates and a wide schema. We're working on a fix.