Bigquery load job said successful but data did not get loaded into table - google-bigquery

I submitted a Bigquery load job, it ran and returned with the status successful. But the data didn't make into the destintation table.
Here was the command that was run:
/usr/local/bin/bq load --nosynchronous_mode --project_id=ardent-course-601 --job_id=logsToBq_load_impressions_20140816_1674a956_6c39_4859_bc45_eb09db7ef99a --source_format=NEWLINE_DELIMITED_JSON dw_logs_impressions.impressions_20140816 gs://sm-uk-hadoop/queries/logsToBq_transformLogs/impressions/20140816/9307f6e3-0b3a-44ca-8571-7107c399998c/part* /opt/sm-analytics/projects/logsTobqMR/jsonschema/impressionsSchema.txt
I checked the job status of the job logsToBq_load_impressions_20140816_1674a956_6c39_4859_bc45_eb09db7ef99a. The input file count and size showed the correct number of input files and total size.
Does anyone know why the data didn't make into the table but yet the job is reported as successful?
Just in case this is not a mistake on our side, I ran the load job again but to a different destination table and this time the data made into the destination table fine.
Thank you.

I experienced this recently with BigQuery in sandbox mode without a billing account.
In this mode the partition expiration is automatically set to 60 days. If you load data into the table where the partitioned column(e.g. date) is older than 60 days it won't show up in the table. The load job still succeeds with the correct number of output rows.

This is very surprising, but I've confirmed via the logs that this is indeed the case.
Unfortunately, the detailed logs for this job, which ran on August 16, are no longer available. We're investigating whether this may have affected other jobs more recently. Please ping this thread if you see this issue again.

we had this issue in our system and the reason was like table was set with partition expiry for 30 days and table was partitioned on timestamp column.. Hence when someone was ingesting data which is older than partition expiry date bigquery load jobs were successfully completed in Spark but we see no data in ingestion tables.. since it was getting deleted moment after it was ingested.. due to partition expiry set on.
Please check your bigquery table partition expiry parameters and see the partition column value of incoming data. If it value will be lower than partition expiry.. you wont see data in bigquery tables.. it will get deleted just after the ingestion.

Related

Azure Data Factory - Rerun Failed Pipeline Against Azure SQL Table With Differential Date Filter

I am using ADF to keep an Azure SQL DB in sync with an on-prem DB. The on-prem DB is read only and the direction is one-way, from the Azure SQL DB to the on-prem DB.
My source table in the Azure SQL Cloud DB is quite large (10's of millions of rows) so I have the pipeline set to use an UPSERT (merge, trying to create a differential merge). I am using a filter on the Source table and the and the Filter Query has a WHERE condition that looks like this:
[HistoryDate] >= '#{formatDateTime(pipeline().parameters.windowStart, 'yyyy-MM-dd HH:mm' )}'
AND [HistoryDate] < '#{formatDateTime(pipeline().parameters.windowEnd, 'yyyy-MM-dd HH:mm' )}'
The HistoryDate column is auto-maintained in the source table with a getUTCDate() type approach. New records will always get a higher value and be included in the WHERE condition.
This works well, but here is my question: I am testing on my local machine before deploying to the client. When I am not working, my laptop hibernates and the pipeline rightfully fails because my local SQL Instance is "offline" during that run. When I move this to production this should not be an issue (computer hibernating), but what happens if the clients connection is temporarily lost (i.e, the client loses internet for a time)? Because my pipeline has a WHERE condition on the source to reduce the table size upsert to a practical number, any failure would result in a loss of any data created during that 5 minute window.
A failed pipeline can be rerun, but the run time would be different at that moment in time and I would effectively miss the block of records that would have been picked up if the pipeline had been run on time. pipeline().parameters.windowStart and pipeline().parameters.windowEnd will now be different.
As an FYI, I have this running every 5 minutes to keep the local copy in sync as close to real-time as possible.
Am I approaching this correctly? I'm sure others have this scenario and it's likely I am missing something obvious. :-)
Thanks...
Sorry to answer my own question, but to potentially help others in the future, it seems there was a better way to deal with this.
ADF offers a "Metadata-driven Copy Task" utility/wizard on the home screen that creates a pipeline. When I used it, it offers a "Delta Load" option for tables which takes a "Watermark". The watermark is a column for an incrementing IDENTITY column, increasing date or timestamp, etc. At the end of the wizard, it allows you to download a script that builds a table and corresponding stored procedure that maintains the values of each parameters after each run. For example, if I wanted my delta load to be based on an IDENTITY column, it stores the value of the max value of a particular pipeline run. The next time a run happens (trigger), it uses this as the MIN value (minus 1) and the current MAX value of the IDENTITY column to get the added records since the last run.
I was going to approach things this way, but it seems like ADF already does this heavy lifting for us. :-)

BigQuery Scheduled Query won't run

I've got data buckets setup in GCS and using BigQuery to run all my .csv files from that bucket to build a table. That works flawlessly. I made a simple deduplication query that when manually run, selects only distinct rows and creates a new table with "DeDupe" appended (Code below). That runs flawlessly.
CREATE OR REPLACE TABLE
`project-name-123456.dataset_2022.dataset 2022 DeDuped` AS
SELECT
DISTINCT *
FROM
`project-name-123456.dataset_2022.dataset 2022`
The issue I am having is with scheduling that query. Every time it tries to run I get the error "Error status: Not found: Dataset project-name-123456:dataset_2022 was not found in location US; JobID: project-name-123456:628d7766-0000-2d36-a82f-94eb2c0a664a"
The only thing I can figure is that I have my data location for the dataset as "us-central1" as it has a free tier. And when I go to my scheduled query, whether I select the same data location, or "Default" it always changes to "US Multiple".
Is there a way to fix this?
Or do I need to create my dataset in "US Multiple"?
Trying to cut down on costs as much as possible by keeping it in the us-central1
EDIT: Seems like I just needed to delete and recreate the scheduled query again. Chatted with Google Support and they sorted it. Sorry all!

How do I find out what is inserting data in my Azure Data Warehouse

I am using an Azure 'Synapse SQL Pool' (aka Data Warehouse) containing a table named 'DimClient'. I see in my database that new records are being added every day at a specific time. I've reviewed all the ADF pipelines and triggers but none of them are set to run at that time. I don't see any stored procedures that insert or update records in this table either. I can only conclude there is another process running that is adding those records.
I turned on 'Send to Log Analytics' to forward to a workspace and included the SqlRequests and ExecRequests categories. I waited a day and reviewed the logs using the following query:
AzureDiagnostics
| where Category == "SqlRequests" or Category == "ExecRequests"
| where Command_s contains "DimClient" ;
I get 'No Results Found' but when I query the table in SSMS, it contains new records that were added within the last 24 hours. How do I determine what is inserting these records?
you should get result. it takes time some to sync data in log analytics. also check diagnostic settings on Synpase pool

Schedule deleting a BQ table

I am streaming data into BQ, every day I run a scheduled job in Dataprep that takes 24 hours of data and modifies some data and creates a new table in the BQ dataset with 24 hours of data.
The original table though stays unmodified and keeps on gathering data.
What I would like to do is delete all rows in the table after the dataprep makes a copy so that a new 24 hours of data streaming is gathered
How can I make this automated, I can't seem to find anything in dataprep that drops the original table and creating a new table.
You can do this setting up your table as partitioned table due to you are ingesting data constantly.
This option is to do it manually:
bq rm '[YOUR_DATASET].[YOUR_TABLE]$xxxxxxx'
And with the expiration time you can set the time when the data of the table will be deleted:
bq update --time_partitioning_expiration [INTEGER] [YOUR_PROJECT_ID]:[YOUR_DATASET].[YOUR_TABLE]
You could use a Scheduled Query to clear out the table:
https://cloud.google.com/bigquery/docs/scheduling-queries
Scheduled queries support DDL so you could schedule a query that deletes all the rows from this table, or deletes the table entirely, on a daily basis. at a specific time.

Will BigQuery finish long running jobs with a destination table if my browser crashes / computer turns off?

I frequently run BigQuery jobs in the web gui that take 30 minutes or more, saving the results into another table to view later.
Since I'm not waiting for the result to come soon, and not storing them in my computer's memory, it would be great if I could start a query and then turn off my computer, to come back the next day and look at the results in the destination table.
Will this work?
The same applies if my computer crashes, or browser runs out of memory, or anything else that causes me to lose my connection to Bigquery while the job is running.
The simple answer is yes, the processing takes place in the cloud, not on your browser. As long as you set a destination table, the results will be saved there or if not, you can check the query history to see if there were any issues which caused it not to be produced.
If you don't set a destination table it will save to a temporary table which may not be available if you don't return in time.
I'm sure someone can give you a much more detailed answer.
Even if you have not defined destination table - you still can access result of the query by checking Query History. You should locate your query in the list of presented queries and then expand respective item and locate value of Destination Table.
Note: this is not regular table - rather so called anonymous table that is being available for about 24 hours after query was executed
So, knowing that table you can just use it in whatever way you want - for example just simply query it as in below
SELECT *
FROM `yourproject._1e65a8880ba6772f612fbe6ff0eee22c939f1a47.anon9139110fa21b95d8c8729cf0bb6e4bb6452946d4`
Note: anonymous table is being "saved" in a "system" dataset that is started with underscore so you will not be able to see it in UI. Also table name startes with 'anon' which I believe states for 'anonymous'