Google Big Query - Loading File From GCS Failed with "Not Found", but the file exists - google-bigquery

We have a strange issue that happen quite often.
We have a process which getting files from sources and loading it into the GCS. Than, and only if the file uploaded successfully, we try to load it into the BigQuery table and get the error of
"Not found: Uris List of uris (possibly truncated): json: file_name: ...".
After a deep investigation, it all supposed to be fine, and we don't know what had changed. In the time frames, the file in the job exists in the cloud storage, and uploaded into the GCS 2 minutes before BigQuery tried to get it.
There is need to say that we load every file as the whole batch dictionary in the Cloud Storage, like gs://<bucket>/path_to_dir/*. Is that still supported?
Also, the file sizes are kind of small - from few bytes to KB. Is that matter?
job ids for checking:
load_file_8e4e16f737084ba59ce0ba89075241b7 load_file_6c13c25e1fc54a088af40199eb86200d

Known issue with Cloud Storage consistency
As noted by Felipe, this was indeed related to a known issue with Cloud Storage. Google Cloud Storage Incident #16036 is shown to have been resolved since December 20, 2016. This was also being tracked in Issue 738. Though Cloud Storage list operations are eventually consistent, this incident displayed excessive delays in operations returning consistent results.
Handling Cloud Storage inconsistency
Though this was an isolated incident, it is nevertheless a good practice to have some means of handling such inconsistencies. Two such suggestions can be found in comment #10 of the related public issue.
Retry the load job if it failed.
Verify if Cloud Storage results are consistent with expectations
Verify the expected number of files (and total size) was processed by BigQuery. You can get this information out of the Job metadata.
Still getting unexpected results
Should you encounter such an issue again and have the appropriate error handling measures in place, I recommend first consulting the Google Cloud Status Dashboard and BigQuery public issue tracker for existing reports showing similar symptoms. If none exist, file a new issue on the issue tracker.

The solution was to move from Multi Region Bucket(that was set before Region type was enable) to Region.
Since we moved, we never faced this issue.

Related

How to resolve this error in Google Data Fusion: "Stage x contains a task of very large size (2803 KB). The maximum recommended task size is 100 KB."

I need to move data from an parameterized S3 Bucket into Google Cloud Storage. Basic Data dump. I don't own the S3 bucket. It has the following syntax,
s3://data-partner-bucket/mykey/folder/date=2020-10-01/hour=0
I was able to transfer data at the hourly granularity using the Amazon S3 Client provided by Data Fusion. I wanted to bring over a days worth of data so I reset the path in the client to:
s3://data-partner-bucket/mykey/folder/date=2020-10-01
It seemed like it was working until it stopped. The status is "Stopped." When I review the logs just before it stopped I see a warning, "Stage 0 contains a task of very large size (2803 KB). The maximum recommended task size is 100 KB."
I examined the data in the S3 bucket. Each folder contains a series of log files. None of them are "big". The largest folder contains a total of 3MB of data.
I saw a similar question for this error, but the answer involved Spark coding that I don't have access to in Data Fusion.
Screenshot of Advanced Settings in Amazon S3 Client
These are the settings I see in the client. Maybe there is another setting somewhere I need to set? What do I need to do so that Data Fusion can import these files from S3 to GCS?
When you deploy the pipeline you are redirected to a new page with a Ribbon at the top. one of the tools in the Ribbon is Configure.
In the resources section of the Configure Modal you can specify the memory resources. Fiddled around with the numbers. 1000MB worked. 6MB was not enough. (For me.)
I processed 756K records in about 46 min.

Add alert when transfer in BigQuery results in "succeeded 0 jobs"

I have a schedule transfer running daily on BigQuery and mostly without any issues. The transfer reads a .csv file from an AWS S3 bucket and appends the information to a BigQuery table.
Recently there has been an issue where the transfer resulted in neither succeeded nor failed jobs.
transfer logs
The outcome was that no entries were imported but also no alert was triggered; I had to hear from the reports' users that something had gone wrong.
Question: is there a way to add an alert on BigQuery Transfers for when successful jobs = 0?
BigQuery does have monitoring, it has some known issues as well, this will help BigQuery Monitoring
Monitoring->Dashboard->Add Chart->Use Resource type Global and Metric type as "Uploaded rows".

AWS DynamoDB Strange Behavior -- Provisioned Capacity & Queries

I have some strange things occurring with my AWS DynamoDB tables. To give you some context, I have several tables for an AWS Lambda function to query and modify. The source code for the function is housed in an S3 bucket. The function is triggered by an AWS Api.
A few days ago I noticed a massive spike in the amount of read and write requests I was being charged for in AWS. To be specific, the number of read and write requests increased by 3,000 from what my tables usually experience (they usually have fewer than 750 requests). Additionally, I have seen similar numbers in my Tier 1 S3 requests, with an increase of nearly 4,000 requests in the past six days.
Immediately, I suspected something malicious had happened, and I suspended all IAM roles and changed their keys. I couldn't see anything in the logs from Lambda denoting it was coming from my function, nor had the API received a volume of requests consistent with what was happening on the tables or the bucket.
When I was looking through the logs on the tables, I was met with this very strange behavior relating to the provisioned write and read capacity of the table. It seems like the table's capacities are ping ponging back and forth wildly as shown in the photo.
I'm relatively new to DynamoDB and AWS as a whole, but I thought I had set the table up with very specific provisioned write and read limits. The requests have continued to come in, and I am unable to figure out where in the world they're coming from.
Would one of you AWS Wizards mind helping me solve this bizarre situation?
Any advice or insight would be wildly appreciated.
Turns out refreshing the table that appears in the DynamoDB management window causes the table to be read from, hence the unexplainable jump in reads. I was doing it the whole time 🤦‍♂️

BigQuery Data Transfer does not delete sources

I am using an on-demand (as a test before automation) Bigquery Data Transfer job which loads data from Storage to a table.
All is working fine however I put "Delete source files after transfer |
true" and at the end no file is deleted. They are not loaded again but are always here in my Storage folder.
This deletion is vital since the amount of data could become quite big in a short period of time. I could delete them with another program but then the Transfer Service would become less interesting.
The job itself does not throw any error, which means that something is silently failing. Do you know what could possibly cause this ? Or maybe I am missing the meaning of this option ?
Thanks
Make sure you have enough permissions to do Cloud storage transfer, it won't tell explicitly what are the permissions are missing.
Required permissions
BigQuery
bigquery.transfers.update
Cloud Storage
storage.objects.get
storage.objects.list
storage.objects.delete
More info refer here

File structure of Apache Beam DynamicDestinations write to BigQuery

I am using DynamicDestinations (from BigQueryIO) to export data from one Cassandra table to multiple Google BigQuery tables. The process consists of several steps including writing prepared data to Google Cloud Storage (as files in JSON format) and then loading the files to BQ via load jobs.
The problem is that export process has ended with out of memory error at the last step (loading files from Google Storage to BQ). But there are prepared files with all of the data in GCS remaining. There are 3 directories in BigQueryWriteTemp location:
And there a lot of files with not obvious names:
The question is what is the storage structure of the files? How can I match the files with tables (table names) they prepared for? How can I use the files to continue export process from load jobs step? Can I use some piece of Beam code for that?
These files, if you're using Beam 2.3.0 or earlier, contain JSON data to be imported into BigQuery using its load job API. However:
This is an implementation detail that you can not rely on, in general. It is very likely to change in future versions of Beam (JSON is horribly inefficient).
It is not possible to match these files with the tables they are intended for - that was stored in the internal state of the pipeline that has failed.
There is also no way to know how much data was written to these files and how much wasn't. The files may contain only partial data: maybe your pipeline failed before creating some of the files, or after some of them were already loaded into BigQuery and deleted.
Basically, you'll need to rerun the pipeline and fix the OOM issue so that it succeeds.
For debugging OOM issues, I suggest using a heap dump. Dataflow can write heap dumps to GCS using --dumpHeapOnOOM --saveHeapDumpsToGcsPath=gs://my_bucket/. You can examine these dumps using any Java memory profiler, such as Eclipse MAT or YourKit. You can also post your code as a separate SO question and ask for advice reducing its memory usage.