Internal error while loading to Bigquery table - google-bigquery

I ran this command to load 11 files to a Bigquery table:
bq load --project_id=ardent-course-601 --source_format=NEWLINE_DELIMITED_JSON dw_test.rome_defaults_20140819_test gs://sm-uk-hadoop/queries/logsToBq_transformLogs/rome_defaults/20140819/23af7218-617d-42e8-884e-f213a583094a/part* /opt/sm-analytics/projects/logsTobqMR/jsonschema/rome_defaultsSchema.txt
I got this error:
Waiting on bqjob_r46f38146351d545_00000147ef890755_1 ... (11s) Current status: DONE
BigQuery error in load operation: Error processing job 'ardent-course-601:bqjob_r46f38146351d545_00000147ef890755_1': Too many errors encountered. Limit is: 0.
Failure details:
- File: 5: Unexpected. Please try again.
I tried many times after that and still got the same error.
To debug what went wrong, I instead load each file one by one to the Bigquery table. For example:
/usr/local/bin/bq load --project_id=ardent-course-601 --source_format=NEWLINE_DELIMITED_JSON dw_test.rome_defaults_20140819_test gs://sm-uk-hadoop/queries/logsToBq_transformLogs/rome_defaults/20140819/23af7218-617d-42e8-884e-f213a583094a/part-m-00011.gz /opt/sm-analytics/projects/logsTobqMR/jsonschema/rome_defaultsSchema.txt
There are 11 files total and each ran fine.
Could someone please help? Is this a bug on Bigquery side?
Thank you.

There was an error reading one of the files: gs://...part-m-00005.gz
Looking at the import logs, it appears that the gzip reader encountered an error decompressing the file.
It looks like that file may not actually be compressed. BigQuery samples the header of the first file in the list to determine whether it is dealing with compressed or uncompressed files and to determine the compression type. When you import all of the files at once, it only samples the first file.
When you run the files individually, bigquery reads the header of the file and determines that it isn't actually compressed (despite having the suffix '.gz') so imports it as a normal flat file.
If you run a load that doesn't mix compressed and uncompressed files, it should work successfully.
Please let me know if you think this is not the case and I'll dig in some more.

Related

dbt Error : Encountered an error: 'utf-8' codec can't decode byte 0xa0 in position 441: invalid start byte

I have upgraded my dbt version to 1.0.0 yesterday night and ran few connection test. It went well . Now when i am running the my first dbt example model , i am getting below error , even though i have not changed any code in this default example model.
Same error i am getting while running dbt seed command also for a csv dataset . The csv is utf-8 encoded and no special character in it .
I am using python 3.9
Could anyone suggest what is the issue ?
Below is my first dbt model sql
After lots of back and forth, I figured out the issue. This is more like fundamental concept issue.
Every time we execute dbt run, dbt will scan through the entire project directory ( including seeds directory even though it is not materializing the seed ) [Attached screenshot below].
If it finds any csv it also parsed it .
In case of above error, I had a csv file which looks follows :
If we see the highlighted line it contains some symbol character which dbt (i.e python) was not able to parse it causing above error.
This symbol was not visible earlier in excel or notepad++.
It could be the issue with Snowflake python connector that #PeterH has pointed out .
As temporary solution , for now we are manually removing these character from Data file.
I’d leave this as a comment but I don’t have the rep yet…
This appears to be related to a recently-opened issue.
https://github.com/dbt-labs/dbt-snowflake/issues/66
Apparently it’s something to do with the snowflake python adapter.
Since you’re seeing the error from a different context, it might be helpful for you to post in that issue that you’re seeing this outside of query preview.

Row larger than the maximum allowed size

I have successfully imported many gzipped JSON files on several occasions. For the two files BQ import choked. Both files reported the same error:
File: 0 / Offset:0 / Line:1 / Column:20971521, Row larger than the maximum allowed size
Now I've read about the row limit of 20MB and I understand that the number above is 20MB +1 but what really bugs me is that the meaning is totally off. My GZs have millions of JSONs (each on a new line). I have written a script to measure the longest line (longest JSON) in the failed GZ file and found it to be 103571 bytes. Why is the BQ import choking then?
I have inspected the longest JSON and it looks perfectly normal. How should I interpret the error? How can I fix it?
Why is BQ thinking the import is on line 1, column 20971521 when there are millions of lines in the file?
All your investigations are correct, but you must check your file as new lines are not identified, and BQ seas all the import as a large line.
That's why it reports column 20971521 for the problem.
You should try importing a sample from the file.
Some of the answers here gave me an idea so I went on a tried it. It appears as if for some strange reason BQ didn't like line endings so I wrote a quick script to rewrite the original input file to use line endings. Automagically the import worked!
This is utterly strange considering I already imported many GBs of data with pure line endings.
I am happy that it worked but I could never guess why. I hope this helps someone else.

Storing from wildcard input path

I’m having issues using wildcard input paths in Pig.
If I run the following commands:
A = load ‘/something/*.csv’ using PigStorage(‘,’)
dump A;
I see the output from all csv files in the something folder printed to my console after the job is run.
If, however, I run a store instead:
A = load ‘/something/*.csv’ using PigStorage(‘,’)
store A into ‘somedestination’;
The job fails with the following error message:
Input(s):
Failed to read data from “/something/*.csv”
It looks like the store is attempting to load from the literal path instead of globbing using the wildcard, but if that’s the case then why does it work during the dump? Is there another way to accomplish this?
You may not have the permission to write to that folder.
The dump essentially writes to the tmp folder (or another folder if the configuration is different) and then prints that to the screen.
Do a dump. Look at the log. It should say something like:
Input(s):
Successfully read 0 records from: "‘/something/*.csv’"
Output(s):
Successfully stored 0 records in: "file:/tmp/temp1865628879/tmp-1573237939"
Then next time try and store to the folder that you saw when you did the dump. If that works fine, then you have a permissions problem.

IS it possible to manage NO FILE error in Pig?

I'm trying to load simple file:
log = load 'file_1.gz' using TextLoader AS (line:chararray);
dump log
And I get an error:
2014-04-08 11:46:19,471 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input Pattern hdfs://hadoop1:8020/pko/file*gz matches 0 files
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:288)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1054)
Is is possible to manage such situation before error appears?
Input Pattern hdfs://hadoop1:8020/pko/file*gz matches 0 files
The error is the input file doesn't exist in the given hdfs path.
log = load 'file_1.gz' using TextLoader AS (line:chararray);
as you haven’t mentioned the absolute path of file_1.gz , it will taken the home hdfs dir of the user with which you are running your pig-script
Unfortunately in the current version of Pig (0.15.0) it is impossible to manage these errors without using UDF's.
I suggest creating a Java or Python script using try and catch to take care of this.
Here's a good website that might be of some use to you: https://wiki.apache.org/pig/PigErrorHandlingInScripts
Good luck learning Pig!
I'm facing this issue as well. My load command is:
DATA = LOAD '${qurwf_folder_input}/data/*/' AS (...);
I want to load all files from the data subfolders, but the data folder is empty and I got the same error as you. What I did, in my particular case, was to create an empty folder in the data directory. So the LOAD returns an empty dataset and the script did not fail.
By the way, I'm using Oozie workflow to run the scripts, and in the prepare, I create the empty folders.

Error loading data "operation: Unexpected"

This is a repost from a question asked on the (now disfunct) bigquery forum.
While uploading data from the bq tool I get the following error:
BigQuery error in load operation: Unexpected. Please try again.
I've tried running several files, but each gives the same exception.
The latest failed job is job_5251c0bf5eb24436a350bdfbdbdb3cd8
It looks like that job hit a SECURITY_VIOLATION error. This is likely due to a line that is longer than the maximum line length (64k).
In the next build of BigQuery (which will probably go live next week) it will give you a better error in this case -- it will tell you which lines are too long, and long lines won't cause the import to fail (subject to the maxBadRecords limit).
In the meantime, you can make sure that your input lines are shorter than 64k (note that newlines can be quoted, so stray quotes can cause lines to appear to be too long).