process tar files - google-bigquery

process tar files - google-bigquery

I am not able to process any tar files.
Errors:
Line:1 / Field:1, Bad character (ASCII 0) encountered: field starts with: <Master.c>
Line:2 / Field:1, Bad character (ASCII 0) encountered. Rest of file not processed.
Input contained no data
Job ID: cdrFinal20130123193311freeswitch8
If I upload uncompressed files, there is no issue.

tar is not a compression format - it is an archival format. BigQuery currently supports loading of files compressed using the gzip algorithm.
More here: https://developers.google.com/bigquery/articles/ingestioncookbook#compressedversusuncompressed

Related

Bigquery error (ASCII 0) encountered for external table and when loading table

I'm getting this error
"Error: Error detected while parsing row starting at position: 4824. Error: Bad character (ASCII 0) encountered."
The data is not compressed.
My external table points to multiple CSV files, and one of them contains a couple of lines with that character. In my table definition I added "MaxBadRecords", but that had no effect. I also get the same problem when loading the data in a regular table.
I know I could use DataFlow or even try to fix the CSVs, but is there an alternative to that does not include writing a parser, and hopefully just as easy and efficient?

is there an alternative to that does not include writing a parser, and hopefully just as easy and efficient?
Try below in Google Cloud SDK Shell (with use of tr utility)
gsutil cp gs://bucket/badfile.csv - | tr -d '\000' | gsutil cp - gs://bucket/fixedfile.csv
This will
Read your "bad" file
Remove ASCII 0
Save "fixed" file into new file
After you have new file - just make sure your table now points to that fixed one

Sometimes it occurs that a final byte appears in file.
What could help is replacing it thanks to :
tr '\0' ' ' < file1 > file2

You can clean the file using an external tool like python or PowerShell. There is no way to load any file with an ASCII0 in bigquery
This is a script that can clear the file with python:
def replace_chars(self,file_path,orignal_string,new_string):
#Create temp file
fh, abs_path = mkstemp()
with os.fdopen(fh,'w', encoding='utf-8') as new_file:
with open(file_path, encoding='utf-8', errors='replace') as old_file:
print("\nCurrent line: \t")
i=0
for line in old_file:
print(i,end="\r", flush=True)
i=i+1
line=line.replace(orignal_string, new_string)
new_file.write(line)
#Copy the file permissions from the old file to the new file
shutil.copymode(file_path, abs_path)
#Remove original file
os.remove(file_path)
#Move new file
shutil.move(abs_path, file_path)
The same but for PowerShell:
(Get-Content "C:\Source.DAT") -replace "`0", " " | Set-Content "C:\Destination.DAT"

error when importing gz files into bigquery

I ran into an error when importing gzipped tab delimited files into bigquery
The output I got was:
root#a20c6fbdf9b5:/opt/batch/jobs# bq show -j bqjob_r5720e2f2267a5a5b_0000014d09571f27_1
Job infra-bedrock-861:bqjob_r5720e2f2267a5a5b_0000014d09571f27_1
Job Type State Start Time Duration Bytes Processed
---------- --------- ----------------- ---------- -----------------
load FAILURE 30 Apr 08:00:44 0:02:05
Errors encountered during job execution. Bad character (ASCII 0) encountered: field starts with: <H:|\ufc0f\ufffd(>
Failure details:
- File: 1 / Line:1 / Field:1: Bad character (ASCII 0) encountered:
field starts with: <\ufff>
- File: 1 / Line:3 / Field:1: Bad character (ASCII 0) encountered:
field starts with: <\u0475\ufffd=\ufffd\ufffd\u03d6>
- File: 1 / Line:4 / Field:1: Bad character (ASCII 0) encountered:
field starts with: <-\ufffd\ufffdY\u049a\ufffd>
- File: 1 / Line:6 / Field:1: Bad character (ASCII 0) encountered:
field starts with: <\u018e\ufffd\ufffd\ufffd\ufffd>
I tried manually downloading the files, unzipping and then uploading the files again. The uncompressed files could be imported into bigquery without any problems.
This looks like a bug in bigquery with zip files

Inspecting the job configuration, you include a non-gzip file as the first uri, ending in .../20150426/_SUCCESS. BigQuery uses the first file to determine whether compression is enabled.
Assuming this file is empty, you can remove it from your load requests to fix this. If there is data in this file, attach a ".gz" suffix or re-order this file so it is not first in the uri list.

Issue when loading data from cloud storage, at least an error message improvement is needed

When I try to load multiple files from cloud storage larger jobs almost always fail. When I try to load an individual file that works, but loading batches is really much more convenient.
Snippet:
Recent Jobs
Load 11:24am
gs://albertbigquery.appspot.com/uep/201409/01/wpc_5012_20140901_0002.log.gz toalbertbigquery:uep.201409
Load 11:23am
gs://albertbigquery.appspot.com/uep/201409/01/wpc_5012_20140901_0001.log.gz toalbertbigquery:uep.201409
Load 11:22am
gs://albertbigquery.appspot.com/uep/201409/01/* toalbertbigquery:uep.201409
Errors:
File: 40 / Line:1 / Field:1, Bad character (ASCII 0) encountered: field starts with: <�>
File: 40 / Line:2 / Field:1, Bad character (ASCII 0) encountered: field starts with: <5C���>}�>
File: 40 / Line:3 / Field:1, Bad character (ASCII 0) encountered: field starts with: <����W�o�>
File: 40 / Line:4, Too few columns: expected 7 column(s) but got 2 column(s). For additional help:
File: 40 / Line:5, Too few columns: expected 7 column(s) but got 1 column(s). For additional help:
File: 40 / Line:6, Too few columns: expected 7 column(s) but got 1 column(s). For additional help:
File: 40 / Line:7, Too few columns: expected 7 column(s) but got 1 column(s). For additional help:
File: 40 / Line:8 / Field:1, Bad character (ASCII 0) encountered: field starts with: <��hy�>
The worst with this problem is that I don't know which file is "File: 40" the order seems random, otherwise I could remove that file and load the data, or try to find the error in the file.
I also strongly doubt that there even is an actual file error, for example in the above case when I removed all files but _0001 and _0002 (that worked fine to load as single files) I still get this output:
Recent Jobs
Load 11:44am
gs://albertbigquery.appspot.com/uep/201409/01/* toalbertbigquery:uep.201409
Errors:
File: 1 / Line:1 / Field:1, Bad character (ASCII 0) encountered: field starts with: <�>
File: 1 / Line:2 / Field:3, Bad character (ASCII 0) encountered: field starts with:
File: 1 / Line:3, Too few columns: expected 7 column(s) but got 1 column(s). For additional help:
File: 1 / Line:4 / Field:3, Bad character (ASCII 0) encountered: field starts with:
Sometimes though the files load just fine, otherwise I'd expect that multiple file loading was all wrecked.
Info:
Average file size is around 20MB, usually a directory is 70 files somewhere between 1 and 2 GB.

It looks like you're hitting a BigQuery bug.
When BigQuery gets a load job request with a wildcard pattern (i.e. gs://foo/bar*) we first expand the pattern to the list of files. Then we read the first one to determine the compression type.
One oddity with GCS is that there isn't a real concept of a directory. That is gs://foo/bar/baz.csv is really bucket: 'foo', object: 'bar/baz.csv'. It looks like you have empty files as placeholders for your directories (as in gs://albertbigquery.appspot.com/uep/201409/01/).
This empty file doesn't play nicely with the bigquery probe-for-compression type, since when we expand the file pattern, the directory dummy file is the first thing that gets returned. We then open the dummy file, and it doesn't appear to be a gzip file, so we assume the compression type of the entire load is uncompressed.
We've filed a bug and have a fix under testing. Hopefully the fix will be out next week. In the mean time, your options are to either expand the pattern yourself, to use a longer pattern that won't match the directory (as in gs://albertbigquery.appspot.com/uep/201409/01/wpc*), or to delete the dummy directory file.

Bad character in the file

I tried to load the data from cloud and it failed 3 times.
Job ID: job_2ed0ded6ce1d4837873e0ab498b0bc1b
Start Time: 9:10pm, 1 Aug 2012
End Time: 10:55pm, 1 Aug 2012
Destination Table: 567402616005:company.ox_data_summary_ad_hourly
Source URI: gs://daily_log/ox_data_summary_ad_hourly.txt.gz
Delimiter:
Max Bad Records: 30000
Job ID: job_47447ab60d2a40f588c89dfe638aa438
Line:176073205 / Field:1, Bad character (ASCII 0) encountered. Rest of file not processed.
Too many errors encountered. Limit is: 0.
Should I try again? or is there any issue with the source file?

This is a known bug dealing with gzipped files. The only workaround currently is just to use an uncompressed file.
There are changes coming soon that should make it easier to handle large, uncompressed files (imports will be faster, and file size limits will increase).

Rest of the file not processed

The status is shown as success but the file is not actually transferred to big-query.
# bq show -j abc
Job Type State Start Time Duration Bytes Processed
---------- --------- ----------------- ---------- -----------------
load SUCCESS 05 Jul 15:32:45 0:26:24
From web interface, I can see the actual error.
Line:9732968, Too few columns: expected 27 column(s) but got 9 column(s)
Line:10893908 / Field:1, Bad character (ASCII 0) encountered. Rest of file not processed.
1) How do I know which bad character needs to be removed?
2) Why does "success" shown as job status?
Update:
Job ID: summary_2012_07_09_to_2012_07_10a2
The error that I got at command prompt:
BigQuery error in load operation: Backend Error
A lot of lines were not processed at all. The details from web interface:
Line:9857286 / Field:1, Bad character (ASCII 0) encountered: field starts with: <15>
Line:9857287 / Field:1, Bad character (ASCII 0) encountered. Rest of file not processed.
All the lines where successfully processed in the second attempt:
job_id: summary_2012_07_09_to_2012_07_10a3
Update 2:
Line:174952407 / Field:1, Bad character (ASCII 0) encountered. Rest of file not processed.
Job ID: job_19890847cbc3410495c3cecaf79b31fb

Sorry for the slow response, the holiday weekend meant most of the bigquery team was not answering support questions. The 'bad character' looks like it may be a known bug with some gzipped files where we improperly detect an ascii 0 value at the end of the file.
If the job is actually failing but reporting success, that sounds like a problem but we'll need the job id of the failing job in order to be able to debug. Also if you can reproduce it that would be helpful since we may not have the logs around for the original job anymore.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

process tar files - google-bigquery

tar is not a compression format - it is an archival format. BigQuery currently supports loading of files compressed using the gzip algorithm. More here: https://developers.google.com/bigquery/articles/ingestioncookbook#compressedversusuncompressed

Related

Bigquery error (ASCII 0) encountered for external table and when loading table

error when importing gz files into bigquery

Issue when loading data from cloud storage, at least an error message improvement is needed

Bad character in the file

Rest of the file not processed

Categories

Resources