Uploading job fails on the same file that was uploaded successfully before

Uploading job fails on the same file that was uploaded successfully before - google-bigquery

I'm running regular uploading job to upload csv into BigQuery. The job runs every hour. According to recent fail log, it says:
Error: [REASON] invalid [MESSAGE] Invalid argument: service.geotab.com [LOCATION] File: 0 / Offset:268436098 / Line:218637 / Field:2
Error: [REASON] invalid [MESSAGE] Too many errors encountered. Limit is: 0. [LOCATION]
I went to line 218638 (the original csv has a headline, so I assume 218638 should be the actual failed line, let me know if I'm wrong) but it seems all right. I checked according table in BigQuery, it has that line too, which means I actually successfully uploaded this line before.
Then why does it causes failure recently?
project id: red-road-574
Job ID: Job_Upload-7EDCB180-2A2E-492B-9143-BEFFB36E5BB5

This indicates that there was a problem with the data in your file, where it didn't match the schema.
The error message says it occurred at File: 0 / Offset:268436098 / Line:218637 / Field:2. This means the first file (it looks like you just had one), and then the chunk of the file starting at 268436098 bytes from the beginning of the file, then the 218637th line from that file offset.
The reason for the offset portion is that bigquery processes large files in parallel in multiple workers. Each file worker starts at an offset from the beginning of the file. The offset that we include is the offset that the worker started from.
From the rest of the error message, it looks like the string service.geotab.com showed up in the second field, but the second field was a number, and service.geotab.com isn't a valid number. Perhaps there was a stray newline?
You can see what the lines looked like around the error by doing:
cat <yourfile> | tail -c +268436098 | tail -n +218636 | head -3
This will print out three lines... the one before the error (since I used -n +218636 instead of +218637), the one that had the error, and the next line as well.
Note that if this is just one line in the file that has a problem, you may be able to work around the issue by specifying maxBadRecords.

Related

Amazon CloudWatch log agent ignoring first character of log lines

If I add a batch of n test records to my log file, the awslogs agent is erroneously ignoring the first character of the first line of the batch of rows that I add. This is fully reproducible. So if I use vi to append the following test log lines:
2017-06-16 15:20:01,123 this line will not get correctly parsed. first character in date string will be skipped.
2017-06-16 15:21:23,456 this will get parsed. 1
2017-06-16 15:22:23,456 this will get parsed. 2
2017-06-16 15:23:23,456 this will get parsed. 3
2017-06-16 15:24:23,456 this will get parsed. 4
2017-06-16 15:25:23,456 this will get parsed. 5
2017-06-16 15:26:23,456 this will get parsed. 6
2017-06-16 15:27:23,456 this will get parsed. 7
The leading 2 in the first row gets omitted by the log agent. In the CloudWatch Logs web console, the event shows up as 017-06-16 15:20:01,123 this line will..., the datetime string does not get successfully parsed, and the log event must use the timestamp of the previous log.
In the common scenario where I add log events to the file one at a time, the first letter of each line is ignored and the timestamp strings do not get correctly parsed. If I append multiple lines in vi before hitting :w save, only the first line experiences this problem and the other lines in the batch get ingested correctly.
I created the log file (as a test) with touch and have only added lines manually with vi so I don't think this is a file encoding problem.
I'm using a mostly standard default configuration.
My CloudWatch Agent Config File:
[general]
state_file = /var/awslogs/state/agent-state
[/var/log/myapp/app.log]
file = /var/log/myapp/app.log
log_group_name = MyAppLogGroup
log_stream_name = MyAppLogStream
datetime_format=%Y-%m-%d %H:%M:%S,%f
Then I download the latest setup script from https://s3.amazonaws.com//aws-cloudwatch/downloads/latest/awslogs-agent-setup.py
And run sudo ./awslogs-agent-setup.py -n -r us-west-2 -c cloudwatch_logs.config

Try setting the
initial_position = start_of_file
option in your config file explicitly, do you get the same behavior?

Hi , Google big query - bq fail load display file number how to get the file name

I'm running the following bq command
bq load --source_format=CSV --skip_leading_rows=1 --max_bad_records=1000 --replace raw_data.order_20150131 gs://raw-data/order/order/2050131/* order.json
and
getting the following message when loading data into bq .
*************************************
Waiting on bqjob_r4ca10491_0000014ce70963aa_1 ... (412s) Current status: DONE
BigQuery error in load operation: Error processing job
'orders:bqjob_r4ca10491_0000014ce70963aa_1': Too few columns: expected
11 column(s) but got 1 column(s). For additional help: http://goo.gl/RWuPQ
Failure details:
- File: 844 / Line:1: Too few columns: expected 11 column(s) but got
1 column(s). For additional help: http://goo.gl/RWuPQ
**********************************
The message display only the file number .
checked the files content most of them are good .
gsutil ls and the cloud console on the other hand display file names .
how can I know which file is it according to the file number?

There seems to be some weird spacing introduced in the question, but if the desired path to ingest is "/order.json" - that won't work: You can only use "" at the end of the path when ingesting data to BigQuery.

Issue when loading data from cloud storage, at least an error message improvement is needed

When I try to load multiple files from cloud storage larger jobs almost always fail. When I try to load an individual file that works, but loading batches is really much more convenient.
Snippet:
Recent Jobs
Load 11:24am
gs://albertbigquery.appspot.com/uep/201409/01/wpc_5012_20140901_0002.log.gz toalbertbigquery:uep.201409
Load 11:23am
gs://albertbigquery.appspot.com/uep/201409/01/wpc_5012_20140901_0001.log.gz toalbertbigquery:uep.201409
Load 11:22am
gs://albertbigquery.appspot.com/uep/201409/01/* toalbertbigquery:uep.201409
Errors:
File: 40 / Line:1 / Field:1, Bad character (ASCII 0) encountered: field starts with: <�>
File: 40 / Line:2 / Field:1, Bad character (ASCII 0) encountered: field starts with: <5C���>}�>
File: 40 / Line:3 / Field:1, Bad character (ASCII 0) encountered: field starts with: <����W�o�>
File: 40 / Line:4, Too few columns: expected 7 column(s) but got 2 column(s). For additional help:
File: 40 / Line:5, Too few columns: expected 7 column(s) but got 1 column(s). For additional help:
File: 40 / Line:6, Too few columns: expected 7 column(s) but got 1 column(s). For additional help:
File: 40 / Line:7, Too few columns: expected 7 column(s) but got 1 column(s). For additional help:
File: 40 / Line:8 / Field:1, Bad character (ASCII 0) encountered: field starts with: <��hy�>
The worst with this problem is that I don't know which file is "File: 40" the order seems random, otherwise I could remove that file and load the data, or try to find the error in the file.
I also strongly doubt that there even is an actual file error, for example in the above case when I removed all files but _0001 and _0002 (that worked fine to load as single files) I still get this output:
Recent Jobs
Load 11:44am
gs://albertbigquery.appspot.com/uep/201409/01/* toalbertbigquery:uep.201409
Errors:
File: 1 / Line:1 / Field:1, Bad character (ASCII 0) encountered: field starts with: <�>
File: 1 / Line:2 / Field:3, Bad character (ASCII 0) encountered: field starts with:
File: 1 / Line:3, Too few columns: expected 7 column(s) but got 1 column(s). For additional help:
File: 1 / Line:4 / Field:3, Bad character (ASCII 0) encountered: field starts with:
Sometimes though the files load just fine, otherwise I'd expect that multiple file loading was all wrecked.
Info:
Average file size is around 20MB, usually a directory is 70 files somewhere between 1 and 2 GB.

It looks like you're hitting a BigQuery bug.
When BigQuery gets a load job request with a wildcard pattern (i.e. gs://foo/bar*) we first expand the pattern to the list of files. Then we read the first one to determine the compression type.
One oddity with GCS is that there isn't a real concept of a directory. That is gs://foo/bar/baz.csv is really bucket: 'foo', object: 'bar/baz.csv'. It looks like you have empty files as placeholders for your directories (as in gs://albertbigquery.appspot.com/uep/201409/01/).
This empty file doesn't play nicely with the bigquery probe-for-compression type, since when we expand the file pattern, the directory dummy file is the first thing that gets returned. We then open the dummy file, and it doesn't appear to be a gzip file, so we assume the compression type of the entire load is uncompressed.
We've filed a bug and have a fix under testing. Hopefully the fix will be out next week. In the mean time, your options are to either expand the pattern yourself, to use a longer pattern that won't match the directory (as in gs://albertbigquery.appspot.com/uep/201409/01/wpc*), or to delete the dummy directory file.

Rest of the file not processed

The status is shown as success but the file is not actually transferred to big-query.
# bq show -j abc
Job Type State Start Time Duration Bytes Processed
---------- --------- ----------------- ---------- -----------------
load SUCCESS 05 Jul 15:32:45 0:26:24
From web interface, I can see the actual error.
Line:9732968, Too few columns: expected 27 column(s) but got 9 column(s)
Line:10893908 / Field:1, Bad character (ASCII 0) encountered. Rest of file not processed.
1) How do I know which bad character needs to be removed?
2) Why does "success" shown as job status?
Update:
Job ID: summary_2012_07_09_to_2012_07_10a2
The error that I got at command prompt:
BigQuery error in load operation: Backend Error
A lot of lines were not processed at all. The details from web interface:
Line:9857286 / Field:1, Bad character (ASCII 0) encountered: field starts with: <15>
Line:9857287 / Field:1, Bad character (ASCII 0) encountered. Rest of file not processed.
All the lines where successfully processed in the second attempt:
job_id: summary_2012_07_09_to_2012_07_10a3
Update 2:
Line:174952407 / Field:1, Bad character (ASCII 0) encountered. Rest of file not processed.
Job ID: job_19890847cbc3410495c3cecaf79b31fb

Sorry for the slow response, the holiday weekend meant most of the bigquery team was not answering support questions. The 'bad character' looks like it may be a known bug with some gzipped files where we improperly detect an ascii 0 value at the end of the file.
If the job is actually failing but reporting success, that sounds like a problem but we'll need the job id of the failing job in order to be able to debug. Also if you can reproduce it that would be helpful since we may not have the logs around for the original job anymore.

ASE ISQL output to file, occassionally is empty or blank

Give this unix script, which is scheduled batch run:
isql -U$USR -S$SRVR -P$PWD -w2000 < $SCRIPTS/sample_report.sql > $TEMP_DIR/sample_report.tmp_1
sed 's/-\{3,\}//g' $TEMP_DIR/sample_report.tmp_1 > $TEMP_DIR/sample_report.htm_1
uuencode $TEMP_DIR/sample_report.htm_1 sample_report.xls > $TEMP_DIR/sample_report.mail_1
mailx -s "Daily Sample Report" email#example.com < $TEMP_DIR/sample_report.mail_1
There are occasionally cases where the sample_report.xls attached in the mail, is empty, zero lines.
I have ruled out the following:
not command processing timeout - by adding the -t30 to isql, I get the xls and it contains the error, not empty
not sql error - by forcing an error in the sql, I get the xls and it contains the error, not empty
not sure of login timeout - by adding -l1, it does not timeout, but I can't specify a number lower than 1 second, so I can't say
I cannot reproduce this, as I do not know the cause. Has anyone else experienced this or have way to address this? Any suggestions how to find the cause? Is it the unix or the Sybase isql?

I found the cause. Since this is scheduled, and this particular report takes a long time to generate. Other scheduled scripts, I found have this line of code:
rm -f $TEMP_DIR/*
If the this long running report, overlaps with one of the scheduled scripts with the line above, the .tmp_1 can possibly be deleted, hence blank by the time it is mailed. I replicated this by manually deleting the .tmp_1 while the report was still writing the sql in there.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas