Error loading a local file to a BigQuery table - google-bigquery

I'm trying to load a local file into BigQuery via the API, and it is failing. The file size is 98 MB and as a bit over 5 million rows. Note that I have loaded tables with the same number of rows and slightly bigger file size without problems in the past.
The code I am using is exactly the same as the one in the API documentation, which I have used successfully to upload several other tables. The error I get is the following:
Errors:
Line:2243530, Too few columns: expected 5 column(s) but got 3 column(s)
Too many errors encountered. Limit is: 0.
Job ID: job_6464fc24a4414ae285d1334de924f12d
Start Time: 9:38am, 7 Aug 2012
End Time: 9:38am, 7 Aug 2012
Destination Table: 387047224813:pos_dw_api.test
Source URI: uploaded file
Schema:
tbId: INTEGER
hdId: INTEGER
vtId: STRING
prId: INTEGER
pff: INTEGER
Note that the same file loads just fine from CloudStorage (dw_tests/TestCSV/test.csv), so the problem cannot be the one reported about one line having fewer columns, as it would fail from CloudStorage too, and I have also checked that all the rows have the correct format.
The following jobs have the same problem, the only difference is the table name and the name of the fields in the schema are different (but it is the same data file, fields and types). In those attempts it claimed a different row in trouble:
Line:4288253, Too few columns: expected 5 column(s) but got 4 column(s)
The jobs are the following:
job_cbe54015b5304785b874baafd9c7e82e load FAILURE 07 Aug 08:45:23 0:00:34
job_f634cbb0a26f4404b6d7b442b9fca39c load FAILURE 06 Aug 16:35:28 0:00:30
job_346fdf250ae44b618633ad505d793fd1 load FAILURE 06 Aug 16:30:13 0:00:34
The error that the Python script returns is the following:
{'status': '503', 'content-length': '177', 'expires': 'Fri, 01 Jan 1990 00:00:00 GMT', 'server': 'HTTP Upload Server Built on Jul 27 2012 15:58:36 (1343429916)', 'pragma': 'no-cache', 'cache-control': 'no-cache, no-store, must-revalidate', 'date': 'Tue, 07 Aug 2012 08:36:40 GMT', 'content-type': 'application/json'}
{
"error": {
"errors": [
{
"domain": "global",
"reason": "backendError",
"message": "Backend Error"
}
],
"code": 503,
"message": "Backend Error"
}
}
This looks like there may be an issue at BigQuery. How can I fix this problem?

The temporary files were still around for this import, so I was able to check out the file we tried to import. For job job_6464fc24a4414ae285d1334de924f12d, the last lines were:
222,320828,bot,2,0
222,320829,bot,4,3
222,320829,
It looks like we dropped part of the input file at some point... The input specification says that the MD5 hash should be 58eb7c2954ddfa96d109fa1c60663293 but our hash of the data is 297f958bcf94959eae49bee32cc3acdc, and file size should be 98921024, but we only have 83886080 bytes.
I'll look into why this is occurring. In the meantime, imports though Google Storage use a much simpler path and should be fine.

Related

Selenium IDE doesn't match text, despite using correct wildcard

I'm using Selenium IDE in Chrome to test a web site. When the test runs successfully, the site produces the text " Success Saving Scenario!" Selenium IDE finds this text, but I can't find the right value to match that text.
Here's my setting:
Command: Assert Text
Target: css-li > span
Value: Success Saving Scenario
Each time I run this test, the IDE records a failure with the message:
assertText on css=li > span with value Success Saving Scenario Failed:
12:23:02
Actual value "Thu, 03 Feb 2022 17:23:02 GMT - Success Saving Scenario!" did not match "Success Saving Scenario"
I checked the page, and sure enough the text displays Thu, 03 Feb 2022 17:23:02 GMT - Success Saving Scenario!
Why does that not match Success Saving Scenario? I thought the asterisks would be a wildcard that would match any characters.
I've tried these values as well with no success:
glob: Success Saving Scenario
regexp: Success Saving Scenario
(just an asterisk by itself)
Any ideas?
I would use 'Assert Element Present' for this case. Find another locator from the dropdown with a 'contain' keyword and remove the timezone from the contain keyword as needed. Leave value field empty.
Sample
Command: assert element present | Target: xpath=//span[contains(.,'Success Saving Scenario')] | Value: empty

unable to load csv file from GCS into bigquery

I am unable to load 500mb csv file from google cloud storage to big query but i got this error
Errors:
Too many errors encountered. (error code: invalid)
Job ID xxxx-xxxx-xxxx:bquijob_59e9ec3a_155fe16096e
Start Time Jul 18, 2016, 6:28:27 PM
End Time Jul 18, 2016, 6:28:28 PM
Destination Table xxxx-xxxx-xxxx:DEV.VIS24_2014_TO_2017
Write Preference Write if empty
Source Format CSV
Delimiter ,
Skip Leading Rows 1
Source URI gs://xxxx-xxxx-xxxx-dev/VIS24 2014 to 2017.csv.gz
I have gzipped 500mb csv file to csv.gz to upload to GCS.Please help me to solve this issue
The internal details for your job show that there was an error reading the row #1 of your CSV file. You'll need to investigate further, but it could be that you have a header row that doesn't conform to the schema of the rest of the file, so we're trying to parse a string in the header as an integer or boolean or something like that. You can set the skipLeadingRows property to skip such a row.
Other than that, I'd check that the first row of your data matches the schema you're attempting to import with.
Also, the error message you received is unfortunately very unhelpful, so I've filed a bug internally to make the error you received in this case more helpful.

Sails.js file upload returns empty files array when large file is uploaded in multi-upload form

I have multipart form in my Sails.js project which submits 2 different files (first audio then image) along with some text params. In most cases with rather small files, everything works fine. But when I tried a bigger audio file (33MB) I got empty files array for my image field in the receiver.
Here is some code.
The Controller:
var uploadParamNames = ['audio', 'image'];
async.map(uploadParamNames,
function (file, cb) {
sails.log(req.file(file)._files)
req.file(file).upload(
{
adapter: require('skipper-gridfs'),
uri: sails.config.connections.mongoConnection.url + '.' + file
},
function (err, files) {
// save the file, and then:
return cb(err, files);
});
}, function doneUploading(err, files) {
...
});
Basicaly here I get the following logs for audio and image:
[ { stream: [Object], status: 'bufferingOrWriting' } ]
[]
I tried a debug and found that in case of image field it never reached the line where the file is actually written in prototype.onFile.js line up.writeFile(part);.
Also the debug log prints the following:
Parser: Read a chunk of textparam through field `_csrf`
Parser: Read a chunk of textparam through field `ss-name`
Parser: Read a chunk of textparam through field `ss-desc`
Parser: Read a chunk of textparam through field `ss-category`
Parser: Read a chunk of textparam through field `ss-language`
Parser: Read a chunk of textparam through field `ss-place`
Parser: Read a chunk of textparam through field `ss-place-lat`
Parser: Read a chunk of textparam through field `ss-place-lon`
Acquiring new Upstream for field `audio`
Tue, 13 Oct 2015 10:52:54 GMT skipper Set up "maxTimeToWaitForFirstFile" timer for 10000ms
Tue, 13 Oct 2015 10:52:58 GMT skipper passed control to app because first file was received
Tue, 13 Oct 2015 10:52:58 GMT skipper waiting for any text params
Upstream: Pumping incoming file through field `audio`
Parser: Done reading textparam through field `_csrf`
Parser: Done reading textparam through field `ss-name`
Parser: Done reading textparam through field `ss-desc`
Parser: Done reading textparam through field `ss-category`
Parser: Done reading textparam through field `ss-language`
Parser: Done reading textparam through field `ss-tags`
Parser: Done reading textparam through field `ss-place`
Parser: Done reading textparam through field `ss-place-lat`
Parser: Done reading textparam through field `ss-place-lon`
Tue, 13 Oct 2015 10:53:11 GMT skipper Something is trying to read from Upstream `audio`...
Tue, 13 Oct 2015 10:53:11 GMT skipper Passing control to app...
Tue, 13 Oct 2015 10:53:16 GMT skipper maxTimeToWaitForFirstFile timer fired- as of now there are 1 file uploads pending (so it's fine)
debug: [ { stream: [Object], status: 'bufferingOrWriting' } ]
Tue, 13 Oct 2015 10:53:41 GMT skipper .upload() called on upstream
Acquiring new Upstream for field `image`
Tue, 13 Oct 2015 10:53:46 GMT skipper Set up "maxTimeToWaitForFirstFile" timer for 10000ms
debug: []
Not sure why, but it seems the control is already passed to the app before image file is written. Again this only happens with a larger audio file. Is there a way to fix this?
EDIT:
More debugging showed that the receivedFirstFileOfRequest listener is called before image file is written. Which is logical, because it actually listens for first file upload, but what to do with next files?
EDIT:
Ah... the file doesn't need to bee very large at all. A 29KB file passes and a 320KB does not...

Bad character in the file

I tried to load the data from cloud and it failed 3 times.
Job ID: job_2ed0ded6ce1d4837873e0ab498b0bc1b
Start Time: 9:10pm, 1 Aug 2012
End Time: 10:55pm, 1 Aug 2012
Destination Table: 567402616005:company.ox_data_summary_ad_hourly
Source URI: gs://daily_log/ox_data_summary_ad_hourly.txt.gz
Delimiter:
Max Bad Records: 30000
Job ID: job_47447ab60d2a40f588c89dfe638aa438
Line:176073205 / Field:1, Bad character (ASCII 0) encountered. Rest of file not processed.
Too many errors encountered. Limit is: 0.
Should I try again? or is there any issue with the source file?
This is a known bug dealing with gzipped files. The only workaround currently is just to use an uncompressed file.
There are changes coming soon that should make it easier to handle large, uncompressed files (imports will be faster, and file size limits will increase).

Unexpected error while loading data

I am getting an "Unexpected" error. I tried a few times, and I still could not load the data. Is there any other way to load data?
gs://log_data/r_mini_raw_20120510.txt.gzto567402616005:myv.may10c
Errors:
Unexpected. Please try again.
Job ID: job_4bde60f1c13743ddabd3be2de9d6b511
Start Time: 1:48pm, 12 May 2012
End Time: 1:51pm, 12 May 2012
Destination Table: 567402616005:myvserv.may10c
Source URI: gs://log_data/r_mini_raw_20120510.txt.gz
Delimiter: ^
Max Bad Records: 30000
Schema:
zoneid: STRING
creativeid: STRING
ip: STRING
update:
I am using the file that can be found here:
http://saraswaticlasses.net/bad.csv.zip
bq load -F '^' --max_bad_record=30000 mycompany.abc bad.csv id:STRING,ceid:STRING,ip:STRING,cb:STRING,country:STRING,telco_name:STRING,date_time:STRING,secondary:STRING,mn:STRING,sf:STRING,uuid:STRING,ua:STRING,brand:STRING,model:STRING,os:STRING,osversion:STRING,sh:STRING,sw:STRING,proxy:STRING,ah:STRING,callback:STRING
I am getting an error "BigQuery error in load operation: Unexpected. Please try again."
The same file works from Ubuntu while it does not work from CentOS 5.4 (Final)
Does the OS encoding need to be checked?
The file you uploaded has an unterminated quote. Can you delete that line and try again? I've filed an internal bigquery bug to be able to handle this case more gracefully.
$grep '"' bad.csv
3000^0^1.202.218.8^2f1f1491^CN^others^2012-05-02 20:35:00^^^^^"Mozilla/5.0^generic web browser^^^^^^^^
When I run a load from my workstation (Ubuntu), I get a warning about the line in question. Note that if you were using a larger file, you would not see this warning, instead you'd just get a failure.
$bq show --format=prettyjson -j job_e1d8636e225a4d5f81becf84019e7484
...
"status": {
"errors": [
{
"location": "Line:29057 / Field:12",
"message": "Missing close double quote (\") character: field starts with: <Mozilla/>",
"reason": "invalid"
}
]
My suspicion is that you have rows or fields in your input data that exceed the 64 KB limit. Perhaps re-check the formatting of your data, check that it is gzipped properly, and if all else fails, try importing uncompressed data. (One possibility is that the entire compressed file is being interpreted as a single row/field that exceeds the aforementioned limit.)
To answer your original question, there are a few other ways to import data: you could upload directly from your local machine using the command-line tool or the web UI, or you could use the raw API. However, all of these mechanisms (including the Google Storage import that you used) funnel through the same CSV parser, so it's possible that they'll all fail in the same way.