How to resolve CSV To BigQuery Load Error - google-bigquery

Facing the below error while loading the csv file into BQ Table. Didn't face this problem when we were loading the files that are of TBs in size
'Error while reading data, error message: The options set for reading CSV prevent BigQuery from splitting files to read in parallel, and at least one of the files is larger than the maximum allowed size when files cannot be split. Size is: 7561850767. Max allowed size is: 4294967296.'

The limit for compressed files is 4GB.
If you file is not compressed you should check if there is any double quote characters (") in the file. Unmatched double quote characters could result in a large field (greater than 4GB) that cannot be split.
You can try loading the file from command line using something like:
bq --project_id <project_id> load --source_format=CSV --autodetect --quote $(echo -en '\000') <dataset.table> <path_to_source>
The idea would be to mute the default quote which is double quotes (").
Please refer the CLI documentation for the exact command.

Related

Load compressed data from Amazon S3 to Postgres using datastage

I am trying to load data which is stored in .gz format in S3 to PostgreSQL server using Datastage. I am using the ODBC connector on the target (database) side. I am able to load uncompressed data from S3 to PostgreSQL but no luck with compressed data so far. I have tried the Expand Stage but it's not helping or I am not doing the right thing. Without the "Expand" the data is coming but it is trying to read the compressed data, while doing so it fails and throws an error:
Amazon_S3_0,1: com.ascential.e2.common.CC_Exception: Failed to initialize the parser: The row delimiter was not found within the first 132 bytes of the file. Ensure that the Row delimiter property matches the row delimiter of the file.
at com.ibm.iis.cc.cloud.CloudLogger.createCCException (CloudLogger.java: 196)
at com.ibm.iis.cc.cloud.CloudStage.processReadAndParse (CloudStage.java: 1591)
at com.ibm.iis.cc.cloud.CloudStage.process (CloudStage.java: 680)
at com.ibm.is.cc.javastage.connector.CC_JavaAdapter.run (CC_JavaAdapter.java: 443)
Amazon_S3_0,1: Failed to initialize the parser: The row delimiter was not found within the first 132 bytes of the file. Ensure that the Row delimiter property matches the row delimiter of the file. (com.ibm.iis.cc.cloud.CloudLogger::createCCException, file CloudLogger.java, line 196)
If someone has come across this, please share your valuable inputs.

Exporting large file from BigQuery to Google cloud using wildcard

I have 8Gb table in BigQuery that I'm trying to export to Google Cloud Storage (GCS). If I specify url as it is, I'm getting an error
Errors:
Table gs://***.large_file.json too large to be exported to a single file. Specify a uri including a * to shard export. See 'Exporting data into one or more files' in https://cloud.google.com/bigquery/docs/exporting-data. (error code: invalid)
Okay... I'm specifying * in a file name, but it exports it in 2 files: one 7.13Gb and one ~150Mb.
UPD. I thought I should get about 8 files, 1Gb each? Am I wrong? Or what am I doing wrong?
P.S. I tried this in WebUI mode as well as using Java library.
For files of certain size or larger, BigQuery will export to multiple GCS files - that's why it asks for the "*" glob.
Once you have multiple files in GCS, you can join them into 1 with the compose operation:
gsutil compose gs://bucket/obj1 [gs://bucket/obj2 ...] gs://bucket/composite
https://cloud.google.com/storage/docs/gsutil/commands/compose
To export it to GCP you have to go to the table and click EXPORT > Export to GCS.
This opens the following screen
In Select GCS location you define the bucket, the folder and the file.
For instances, you have a bucket named daria_bucket (Use only lowercase letters, numbers, hyphens (-), and underscores (_). Dots (.) may be used to form a valid domain name.) and want to save the file(s) in the root of the bucket with the name test, then you write (in Select GCS location)
daria_bucket/test.csv
Because the file is too big, you're getting an error. To fix it, you'll have to break it down into more files using wildcard. So, you'll need to add *, just like that
daria_bucket/test*.csv
This is going to store, inside of the bucket daria_bucket, all the data extracted from the table in more than one file named test000000000000, test000000000001, test000000000002, ... testX.
In my case (more than 1 year after you've asked the question), using a random table of 1,25 GBs, got 16 files with 80,3 MBs each.

Row larger than the maximum allowed size

I have successfully imported many gzipped JSON files on several occasions. For the two files BQ import choked. Both files reported the same error:
File: 0 / Offset:0 / Line:1 / Column:20971521, Row larger than the maximum allowed size
Now I've read about the row limit of 20MB and I understand that the number above is 20MB +1 but what really bugs me is that the meaning is totally off. My GZs have millions of JSONs (each on a new line). I have written a script to measure the longest line (longest JSON) in the failed GZ file and found it to be 103571 bytes. Why is the BQ import choking then?
I have inspected the longest JSON and it looks perfectly normal. How should I interpret the error? How can I fix it?
Why is BQ thinking the import is on line 1, column 20971521 when there are millions of lines in the file?
All your investigations are correct, but you must check your file as new lines are not identified, and BQ seas all the import as a large line.
That's why it reports column 20971521 for the problem.
You should try importing a sample from the file.
Some of the answers here gave me an idea so I went on a tried it. It appears as if for some strange reason BQ didn't like line endings so I wrote a quick script to rewrite the original input file to use line endings. Automagically the import worked!
This is utterly strange considering I already imported many GBs of data with pure line endings.
I am happy that it worked but I could never guess why. I hope this helps someone else.

Internal error while loading to Bigquery table

I ran this command to load 11 files to a Bigquery table:
bq load --project_id=ardent-course-601 --source_format=NEWLINE_DELIMITED_JSON dw_test.rome_defaults_20140819_test gs://sm-uk-hadoop/queries/logsToBq_transformLogs/rome_defaults/20140819/23af7218-617d-42e8-884e-f213a583094a/part* /opt/sm-analytics/projects/logsTobqMR/jsonschema/rome_defaultsSchema.txt
I got this error:
Waiting on bqjob_r46f38146351d545_00000147ef890755_1 ... (11s) Current status: DONE
BigQuery error in load operation: Error processing job 'ardent-course-601:bqjob_r46f38146351d545_00000147ef890755_1': Too many errors encountered. Limit is: 0.
Failure details:
- File: 5: Unexpected. Please try again.
I tried many times after that and still got the same error.
To debug what went wrong, I instead load each file one by one to the Bigquery table. For example:
/usr/local/bin/bq load --project_id=ardent-course-601 --source_format=NEWLINE_DELIMITED_JSON dw_test.rome_defaults_20140819_test gs://sm-uk-hadoop/queries/logsToBq_transformLogs/rome_defaults/20140819/23af7218-617d-42e8-884e-f213a583094a/part-m-00011.gz /opt/sm-analytics/projects/logsTobqMR/jsonschema/rome_defaultsSchema.txt
There are 11 files total and each ran fine.
Could someone please help? Is this a bug on Bigquery side?
Thank you.
There was an error reading one of the files: gs://...part-m-00005.gz
Looking at the import logs, it appears that the gzip reader encountered an error decompressing the file.
It looks like that file may not actually be compressed. BigQuery samples the header of the first file in the list to determine whether it is dealing with compressed or uncompressed files and to determine the compression type. When you import all of the files at once, it only samples the first file.
When you run the files individually, bigquery reads the header of the file and determines that it isn't actually compressed (despite having the suffix '.gz') so imports it as a normal flat file.
If you run a load that doesn't mix compressed and uncompressed files, it should work successfully.
Please let me know if you think this is not the case and I'll dig in some more.

How to create sequence files from tsv file for text classification

I have a tsv file which is seperated in class, id and text, e.g.
positive 2342 This is very good.
negative 4343 I hate it.
and I'm trying to feed Mahout's nbayes to classify the text part either pos or neg.
My first attempt was using mahout seqdirectory command on every line as a seperate file in its class directory. This works well with a small amount of data but eventually fails at around 30 Gigabytes of data with OutOfMemoryException. Increasing the heap size fails with "GC overhead limit exceeded" probably because of the large amount of seperate files.
My second attempt was loading the data into a hive table and convert it to a sequence file, as it is described here [0], which seems to work fine at first but after creating the vector file and splitting up the data set the trainnb step fails with an ArrayIndexOutOfBounds Exception.
[0] http://files.meetup.com/6195792/Working%20With%20Mahout.pdf
Right now I'm out of ideas what to look for. Any ideas how I can convert the tsv file or hive table to a sequencefile as it's generated by seqdirectory command on a directory?
Going to answer by myself in case some else needs a solution to the same or similar problem:
I found this code snippet at github and modified it to my needs. Additionally I had to trim the value string to get proper results.
This may be a simpler implementation for those searching for this answer in the future. This can be done completely from the command line (I tested it in EMR):
hadoop jar \
/home/hadoop/contrib/streaming/hadoop-streaming.jar \
-D mapred.reduce.tasks=0 \
-inputformat TextInputFormat \
-input {input_directory}/* \
-mapper '/bin/cat' \
-outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat \
-output {output_directory}
/home/hadoop/contrib/streaming/hadoop-streaming.jar is the location of the hadoop-streaming.jar on Amazon EMR (AMI 3.4.0). It may be a in a different location depending on your configuration.