BQ Load error : Avro parsing error in position 893786302. Size of data block 27406834 is larger than the maximum allowed value 16777216 - google-bigquery

To BigQuery experts,
I am working on the process which requires us to represent customers shopping history in way where we concatenate all last 12 months of transactions in a single column for Solr faceting using prefixes.
while trying to load this data in BIG Query, we are getting below row limit exceed error. Is there any way to get around this? the actual tuple size is around 64 mb where as the avro limit is 16mb.
[ ~]$ bq load --source_format=AVRO --allow_quoted_newlines --max_bad_records=10 "syw-dw-prod":"MAP_ETL_STG.mde_golden_tbl" "gs://data/final/tbl1/tbl/part-m-00005.avro"
Waiting on bqjob_r7e84784c187b9a6f_0000015ee7349c47_1 ... (5s) Current status: DONE
BigQuery error in load operation: Error processing job 'syw-dw-prod:bqjob_r7e84784c187b9a6f_0000015ee7349c47_1': Avro parsing error in position 893786302. Size of data
block 27406834 is larger than the maximum allowed value 16777216.

Update: This is no longer true, the limit has been lifted.
BigQuery's limit on loaded Avro file's block size is 16MB (https://cloud.google.com/bigquery/quotas#import). Unless each row is actually greater than 16MB, you should be able to split up the rows into more blocks to stay within the 16MB block limit. Using a compression codec may reduce the block size.

Related

Why spark is reading more data that I expect it to read using read schema?

In my spark job, I'm reading a huge table (parquet) with more than 30 columns. To limit the size of data read I specify schema with one column only (I need only this one). Unfortunately, when reading the info in spark UI I get the information that the size of files read equals 1123.8 GiB but filesystem read data size total equals 417.0 GiB. I was expecting that if I take one from 30 columns the filesystem read data size total will be around 1/30 of the initial size, not almost half.
Could you explain to me why is that happening?

Google BigQuery fails with "Resources exceeded during query execution: UDF out of memory" when loading Parquet file

We use the BigQuery Java API to upload data from local data source as described here. When uploading a Parquet file with 18 columns (16 string, 1 float64, 1 timestamp) and 13 Mio rows (e.g. 17GB of data) the upload fails with the following exception:
Resources exceeded during query execution: UDF out of memory.; Failed
to read Parquet file . This might happen if the file contains a row
that is too large, or if the total size of the pages loaded for the
queried columns is too large.
However when uploading the same data using CSV (17.5GB of data) the upload succeeds. My questions are:
What is the difference when uploading Parquet or CSV?
What query is executed during upload?
Is it possible to increase the memory for this query?
Thanks
Tobias
Parquet is columnar data format, which means that loading data requires reading all columns. In parquet, columns are divided into pages. BigQuery keeps entire uncompressed pages for each column in memory while reading data from them. If the input file contains too many columns, BigQuery workers can hit Out of Memory errors.
Even when a precise limit is not enforced as it happens with other formats, it is recommended that records should in the range of 50 Mb, loading larger records may lead to resourcesExceeded errors.
Taking into account the above considerations, it would be great to clarify the following points:
What is the maximum size of rows in your Parquet file?
What is the maximum page size per column?
This info can be retrieved by publicly available tool.
If you think about increasing the alocated memory for queries, you need to read about Bigquery slots.
In my case, I ran bq load --autodetect --source_format=PARQUET ... which failed with the same error (resources exceeded during query execution). Finally, I had to split the data into multiple Parquet files so that they would be loaded in batches.

Bigquery maximum processing data size allowance?

My question is how much data are we allowed to process on bigquery. I am using stackoverflow's kaggle dataset to analyze the data, and the text I am analyzing is around 27gb. I just want to get the average length per entry, so I do
query_length_text = """
SELECT
AVG(CHAR_LENGTH(title)) AS avg_title_length,
AVG(CHAR_LENGTH(body)) AS avg_body_length
FROM
`bigquery-public-data.stackoverflow.stackoverflow_posts`
"""
however this says:
Query cancelled; estimated size of 26.847077486105263 exceeds limit of 1 GB
I am only returning one float, so i know that isn't the problem. Is the 1gb on the processing too? How do I process it in batches, so I do 1gb at a time?
So Kaggle by default sets a 1GB limit on requests (to prevent your monthly quota of 5TB to run out). This is what causes this to happen. To prevent this, you can override it by using the max_gb_scanned parameter like this:
df = bq_assistant.query_to_pandas_safe(QUERY, max_gb_scanned = N)
where N is the amount of data processed by your query, or any number higher than it.

Is there a way to return the number of bad records loaded to BigQuery?

Is there a way using the Python API to get the number of bad records from a job when I load to a BigQuery dataset?
Below is stats available for load job:
statistics.load nested object [Output-only] Statistics for a load job.
statistics.load.inputFileBytes long [Output-only] Number of bytes of source data in a load job.
statistics.load.inputFiles long [Output-only] Number of source files in a load job.
statistics.load.outputBytes long [Output-only] Size of the loaded data in bytes. Note that while a load job is in the running state, this value may change.
statistics.load.outputRows long [Output-only] Number of rows imported in a load job. Note that while an import job is in the running state, this value may change.
If you know expected rows number you can figure out bad ones using outputRows
Meantime - you can control number of bad records allowed in your load job:
configuration.load.allowJaggedRows
configuration.load.ignoreUnknownValues
configuration.load.maxBadRecords
All this can be found in
https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.load
and
https://cloud.google.com/bigquery/docs/reference/v2/jobs#statistics.load

Per row size limits in BigQuery data?

Is there a limit to the amount of data that can be put in a single row in BigQuery? Is there a limit on the size of a single column entry (cell)? (in bytes)
Is there a limitation when importing from Cloud Storage?
The largest size of a single row allowed is 1MB for CSV and 2 MB for JSON. There are no limits on field sizes, but obviously they must be under the row size as well.
These limits are described here.