Unable to extract data in a single .csv file from Google Big Query (though data is smaller than 1GB) - google-bigquery

I am able to export the data in 4 different files of about 90 MB each. (which doesn't make sense)
I have read the limitations of Google Big Query and it says that data with more than 1 GB in size cannot be downloaded in a single CSV file.
My data size is about 250 - 300 MB in size.
This is what usually I do to export data from GBQ:
I saved the table in Google Big Query (as it has more than 16000 rows)
Then exported it in the Bucket using as follows:
gs://[your_bucket]/file-name-*.csv
I think 2M rows of data is less than 1 GB. (Let me know if I am wrong)
Can I get this data in a single csv file ?
Thank you.

You should take out the wildcard from the name of the blob you want to write to. This tells BQ you want to export as multiple files.
So you should rather export to gs://[your_bucket]/file-name.csv
As you noted, this won't work if your data is bigger than 1GB, but you should be fine if total is about 300MB.

You can get node.js readable stream that contains result of your query (https://cloud.google.com/nodejs/docs/reference/bigquery/2.0.x/BigQuery#createQueryStream).
Chunk of data is a row of result set.
And then write data (row by row) to csv (locally or to cloud storage).

Related

very different file size when export data from bigquery to GCS

I am export data from BQ to GCS with the following query
export_query = f"""
EXPORT DATA
OPTIONS(
uri='{uri}',
format='PARQUET',
overwrite=true,
compression='GZIP')
AS {query}"""
and I am seeing the resulting files are of very different size, as a few of them are 10x larger than the rest. I am wondering why this happened..And how can I make sure the files all have similar size?
BigQuery supports the maximum table size exported to a single file is 1 GB. For exporting data more than 1 GB, a wildcard can be used to export the data into multiple files. When exporting data to multiple files the size of the file varies as mentioned in the documentation.You can check for the possible options for the destinationUris property in this link.
When you export data to multiple files, the size of the files will vary, this is because the number of files will depend on the number of workers that are starting to export a table/query to GCS in parquet format. However, combining the results in one file would require an additional shuffling step to ensure that all of the data ends up on the same partition, which is not something that BigQuery currently does.
If you want to customize the number of files then you need to use dataflow.

Why spark is reading more data that I expect it to read using read schema?

In my spark job, I'm reading a huge table (parquet) with more than 30 columns. To limit the size of data read I specify schema with one column only (I need only this one). Unfortunately, when reading the info in spark UI I get the information that the size of files read equals 1123.8 GiB but filesystem read data size total equals 417.0 GiB. I was expecting that if I take one from 30 columns the filesystem read data size total will be around 1/30 of the initial size, not almost half.
Could you explain to me why is that happening?

Google BigQuery fails with "Resources exceeded during query execution: UDF out of memory" when loading Parquet file

We use the BigQuery Java API to upload data from local data source as described here. When uploading a Parquet file with 18 columns (16 string, 1 float64, 1 timestamp) and 13 Mio rows (e.g. 17GB of data) the upload fails with the following exception:
Resources exceeded during query execution: UDF out of memory.; Failed
to read Parquet file . This might happen if the file contains a row
that is too large, or if the total size of the pages loaded for the
queried columns is too large.
However when uploading the same data using CSV (17.5GB of data) the upload succeeds. My questions are:
What is the difference when uploading Parquet or CSV?
What query is executed during upload?
Is it possible to increase the memory for this query?
Thanks
Tobias
Parquet is columnar data format, which means that loading data requires reading all columns. In parquet, columns are divided into pages. BigQuery keeps entire uncompressed pages for each column in memory while reading data from them. If the input file contains too many columns, BigQuery workers can hit Out of Memory errors.
Even when a precise limit is not enforced as it happens with other formats, it is recommended that records should in the range of 50 Mb, loading larger records may lead to resourcesExceeded errors.
Taking into account the above considerations, it would be great to clarify the following points:
What is the maximum size of rows in your Parquet file?
What is the maximum page size per column?
This info can be retrieved by publicly available tool.
If you think about increasing the alocated memory for queries, you need to read about Bigquery slots.
In my case, I ran bq load --autodetect --source_format=PARQUET ... which failed with the same error (resources exceeded during query execution). Finally, I had to split the data into multiple Parquet files so that they would be loaded in batches.

Google bigquery export big table to multiple objects in Google Cloud storage

I have two bigquery tables, bigger than 1 GB.
To export to storage,
https://googlecloudplatform.github.io/google-cloud-php/#/docs/google-cloud/v0.39.2/bigquery/table?method=export
$destinationObject = $storage->bucket('myBucket')->object('tableOutput_*');
$job = $table->export($destinationObject);
I used wild card.
Strange things is one bigquery table is exported to 60 files each of them with 3 - 4 MB size.
Another table is exported to 3 files, each of them close to 1 GB, 900 MB.
The codes are the same. The only difference is in the case that the table exported to 3 files. I put them into a subfolder.
The one exported to 60 files are one level above the subfolder.
My question is how bigquery decided that a file will be broken into dozens smaller files or just be broken into a few big files (as long as each file is less than 1GB)?
Thanks!
BigQuery makes no guarantees on the sizes of the exported files, and there is currently no way to adjust this.

how to limit the size of the file that exporting from bigquery to gcs?

I Used the python code for exporting data from bigquery to gcs,and then using gsutil to export to s3!But after exporting to gcs ,I noticed the some files are more tha 5 GB,which gsutil cannnot deal?So I want to know the way for limiting the size
So after the issue tracker, the correct way to take this is.
Single URI ['gs://[YOUR_BUCKET]/file-name.json']
Use a single URI if you want BigQuery to export your data to a single
file. The maximum exported data with this method is 1 GB.
Please note that data size is up to a maximum of 1GB, and the 1GB is not for the file size that is exported.
Single wildcard URI ['gs://[YOUR_BUCKET]/file-name-*.json']
Use a single wildcard URI if you think your exported data set will be
larger than 1 GB. BigQuery shards your data into multiple files based
on the provided pattern. Exported files size may vary, and files won't
be equally in size.
So again you need to use this method when your data size is above 1 GB, and the resulting files size may vary, and may go beyond the 1 GB, as you mentioned 5GB and 160Mb pair would happen on this method.
Multiple wildcard URIs
['gs://my-bucket/file-name-1-*.json',
'gs://my-bucket/file-name-2-*.json',
'gs://my-bucket/file-name-3-*.json']
Use multiple wildcard URIs if you want to partition the export output.
You would use this option if you're running a parallel processing job
with a service like Hadoop on Google Cloud Platform. Determine how
many workers are available to process the job, and create one URI per
worker. BigQuery treats each URI location as a partition, and uses
parallel processing to shard your data into multiple files in each
location.
the same applies here as well, exported file sizes may vary beyond 1 GB.
Try using single wildcard URI
See documentation for Exporting data into one or more files
Use a single wildcard URI if you think your exported data will be
larger than BigQuery's 1 GB per file maximum value. BigQuery shards
your data into multiple files based on the provided pattern. If you
use a wildcard in a URI component other than the file name, be sure
the path component does not exist before exporting your data.
Property definition:
['gs://[YOUR_BUCKET]/file-name-*.json']
Creates:
gs://my-bucket/file-name-000000000000.json
gs://my-bucket/file-name-000000000001.json
gs://my-bucket/file-name-000000000002.json ...
Property definition:
['gs://[YOUR_BUCKET]/path-component-*/file-name.json']
Creates:
gs://my-bucket/path-component-000000000000/file-name.json
gs://my-bucket/path-component-000000000001/file-name.json
gs://my-bucket/path-component-000000000002/file-name.json