Load Limit for BigQuery Table - google-bigquery

I have tons of avro format documents saved in GCS.
I would like to use BigQuery REST API to load them back as BigQuery tables.
Is there a limit for the total amount of data (such as 10 TB) I can load per day?
Thanks,
Yefu

You are limited by several quota [1]
Maximum size per load job — 15 TB across all input files for CSV, JSON, Avro, Parquet, and ORC
Maximum number of source URIs in job configuration — 10,000 URIs
Maximum number of files per load job — 10 Million total files including all files matching all wildcard URIs
From above, you can upload up to 15TB each time (per job)
Load jobs per table per day — 1,000 (including failures)
Load jobs per project per day — 100,000 (including failures)
With these 2 limits, you can upload up to 1000 * 15TB = 15PB per table per day.
[1] https://cloud.google.com/bigquery/quotas#load_jobs

Related

Why spark is reading more data that I expect it to read using read schema?

In my spark job, I'm reading a huge table (parquet) with more than 30 columns. To limit the size of data read I specify schema with one column only (I need only this one). Unfortunately, when reading the info in spark UI I get the information that the size of files read equals 1123.8 GiB but filesystem read data size total equals 417.0 GiB. I was expecting that if I take one from 30 columns the filesystem read data size total will be around 1/30 of the initial size, not almost half.
Could you explain to me why is that happening?

Google BigQuery fails with "Resources exceeded during query execution: UDF out of memory" when loading Parquet file

We use the BigQuery Java API to upload data from local data source as described here. When uploading a Parquet file with 18 columns (16 string, 1 float64, 1 timestamp) and 13 Mio rows (e.g. 17GB of data) the upload fails with the following exception:
Resources exceeded during query execution: UDF out of memory.; Failed
to read Parquet file . This might happen if the file contains a row
that is too large, or if the total size of the pages loaded for the
queried columns is too large.
However when uploading the same data using CSV (17.5GB of data) the upload succeeds. My questions are:
What is the difference when uploading Parquet or CSV?
What query is executed during upload?
Is it possible to increase the memory for this query?
Thanks
Tobias
Parquet is columnar data format, which means that loading data requires reading all columns. In parquet, columns are divided into pages. BigQuery keeps entire uncompressed pages for each column in memory while reading data from them. If the input file contains too many columns, BigQuery workers can hit Out of Memory errors.
Even when a precise limit is not enforced as it happens with other formats, it is recommended that records should in the range of 50 Mb, loading larger records may lead to resourcesExceeded errors.
Taking into account the above considerations, it would be great to clarify the following points:
What is the maximum size of rows in your Parquet file?
What is the maximum page size per column?
This info can be retrieved by publicly available tool.
If you think about increasing the alocated memory for queries, you need to read about Bigquery slots.
In my case, I ran bq load --autodetect --source_format=PARQUET ... which failed with the same error (resources exceeded during query execution). Finally, I had to split the data into multiple Parquet files so that they would be loaded in batches.

Increasing Spark Read and Parquet Conversion Performance for Gzipped Text File

Use case:
A> Have Text Gzipped files in AWS s3 location
B> Hive Table created on top of the file, to access the data from the file as Table
C> Using Spark Dataframe to read the table and converting into Parquet Data with Snappy Compression
D> Number of fields in the table is 25, which includes 2 partition columns. Data Type is String except for two fields which has Decimal as data type.
Used following Spark Option: --executor-memory 37G --executor-cores 5 --num-executors 20
Cluster Size - 10 Data Nodes of type r3.8xLarge
Found the number of vCores used in AWS EMR is always equal to the number of files, may be because gzip files are not splittable. Gzipped files are coming from different system and size of files are around 8 GB.
Total Time taken is more than 2 hours for Parquet conversion for 6 files with total size 29.8GB.
Is there a way to improve the performance via Spark, using version 2.0.2?
Code Snippet:
val srcDF = spark.sql(stgQuery)
srcDF.write.partitionBy("data_date","batch_number").options(Map("compression"->"snappy","spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version"->"2","spark.speculation"->"false")).mode(SaveMode.Overwrite).parquet(finalPath)
It doesn't matter how many nodes you ask for, or how many cores there are, if you have 6 files, six threads will be assigned to work on them. Try to do one of
save in a splittable format (snappy)
get the source to save their data is many smaller files
do some incremental conversion into a new format as you go along (e.g a single spark-streaming core polling for new gzip files, then saving elsewhere into snappy files. Maybe try with AWS-Lambda as the trigger for this, to save dedicating a single VM to the task.

Google Dataflow not reading more than 3 input compressed files at once when there are multiple sources

Background: I have 30 days data in 30 separate compressed files stored in google storage. I have to write them to a BigQuery table in 30 different partitions in the same table. Each compressed file size was around 750MB.
I did 2 experiments on the same data set on Google Dataflow today.
Experiment 1: I read each day's compressed file using TextIO, applied a simple ParDo transform to prepare TableRow objects and wrote them directly to BigQuery using BigQueryIO. So basically 30 pairs of parallel unconnected sources and and sinks got created. But I found that at any point of time, only 3 files were read, transformed and written to BigQuery. The ParDo transformation and BigQuery writing speed of Google Dataflow was around 6000-8000 elements/sec at any point in time.
So only 3 source and sinks were being processed out of 30 at any time which significantly slowed the process. In over 90 minutes only 7 out 30 files were written to separate BigQuery partitions of a table.
Experiment 2: Here I first read each day's data from the same compressed file for 30 days, applied ParDo transformation on these the 30 PCollections and stored these 30 resultant Pcollections in a PCollectionList object. All these 30 TextIO sources were being read in parallel.
Now I wrote each PCollection corresponding to each day's data in the PCollectionList to BigQuery using BigQueryIO directly. So 30 sinks were being written into again in parallel.
I found that out of 30 parallel sources, again only 3 sources were being read and applied ParDo transformation at a speed of around 20000 elements/sec. At the time of writing of this question when 1 hr had already elapsed, reading from the all the compressed file had not even read completely 50% of the files and writing to the BigQuery table partitions had not even started.
These problems seem to occur only when Google Dataflow reads compressed files. I had asked a question about its slow reading from compressed files(Relatively poor performance when reading compressed files vis a vis normal text files kept in google storage using google dataflow) and was told that parallelizing work would make reading faster as only 1 worker reads a compressed file and multiple sources would mean multiple workers being given chance to read multiple files. But this also does not seem to be working.
Is there any way to speed up this whole process of reading from multiple compressed files and writing to separate partitions of the same table in BigQuery in dataflow job at the same time?
Each compressed file will be read by a single worker. The initial number of workers for a job can be increased with the numWorkers pipeline option, and the maximum number that can be scaled up to can be set with the maxNumWorkers pipeline option.

Cloud DataFlow performance - are our times to be expected?

Looking for some advice on how best to architect/design and build our pipeline.
After some initial testing, we're not getting the results that we were expecting. Maybe we're just doing something stupid, or our expectations are too high.
Our data/workflow:
Google DFP writes our adserver logs (CSV compressed) directly to GCS (hourly).
A day's worth of these logs has in the region of 30-70 million records, and about 1.5-2 billion for the month.
Perform transformation on 2 of the fields, and write the row to BigQuery.
The transformation involves performing 3 REGEX operations (due to increase to 50 operations) on 2 of the fields, which produces new fields/columns.
What we've got running so far:
Built a pipeline that reads the files from GCS for a day (31.3m), and uses a ParDo to perform the transformation (we thought we'd start with just a day, but our requirements are to process months & years too).
DoFn input is a String, and its output is a BigQuery TableRow.
The pipeline is executed in the cloud with instance type "n1-standard-1" (1vCPU), as we think 1 vCPU per worker is adequate given that the transformation is not overly complex, nor CPU intensive i.e. just a mapping of Strings to Strings.
We've run the job using a few different worker configurations to see how it performs:
5 workers (5 vCPUs) took ~17 mins
5 workers (10 vCPUs) took ~16 mins (in this run we bumped up the instance to "n1-standard-2" to get double the cores to see if it improved performance)
50 min and 100 max workers with autoscale set to "BASIC" (50-100 vCPUs) took ~13 mins
100 min and 150 max workers with autoscale set to "BASIC" (100-150 vCPUs) took ~14 mins
Would those times be in line with what you would expect for our use case and pipeline?
You can also write the output to files and then load it into BigQuery using command line/console. You'd probably save some dollars of instance's uptime. This is what I've been doing after running into issues with Dataflow/BigQuery interface. Also from my experience there is some overhead bringing instances up and tearing them down (could be 3-5 minutes). Do you include this time in your measurements as well?
BigQuery has a write limit of 100,000 rows per second per table OR 6M/per minute. At 31M rows of input that would take ~ 5 minutes of just flat out writes. When you add back the discrete processing time per element & then the synchronization time (read from GCS->dispatch->...) of the graph this looks about right.
We are working on a table sharding model so you can write across a set of tables and then use table wildcards within BigQuery to aggregate across the tables (common model for typical BigQuery streaming use case). I know the BigQuery folks are also looking at increased table streaming limits, but nothing official to share.
Net-net increasing instances is not going to get you much more throughput right now.
Another approach - in the mean time while we work on improving the BigQuery sync - would be to shard your reads using pattern matching via TextIO and then run X separate pipelines targeting X number of tables. Might be a fun experiment. :-)
Make sense?