Bigquery maximum processing data size allowance? - google-bigquery

My question is how much data are we allowed to process on bigquery. I am using stackoverflow's kaggle dataset to analyze the data, and the text I am analyzing is around 27gb. I just want to get the average length per entry, so I do
query_length_text = """
SELECT
AVG(CHAR_LENGTH(title)) AS avg_title_length,
AVG(CHAR_LENGTH(body)) AS avg_body_length
FROM
`bigquery-public-data.stackoverflow.stackoverflow_posts`
"""
however this says:
Query cancelled; estimated size of 26.847077486105263 exceeds limit of 1 GB
I am only returning one float, so i know that isn't the problem. Is the 1gb on the processing too? How do I process it in batches, so I do 1gb at a time?

So Kaggle by default sets a 1GB limit on requests (to prevent your monthly quota of 5TB to run out). This is what causes this to happen. To prevent this, you can override it by using the max_gb_scanned parameter like this:
df = bq_assistant.query_to_pandas_safe(QUERY, max_gb_scanned = N)
where N is the amount of data processed by your query, or any number higher than it.

Related

Why spark is reading more data that I expect it to read using read schema?

In my spark job, I'm reading a huge table (parquet) with more than 30 columns. To limit the size of data read I specify schema with one column only (I need only this one). Unfortunately, when reading the info in spark UI I get the information that the size of files read equals 1123.8 GiB but filesystem read data size total equals 417.0 GiB. I was expecting that if I take one from 30 columns the filesystem read data size total will be around 1/30 of the initial size, not almost half.
Could you explain to me why is that happening?

How to interpret query process GB in Bigquery?

I am using a free trial of Google bigquery. This is the query that I am using.
select * from `test`.events where subject_id = 124 and id = 256064 and time >= '2166-01-15T14:00:00' and time <='2166-01-15T14:15:00' and id_1 in (3655,223762,223761,678,211,220045,8368,8441,225310,8555,8440)
This query is expected to return at most 300 records and not more than that.
However I see a message like this as below
But the table on which this query operates is really huge. Does this indicate the table size? However, I ran this query multiple times a day
Due to this, it resulted in error below
Quota exceeded: Your project exceeded quota for free query bytes scanned. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors
How long do I have to wait for this error to go-away? Is the daily limit 1TB? If yes, then I didn't not use close to 400 GB.
How to view my daily usage?
If I can edit quota, can you let me know which option should I be editing?
Can you help me with the above questions?
According to the official documentation
"BigQuery charges for queries by using one metric: the number of bytes processed (also referred to as bytes read)", regardless of how large the output size is. What this means is that if you do a count(*) on a 1TB table, you will supposedly be charged $5, even though the final output is very minimal.
Note that due to storage optimizations that BigQuery is doing internally, the bytes processed might not equal to the actual raw table size when you created it.
For the error you're seeing, browse the Google Console to "IAM & admin" then "Quotas", where you can then search for quotas specific to the BigQuery service.
Hope this helps!
Flavien

Google Bigquery results

I am getting a part of result from the Bigquery API.
Earlier, I solved the issue of 1,00,000 records per result using iterators.
However, now I'm stuck at some other obstacle.
If I take more than 6-7 columns in a result, I do not get the complete set of result.
However, if I take a single column, I get the complete result.
Can there be a size limit as well for results in Bigquery API ?
There are some limits for Query Job
In particular
Maximum response size — 128 MB compressed
Of course, it is unlimited when writing large query results to a destination table (and then reading from there)

BQ Load error : Avro parsing error in position 893786302. Size of data block 27406834 is larger than the maximum allowed value 16777216

To BigQuery experts,
I am working on the process which requires us to represent customers shopping history in way where we concatenate all last 12 months of transactions in a single column for Solr faceting using prefixes.
while trying to load this data in BIG Query, we are getting below row limit exceed error. Is there any way to get around this? the actual tuple size is around 64 mb where as the avro limit is 16mb.
[ ~]$ bq load --source_format=AVRO --allow_quoted_newlines --max_bad_records=10 "syw-dw-prod":"MAP_ETL_STG.mde_golden_tbl" "gs://data/final/tbl1/tbl/part-m-00005.avro"
Waiting on bqjob_r7e84784c187b9a6f_0000015ee7349c47_1 ... (5s) Current status: DONE
BigQuery error in load operation: Error processing job 'syw-dw-prod:bqjob_r7e84784c187b9a6f_0000015ee7349c47_1': Avro parsing error in position 893786302. Size of data
block 27406834 is larger than the maximum allowed value 16777216.
Update: This is no longer true, the limit has been lifted.
BigQuery's limit on loaded Avro file's block size is 16MB (https://cloud.google.com/bigquery/quotas#import). Unless each row is actually greater than 16MB, you should be able to split up the rows into more blocks to stay within the 16MB block limit. Using a compression codec may reduce the block size.

Mesaure upload and download speed in iPhone

I would like to measure the upload and download speed of data in iPhone, is any API available to achieve the same? Is it correct to measure it on the basis of dividing total bytes received with time taken in response?
Yes, it is correct to measure the total bytes / time taken, that is exactly what the speed is. You might want to take an average if you want to constantly show the download speed.., like using 500 bytes and the time it took to download those particular ones.
For doing this you could like have an NSMutableArray, as a buffer, which you empty idk every 2 seconds. Then you do [bufferMutableArray length]/2 and you know how many bytes a second you had those 2 seconds. When you empty the buffer ofc append to the data you are downloading.
There is no direct API to know the speed.
Total data received/sent and time only will give you average speed. There use to be lot of variation in the speed over the time so if you want more accurate value then do the speed calculation based on sampling.
(Data transferred in 1 miniut) /(60 seconds) ---> this solution only if you need greater accuracy in the speed calculation. The sampling duration can changed based on the level of accuracy required.