Tables truncated when using bigquery api (buffer size issue?) - google-bigquery

I am running the following:
bq query --format=csv SELECT GKGRECORDID, DATE,SourceCommonName,DocumentIdentifier, V2Persons, V2Tone, TranslationInfo, from [gdelt-bq:gdeltv2.gkg_partitioned]where V2Persons like "%Orban%" and _PARTITIONTIME >= TIMESTAMP("2016-11-09") and _PARTITIONTIME < TIMESTAMP("2016-11-11")' > outputfile.csv
This should return a table with roughly 1000 rows (which I get when I use the normal bigquery interface in the browser). However, when I run this using the api, it will only return 100.
It seems like an issue with the size of the buffer, but I thought I'd ask if there was something that could be done on the bigquery side (for example, a way to send query output in several chunks) to remedy this.
Thanks!

On the command line you can specify how many rows to be returned, defaults to max 100.
bq query -n 1500
Please be aware that maximum return size is 128MB compressed regardless of rows requested.

Related

How to interpret query process GB in Bigquery?

I am using a free trial of Google bigquery. This is the query that I am using.
select * from `test`.events where subject_id = 124 and id = 256064 and time >= '2166-01-15T14:00:00' and time <='2166-01-15T14:15:00' and id_1 in (3655,223762,223761,678,211,220045,8368,8441,225310,8555,8440)
This query is expected to return at most 300 records and not more than that.
However I see a message like this as below
But the table on which this query operates is really huge. Does this indicate the table size? However, I ran this query multiple times a day
Due to this, it resulted in error below
Quota exceeded: Your project exceeded quota for free query bytes scanned. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors
How long do I have to wait for this error to go-away? Is the daily limit 1TB? If yes, then I didn't not use close to 400 GB.
How to view my daily usage?
If I can edit quota, can you let me know which option should I be editing?
Can you help me with the above questions?
According to the official documentation
"BigQuery charges for queries by using one metric: the number of bytes processed (also referred to as bytes read)", regardless of how large the output size is. What this means is that if you do a count(*) on a 1TB table, you will supposedly be charged $5, even though the final output is very minimal.
Note that due to storage optimizations that BigQuery is doing internally, the bytes processed might not equal to the actual raw table size when you created it.
For the error you're seeing, browse the Google Console to "IAM & admin" then "Quotas", where you can then search for quotas specific to the BigQuery service.
Hope this helps!
Flavien

Google Bigquery results

I am getting a part of result from the Bigquery API.
Earlier, I solved the issue of 1,00,000 records per result using iterators.
However, now I'm stuck at some other obstacle.
If I take more than 6-7 columns in a result, I do not get the complete set of result.
However, if I take a single column, I get the complete result.
Can there be a size limit as well for results in Bigquery API ?
There are some limits for Query Job
In particular
Maximum response size — 128 MB compressed
Of course, it is unlimited when writing large query results to a destination table (and then reading from there)

How to get cost for a query in BQ

In BigQuery, how can we get the cost for a given query? We are doing a lot of high-compute queries -- https://cloud.google.com/bigquery/pricing#high-compute -- which often multiplies the data processed by 2 or more.
Is there a way to get the "Cost" of a query with the result set?
For the API or the CLI you could use the flat --dry_run which validates the query instead of running it, like so:
cat ../query.sql | bq query --use_legacy_sql=False --dry_run
Output:
Query successfully validated. Assuming the tables are not modified,
running this query will process 9614741466 bytes of data.
For costs, just divide the total bytes by 1024 ^ 4, multiply the result by 5 and then multiply by the Billing Tier you are in and you have the expected cost ($0.043 in this example).
If you already ran the query and want to know how much it processed, you can run:
bq show -j (job_id of your query)
And it'll return Bytes Billed and Billing Tier (looks like you still have to do the math for cost computation).
For WebUI, you can install BQMate and it already estimates costs for you (but you still have to adapt for your Billing Tier).
As a final recommendation, sometimes it's possible to greatly improve performance of analyzes just by optimizing how the query process data (here at our company we had several high computing queries that now process data normally just by using features such as ARRAYS and STRUCTS for instance).

BigQuery Table Data Export

I am trying to export data from BigQuery Table using python api. Table contains 1 to 4 million of rows. So I have kept maxResults parameter to maximum i.e. 100000 and then paging through. But problem is that in One page I am getting 2652 rows only so number of paging is too much. Can anyone provide reason for this or solution to deal. Format is JSON.
Or can I export data into CSV format without using GCS?
I tried by inserting job and keeping allowLargeResults =true, but the result remain same.
Below is my query body :
queryData = {'query':query,
'maxResults':100000,
'timeoutMs':'130000'}
Thanks in advance.
You can try to export data from table without using GCS by using bq command line tool https://cloud.google.com/bigquery/bq-command-line-tool like this:
bq --format=prettyjson query --n=10000000 "SELECT * from publicdata:samples.shakespeare"
You can use --format=json depending on your needs as well.
Actual page size is determined not by row count, but rather by size of those rows in given page. I think it is something around 10MB
User can alsoset maxResults to limit rows in page in addition to above criteria

Google BigQuery unable to process larger result set getting "Response too large to return" or "Resources exceeded during query execution"

I am currently working with large table (~105M Records) in C# application.
When query the table with 'Order by' or 'Order Each by' clause, then i am getting "Resources exceeded during query execution" error.
If i remove 'Order by' or 'Order Each by' clause, then i am getting Response too large to return error.
Here is the sample query for two scenarios (I am using Wikipedia public table)
SELECT Id,Title,Count(*) FROM [publicdata:samples.wikipedia] Group EACH by Id, title Order by Id, Title Desc
SELECT Id,Title,Count(*) FROM [publicdata:samples.wikipedia] Group EACH by Id, title
Here are the questions i have
What is the maximum size of Big Query Response?
How do we select all the records in Query Request not in 'Export Method'?
1. What is the maximum size of Big Query Response?
As it's mentioned on Quota-policy queries maximum response size: 10 GB compressed (unlimited when returning large query results)
2. How do we select all the records in Query Request not in 'Export Method'?
If you plan to run a query that might return larger results, you can set allowLargeResults to true in your job configuration.
Queries that return large results will take longer to execute, even if the result set is small, and are subject to additional limitations:
You must specify a destination table.
You can't specify a top-level ORDER BY, TOP or LIMIT clause. Doing so negates the benefit of using allowLargeResults, because the query output can no longer be computed in parallel.
Window functions can return large query results only if used in conjunction with a PARTITION BY clause.
Read more about how to paginate to get the results here and also read from the BigQuery Analytics book, the pages that start with page 200, where it is explained how Jobs::getQueryResults is working together with the maxResults parameter and int's blocking mode.
Update:
Query Result Size Limitations - Sometimes, it is hard to know what 10 GB of compressed
data means.
When you run a normal query in BigQuery, the response size is limited to 10 GB
of compressed data. Sometimes, it is hard to know what 10 GB of compressed
data means. Does it get compressed 2x? 10x? The results are compressed within
their respective columns, which means the compression ratio tends to be very
good. For example, if you have one column that is the name of a country, there
will likely be only a few different values. When you have only a few distinct
values, this means that there isn’t a lot of unique information, and the column
will generally compress well. If you return encrypted blobs of data, they will
likely not compress well because they will be mostly random. (This is explained on the book linked above on page 220)