Flex-slots intermittently unable to execute trivial queries - google-bigquery

I am running some tests using flex-slots. I have created a flex-slot commitment/reservation of 100 using QUERY as the type, and assigned it to my project. There have no other queries running concurrently. I am using a trivial query without cached results and mode set to INTERACTIVE to test performance. Everything is running in the US multi-region.
The table I am querying has just 1.2M rows (only 86MB), and is the public wikipedia dataset found in bigquery-samples.wikipedia_benchmark.Wiki1M.
This is the query:
SELECT
SUM(views) AS total_views,
title,
LANGUAGE
FROM
`bigquery-samples.wikipedia_benchmark.Wiki1M`
WHERE
LANGUAGE='en'
AND TITLE LIKE "%Germany%"
GROUP BY
title,
LANGUAGE
ORDER BY
total_views DESC;
A successful run (e.g. bquxjob_1f741592_1853646efb9) runs in under 1 second and uses on average less than a single slot. 100 slots for this query should be more than adequate.
However, intermittently the query (e.g. bquxjob_1913d525_185365564ea) fails with:
"Resources exceeded during query execution: Your project or organization exceeded the maximum disk and memory limit available for shuffle operations. Consider provisioning more slots, reducing query concurrency, or using more efficient logic in this job."
Is there a reason why this trivial query would be failing even with 100 slots assigned and no other queries running concurrently?

Related

ST_INTERSECTS issue BigQuery Waze

I'm using the following code to query a dataset based on a polygon:
SELECT *
FROM `waze-public-dataset.partner_name.view_jams_clustered`
WHERE ST_INTERSECTS(geo, ST_GEOGFROMTEXT("POLYGON((-99.54913355822276 27.60526592074579,-99.52673174853038 27.60526592074579,-99.52673174853038 27.590813604291416,-99.54913355822276 27.590813604291416,-99.54913355822276 27.60526592074579))")) IS TRUE
The validation message says that "This query will process 1 TB when run".
It seems like there's no problem. However, when I remove the "WHERE INTERSECTS" function, the validation message says exactly the same thing: "This query will process 1 TB when run", the same 1 TB, so I'm guessing that the ST_INTERSECTS function is not working.
When you actually run this query, the amount charged should be usually much less, as expected for spatially clustered table. I've run select count(*) ... query with one partner dataset, and while editor UI announced 9TB before running query, the query reported around 150MB processed after running.
The savings come from clustered table - but the specific clusters that intersect the polygon used in filter depend on actual data in the table and how clusters were created. The clusters and thus the cost of the query can only be determined when the query runs. Editor UI in this case shows maximum possible cost of the query.

BigQuery Count Appears to be Processing Data

I noticed that running a SELECT count(*) FROM myTable on my larger BQ tables yields long running times, upwards of 30/40 seconds despite the validator claiming the query processes 0 bytes. This doesn't seem quite right when 500 GB queries run faster. Additionally, total row counts are listed under details -> Table Info. Am I doing something wrong? Is there a way to get total row counts instantly?
When you run a count BigQuery still needs to allocate resources (such as: slot units, shards etc). You might be reaching some limits which cause a delay. For example, the slots default per project is 2,000 units.
BigQuery execution plan provides very detail information about the process which can help you better understand the source of the delay.
One way to overcome this is to use an approximate method described in this link
This Slide by Google might also help you
For more details see this video about how to understand the execution plan

How to get cost for a query in BQ

In BigQuery, how can we get the cost for a given query? We are doing a lot of high-compute queries -- https://cloud.google.com/bigquery/pricing#high-compute -- which often multiplies the data processed by 2 or more.
Is there a way to get the "Cost" of a query with the result set?
For the API or the CLI you could use the flat --dry_run which validates the query instead of running it, like so:
cat ../query.sql | bq query --use_legacy_sql=False --dry_run
Output:
Query successfully validated. Assuming the tables are not modified,
running this query will process 9614741466 bytes of data.
For costs, just divide the total bytes by 1024 ^ 4, multiply the result by 5 and then multiply by the Billing Tier you are in and you have the expected cost ($0.043 in this example).
If you already ran the query and want to know how much it processed, you can run:
bq show -j (job_id of your query)
And it'll return Bytes Billed and Billing Tier (looks like you still have to do the math for cost computation).
For WebUI, you can install BQMate and it already estimates costs for you (but you still have to adapt for your Billing Tier).
As a final recommendation, sometimes it's possible to greatly improve performance of analyzes just by optimizing how the query process data (here at our company we had several high computing queries that now process data normally just by using features such as ARRAYS and STRUCTS for instance).

Why does DevCenter of Datastax has row restrictions to 1000?

There is a limit of displaying 1000 rows for the tables in Datastax Devcenter. Any reason for having this option?
Because when queried as SELECT count(*) FROM tablename;
the performance from Cassandra is going to be same whether displaying 1000 records or complete records set.
DevCenter version 1.6.0 introduces result set paging which allows you to browse all the rows in your result set.
In DevCenter 1.6.0 the "with limit" value sets the paging size, e.g. the number of records to view per page and is still limited to 1000 maximum. However, now you can page forward (and back) through all of the query results.
A related new feature allows you to export all results to a file, either as CSV or INSERT statements. Right-click in the results view area and select "Export all results to File as [CSV|Insert]".
This is by design; consider it as a safeguard that prevents you from potentially fetching thousands or millions of rows by accident, which, among other problems, could have a serious impact on your network's bandwidth usage.
When you run the query in Datastax DevCenter 1.6 it displays the 1000 record in result as selected limit but if you export the same result to CSV it will give you all the record which you are looking for.
I run Datastax Devcenter 1.4.
I run the query with a limit and it provides me the actual count.
But LIMIT is limited to maximum value of signed integer (2147483647)
select count(*) from users LIMIT 2147483647;-- ALLOW FILTERING;

Measure query execution time excluding start-up cost in postgres

I want to measure the total time taken by postgres to execute my query excluding the start-up cost. Earlier I was using \timing but now I found \timing includes start-up cost.
I also tried: "explain analyze" in which I found that actual time is specified in a particular format like: actual time=12.04..12.09
So, does this mean that the time taken to execute postgres query excluding start-up time is 0.05. If not, then is there a way to exclude start-up costs and measure query execution time?
What you want is actually quite ill-defined.
"Startup cost" could mean:
network connection overhead and back-end start cost of establishing a new connection. Avoided by re-using the same session.
network round-trip times for sending the query and getting the results. Avoided by measuring the timing server-side with log_statement_min_duration = 0 or (with timing overhead) using explain analyze or the auto_explain module.
Query planning time. Avoided by PREPAREing the query, then timing only the subsequent EXECUTE.
Lock acquisition time. There is not currently any way to exclude this.
Note that using EXPLAIN ANALYZE may not be ideal for your purposes: it throws the query result away, and it adds its own costs because of the detailed timing it does. I would set log_statement_min_duration = 0, set client_min_messages appropriately, and capture the timings from the log output.
So it sounds like you want to PREPARE a query then EXPLAIN ANALYZE EXECUTE it or just EXECUTE it with log_statement_min_duration set to 0.
For exploring PLANNING costs and EXECUTE costs separately you need to set on several postgres.conf parameters:
log_planner_stats = on
log_executor_stats = on
and explore your log file.
Update:
1. find your config file location with executing:
SHOW config_file;
2. Set parameters. Don't foget to remove comment-symbol '#'.
3. Restart postgresql service
4. Execute your query
5. Explore your log file.