Check number of slots used by a query in BigQuery - google-bigquery

Is there a way to check how many slots were used by a query over the period of its execution in BigQuery? I checked the execution plan but I could just see the Slot Time in ms but could not see any parameter or any graph to show the number of slots used over the period of execution. I even tried looking at Stackdriver Monitoring but I could not find anything like this. Please let me know if it can be calculated in some way or if I can see it somewhere I might've missed seeing.

A BigQuery job will report the total number of slot-milliseconds from the extended query stats in the job metadata, which is analogous to computational cost. Each stage of the query plan also indicates input stats for the stage, which can be used to indicate the number of units of work each stage dispatched.
More details about the representation can be found in the REST reference for jobs. See query.statistics.totalSlotMs and statistics.query.queryPlan[].parallelInputs for more information.

BigQuery now provides a key in the Jobs API JSON called "timeline". This structure provides "statistics.query.timeline[].completedUnits" which you can obtain either during job execution or after. If you choose to pull this information after a job has executed, "completedUnits" will be the cumulative sum of all the units of work (slots) utilised during the query execution.
The question might have two parts though: (1) Total number of slots utilised (units of work completed) or (2) Maximum parallel number of units used at a point in time by the query.
For (1), the answer is as above, given by "completedUnits".
For (2), you might need to consider the maximum value of queryPlan.parallelInputs across all query stages, which would indicate the maximum "number of parallelizable units of work for the stage" (https://cloud.google.com/bigquery/query-plan-explanation)
If, after this, you additionally want to know if the 2000 parallel slots that you are allocated across your entire on-demand query project is sufficient, you'd need to find the point in time across all queries taking place in your project where the slots being utilised is at a maximum. This is not a trivial task, but Stackdriver monitoring provides the clearest view for you on this.

Related

How is duration being calculated in a Spark Structured Streaming UI?

We have a Spark SQL job that we would like to optimize. We are trying to
figure out which part of our pipeline is slower/faster.
In the attached SQL query graph, there are 3 WholeStageCodegen boxes,
all with the same duration: 2.9s, 2.9s, 2.9s. See the below picture:
But if we check the Stage graph, it shows 3 seconds for the total stage. See the below picture:
So the durations in the WholeStageCodegen boxes do not add up, it seems
that these durations refer to the sum of the whole stage. Do we miss
something here? Is there a way to figure out the duration for the
individual boxes?
Sometimes there is some difference in the duration, but not more than
0.1s, examples:
18.3s, 18.3s, 18.4s
968ms, 967ms, 1.0s
The Stage duration is always as much as one of the WholeStageCodegen's
duration, or at most 0.1-0.3sec larger.
How can one figure out the duration for each of the WholeStageCodegen parts, and is that actually measured? I suspect that Spark would have to trace individual operations as units of generated functions. Is that measurement actually performed there, or are these numbers more like a placeholder for a feature that does not exist?

BigQuery Count Appears to be Processing Data

I noticed that running a SELECT count(*) FROM myTable on my larger BQ tables yields long running times, upwards of 30/40 seconds despite the validator claiming the query processes 0 bytes. This doesn't seem quite right when 500 GB queries run faster. Additionally, total row counts are listed under details -> Table Info. Am I doing something wrong? Is there a way to get total row counts instantly?
When you run a count BigQuery still needs to allocate resources (such as: slot units, shards etc). You might be reaching some limits which cause a delay. For example, the slots default per project is 2,000 units.
BigQuery execution plan provides very detail information about the process which can help you better understand the source of the delay.
One way to overcome this is to use an approximate method described in this link
This Slide by Google might also help you
For more details see this video about how to understand the execution plan

What is the meaning of totalSlotMs for a BigQuery job?

What is the meaning of the statistics.query.totalSlotMs value returned for a completed BigQuery job? Except for giving an indication of relative cost of one job vs the other, it's not clear how else one should interpret the number. For example, how does the slot-milliseconds number relate to the stack driver reported total slot usage for a given project (which needs to stay below 2000 for on demand BigQuery usage)?
The docs are a bit terse ('[Output-only] Slot-milliseconds for the job.')
The idea is to have a 'slots' metric in the same units at which slots of reservation are sold to customers.
For example, imagine that you have a 20-second query that is continuously consuming 4 slots. In that case, your query is using 80,000 totalSlotMs (4 * 20,000).
This way you can determine the average number of slots even if the peak number of slots differs as, in practice, the number of workers will fluctuate over the runtime of a query.

How do I know the number of slots used by Bigquery query?

I am trying to figure out the number of slots used by every big query query. Is there a way to find it out?
As per google docs, this way we can calculate number of slot used(average).
Number of slot = total_slot_ms / TIMESTAMP_DIFF(end_time,start_time, MILLISECOND)
select job_id
,total_slot_ms / TIMESTAMP_DIFF(end_time,start_time,MILLISECOND) as num_slot
from `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
or manually using BQ UI execution details if you don't have access to the table above.
Number of slot = Slot time consumed (convert in MILLISECOND)/Elapses time (convert in MILLISECOND)
This information is indeed available in job.statistics.query.timeline that forms parts of the Jobs API of BigQuery (https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#resource). When you get this information it comes in an array like this:
timeline:
[ '{"elapsedMs":"750","totalSlotMs":"2795","pendingUnits":"8","completedUnits":"66","activeUnits":"9"}',
'{"elapsedMs":"1252","totalSlotMs":"3617","pendingUnits":"1","completedUnits":"73","activeUnits":"1"}',
'{"elapsedMs":"2944","totalSlotMs":"5643","pendingUnits":"0","completedUnits":"78","activeUnits":"0"}' ],
So what you can do depends on your actual question:
1) If your question is "What is the total amount of slots utilised by the query over its running time?", then look at the final value of completedUnits
2) If your question is "How are the slots utilised during the query's running time?", then you can build an average of completedUnits over the elapsedMs per timeslice.
There is a Slot Utilization Chart in Stackdriver Monitoring For BigQuery
It shows Allocated and Available Slot for selected Project
Unfortunatelly, I don't think such stats is available on per each query basis
You can get per-query slot utilisation using the INFORMATION_SCHEMA tables for jobs.
Example query to get the slot utilisation of today's queries of the current project:
SELECT
project_id,
job_id,
start_time,
end_time,
query,
total_slot_ms,
total_bytes_processed/1e9 AS gbs_processed,
destination_table.table_id AS destination_table
FROM
`region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE DATE(creation_time)=CURRENT_DATE
The total_slot_ms field is what you're looking for, I guess.
In Google's words, it expresses the "Slot-milliseconds for the job over its entire duration." (from the schema documentation).
There are equivalent INFORMATION_SCHEMA tables for individual users (INFORMATION_SCHEMA.JOBS_BY_USER) and for the whole organisation (INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION).

How to use BigQuery Slots

Hi,there.
Recently,I want to run a query in bigquery web UI by using "group by" over some tables(tables' name suits xxx_mst_yyyymmdd).The rows will be over 10 million. Unhappily,the query failed with this error:
Query Failed
Error: Resources exceeded during query execution.
I did some improvements with my query language,the error may not happen for this time.But with the increasement of my data, the Error will also appear in the future.So I checked the latest release of Bigquery,maybe there two ways to solve this:
1.After 2016/01/01,Bigquery will change the Query pricing tiers to satisfy the "High Compute Tiers" so that the "resourcesExceeded error" will not happen again.
2.BigQuery Slots.
I checked some documents in Google and didn't find a way on how to use BigQuery Slots.Is there any sample or usecase of BigQuery Slots?Or I have to contact with BigQuery Team to open the function?
Hope someone can help me to answer this question,thanks very much!
A couple of points:
I'm surprised that a GROUP BY with a cardinality of 10M failed with resources exceeded. Can you provide a job id of the failed query so we can investigate? You mention that you're concerned about hitting these errors more often as your data size increases; you should likely be able to increase your data size by a few more orders of magnitude without seeing this; likely you've encountered either a bug or something was strange with either your query or your data.
"High Compute Tiers" won't necessarily get rid of resourcesExceeded. For the most part, resourcesExceeded means that BigQuery ran into memory limitations; high compute tiers only address CPU usage. (and note, they haven't been enabled yet).
BigQuery slots enable you to process data faster and with more reliable performance. For the most part, they also wouldn't help prevent resourcesExceeded errors.
There is currently (as of Nov 5) a bug where you may need to provide an EACH keyword with a GROUP BY. Recent changes should enable BigQuery to automatically select the execution strategy, so EACH shouldn't be needed, but there are a couple of cases where it doesn't pick the right one. When in doubt, add an EACH to your JOIN and GROUP BY operations.
To get your project eligible for using slots you need to contact support.