How do I know the number of slots used by Bigquery query? - google-bigquery

I am trying to figure out the number of slots used by every big query query. Is there a way to find it out?

As per google docs, this way we can calculate number of slot used(average).
Number of slot = total_slot_ms / TIMESTAMP_DIFF(end_time,start_time, MILLISECOND)
select job_id
,total_slot_ms / TIMESTAMP_DIFF(end_time,start_time,MILLISECOND) as num_slot
from `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
or manually using BQ UI execution details if you don't have access to the table above.
Number of slot = Slot time consumed (convert in MILLISECOND)/Elapses time (convert in MILLISECOND)

This information is indeed available in job.statistics.query.timeline that forms parts of the Jobs API of BigQuery (https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#resource). When you get this information it comes in an array like this:
timeline:
[ '{"elapsedMs":"750","totalSlotMs":"2795","pendingUnits":"8","completedUnits":"66","activeUnits":"9"}',
'{"elapsedMs":"1252","totalSlotMs":"3617","pendingUnits":"1","completedUnits":"73","activeUnits":"1"}',
'{"elapsedMs":"2944","totalSlotMs":"5643","pendingUnits":"0","completedUnits":"78","activeUnits":"0"}' ],
So what you can do depends on your actual question:
1) If your question is "What is the total amount of slots utilised by the query over its running time?", then look at the final value of completedUnits
2) If your question is "How are the slots utilised during the query's running time?", then you can build an average of completedUnits over the elapsedMs per timeslice.

There is a Slot Utilization Chart in Stackdriver Monitoring For BigQuery
It shows Allocated and Available Slot for selected Project
Unfortunatelly, I don't think such stats is available on per each query basis

You can get per-query slot utilisation using the INFORMATION_SCHEMA tables for jobs.
Example query to get the slot utilisation of today's queries of the current project:
SELECT
project_id,
job_id,
start_time,
end_time,
query,
total_slot_ms,
total_bytes_processed/1e9 AS gbs_processed,
destination_table.table_id AS destination_table
FROM
`region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE DATE(creation_time)=CURRENT_DATE
The total_slot_ms field is what you're looking for, I guess.
In Google's words, it expresses the "Slot-milliseconds for the job over its entire duration." (from the schema documentation).
There are equivalent INFORMATION_SCHEMA tables for individual users (INFORMATION_SCHEMA.JOBS_BY_USER) and for the whole organisation (INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION).

Related

Crux dataset Bigquery - Query for Min/Avg/Max LCP, FID and CLS

I have been exploring the Crux dataset in big query for last 10 days to extract data for data studio report. Though I consider myself good at SQL, as I have mostly worked with oracle and SQL server, I am finding it very hard to write queries against this dataset. I started from this article by Rick Viscomi, explored the queries on his github repo but still unable to figure it out.
I am trying to use the materialized table chrome-ux-report.materialized.metrics_summary to get some of the metrics but I am not sure if the Min/Avg/Max lcp (in milliseconds) for a time period (month for example) could be extracted from this table. What other queries could I possibly try which requires less data processing. (Some of the queries that I tried expired my free TB of data processing on big query).
Any suggestion, advise solution, queries are more than welcome since the documentation about the structure of the dataset and queries against it is not very clear.
For details about the fields used on the report you can check on the main documentation for the chrome ux report specially on the last part with data format which shows the dimensions and how its interpreted as show below:
Dimension
origin "https://example.com"
effective_connection_type.name 4G
form_factor.name "phone"
first_paint.histogram.start 1000
first_paint.histogram.end 1200
first_paint.histogram.density 0.123
For example, the above shows a sample record from the Chrome User Experience Report, which indicates that 12.3% of page loads had a “first paint time” measurement in the range of 1000-1200 milliseconds when loading “http://example.com” on a “phone” device over a ”4G”-like connection. To obtain a cumulative value of users experiencing a first paint time below 1200 milliseconds, you can add up all records whose histogram’s “end” value is less than or equal to 1200.
For the metrics, in the initial link there is a section called methodology where you can get information about the metrics and dimensions of the report. I recommend going to the actual origin source table per country and per site and not the summary as the data you are looking for can be obtained there. In the Bigquery part of the documentation you will find samples of how to query those tables. I find this relatable:
SELECT
SUM(bin.density) AS density
FROM
`chrome-ux-report.chrome_ux_report.201710`,
UNNEST(first_contentful_paint.histogram.bin) AS bin
WHERE
bin.start < 1000 AND
origin = 'http://example.com'
In the example above we’re adding all of the density values in the FCP histogram for “http://example.com” where the FCP bin’s start value is less than 1000 ms. The result is 0.7537, which indicates that ~75.4% of page loads experience the FCP in under a second.
About query estimation cost, you can see estimating query cost guide on google official bigquery documentation. But using this tables due to its nature consumes a lot of processing so filter it as much as possible.

Teradata Current CPU utilization (Not User level and no History data)

I want to run heavy extraction basically for migration of data from Teradata to some cloud warehouse and would want to check current CPU utilization (in percentage) of overall Teradata CPU and accordingly increase the extraction processes on it.
I know we have this type of information available in "dbc.resusagespma" but it looks like history data and not current, which we can see on Viewpoint.
Can we get such a run time information with the help of SQL in Teradata?
This info is returned by one of the PMPC-API funtions, syslib.MonitorPhysicalSummary, of course, you need Execute Function rights:
SELECT * FROM TABLE (MonitorPhysicalSummary()) AS t

Check number of slots used by a query in BigQuery

Is there a way to check how many slots were used by a query over the period of its execution in BigQuery? I checked the execution plan but I could just see the Slot Time in ms but could not see any parameter or any graph to show the number of slots used over the period of execution. I even tried looking at Stackdriver Monitoring but I could not find anything like this. Please let me know if it can be calculated in some way or if I can see it somewhere I might've missed seeing.
A BigQuery job will report the total number of slot-milliseconds from the extended query stats in the job metadata, which is analogous to computational cost. Each stage of the query plan also indicates input stats for the stage, which can be used to indicate the number of units of work each stage dispatched.
More details about the representation can be found in the REST reference for jobs. See query.statistics.totalSlotMs and statistics.query.queryPlan[].parallelInputs for more information.
BigQuery now provides a key in the Jobs API JSON called "timeline". This structure provides "statistics.query.timeline[].completedUnits" which you can obtain either during job execution or after. If you choose to pull this information after a job has executed, "completedUnits" will be the cumulative sum of all the units of work (slots) utilised during the query execution.
The question might have two parts though: (1) Total number of slots utilised (units of work completed) or (2) Maximum parallel number of units used at a point in time by the query.
For (1), the answer is as above, given by "completedUnits".
For (2), you might need to consider the maximum value of queryPlan.parallelInputs across all query stages, which would indicate the maximum "number of parallelizable units of work for the stage" (https://cloud.google.com/bigquery/query-plan-explanation)
If, after this, you additionally want to know if the 2000 parallel slots that you are allocated across your entire on-demand query project is sufficient, you'd need to find the point in time across all queries taking place in your project where the slots being utilised is at a maximum. This is not a trivial task, but Stackdriver monitoring provides the clearest view for you on this.

What is the meaning of totalSlotMs for a BigQuery job?

What is the meaning of the statistics.query.totalSlotMs value returned for a completed BigQuery job? Except for giving an indication of relative cost of one job vs the other, it's not clear how else one should interpret the number. For example, how does the slot-milliseconds number relate to the stack driver reported total slot usage for a given project (which needs to stay below 2000 for on demand BigQuery usage)?
The docs are a bit terse ('[Output-only] Slot-milliseconds for the job.')
The idea is to have a 'slots' metric in the same units at which slots of reservation are sold to customers.
For example, imagine that you have a 20-second query that is continuously consuming 4 slots. In that case, your query is using 80,000 totalSlotMs (4 * 20,000).
This way you can determine the average number of slots even if the peak number of slots differs as, in practice, the number of workers will fluctuate over the runtime of a query.

Export Bigquery Logs

I want to analyze the activity on BigQuery during the past month.
I went to the cloud console and the (very inconvenient) log viewer. I set up exports to Big-query, and now I can run queries on the logs and analyze the activity. There is even very convenient guide here: https://cloud.google.com/bigquery/audit-logs.
However, all this helps to look at data collected from now on. I need to analyze past month.
Is there a way to export existing logs (rather than new) to Bigquery (or to flat file and later load them to BQ)?
Thanks
While you cannot "backstream" the BigQuery's logs of the past, there is something you can still do, depending on what kind of information you're looking for. If you need information about query jobs (jobs stats, config etc), you can call Jobs: list method of BigQuery API to list all jobs in your project. The data is preserved there for 6 months and if you're project owner, you can list the jobs of all users, regardless who actually ran it.
If you don't want to code anything, you can even use API Explorer to call the method and save the output as json file and then load it back into BigQuery's table.
Sample code to list jobs with BigQuery API. It requires some modification but it should be fairly easy to get it done.
You can use Jobs: list API to collect job info and upload it to GBQ
Since it is in GBQ - you can analyze it any way you want using power of BigQuery
You can either flatten result or use original - i recommend using original as it is less headache as no any transformation before loading to GBQ (you just literally upload whatever you got from API). Of course all this in simple app/script that you still have to write
Note: make sure you use full value for projection parameter
I was facing the same problem when I found a article which describes how to inspect Big Query using INFORMATION_SCHEMA without any script nor Jobs: list as mentioned by other OPs.
I was able to run and got this working.
# Monitor Query costs in BigQuery; standard-sql; 2020-06-21
# #see http://www.pascallandau.com/bigquery-snippets/monitor-query-costs/
DECLARE timezone STRING DEFAULT "Europe/Berlin";
DECLARE gb_divisor INT64 DEFAULT 1024*1024*1024;
DECLARE tb_divisor INT64 DEFAULT gb_divisor*1024;
DECLARE cost_per_tb_in_dollar INT64 DEFAULT 5;
DECLARE cost_factor FLOAT64 DEFAULT cost_per_tb_in_dollar / tb_divisor;
SELECT
DATE(creation_time, timezone) creation_date,
FORMAT_TIMESTAMP("%F %H:%I:%S", creation_time, timezone) as query_time,
job_id,
ROUND(total_bytes_processed / gb_divisor,2) as bytes_processed_in_gb,
IF(cache_hit != true, ROUND(total_bytes_processed * cost_factor,4), 0) as cost_in_dollar,
project_id,
user_email,
FROM
`region-us`.INFORMATION_SCHEMA.JOBS_BY_USER
WHERE
DATE(creation_time) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) and CURRENT_DATE()
ORDER BY
bytes_processed_in_gb DESC
Credits: https://www.pascallandau.com/bigquery-snippets/monitor-query-costs/