Crux dataset Bigquery - Query for Min/Avg/Max LCP, FID and CLS - google-bigquery

I have been exploring the Crux dataset in big query for last 10 days to extract data for data studio report. Though I consider myself good at SQL, as I have mostly worked with oracle and SQL server, I am finding it very hard to write queries against this dataset. I started from this article by Rick Viscomi, explored the queries on his github repo but still unable to figure it out.
I am trying to use the materialized table chrome-ux-report.materialized.metrics_summary to get some of the metrics but I am not sure if the Min/Avg/Max lcp (in milliseconds) for a time period (month for example) could be extracted from this table. What other queries could I possibly try which requires less data processing. (Some of the queries that I tried expired my free TB of data processing on big query).
Any suggestion, advise solution, queries are more than welcome since the documentation about the structure of the dataset and queries against it is not very clear.

For details about the fields used on the report you can check on the main documentation for the chrome ux report specially on the last part with data format which shows the dimensions and how its interpreted as show below:
Dimension
origin "https://example.com"
effective_connection_type.name 4G
form_factor.name "phone"
first_paint.histogram.start 1000
first_paint.histogram.end 1200
first_paint.histogram.density 0.123
For example, the above shows a sample record from the Chrome User Experience Report, which indicates that 12.3% of page loads had a “first paint time” measurement in the range of 1000-1200 milliseconds when loading “http://example.com” on a “phone” device over a ”4G”-like connection. To obtain a cumulative value of users experiencing a first paint time below 1200 milliseconds, you can add up all records whose histogram’s “end” value is less than or equal to 1200.
For the metrics, in the initial link there is a section called methodology where you can get information about the metrics and dimensions of the report. I recommend going to the actual origin source table per country and per site and not the summary as the data you are looking for can be obtained there. In the Bigquery part of the documentation you will find samples of how to query those tables. I find this relatable:
SELECT
SUM(bin.density) AS density
FROM
`chrome-ux-report.chrome_ux_report.201710`,
UNNEST(first_contentful_paint.histogram.bin) AS bin
WHERE
bin.start < 1000 AND
origin = 'http://example.com'
In the example above we’re adding all of the density values in the FCP histogram for “http://example.com” where the FCP bin’s start value is less than 1000 ms. The result is 0.7537, which indicates that ~75.4% of page loads experience the FCP in under a second.
About query estimation cost, you can see estimating query cost guide on google official bigquery documentation. But using this tables due to its nature consumes a lot of processing so filter it as much as possible.

Related

Big Query External table - Query performance degrades with increased number of files in the Source URI

I have an external big query table created to read "Parquet" files from a GCS bucket.
The folder layout in the GCS bucket is as follows:
gs://mybucket/root/year=2022/model=abc/
gs://mybucket/root/year=2022/model=.../
gs://mybucket/root/year=2021/model=abc/
gs://mybucket/root/year=2021/model=.../
The layout is organized in such a way that it follows hive partitioning layout as explained the big query documentation. The columns "year" and "model" are seen as partition columns in the external table.
**External Data Configuration**
Source URI(s)- gs://mybucket/root/*
Source format - PARQUET
Hive Partitioning Mode - CUSTOM
Hive Partitioning Source URI Prefix - gs://mybucket/root/{year:INTEGER}/{model:STRING}
Hive Partitioning Column(s)- year, model
Problem: When I run queries on the external table as given below, I have observed that every query runs for an initial 2-3 minutes before the actual run happens. Big Query console shows "Query pending" during this time and as soon as it turns "Query Running" the output gets displayed with minimal slot time consumption (Slot time shows in 1-2 seconds.)
Select * from myTable Where year = 2022 and model = 'abc'
The underlying file count will vary and increases for every year and model. For years with more parquet files the initial time sometimes is around 4-5 minutes.
My understanding as per the documentation is that , if the partition columns are present in the query, some sort of partition pruning happens and I expect the query to be responsive immediately as per the documentation.
https://cloud.google.com/bigquery/docs/hive-partitioned-queries-gcs#partition_pruning
But the observations made by me is contrary to this. If the source URIs are restricted to 1 year, the table reads the data from one year, the query initial time (where it remains "Query pending" on console) is reduced to 1-2 minute (or even less)
Source URI(s)- gs://mybucket/root/year=2022/*
Question: Is this the expected behavior ? because as volume of files increase in the GCS bucket, the query takes even longer to run (esp. the initial time, and the actual run time doesn't change much), though in the where clause we have the year and model partition columns applied.
This is likely expected behavior. Before partition pruning can happen objects in GCS need to be listed which is likely where the time is being taken. We are working on improvements in this area. The fact that slot time is so low is a good indicator that pruning is in fact happening (since most files are not being read there isn't a lot of slot time to consume).

How to interpret query process GB in Bigquery?

I am using a free trial of Google bigquery. This is the query that I am using.
select * from `test`.events where subject_id = 124 and id = 256064 and time >= '2166-01-15T14:00:00' and time <='2166-01-15T14:15:00' and id_1 in (3655,223762,223761,678,211,220045,8368,8441,225310,8555,8440)
This query is expected to return at most 300 records and not more than that.
However I see a message like this as below
But the table on which this query operates is really huge. Does this indicate the table size? However, I ran this query multiple times a day
Due to this, it resulted in error below
Quota exceeded: Your project exceeded quota for free query bytes scanned. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors
How long do I have to wait for this error to go-away? Is the daily limit 1TB? If yes, then I didn't not use close to 400 GB.
How to view my daily usage?
If I can edit quota, can you let me know which option should I be editing?
Can you help me with the above questions?
According to the official documentation
"BigQuery charges for queries by using one metric: the number of bytes processed (also referred to as bytes read)", regardless of how large the output size is. What this means is that if you do a count(*) on a 1TB table, you will supposedly be charged $5, even though the final output is very minimal.
Note that due to storage optimizations that BigQuery is doing internally, the bytes processed might not equal to the actual raw table size when you created it.
For the error you're seeing, browse the Google Console to "IAM & admin" then "Quotas", where you can then search for quotas specific to the BigQuery service.
Hope this helps!
Flavien

BigQuery Count Appears to be Processing Data

I noticed that running a SELECT count(*) FROM myTable on my larger BQ tables yields long running times, upwards of 30/40 seconds despite the validator claiming the query processes 0 bytes. This doesn't seem quite right when 500 GB queries run faster. Additionally, total row counts are listed under details -> Table Info. Am I doing something wrong? Is there a way to get total row counts instantly?
When you run a count BigQuery still needs to allocate resources (such as: slot units, shards etc). You might be reaching some limits which cause a delay. For example, the slots default per project is 2,000 units.
BigQuery execution plan provides very detail information about the process which can help you better understand the source of the delay.
One way to overcome this is to use an approximate method described in this link
This Slide by Google might also help you
For more details see this video about how to understand the execution plan

Check number of slots used by a query in BigQuery

Is there a way to check how many slots were used by a query over the period of its execution in BigQuery? I checked the execution plan but I could just see the Slot Time in ms but could not see any parameter or any graph to show the number of slots used over the period of execution. I even tried looking at Stackdriver Monitoring but I could not find anything like this. Please let me know if it can be calculated in some way or if I can see it somewhere I might've missed seeing.
A BigQuery job will report the total number of slot-milliseconds from the extended query stats in the job metadata, which is analogous to computational cost. Each stage of the query plan also indicates input stats for the stage, which can be used to indicate the number of units of work each stage dispatched.
More details about the representation can be found in the REST reference for jobs. See query.statistics.totalSlotMs and statistics.query.queryPlan[].parallelInputs for more information.
BigQuery now provides a key in the Jobs API JSON called "timeline". This structure provides "statistics.query.timeline[].completedUnits" which you can obtain either during job execution or after. If you choose to pull this information after a job has executed, "completedUnits" will be the cumulative sum of all the units of work (slots) utilised during the query execution.
The question might have two parts though: (1) Total number of slots utilised (units of work completed) or (2) Maximum parallel number of units used at a point in time by the query.
For (1), the answer is as above, given by "completedUnits".
For (2), you might need to consider the maximum value of queryPlan.parallelInputs across all query stages, which would indicate the maximum "number of parallelizable units of work for the stage" (https://cloud.google.com/bigquery/query-plan-explanation)
If, after this, you additionally want to know if the 2000 parallel slots that you are allocated across your entire on-demand query project is sufficient, you'd need to find the point in time across all queries taking place in your project where the slots being utilised is at a maximum. This is not a trivial task, but Stackdriver monitoring provides the clearest view for you on this.

google-bigquery: Data Size of the data which query is going to process

When enter the query in BigQuery text box ,it immediately provide the size of the data which the query going to process for e.g This query will process 839 GB when run.
Question 1: How bigquery knows so fast about the data size its going to process.
Question 2: How accurate is this figure
Question 3: I want to get this figure through bigquery tool and want to use in my project . Is there a way to get this figure through API.
BigQuery looks at all the columns mentioned in your query, and adds their size. That's the total data to be processed: It only counts the columns mentioned, and their full size.
100% accurate, as long as the column size doesn't change in the meantime.
API parameter to access this figure: dryRun. It doesn't use quota, so feel free to query. https://developers.google.com/bigquery/docs/reference/v2/jobs/query