Processing trace_info.screen_info.frozen_frame_ratio in BigQuery - google-bigquery

I'm trying to do some post processing on trace_info.screen_info.frozen_frame_ratio, however I'm facing something I cannot explain: every time I compare data that I see in Firebase Console - it's different (bigger) if compared to my query results.
I am doing same kind of filtering (by event_type = "SCREEN_TRACE" and event_name that has it started with "_st_" and then goes my screen name). It almost looks as if I should exclude 0 values of trace_info.screen_info.frozen_frame_ratio from my query, but still results are significantly different if compared with Firebase (same period assumed of course). Why is that happening? Is that because of mysterious auto sampling during data export to BigQuery? How to make it right (need query to return same figures Firebase Console is returning easily)?

Related

Crux dataset Bigquery - Query for Min/Avg/Max LCP, FID and CLS

I have been exploring the Crux dataset in big query for last 10 days to extract data for data studio report. Though I consider myself good at SQL, as I have mostly worked with oracle and SQL server, I am finding it very hard to write queries against this dataset. I started from this article by Rick Viscomi, explored the queries on his github repo but still unable to figure it out.
I am trying to use the materialized table chrome-ux-report.materialized.metrics_summary to get some of the metrics but I am not sure if the Min/Avg/Max lcp (in milliseconds) for a time period (month for example) could be extracted from this table. What other queries could I possibly try which requires less data processing. (Some of the queries that I tried expired my free TB of data processing on big query).
Any suggestion, advise solution, queries are more than welcome since the documentation about the structure of the dataset and queries against it is not very clear.
For details about the fields used on the report you can check on the main documentation for the chrome ux report specially on the last part with data format which shows the dimensions and how its interpreted as show below:
Dimension
origin "https://example.com"
effective_connection_type.name 4G
form_factor.name "phone"
first_paint.histogram.start 1000
first_paint.histogram.end 1200
first_paint.histogram.density 0.123
For example, the above shows a sample record from the Chrome User Experience Report, which indicates that 12.3% of page loads had a “first paint time” measurement in the range of 1000-1200 milliseconds when loading “http://example.com” on a “phone” device over a ”4G”-like connection. To obtain a cumulative value of users experiencing a first paint time below 1200 milliseconds, you can add up all records whose histogram’s “end” value is less than or equal to 1200.
For the metrics, in the initial link there is a section called methodology where you can get information about the metrics and dimensions of the report. I recommend going to the actual origin source table per country and per site and not the summary as the data you are looking for can be obtained there. In the Bigquery part of the documentation you will find samples of how to query those tables. I find this relatable:
SELECT
SUM(bin.density) AS density
FROM
`chrome-ux-report.chrome_ux_report.201710`,
UNNEST(first_contentful_paint.histogram.bin) AS bin
WHERE
bin.start < 1000 AND
origin = 'http://example.com'
In the example above we’re adding all of the density values in the FCP histogram for “http://example.com” where the FCP bin’s start value is less than 1000 ms. The result is 0.7537, which indicates that ~75.4% of page loads experience the FCP in under a second.
About query estimation cost, you can see estimating query cost guide on google official bigquery documentation. But using this tables due to its nature consumes a lot of processing so filter it as much as possible.

BigQuery Count Appears to be Processing Data

I noticed that running a SELECT count(*) FROM myTable on my larger BQ tables yields long running times, upwards of 30/40 seconds despite the validator claiming the query processes 0 bytes. This doesn't seem quite right when 500 GB queries run faster. Additionally, total row counts are listed under details -> Table Info. Am I doing something wrong? Is there a way to get total row counts instantly?
When you run a count BigQuery still needs to allocate resources (such as: slot units, shards etc). You might be reaching some limits which cause a delay. For example, the slots default per project is 2,000 units.
BigQuery execution plan provides very detail information about the process which can help you better understand the source of the delay.
One way to overcome this is to use an approximate method described in this link
This Slide by Google might also help you
For more details see this video about how to understand the execution plan

BigQuery streaming data not available instantly

Since couple of days some data i am streaming to bigquery is not available instantly (as it normally happens) within bigquery web ui after being inserted successfully.
My use case consists of inserting thousand of lines using :
bigquery.tabledata().insertAll(...)
The results of the streaming inserts into the table are :
(i am also checking for insertErrors to be sure as described here):
BigQuery insert status : {"kind":"bigquery#tableDataInsertAllResponse"}
BigQuery insert errors : null
Total number of lines available in bigquery web ui is different that total inserted.
I would be grateful for any help.
Bigquery project details :
Project ID : favorable-beach-87616
Table : mtp_UA_xxxx_1_20150410
Project dependencies on google libraries:
compile 'com.google.api-client:google-api-client:1.19.0'
compile 'com.google.http-client:google-http-client:1.19.0'
compile 'com.google.http-client:google-http-client-jackson2:1.19.0'
compile 'com.google.oauth-client:google-oauth-client:1.19.0'
compile 'com.google.oauth-client:google-oauth-client-servlet:1.19.0'
compile 'com.google.apis:google-api-services-bigquery:v2-rev171-1.19.0'
compile 'com.google.api-client:google-api-client:1.17.0-rc'
Great thanks in advance for your help!
When you say the total number of lines available in the web UI, do you mean the number of rows that show up in the 'details' pane on the table, or the number of rows that are returned if you do a SELECT COUNT(*) query?
If the former, that is expected, since that counter only returns the number of rows that have been flushed to long-term storage (as opposed to the short-term storage buffers the streaming data originally gets written to). This is admittedly confusing, and we are working on a fix.
If the latter, the rows don't show up in a query, that is more concerning. If that is the case, please let us know and we'll investigate.

How can I trigger an email or other notification based on a BigQuery query?

I would like to receive a notification, ideally via email, when some threshold is met in Google BigQuery. For example, if the query is:
SELECT name, count(id) FROM terrible_things
WHERE date(terrible_thing) < -1d
Then I would want to get an alert when there were greater than 0 results, and I would want that alert to contain the name of each object and how many there were.
BigQuery does not provide the kinds of services you'd need to build this without involving other technologies. However, you should be able to use something like appengine (which does have a task scheduling mechanism) to periodically issue your monitoring query probe, check the results of the job, and alert if there are nonzero rows in the results. Alternately, you could do this locally using some scripting and leveraging the BQ command line tool.
You could also refine things by using BQ's table decorators to only scan the data that's arrived since you last ran your monitoring query, if you retain knowledge of the last probe's execution in the calling system.
In short: Something else needs to issue the queries and react based on the outcome, but BQ can certainly evaluate the data.

Problems loading a series of snapshots by date

I have been running into a consistent problem using the LBAPI which I feel is probably a common use case given its purpose. I am generating a chart which uses LBAPI snapshots of a group of Portfolio Items to calculate the chart series. I know the minimum and maximum snapshot dates, and need to query once a day in between these two dates. There are two main ways I have found to accomplish this, both of which are not ideal:
Use the _ValidFrom and _ValidTo filter properties to limit the results to snapshots within the selected timeframe. This is bad because it will also load snapshots which I don't particularly care about. For instance if a PI is revised several times throughout the day, I'm really only concerned with the last valid snapshot of that day. Because some of the PIs I'm looking for have been revised several thousand times, this method requires pulling mostly data I'm not interested in, which results in unnecessarily long load times.
Use the __At filter property and send a separate request for each query date. This method is not ideal because some charts would require several hundred requests, with many requests returning redundant results. For example if a PI wasn't modified for several days, each request within that time frame would return a separate instance of the same snapshot.
My workaround for this was to simulate the effect of __At, but with several filters per request. To do this, I added this filter to my request:
Rally.data.lookback.QueryFilter.or(_.map(queryDates, function(queryDate) {
return Rally.data.lookback.QueryFilter.and([{
property : '_ValidFrom',
operator : '<=',
value : queryDate
},{
property : '_ValidTo',
operator : '>=',
value : queryDate
}]);
}))
But of course, a new problem arises... Adding this filter results in much too large of a request to be sent via the LBAPI, unless querying for less than ~20 dates. Is there a way I can send larger filters to the LBAPI? Or will I need to break theis up into several requests, which only makes this solution slightly better than the second of the latter.
Any help would be much appreciated. Thanks!
Conner, my recommendation is to download all of the snapshots even the ones you don't want and marshal them on the client side. There is functionality in the Lumenize library that's bundled with the App SDK that makes this relatively easy and the TimeSeriesCalculator will also accomplish this for you with even more features like aggregating the data into series.