How is duration being calculated in a Spark Structured Streaming UI? - sql

We have a Spark SQL job that we would like to optimize. We are trying to
figure out which part of our pipeline is slower/faster.
In the attached SQL query graph, there are 3 WholeStageCodegen boxes,
all with the same duration: 2.9s, 2.9s, 2.9s. See the below picture:
But if we check the Stage graph, it shows 3 seconds for the total stage. See the below picture:
So the durations in the WholeStageCodegen boxes do not add up, it seems
that these durations refer to the sum of the whole stage. Do we miss
something here? Is there a way to figure out the duration for the
individual boxes?
Sometimes there is some difference in the duration, but not more than
0.1s, examples:
18.3s, 18.3s, 18.4s
968ms, 967ms, 1.0s
The Stage duration is always as much as one of the WholeStageCodegen's
duration, or at most 0.1-0.3sec larger.
How can one figure out the duration for each of the WholeStageCodegen parts, and is that actually measured? I suspect that Spark would have to trace individual operations as units of generated functions. Is that measurement actually performed there, or are these numbers more like a placeholder for a feature that does not exist?

Related

Can someone explain this PromQL query to me?

I'm new to promQL and I am using it to create grafana dashboard to visualize various API metrics like throughput, latency etc.
For measuring latency I came across these queries being used together. Can someone explain how are they working
histogram_quantile(0.99, sum(irate(http_request_duration_seconds_bucket{path="<API Endpoint>"}[2m])*30) by (path,le))
histogram_quantile(0.95, sum(irate(http_request_duration_seconds_bucket{path="<API Endpoint>"}[2m])*30) by (path,le))
Also I want to write a query which will show me number of API calls with latency greater than 4sec. Can someone please help me there as well?
The provided queries are designed to return 99th and 95th percentiles for the http_request_duration_seconds{path="..."} metric of histogram type over requests received during the last 2 minutes (see 2m in square brackets).
Unfortunately the provided queries have some issues:
They use irate() function for calculating the per-second increase rate of every bucket defined in http_request_duration_seconds histogram. This function isn't recommended to use in general case, because it tends to return jumpy results on repeated queries - see this article for details. So it is better to use rate or increase instead when calculating histogram_quantile.
They multiply the calculated irate() by 30. This has no any effect on query results, since histogram_quantile() normalizes the provided per-bucket values.
So it is recommended to use the following query instead:
histogram_quantile(0.99,
sum(
increase(http_request_duration_seconds_bucket{path="..."}[2m])
) by (le)
)
This query works in the following way:
Prometheus selects all the time series matching the http_request_duration_seconds_bucket{path="..."} time series selector on the selected time range on the graph. These time series represent histogram buckets for the http_request_duration_seconds histogram. Each such bucket contains a counter, which counts the number of requests with duration not exceeding the value specified in the le label.
Prometheus calculates the increase over the last 2 minutes per each selected time series, e.g. how many requests hit every bucket during the last 2 minutes.
Prometheus calculates per-le sums over bucket values calculated at step 2 - see sum() function docs for details.
Prometheus calculates the estimated 99th percentile for the bucket results returned at step 3 by executing histogram_quantile function. The error of the estimation depends on the number of buckets and the le values. More buckets with better le distribution usually give lower error for the estimated percentile.

Crux dataset Bigquery - Query for Min/Avg/Max LCP, FID and CLS

I have been exploring the Crux dataset in big query for last 10 days to extract data for data studio report. Though I consider myself good at SQL, as I have mostly worked with oracle and SQL server, I am finding it very hard to write queries against this dataset. I started from this article by Rick Viscomi, explored the queries on his github repo but still unable to figure it out.
I am trying to use the materialized table chrome-ux-report.materialized.metrics_summary to get some of the metrics but I am not sure if the Min/Avg/Max lcp (in milliseconds) for a time period (month for example) could be extracted from this table. What other queries could I possibly try which requires less data processing. (Some of the queries that I tried expired my free TB of data processing on big query).
Any suggestion, advise solution, queries are more than welcome since the documentation about the structure of the dataset and queries against it is not very clear.
For details about the fields used on the report you can check on the main documentation for the chrome ux report specially on the last part with data format which shows the dimensions and how its interpreted as show below:
Dimension
origin "https://example.com"
effective_connection_type.name 4G
form_factor.name "phone"
first_paint.histogram.start 1000
first_paint.histogram.end 1200
first_paint.histogram.density 0.123
For example, the above shows a sample record from the Chrome User Experience Report, which indicates that 12.3% of page loads had a “first paint time” measurement in the range of 1000-1200 milliseconds when loading “http://example.com” on a “phone” device over a ”4G”-like connection. To obtain a cumulative value of users experiencing a first paint time below 1200 milliseconds, you can add up all records whose histogram’s “end” value is less than or equal to 1200.
For the metrics, in the initial link there is a section called methodology where you can get information about the metrics and dimensions of the report. I recommend going to the actual origin source table per country and per site and not the summary as the data you are looking for can be obtained there. In the Bigquery part of the documentation you will find samples of how to query those tables. I find this relatable:
SELECT
SUM(bin.density) AS density
FROM
`chrome-ux-report.chrome_ux_report.201710`,
UNNEST(first_contentful_paint.histogram.bin) AS bin
WHERE
bin.start < 1000 AND
origin = 'http://example.com'
In the example above we’re adding all of the density values in the FCP histogram for “http://example.com” where the FCP bin’s start value is less than 1000 ms. The result is 0.7537, which indicates that ~75.4% of page loads experience the FCP in under a second.
About query estimation cost, you can see estimating query cost guide on google official bigquery documentation. But using this tables due to its nature consumes a lot of processing so filter it as much as possible.

BigQuery Count Appears to be Processing Data

I noticed that running a SELECT count(*) FROM myTable on my larger BQ tables yields long running times, upwards of 30/40 seconds despite the validator claiming the query processes 0 bytes. This doesn't seem quite right when 500 GB queries run faster. Additionally, total row counts are listed under details -> Table Info. Am I doing something wrong? Is there a way to get total row counts instantly?
When you run a count BigQuery still needs to allocate resources (such as: slot units, shards etc). You might be reaching some limits which cause a delay. For example, the slots default per project is 2,000 units.
BigQuery execution plan provides very detail information about the process which can help you better understand the source of the delay.
One way to overcome this is to use an approximate method described in this link
This Slide by Google might also help you
For more details see this video about how to understand the execution plan

Check number of slots used by a query in BigQuery

Is there a way to check how many slots were used by a query over the period of its execution in BigQuery? I checked the execution plan but I could just see the Slot Time in ms but could not see any parameter or any graph to show the number of slots used over the period of execution. I even tried looking at Stackdriver Monitoring but I could not find anything like this. Please let me know if it can be calculated in some way or if I can see it somewhere I might've missed seeing.
A BigQuery job will report the total number of slot-milliseconds from the extended query stats in the job metadata, which is analogous to computational cost. Each stage of the query plan also indicates input stats for the stage, which can be used to indicate the number of units of work each stage dispatched.
More details about the representation can be found in the REST reference for jobs. See query.statistics.totalSlotMs and statistics.query.queryPlan[].parallelInputs for more information.
BigQuery now provides a key in the Jobs API JSON called "timeline". This structure provides "statistics.query.timeline[].completedUnits" which you can obtain either during job execution or after. If you choose to pull this information after a job has executed, "completedUnits" will be the cumulative sum of all the units of work (slots) utilised during the query execution.
The question might have two parts though: (1) Total number of slots utilised (units of work completed) or (2) Maximum parallel number of units used at a point in time by the query.
For (1), the answer is as above, given by "completedUnits".
For (2), you might need to consider the maximum value of queryPlan.parallelInputs across all query stages, which would indicate the maximum "number of parallelizable units of work for the stage" (https://cloud.google.com/bigquery/query-plan-explanation)
If, after this, you additionally want to know if the 2000 parallel slots that you are allocated across your entire on-demand query project is sufficient, you'd need to find the point in time across all queries taking place in your project where the slots being utilised is at a maximum. This is not a trivial task, but Stackdriver monitoring provides the clearest view for you on this.

Graphing slow counters with prometheus and grafana

We graph fast counters with sum(rate(my_counter_total[1m])) or with sum(irate(my_counter_total[20s])). Where the second one is preferrable if you can always expect changes within the last couple of seconds.
But how do you graph slow counters where you only have some increments every couple of minutes or even hours? Having values like 0.0013232/s is not very human friendly.
Let's say I want to graph how many users sign up to our service (we expect a couple of signups per hour). What's a reasonable query?
We currently use the following to graph that in grafana:
Query: 3600 * sum(rate(signup_total[1h]))
Step: 3600s
Resolution: 1/1
Is this reasonable?
I'm still trying to understand how all those parameters play together to draw a graph. Can someone explain how the range selector ([10m]), the rate() and the irate() functions, the Step and Resolution settings in grafana influence each other?
That's a correct way to do it. You can also use increase() which is syntactic sugar for using rate() that way.
Can someone explain how the range selector
This is only used by Prometheus, and indicates what data to work over.
the Step and Resolution settings in grafana influence each other?
This is used on the Grafana side, it affects how many time slices it'll request from Prometheus.
These settings do not directly influence each other. However the resolution should work out to be smaller than the range, or you'll be undersampling and miss information.
The 3600 * sum(rate(signup_total[1h])) can be substituted with sum(increase(signup_total[1h])) . The increase(counter[d]) function returns counter increase on the given lookbehind window d. E.g. increase(signup_total[1h]) returns the number of signups during the last hour.
Note that the returned value from increase(signup_total[1h]) may be fractional even if signup_total contains only integer values. This is because of extrapolation - see this issue for technical details. There are the following solutions for this issue:
To use offset modifier: signup_total - (signup_total offset 1h) . This query returns correct results if signup_total wasn't reset to zero during the last hour. In this case the sum(signup_total - (signup_total offset 1h)) is roughly equivalent to sum(increase(signup_total[1h])), but returns more accurate integer results.
To use VictoriaMetrics. It returns the expected integer results from increase() out of the box. See this article and this comment for technical details.