BigQuery query execution costs - google-bigquery

I know there is documentation on BigQuery pricing, but I am confused by which value they charge you on. When you compose a query, the editor with show This query will process 69.3 GB when run. but when you've executed the query, there is a Job Information tab next to the Results tab. In that Job Information, there are two values: "Bytes Process" and "Bytes Billed"
I was informed that you are charged on the "Bytes Billed" value (seems logical based on the name!).
What's causing my confusion is that the Bytes Billed for the above 69.3GB query is 472MB. I'm given to believe that the WHERE clause does not impact pricing
Why is it so much less?
How can I accurately estimate query costs if I can't see the Bytes Billed beforehand?
Thanks in advance
Edit 1
Here is my query:
SELECT
timestamp_trunc(DateTimeUTC, SECOND) as DateTimeUTC,
ANY_VALUE(if(Code = 'Aftrtmnt_1_Scr_Cat_Tank_Level', value, null)) as Aftrtmnt_1_Scr_Cat_Tank_Level,
ANY_VALUE(if(Code = 'ctv_ds_ect', value, null)) as ctv_ds_ect,
ANY_VALUE(if(Code = 'Engine_Coolant_Level', value, null)) as Engine_Coolant_Level,
ANY_VALUE(if(Code = 'ctv_batt_volt_min', value, null)) as ctv_batt_volt_min,
ANY_VALUE(if(Code = 'ctv_moderate_brake_count', value, null)) as ctv_moderate_brake_count,
ANY_VALUE(if(Code = 'ctv_amber_lamp_count', value, null)) as ctv_amber_lamp_count,
VIN,
ANY_VALUE(if(Code = 'ctv_trip_distance_miles', value, null)) as ctv_trip_distance_miles,
FROM `xxxx.yyyy.zzzz`
WHERE
DATE(DateTimeUTC) > '2021-03-01') and DATE(DateTimeUTC) < '2021-06-01' and
Code in ('Aftrtmnt_1_Scr_Cat_Tank_Level', 'ctv_ds_ect', 'Engine_Coolant_Level', 'ctv_trip_distance_miles', 'ctv_batt_volt_min', 'ctv_moderate_brake_count', 'ctv_amber_lamp_count')
and event_name = 'Trip Detail'
group by timestamp_trunc(DateTimeUTC, SECOND), VIN
Essentially it just pivots the main table and the intention is to insert the result into another table
THis article states that the WHERE clause does not impact cost, which is different to what I previously thought

I believe that your actual cost should never be more than estimated, but could be less.
Consider a table that is both partitioned and clustered. Let's assume the partition is on a date field my_date and clustered on a string field my_type.
Then, consider the following query...
select my_date, my_type from <table>
The estimate thinks you are scanning both columns in their entirety, and so your billing should match the estimate
However, if you filter against the partition, you should see a reduction in both the estimation and the billed amount.
select my_date, my_type from <table> where my_date = '2021-06-17'
But, if you filter against the clustered column, I don't believe the estimate evaluates that filter, because it doesn't know what you are filtering, just which columns. However, when you execute the query, you do get the benefit of the clustering, because it won't actually scan the entire column, just the relevant clusters.
select my_date, my_type from <table> where my_type = 'A'
It is not checking 'A' against the clustering in the estimation. Consider a case where 'A' doesn't exist in that clustered column, the estimator would show an estimate, but you would actually scan 0 Bytes when you execute.

Related

SQL Server query plan sort operator cost spikes when column count increased

I use a ROW_NUMBER() function inside a CTE that causes a SORT operator in the query plan. This SORT operator has always been the most expensive element of the query, but has recently spiked in cost after I increased the number of columns read from the CTE/query.
What confuses me is the increase in cost is not proportional to the column count. I can increase the column count without much issue, normally. However, it seems my query has past some threshold and now costs so much it has doubled the query execution time from 1hour to 2hour+.
I can't figure out what has caused the spike in cost and it's having an impact on business. Any ideas or next steps for troubleshooting you can advise?
Here is the query (simplified):
WITH versioned_events AS (
SELECT [event].*
,CASE WHEN [event].[handle_space] IS NOT NULL THEN [inv].[involvement_id]
ELSE [event].[involvement_id]
END AS [derived_involvement_id]
,ROW_NUMBER() OVER (PARTITION BY [event_id], [event_version] ORDER BY [event_created_date] DESC, [timestamp] DESC ) AS [latest_version]
FROM [database].[schema].[event_table] [event]
LEFT JOIN [database].[schema].[involvement] as [inv]
ON [event].[service_delivery_id] = [inv].[service_delivery_id]
AND [inv].[role_type_code] = 't'
AND [inv].latest_involvement = 1
WHERE event.deletion_type IS NULL AND (event.handle_space IS NULL
OR (event.handle_space NOT LIKE 'x%'
AND event.handle_space NOT LIKE 'y%'))
)
INSERT INTO db.schema.table (
....
)
SELECT
....
FROM versioned_events AS [event]
INNER JOIN (
SELECT DISTINCT service_delivery_id, derived_involvement_id
FROM versioned_events
WHERE latest_version = 1
WHERE ([versioned_events].[timestamp] > '2022-02-07 14:18:09.777610 +00:00')
) AS [delta_events]
ON COALESCE([event].[service_delivery_id],'NULL') = COALESCE([delta_events].[service_delivery_id],'NULL')
AND COALESCE([event].[derived_involvement_id],'NULL') = COALESCE([delta_events].[derived_involvement_id],'NULL')
WHERE [event].[latest_version] = 1
Here is the query plan from the version with the most columns that experiences the cost spike (all others look the same except this operator takes much less time (40-50mins):
I did a comparison of three executions, each with different column counts in the INSERT INTO SELECT FROM clause. I can't share the spreadsheet, but I will try convey my findings so far. The following is true of the query with the most columns:
It takes more than twice as long to execute than the other two executions
It performs more logical & physical reads and scans
It has more CPU time
It reads the most from Tempdb
The increase in execution time is not proportional with the increase in reads or other mentioned metrics
It is true that there is a memory spill level 8 happening. I have tried updating statistics, but it didn't help and all the versions of the query suffer the same problem so like-for-like is still compared.
I know it can be hard to help with this kind of problem without being able to poke around but I would be grateful if anyone could point me in the direction for what to check / try next.
P.S. the table it reads from is a heap and the table it joins to is indexed. The heap table needs to be a heap otherwise inserts into it will take too long and the problem is kicked down the road.
Also, when I say added more columns, I mean in the SELECT FROM versioned_events statement. The columns are replaced with "...." in the above example.
UPDATE
Using a temp table halved the execution time when the column count is the high number that caused the issue but actually takes longer with a reduced column count. It goes back to the idea that a threshold is crossed when the column count is increased :(. In any event, we've used a temp table for now to see if it helps in production.

SQL Server Fast Way to Determine IF Exists

I need to find a fast way to determine if records exist in a database table. The normal method of IF Exists (condition) is not fast enough for my needs. I've found something that is faster but does not work quite as intended.
The normal IF Exists (condition) which works but is too slow for my needs:
IF EXISTS (SELECT *
From dbo.SecurityPriceHistory
Where FortLabel = 'EP'
and TradeTime >= '2020-03-20 15:03:53.000'
and Price >= 2345.26)
My work around that doesn't work, but is extremely fast:
IF EXISTS (SELECT IIF(COUNT(*) = 0, null, 1)
From dbo.SecurityPriceHistory
Where FortLabel = 'EP'
and TradeTime >= '2020-03-20 15:03:53.000'
and Price >= 2345.26)
The issue with the second solution is that when the count(*) = 0, null is returned, but that causes IF EXISTS(null) to return true.
The second solution is fast because it doesn't read any data in the execution plan, while the first one does read data.
I suggested leaving the original code unchanged, but adding an index to cover one (or more) of the columns in the WHERE clause.
If I changed anything, I might limit the SELECT clause to a single non-null small column.
Switching to a column store index in my particular use case appears to solve my performance problem.
For this query:
IF EXISTS (SELECT *
From dbo.SecurityPriceHistory
Where FortLabel = 'EP' and
TradeTime >= '2020-03-20 15:03:53.000' and
Price >= 2345.26
)
You either want an index on:
SecurityPriceHistory(Fortlabel, TradeTime, Price)
SecurityPriceHistory(Fortlabel, Price, TradeTime)
The difference is whether TradeTime or Price is more selective. A single column index is probably not sufficient for this query.
The third column in the index is just there so the index covers the query and doesn't have to reference the data pages.

Optimizing spark sql query

I am using this below query to derive outliers form my data. using distinct is creating too much shuffle and the end tasks are taking huge amount of time to complete. are there any optimization that can be done to speed it up?
query = """SELECT
DISTINCT NAME,
PERIODICITY,
PERCENTILE(CAST(AMOUNT AS INT), 0.997) OVER(PARTITION BY NAME, PERIODICITY) as OUTLIER_UPPER_THRESHOLD,
CASE
WHEN PERIODICITY = "WEEKLY" THEN 100
WHEN PERIODICITY = "BI_WEEKLY" THEN 200
WHEN PERIODICITY = "MONTHLY" THEN 250
WHEN PERIODICITY = "BI_MONTHLY" THEN 400
WHEN PERIODICITY = "QUARTERLY" THEN 900
ELSE 0
END AS OUTLIER_LOWER_THRESHOLD
FROM base"""
I would suggest rephrasing this so you can filter before aggregating:
SELECT NAME, PERIODICITY, OUTLIER_LOWER_THRESHOLD,
MIN(AMOUNT)
FROM (SELECT NAME, PERIODICITY,
RANK() OVER (PARTITION BY NAME, PERIODICITY ORDER BY AMOUNT) as sequm,
COUNT(*) OVER (PARTITION BY NAME, PERIODICITY) as cnt,
(CASE . . . END) as OUTLIER_LOWER_THRESHOLD
FROM base
) b
WHERE seqnum >= 0.997 * cnt
GROUP BY NAME, PERIODICITY, OUTLIER_LOWER_THRESHOLD;
Note: This ranks duplicate amounts based on the lowest rank. That means that some NAME/PERIODICITY pairs may not be in the results. They can easily be added back in using a LEFT JOIN.
The easiest way to deal with a large shuffle, independent of what the shuffle is, is to use a larger cluster. It's the easiest way because you don't have to think much about it. Machine time is usually much cheaper than human time refactoring code.
The second easiest way to deal with a large shuffle that is the union of some independent and constant parts is to break it into smaller shuffles. In your case, you could run separate queries for each periodicity, filtering the data down before the shuffle and then union the results.
If the first two approaches are not applicable for some reason, it's time to refactor. In your case you are doing two shuffles: first to compute OUTLIER_UPPER_THRESHOLD which you associate with every row and then to distinct the rows. In other words, you are doing a manual, two-phase GROUP BY. Why don't you just group by NAME, PERIODICITY and compute the percentile?

Optimizing Hive GROUP BY when rows are sorted

I have the following (very simple) Hive query:
select user_id, event_id, min(time) as start, max(time) as end,
count(*) as total, count(interaction == 1) as clicks
from events_all
group by user_id, event_id;
The table has the following structure:
user_id event_id time interaction
Ex833Lli36nxTvGTA1Dv juCUv6EnkVundBHSBzQevw 1430481530295 0
Ex833Lli36nxTvGTA1Dv juCUv6EnkVundBHSBzQevw 1430481530295 1
n0w4uQhOuXymj5jLaCMQ G+Oj6J9Q1nI1tuosq2ZM/g 1430512179696 0
n0w4uQhOuXymj5jLaCMQ G+Oj6J9Q1nI1tuosq2ZM/g 1430512217124 0
n0w4uQhOuXymj5jLaCMQ mqf38Xd6CAQtuvuKc5NlWQ 1430512179696 1
I know for a fact that rows are sorted first by user_id and then by event_id.
The question is: is there a way to "hint" the Hive engine to optimize the query given that rows are sorted? The purpose of optimization is to avoid keeping all groups in memory since its only necessary to keep one group at a time.
Right now this query running in a 6-node 16 GB Hadoop cluster with roughly 300 GB of data takes about 30 minutes and uses most of the RAM, choking the system. I know that each group will be small, no more than 100 rows per (user_id, event_id) tuple, so I think an optimized execution will probably have a very small memory footprint and also be faster (since there is no need to loopup group keys).
Create a bucketed sorted table. The optimizer will know it sorted from metadata.
See example here (official docs): https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-BucketedSortedTables
Count only interaction = 1: count(case when interaction=1 then 1 end) as clicks - case will mark all rows with 1 or null and count only 1s.

table design + SQL question

I have a table foodbar, created with the following DDL. (I am using mySQL 5.1.x)
CREATE TABLE foodbar (
id INT NOT NULL AUTO_INCREMENT,
user_id INT NOT NULL,
weight double not null,
created_at date not null
);
I have four questions:
How may I write a query that returns
a result set that gives me the
following information: user_id,
weight_gain where weight_gain is
the difference between a weight and
a weight that was recorded 7 days
ago.
How may I write a query that will
return the top N users with the
biggest weight gain (again say over
a week).? An 'obvious' way may be to
use the query obtained in question 1
above as a subquery, but somehow
picking the top N.
Since in question 2 (and indeed
question 1), I am searching the
records in the table using a
calculated field, indexing would be
preferable to optimise the query -
however since it is a calculated
field, it is not clear which field
to index (I'm guessing the 'weight'
field is the one that needs
indexing). Am I right in that
assumption?.
Assuming I had another field in the
foodbar table (say 'height') and I
wanted to select records from the
table based on (say) the product
(i.e. multiplication) of 'height'
and 'weight' - would I be right in
assuming again that I need to index
'height' and 'weight'?. Do I also
need to create a composite key (say
(height,weight)). If this question
is not clear, I would be happy to
clarify
I don't see why you should need the synthetic key, so I'll use this table instead:
CREATE TABLE foodbar (
user_id INT NOT NULL
, created_at date not null
, weight double not null
, PRIMARY KEY (user_id, created_at)
);
How may I write a query that returns a result set that gives me the following information: user_id, weight_gain where weight_gain is the difference between a weight and a weight that was recorded 7 days ago.
SELECT curr.user_id, curr.weight - prev.weight
FROM foodbar curr, foodbar prev
WHERE curr.user_id = prev.user_id
AND curr.created_at = CURRENT_DATE
AND prev.created_at = CURRENT_DATE - INTERVAL '7 days'
;
the date arithmetic syntax is probably wrong but you get the idea
How may I write a query that will return the top N users with the biggest weight gain (again say over a week).? An 'obvious' way may be to use the query obtained in question 1 above as a subquery, but somehow picking the top N.
see above, add ORDER BY curr.weight - prev.weight DESC and LIMIT N
for the last two questions: don't speculate, examine execution plans. (postgresql has EXPLAIN ANALYZE, dunno about mysql) you'll probably find you need to index columns that participate in WHERE and JOIN, not the ones that form the result set.
I think that "just somebody" covered most of what you're asking, but I'll just add that indexing columns that take part in a calculation is unlikely to help you at all unless it happens to be a covering index.
For example, it doesn't help to order the following rows by X, Y if I want to get them in the order of their product X * Y:
X Y
1 8
2 2
4 4
The products would order them as:
X Y Product
2 2 4
1 8 8
4 4 16
If mySQL supports calculated columns in a table and allows indexing on those columns then that might help.
I agree with just somebody regarding the primary key, but for what you're asking regarding the weight calculation, you'd be better off storing the delta rather than the weight:
CREATE TABLE foodbar (
user_id INT NOT NULL,
created_at date not null,
weight_delta double not null,
PRIMARY KEY (user_id, created_at)
);
It means you'd store the users initial weight in say, the user table, and when you write records to the foodbar table, a user could supply the weight at that time, but the query would subtract the initial weight from the current weight. So you'd see values like:
user_id weight_delta
------------------------
1 2
1 5
1 -3
Looking at that, you know that user 1 gained 4 pounds/kilos/stones/etc.
This way you could use SUM, because it's possible for someone to have weighings every day - using just somebody's equation of curr.weight - prev.weight wouldn't work, regardless of time span.
Getting the top x is easy in MySQL - use the LIMIT clause, but mind that you provide an ORDER BY to make sure the limit is applied correctly.
It's not obvious, but there's some important information missing in the problem you're trying to solve. It becomes more noticeable when you think about realistic data going into this table. The problem is that you're unlikely to to have a consistent regular daily record of users' weights. So you need to clarify a couple of rules around determining 'current-weight' and 'weight x days ago'. I'm going to assume the following simplistic rules:
The most recent weight reading is the 'current-weight'. (Even though that could be months ago.)
The most recent weight reading more than x days ago will be the weight assumed at x days ago. (Even though for example a reading from 6 days ago would be more reliable than a reading from 21 days ago when determining weight 7 days ago.)
Now to answer the questions:
1&2: Using the above extra rules provides an opportunity to produce two result sets: current weights, and previous weights:
Current weights:
select rd.*,
w.Weight
from (
select User_id,
max(Created_at) AS Read_date
from Foodbar
group by User_id
) rd
inner join Foodbar w on
w.User_id = rd.User_id
and w.Created_at = rd.Read_date
Similarly for the x days ago reading:
select rd.*,
w.Weight
from (
select User_id,
max(Created_at) AS Read_date
from Foodbar
where Created_at < DATEADD(dd, -7, GETDATE()) /*Or appropriate MySql equivalent*/
group by User_id
) rd
inner join Foodbar w on
w.User_id = rd.User_id
and w.Created_at = rd.Read_date
Now simply join these results as subqueries
select cur.User_id,
cur.Weight as Cur_weight,
prev.Weight as Prev_weight
cur.Weight - prev.Weight as Weight_change
from (
/*Insert query #1 here*/
) cur
inner join (
/*Insert query #2 here*/
) prev on
prev.User_id = cur.User_id
If I remember correctly the MySql syntax to get the top N weight gains would be to simply add:
ORDER BY cur.Weight - prev.Weight DESC limit N
2&3: Choosing indexes requires a little understanding of how the query optimiser will process the query:
The important thing when it comes to index selection is what columns you are filtering by or joining on. The optimiser will use the index if it is determined to be selective enough (note that sometimes your filters have to be extremely selective returning < 1% of data to be considered useful). There's always a trade of between slow disk seek times of navigating indexes and simply processing all the data in memory.
3: Although weights feature significantly in what you display, the only relevance is in terms of filtering (or selection) is in #2 to get the top N weight gains. This is a complex calculation based on a number of queries and a lot of processing that has gone before; so Weight will provide zero benefit as an index.
Another note is that even for #2 you have to calculate the weight change of all users in order to determine the which have gained the most. Therefore unless you have a very large number of readings per user you will read most of the table. (I.e. a table scan will be used to obtain the bulk of the data)
Where indexes can benefit:
You are trying to identify specific Foodbar rows based on User_id and Created_at.
You are also joining back to the Foodbar table again using User_id and Created_at.
This implies an index on User_id, Created__at would be useful (more-so if this is the clustered index).
4: No, unfortunately it is mathematically impossible to determine how the individual values H and W would independently determine the ordering of the product. E.g. both H=3 & W=3 are less than 5, yet if H=5 and W=1 then the product 3*3 is greater than 5*1.
You would have to actually store the calculation an index on that additional column. However, as indicated in my answer to #3 above, it is still unlikely to prove beneficial.