BigQuery cost for complex logic in custom functions - google-bigquery

I am using relatively expensive computation in the custom function defined like this:
CREATE TEMP FUNCTION HMAC256(message STRING, secret STRING)
RETURNS STRING
LANGUAGE js
OPTIONS (
-- copy this Forge library file to Storage:
-- https://cdn.jsdelivr.net/npm/node-forge#0.7.0/dist/forge.min.js
-- #see https://github.com/digitalbazaar/forge
library=["gs://.../forge.min.js"]
)
AS
"""
var hmac = forge.hmac.create();
hmac.start('sha256', secret);
hmac.update(message);
return hmac.digest().toHex();
""";
SELECT HMAC256("test", "111");
-- Row f0_
-- 1 f8320c4eded4b06e99c1a884a25c80b2c88860e13b64df1eb6f0d3191023482b
Will this be more expensive compared to applying LOWER function for example?
I cannot see anything different in the bytes billed details while HMAC256 took 4 min on my dataset compared to 14 seconds for LOWER.
It would be awesome if price is the same. I have a feeling I miss something.

Yes. The cost will be the same. Your query can get more complicated and more expensive to BigQuery, but you're only charged of same rate until your query hit the limit. Your off-limits query will fail if you don't have reservation (aka flat-rate slots).
All high compute queries under tier 100 are billed as tier 1.
All queries above tier 100 will fail with a RESOURCES_EXCEEDED_PER_BYTE error unless the query is running in a reserved instance.

Related

SQL Query performance - UI responsiveness concerns

I am running a query on localhost, I am extremely unfamiliar with SQL. I am using a golang library to generate the query statement. This is for an enterprise app so I don't have time to evaluate and code all possible performance cases. I'd prefer good performance for the largest possible queries:
upto 6 query parameters eg. BETWEEN 'created' AND 'abandoned', BETWEEN X AND Y, IN (1,2,3.....25), IN ('A', 'B', 'C'....'Z')
JOINs/subqueries between a 2-5 tables
returning between 50K-5M records (LAT and LNG)
Currently I am using JOIN to find the lat, lng for a record, and some query parameters. Should I join differently, (left, right)? Should the FROM table be the record or the relation? Subqueries?
Is this query performance reasonable from a UI perspective? This is on localhost (docker) on a fairly low performance laptop, under WSL (16GB RAM / 6 core CPU [2.2GHz]).
-- [2547.438ms] [rows:874731]
SELECT "Longitude","Latitude"
FROM Wells
JOIN Well_Reports ON Well_Reports.Well_ID = Wells.Well_ID
JOIN Lithologies on Lithologies.Well_Report_ID = Well_Reports.Well_Report_ID
where Lithologies.Colour IN
(
'NULL',
'Gray','White','Dark','Black','Dark Gray','Dark Brown','Dark Red','Dark Blue',
'Dark Green','Dark Yellow','Bluish Green','Brownish Gray','Brownish Green','Brownish Yellow',
'Light','Light Gray','Light Red','Greenish Gray','Light Yellow','Light Green','Light Blue',
'Light Brown','Blue Gray','Greenish Yellow','Greenish Gray'
);
The UI is a heatmap. I haven't really hit performance issues returning 1-million rows.
Angular is the framework. Im breaking the HTTP response into 10K record chunks
My initial impression was 3+ seconds is too long for a UI to start populating data. I was already breaking the response to the UI into chunks, that portion was efficient and async. It never occurred to me to simply break the SQL requests into smaller chunks with LIMIT and OFFSET, so the server can start responding with data immediately (<200ms) even if it takes +5s to completely finish loading.
I'll write an answer to this effect.
Thanks and best regards,
schmorrison
A few things.
where someColumn in (null, ...)
This will not return rows where the value of someColumn is null because, x in (a, b, c ...) translates to x = a or x = b or x = c, and null is not equal to null.
You need a construct like this instead
where someColumn is null or someColumn in (...)
Second, you mentioned you're returning between 50k and 5M rows to the UI. I question the sanity of this... how is the UI rendering 5 million sets of coordinates for the user to see/use? I suppose there could be some extreme edge cases where this is really what you need to do, but it just seems really unlikely.
Third, if your concern is UI responsiveness, the proper way to handle that is to make asynchronous requests. I don't know anything about golang, but I did find this page with a quick google search. Study up on that kind of technique.
Finally, if you really do need to work with data sets this big, the critical point will be to identify the common search criteria and talk to your DBA about appropriate indexes. We can't provide much help in this regard without a lot more schema information, but if you have a specific query that is taking a long time with a particular set of parameters, you can come back and create a question for that query, along with providing the query plan, and we can help you out.
As pointed out by The Impaler,
"If you want good performance for dynamic queries that return 5
million rows, then you'll need to learn the intricacies of database
engines, not only SQL. A high level knowledge unfortunately won't cut
it."
and I expected as much.
I was simply querying the SQL DB for the whole set at once, this is wrong and I know that.
The changes I made generates the following statements (I know the specifics have changed):
-- [145.955ms] [rows:1]
SELECT count(*)
FROM "tblWellLogs"
WHERE (
DateWellCompleted BETWEEN
'1800-01-01 06:58:00.000' AND '2021-09-12 06:00:00.000'
)
AND "FinalStatusOfWellL" NOT IN
(
5,6,7,8,9,16,27,36
)
AND WaterUseL IN
(
1,29,3,8,26,4,6
)
AND (
wyRate BETWEEN
0 AND 3321
)
AND (
TotalOrFinishedDepth BETWEEN
0 AND 248
);
Calculated total(69977) chunk (17494) currentpage(3)
-- [117.195ms] [rows:17494]
SELECT "Easting","Northing"
FROM "tblWellLogs"
WHERE (
DateWellCompleted BETWEEN
'1800-01-01 06:58:00.000' AND '2021-09-12 06:00:00.000'
)
AND "FinalStatusOfWellL" NOT IN
(
5,6,7,8,9,16,27,36
)
AND WaterUseL IN
(
1,29,3,8,26,4,6
)
AND (
wyRate BETWEEN
0 AND 3321
)
AND (
TotalOrFinishedDepth BETWEEN
0 AND 248
)
ORDER BY (SELECT NULL)
OFFSET 34988 ROWS
FETCH NEXT 17494 ROWS ONLY;
Chunk count: 17494
JSON size: 385135
Request time: 278.6612ms
The response to the UI is as so:
GET /data/wells/xxxx/?page=3&completed.start=1800-01-01T06:58:36.000Z&completed.end=2021-09-12T06:00:00.000Z&status=supply,research,geothermal,other,unknown&use=domestic,commercial,industrial,municipal,irrigation,agriculture&depth.start=0&depth.end=248&rate.start=0&rate.end=3321&page=3 HTTP/1.1 200 385135
{
"data":[[0.0, 0.0],......],
"count": 87473,
"total": 874731,
"pages": 11,
"page": 1
}
In this case, the DBs are essentially immutable (I may get an updated dataset once per year). I figure I can predefine a set of chunk size variations for each DB based on query result size rather than just DB size. I am also retrieving the next page before the client requests it. I am caching requests at the browser and client layers. I only perform the Count(*) request if the client doesn't provide it, and it isn't in the cache.
I did find that running the request concurrently simply over-burdened my CPU, all the requests returned almost simultaneously but took +5s each rather than ~1second each.

How to interpret query process GB in Bigquery?

I am using a free trial of Google bigquery. This is the query that I am using.
select * from `test`.events where subject_id = 124 and id = 256064 and time >= '2166-01-15T14:00:00' and time <='2166-01-15T14:15:00' and id_1 in (3655,223762,223761,678,211,220045,8368,8441,225310,8555,8440)
This query is expected to return at most 300 records and not more than that.
However I see a message like this as below
But the table on which this query operates is really huge. Does this indicate the table size? However, I ran this query multiple times a day
Due to this, it resulted in error below
Quota exceeded: Your project exceeded quota for free query bytes scanned. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors
How long do I have to wait for this error to go-away? Is the daily limit 1TB? If yes, then I didn't not use close to 400 GB.
How to view my daily usage?
If I can edit quota, can you let me know which option should I be editing?
Can you help me with the above questions?
According to the official documentation
"BigQuery charges for queries by using one metric: the number of bytes processed (also referred to as bytes read)", regardless of how large the output size is. What this means is that if you do a count(*) on a 1TB table, you will supposedly be charged $5, even though the final output is very minimal.
Note that due to storage optimizations that BigQuery is doing internally, the bytes processed might not equal to the actual raw table size when you created it.
For the error you're seeing, browse the Google Console to "IAM & admin" then "Quotas", where you can then search for quotas specific to the BigQuery service.
Hope this helps!
Flavien

Limitations in using all string columns in BigQuery

I have an input table in BigQuery that has all fields stored as strings. For example, the table looks like this:
name dob age info
"tom" "11/27/2000" "45" "['one', 'two']"
And in the query, I'm currently doing the following
WITH
table AS (
SELECT
"tom" AS name,
"11/27/2000" AS dob,
"45" AS age,
"['one', 'two']" AS info )
SELECT
EXTRACT( year from PARSE_DATE('%m/%d/%Y', dob)) birth_year,
ANY_value(PARSE_DATE('%m/%d/%Y', dob)) bod,
ANY_VALUE(name) example_name,
ANY_VALUE(SAFE_CAST(age AS INT64)) AS age
FROM
table
GROUP BY
EXTRACT( year from PARSE_DATE('%m/%d/%Y', dob))
Additionally, I tried doing a very basic group by operation casting an item to a string vs not, and I didn't see any performance degradation on a data set of ~1M rows (actually, in this particular case, casting to a string was faster):
Other than it being bad practice to "keep" this all-string table and not convert it into its proper type, what are some of the limitations (either functional or performance-wise) that I would encounter by keeping a table all-string instead of storing it as their proper type. I know there would be a slight increase in size due to storing strings instead of number/date/bool/etc., but what would be the major limitations or performance hits I'd run into if I kept it this way?
Off the top of my head, the only limitations I see are:
Queries would become more complex (though wouldn't really matter if using a query-builder).
A bit more difficult to extract non-string items from array fields.
Inserting data becomes a bit trickier (for example, need to keep track of what the date format is).
But these all seem like very small items that can be worked around. Are there are other, "bigger" reasons why using all string fields would be a huge limitation, either in limiting query-ability or having a huge performance hit in various cases?
First of all - I don't really see any bigger show-stoppers than those you already know and enlisted
Meantime,
though wouldn't really matter if using a query-builder ...
based on above excerpt - I wanted to touch upon some aspect of this approach (storing all as strings)
While we usually concerned about CASTing from string to native type to apply relevant functions and so on, I realized that building complex and generic query with some sort of query builder in some cases requires opposite - cast native type to string for applying function like STRING_AGG [just] as a quick example
So, my thoughts are:
When table is designed for direct user's access with trivial or even complex queries - having native types is beneficial and performance wise and being more friendly for user to understand, etc.
Meantime, if you are developing your own query builder and you design table such that it will be available to users for querying via that query builder with some generic logic being implemented - having all fields in string can be helpful in building the query builder itself.
So it is a balance - you can lose a little in performance but you can win in being able to better implement generic query builder. And such balance depend on nature of your business - both from data prospective and what kind of query you envision to support
Note: your question is quite broad and opinion based (which is btw not much respected on SO) so, obviously my answer - is totally my opinion but based on quite an experience with BigQuery
Are you OK to store string "33/02/2000" as a date in one row and "21st of December 2012" in another row and "22ое октября 2013" in another row?
Are you OK to store string "45" as age in one row and "young" in another row?
Are you OK when age "10" is less than age "9"?
Data types provide some basic data validation mechanism at the database level.
Does BigQuery databases have a notion of indexes?
If yes, then most likely these indexes become useless as soon as you start casting your strings to proper types, such as
SELECT
...
WHERE
age > 10 and age < 30
vs
SELECT
...
WHERE
ANY_VALUE(SAFE_CAST(age AS INT64)) > 10
and ANY_VALUE(SAFE_CAST(age AS INT64)) < 30
It is normal that with less columns/rows you don't feel the problems. You start to feel the problems when your data gets huge.
Major concerns:
Maintenance of the code: Think of future requirements that you may receive. Every conversion for data manipulation will add extra complexity to your code. For example, if your customer asks for retrieving teenagers in future, you'll need to convert string to date to get the age and then be able to do the manupulation.
Data size: The data size has broader impacts that can not be seen at the start. For example if you have N parallel test teams which require own test systems, you'll need to allocate more disk space.
Read Performance: When you have more bytes to read in huge tables it will cost you considerable time. For example typically telco operators have a couple of billions of rows data per month.
If your code complexity increase, you'll need to replicate conversions in multiple places.
Even single of above items should push one to distance from using strings for everything.
I would think the biggest issue with this would be if there are other users of this table/data, for instance if someone is trying to write reports with it and do calculations or charts or date ranges it could be a big headache having to always cast or convert the data with whatever tool they are using. You or someone would likely get a lot of complaints about it.
And if someone decided to build a layer between this data and the reporting tool which converted all of the data, then you may as well just do it one time to the table/data and be done with it.
From the solution below, you might face some storage and performance problems, you can find some guidance in the official documentation:
The main performance problem will come from the CAST operation, remember that the BigQuery Engine will have to deal with a CAST operation for each value per row.
In order to test the compute cost of this operations, I used the following query:
SELECT
street_number
FROM
`bigquery-public-data.austin_311.311_service_requests`
LIMIT
5000
Inspecting the stages executed in the execution details we are able to see the following:
READ
$1:street_number
FROM bigquery-public-data.austin_311.311_service_requests
LIMIT
5000
WRITE
$1
TO __stage00_output
Only the Read, Limit and Write operations are required. However if we execute the same query adding the the CAST operator.
SELECT
CAST(street_number AS int64)
FROM
`bigquery-public-data.austin_311.311_service_requests`
LIMIT
5000
We see that a compute operation is also required in order to perform the cast operation:
READ
$1:street_number
FROM bigquery-public-data.austin_311.311_service_requests
LIMIT
5000
COMPUTE
$10 := CAST($1 AS INT64)
WRITE
$10
TO __stage00_output
Those compute operations will consume some time, that might cause problems when escalating the operation size.
Also, remember that each time that you want to use the data type properties of each data type, you will have to cast your value, and deal with the compute operation time required.
Finally, referring to the storage performance, as you mentioned Strings do not have a fixed size, and that might cause a size increase.

Confirmation of how to calculate bigquery query costs

I want to double check what I need to look at when assessing query costs for BigQuery. I've found the quoted price per TB here which says $5 per TB, but for precisely 1 TB of what? I have been assuming up until now (before it seemed to matter) that the relevant number would be that which the BigQuery UI outputs above the results, so for this example query:
...in this case 2.34GB. So as a fraction of a terabyte and multiplied by $5 this would cost around 1.2cents assuming I'd used up my allowance for the month.
Can anyone confirm that I'm correct? Checking this before I process something I think could rack up some non-negligible costs for once. I should say I've never been stung with a sizeable BigQuery bill before it seems difficult to do.
Can anyone confirm that I'm correct?
Confirmed
Please note - BigQuery UI in fact uses DryRun which only estimates Total Bytes Processed . The final cost is based on Bytes Billed which reflects some nuances - minimum 10MB per each table involved in query as an example. You can see more details here - https://cloud.google.com/bigquery/pricing#on_demand_pricing
I know I am late but this might help you.
If you are pushing your audit logs to another dataset, you can do below on that dataset.
WITH data as
(
SELECT
protopayload_auditlog.authenticationInfo.principalEmail as principalEmail,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent AS jobCompletedEvent
FROM
`administrative-audit-trail.gcp_audit_logs.cloudaudit_googleapis_com_data_access_20190227`
)
SELECT
principalEmail,
FORMAT('%9.2f',5.0 * (SUM(jobCompletedEvent.job.jobStatistics.totalBilledBytes)/POWER(2, 40))) AS Estimated_USD_Cost
FROM
data
WHERE
jobCompletedEvent.eventName = 'query_job_completed'
GROUP BY principalEmail
ORDER BY Estimated_USD_Cost DESC
Reference: https://cloud.google.com/bigquery/docs/reference/auditlogs/

How to get cost for a query in BQ

In BigQuery, how can we get the cost for a given query? We are doing a lot of high-compute queries -- https://cloud.google.com/bigquery/pricing#high-compute -- which often multiplies the data processed by 2 or more.
Is there a way to get the "Cost" of a query with the result set?
For the API or the CLI you could use the flat --dry_run which validates the query instead of running it, like so:
cat ../query.sql | bq query --use_legacy_sql=False --dry_run
Output:
Query successfully validated. Assuming the tables are not modified,
running this query will process 9614741466 bytes of data.
For costs, just divide the total bytes by 1024 ^ 4, multiply the result by 5 and then multiply by the Billing Tier you are in and you have the expected cost ($0.043 in this example).
If you already ran the query and want to know how much it processed, you can run:
bq show -j (job_id of your query)
And it'll return Bytes Billed and Billing Tier (looks like you still have to do the math for cost computation).
For WebUI, you can install BQMate and it already estimates costs for you (but you still have to adapt for your Billing Tier).
As a final recommendation, sometimes it's possible to greatly improve performance of analyzes just by optimizing how the query process data (here at our company we had several high computing queries that now process data normally just by using features such as ARRAYS and STRUCTS for instance).