So, this is somewhat of a "realtime" question. I run a query and it's currently going at almost 12K seconds, but it told me This query will process 239.3 GB when run. Does BQ charge by the time besides the data processed? Should I stop this now?
I assume you are using on-demand pricing model - you are billed based on amount of processed bytes and processing time is not involved
BigQuery uses a columnar data structure. You're charged according to the total data processed in the columns you select, and the total data per column is calculated based on the types of data in the column. For more information about how your data size is calculated, see data size calculation.
You aren't charged for queries that return an error, or for queries that retrieve results from the cache.
Charges are rounded to the nearest MB, with a minimum 10 MB data processed per table referenced by the query, and with a minimum 10 MB data processed per query.
Cancelling a running query job may incur charges up to the full cost for the query were it allowed to run to completion.
When you run a query, you're charged according to the data processed in the columns you select, even if you set an explicit LIMIT on the results.
See more at BigQuery pricing
For on-demand queries you are charged for the amount of data processed by BigQuery engine and for expensive queries only you are charged extra for the complexity (which could manifest itself by increased query time).
The amount of data processed is reflected by totalBytesProcessed. And also by totalBytesBilled which is the same for ordinary queries. For complex/expensive queries you are charged extra for the complexity and technically it's done by totalBytesBilled becoming bigger than totalBytesProcessed.
More details: see this link
Related
We use Snowflake on AWS and have some "Daily Jobs" to rebuild our data on daily basis. Some of jobs take ~2+ hours to be built on Large Size Snowflake Warehouse. As about price per this document I can state that 2 hours of computations on Medium Warehouse = 1 hour computation on Large Warehouse. Should we consider running our stuff on X-Large Warehouse for ~30 mins and are there any engineering/SQL related drawbacks/issues? We want to get our data modelled in less amount of time and for the same amount of money.
Thank you in advance!
https://docs.snowflake.com/en/user-guide/credits.html
The way it is supposed to work is that yes, cost doubles but computational power doubles leading to the same cost as it takes half the time
This is not always true for things like data loading where increasing warehouse size does not always mean faster loading (many factors involved in this, file size etc)
If your tasks are ordered correctly and you don't have a situation where you need a previous task to be finished first due to dependency, then I see no reason why you shouldn't do it.
At the very least, you should test one for one of the daily runs
I have a BigQuery table with the following properties:
Table size: 1.64 TB
Number of rows: 9,883,491,153
The data is put there using streaming inserts (in batches of 500 rows each).
From the Google Cloud Pricing Calculator the costs for these inserts so far should roughly be 86 $.
But in reality, it turns out to be around 482 $.
The explanation is in the pricing docs:
Streaming inserts (tabledata.insertAll): $0.010 per 200 MB (You are charged for rows that are successfully inserted. Individual rows are calculated using a 1 KB minimum size.)
So, in the case of my table, each row is just 182 bytes, but I need to pay the full 1024 bytes for each row, resulting in ~ 562 % of the originally (incorrectly) estimated costs.
Is there a canonical (and of course legal) way to improve the situation, i.e., reduce cost? (Something like inserting into a temp table with just one array-of-struct column, to hold multiple rows in a row, and then split-moving regularly into the actual target table?)
I can suggest you these options:
Use BigQuery Storage Write API. You can stream records into BigQuery and they can be available as the ones written in the DB, or batch a process to insert a large number of records to commit in a single operation.
Some advantages are:
Lower cost because you have 2 TB per month free.
It supports exactly-once semantics through the use of stream offset.
If a table schema changes while a client is streaming, BigQuery
Storage Write notifies the client.
Here is more information about BigQuery Storage Write.
Another option, you could use Beam/DataFlow to create a batch for streaming into BigQuery and use BigQueryIO with the write method of batch.
You can see more information here.
I created a sink export to load audit logs into BigQuery. However, there are a large number of columns that I don't need from the audit log. Is there a way to pick and choose the columns in the sink export?
We need to define our reason for wanting to reducing the number of columns. My thinking is that you are concerned about costs. If we look at active storage, we find that the current price is $0.02 / GB with the first 10GB free each month. If the data is untouched for 90 days, that storage cost drops to $0.01/GB. Next we have to estimate how much storage is used for recording all columns for a month vs recording just the storage you want to record. If we can make some projections, then we can make a call on how much the cost might change if we reduced storage usage. What we will want to estimate will be the number of log records to be exported per month and the size of the average log record if written as-is today vs a log record with only minimally needed fields.
If we do find that there is a distinction that makes a significant cost saving, one further thought would be to export the log entries to Pub/Sub and have them trigger a cloud function. However, I'm dubious that we might end up finding that the savings on BQ storage is then lost due to the cost of Pub/Sub and Cloud Function (and possibly BQ streaming insert).
Another thought might be to realize that the BQ log records are written to tables named by "day". We could have a batch job that runs after a days worth of records are written that copies only the columns of interest to a new table. Again, we are going to have to watch that we don't end up with higher costs elsewhere in our attempt to reduce storage costs.
We have two tables in bigquery: one is large (a couple billion rows) and on a 'table-per-day' basis, the other is date partitioned, has the exact same schema but a small subset of the rows (~100 million rows).
I wanted to run a (standard-sql) query with a subselect in form of a join (same when subselect is in the where clause) on the small partitioned dataset. I was not able to run it because tier 1 was exceeded.
If I run the same query on the big dataset (that contains the data I need and a lot of other data) it runs fine.
I do not understand the reason for this.
Is it because:
Partitioned tables need more resources to query
Bigquery has some internal rules that the ratio of data processed to resources needed must meet a certain threshold, i.e. I was not paying enough when I queried the small dataset given the amount of resources I needed.
If 1. is true, we could simply make the small dataset also be on a 'table-per-day' basis in order to solve the issue. But before we do that though we would like to know if it is really going to solve our problem.
Details on the queries:
Big datset
queries 11 GB, runs 50 secs, Job Id remilon-study:bquijob_2adced71_15bf9a59b0a
Small dataset
Job Id remilon-study:bquijob_5b09f646_15bf9acd941
I'm an engineer on BigQuery and I just took a look at your jobs but it looks like your second query has an additional filter with a nested clause that your first query does not. It is likely that that extra processing is making your query exceed your tier. I would recommend running the queries in the BigQuery UI and looking at the Explanation tab to see how the queries differ in the query plan.
If you try running the exact same query (modifying only the partition syntax) for both tables and still get the same error I would recommend filing a bug.
I've got a question about BQ performance in various scenarios, especially revolving around parallelization "under the hood".
I am saving 100M records on a daily basis. At the moment, I am rotating tables every 5 days to avoid high charges due to full table scans.
If I were to run a query with a date range of "last 30 days" (for example), I would be scanning between 6 (if I am at the last day of the partition) and 7 tables.
I could, as an alternative, partition my data into a new table daily. In this case, I will optimize my expenses - as I'm never querying more data than I have too. The question is, will be suffering a performance penalty in terms of getting the results back to the client, because I am now querying potentially 30 or 90 or 365 tables in parallel (Union).
To summarize:
More tables = less data scanned
Less tables =(?) longer response time to the client
Can anyone shed some light on how to find the balance between cost and performance?
A lot depends how you write your queries and how much development costs, but that amount of data doesn't seam like a barrier, and thus you are trying to optimize too early.
When you JOIN tables larger than 8MB, you need to use the EACH modifier, and that query is internally paralleled.
This partitioning means that you can get higher effective read bandwidth because you can read from many of these disks in parallel. Dremel takes advantage of this; when you run a query, it can read your data from thousands of disks at once.
Internally, BigQuery stores tables in
shards; these are discrete chunks of data that can be processed in parallel. If
you have a 100 GB table, it might be stored in 5000 shards, which allows it to be
processed by up to 5000 workers in parallel. You shouldn’t make any assumptions
about the size of number of shards in a table. BigQuery will repartition
data periodically to optimize the storage and query behavior.
Go ahead and create tables for every day, one recommendation is that write your create/patch script that creates tables for far in the future when it runs eg: I create the next 12 months of tables for every day now. This is better than having a script that creates tables each day. And make it part of your deploy/provisioning script.
To read more check out Chapter 11 ■ Managing Data Stored in BigQuery from the book.