Snowflake Compute Capacity - sql

We use Snowflake on AWS and have some "Daily Jobs" to rebuild our data on daily basis. Some of jobs take ~2+ hours to be built on Large Size Snowflake Warehouse. As about price per this document I can state that 2 hours of computations on Medium Warehouse = 1 hour computation on Large Warehouse. Should we consider running our stuff on X-Large Warehouse for ~30 mins and are there any engineering/SQL related drawbacks/issues? We want to get our data modelled in less amount of time and for the same amount of money.
Thank you in advance!
https://docs.snowflake.com/en/user-guide/credits.html

The way it is supposed to work is that yes, cost doubles but computational power doubles leading to the same cost as it takes half the time
This is not always true for things like data loading where increasing warehouse size does not always mean faster loading (many factors involved in this, file size etc)
If your tasks are ordered correctly and you don't have a situation where you need a previous task to be finished first due to dependency, then I see no reason why you shouldn't do it.
At the very least, you should test one for one of the daily runs

Related

Does Google BigQuery charge by processing time?

So, this is somewhat of a "realtime" question. I run a query and it's currently going at almost 12K seconds, but it told me This query will process 239.3 GB when run. Does BQ charge by the time besides the data processed? Should I stop this now?
I assume you are using on-demand pricing model - you are billed based on amount of processed bytes and processing time is not involved
BigQuery uses a columnar data structure. You're charged according to the total data processed in the columns you select, and the total data per column is calculated based on the types of data in the column. For more information about how your data size is calculated, see data size calculation.
You aren't charged for queries that return an error, or for queries that retrieve results from the cache.
Charges are rounded to the nearest MB, with a minimum 10 MB data processed per table referenced by the query, and with a minimum 10 MB data processed per query.
Cancelling a running query job may incur charges up to the full cost for the query were it allowed to run to completion.
When you run a query, you're charged according to the data processed in the columns you select, even if you set an explicit LIMIT on the results.
See more at BigQuery pricing
For on-demand queries you are charged for the amount of data processed by BigQuery engine and for expensive queries only you are charged extra for the complexity (which could manifest itself by increased query time).
The amount of data processed is reflected by totalBytesProcessed. And also by totalBytesBilled which is the same for ordinary queries. For complex/expensive queries you are charged extra for the complexity and technically it's done by totalBytesBilled becoming bigger than totalBytesProcessed.
More details: see this link

Is there an option to limit the number to columns in a sink export into BigQuery?

I created a sink export to load audit logs into BigQuery. However, there are a large number of columns that I don't need from the audit log. Is there a way to pick and choose the columns in the sink export?
We need to define our reason for wanting to reducing the number of columns. My thinking is that you are concerned about costs. If we look at active storage, we find that the current price is $0.02 / GB with the first 10GB free each month. If the data is untouched for 90 days, that storage cost drops to $0.01/GB. Next we have to estimate how much storage is used for recording all columns for a month vs recording just the storage you want to record. If we can make some projections, then we can make a call on how much the cost might change if we reduced storage usage. What we will want to estimate will be the number of log records to be exported per month and the size of the average log record if written as-is today vs a log record with only minimally needed fields.
If we do find that there is a distinction that makes a significant cost saving, one further thought would be to export the log entries to Pub/Sub and have them trigger a cloud function. However, I'm dubious that we might end up finding that the savings on BQ storage is then lost due to the cost of Pub/Sub and Cloud Function (and possibly BQ streaming insert).
Another thought might be to realize that the BQ log records are written to tables named by "day". We could have a batch job that runs after a days worth of records are written that copies only the columns of interest to a new table. Again, we are going to have to watch that we don't end up with higher costs elsewhere in our attempt to reduce storage costs.

Google Bigtable under usage "performance"

I have seen the warnings of not using Google Big Table for small data sets.
Does this mean that a workload of 100 QPS could run slower (total time; not per query) than a workload of 8000 QPS?
I understand that 100 QPS is going to be incredibly inefficient on BigTable; but could it be as drastic as 100 inserts takes 15 seconds to complete; where-as a 8000 inserts could run in 1 second?
Just looking for a "in theory; from time to time; yes" vs "probably relatively unlikely" type answer to be a rough guide for how I structure my performance test cycles.
Thanks
There's a flat start up cost to running any Cloud Bigtable operations. That start up cost generally is generally less than 1 second. I would expect 100 operations should take less than 8000 operations. When I see extreme slowness, I usually suspect network latency or some other unique condition.
We're having issues with running small workloads on our Developer Big Table instance (2.5 TB) One instance instead of 3.
We have a key set up on user id and around 100 rows on the key user id. Total records in the database are a few million. We querying big table and seeing 1.4 seconds of latency from fetching the rows associated with a single key of user id. Total number of records returned is less than 100 and we're seeing way over a second of latency. It seems to me that giant workloads are the only way to use this data store. We're looking at other NoSQL alternatives like Redis.

Bigquery partitioning table performance

I've got a question about BQ performance in various scenarios, especially revolving around parallelization "under the hood".
I am saving 100M records on a daily basis. At the moment, I am rotating tables every 5 days to avoid high charges due to full table scans.
If I were to run a query with a date range of "last 30 days" (for example), I would be scanning between 6 (if I am at the last day of the partition) and 7 tables.
I could, as an alternative, partition my data into a new table daily. In this case, I will optimize my expenses - as I'm never querying more data than I have too. The question is, will be suffering a performance penalty in terms of getting the results back to the client, because I am now querying potentially 30 or 90 or 365 tables in parallel (Union).
To summarize:
More tables = less data scanned
Less tables =(?) longer response time to the client
Can anyone shed some light on how to find the balance between cost and performance?
A lot depends how you write your queries and how much development costs, but that amount of data doesn't seam like a barrier, and thus you are trying to optimize too early.
When you JOIN tables larger than 8MB, you need to use the EACH modifier, and that query is internally paralleled.
This partitioning means that you can get higher effective read bandwidth because you can read from many of these disks in parallel. Dremel takes advantage of this; when you run a query, it can read your data from thousands of disks at once.
Internally, BigQuery stores tables in
shards; these are discrete chunks of data that can be processed in parallel. If
you have a 100 GB table, it might be stored in 5000 shards, which allows it to be
processed by up to 5000 workers in parallel. You shouldn’t make any assumptions
about the size of number of shards in a table. BigQuery will repartition
data periodically to optimize the storage and query behavior.
Go ahead and create tables for every day, one recommendation is that write your create/patch script that creates tables for far in the future when it runs eg: I create the next 12 months of tables for every day now. This is better than having a script that creates tables each day. And make it part of your deploy/provisioning script.
To read more check out Chapter 11 ■ Managing Data Stored in BigQuery from the book.

Oracle: Fastest way of doing stats on millions of rows?

I have developed an application which allows users to enter measurements - these are stored in an Oracle database. Each measurement "session" could contain around 100 measurements. There could be around 100 measurement sessions in a "batch", so that's 10,000 measurements per batch. There could easily be around 1000 batches at some point, bringing the total number of measurements into the millions.
The problem is that calculations and statistics need to be performed on the measurements. It ranges from things like average measurements per batch to statistics across the last 6 months of measurements.
My question is: is there any way that I can make the process of calculating these statistics faster? Either through the types of queries I'm running or the structure of the database?
Thanks!
I assume that most calculations will be performed on either a single session or a single batch. If this is the case, then it's important that sessions and batches aren't distributed all over the disk.
In order to achieve the desired data clustering, you probably want to create an index-organized table (IOT) organized by batch and session. That way, the measurements belonging to the same session or same batch are close togehter on the disk and the queries for a session or batch will be limited to a small number of disk pages.
Unfortunately, as the number of calculations that needed to be carried out are not limited to just a few, I could not calculate them at the end of every measurement session.
In the end the queries did not take so long - around 3 minutes for calculating all the stats. For end users this was still an unacceptably long time to wait, BUT the good thing was that the stats did not necessarily have to be completely up to date.
Therefore I used a materialized view to take a 'snapshot' of the stats, and set it to update itself every morning at 2am. Then, when the user requested the stats from the materialized view it was instant!