Running Impala compute stats too long - impala

We have a process where tables are truncated and re-loaded every month. When running compute stats on a partition table, compute stats taking too long.

Related

Snowflake Compute Capacity

We use Snowflake on AWS and have some "Daily Jobs" to rebuild our data on daily basis. Some of jobs take ~2+ hours to be built on Large Size Snowflake Warehouse. As about price per this document I can state that 2 hours of computations on Medium Warehouse = 1 hour computation on Large Warehouse. Should we consider running our stuff on X-Large Warehouse for ~30 mins and are there any engineering/SQL related drawbacks/issues? We want to get our data modelled in less amount of time and for the same amount of money.
Thank you in advance!
https://docs.snowflake.com/en/user-guide/credits.html
The way it is supposed to work is that yes, cost doubles but computational power doubles leading to the same cost as it takes half the time
This is not always true for things like data loading where increasing warehouse size does not always mean faster loading (many factors involved in this, file size etc)
If your tasks are ordered correctly and you don't have a situation where you need a previous task to be finished first due to dependency, then I see no reason why you shouldn't do it.
At the very least, you should test one for one of the daily runs

Does Google BigQuery charge by processing time?

So, this is somewhat of a "realtime" question. I run a query and it's currently going at almost 12K seconds, but it told me This query will process 239.3 GB when run. Does BQ charge by the time besides the data processed? Should I stop this now?
I assume you are using on-demand pricing model - you are billed based on amount of processed bytes and processing time is not involved
BigQuery uses a columnar data structure. You're charged according to the total data processed in the columns you select, and the total data per column is calculated based on the types of data in the column. For more information about how your data size is calculated, see data size calculation.
You aren't charged for queries that return an error, or for queries that retrieve results from the cache.
Charges are rounded to the nearest MB, with a minimum 10 MB data processed per table referenced by the query, and with a minimum 10 MB data processed per query.
Cancelling a running query job may incur charges up to the full cost for the query were it allowed to run to completion.
When you run a query, you're charged according to the data processed in the columns you select, even if you set an explicit LIMIT on the results.
See more at BigQuery pricing
For on-demand queries you are charged for the amount of data processed by BigQuery engine and for expensive queries only you are charged extra for the complexity (which could manifest itself by increased query time).
The amount of data processed is reflected by totalBytesProcessed. And also by totalBytesBilled which is the same for ordinary queries. For complex/expensive queries you are charged extra for the complexity and technically it's done by totalBytesBilled becoming bigger than totalBytesProcessed.
More details: see this link

BigQuery: Exceeded quota for Number of partition modifications to a column partitioned table

I get this error when trying to run a lot of import CSV jobs on BigQuery date-partitioned with a custom Timestamp column.
Your table exceeded quota for Number of partition modifications to a column partitioned table
Full error below:
{Location: "partition_modifications_per_column_partitioned_table.long"; Message: "Quota exceeded: Your table exceeded quota for Number of partition modifications to a column partitioned table. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors"; Reason: "quotaExceeded"}
It is not clear to me on: What is the quota for Number of partition modifications? and how is it being exceeded?
Thanks!
What is the quota for Number of partition modifications?
See Quotas for Partitioned tables
In particular:
Maximum number of partitions modified by a single job — 2,000
Each job operation (query or load) can affect a maximum of 2,000 partitions. Any query or load job that affects more than 2,000 partitions is rejected by Google BigQuery.
Maximum number of partition modifications per day per table — 5,000
You are limited to a total of 5,000 partition modifications per day for a partitioned table. A partition can be modified by using an operation that appends to or overwrites data in the partition. Operations that modify partitions include: a load job, a query that writes results to a partition, or a DML statement (INSERT, DELETE, UPDATE, or MERGE) that modifies data in a partition.
You can see more details in above link
If you're gonna change the data often I strongly suggest you delete the table and simply upload it again with the new values. Every time you upload a new table, you get the limit refreshed.

Inserting data to U-SQL tables is taking too long?

Inserting data to U-SQL table is taking too much time. We are using partitioned tables to recalculate previously processed data. Insertion for the first time took almost 10-12 minutes on three tables with 11, 5 and 1 partitions and parallelism was set to 10. Second time insertion of same data took almost 4 hours. Currently we are using year based partitions. We tested insertion and querying without adding partitions and performance was much better. Is this an issue with partitioned tables?
It is very strange that the same job would be taking that much longer for the same data and script executed with the same degree of parallelism. If you look at the job graph (or the vertex execution information) from within VisualStudio, can you see where the time was being spent?
Note that (coarse-grained) partitions are more of a data life-cycle management feature that allows you to address individual partitions of a table, and not necessarily a performance feature (although partition elimination can help with query performance). But it should not go from minutes to hours with the same script, resources and data.

Oracle: Fastest way of doing stats on millions of rows?

I have developed an application which allows users to enter measurements - these are stored in an Oracle database. Each measurement "session" could contain around 100 measurements. There could be around 100 measurement sessions in a "batch", so that's 10,000 measurements per batch. There could easily be around 1000 batches at some point, bringing the total number of measurements into the millions.
The problem is that calculations and statistics need to be performed on the measurements. It ranges from things like average measurements per batch to statistics across the last 6 months of measurements.
My question is: is there any way that I can make the process of calculating these statistics faster? Either through the types of queries I'm running or the structure of the database?
Thanks!
I assume that most calculations will be performed on either a single session or a single batch. If this is the case, then it's important that sessions and batches aren't distributed all over the disk.
In order to achieve the desired data clustering, you probably want to create an index-organized table (IOT) organized by batch and session. That way, the measurements belonging to the same session or same batch are close togehter on the disk and the queries for a session or batch will be limited to a small number of disk pages.
Unfortunately, as the number of calculations that needed to be carried out are not limited to just a few, I could not calculate them at the end of every measurement session.
In the end the queries did not take so long - around 3 minutes for calculating all the stats. For end users this was still an unacceptably long time to wait, BUT the good thing was that the stats did not necessarily have to be completely up to date.
Therefore I used a materialized view to take a 'snapshot' of the stats, and set it to update itself every morning at 2am. Then, when the user requested the stats from the materialized view it was instant!