Why is collect statistics in databases called a resource consuming activity?

Why is collect statistics in databases called a resource consuming activity? - sql

Whenever we reach out to our Teradata DBA's to collect stats on specific tables, we get a feedback that it is a resource consuming activity and we will do it when system is relatively free or on the weekends when there is no load on system.
The tables for which stats collection is required are getting queried on intra-day basis. The explain plan shows "High confidence" if we collect stats on few columns.
So I just want to understand why stats collection is called as a resource consuming activity ? If we do not collect stats on tables which are getting loaded on intra-day basis, aren't we burdening the system by executing SQL's for which explain plans are saying "Collect stats" ?
Thanks!

Yes absence of stats may result in less optimal access paths and lesser performance than might otherwise be possible/achievable.
But yes collecting stats is a bit more intensive than looking whether a key value is present in a table. So yes on loaded systems, it is not the wisest idea to add a stats collection to the load mix.
And at any rate, if the concerned tables are "loaded on an intraday basis", this means they are highly volatile and collecting stats for them might turn out to be not that useful after all, as any new load might render existing stats completely obsolete and/or off. If you can provide reasonably accurate stats on a manual basis, do that.
EDIT
Oh yes, to answer the actual question you asked, "Why is collect statistics in databases called a resource consuming activity?" : because it consumes resources, and seriously above average compared to "normal" database transactions.

Statistics collection is a maintenance activity that requires balance in a production Teradata environment. Teradata continues to make strides in improving the efficiency of statistics maintenance with each release of the database. If I recall correctly, one of the more recent improvements is to identify unused statistics or objects and bypass refreshing thee statistics during statistics collection. But it is a resource intensive operation on large tables with multiple sets of statistics present.
The frequency in which you collect statistics will vary based on the size of the table, how the table is loaded, and the number of statistics that exist on that table. Tables that are “flushed and filled” require more frequent statistics collection than tables where data is appended or updated in place. The former should have statistics collected after loading. The latter will vary based on the volume of data that changes vs. the time since the last collection of statistics. Stale statistics can mislead the optimizer or cause the optimizer to abandon them in favor of random sampling.
Furthermore, the larger a table grows in relation to the size of the system along with the known demographics of the table structure the ability to rely on sample statistics in place of full statistics comes into play. Being able to use the correct sample size reduces the cost of collecting the statistics.
It is not uncommon for statistics maintenance activities to be scheduled off hours or over the weekend. For large platforms, the collection of statistics across the system can be measured in hours. As a DBA I would be reluctant to refresh the statistics on a large production table in the middle of the day unless there was a query that was causing catastrophic problems (i.e. hot AMPing). Even then the remedy would be to prevent that query from running until statistics could be collected off hours.
If you have SLA’s defined in your environment and believe statistics collection would improve your ability to meet your SLAs, then a discussion with the DBA’s to come to a better understanding is necessary. Based on what you described, the DBA response is not surprising because they are trying to ensure the users receive the resources during the day.
Finally, if you have tables that are being loaded intra-day, the collection of SUMMARY statistics has low overhead and should be part of your ETL routine. Previously, collection of PARTITION statistics was also advisable irrespective of whether the table was actually partitioned, but I don’t recall if that has fallen out of favor in the most recent releases (16.xx) of Teradata. PARTITION statistics were also fairly low overhead.
Hope this helps.

Related

How to maintain high performance in a medical production database with millions of rows

I have an application that is used to chart patient data during an ICU stay (Electronic record).
Patients are usually connected to several devices (monitors, ventilator, dialysis etc.)
that send data in a one minute interval. An average of 1800 rows are inserted per hour per patient.
Until now the integration enginge recieves the data and stores it in files on a dedicated drive.
The application reads it from there and plots it in graphs and data grids.
As there's a requirement for analysis we're thinking about writing the incoming signals immediately into the DB.
But there're a lot of concerns with respect to performance. Especially in this work environment people are very sensitive when it comes to performance.
Are there any techniques besides proper indexing to mitigate a possbile performance impact?
I'm thinking of a job to load the data into a dedicated table or maybe even into another database e.g. after 1 Month after the record was closed.
Any experiences how to keep the production DB small and lightweight?

I have no idea how many patients you have in you ICU unit but unless you have thousands of patients you should not have any problems - as long as you stick to inserts, use bind variables and do have as many freelists as necessary. Insert will only create locks on the free list. So you can do as many parallel insert as there are freelists available to determine a free block where to write the data to. You may want to look at the discussion over ra TKyte's site
https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:950845531436
Generally speaking 1.800 records per hours (or 10-20 times that) is not a lot for any decent sized Oracle db. If you are really fancy you could choose to partition based on the patient_id. This would be specifically useful if you:
Access the data only for one patient at a time because you can just skip all other partitions.
If you want to remove the data for a patient en bloc once he leaves ICU. Instead of DELETEING you could just drop the patients partitions.

Define "immediately". One of the best things you can do to improve INSERT performance is to batch the commands instead of running them one-at-a-time.
Every SQL statement has overhead - sending the statement to the database, parsing it (remember to use bind variables so that you don't have to hard parse each statement), returning a message, etc. In many applications, that overhead takes longer than the actual INSERT.
You only have to do a small amount of batching to significantly reduce that overhead. Running an INSERT ALL with two rows instead of two separate statements reduces the overhead by 1/2, running with three rows reduces overhead by 2/3, etc. Waiting a minute, or even a few seconds, can make a big difference.
As long as you avoid the common row-by-row blunders, an Oracle database with "millions" of rows is nothing to worry about. You don't need to think about cryptic table settings or replication yet.

Rapidly changing large data processing advise

My team has the following dilemma that we need some architectural/resources advise:
Note: Our data is semi-structured
Over-all Task:
We have a semi-large data that we process during the day
each day this "process" get executed 1-5 times a day
each "process" takes anywhere from 30 minutes to 5 hours
semi-large data = ~1 million rows
each row gets updated anywhere from 1-10 times during the process
during this update ALL other rows may change, as we aggregate these rows for UI
What we are doing currently:
our current system is functional, yet expensive and inconsistent
we use SQL db to store all the data and we retrieve/update as process requires
Unsolved problems and desired goals:
since this processes are user triggered we never know when to scale up/down, which causes high spikes and Azure doesnt make it easy to do autoscale based on demand without data warehouse which we are wanting to stay away from because of lack of aggregates and other various "buggy" issues
because of constant IO to the db we hit 100% of DTU when 1 process begins (we are using Azure P1 DB) which of course will force us to grow even larger if multiple processes start at the same time (which is very likely)
yet we understand the cost comes with high compute tasks, we think there is better way to go about this (SQL is about 99% optimized, so much left to do there)
We are looking for some tool that can:
Process large amount of transactions QUICKLY
Can handle constant updates of this large amount of data
supports all major aggregations
is "reasonably" priced (i know this is an arguable keyword, just take it lightly..)
Considered:
Apache Spark
we don't have ton of experience with HDP so any pros/cons here will certainly be useful (does the use case fit the tool??)
ArangoDB
seems promising.. Seems fast and has all aggregations we need..
Azure Data Warehouse
too many various issues we ran into, just didn't work for us.
Any GPU-accelerated compute or some other high-end ideas are also welcome.
Its hard to try them all and compare which one fits the best, as we have a fully functional system and are required to make adjustments to whichever way we go.
Hence, any before hand opinions are welcome, before we pull the trigger.

Why Spark SQL considers the support of indexes unimportant?

Quoting the Spark DataFrames, Datasets and SQL manual:
A handful of Hive optimizations are not yet included in Spark. Some of
these (such as indexes) are less important due to Spark SQL’s
in-memory computational model. Others are slotted for future releases
of Spark SQL.
Being new to Spark, I'm a bit baffled by this for two reasons:
Spark SQL is designed to process Big Data, and at least in my use
case the data size far exceeds the size of available memory.
Assuming this is not uncommon, what is meant by "Spark SQL’s
in-memory computational model"? Is Spark SQL recommended only for
cases where the data fits in memory?
Even assuming the data fits in memory, a full scan over a very large
dataset can take a long time. I read this argument against
indexing in in-memory database, but I was not convinced. The example
there discusses a scan of a 10,000,000 records table, but that's not
really big data. Scanning a table with billions of records can cause
simple queries of the "SELECT x WHERE y=z" type take forever instead
of returning immediately.
I understand that Indexes have disadvantages like slower INSERT/UPDATE, space requirements, etc. But in my use case, I first process and load a large batch of data into Spark SQL, and then explore this data as a whole, without further modifications. Spark SQL is useful for the initial distributed processing and loading of the data, but the lack of indexing makes interactive exploration slower and more cumbersome than I expected it to be.
I'm wondering then why the Spark SQL team considers indexes unimportant to a degree that it's off their road map. Is there a different usage pattern that can provide the benefits of indexing without resorting to implementing something equivalent independently?

Indexing input data
The fundamental reason why indexing over external data sources is not in the Spark scope is that Spark is not a data management system but a batch data processing engine. Since it doesn't own the data it is using it cannot reliably monitor changes and as a consequence cannot maintain indices.
If data source supports indexing it can be indirectly utilized by Spark through mechanisms like predicate pushdown.
Indexing Distributed Data Structures:
standard indexing techniques require persistent and well defined data distribution but data in Spark is typically ephemeral and its exact distribution is nondeterministic.
high level data layout achieved by proper partitioning combined with columnar storage and compression can provide very efficient distributed access without an overhead of creating, storing and maintaining indices.This is a common pattern used by different in-memory columnar systems.
That being said some forms of indexed structures do exist in Spark ecosystem. Most notably Databricks provides Data Skipping Index on its platform.
Other projects, like Succinct (mostly inactive today) take different approach and use advanced compression techniques with with random access support.
Of course this raises a question - if you require an efficient random access why not use a system which is design as a database from the beginning. There many choices out there, including at least a few maintained by the Apache Foundation. At the same time Spark as a project evolves, and the quote you used might not fully reflect future Spark directions.

In general, the utility of indexes is questionable at best. Instead, data partitioning is more important. They are very different things, and just because your database of choice supports indexes doesn't mean they make sense given what Spark is trying to do. And it has nothing to do with "in memory".
So what is an index, anyway?
Back in the days when permanent storage was crazy expensive (instead of essentially free) relational database systems were all about minimizing usage of permanent storage. The relational model, by necessity, split a record into multiple parts -- normalized the data -- and stored them in different locations. To read a customer record, maybe you read a customer table, a customerType table, take a couple of entries out of an address table, etc. If you had a solution that required you to read the entire table to find what you want, this is very costly, because you have to scan so many tables.
But this is not the only way to do things. If you didn't need to have fixed-width columns, you can store the entire set of data in one place. Instead of doing a full-table scan on a bunch of tables, you only need to do it on a single table. And that's not as bad as you think it is, especially if you can partition your data.
40 years later, the laws of physics have changed. Hard drive random read/write speeds and linear read/write speeds have drastically diverged. You can basically do 350 head movements a second per disk. (A little more or less, but that's a good average number.) On the other hand, a single disk drive can read about 100 MB per second. What does that mean?
Do the math and think about it -- it means if you are reading less than 300KB per disk head move, you are throttling the throughput of your drive.
Seriouusly. Think about that a second.
The goal of an index is to allow you to move your disk head to the precise location on disk you want and just read that record -- say just the address record joined as part of your customer record. And I say, that's useless.
If I were designing an index based on modern physics, it would only need to get me within 100KB or so of the target piece of data (assuming my data had been laid out in large chunks -- but we're talking theory here anyway). Based on the numbers above, any more precision than that is just a waste.
Now go back to your normalized table design. Say a customer record is really split across 6 rows held in 5 tables. 6 total disk head movements (I'll assume the index is cached in memory, so no disk movement). That means I can read 1.8 MB of linear / de-normalized customer records and be just as efficient.
And what about customer history? Suppose I wanted to not just see what the customer looks like today -- imagine I want the complete history, or a subset of the history? Multiply everything above by 10 or 20 and you get the picture.
What would be better than an index would be data partitioning -- making sure all of the customer records end up in one partition. That way with a single disk head move, I can read the entire customer history. One disk head move.
Tell me again why you want indexes.
Indexes vs ___ ?
Don't get me wrong -- there is value in "pre-cooking" your searches. But the laws of physics suggest a better way to do it than traditional indexes. Instead of storing the customer record in exactly one location, and creating a pointer to it -- an index -- why not store the record in multiple locations?
Remember, disk space is essentially free. Instead of trying to minimize the amount of storage we use -- an outdated artifact of the relational model -- just use your disk as your search cache.
If you think someone wants to see customers listed both by geography and by sales rep, then make multiple copies of your customer records stored in a way that optimized those searches. Like I said, use the disk like your in memory cache. Instead of building your in-memory cache by drawing together disparate pieces of persistent data, build your persistent data to mirror your in-memory cache so all you have to do is read it. In fact don't even bother trying to store it in memory -- just read it straight from disk every time you need it.
If you think that sounds crazy, consider this -- if you cache it in memory you're probably going to cache it twice. It's likely your OS / drive controller uses main memory as cache. Don't bother caching the data because someone else is already!
But I digress...
Long story short, Spark absolutely does support the right kind of indexing -- the ability to create complicated derived data from raw data to make future uses more efficient. It just doesn't do it the way you want it to.

Query Cost vs. Execution Speed + Parallelism

My department was recently reprimanded (nicely) by our IT department for running queries with very high costs on the premise that our queries have a real possibility of destabilizing and/or crashing the database. None of us are DBA's; were just researchers who write and execute queries against the database, and I'm probably the only one who ever looked at an explain plan before the reprimand.
We were told that query costs over 100 should be very rare, and queries with costs over 1000 should never be run. The problems I am running into are that cost seems have no correlation with execution time, and I'm losing productivity while trying to optimize my queries.
As an example, I have a query that executes in under 5 seconds with a cost of 10844. I rewrote the query to use a view that contains most of the information I need, and got the cost down to 109, but the new query, which retrieves the same results, takes 40 seconds to run. I found a question here with a possible explanation:
Measuring Query Performance : "Execution Plan Query Cost" vs "Time Taken"
That question led me to parallelism hints. I tried using /*+ no_parallel*/ in the cost 10884 query, but the cost did not change, nor did the execution time, so I'm not sure that parallelism is the explanation for the faster execution time but higher cost. Then, I tried using the /*+ parallel(n)*/ hint, and found that the higher the value of n, the lower the cost of the query. In the case of cost 10844 query, I found that /*+ parallel(140)*/ dropped the cost to 97, with a very minor increase in execution time.
This seemed like an ideal "cheat" to meet the requirements that our IT department set forth, but then I read this:
http://www.oracle.com/technetwork/articles/datawarehouse/twp-parallel-execution-fundamentals-133639.pdf
The article contains this sentence:
Parallel execution can enable a single operation to utilize all system resources.
So, my questions are:
Am I actually placing more strain on the server resources by using the /*+ parallel(n)*/ hint with a very high degree of parallelism, even though I am lowering the cost?
Assuming no parallelism, is execution speed a better measure of resources used than cost?

The rule your DBA gave you doesn't make a lot of sense. Worrying about the cost that is reported for a query is very seldom productive. First, you cannot directly compare the cost of two different queries-- one query that has a cost in the millions may run very quickly and consume very few system resources another query that has a cost in the hundreds may run for hours and bring the server to its knees. Second, cost is an estimate. If the optimizer made an accurate estimate of the cost, that strongly implies that it has come up with the optimal query plan which would mean that it is unlikely that you'd be able to modify the query to return the same results while using fewer resources. If the optimizer made an inaccurate estimate of the cost, that strongly implies that it has come up with a poor query plan in which case the reported cost would have no relationship to any useful metric you'd come up with. Most of the time, the queries you're trying to optimize are the queries where the optimizer generated an incorrect query plan because it incorrectly estimated the cost of various steps.
Tricking the optimizer by using hints that may or may not actually change the query plan (depending on how parallelism is configured, for example) is very unlikely to solve a problem-- it's much more likely to cause the optimizer's estimates to be less accurate and make it more likely that it chooses a query plan that consumes far more resources than it needs to. A parallel hint with a high degree of parallelism, for example, would tell Oracle to drastically reduce the cost of a full table scan which makes it more likely that the optimizer would choose that over an index scan. That is seldom something that your DBAs would want to see.
If you're looking for a single metric that tells you whether a query plan is reasonable, I'd use the amount of logical I/O. Logical I/O is correlated pretty well with actual query performance and with the amount of resources your query consumes. Looking at execution time can be problematic because it varies significantly based on what data happens to be cached (which is why queries often run much faster the second time they're executed) while logical I/O doesn't change based on what data is in cache. It also lets you scale your expectations as the number of rows your queries need to process change. If you're writing a query that needs to aggregate data from 1 million rows, for example, that should consume far more resources than a query that needs to return 100 rows of data from a table with no aggregation. If you're looking at logical I/O, you can easily scale your expectations to the size of the problem to figure out how efficient your queries could realistically be.
In Christian Antognini's "Troubleshooting Oracle Performance" (page 450), for example, he gives a rule of thumb that is pretty reasonable
5 logical reads per row that is returned/ aggregated is probably very good
10 logical reads per row that is returned/ aggregated is probably adequate
20+ logical reads per row that is returned/ aggregated is probably inefficient and needs to be tuned
Different systems with different data models may merit tweaking the buckets a bit but those are likely to be good starting points.
My guess is that if you're researchers that are not developers, you're probably running queries that need to aggregate or fetch relatively large data sets, at least in comparison to those that application developers are commonly writing. If you're scanning a million rows of data to generate some aggregate results, your queries are naturally going to consume far more resources than an application developer whose queries are reading or writing a handful of rows. You may be writing queries that are just as efficient from a logical I/O per row perspective, you just may be looking at many more rows.
If you are running queries against the live production database, you may well be in a situation where it makes sense to start segregating workload. Most organizations reach a point where running reporting queries against the live database starts to create issues for the production system. One common solution to this sort of problem is to create a separate reporting database that is fed from the production system (either via a nightly snapshot or by an ongoing replication process) where reporting queries can run without disturbing the production application. Another common solution is to use something like Oracle Resource Manager to limit the amount of resources available to one group of users (in this case, report developers) in order to minimize the impact on higher priority users (in this case, users of the production system).

real-time data warehouse for web access logs

We're thinking about putting up a data warehouse system to load with web access logs that our web servers generate. The idea is to load the data in real-time.
To the user we want to present a line graph of the data and enable the user to drill down using the dimensions.
The question is how to balance and design the system so that ;
(1) the data can be fetched and presented to the user in real-time (<2 seconds),
(2) data can be aggregated on per-hour and per-day basis, and
(2) as large amount of data can still be stored in the warehouse, and
Our current data-rate is roughly ~10 accesses per second which gives us ~800k rows per day. My simple tests with MySQL and a simple star schema shows that my quires starts to take longer than 2 seconds when we have more than 8 million rows.
Is it possible it get real-time query performance from a "simple" data warehouse like this,
and still have it store a lot of data (it would be nice to be able to never throw away any data)
Are there ways to aggregate the data into higher resolution tables?
I got a feeling that this isn't really a new question (i've googled quite a lot though). Could maybe someone give points to data warehouse solutions like this? One that comes to mind is Splunk.
Maybe I'm grasping for too much.
UPDATE
My schema looks like this;
dimensions:
client (ip-address)
server
url
facts;
timestamp (in seconds)
bytes transmitted

Seth's answer above is a very reasonable answer and I feel confident that if you invest in the appropriate knowledge and hardware, it has a high chance of success.
Mozilla does a lot of web service analytics. We keep track of details on an hourly basis and we use a commercial DB product, Vertica. It would work very well for this approach but since it is a proprietary commercial product, it has a different set of associated costs.
Another technology that you might want to investigate would be MongoDB. It is a document store database that has a few features that make it potentially a great fit for this use case.
Namely, the capped collections (do a search for mongodb capped collections for more info)
And the fast increment operation for things like keeping track of page views, hits, etc.
http://blog.mongodb.org/post/171353301/using-mongodb-for-real-time-analytics

Doesn't sound like it would be a problem. MySQL is very fast.
For storing logging data, use MyISAM tables -- they're much faster and well suited for web server logs. (I think InnoDB is the default for new installations these days - foreign keys and all the other features of InnoDB aren't necessary for the log tables). You might also consider using merge tables - you can keep individual tables to a manageable size while still being able to access them all as one big table.
If you're still not able to keep up, then get yourself more memory, faster disks, a RAID, or a faster system, in that order.
Also: Never throwing away data is probably a bad idea. If each line is about 200 bytes long, you're talking about a minimum of 50 GB per year, just for the raw logging data. Multiply by at least two if you have indexes. Multiply again by (at least) two for backups.
You can keep it all if you want, but in my opinion you should consider storing the raw data for a few weeks and the aggregated data for a few years. For anything older, just store the reports. (That is, unless you are required by law to keep around. Even then, it probably won't be for more than 3-4 years).

Also, look into partitioning, especially if your queries mostly access latest data; you could -- for example -- set-up weekly partitions of ~5.5M rows.
If aggregating per-day and per hour, consider having date and time dimensions -- you did not list them so I assume you do not use them. The idea is not to have any functions in a query, like HOUR(myTimestamp) or DATE(myTimestamp). The date dimension should be partitioned the same way as fact tables.
With this in place, the query optimizer can use partition pruning, so the total size of tables does not influence the query response as before.

This has gotten to be a fairly common data warehousing application. I've run one for years that supported 20-100 million rows a day with 0.1 second response time (from database), over a second from web server. This isn't even on a huge server.
Your data volumes aren't too large, so I wouldn't think you'd need very expensive hardware. But I'd still go multi-core, 64-bit with a lot of memory.
But you will want to mostly hit aggregate data rather than detail data - especially for time-series graphing over days, months, etc. Aggregate data can be either periodically created on your database through an asynchronous process, or in cases like this is typically works best if your ETL process that transforms your data creates the aggregate data. Note that the aggregate is typically just a group-by of your fact table.
As others have said - partitioning is a good idea when accessing detail data. But this is less critical for the aggregate data. Also, reliance on pre-created dimensional values is much better than on functions or stored procs. Both of these are typical data warehousing strategies.
Regarding the database - if it were me I'd try Postgresql rather than MySQL. The reason is primarily optimizer maturity: postgresql can better handle the kinds of queries you're likely to run. MySQL is more likely to get confused on five-way joins, go bottom up when you run a subselect, etc. And if this application is worth a lot, then I'd consider a commercial database like db2, oracle, sql server. Then you'd get additional features like query parallelism, automatic query rewrite against aggregate tables, additional optimizer sophistication, etc.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas