I'm trying to apprehend the idea of expensive. Here's an example based on my understanding.If I want to find the id of all users aged above 18 select * from table where age > 18 select * is then expensive as I only wanted id.
Is expensive a negative word so it should always be avoided?
Yes, expensive and cheap are often use to measure if some execution plan is better then an other. I guess it is based on the fact that the engine is calculating the cost of possible execution plans and choosing the cheaper one.
For example, in posgresql (but similar in others RDMBs) we have:
The costs are in an arbitrary unit. A common misunderstanding is that
they are in milliseconds or some other unit of time, but that’s not
the case.
The cost units are anchored (by default) to a single sequential page
read costing 1.0 units (seq_page_cost). Each row processed adds 0.01
(cpu_tuple_cost), and each non-sequential page read adds 4.0
(random_page_cost).
So, based on your operators, the engine is determining the cost of your query and we can say that is better to avoid expensive operations. Some of the SQL performance tuning may include moving some of the business logic in the application in order to avoid heavy (not fast enough) operations.
Related
select DATE(request_time) from logs.nobids_05 limit 1
gave me "3.48 GB processed" which a bit much considering that request_time is a field that appears in each row.
There are many other cases where just touching column automatically adds its total size to the cost. For example,
select * from logs.nobids_05 limit 1
gives me "This query will process 274 GB when run".
I am sure bigquery does not need to read 274GB for outputting 1 row of data.
2019 update: IF you cluster your tables, the cost of a SELECT * LIMIT 1 will be minimal.
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
Running a "SELECT * FROM big_table LIMIT 1" with BigQuery would be the equivalent of doing this: https://www.youtube.com/watch?v=KZ-slvv_ZT4.
BigQuery is an analytical database. It's architecture and pricing are optimized for analysis at scale, not for single row handling.
Every operation in BigQuery involves a full table scan, but only of the columns mentioned in the query. The goal is to have predictable costs: Before running the query you are able to know how much data will be involved, therefore its cost. It might seem a big price to query just one row, but the good news is the cost remains constant, even when the queries get way more complex and CPU intensive.
Once in a while you might need to run a single row query, and the costs might seem excessive, but the assumption here is that you are using this tool to analyze data at scale, and the overall costs of having data stored in it should be more than competitive with other tools available. Since you've been working with other tools, I'd love to see a total cost comparison of analytical sessions within real case scenarios.
By the way, BigQuery has a better way for doing the equivalent of "SELECT * LIMIT x". It's free, and it relies on the REST API instead of querying:
https://developers.google.com/bigquery/docs/reference/v2/tabledata/list
This being said, thanks for the feedback, as there is a balancing job between making pricing more complex and the tool better suited for other jobs - and this balance is built on the feedback we get.
I don't think this is a bug. "When you run a query, you're charged according to the total data processed in the columns you select, even if you set an explicit LIMIT on the results." (https://developers.google.com/bigquery/pricing#samplecosts)
Suppose the data distribution does not change, For a same query, only dataset is enlarged a time, will the time taken also becomes 1 time? If the data distribution does not change, will the query plan change if in theory?
Yes, the query plan may still change even if the data is completely static, though it probably won't.
The autovaccum daemon will ANALYZE your tables and generate new statistics. This usually happens only when they've changed, but may happen for other reasons (wrap-around prevention vacuum, etc).
The statistics include a random sampling to collect common values for a histogram. Being random, the outcome may be somewhat different each time.
To reduce the chances of plans shifting for a static dataset, you probably want to increase the statistics target on the table's columns and re-ANALYZE. Don't set it too high though, as the query planner has to read those histograms when it makes planning decisions, and bigger histograms mean slightly more planning time.
If your table is growing continuously but the distribution isn't changing then you want the planner to change plans at various points. A 1000-row table is almost certainly best accessed by doing a sequential scan; an index scan would be a waste of time and effort. You certainly don't want a million row table being scanned sequentially unless you're retrieving a majority of the rows, though. So the planner should - and does - adjust its decisions based not only on the data distribution, but the overall row counts.
Here is an example. You have record on one page and an index. Consider the query:
select t.*
from table t
where col = x;
And, assume you have an index on col. With one record, the fastest way is to simply read the record and check the where clause. You could have 200 records on the page, so the selectivity of the query might be less than 1%.
One of the key considerations that a SQL optimizer makes in choosing an algorithm is the number of expected page reads. So, if you have a query like the above, the engine might think "I have to read all pages in the table anyway, so let me just do a full table scan and ignore the index." Note that this will be true when the data is on a single page.
This generalizes to other operations as well. If all the records in your data fit on one data page, then "slow" algorithms are often the best or close enough to the best. So, nested loop joins might be better than using indexes, hash-based, or sort-merge based joins. Similarly, a sort-based aggregation might be better than other methods.
Alas, I am not as familiar with the Postgres query optimizer as I am with SQL Server and Oracle. I have definitely encountered changes in execution plans in those databases as data grew.
On postgres 9.0, set both index_scan and seq_scan to Off. Why does it improve query performance by 2x?
This may help some queries run faster, but is almost certain to make other queries slower. It's interesting information for diagnostic purposes, but a bad idea for a long-term "solution".
PostgreSQL uses a cost-based optimizer, which looks at the costs of all possible plans based on statistics gathered by scanning your tables (normally by autovacuum) and costing factors. If it's not choosing the fastest plan, it is usually because your costing factors don't accurately model actual costs for your environment, statistics are not up-to-date, or statistics are not fine-grained enough.
After turning index_scan and seq_scan back on:
I have generally found the cpu_tuple_cost default to be too low; I have often seen better plans chosen by setting that to 0.03 instead of the default 0.01; and I've never seen that override cause problems.
If the active portion of your database fits in RAM, try reducing both seq_page_cost and random_page_cost to 0.1.
Be sure to set effective_cache_size to the sum of shared_buffers and whatever your OS is showing as cached.
Never disable autovacuum. You might want to adjust parameters, but do that very carefully, with small incremental changes and subsequent monitoring.
You may need to occasionally run explicit VACUUM ANALYZE or ANALYZE commands, especially for temporary tables or tables which have just had a lot of modifications and are about to be used in queries.
You might want to increase default_statistics_target, from_collapse_limit, join_collapse_limit, or some geqo settings; but it's hard to tell whether those are appropriate without a lot more detail than you've given so far.
You can try out a query with different costing factors set on a single connection. When you confirm a configuration which works well for your whole mix (i.e., it accurately models costs in your environment), you should make the updates in your postgresql.conf file.
If you want more targeted help, please show the structure of the tables, the query itself, and the results of running EXPLAIN ANALYZE for the query. A description of your OS and hardware helps a lot, too, along with your PostgreSQL configuration.
Why ?
The most logical answer is because of the way your database tables are configured.
Without you posting your table schema's I can only hazard a guess that your indices don't have a high cardinality.
that is to say, that if your index contains too much information to be useful then it will be far less efficient, or indeed slower.
Cardinality is a measure of how unique a row in your index is. The lower the cardinality, the slower your query will be.
A perfect example is having a boolean field in your index; perhaps you have a Contacts table in your database and it has a boolean column that records true or false depending on whether the customer would like to be contacted by a third party.
In the mean, if you did 'select * from Contacts where OptIn = true'; you can imagine that you'd return a lot of Contacts; imagine 50% of contacts in our case.
Now if you add this 'Optin' column to an index on that same table; it stands to reason that no matter how fine the other selectors are, you will always return 50% of the table, because of the value of 'OptIn'.
This is a perfect example of low cardinality; it will be slow because any query involving that index will have to select 50% of the rows in the table; to then be able to apply further WHERE filters to reduce the dataset again.
Long story short; If your Indices include bad fields or simply represent every column in the table; then the SQL engine has to resort to testing row-by-agonizing-row.
Anyway, the above is theoretical in your case; but it is a known common reason for why queries suddenly start taking much longer.
Please fill in the gaps regarding your data structure, index definitions and the actual query that is really slow!
I would like to populate some tables with a large amount of data in order to empirically test the performance of an SQL query in the worst case scenario (well, as close to it as possible).
I considered using random values. But this would require manual adjustment to get even close to the worst case. Unconstrained random values are no good for a worst case because they tend mostly to be unique -- in which case an index on a single column should perform about as well as a compound index. On the other hand, random values chosen from too small a set will result in a large fraction of the rows being returned, which is uninteresting because it reflects not so much search performance as listing performance.
I also considered just looking at EXPLAIN PLAN, but this is not empirical, and also the explanation varies, partly depending on the data that you already have, rather than the worst case.
Is there a tool that analyzes a given SQL query (and the db schema and ideally indexes), then generates a large data set (of a given size) that will cause the query to perform as close to worst-case as possible?
Any RDBMS is fine.
I would also be interested in alternative approaches for gaining this level of insight into worst-case behaviour.
Short answer: There is no worst case scenario because every case can be made much worse, usually just by adding more data with the same distribution.
Long answer:
I would recommend to you to look not for the worst case scenario, but for an "overblown realistic scenario" in which you start from production data, define what you consider a large amount of entities (for each table separately), multiply by a factor of two or three, and generate the data from the production data you have by hand.
For example, if your production data has 1000 car models from 150 car manufacturers and you will decide you might need 10000 models from 300 manufacturers, you will first double the number of records in the referenced table (manufacturers), then generate a "copy" of existing 1000 car models to create another 1000 cars referencing those generated manufacturers, and then generating 4 more cars per each existing one, every time copying the existing distribution of values based on case by case decisions. This means new unique values in some columns, and simply copied values in others.
Do not forget to regenerate statistics after you are done. Why exactly am I saying this? Because you want to test the best possible query plan given the query, data, and schema, and optimize that.
Rationale: Queries are not algorithms. The query optimizer chooses a suitable query plan not only based on the query, but also on information about how big the tables approximately are, index coverage, operator selectivity, and so on. You are not really interested in learning how poorly chosen plans, or plans for unrealistically populated database execute. This could even induce you to add ill chosen indexes, and ill chosen indexes can make production performance worse. You want to learn and test what happens with the best plan for a realistic albeit large numbers of rows.
While you could test with 1,000,000 car models, odds are that such production content is science fiction for your specific database schema and queries. However, it would be even less useful to test with the number of car models equaling the number of car manufacturers in your database. While such a distribution might happen to be the worst possible one for your application, you will learn almost nothing from basing your metrics on it.
Let's say I have this query:
select * from table1 r where r.x = 5
Does the speed of this query depend on the number of rows that are present in table1?
The are many factors on the speed of a query, one of which can be the number of rows.
Others include:
index strategy (if you index column "x", you will see better performance than if it's not indexed)
server load
data caching - once you've executed a query, the data will be added to the data cache. So subsequent reruns will be much quicker as the data is coming from memory, not disk. Until such point where the data is removed from the cache
execution plan caching - to a lesser extent. Once a query is executed for the first time, the execution plan SQL Server comes up with will be cached for a period of time, for future executions to reuse.
server hardware
the way you've written the query (often one of the biggest contibutors to poor performance!). e.g. writing something using a cursor instead of a set-based operation
For databases with a large number of rows in tables, partitioning is usually something to consider (with SQL Server 2005 onwards, Enterprise Edition there is built-in support). This is to split the data down into smaller units. Generally, smaller units = smaller tables = smaller indexes = better performance.
Yes, and it can be very significant.
If there's 100 million rows, SQL server has to go through each of them and see if it matches.
That takes a lot more time compared to there being 10 rows.
You probably want an index on the 'x' column, in which case the sql server might check the index rather than going through all the rows - which can be significantly faster as the sql server might not even need to check all the values in the index.
On the other hand, if there's 100 million rows matching x = 5, it's slower than 10 rows.
Almost always yes. The real question is: what is the rate at which the query slows down as the table size increases? And the answer is: by not much if r.x is indexed, and by a large amount if not.
Not the rows (to a certain degree of course) per se, but the amount of data (columns) is what can make a query slow. The data also needs to be transfered from the backend to the frontend.
The Answer is Yes. But not the only factor.
if you did appropriate optimizations and tuning the performance drop will be negligible
Main Performance factors
Indexing Clustered or None clustered
Data Caching
Table Partitioning
Execution Plan caching
Data Distribution
Hardware specs
There are some other factors but these are mainly considered.
Even how you designed your Schema makes effect on the performance.
You should assume that your query always depends on the number of rows. In fact, you should assume the worst case (linear or O(N) for the example you provided) and exponential for more complex queries. There are database specific manuals filled with tricks to help you avoid the worst case but SQL itself is a language and doesn't specify how to execute your query. Instead, the database implementation decides how to execute any given query: if you have indexed a column or set of columns in your database then you will get O(log(N)) performance for a simple lookup; if the system has effective query caching you might get O(1) response. Here is a good introductory article: High scalability: SQL and computational complexity