Solution for speeding up a slow SELECT DISTINCT query in Postgres

Solution for speeding up a slow SELECT DISTINCT query in Postgres - sql

The query is basically:
SELECT DISTINCT "my_table"."foo" from "my_table" WHERE...
Pretending that I'm 100% certain the DISTINCT portion of the query is the reason it runs slowly, I've omitted the rest of the query to avoid confusion, since it is the distinct portion's slowness that I'm primarily concerned with (distinct is always a source of slowness).
The table in question has 2.5 million rows of data. The DISTINCT is needed for purposes not listed here (because I don't want back a modified query, but rather just general information about making distinct queries run faster at the DBMS level, if possible).
How can I make DISTINCT run quicker (using Postgres 9, specifically) without altering the SQL (ie, I can't alter this SQL coming in, but have access to optimize something at the DB level)?

Oftentimes, you can make such queries run faster by working around the distinct by using a group by instead:
select my_table.foo
from my_table
where [whatever where conditions you want]
group by foo;

Your DISTINCT is causing it to sort the output rows in order to find duplicates. If you put an index on the column(s) selected by the query, the database may be able to read them out in index order and save the sort step. A lot will depend on the details of the query and the tables involved-- your saying you "know the problem is with the DISTINCT" really limits the scope of available answers.

You can try increasing the work_mem setting, depending on the size of Your dataset It can cause switching the query plan to hash aggregates, which are usually faster.
But before setting it too high globally, first read up on it. You can easily blow up Your server, because the max_connections setting acts as a multiplier to this number.
This means that if you were to set work_mem = 128MB and you set max_connections = 100 (the default), you should have more than 12.8GB of RAM. You're essentially telling the server that it can use that much for performing queries (not even considering any other memory use by Postgres or otherwise).

Related

Most memory efficient for Cluster Tables: "select count(*)" or "select .. endselect"?

I would like to know which statement (see below) will be more efficient for determining the size of a Cluster Table. Or at least determine, whether the table size reaches a certain threshhold {n}.
Efficiency meaning using less PSAPTEMP tablespace.
The problem with Cluster Tables is, that in order to get an entry for a table the fields of one entry need to be looked up in several tables of the Cluster where they are dispersed. Thus, more than just the counted table need to be looked at. So for every entry several entries need to be looked up. This makes it inefficient for reads and this can make it dump because the COUNT uses an INT datatype that can overflow.
SELECT COUNT(*)
...
UP TO {n} rows.
SELECT *
...
UP TO {n} ROWS.
ENDSELECT. `and then determine the size of the result. `
To me they seem equivalent, but maybe they are not when using a threshold. Maybe the limitation makes a difference depending how the data is read. EDIT: Of course, SELECT .. ENDSELECT is a loop and thus less efficient principally.
But I would like to know how it actually works under the hood and understand the difference better. So far it seems like I will have to try it out.
I assume the database will differ but will most often be Oracle.

We could not really create the test environment we needed. So no final answer. But some learnings:
Reading the data from cluster tables should be done based on a full primary key sequence (Should be accessed via primary key - very fast retrieval otherwise very slow)
There are no secondary indexes
Select * is Ok because all columns are retrieved anyways. Performing an operation on multiple rows is more efficient than single row operations. -> Therefore you still want to select into an internal table.
If many rows are being selected into the internal table, you might still like to retrieve specific columns to cut down on the memory required.
There is a way to convert cluster to transparent but with downtime and this no way for us
Aggreate SQL functions (SUM, AVG, MIN, MAX, etc) are not supported

Basically Select Endselect will run a loop and there will be multiple trips to DB Server.
Technically select SELECT COUNT(*) will perform all the data on the DB server itself and in one shot.
After which you can simply put the data in an internal table and work on the same.
As per the standards, this is not at all recommended even for normal transparent tables leave aside Cluster tables.
Access to Cluster tables is very expensive. Also, to make the matter worse you cannot use any indexes on Cluster tables. Its always better to provide as much data in the where clause as possible.
The priority is always given to fetch the data in one shot from the Database Server using
select * from table into table where ....
and then loop on it on the local server.
Specifically in your use case It will be fastest if you will be using count(*) and not select endselect.
Certified SAP ABAP Consultant

Using native SQL with COUNT BIG instead of COUNT can make it not memory efficient but prevent it from dumping due to a counter overflow.

Performance of select * vs select colname in oracle

Why performance of select * from table is not as good as select col_1,col_2 from table? So far as I understand, it is the locating of the row that takes up time, not how many columns are returned.

Selecting unnecessary columns can cause query plan changes that have a massive impact on performance. For example, if there is an index on col_1, col2 but there are other columns in the table, the select * query has to do a full table scan while the select col_1, col_2 query can simply scan the index which is likely to be much smaller and, thus, much less expensive to query. If you start dealing with queries that involve more than one table or that involve queries against views, selecting subsets of columns can also sometimes change query plans by allowing Oracle to eliminate unnecessary joins or function evaluations. Now, to be fair, it's not particularly common that the query plan will change based on what columns are selected, but when it does, the change is often significant.
If you are issuing the SQL statement from an application outside the database, selecting additional columns forces Oracle to send additional data over the network so your application will spend more time waiting on network I/O to send over data that it is not interested in. That can be very inefficient particularly if your application ever gets deployed on a WAN.
Selecting unnecessary columns can also force Oracle to do additional I/O without changing the plan. If one of the columns in the table that you don't need is a LOB, for example, Oracle would have to do additional work to fetch that LOB data. If the data is stored in a chained block on disk but the columns you are interested in happen to be in the first row piece, Oracle doesn't have to fetch the additional row pieces for the query that specifies a subset of columns. The query that does a select *, on the other hand, has to fetch every row piece.
Of course, that is before considering the maintenance aspects. If you are writing an application outside of PL/SQL, doing a SELECT * means that either your code will break when someone adds a new column to the table in the future or that your application has to dynamically determine at runtime the set of columns that are being returned in order to accommodate the new column automatically. While that is certainly possible, it is likely to lead to code that is more complex and thus more difficult to debug and maintain. If you are writing PL/SQL and fetching the data into a %ROWTYPE variable, it can be perfectly reasonable to do a SELECT * in production code; in other languages, you're generally setting yourself up for a maintenance nightmare if you do a SELECT *.

There is the issue of looking up the definition from the data dictionary for the table when you do a SELECT *.
There is also the issue of the database doing a little more work than is necessary when the only columns you require are col_1 and col_2. This is particularly an issue with large tables.
And there is the issue of network bandwidth being unnecessarily swallowed by a larger than required dataset.
It's not best practice to do an a SELECT *.
It also makes embedded SQL code harder to read.

Pre-fetching row counts before query - performance

I recently answered this question based on my experience:
Counting rows before proceeding to actual searching
but I'm not 100% satisfied with the answer I gave.
The question is basically this: Can I get a performance improvement by running a COUNT over a particular query before deciding to run the query that brings back the actual rows?
My intuition is this: you will only save the I/O and wire time associated with retrieving the data instead of the count because to count the data, you need to actually find the rows. The possible exception to this is when the query is a simple function of the indexes.
My question then is this: Is this always true? What other exception cases are there? From a pure performance perspective, in what cases would one want to do a COUNT before running the full query?

First, the answer to your question is highly dependent on the database.
I cannot think of a situation when doing a COUNT() before a query will shorten the overall time for both the query and the count().
In general, doing a count will pre-load tables and indexes into the page cache. Assuming the data fits in memory, this will make the subsequent query run faster (although not much faster if you have fast I/O and the database does read-ahead page reading). However, you have just shifted the time frame to the COUNT(), rather than reducing overall time.
To shorten the overall time (including the run time of the COUNT()) would require changing the execution plan. Here are two ways this could theoretically happen:
A database could update statistics as a table is read in, and these statistics, in turn, change the query plan for the main query.
A database could change the execution plan based on whether tables/indexes are already in the page cache.
Although theoretically possible, I am not aware of any database that does either of these.
You could imagine that intermediate results could be stored, but this would violate the dynamic nature of SQL databases. That is, updates/inserts could occur on the tables between the COUNT() and the query. A database engine could not maintain integrity and maintain such intermediate results.
Doing a COUNT() has disadvantages, relative to speeding up the subsequent query. The query plan for the COUNT() might be quite different from the query plan for the main query. Your example with indexes is one case. Another case would be in a columnar database, where different vertical partitions of the data do not need to be read.
Yet another case would be a query such as:
select t.*, r.val
from table t left outer join
ref r
on t.refID = r.refID
and refID is a unique index on the ref table. This join can be eliminated for a count, since there are not duplicates and all records in t are used. However, the join is clearly needed for this query. Once again, whether a SQL optimizer recognizes and acts on this situation is entirely the decision of the writers of the database. However, the join could theoretically be optimized away for the COUNT().

Speed of paged queries in Oracle

This is a never-ending topic for me and I'm wondering if I might be overlooking something. Essentially I use two types of SQL statements in an application:
Regular queries with a "fallback" limit
Sorted and paged queries
Now, we're talking about some queries against tables with several million records, joined to 5 more tables with several million records. Clearly, we hardly want to fetch all of them, that's why we have the above two methods to limit user queries.
Case 1 is really simple. We just add an additional ROWNUM filter:
WHERE ...
AND ROWNUM < ?
That's quite fast, as Oracle's CBO will take this filter into consideration for its execution plan and probably apply a FIRST_ROWS operation (similar to the one enforced by the /*+FIRST_ROWS*/ hint.
Case 2, however is a bit more tricky with Oracle, as there is no LIMIT ... OFFSET clause as in other RDBMS. So we nest our "business" query in a technical wrapper as such:
SELECT outer.* FROM (
SELECT * FROM (
SELECT inner.*, ROWNUM as RNUM, MAX(ROWNUM) OVER(PARTITION BY 1) as TOTAL_ROWS
FROM (
[... USER SORTED business query ...]
) inner
)
WHERE ROWNUM < ?
) outer
WHERE outer.RNUM > ?
Note that the TOTAL_ROWS field is calculated to know how many pages we will have even without fetching all data. Now this paging query is usually quite satisfying. But every now and then (as I said, when querying 5M+ records, possibly including non-indexed searches), this runs for 2-3minutes.
EDIT: Please note, that a potential bottleneck is not so easy to circumvent, because of sorting that has to be applied before paging!
I'm wondering, is that state-of-the-art simulation of LIMIT ... OFFSET, including TOTAL_ROWS in Oracle, or is there a better solution that will be faster by design, e.g. by using the ROW_NUMBER() window function instead of the ROWNUM pseudo-column?

The main problem with Case 2 is that in many cases the whole query result set has to be obtained and then sorted before the first N rows can be returned - unless the ORDER BY columns are indexed and Oracle can use the index to avoid a sort. For a complex query and a large set of data this can take some time. However there may be some things you can do to improve the speed:
Try to ensure that no functions are called in the inner SQL - these may get called 5 million times just to return the first 20 rows. If you can move these function calls to the outer query they will be called less.
Use a FIRST_ROWS_n hint to nudge Oracle into optimising for the fact that you will never return all the data.
EDIT:
Another thought: you are currently presenting the user with a report that could return thousands or millions of rows, but the user is never realistically going to page through them all. Can you not force them to select a smaller amount of data e.g. by limiting the date range selected to 3 months (or whatever)?

You might want to trace the query that takes a lot of time and look at its explain plan. Most likely the performance bottleneck comes from the TOTAL_ROWS calculation. Oracle has to read all the data, even if you only fetch one row, this is a common problem that all RDBMS face with this type of query. No implementation of TOTAL_ROWS will get around that.
The radical way to speed up this type of query is to forego the TOTAL_ROWS calculation. Just display that there are additional pages. Do your users really need to know that they can page through 52486 pages? An estimation may be sufficient. That's another solution, implemented by google search for example: estimate the number of pages instead of actually counting them.
Designing an accurate and efficient estimation algorithm might not be trivial.

A "LIMIT ... OFFSET" is pretty much syntactic sugar. It might make the query look prettier, but if you still need to read the whole of a data set and sort it and get rows "50-60", then that's the work that has to be done.
If you have an index in the right order, then that can help.

It may perform better to run two queries instead of trying to count() and return the results in the same query. Oracle may be able to answer the count() without any sorting or joining to all the tables (join table elimination based on declared foreign key constraints). This is what we generally do in our application. For performance important statements, we write a separate query that we know will return the correct count as we can sometimes do better than Oracle.
Alternatively, you can make a tradeoff between performance and recency of the data. Bringing back the first 5 pages is going to be nearly as quick as bringing back the first page. So you could consider storing the results from 5 pages in a temporary table along with an expiry date for the information. Take the result from the temporary table if valid. Put a background task in to delete the expired data periodically.

Is using COUNT() or SELECT a good idea?

I've heard several times that you shouldn't perform COUNT(*) or SELECT * for performance reasons, but wasn't able to dig up some further information about it.
I can imagine that the database is then using all columns for the action, which can be an impressive performance loss, but I'm not sure about that. Does somebody have further information about the topic?

1. On count(*) vs. count(something else)
SQL is declarative in that you specify what you want. This is different from specifying how to get what you want. That means the database engine is free to realize your query in whatever way it thinks is the most efficient. Many database optimizers rewrites your query to a less costly alternative (if such a plan is available).
Given the following table:
table(
pk not null
,color not null
,nullable null
,unique(pk)
,index(color)
);
...all of the following are functionally equivalent (due to the mechanics of count and nulls):
1) select count(*) from table;
2) select count(1) from table;
3) select count(pk) from table;
4) select count(color) from table;
Regardless of which form you use, the optimizer is free to rewrite the query to another form if it is more efficient. (Again, not all optimizers are sophisticated enough to do this). The unique index(pk) would be smaller (bytes occupied) than the entire table. Therefore it would be more efficient to count the number of index entries rather than scanning through the entire table. In Oracle we have bitmap indexes, which also compress repeating strings. If we had used such an index on the color column, it would probably have been the smallest index to scan. Oracle also supports table compression which in some cases makes the physical table smaller than a composite index.
1. TL;DR;
Your specific dbms will have its own set of tools that enables different rewriting rules and in turn execution plans. That renders the question somewhat useless (unless we talk about a specific release of a specific dbms). I recommend COUNT(*) in all cases because it requires the least cognitive effort to grasp.
2. On select a,b,c vs. select *
There are very few valid uses of SELECT * in code you write and put into production. Imagine a table which contains Bluray movies (yes, the movies is stored as a blob in this table). So you slapped together your awesomesauce abstraction layer and put SELECT * FROM movies where id = ? in the getMovies(movie_id) method. I will refrain myself from explaining why SELECT name FROM movies will be transported across the network just a tad faster. Of course, in most realistic cases it won't have a noticable impact.
One last point on performance is that when all the referenced columns (selected, filtered) in your query exists as an index (called a covering index), the database need not touch the table at all. It can be fully resolved from scanning the index only. By selecting all columns you remove this option from the optimizer.
Another thing about SELECT * which is far more serious than anything, is that it creates an implicit dependency on a specific physical layout of the table. Let me explain. Consider the following tables:
table T1(name, id)
table T2(name, id)
The following statement...
insert into t1 select * from t2;
... will break or produce a different result if any of the following happens:
Any of the tables columns are rearranged for example T1(id, name)
T1 gets an additional not-null column
T2 gets another column
2. TL;DR; When possible, explicitly specify the columns you want (eventually, you'll have to do that anyway). Also, selecting fewer columns are faster than selecting more columns. A possitive side-effect on explicit selects is that it gives greater freedom to the optimizer.

COUNT(*) is different from COUNT(column1) !
COUNT(*) returns the number of records, and does NOT use more resources, while COUNT(column1) counts the number of records where column1 is non null.
For SELECT, it is different. SELECT * will of course request more data.

When using count(*) the * doesn't mean "all fields". Using count(field) will count all non-null values in the field, but count(*) will always count all records even if all fields in all records are null, so it doesn't need to check the data in the fields at all.
Using select * means that you almost always return more data than you are going to use, which of course is a waste. However, perhaps more serious is the maintainence problem; if you add fields to a table your query will return these too. That might mean that the record becomes too large to fit in the buffer, resulting in an error message.

Don't confuse the * in "COUNT(*)" with the * in "SELECT * ". They are completely unrelated but sometimes confused because it's such an odd syntax. There is nothing wrong with using COUNT(*), which just means "count rows".
SELECT * on the other hand means "select all columns". That's generally poor practice because it tightly couples your code to the database schema. That means when you change the table you probably have to change the code even if it should have been unaffected. It increases the impact of any schema change.
SELECT * may also cause a sub-optimal query plan. Either because you didn't really need all columns or because it forces the DBMS to do an extra lookup at runtime to get the list of columns.

It's absolutely true that "*" is "all columns". And you're right in the point of if you've a table with an incredible number of columns (say 100+), these kind of queries can be bad in terms of efficiency.
I believe that the best solution is creating database views previously filtering the amount of records evolved in the count operation, so, the performance impact isn't a big problem, because views can be cached.
In the other hand, it seems that "*" operator should be avoided when returning records, and it's brutally better to select the fields you really need to use in some business.

When using SELECT * it can have a performance hit. Applications which use the SELECT * syntax when they actually only need a handful of columns are transferring more data across the network than they need to consume, which is wasteful.
Also, in Microsoft SQL Server at least, there's a strange problem when you use SELECT * in a view and then add a column to the underlying table. The column headings and data returned by the view don't match each other following certain changes! See my blog post for further details of this particular problem.

Depending on the size of the database depends on how inefficient it becomes, the simnplest way to describe would be like so:
when you specifically do:
SELECT column1,column2,column3 FROM table1
Mysql knows exactly exactly what columns it looking for, but when you do
SELECT * FROM table1
Mysql does not know the columns you want, it knows you want all of them but not the names, so it has to perform extra tasks that analyse the table to discover the columns, thus resulting in using resources.

In case of COUNT(*) it depends on database and its version. For example in modern versions of MS SQL it doesn't matter [source needed].
So the best approach in case of COUNT(*) is to measure it.
Using SELECT * is really bad idea. * means read all columns which can be heavy IO and network operation (especially for various type of CHAR columns). Moreover -- rather rarely you need all columns.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas