Count on join of big tables with conditions is slow

Count on join of big tables with conditions is slow - sql

This query had reasonable times when the table was small. I'm trying to identify what's the bottleneck, but I'm not sure how to analyze the EXPLAIN results.
SELECT
COUNT(*)
FROM performance_analyses
INNER JOIN total_sales ON total_sales.id = performance_analyses.total_sales_id
WHERE
(size > 0) AND
total_sales.customer_id IN (
SELECT customers.id FROM customers WHERE customers.active = 't'
AND customers.visible = 't' AND customers.organization_id = 3
) AND
total_sales.product_category_id IN (
SELECT product_categories.id FROM product_categories
WHERE product_categories.organization_id = 3
) AND
total_sales.period_id = 193;
I've tried both the approach of INNER JOIN'ing customers and product_categories tables and doing an INNER SELECT. Both had the same time.
Here's the link to EXPLAIN: https://explain.depesz.com/s/9lhr
Postgres version:
PostgreSQL 9.4.5 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-16), 64-bit
Tables and indexes:
CREATE TABLE total_sales (
id serial NOT NULL,
value double precision,
start_date date,
end_date date,
product_category_customer_id integer,
created_at timestamp without time zone,
updated_at timestamp without time zone,
processed boolean,
customer_id integer,
product_category_id integer,
period_id integer,
CONSTRAINT total_sales_pkey PRIMARY KEY (id)
);
CREATE INDEX index_total_sales_on_customer_id ON total_sales (customer_id);
CREATE INDEX index_total_sales_on_period_id ON total_sales (period_id);
CREATE INDEX index_total_sales_on_product_category_customer_id ON total_sales (product_category_customer_id);
CREATE INDEX index_total_sales_on_product_category_id ON total_sales (product_category_id);
CREATE INDEX total_sales_product_category_period ON total_sales (product_category_id, period_id);
CREATE INDEX ts_pid_pcid_cid ON total_sales (period_id, product_category_id, customer_id);
CREATE TABLE performance_analyses (
id serial NOT NULL,
total_sales_id integer,
status_id integer,
created_at timestamp without time zone,
updated_at timestamp without time zone,
size double precision,
period_size integer,
nominal_variation double precision,
percentual_variation double precision,
relative_performance double precision,
time_ago_max integer,
deseasonalized_series text,
significance character varying,
relevance character varying,
original_variation double precision,
last_level double precision,
quantiles text,
range text,
analysis_method character varying,
CONSTRAINT performance_analyses_pkey PRIMARY KEY (id)
);
CREATE INDEX index_performance_analyses_on_status_id ON performance_analyses (status_id);
CREATE INDEX index_performance_analyses_on_total_sales_id ON performance_analyses (total_sales_id);
CREATE TABLE product_categories (
id serial NOT NULL,
name character varying,
organization_id integer,
created_at timestamp without time zone,
updated_at timestamp without time zone,
external_id character varying,
CONSTRAINT product_categories_pkey PRIMARY KEY (id)
);
CREATE INDEX index_product_categories_on_organization_id ON product_categories (organization_id);
CREATE TABLE customers (
id serial NOT NULL,
name character varying,
external_id character varying,
region_id integer,
organization_id integer,
created_at timestamp without time zone,
updated_at timestamp without time zone,
active boolean DEFAULT false,
visible boolean DEFAULT false,
segment_id integer,
"group" boolean,
group_id integer,
ticket_enabled boolean DEFAULT true,
CONSTRAINT customers_pkey PRIMARY KEY (id)
);
CREATE INDEX index_customers_on_organization_id ON customers (organization_id);
CREATE INDEX index_customers_on_region_id ON customers (region_id);
CREATE INDEX index_customers_on_segment_id ON customers (segment_id);
Rows counts:
customers - 6,970 rows
product_categories - 34 rows
performance_analyses - 1,012,346 rows
total_sales - 7,104,441 rows

Your query, rewritten and 100 % equivalent:
SELECT count(*)
FROM product_categories pc
JOIN customers c USING (organization_id)
JOIN total_sales ts ON ts.customer_id = c.id
JOIN performance_analyses pa ON pa.total_sales_id = ts.id
WHERE pc.organization_id = 3
AND c.active -- boolean can be used directly
AND c.visible
AND ts.product_category_id = pc.id
AND ts.period_id = 193
AND pa.size > 0;
Another answer advises to move all conditions into join clauses and order tables in the FROM list. This may apply for a certain other RDBMS with a comparatively primitive query planner. But while it doesn't hurt for Postgres either, it also has no effect on performance for your query - assuming default server configuration. The manual:
Explicit inner join syntax (INNER JOIN, CROSS JOIN, or unadorned JOIN)
is semantically the same as listing the input relations in FROM, so it
does not constrain the join order.
Bold emphasis mine. There is more, read the manual.
The key setting is join_collapse_limit (with default 8). The Postgres query planner will rearrange your 4 tables any way it expects it to be fastest, no matter how you arranged your tables and whether you write conditions as WHERE or JOIN clauses. No difference whatsoever. (The same is not true for some other types of joins that cannot be rearranged freely.)
The important point is that these different join possibilities give
semantically equivalent results but might have hugely different
execution costs. Therefore, the planner will explore all of them to
try to find the most efficient query plan.
Related:
Sample Query to show Cardinality estimation error in PostgreSQL
A: Slow fulltext search due to wildly inaccurate row estimates
Finally, WHERE id IN (<subquery>) is not generally equivalent to a join. It does not multiply rows on the left side for duplicate matching values on the right side. And columns of the subquery are not visible for the rest of the query. A join can multiply rows with duplicate values and columns are visible.
Your simple subqueries dig up a single unique column in both cases, so there is no effective difference in this case - except that IN (<subquery>) is generally (at least a bit) slower and more verbose. Use joins.
Your query
Indexes
product_categories has 34 rows. Unless you plan on adding many more, indexes do no help performance for this table. A sequential scan will always be faster. Drop index_product_categories_on_organization_id.
customers has 6,970 rows. Indexes start to make sense. But your query uses 4,988 of them according to the EXPLAIN output. Only an index-only scan on an index much less wide than the table could help a bit. Assuming WHERE active AND visible are constant predicates, I suggest a partial multicolumn index:
CREATE INDEX index_customers_on_organization_id ON customers (organization_id, id)
WHERE active AND visible;
I appended id to allow index-only scans. The column is otherwise useless in the index for this query.
total_sales has 7,104,441 rows. Indexes are very important. I suggest:
CREATE INDEX index_total_sales_on_product_category_customer_id
ON total_sales (period_id, product_category_id, customer_id, id)
Again, aiming for an index-only scan. This is the most important one.
You can delete the completely redundant index index_total_sales_on_product_category_id.
performance_analyses has 1,012,346 rows. Indexes are very important.
I would suggest another partial index with the condition size > 0:
CREATE INDEX index_performance_analyses_on_status_id
ON performance_analyses (total_sales_id)
WHERE pa.size > 0;
However:
Rows Removed by Filter: 0"
Seems like this conditions serves no purpose? Are there any rows with size > 0 is not true?
After creating these indexes you need to ANALYZE the tables.
Tables statistics
Generally, I see many bad estimates. Postgres underestimates the number of rows returned at almost every step. The nested loops we see would work much better for fewer rows. Unless this is an unlikely coincidence, your table statistics are badly outdated. You need to visit your settings for autovacuum and probably also per-table settings for your two big tables
performance_analyses and total_sales.
You already did run VACUUM and ANALYZE, which made the query slower, according to your comment. That doesn't make a lot of sense. I would run VACUUM FULL on these two tables once (if you can afford an exclusive lock). Else try pg_repack.
With all the fishy statistics and bad plans I would consider running a complete vacuumdb -fz yourdb on your DB. That rewrites all tables and indexes in pristine conditions, but it's no good to use on a regular basis. It's also expensive and will lock your DBs for an extended period of time!
While being at it, have a look at the cost settings of your DB as well.
Related:
Keep PostgreSQL from sometimes choosing a bad query plan
Postgres Slow Queries - Autovacuum frequency

Although theoretically the optimizer should be able to do this, I often find that these changes can massively improve performance:
use proper joins (instead of where id in (select ...))
order the reference to tables in the from clause such that the fewest rows are returned at each join, especially the first table's condition (in the where clause) should be the most restrictive (and should use indexes)
move all conditions on joined tables into the on condition of joins
Try this (aliases added for readability):
select count(*)
from total_sales ts
join product_categories pc on ts.product_category_id = pc.id and pc.organization_id = 3
join customers c on ts.customer_id = c.id and c.organization_id = 3
join performance_analyses pa on ts.id = pa.total_sales_id and pa.size > 0
where ts.period_id = 193
You will need to create this index for optimal performance (to allow an index-only scan on total_sales):
create index ts_pid_pcid_cid on total_sales(period_id, product_category_id, customer_id)
This approach first narrows the data to a period, so it will scale (remain roughly constant) into the future, because the number of sales per period will be roughly constant.

The estimations there are not accurate. Postgres's planner uses wrongly nested loop - try to penalize nest_loop by statement set enable_nestloop to off.

Related

Speed up LEFT OUTER JOIN query in Firebird

The question is for Firebird 2.5. Let's assume we have the following query:
SELECT
EVENTS.ID,
EVENTS.TS,
EVENTS.DEV_TS,
EVENTS.COMPLETE_TS,
EVENTS.OBJ_ID,
EVENTS.OBJ_CODE,
EVENTS.SIGNAL_CODE,
EVENTS.SIGNAL_EVENT,
EVENTS.REACTION,
EVENTS.PROT_TYPE,
EVENTS.GROUP_CODE,
EVENTS.DEV_TYPE,
EVENTS.DEV_CODE,
EVENTS.SIGNAL_LEVEL,
EVENTS.SIGNAL_INFO,
EVENTS.USER_ID,
EVENTS.MEDIA_ID,
SIGNALS.ID AS SIGNAL_ID,
SIGNALS.SIGNAL_TYPE,
SIGNALS.IMAGE AS SIGNAL_IMAGE,
SIGNALS.NAME AS SIGNAL_NAME,
REACTION.INFO,
USERS.NAME AS USER_NAME
FROM EVENTS
LEFT OUTER JOIN SIGNALS ON (EVENTS.SIGNAL_ID = SIGNALS.ID)
LEFT OUTER JOIN REACTION ON (EVENTS.ID = REACTION.EVENTS_ID)
LEFT OUTER JOIN USERS ON (EVENTS.USER_ID = USERS.ID)
WHERE (TS BETWEEN '27.07.2021 00:00:00' AND '28.07.2021 10:34:08')
AND (OBJ_ID = 8973)
AND (DEV_CODE IN (0, 1234))
AND (DEV_TYPE = 79)
AND (PROT_TYPE = 8)
ORDER BY TS;
EVENTS has about 190 million records by now and this query takes too much time to complete. As I read here, the tables have to have indexes on all the columns that are used.
Here are the CREATE INDEX statements for the EVENTS table:
CREATE INDEX FK_EVENTS_OBJ ON EVENTS (OBJ_ID);
CREATE INDEX FK_EVENTS_SIGNALS ON EVENTS (SIGNAL_ID);
CREATE INDEX IDX_EVENTS_COMPLETE_TS ON EVENTS (COMPLETE_TS);
CREATE INDEX IDX_EVENTS_OBJ_SIGNAL_TS ON EVENTS (OBJ_ID,SIGNAL_ID,TS);
CREATE INDEX IDX_EVENTS_TS ON EVENTS (TS);
Here is the data from the PLAN analyzer:
PLAN JOIN (JOIN (JOIN (EVENTS ORDER IDX_EVENTS_TS INDEX (FK_EVENTS_OBJ, IDX_EVENTS_TS), SIGNALS INDEX (PK_SIGNALS)), REACTION INDEX (IDX_REACTION_EVENTS)), USERS INDEX (PK_USERS))
As requested the speed of the execution:
without LEFT JOIN -> 138ms
with LEFT JOIN -> 338ms
Is there another way to speed up the execution of the query besides indexing the columns or maybe add another index?
If I add another index will the optimizer choose to use it?

You can only optimize the joins themselves by being sure that the keys are indexed in the second tables. These all look like primary keys, so they should have appropriate indexes.
For this WHERE clause:
WHERE TS BETWEEN '27.07.2021 00:00:00' AND '28.07.2021 10:34:08')
OBJ_ID = 8973 AND
DEV_CODE IN (0, 1234) AND
DEV_TYPE = 79 AND
PROT_TYPE = 8
You probably want an index on (OBJ_ID, DEV_TYPE, PROT_TYPE, TS, DEV_CODE). The order of the first three keys is not particularly important because they are all equality comparisons. I am guessing that one day of data is fewer rows than two device codes.

First of all you want to find the table1 rows quickly. You are using several columns in your WHERE clause to get them. Provide an index on these columns. Which column is the most selective? I.e. which criteria narrows the result rows most? Let's say it's dt, so we put this first:
create index idx1 on table1 (dt, oid, pt, ts, dc);
I have put ts and dt last, because we are looking for more than one value in these columns. It may still be that putting ts or dsas the first column is a good choice. Sometimes we have to play around with this. I.e. provide several indexes with the column order changed and then see which one gets used by the DBMS.
Tables table2 and tabe4 get accessed by the primary key for which exists an index. But table3 gets accessed by t1id. So provide an index on that, too:
create index idx2 on table3 (t1id);

Clarification on Indexes

To illustrate my question, I will use the following example:
CREATE INDEX supplier_idx
ON supplier (supplier_name);
Will the searching on this table only be sped up if the supplier_name column is specified in the SELECT clause? What if we select the supplier_name column as well as other columns in the SELECT clause? Is searching sped up if this column is used in a WHERE clause, even if it is not in the SELECT clause?
Do the same rules apply to the following index as well:
CREATE INDEX supplier_idx
ON supplier (supplier_name, city);

Indexes can be complex, so a full explanation would take a lot of writing. There are many resources on the internet. (Helpful link here to Oracle indexes)
However, I can just answer your questions simply.
CREATE INDEX supplier_idx
ON supplier (supplier_name);
This means that any joins (and similar) using the col supplier_name and using the col supplier_name in a WHERE clause will benefit from an index.
For example
SELECT * FROM SomeTable
WHERE supplier_name = 'Smith'
But simply using the supplier_name column in a SELECT clause will not benefit from having an index (unless you add complexity to the SELECT clause, which I will cover...). For example - this will not benefit from an Index on supplier_name
SELECT
supplier_name
FROM SomeTable WHERE ID = 1
However, if you added some complexity to your SELECT statement, your index could indeed speed it up...For example:
SELECT
supplier_name -- no index benefit
,(SELECT TOP 1 somedata FROM Table2 WHERE supplier_name = Table2.name) AS SomeValue
-- the line above uses the index as supplier_name is used in WHERE
, CASE WHEN supplier_name = 'Best Supplier'
THEN 'Best'
ELSE 'Worst'
END AS FindBestSupplier
-- Also the CASE statement will use the index on supplier_name
FROM SomeTable WHERE ID = 1
(The 'complexity' above still basically shows that if the field 'supplier_name' is used in CASE, or WHERE aswell as JOINS and aggregations, then the INDEX is very beneficial...This example above is a combination of many clauses wrapped into one SELECT statement)
But your composite index
CREATE INDEX supplier_idx
ON supplier (supplier_name, city);
would be beneficial in specific and important cases (Eg: where the city is in the SELECT clause and the supplier_name is used in the WHERE clause), for example
SELECT
city
FROM SomeTable WHERE supplier_name = 'Smith'
The reason is that city is stored alongside the supplier_name index values, so when the index finds the supplier_name value, it immediately has a copy of the city value (stored in the index) and does not need to hit the database files to find any more data. (If city was not in the index, it would have to hit the database to pull the city value out, as it does with most data required in the SELECT statement usually)
The joins will benefit from an index also, with the example:
SELECT
* FROM SomeTable T1
LEFT JOIN AnotherTable T2
ON T1.supplier_name = T2.supplier_name_2
AND T1.city = T2.city_2
So in summary, if you use the field in any comparison expression like a WHERE clause or a JOIN , or a GROUP BY clause (and the aggregations SUM, MIN, MAX etc)...then an Index is very beneficial for Tables with over a few thousand rows...
(Usually only makes a big difference when you have at least 10,000 rows in a Table, but this can vary depending on your complexity)
SQL Server (for example) always creates any missing indexes that it needs (and then discards them)..So if you do not create the correct indexes manually - the system can slow down as it creates the missing indexes on the fly each time it needs them. (SQL Server will show you hints on what indexes it thinks you need for a certain query)
Indexes can slow down UPDATES or INSERTS, so they must be used with a little wisdom and balance...(Sometimes indexes are deleted before a batch of UPDATEs is performed and then the index re-created again, although this is kinda extreme)

Any way to speed up this sql query?

I have the following Postgres query, the query takes 10 to 50 seconds to execute.
SELECT m.match_id FROM match m
WHERE m.match_id NOT IN(SELECT ml.match_id FROM message_log ml)
AND m.account_id = ?
I have created an index on match_id and account_id
CREATE INDEX match_match_id_account_id_idx ON match USING btree
(match_id COLLATE pg_catalog."default",
account_id COLLATE pg_catalog."default");
But still the query takes a long time. What can I do to speed this up and make it efficient? My server load goes to 25 when I have a few of these queries executing.

NOT IN (SELECT ... ) can be considerably more expensive because it has to handle NULL separately. It can also be tricky when NULL values are involved. Typically LEFT JOIN / IS NULL (or one of the other related techniques) is faster:
Select rows which are not present in other table
Applied to your query:
SELECT m.match_id
FROM match m
LEFT JOIN message_log ml USING (match_id)
WHERE ml.match_id IS NULL
AND m.account_id = ?;
The best index would be:
CREATE INDEX match_match_id_account_id_idx ON match (account_id, match_id);
Or just on (account_id), assuming that match_id is PK in both tables. You also already have the needed index on message_log(match_id). Else create that, too.
Also COLLATE pg_catalog."default" in your index definition indicates that your ID columns are character types, which is typically inefficient. Should typically better be integer types.
My educated guess from the little you have shown so far: there are probably more issues.

Does EXCEPT execute faster than a JOIN when the table columns are the same

To find all the changes between two databases, I am left joining the tables on the pk and using a date_modified field to choose the latest record. Will using EXCEPT increase performance since the tables have the same schema. I would like to rewrite it with an EXCEPT, but I'm not sure if the implementation for EXCEPT would out perform a JOIN in every case. Hopefully someone has a more technical explanation for when to use EXCEPT.

There is no way anyone can tell you that EXCEPT will always or never out-perform an equivalent OUTER JOIN. The optimizer will choose an appropriate execution plan regardless of how you write your intent.
That said, here is my guideline:
Use EXCEPT when at least one of the following is true:
The query is more readable (this will almost always be true).
Performance is improved.
And BOTH of the following are true:
The query produces semantically identical results, and you can demonstrate this through sufficient regression testing, including all edge cases.
Performance is not degraded (again, in all edge cases, as well as environmental changes such as clearing buffer pool, updating statistics, clearing plan cache, and restarting the service).
It is important to note that it can be a challenge to write an equivalent EXCEPT query as the JOIN becomes more complex and/or you are relying on duplicates in part of the columns but not others. Writing a NOT EXISTS equivalent, while slightly less readable than EXCEPT should be far more trivial to accomplish - and will often lead to a better plan (but note that I would never say ALWAYS or NEVER, except in the way I just did).
In this blog post I demonstrate at least one case where EXCEPT is outperformed by both a properly constructed LEFT OUTER JOIN and of course by an equivalent NOT EXISTS variation.

In the following example, the LEFT JOIN is faster than EXCEPT by 70%
(PostgreSQL 9.4.3)
Example:
There are three tables. suppliers, parts, shipments.
We need to get all parts not supplied by any supplier in London.
Database(has indexes on all involved columns):
CREATE TABLE suppliers (
id bigint primary key,
city character varying NOT NULL
);
CREATE TABLE parts (
id bigint primary key,
name character varying NOT NULL,
);
CREATE TABLE shipments (
id bigint primary key,
supplier_id bigint NOT NULL,
part_id bigint NOT NULL
);
Records count:
db=# SELECT COUNT(*) FROM suppliers;
count
---------
1281280
(1 row)
db=# SELECT COUNT(*) FROM parts;
count
---------
1280000
(1 row)
db=# SELECT COUNT(*) FROM shipments;
count
---------
1760161
(1 row)
Query using EXCEPT.
SELECT parts.*
FROM parts
EXCEPT
SELECT parts.*
FROM parts
LEFT JOIN shipments
ON (parts.id = shipments.part_id)
LEFT JOIN suppliers
ON (shipments.supplier_id = suppliers.id)
WHERE suppliers.city = 'London'
;
-- Execution time: 3327.728 ms
Query using LEFT JOIN with table, returned by subquery.
SELECT parts.*
FROM parts
LEFT JOIN (
SELECT parts.id
FROM parts
LEFT JOIN shipments
ON (parts.id = shipments.part_id)
LEFT JOIN suppliers
ON (shipments.supplier_id = suppliers.id)
WHERE suppliers.city = 'London'
) AS subquery_tbl
ON (parts.id = subquery_tbl.id)
WHERE subquery_tbl.id IS NULL
;
-- Execution time: 1136.393 ms

Speed up Oracle multi-table query with ORDER BY clause

I have the following three tables (there are actually many more fields, but this should give an idea of what I'm trying to achieve):
log (
eventId INTEGER,
objectId INTEGER,
PRIMARY KEY (eventId)
)
objects (
objectId INTEGER,
typeId INTEGER,
PRIMARY KEY (objectId, typeId)
)
statusBits (
typeId INTEGER,
bitNumber INTEGER,
)
The log table contains a very large number of records (500,000+), while the other tables are quite small. I can join the tables using the following query:
SELECT l.eventId, o.typeId, s.bitNumber
FROM log l, objects o, statusBits s
WHERE (l.objectId = o.objectId) AND (o.typeId = s.typeId)
This query runs nice and fast. It also runs fast when I add an ORDER BY eventId clause at the end. However, when I add ORDER BY eventId, bitNumber (thus sorting by two fields rather than one) it becomes painfully slow.
How can I optimise my query to that it runs faster? I am running Oracle 10g XE if that makes any difference.
UPDATE:
I've already tried CREATE INDEX ON statusBits(bitNumber) but it doesn't seem to have a great effect.

First of all, i'll refactor your query as follow:
SELECT L.eventId
,O.typeId
,S.bitNumber
FROM log L
INNER JOIN objects O ON O.objectId = L.objectId
INNER JOIN statusBits S ON S.typeId = O.typeId
It may probably not help for your execution time but the query is much more readable and the use of INNER JOIN is a best practice.
Then in order to optimise your execution time, the first solution that comes in mind is to create an index but you've already tested that. It may help to try a concatenated index instead of a simple index:
CREATE INDEX ON statusBits (typeId, bitNumber);
Hope this will help.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas