Any way to speed up this sql query?

Any way to speed up this sql query? - sql

I have the following Postgres query, the query takes 10 to 50 seconds to execute.
SELECT m.match_id FROM match m
WHERE m.match_id NOT IN(SELECT ml.match_id FROM message_log ml)
AND m.account_id = ?
I have created an index on match_id and account_id
CREATE INDEX match_match_id_account_id_idx ON match USING btree
(match_id COLLATE pg_catalog."default",
account_id COLLATE pg_catalog."default");
But still the query takes a long time. What can I do to speed this up and make it efficient? My server load goes to 25 when I have a few of these queries executing.

NOT IN (SELECT ... ) can be considerably more expensive because it has to handle NULL separately. It can also be tricky when NULL values are involved. Typically LEFT JOIN / IS NULL (or one of the other related techniques) is faster:
Select rows which are not present in other table
Applied to your query:
SELECT m.match_id
FROM match m
LEFT JOIN message_log ml USING (match_id)
WHERE ml.match_id IS NULL
AND m.account_id = ?;
The best index would be:
CREATE INDEX match_match_id_account_id_idx ON match (account_id, match_id);
Or just on (account_id), assuming that match_id is PK in both tables. You also already have the needed index on message_log(match_id). Else create that, too.
Also COLLATE pg_catalog."default" in your index definition indicates that your ID columns are character types, which is typically inefficient. Should typically better be integer types.
My educated guess from the little you have shown so far: there are probably more issues.

Related

Best way to get distinct count from a query joining two tables

I have 2 tables, table A & table B.
Table A (has thousands of rows)
id
uuid
name
type
created_by
org_id
Table B (has a max of hundred rows)
org_id
org_name
I am trying to get the best join query to obtain a count with a WHERE clause. I need the count of distinct created_bys from table A with an org_name in Table B that contains 'myorg'. I currently have the below query (producing expected results) and wonder if this can be optimized further?
select count(distinct a.created_by)
from a left join
b
on a.org_id = b.org_id
where b.org_name like '%myorg%';

You don't need a left join:
select count(distinct a.created_by)
from a join
b
on a.org_id = b.org_id
where b.org_name like '%myorg%'
For this query, you want an index on b.org_id, which I assume that you have.

I would use exists for this:
select count(distinct a.created_by)
from a
where exists (select 1 from b where b.org_id = a.org_id and b.org_name like '%myorg%')
An index on b(org_id) would help. But in terms of performance, key points are:
searching using like with a wildcard on both sides is not good for performance (this cannot take advantage of an index); it would be far better to search for an exact match, or at least to not have a wildcard on the left side of the string.
count(distinct ...) is more expensive than a regular count(); if you don't really need distinct, then don't use it.

Your query looks good already. Use a plain [INNER] JOIN instead or LEFT [OUTER] JOIN, like Gordon suggested. But that won't change much.
You mention that table B has only ...
a max of hundred rows
while table A has ...
thousands of rows
If there are many rows per created_by (which I'd expect), then there is potential for an emulated index skip scan.
(The need to emulate it might go away in one of the coming Postgres versions.)
Essential ingredient is this multicolumn index:
CREATE INDEX ON a (org_id, created_by);
It can replace a simple index on just (org_id) and works for your simple query as well. See:
Is a composite index also good for queries on the first field?
There are two complications for your case:
DISTINCT
0-n org_id resulting from org_name like '%myorg%'
So the optimization is harder to implement. But still possible with some fancy SQL:
SELECT count(DISTINCT created_by) -- does not count NULL (as desired)
FROM b
CROSS JOIN LATERAL (
WITH RECURSIVE t AS (
( -- parentheses required
SELECT created_by
FROM a
WHERE org_id = b.org_id
ORDER BY created_by
LIMIT 1
)
UNION ALL
SELECT (SELECT created_by
FROM a
WHERE org_id = b.org_id
AND created_by > t.created_by
ORDER BY created_by
LIMIT 1)
FROM t
WHERE t.created_by IS NOT NULL -- stop recursion
)
TABLE t
) a
WHERE b.org_name LIKE '%myorg%';
db<>fiddle here (Postgres 12, but works in Postgres 9.6 as well.)
That's a recursive CTE in a LATERAL subquery, using a correlated subquery.
It utilizes the multicolumn index from above to only retrieve a single row for every (org_id, created_by). With an index-only scans if the table is vacuumed enough.
The main objective of the sophisticated SQL is to completely avoid a sequential scan (or even a bitmap index scan) on the big table and only read very few fast index tuples.
Due to the added overhead it can be a bit slower for an unfavorable data distribution (many org_id and/or only few rows per created_by) But it's much faster for favorable conditions and is scales excellently, even for millions of rows. You'll have to test to find the sweet spot.
Related:
Optimize GROUP BY query to retrieve latest row per user
What is the difference between LATERAL and a subquery in PostgreSQL?
Is there a shortcut for SELECT * FROM?

Count on join of big tables with conditions is slow

This query had reasonable times when the table was small. I'm trying to identify what's the bottleneck, but I'm not sure how to analyze the EXPLAIN results.
SELECT
COUNT(*)
FROM performance_analyses
INNER JOIN total_sales ON total_sales.id = performance_analyses.total_sales_id
WHERE
(size > 0) AND
total_sales.customer_id IN (
SELECT customers.id FROM customers WHERE customers.active = 't'
AND customers.visible = 't' AND customers.organization_id = 3
) AND
total_sales.product_category_id IN (
SELECT product_categories.id FROM product_categories
WHERE product_categories.organization_id = 3
) AND
total_sales.period_id = 193;
I've tried both the approach of INNER JOIN'ing customers and product_categories tables and doing an INNER SELECT. Both had the same time.
Here's the link to EXPLAIN: https://explain.depesz.com/s/9lhr
Postgres version:
PostgreSQL 9.4.5 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-16), 64-bit
Tables and indexes:
CREATE TABLE total_sales (
id serial NOT NULL,
value double precision,
start_date date,
end_date date,
product_category_customer_id integer,
created_at timestamp without time zone,
updated_at timestamp without time zone,
processed boolean,
customer_id integer,
product_category_id integer,
period_id integer,
CONSTRAINT total_sales_pkey PRIMARY KEY (id)
);
CREATE INDEX index_total_sales_on_customer_id ON total_sales (customer_id);
CREATE INDEX index_total_sales_on_period_id ON total_sales (period_id);
CREATE INDEX index_total_sales_on_product_category_customer_id ON total_sales (product_category_customer_id);
CREATE INDEX index_total_sales_on_product_category_id ON total_sales (product_category_id);
CREATE INDEX total_sales_product_category_period ON total_sales (product_category_id, period_id);
CREATE INDEX ts_pid_pcid_cid ON total_sales (period_id, product_category_id, customer_id);
CREATE TABLE performance_analyses (
id serial NOT NULL,
total_sales_id integer,
status_id integer,
created_at timestamp without time zone,
updated_at timestamp without time zone,
size double precision,
period_size integer,
nominal_variation double precision,
percentual_variation double precision,
relative_performance double precision,
time_ago_max integer,
deseasonalized_series text,
significance character varying,
relevance character varying,
original_variation double precision,
last_level double precision,
quantiles text,
range text,
analysis_method character varying,
CONSTRAINT performance_analyses_pkey PRIMARY KEY (id)
);
CREATE INDEX index_performance_analyses_on_status_id ON performance_analyses (status_id);
CREATE INDEX index_performance_analyses_on_total_sales_id ON performance_analyses (total_sales_id);
CREATE TABLE product_categories (
id serial NOT NULL,
name character varying,
organization_id integer,
created_at timestamp without time zone,
updated_at timestamp without time zone,
external_id character varying,
CONSTRAINT product_categories_pkey PRIMARY KEY (id)
);
CREATE INDEX index_product_categories_on_organization_id ON product_categories (organization_id);
CREATE TABLE customers (
id serial NOT NULL,
name character varying,
external_id character varying,
region_id integer,
organization_id integer,
created_at timestamp without time zone,
updated_at timestamp without time zone,
active boolean DEFAULT false,
visible boolean DEFAULT false,
segment_id integer,
"group" boolean,
group_id integer,
ticket_enabled boolean DEFAULT true,
CONSTRAINT customers_pkey PRIMARY KEY (id)
);
CREATE INDEX index_customers_on_organization_id ON customers (organization_id);
CREATE INDEX index_customers_on_region_id ON customers (region_id);
CREATE INDEX index_customers_on_segment_id ON customers (segment_id);
Rows counts:
customers - 6,970 rows
product_categories - 34 rows
performance_analyses - 1,012,346 rows
total_sales - 7,104,441 rows

Your query, rewritten and 100 % equivalent:
SELECT count(*)
FROM product_categories pc
JOIN customers c USING (organization_id)
JOIN total_sales ts ON ts.customer_id = c.id
JOIN performance_analyses pa ON pa.total_sales_id = ts.id
WHERE pc.organization_id = 3
AND c.active -- boolean can be used directly
AND c.visible
AND ts.product_category_id = pc.id
AND ts.period_id = 193
AND pa.size > 0;
Another answer advises to move all conditions into join clauses and order tables in the FROM list. This may apply for a certain other RDBMS with a comparatively primitive query planner. But while it doesn't hurt for Postgres either, it also has no effect on performance for your query - assuming default server configuration. The manual:
Explicit inner join syntax (INNER JOIN, CROSS JOIN, or unadorned JOIN)
is semantically the same as listing the input relations in FROM, so it
does not constrain the join order.
Bold emphasis mine. There is more, read the manual.
The key setting is join_collapse_limit (with default 8). The Postgres query planner will rearrange your 4 tables any way it expects it to be fastest, no matter how you arranged your tables and whether you write conditions as WHERE or JOIN clauses. No difference whatsoever. (The same is not true for some other types of joins that cannot be rearranged freely.)
The important point is that these different join possibilities give
semantically equivalent results but might have hugely different
execution costs. Therefore, the planner will explore all of them to
try to find the most efficient query plan.
Related:
Sample Query to show Cardinality estimation error in PostgreSQL
A: Slow fulltext search due to wildly inaccurate row estimates
Finally, WHERE id IN (<subquery>) is not generally equivalent to a join. It does not multiply rows on the left side for duplicate matching values on the right side. And columns of the subquery are not visible for the rest of the query. A join can multiply rows with duplicate values and columns are visible.
Your simple subqueries dig up a single unique column in both cases, so there is no effective difference in this case - except that IN (<subquery>) is generally (at least a bit) slower and more verbose. Use joins.
Your query
Indexes
product_categories has 34 rows. Unless you plan on adding many more, indexes do no help performance for this table. A sequential scan will always be faster. Drop index_product_categories_on_organization_id.
customers has 6,970 rows. Indexes start to make sense. But your query uses 4,988 of them according to the EXPLAIN output. Only an index-only scan on an index much less wide than the table could help a bit. Assuming WHERE active AND visible are constant predicates, I suggest a partial multicolumn index:
CREATE INDEX index_customers_on_organization_id ON customers (organization_id, id)
WHERE active AND visible;
I appended id to allow index-only scans. The column is otherwise useless in the index for this query.
total_sales has 7,104,441 rows. Indexes are very important. I suggest:
CREATE INDEX index_total_sales_on_product_category_customer_id
ON total_sales (period_id, product_category_id, customer_id, id)
Again, aiming for an index-only scan. This is the most important one.
You can delete the completely redundant index index_total_sales_on_product_category_id.
performance_analyses has 1,012,346 rows. Indexes are very important.
I would suggest another partial index with the condition size > 0:
CREATE INDEX index_performance_analyses_on_status_id
ON performance_analyses (total_sales_id)
WHERE pa.size > 0;
However:
Rows Removed by Filter: 0"
Seems like this conditions serves no purpose? Are there any rows with size > 0 is not true?
After creating these indexes you need to ANALYZE the tables.
Tables statistics
Generally, I see many bad estimates. Postgres underestimates the number of rows returned at almost every step. The nested loops we see would work much better for fewer rows. Unless this is an unlikely coincidence, your table statistics are badly outdated. You need to visit your settings for autovacuum and probably also per-table settings for your two big tables
performance_analyses and total_sales.
You already did run VACUUM and ANALYZE, which made the query slower, according to your comment. That doesn't make a lot of sense. I would run VACUUM FULL on these two tables once (if you can afford an exclusive lock). Else try pg_repack.
With all the fishy statistics and bad plans I would consider running a complete vacuumdb -fz yourdb on your DB. That rewrites all tables and indexes in pristine conditions, but it's no good to use on a regular basis. It's also expensive and will lock your DBs for an extended period of time!
While being at it, have a look at the cost settings of your DB as well.
Related:
Keep PostgreSQL from sometimes choosing a bad query plan
Postgres Slow Queries - Autovacuum frequency

Although theoretically the optimizer should be able to do this, I often find that these changes can massively improve performance:
use proper joins (instead of where id in (select ...))
order the reference to tables in the from clause such that the fewest rows are returned at each join, especially the first table's condition (in the where clause) should be the most restrictive (and should use indexes)
move all conditions on joined tables into the on condition of joins
Try this (aliases added for readability):
select count(*)
from total_sales ts
join product_categories pc on ts.product_category_id = pc.id and pc.organization_id = 3
join customers c on ts.customer_id = c.id and c.organization_id = 3
join performance_analyses pa on ts.id = pa.total_sales_id and pa.size > 0
where ts.period_id = 193
You will need to create this index for optimal performance (to allow an index-only scan on total_sales):
create index ts_pid_pcid_cid on total_sales(period_id, product_category_id, customer_id)
This approach first narrows the data to a period, so it will scale (remain roughly constant) into the future, because the number of sales per period will be roughly constant.

The estimations there are not accurate. Postgres's planner uses wrongly nested loop - try to penalize nest_loop by statement set enable_nestloop to off.

sqlite is using wrong index in left join

I am joining two tables with a left join:
The first table is quite simple
create table L (
id integer primary key
);
and contains only a handful of records.
The second table is
create table R (
L_id null references L,
k text not null,
v text not null
);
and contains millions of records.
The following two indexes are on R:
create index R_ix_1 on R(L_id);
create index R_ix_2 on R(k);
This select statement, imho, selects the wrong index:
select
L.id,
R.v
from
L left join
R on
L.id = R.L_id and
R.k = 'foo';
A explain query plan tells me that the select statement uses the index R_ix_2, the execution of the select takes too much time. I believe the performance would be much
better if sqlite chose to use R_ix_1 instead.
I tried also
select
L.id,
R.v
from
L left join
R indexed by R_ix_1 on
L.id = R.L_id and
R.k = 'foo';
but that gave me Error: no query solution.
Is there something I can do to make sqlite use the other index?

Your join condition relies on 2 columns, so your index should cover those 2 columns:
create index R_ix_1 on R(L_id, k);
If you do some other queries relying only on single column, you can keep old indexes, but you still need to have this double-column index as well:
create index R_ix_1 on R(L_id);
create index R_ix_2 on R(k);
create index R_ix_3 on R(L_id, k);

I wonder if the SQLite optimizer just gets confused in this case. Does this work better?
select L.id, R.v
from L left join
R
on L.id = R.L_id
where R.k = 'foo' or R.k is NULL;
EDIT:
Of course, SQLite will only use an index if the types of the columns are the same. The question doesn't specify the type of l_id. If it is not the same as the type of the primary key, then the index (probably) will not be used.

Best index configuration for sql statement in sqlite?

I have the following compound sql statement for a lookup and I am trying to understand that are the optimal indexes (indices?) to create, and which ones I should leave out because they aren't needed or if it is counter productive to have multiple.
SELECT items.id, items.standard_part_number,
items.standard_price, items.quantity,
part_numbers.value, items.metadata,
items.image_file_name, items.updated_at
FROM items LEFT OUTER JOIN part_numbers ON items.id=part_numbers.item_id
AND part_numbers.account_id='#{account_id}'
WHERE items.standard_part_number LIKE '#{part_number}%'
UNION ALL
SELECT items.id, items.standard_part_number,
items.standard_price, items.quantity,
part_numbers.value, items.metadata,
items.image_file_name, items.updated_at
FROM items LEFT OUTER JOIN part_numbers ON items.id=part_numbers.item_id
AND part_numbers.account_id='#{account_id}'
WHERE part_numbers.value LIKE '#{part_number}%'
ORDER BY items.standard_part_number
LIMIT '#{limit}' OFFSET '#{offset}'
I have the following indices, some of them may not be necessary or could I be missing an index?... Or worse can having too many be working against the optimal performance configuration?
for items:
CREATE INDEX index_items_standard_part_number ON items (standard_part_number);
for part_numbers:
CREATE INDEX index_part_numbers_item_id ON part_numbers (item_id);
CREATE INDEX index_part_numbers_item_id_and_account_id on part_numbers (item_id,account_id);
CREATE INDEX index_part_numbers_item_id_and_account_id_and_value ON part_numbers (item_id,account_id,value);
CREATE INDEX index_part_numbers_item_id_and_value on part_numbers (item_id,value);
CREATE INDEX index_part_numbers_value on part_numbers (value);
Update:
The schema for the tables listed above
CREATE TABLE accounts (id INTEGER PRIMARY KEY,name TEXT,code TEXT UNIQUE,created_at INTEGER,updated_at INTEGER,company_id INTEGER,standard BOOLEAN,price_list_id INTEGER);
CREATE TABLE items (id INTEGER PRIMARY KEY,standard_part_number TEXT UNIQUE,standard_price INTEGER,part_number TEXT,price INTEGER,quantity INTEGER,unit_of_measure TEXT,metadata TEXT,image_file_name TEXT,created_at INTEGER,updated_at INTEGER,company_id INTEGER);
CREATE TABLE part_numbers (id INTEGER PRIMARY KEY,value TEXT,item_id INTEGER,account_id INTEGER,created_at INTEGER,updated_at INTEGER,company_id INTEGER,standard BOOLEAN);

Outer joins constrain the join order, so you should not use them unless necessary.
In the second subquery, the WHERE part_numbers.value LIKE ... clause would filter out any unmatched records anyway, so you should drop that LEFT OUTER.
SQLite can use at most one index per table per (sub)query.
So to be able to use the same index for both searching and sorting, both operations must use the same collation.
LIKE uses a case-insensitive collation, so the ORDER BY should be declared to use the same (ORDER BY items.standard_part_number COLLATE NOCASE).
This is not possible if the part numbers must be sorted case sensitively.
This is not needed if SQLite does not actually use the same index for both (check with EXPLAIN QUERY PLAN).
In the first subquery, there is no index that could be used for the items.standard_part_number LIKE '#{part_number}%' search.
You would need an index like this (NOCASE is needed for LIKE):
CREATE INDEX iii ON items(standard_part_number COLLATE NOCASE);
In the second subquery, SQLite is likely to use part_numbers as the outer table in the join because it has two filtered columns.
An index for these two searches must look like this (with NOCASE only for the second column):
CREATE INDEX ppp ON part_numbers(account_id, value COLLATE NOCASE);
With all these changes, the query and its EXPLAIN QUERY PLAN output look like this:
EXPLAIN QUERY PLAN
SELECT items.id, items.standard_part_number,
items.standard_price, items.quantity,
part_numbers.value, items.metadata,
items.image_file_name, items.updated_at
FROM items LEFT OUTER JOIN part_numbers ON items.id=part_numbers.item_id
AND part_numbers.account_id='#{account_id}'
WHERE items.standard_part_number LIKE '#{part_number}%'
UNION ALL
SELECT items.id, items.standard_part_number,
items.standard_price, items.quantity,
part_numbers.value, items.metadata,
items.image_file_name, items.updated_at
FROM items JOIN part_numbers ON items.id=part_numbers.item_id
AND part_numbers.account_id='#{account_id}'
WHERE part_numbers.value LIKE '#{part_number}%'
ORDER BY items.standard_part_number COLLATE NOCASE
LIMIT -1 OFFSET 0;
1|0|0|SEARCH TABLE items USING INDEX iii (standard_part_number>? AND standard_part_number<?)
1|1|1|SEARCH TABLE part_numbers USING COVERING INDEX index_part_numbers_item_id_and_account_id_and_value (item_id=? AND account_id=?)
2|0|1|SEARCH TABLE part_numbers USING INDEX ppp (account_id=? AND value>? AND value<?)
2|1|0|SEARCH TABLE items USING INTEGER PRIMARY KEY (rowid=?)
2|0|0|USE TEMP B-TREE FOR ORDER BY
0|0|0|COMPOUND SUBQUERIES 1 AND 2 (UNION ALL)
The second subquery cannot use an index for sorting because part_numbers is not the outer table in the join, but the speedup from looking up both account_id and value through an index is likely to be greater than the slowdown from doing an explicit sorting step.
For this query alone, you could drop all indexes not mentioned here.
If the part numbers can be searched case sensitively, you should remove all the COLLATE NOCASE stuff and replace the LIKE searches with a case-sensitive search (partnum BETWEEN 'abc' AND 'abcz').

Performance issue with select query in Firebird

I have two tables, one small (~ 400 rows), one large (~ 15 million rows), and I am trying to find the records from the small table that don't have an associated entry in the large table.
I am encountering massive performance issues with the query.
The query is:
SELECT * FROM small_table WHERE NOT EXISTS
(SELECT NULL FROM large_table WHERE large_table.small_id = small_table.id)
The column large_table.small_id references small_table's id field, which is its primary key.
The query plan shows that the foreign key index is used for the large_table:
PLAN (large_table (RDB$FOREIGN70))
PLAN (small_table NATURAL)
Statistics have been recalculated for indexes on both tables.
The query takes several hours to run. Is this expected?
If so, can I rewrite the query so that it will be faster?
If not, what could be wrong?

I'm not sure about Firebird, but in other DBs often a join is faster.
SELECT *
FROM small_table st
LEFT JOIN large_table lt
ON st.id = lt.small_id
WHERE lt.small_id IS NULL
Maybe give that a try?
Another option, if you're really stuck, and depending on the situation this needs to be run in, is to take the small_id column out of the large_table, possibly into a temp table, and then do a left join / EXISTS query.

If the large table only has relatively few distinct values for small_id, the following might perform better:
select *
from small_table st left outer join
(select distinct small_id
from large_table
) lt
on lt.small_id = st.id
where lt.small_id is null
In this case, the performance would be better by doing a full scan of the large table and then index lookups in the small table -- the opposite of what it is doing. Doing a distinct could do just an index scan on the large table which then uses the primary key index on the small table.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas