I have the following compound sql statement for a lookup and I am trying to understand that are the optimal indexes (indices?) to create, and which ones I should leave out because they aren't needed or if it is counter productive to have multiple.
SELECT items.id, items.standard_part_number,
items.standard_price, items.quantity,
part_numbers.value, items.metadata,
items.image_file_name, items.updated_at
FROM items LEFT OUTER JOIN part_numbers ON items.id=part_numbers.item_id
AND part_numbers.account_id='#{account_id}'
WHERE items.standard_part_number LIKE '#{part_number}%'
UNION ALL
SELECT items.id, items.standard_part_number,
items.standard_price, items.quantity,
part_numbers.value, items.metadata,
items.image_file_name, items.updated_at
FROM items LEFT OUTER JOIN part_numbers ON items.id=part_numbers.item_id
AND part_numbers.account_id='#{account_id}'
WHERE part_numbers.value LIKE '#{part_number}%'
ORDER BY items.standard_part_number
LIMIT '#{limit}' OFFSET '#{offset}'
I have the following indices, some of them may not be necessary or could I be missing an index?... Or worse can having too many be working against the optimal performance configuration?
for items:
CREATE INDEX index_items_standard_part_number ON items (standard_part_number);
for part_numbers:
CREATE INDEX index_part_numbers_item_id ON part_numbers (item_id);
CREATE INDEX index_part_numbers_item_id_and_account_id on part_numbers (item_id,account_id);
CREATE INDEX index_part_numbers_item_id_and_account_id_and_value ON part_numbers (item_id,account_id,value);
CREATE INDEX index_part_numbers_item_id_and_value on part_numbers (item_id,value);
CREATE INDEX index_part_numbers_value on part_numbers (value);
Update:
The schema for the tables listed above
CREATE TABLE accounts (id INTEGER PRIMARY KEY,name TEXT,code TEXT UNIQUE,created_at INTEGER,updated_at INTEGER,company_id INTEGER,standard BOOLEAN,price_list_id INTEGER);
CREATE TABLE items (id INTEGER PRIMARY KEY,standard_part_number TEXT UNIQUE,standard_price INTEGER,part_number TEXT,price INTEGER,quantity INTEGER,unit_of_measure TEXT,metadata TEXT,image_file_name TEXT,created_at INTEGER,updated_at INTEGER,company_id INTEGER);
CREATE TABLE part_numbers (id INTEGER PRIMARY KEY,value TEXT,item_id INTEGER,account_id INTEGER,created_at INTEGER,updated_at INTEGER,company_id INTEGER,standard BOOLEAN);
Outer joins constrain the join order, so you should not use them unless necessary.
In the second subquery, the WHERE part_numbers.value LIKE ... clause would filter out any unmatched records anyway, so you should drop that LEFT OUTER.
SQLite can use at most one index per table per (sub)query.
So to be able to use the same index for both searching and sorting, both operations must use the same collation.
LIKE uses a case-insensitive collation, so the ORDER BY should be declared to use the same (ORDER BY items.standard_part_number COLLATE NOCASE).
This is not possible if the part numbers must be sorted case sensitively.
This is not needed if SQLite does not actually use the same index for both (check with EXPLAIN QUERY PLAN).
In the first subquery, there is no index that could be used for the items.standard_part_number LIKE '#{part_number}%' search.
You would need an index like this (NOCASE is needed for LIKE):
CREATE INDEX iii ON items(standard_part_number COLLATE NOCASE);
In the second subquery, SQLite is likely to use part_numbers as the outer table in the join because it has two filtered columns.
An index for these two searches must look like this (with NOCASE only for the second column):
CREATE INDEX ppp ON part_numbers(account_id, value COLLATE NOCASE);
With all these changes, the query and its EXPLAIN QUERY PLAN output look like this:
EXPLAIN QUERY PLAN
SELECT items.id, items.standard_part_number,
items.standard_price, items.quantity,
part_numbers.value, items.metadata,
items.image_file_name, items.updated_at
FROM items LEFT OUTER JOIN part_numbers ON items.id=part_numbers.item_id
AND part_numbers.account_id='#{account_id}'
WHERE items.standard_part_number LIKE '#{part_number}%'
UNION ALL
SELECT items.id, items.standard_part_number,
items.standard_price, items.quantity,
part_numbers.value, items.metadata,
items.image_file_name, items.updated_at
FROM items JOIN part_numbers ON items.id=part_numbers.item_id
AND part_numbers.account_id='#{account_id}'
WHERE part_numbers.value LIKE '#{part_number}%'
ORDER BY items.standard_part_number COLLATE NOCASE
LIMIT -1 OFFSET 0;
1|0|0|SEARCH TABLE items USING INDEX iii (standard_part_number>? AND standard_part_number<?)
1|1|1|SEARCH TABLE part_numbers USING COVERING INDEX index_part_numbers_item_id_and_account_id_and_value (item_id=? AND account_id=?)
2|0|1|SEARCH TABLE part_numbers USING INDEX ppp (account_id=? AND value>? AND value<?)
2|1|0|SEARCH TABLE items USING INTEGER PRIMARY KEY (rowid=?)
2|0|0|USE TEMP B-TREE FOR ORDER BY
0|0|0|COMPOUND SUBQUERIES 1 AND 2 (UNION ALL)
The second subquery cannot use an index for sorting because part_numbers is not the outer table in the join, but the speedup from looking up both account_id and value through an index is likely to be greater than the slowdown from doing an explicit sorting step.
For this query alone, you could drop all indexes not mentioned here.
If the part numbers can be searched case sensitively, you should remove all the COLLATE NOCASE stuff and replace the LIKE searches with a case-sensitive search (partnum BETWEEN 'abc' AND 'abcz').
Related
The question is for Firebird 2.5. Let's assume we have the following query:
SELECT
EVENTS.ID,
EVENTS.TS,
EVENTS.DEV_TS,
EVENTS.COMPLETE_TS,
EVENTS.OBJ_ID,
EVENTS.OBJ_CODE,
EVENTS.SIGNAL_CODE,
EVENTS.SIGNAL_EVENT,
EVENTS.REACTION,
EVENTS.PROT_TYPE,
EVENTS.GROUP_CODE,
EVENTS.DEV_TYPE,
EVENTS.DEV_CODE,
EVENTS.SIGNAL_LEVEL,
EVENTS.SIGNAL_INFO,
EVENTS.USER_ID,
EVENTS.MEDIA_ID,
SIGNALS.ID AS SIGNAL_ID,
SIGNALS.SIGNAL_TYPE,
SIGNALS.IMAGE AS SIGNAL_IMAGE,
SIGNALS.NAME AS SIGNAL_NAME,
REACTION.INFO,
USERS.NAME AS USER_NAME
FROM EVENTS
LEFT OUTER JOIN SIGNALS ON (EVENTS.SIGNAL_ID = SIGNALS.ID)
LEFT OUTER JOIN REACTION ON (EVENTS.ID = REACTION.EVENTS_ID)
LEFT OUTER JOIN USERS ON (EVENTS.USER_ID = USERS.ID)
WHERE (TS BETWEEN '27.07.2021 00:00:00' AND '28.07.2021 10:34:08')
AND (OBJ_ID = 8973)
AND (DEV_CODE IN (0, 1234))
AND (DEV_TYPE = 79)
AND (PROT_TYPE = 8)
ORDER BY TS;
EVENTS has about 190 million records by now and this query takes too much time to complete. As I read here, the tables have to have indexes on all the columns that are used.
Here are the CREATE INDEX statements for the EVENTS table:
CREATE INDEX FK_EVENTS_OBJ ON EVENTS (OBJ_ID);
CREATE INDEX FK_EVENTS_SIGNALS ON EVENTS (SIGNAL_ID);
CREATE INDEX IDX_EVENTS_COMPLETE_TS ON EVENTS (COMPLETE_TS);
CREATE INDEX IDX_EVENTS_OBJ_SIGNAL_TS ON EVENTS (OBJ_ID,SIGNAL_ID,TS);
CREATE INDEX IDX_EVENTS_TS ON EVENTS (TS);
Here is the data from the PLAN analyzer:
PLAN JOIN (JOIN (JOIN (EVENTS ORDER IDX_EVENTS_TS INDEX (FK_EVENTS_OBJ, IDX_EVENTS_TS), SIGNALS INDEX (PK_SIGNALS)), REACTION INDEX (IDX_REACTION_EVENTS)), USERS INDEX (PK_USERS))
As requested the speed of the execution:
without LEFT JOIN -> 138ms
with LEFT JOIN -> 338ms
Is there another way to speed up the execution of the query besides indexing the columns or maybe add another index?
If I add another index will the optimizer choose to use it?
You can only optimize the joins themselves by being sure that the keys are indexed in the second tables. These all look like primary keys, so they should have appropriate indexes.
For this WHERE clause:
WHERE TS BETWEEN '27.07.2021 00:00:00' AND '28.07.2021 10:34:08')
OBJ_ID = 8973 AND
DEV_CODE IN (0, 1234) AND
DEV_TYPE = 79 AND
PROT_TYPE = 8
You probably want an index on (OBJ_ID, DEV_TYPE, PROT_TYPE, TS, DEV_CODE). The order of the first three keys is not particularly important because they are all equality comparisons. I am guessing that one day of data is fewer rows than two device codes.
First of all you want to find the table1 rows quickly. You are using several columns in your WHERE clause to get them. Provide an index on these columns. Which column is the most selective? I.e. which criteria narrows the result rows most? Let's say it's dt, so we put this first:
create index idx1 on table1 (dt, oid, pt, ts, dc);
I have put ts and dt last, because we are looking for more than one value in these columns. It may still be that putting ts or dsas the first column is a good choice. Sometimes we have to play around with this. I.e. provide several indexes with the column order changed and then see which one gets used by the DBMS.
Tables table2 and tabe4 get accessed by the primary key for which exists an index. But table3 gets accessed by t1id. So provide an index on that, too:
create index idx2 on table3 (t1id);
I have 2 tables, table A & table B.
Table A (has thousands of rows)
id
uuid
name
type
created_by
org_id
Table B (has a max of hundred rows)
org_id
org_name
I am trying to get the best join query to obtain a count with a WHERE clause. I need the count of distinct created_bys from table A with an org_name in Table B that contains 'myorg'. I currently have the below query (producing expected results) and wonder if this can be optimized further?
select count(distinct a.created_by)
from a left join
b
on a.org_id = b.org_id
where b.org_name like '%myorg%';
You don't need a left join:
select count(distinct a.created_by)
from a join
b
on a.org_id = b.org_id
where b.org_name like '%myorg%'
For this query, you want an index on b.org_id, which I assume that you have.
I would use exists for this:
select count(distinct a.created_by)
from a
where exists (select 1 from b where b.org_id = a.org_id and b.org_name like '%myorg%')
An index on b(org_id) would help. But in terms of performance, key points are:
searching using like with a wildcard on both sides is not good for performance (this cannot take advantage of an index); it would be far better to search for an exact match, or at least to not have a wildcard on the left side of the string.
count(distinct ...) is more expensive than a regular count(); if you don't really need distinct, then don't use it.
Your query looks good already. Use a plain [INNER] JOIN instead or LEFT [OUTER] JOIN, like Gordon suggested. But that won't change much.
You mention that table B has only ...
a max of hundred rows
while table A has ...
thousands of rows
If there are many rows per created_by (which I'd expect), then there is potential for an emulated index skip scan.
(The need to emulate it might go away in one of the coming Postgres versions.)
Essential ingredient is this multicolumn index:
CREATE INDEX ON a (org_id, created_by);
It can replace a simple index on just (org_id) and works for your simple query as well. See:
Is a composite index also good for queries on the first field?
There are two complications for your case:
DISTINCT
0-n org_id resulting from org_name like '%myorg%'
So the optimization is harder to implement. But still possible with some fancy SQL:
SELECT count(DISTINCT created_by) -- does not count NULL (as desired)
FROM b
CROSS JOIN LATERAL (
WITH RECURSIVE t AS (
( -- parentheses required
SELECT created_by
FROM a
WHERE org_id = b.org_id
ORDER BY created_by
LIMIT 1
)
UNION ALL
SELECT (SELECT created_by
FROM a
WHERE org_id = b.org_id
AND created_by > t.created_by
ORDER BY created_by
LIMIT 1)
FROM t
WHERE t.created_by IS NOT NULL -- stop recursion
)
TABLE t
) a
WHERE b.org_name LIKE '%myorg%';
db<>fiddle here (Postgres 12, but works in Postgres 9.6 as well.)
That's a recursive CTE in a LATERAL subquery, using a correlated subquery.
It utilizes the multicolumn index from above to only retrieve a single row for every (org_id, created_by). With an index-only scans if the table is vacuumed enough.
The main objective of the sophisticated SQL is to completely avoid a sequential scan (or even a bitmap index scan) on the big table and only read very few fast index tuples.
Due to the added overhead it can be a bit slower for an unfavorable data distribution (many org_id and/or only few rows per created_by) But it's much faster for favorable conditions and is scales excellently, even for millions of rows. You'll have to test to find the sweet spot.
Related:
Optimize GROUP BY query to retrieve latest row per user
What is the difference between LATERAL and a subquery in PostgreSQL?
Is there a shortcut for SELECT * FROM?
To illustrate my question, I will use the following example:
CREATE INDEX supplier_idx
ON supplier (supplier_name);
Will the searching on this table only be sped up if the supplier_name column is specified in the SELECT clause? What if we select the supplier_name column as well as other columns in the SELECT clause? Is searching sped up if this column is used in a WHERE clause, even if it is not in the SELECT clause?
Do the same rules apply to the following index as well:
CREATE INDEX supplier_idx
ON supplier (supplier_name, city);
Indexes can be complex, so a full explanation would take a lot of writing. There are many resources on the internet. (Helpful link here to Oracle indexes)
However, I can just answer your questions simply.
CREATE INDEX supplier_idx
ON supplier (supplier_name);
This means that any joins (and similar) using the col supplier_name and using the col supplier_name in a WHERE clause will benefit from an index.
For example
SELECT * FROM SomeTable
WHERE supplier_name = 'Smith'
But simply using the supplier_name column in a SELECT clause will not benefit from having an index (unless you add complexity to the SELECT clause, which I will cover...). For example - this will not benefit from an Index on supplier_name
SELECT
supplier_name
FROM SomeTable WHERE ID = 1
However, if you added some complexity to your SELECT statement, your index could indeed speed it up...For example:
SELECT
supplier_name -- no index benefit
,(SELECT TOP 1 somedata FROM Table2 WHERE supplier_name = Table2.name) AS SomeValue
-- the line above uses the index as supplier_name is used in WHERE
, CASE WHEN supplier_name = 'Best Supplier'
THEN 'Best'
ELSE 'Worst'
END AS FindBestSupplier
-- Also the CASE statement will use the index on supplier_name
FROM SomeTable WHERE ID = 1
(The 'complexity' above still basically shows that if the field 'supplier_name' is used in CASE, or WHERE aswell as JOINS and aggregations, then the INDEX is very beneficial...This example above is a combination of many clauses wrapped into one SELECT statement)
But your composite index
CREATE INDEX supplier_idx
ON supplier (supplier_name, city);
would be beneficial in specific and important cases (Eg: where the city is in the SELECT clause and the supplier_name is used in the WHERE clause), for example
SELECT
city
FROM SomeTable WHERE supplier_name = 'Smith'
The reason is that city is stored alongside the supplier_name index values, so when the index finds the supplier_name value, it immediately has a copy of the city value (stored in the index) and does not need to hit the database files to find any more data. (If city was not in the index, it would have to hit the database to pull the city value out, as it does with most data required in the SELECT statement usually)
The joins will benefit from an index also, with the example:
SELECT
* FROM SomeTable T1
LEFT JOIN AnotherTable T2
ON T1.supplier_name = T2.supplier_name_2
AND T1.city = T2.city_2
So in summary, if you use the field in any comparison expression like a WHERE clause or a JOIN , or a GROUP BY clause (and the aggregations SUM, MIN, MAX etc)...then an Index is very beneficial for Tables with over a few thousand rows...
(Usually only makes a big difference when you have at least 10,000 rows in a Table, but this can vary depending on your complexity)
SQL Server (for example) always creates any missing indexes that it needs (and then discards them)..So if you do not create the correct indexes manually - the system can slow down as it creates the missing indexes on the fly each time it needs them. (SQL Server will show you hints on what indexes it thinks you need for a certain query)
Indexes can slow down UPDATES or INSERTS, so they must be used with a little wisdom and balance...(Sometimes indexes are deleted before a batch of UPDATEs is performed and then the index re-created again, although this is kinda extreme)
I have the following Postgres query, the query takes 10 to 50 seconds to execute.
SELECT m.match_id FROM match m
WHERE m.match_id NOT IN(SELECT ml.match_id FROM message_log ml)
AND m.account_id = ?
I have created an index on match_id and account_id
CREATE INDEX match_match_id_account_id_idx ON match USING btree
(match_id COLLATE pg_catalog."default",
account_id COLLATE pg_catalog."default");
But still the query takes a long time. What can I do to speed this up and make it efficient? My server load goes to 25 when I have a few of these queries executing.
NOT IN (SELECT ... ) can be considerably more expensive because it has to handle NULL separately. It can also be tricky when NULL values are involved. Typically LEFT JOIN / IS NULL (or one of the other related techniques) is faster:
Select rows which are not present in other table
Applied to your query:
SELECT m.match_id
FROM match m
LEFT JOIN message_log ml USING (match_id)
WHERE ml.match_id IS NULL
AND m.account_id = ?;
The best index would be:
CREATE INDEX match_match_id_account_id_idx ON match (account_id, match_id);
Or just on (account_id), assuming that match_id is PK in both tables. You also already have the needed index on message_log(match_id). Else create that, too.
Also COLLATE pg_catalog."default" in your index definition indicates that your ID columns are character types, which is typically inefficient. Should typically better be integer types.
My educated guess from the little you have shown so far: there are probably more issues.
I am joining two tables with a left join:
The first table is quite simple
create table L (
id integer primary key
);
and contains only a handful of records.
The second table is
create table R (
L_id null references L,
k text not null,
v text not null
);
and contains millions of records.
The following two indexes are on R:
create index R_ix_1 on R(L_id);
create index R_ix_2 on R(k);
This select statement, imho, selects the wrong index:
select
L.id,
R.v
from
L left join
R on
L.id = R.L_id and
R.k = 'foo';
A explain query plan tells me that the select statement uses the index R_ix_2, the execution of the select takes too much time. I believe the performance would be much
better if sqlite chose to use R_ix_1 instead.
I tried also
select
L.id,
R.v
from
L left join
R indexed by R_ix_1 on
L.id = R.L_id and
R.k = 'foo';
but that gave me Error: no query solution.
Is there something I can do to make sqlite use the other index?
Your join condition relies on 2 columns, so your index should cover those 2 columns:
create index R_ix_1 on R(L_id, k);
If you do some other queries relying only on single column, you can keep old indexes, but you still need to have this double-column index as well:
create index R_ix_1 on R(L_id);
create index R_ix_2 on R(k);
create index R_ix_3 on R(L_id, k);
I wonder if the SQLite optimizer just gets confused in this case. Does this work better?
select L.id, R.v
from L left join
R
on L.id = R.L_id
where R.k = 'foo' or R.k is NULL;
EDIT:
Of course, SQLite will only use an index if the types of the columns are the same. The question doesn't specify the type of l_id. If it is not the same as the type of the primary key, then the index (probably) will not be used.