sql table index - sql

i am having a hard time to understand what indexes I should create.
I made this sample query that contains various situations (select, join, group, order etc..).
What index/indexes should i create on this sample?
Table A: 2 gb in size
Table B: 100kb in size
SELECT A.AAA, A.BBB, A.CCC, B.mycol
From tableA as A
INNER JOIN tableB as B
ON A.ID = B.ID
WHERE AAA='3'
AND BBB>'2021-10-10'
AND CCC<'2021-11-01'
GROUP BY B.mycol, A.AAA, A.BBB, A.CCC
ORDER BY A.AAA desc
my understanding would be that i have to create one single inxed, with the clumns A.ID, A.AAA, A.BBB, and A.CCC. Table B does not need a index becuase it is small and wouldnt make any change.
is this correct? or do i need to create multiple indexes?

You want to optimize execution time on the query:
SELECT A.AAA, A.BBB, A.CCC, B.mycol
From tableA as A
INNER JOIN tableB as B
ON A.ID = B.ID
WHERE AAA='3'
AND BBB>'2021-10-10'
AND CCC<'2021-11-01'
GROUP BY B.mycol, A.AAA, A.BBB, A.CCC
ORDER BY A.AAA desc
Since this query is filtering data using columns in tableA only, then tableA will be the driving table.
In the driving table we need to include the filtering columns, considering equality filters first, then non-equality filters, from highest to lowest selectivity. In this case:
AAA
BBB
CCC
The GROUP BY clause isn't doing anything, so we'll ignore it.
The index above will provide the rows in the order required by the ORDER BY clause, that the engine will walk backwards. Therefore, there's no need to tweak the index for this purpose.
Finally, the engine will perform a nested loop to retrieve rows from tableB. In order to do this efficiently the query will need and index by:
ID
mycol (optional, if we want a covering index for higher performance)
In short you'll need the following two indexes:
create index ix1 on tableA (AAA, BBB, CCC);
create index ix2 on tableB (ID);
Please consider the engine mat ignore them anyway, if the histograms of the table say otherwise.

Related

Best way to get distinct count from a query joining two tables

I have 2 tables, table A & table B.
Table A (has thousands of rows)
id
uuid
name
type
created_by
org_id
Table B (has a max of hundred rows)
org_id
org_name
I am trying to get the best join query to obtain a count with a WHERE clause. I need the count of distinct created_bys from table A with an org_name in Table B that contains 'myorg'. I currently have the below query (producing expected results) and wonder if this can be optimized further?
select count(distinct a.created_by)
from a left join
b
on a.org_id = b.org_id
where b.org_name like '%myorg%';
You don't need a left join:
select count(distinct a.created_by)
from a join
b
on a.org_id = b.org_id
where b.org_name like '%myorg%'
For this query, you want an index on b.org_id, which I assume that you have.
I would use exists for this:
select count(distinct a.created_by)
from a
where exists (select 1 from b where b.org_id = a.org_id and b.org_name like '%myorg%')
An index on b(org_id) would help. But in terms of performance, key points are:
searching using like with a wildcard on both sides is not good for performance (this cannot take advantage of an index); it would be far better to search for an exact match, or at least to not have a wildcard on the left side of the string.
count(distinct ...) is more expensive than a regular count(); if you don't really need distinct, then don't use it.
Your query looks good already. Use a plain [INNER] JOIN instead or LEFT [OUTER] JOIN, like Gordon suggested. But that won't change much.
You mention that table B has only ...
a max of hundred rows
while table A has ...
thousands of rows
If there are many rows per created_by (which I'd expect), then there is potential for an emulated index skip scan.
(The need to emulate it might go away in one of the coming Postgres versions.)
Essential ingredient is this multicolumn index:
CREATE INDEX ON a (org_id, created_by);
It can replace a simple index on just (org_id) and works for your simple query as well. See:
Is a composite index also good for queries on the first field?
There are two complications for your case:
DISTINCT
0-n org_id resulting from org_name like '%myorg%'
So the optimization is harder to implement. But still possible with some fancy SQL:
SELECT count(DISTINCT created_by) -- does not count NULL (as desired)
FROM b
CROSS JOIN LATERAL (
WITH RECURSIVE t AS (
( -- parentheses required
SELECT created_by
FROM a
WHERE org_id = b.org_id
ORDER BY created_by
LIMIT 1
)
UNION ALL
SELECT (SELECT created_by
FROM a
WHERE org_id = b.org_id
AND created_by > t.created_by
ORDER BY created_by
LIMIT 1)
FROM t
WHERE t.created_by IS NOT NULL -- stop recursion
)
TABLE t
) a
WHERE b.org_name LIKE '%myorg%';
db<>fiddle here (Postgres 12, but works in Postgres 9.6 as well.)
That's a recursive CTE in a LATERAL subquery, using a correlated subquery.
It utilizes the multicolumn index from above to only retrieve a single row for every (org_id, created_by). With an index-only scans if the table is vacuumed enough.
The main objective of the sophisticated SQL is to completely avoid a sequential scan (or even a bitmap index scan) on the big table and only read very few fast index tuples.
Due to the added overhead it can be a bit slower for an unfavorable data distribution (many org_id and/or only few rows per created_by) But it's much faster for favorable conditions and is scales excellently, even for millions of rows. You'll have to test to find the sweet spot.
Related:
Optimize GROUP BY query to retrieve latest row per user
What is the difference between LATERAL and a subquery in PostgreSQL?
Is there a shortcut for SELECT * FROM?

Impact of index on different columns while join

lets say we have two tables- table A and table B and both tables have 5 million records each. They have common fields, id and name. i want to check that what would be the impact of index if we apply on join field while joining the tables and what would be the impact of index on select column while joining the tables. below is query
select t1.name from table A t1 inner join table B t2 on t1.id=t2.id;
on which field shall i create index in order to have faster result. shall i put index on id or name? please help
my expectation is if we put index on id column, then query will give result in shorter duration rather if we put index on name field.
looking for performance improvement
my expectation is if we put index on id column, then query will give result in shorter duration rather if we put index on name field.
For this query:
select t1.name
from tableA t1 inner join
tableB t2
on t1.id = t2.id;
I would expect the best index to be tableB(id). This is the key used for the JOIN.
Under some circumstances, an index on tableA(id, name) might be the best alternative. This would be particularly true if tableA were much larger than tableB.

Appropriate Index for a JOIN clause

Lets say that I have the following tables with the given attributes:
TableA: A_ID, B_NUM,C,D
TableB: B_ID, E, F
Having the following query:
SELECT TableA.*,TableB.E,TableB.F FROM TableA
INNER JOIN TableB ON TableA.B_NUM=TableB.B_ID
What index would benefit this query?
I am having a hard time compreending this subject, in terms of what index should I create where.
Thanks!
This query:
SELECT a.*, b.E, b.F
FROM TableA a INNER JOIN
TableB b
ON a.B_NUM = b.B_ID;
is returning all data that matches between the two tables.
The general advise for indexing a query that has no WHERE or GROUP BY is to add indexes on the columns used for the joins. I would go a little further.
My first guess of the best index would be on TableB(b_id, e, f). This is a covering index for TableB. That means that the SQL engine can use the index to fetch e and f. It does not need to go to the data pages. The index is bigger, but the data pages are not needed. (This is true in most databases; sometimes row-locking considerations make things a bit more complicated.)
On the other hand, if TableA is really big and TableB much smaller so most rows in TableA have no match in TableB, then an index on TableA(B_NUM) would be the better index.
You can include both indexes and let the optimizer decide which to use (it is possible the optimizer would decide to use both, but I think that is unlikely).

Improving SQL cartesian product performance by reducing columns

I have an SQL query which uses cartesian product on a large table. However, I only need one column from one of the tables. Would it actually perform better, if I selected only that one column before using the cartesian product?
So, in other words, would this:
SELECT A.Id, B.Id
FROM (SELECT Id FROM Table1) AS A , Table2 AS B;
be faster than this, given that Table1 has more columns than Id?:
SELECT A.Id, B.Id
FROM Table1 AS A , Table2 AS B;
Or does the number of columns not matter?
On most databases, the two forms would have the same execution plan.
The first could would be worse on a database (such as MySQL) that materializes subqueries.
The second should be better with indexes on the two tables . . . table1(id) and table2(id). The index would be used to get the value rather than the base data.
Try it out yourself! But generally speaking having a subquery reduce the number of rows will help improve the performance. Your query should, however, be written differently:
select a.id aid, b.id bid from
(Select id from table1 where id=<specific_id>) a, table2 b

SQL Perfomance : Subselects in joins or direct joins?

I have got a question on the performance of the below tables\
Table A -- Has only 5 customer ID's(5 Rows 1 column)
Table B -- Is the master base for all Customer's and their information.(1 Million Rows and 500 Columns)
Query 1:-
Select A.*,
B.Age
from A
left join B
on A.Customer_id = B.Customer_id;
Query 2:-
Select a.*,
B.Age
from A
left join
(select Customer_id,age from B) C
on A.Customer_id = C.Customer_id;
The main question of performance here is because of the presence of 500 columns in Table B.
I feel the 2nd Query is better as SQL wont have to create a temporary table during the join containing all columns from table B.
Please let me know if this is wrong?
I feel the 2nd Query is better as SQL wont have to create a temporary table during the join containing all columns from table B.
You can tell whether Oracle does create a temporary table during the execution or not from the explain plan. You should also consider whether the Oracle kernel developers would not have got round such an obvious performance problem if it existed.
As it happens, there will be no temporary table, and there is nothing wrong with your first query. There is almost never a need to manipulate the query for performance reasons -- write queries that are the best encapsulation of the logic you require.
CREATE INDEX index_name ON table_b (customer_id)
then use
Select a.*,
B.Age
from A
left join (select Customer_id,
age
from B) C
on A.Customer_id = C.Customer_id;
500 columns is rather extensive.
Maybe you can create an index like:
CREATE INDEX index_name
ON table_b (customer_id,
age
);
sub query in select is faster than using join (no matter if direct join or sub select)
select
a.*,
(select b.age
from b
where b.customer_id = a.customer_id)
from a
note:
it behaves like outer join (return empty field in age if customer_id from b doesn't exists in a)
the sub query should return only one row from b per row from a.