lets say we have two tables- table A and table B and both tables have 5 million records each. They have common fields, id and name. i want to check that what would be the impact of index if we apply on join field while joining the tables and what would be the impact of index on select column while joining the tables. below is query
select t1.name from table A t1 inner join table B t2 on t1.id=t2.id;
on which field shall i create index in order to have faster result. shall i put index on id or name? please help
my expectation is if we put index on id column, then query will give result in shorter duration rather if we put index on name field.
looking for performance improvement
my expectation is if we put index on id column, then query will give result in shorter duration rather if we put index on name field.
For this query:
select t1.name
from tableA t1 inner join
tableB t2
on t1.id = t2.id;
I would expect the best index to be tableB(id). This is the key used for the JOIN.
Under some circumstances, an index on tableA(id, name) might be the best alternative. This would be particularly true if tableA were much larger than tableB.
Related
i am having a hard time to understand what indexes I should create.
I made this sample query that contains various situations (select, join, group, order etc..).
What index/indexes should i create on this sample?
Table A: 2 gb in size
Table B: 100kb in size
SELECT A.AAA, A.BBB, A.CCC, B.mycol
From tableA as A
INNER JOIN tableB as B
ON A.ID = B.ID
WHERE AAA='3'
AND BBB>'2021-10-10'
AND CCC<'2021-11-01'
GROUP BY B.mycol, A.AAA, A.BBB, A.CCC
ORDER BY A.AAA desc
my understanding would be that i have to create one single inxed, with the clumns A.ID, A.AAA, A.BBB, and A.CCC. Table B does not need a index becuase it is small and wouldnt make any change.
is this correct? or do i need to create multiple indexes?
You want to optimize execution time on the query:
SELECT A.AAA, A.BBB, A.CCC, B.mycol
From tableA as A
INNER JOIN tableB as B
ON A.ID = B.ID
WHERE AAA='3'
AND BBB>'2021-10-10'
AND CCC<'2021-11-01'
GROUP BY B.mycol, A.AAA, A.BBB, A.CCC
ORDER BY A.AAA desc
Since this query is filtering data using columns in tableA only, then tableA will be the driving table.
In the driving table we need to include the filtering columns, considering equality filters first, then non-equality filters, from highest to lowest selectivity. In this case:
AAA
BBB
CCC
The GROUP BY clause isn't doing anything, so we'll ignore it.
The index above will provide the rows in the order required by the ORDER BY clause, that the engine will walk backwards. Therefore, there's no need to tweak the index for this purpose.
Finally, the engine will perform a nested loop to retrieve rows from tableB. In order to do this efficiently the query will need and index by:
ID
mycol (optional, if we want a covering index for higher performance)
In short you'll need the following two indexes:
create index ix1 on tableA (AAA, BBB, CCC);
create index ix2 on tableB (ID);
Please consider the engine mat ignore them anyway, if the histograms of the table say otherwise.
I have 2 tables TABLE1 AND TABLE2.
TABLE1 has columns masterId, Id, col1, col2, category
TABLE2 has columns Id, col1, col2
TABLE2.Id is primary key and TABLE1.Id is foreign key.
TABLE1.masterId is primary key of TABLE1.
TABLE1 has 10 million rows with Id 1 to 10 million and first 10 rows having category = 1
TABLE2 has only 10 rows with Id 1 to 10.
Now I want col1 and col2 values with category=1 (either from TABLE1 OR TABLE2 because the values are same in both tables)
Which among below 2 queries gives output faster?
Solution1:
SELECT T1.col1, T1.col2
FROM TABLE1 T1
WHERE T1.category = 1
Solution2:
SELECT T2.col1, T2.col2
FROM TABLE2 T2
INNER JOIN TABLE1 T1 ON T1.Id = T2.Id
WHERE T1.category = 1
Does Solution2 save Table Scan time on millions of rows of TABLE1.
Limitation is:
In my real db scenario, I can make Table1.Id as non clustered index and Table1.category also non clustered index. I cannot make Table1.Id as clustered index because I actually have another auto increment column as primary key in my Table1 in real scenario. So please share your thoughts with this limitation.
Please confirm and share thoughts on this.
It depends on the existing indexes. With a nonclustered index on Id in T1, then the solution 2 might perform better than solution 1, that would require a complete table scan to select the rows with category1. If instead we also have a nonclustered index on Category, then the solution 1 will be faster, since it would only have to seek the nonclustered index to find the rows.
Without any index on Id on T1 a full scan would be required to find the T2.Id row, therefore there might be 10 full scan of T1 for solution 2 and 1 full scan on T1.Category for solution 1, so the solution 1 might be faster. But this depends on the query optimizer and a test the real case to see what are the actual execution plans would be the best way to answer.
But the way to go is to implement the right model and then proceed to create the indexes needed to make the query run fast.
Edit: adapted the answer according to the query edits.
Edit2: index coverage would be expensive and a 10 index seek on PK on table 1 would not cost so much.
[Notice]
This answer was given for an older version of the question, https://stackoverflow.com/revisions/65263530/7
The scenario back then was:
T2 also had a category column, and,
the second query was:
SELECT T2.col1, T2.col2
FROM TABLE2 T2
INNER JOIN TABLE1 T1 ON T1.categoryId = T2.category Id
WHERE T2.category = 1
Assuming the only indices are the PKs, nope, Solution 2 will NOT avoid the table scan. Worse:
Solution 1
Full table scan
Solution 2
Full table scan on T2 (T2.category) and then nested loops (T2.category = T1.category)
Please, what are your goals here?
To begin with, this statement shows a lack of understanding of databases:
first 10 rows having category = 1
SQL tables represent unordered sets. There is no such thing as "first 10 rows". In the context of your question, I think you mean "the 10 rows with the lowest values of the id". However, the ordering of the table is still arbitrary from the perspective of the engine. There are situations where a clustered index could reasonably be assumed to be a "table ordering", but there is never a guarantee that:
select *
from t;
returns data in a particular ordering even with a clustered index.
Two possible execution plans for the first query -- depending on the indexing -- are:
Scanning the table (i.e. reading millions of rows) and doing the test for each row.
Scanning an index on category and just fetching the rows that are needed.
In general, (1) would be much, much slower than (2) when the scanned rows is in the millions and the returned rows are just a few. However, if this may not be true if a significant proportion of all records were returned.
I interpret your question as asking whether the second query could ever be faster than the first:
SELECT T2.col1, T2.col2
FROM TABLE2 T2 INNER JOIN
TABLE1 T1
ON T1.Id = T2.Id
WHERE T1.category = 1;
The answer is "definitely faster than the scan". This is a possible if you have an index on Table1(id, category). However, the query would be better written using EXISTS:
select t2.*
from table2 t2
where exists (select 1
from table1 t1
where t1.id = t2.id and t2.category = 1
);
I would expect this to be faster than the indexed version of the first query as well. Even with an index on (category), the database still has to fetch the data for the select. If the data is on one page (as the "first" statement might suggest), then the two might be quite comparable. However, it would be hard to measure the difference in performance with the correct indexing on table1.
A note about clustered indexes in SQL Server. If the id is an identity primary key and there is no other clustered index, then it is automatically used as the clustered index.
I have three tables, table3 is bascially the intermediate table of table1 and table2. When I execute the query statement that contains "in" and joins table1 and table3, it just kept running and I could not get the result. If I use id=134 instead of id in (134,267,390,4234 ... ), the result comes up. I don't understand why "in" has the effect, does anyone have an idea?
Query statement:
select count(*) from table1, table3 on id=table3.table1_id where table3.table2_id = 123 and id in (134,267,390,4234) and item = 30;
table structure:
table1:
id integer primary key,
item integer
table2:
id integer,
item integer
table3:
table1_id integer,
table2_id integer
-- the DB without index was 0.8 TB after the three indices is now 2.5 TB
indices on: table1.item, table3.table1_id, table3.table2_id
env: Linux, sqlite 3.7.17
from table1, table3 is a cross join on most databases, with the size of your data a cross join is enormous, but in SQLite3 it's an inner join. From the SQLite SELECT docs
Side note: Special handling of CROSS JOIN. There is no difference between the "INNER JOIN", "JOIN" and "," join operators. They are completely interchangeable in SQLite.
That's not your problem in this specific instance, but let's not tempt fate; always write out your joins explicitly.
select count(*)
from table1
join table3 on id=table3.table1_id
where table3.table2_id = 123
and id in (134,267,390,4234);
Since you're just counting, you don't need any data from table1 but the ID. table3 has table1_id, so there's no need to join with table1. We can do this entirely with the table3 join table.
select count(*)
from table3
where table2_id = 123
and table1_id in (134,267,390,4234);
SQLite can only use one index per table. For this to be performant on such a large data set, you need a composite index of both columns: table3(table1_id, table2_id). Presumably you don't want duplicates, so this should take the form of a unique index. That will cover queries for just table1_id and queries for both table1_id and table2_id; you should drop your table1_id index to save space and time.
create unique index table3_unique on table3(table1_id, table2_id);
The composite index will not for queries which use only table2_id, keep your existing table2_id index.
Your query should now run lickity-split.
For more, read about the SQLite Query Optimizer.
A terabyte is a lot of data. While SQLite technicly can handle this, it might not be the best choice. It's great for small and simple databases, but it's missing a lot of features. You should look into a more powerful database such as PostgreSQL. It is not a magic bullet, all the same principles apply, but it is much more appropriate for data at that scale.
I have two tables. One is a comparatively small table with 20000 records with columns with id(Unique key),name, zipcode. And the other is a huge table with nearly 1 billion+ records with columns id(unique key), name, age, address & active status(boolean). I want to have the records which are not active in the second table and check if those inactive records are available in the first table. I don't know how to loop the records in the first table as a single query. How can I do that in Db2?
You may use EXISTS logic here:
SELECT t1.*
FROM Table1 t1
WHERE EXISTS (SELECT 1 FROM Table2 t2 WHERE t2.id = t1.id AND t2.status = false);
Note that the above query might benefit from the following index on the second table:
CREATE INDEX idx2 ON Table2 (id, status);
This might let the lookup proceed much faster. Note that we chose to express your logic by scanning the first table, and looking up in the second, as the first table is much smaller than the second.
I have a table whose all columns store ids of other tables (huge tables).
CREATE TABLE #mytable (
Table1Id int,
Table2Id int,
Table3Id int,
Table4Id int,
Table5Id int,
)
Now my select has join to all the tables whose ids are stored in the columns of my table.
select T1.col1, t2.Col1, T3.col1... from
#mytable MyTable inner join table1 T1 on MyTable.Table1Id = T1.Id
inner join table2 T2 on MyTable.Table2Id = T2.Id
inner join table3 T3 on MyTable.Table3Id = T3.Id
inner join table4 T4 on MyTable.Table4Id = T4.Id
inner join table5 T5 on MyTable.Table5Id = T5.Id
order by T1.Col1, T2.col1
At the moment I only have an index on Table1Id and on all the id columns of the other tables. Any suggestions to improve the performance.
You don't say which column your index is currently defined on, but based on your example query, you should create an index for all five columns;
Table1Id, Table2Id, Table3Id, Table4Id, Table5Id
This allows the SQL engine to resolve the query just by reading the index, which should be faster than reading the index, then reading the table.
If you run queries where you access some of the columns, then you need an index for those columns as well. Let's say you run a query on Table3Id and Table4Id. Then you need to create an index on;
Table3Id, Table4Id
I can't tell from the information you provided in your question if these indexes should be unique or non unique. You would have to make that determination.
Examine #mytable
You have no search criteria on that table
no where
no order by
no group by
You are just going to get those rows in no particular order.
There is no use for any index on #mytable
The index Table1Id is not used by that query and will slow down inserts
I suspect #mytable is just an output table and the where conditions are used to populate that table.
The join will use the ID on the table to be joined.
So index ID on table1-x and index it as a PK (clustered) if you can.
If that index is fragmented then defrag.
That join should be an index seek and you can't do any better.
Verify the query plan has index seeks on the joins.
If you don have index seeks on those joins then post the query plan.
You could experiment with hints on the join but I suspect the query optimizer will get it right - that may be a big query but it is not a complex query.
Since SQL will grab pages if you order the #mytable by the individual columns you have a better chance of that page being in memory.
A PK is free IF you can insert in the order of the PK.
In that case you would put the column with the most values in the last position.
Actually would would put the column with tightest groupings of PK in the last position.
And then sort by PK.
For the statement that you have put in your question, there is probably little that you can do. In fact, indexes could even hurt under some circumstances if you are in a memory limited environment.
As a first step, though, you should have indexes in the numbered tables on the id column. That is, you should be storing and then joining on the primary key of these tables (the index is automatic on a primary key).
Generally, the purpose of indexes is to prevent scanning an entire table to find a particular set of records. In this case, it looks like you want all the records anyway, so full-table scans are necessary. That limits the applicability of indexes. There is a good chance that SQL Server will turn these joins into hash joins, which is an efficient way of joining a table when you need to read all the rows.
Additional indexes might be warranted depending on where and group by clauses.