Lets say that I have the following tables with the given attributes:
TableA: A_ID, B_NUM,C,D
TableB: B_ID, E, F
Having the following query:
SELECT TableA.*,TableB.E,TableB.F FROM TableA
INNER JOIN TableB ON TableA.B_NUM=TableB.B_ID
What index would benefit this query?
I am having a hard time compreending this subject, in terms of what index should I create where.
Thanks!
This query:
SELECT a.*, b.E, b.F
FROM TableA a INNER JOIN
TableB b
ON a.B_NUM = b.B_ID;
is returning all data that matches between the two tables.
The general advise for indexing a query that has no WHERE or GROUP BY is to add indexes on the columns used for the joins. I would go a little further.
My first guess of the best index would be on TableB(b_id, e, f). This is a covering index for TableB. That means that the SQL engine can use the index to fetch e and f. It does not need to go to the data pages. The index is bigger, but the data pages are not needed. (This is true in most databases; sometimes row-locking considerations make things a bit more complicated.)
On the other hand, if TableA is really big and TableB much smaller so most rows in TableA have no match in TableB, then an index on TableA(B_NUM) would be the better index.
You can include both indexes and let the optimizer decide which to use (it is possible the optimizer would decide to use both, but I think that is unlikely).
Related
i am having a hard time to understand what indexes I should create.
I made this sample query that contains various situations (select, join, group, order etc..).
What index/indexes should i create on this sample?
Table A: 2 gb in size
Table B: 100kb in size
SELECT A.AAA, A.BBB, A.CCC, B.mycol
From tableA as A
INNER JOIN tableB as B
ON A.ID = B.ID
WHERE AAA='3'
AND BBB>'2021-10-10'
AND CCC<'2021-11-01'
GROUP BY B.mycol, A.AAA, A.BBB, A.CCC
ORDER BY A.AAA desc
my understanding would be that i have to create one single inxed, with the clumns A.ID, A.AAA, A.BBB, and A.CCC. Table B does not need a index becuase it is small and wouldnt make any change.
is this correct? or do i need to create multiple indexes?
You want to optimize execution time on the query:
SELECT A.AAA, A.BBB, A.CCC, B.mycol
From tableA as A
INNER JOIN tableB as B
ON A.ID = B.ID
WHERE AAA='3'
AND BBB>'2021-10-10'
AND CCC<'2021-11-01'
GROUP BY B.mycol, A.AAA, A.BBB, A.CCC
ORDER BY A.AAA desc
Since this query is filtering data using columns in tableA only, then tableA will be the driving table.
In the driving table we need to include the filtering columns, considering equality filters first, then non-equality filters, from highest to lowest selectivity. In this case:
AAA
BBB
CCC
The GROUP BY clause isn't doing anything, so we'll ignore it.
The index above will provide the rows in the order required by the ORDER BY clause, that the engine will walk backwards. Therefore, there's no need to tweak the index for this purpose.
Finally, the engine will perform a nested loop to retrieve rows from tableB. In order to do this efficiently the query will need and index by:
ID
mycol (optional, if we want a covering index for higher performance)
In short you'll need the following two indexes:
create index ix1 on tableA (AAA, BBB, CCC);
create index ix2 on tableB (ID);
Please consider the engine mat ignore them anyway, if the histograms of the table say otherwise.
I have a SQL query that is taking hours to run. My join is on the descriptions of products. Would it be more efficient to create a unique numerical id and join on this instead since the product description is a few sentences long?
Example:
SELECT A*, B.something
FROM tableA A JOIN TABLE B
ON A.product_details = B.product_details
For this query:
SELECT A.*, B.something
FROM tableA A JOIN
TABLE B
ON A.product_details = B.product_details
The best index is on B(product_details, something) -- however product_details is most important as the first key.
I generally recommend a numeric index. They are a bit more efficient. And they reduce the number of things to worry about, such as spaces at the ends of keys and collation conflicts.
Consider a simple 3 table database i SQL Server 2012.
Table A
AId
Name
Other1
Other2
Table B
BId
Name
Table A_B
BId
AId
Simple example query:
SELECT TOP(20) A.Aid, A.Name, B.Bid, B.Name
FROM A
INNER JOIN A_B ON A.AId = A_B.Aid
INNER JOIN A as AA ON AA.Aid = A_B.Aid
INNER JOIN B ON B.BId = A_B.Bid
WHERE AA.Aid = #aid
AND A.Other1 = #other1
There are millions of rows in table A.
There are thousands of rows in table B.
There are ten times more rows in table A_B than A.
The Other1 and Other2 fields can be used to filter the queries.
Join queries using Top(20) could be done at a rate of 100 requests per second or more (specs are unclear).
The queries will almost always be using different parameters so result caching would not help that much.
What features in SQL Server 2012 can help to improve join query perfomance given the example above?
My initial thought is that since it's all PK int joins there isn't much that I could do. However I don't know if partitioned views could help.
I'm thinking that probably it's just about adding memory.
Well the first thing to understand (well maybe not the first) is that a performance model is built into all current versions which is dependant on head seek times vs continuous reads, This may well change with solidstate drives. Your choice of clusted indexes will be important keeping likely frequently queried data together. Also having a covering index for each part of the query will mean that the data can be accessed without reading the table its self. Partitoning may help (but its probably a long way down the list). Keeping stats up do date is essential. To often poor performance comes from undermaintained indexes and stats. Actully all these things are true right back to SQL7 (except I dont think SQL7 had partitioned views). Having the right RAID structure can alter performace by a factor of 4. The number of tempdbs should be equivalent to the number of processors (upto about 16) and the tempdb load balancing option should be set to true. Having Tempdbs, logs and data distributed across diffent i/os. No auto shrink - its evil. These are the more obvious ones. If you really want to get to grips with large db, then "Inside SQL" by Kalen Delany is almost mandatory reading though probably costs more that a few GB of RAM. And as you said - more RAM.
First yes have a clustered index for the PK
If Table B is smaller than Int16 use Int16
Not for disk space but for more rows in the same amount of memory
The interesting part is Table A_B
The order of that PK will probably effect in performance
Against just a single PK index which ever is second will be a slower join
Try the order each way
Check the query plan
Check the tuning adviser
My thought is
PK AId, BId
Non clustered index on BId based on that index is smaller
Then flop them around and compare
If the same then go with AId, BId for smaller index size and speed of insert
Then you can go into hints on the joins
Defrag on a regular basis
Insert in the order of the PK
If the data comes in natural order and insert speed is an issue then use that order for the PK
If insert speed is a problem then it may help to disable the non clustered index, insert, and then rebuild the non clustered index
Millions and thousands is still not enormous.
And I would not write the query like that
Keep the number joins down
SELECT TOP(20) A.Aid, A.Name, B.Bid, B.Name
FROM A_B
JOIN A
ON A.Aid = A_B.Aid
JOIN B
ON B.BId = A_B.Bid
WHERE AA.Aid = #aid
AND A.Other1 = #other1
That query is very wasteful
Why join on all A.Aid = A_B.Aid to filter to a single AA.Aid in the where
Get the filter to execute early
This may perform better
SELECT TOP(20) A.Aid, A.Name, B.Bid, B.Name
FROM A_B
JOIN A
ON A.Aid = A_B.Aid
AND A.Aid = #aid
AND A.Other1 = #other1
JOIN B
ON B.BId = A_B.Bid
If you can get it to filter before it joins then less work
Check the query plan
A CTE on A with the conditions may coerce it to perform the filter first.
If you cannot get the filter to happen first with a single statement then create a #tempA with ID as a declared PK
(not a CTE the purpose is to materialize)
Insert into #tempA
select Id, Name
from Table A
where A.Aid = #aid
AND A.Other1 = #other1
If Id is PK on Table A then that query returns 0 or 1 records
The join to #tempA is trivial
Given:
Table y
id int clustered index
name nvarchar(25)
Table anothertable
id int clustered Index
name nvarchar(25)
Table someFunction
does some math then returns a valid ID
Compare:
SELECT y.name
FROM y
WHERE dbo.SomeFunction(y.id) IN (SELECT anotherTable.id
FROM AnotherTable)
vs:
SELECT y.name
FROM y
JOIN AnotherTable ON dbo.SomeFunction(y.id) ON anotherTable.id
Question:
While timing these two queries out I found that at large data sets the first query using IN is much faster then the second query using an INNER JOIN. I do not understand why can someone help explain please.
Execution Plan
Generally speaking IN is different from JOIN in that a JOIN can return additional rows where a row has more than one match in the JOIN-ed table.
From your estimated execution plan though it can be seen that in this case the 2 queries are semantically the same
SELECT
A.Col1
,dbo.Foo(A.Col1)
,MAX(A.Col2)
FROM A
WHERE dbo.Foo(A.Col1) IN (SELECT Col1 FROM B)
GROUP BY
A.Col1,
dbo.Foo(A.Col1)
versus
SELECT
A.Col1
,dbo.Foo(A.Col1)
,MAX(A.Col2)
FROM A
JOIN B ON dbo.Foo(A.Col1) = B.Col1
GROUP BY
A.Col1,
dbo.Foo(A.Col1)
Even if duplicates are introduced by the JOIN then they will be removed by the GROUP BY as it only references columns from the left hand table. Additionally these duplicate rows will not alter the result as MAX(A.Col2) will not change. This would not be the case for all aggregates however. If you were to use SUM(A.Col2) (or AVG or COUNT) then the presence of the duplicates would change the result.
It seems that SQL Server doesn't have any logic to differentiate between aggregates such as MAX and those such as SUM and so quite possibly it is expanding out all the duplicates then aggregating them later and simply doing a lot more work.
The estimated number of rows being aggregated is 2893.54 for IN vs 28271800 for JOIN but these estimates won't necessarily be very reliable as the join predicate is unsargable.
Your second query is a bit funny - can you try this one instead??
SELECT y.name
FROM dbo.y
INNER JOIN dbo.AnotherTable a ON a.id = dbo.SomeFunction(y.id)
Does that make any difference?
Otherwise: look at the execution plans! And possibly post them here. Without knowing a lot more about your tables (amount and distribution of data etc.) and your system (RAM, disk etc.), it's really really hard to give a "globally" valid statement
Well, for one thing: get rid of the scalar UDF that is implied by dbo.SomeFunction(y.id). That will kill your performance real good. Even if you replace it with a one-row inline table-valued function it will be better.
As for your actual question, I have found similar results in other situations and have been similarly perplexed. The optimizer just treats them differently; I'll be interested to see what answers others provide.
I have some SQL similar to the following, which joins four tables and then orders the results by the "status" column of the first:
SELECT *
FROM a, b, c, d
WHERE b.aid=a.id AND c.id=a.cid AND a.did=d.id AND a.did='XXX'
ORDER BY a.status
It works. However, it's slow. I've worked out this is because of the ORDER BY clause and the lack of any index on table "a".
All four tables have the PRIMARY KEYs set on the "id" column.
So, I know I need to add an index to table a which includes the "status" column but what else does it need to include? Should "bid", "cid" and "did" be in there too?
I've tried to ask this in a general SQL sense but, if it's important, the target is SQLite for use with Gears.
Thanks in advance,
Jake (noob)
I would say it's slow because the engine is doing scans all over the place instead of seeks. Did you mean to do SELECT a.* instead? That would be faster as well, SELECT * here is equivalent to a.*, b.*, c.*, d.*.
You will probably get better results if you put a separate index on each of these columns:
a.did (so that a.did = 'XXX' is a seek instead of a scan, also helps a.did = d.id)
a.cid (for a.cid = c.id)
b.aid (for a.id = b.aid)
You could try adding Status to the first and second indexes with ASCENDING order, for additional performance - it doesn't hurt.
I'd be curious as to how you worked out that the problem is 'the ORDER BY clause and the lack of any index on table "a".' I find this a little suspicious because there is an index on table a, on the primary key, you later say.
Looking at the nature of the query and what I can guess about the nature of the data, I would think that this query would generally produce relatively few results compared to the size of the tables it's using, and that thus the ORDER BY would be extremely cheap. Of course, this is just a guess.
Whether an index will even help at all is dependent on the data in the table. What indices your query optimizer will use when doing a query is dependent on a lot of different factors, one of the big ones being the expected number of results produced from a lookup.
One thing that would help a lot is if you would post the output of EXPLAINing your query.
have you tried joins?
select * from a inner join b on a.id = b.aid inner join c on a.cid = c.id inner join d on a.did=d.id where a.did='XXX'
ORDER BY a.status
the correct use of joins (left, richt, inner, outer) depends on structure of tables
hope this helps