How To Create a SQL Index to Improve ORDER BY performance - sql

I have some SQL similar to the following, which joins four tables and then orders the results by the "status" column of the first:
SELECT *
FROM a, b, c, d
WHERE b.aid=a.id AND c.id=a.cid AND a.did=d.id AND a.did='XXX'
ORDER BY a.status
It works. However, it's slow. I've worked out this is because of the ORDER BY clause and the lack of any index on table "a".
All four tables have the PRIMARY KEYs set on the "id" column.
So, I know I need to add an index to table a which includes the "status" column but what else does it need to include? Should "bid", "cid" and "did" be in there too?
I've tried to ask this in a general SQL sense but, if it's important, the target is SQLite for use with Gears.
Thanks in advance,
Jake (noob)

I would say it's slow because the engine is doing scans all over the place instead of seeks. Did you mean to do SELECT a.* instead? That would be faster as well, SELECT * here is equivalent to a.*, b.*, c.*, d.*.
You will probably get better results if you put a separate index on each of these columns:
a.did (so that a.did = 'XXX' is a seek instead of a scan, also helps a.did = d.id)
a.cid (for a.cid = c.id)
b.aid (for a.id = b.aid)
You could try adding Status to the first and second indexes with ASCENDING order, for additional performance - it doesn't hurt.

I'd be curious as to how you worked out that the problem is 'the ORDER BY clause and the lack of any index on table "a".' I find this a little suspicious because there is an index on table a, on the primary key, you later say.
Looking at the nature of the query and what I can guess about the nature of the data, I would think that this query would generally produce relatively few results compared to the size of the tables it's using, and that thus the ORDER BY would be extremely cheap. Of course, this is just a guess.
Whether an index will even help at all is dependent on the data in the table. What indices your query optimizer will use when doing a query is dependent on a lot of different factors, one of the big ones being the expected number of results produced from a lookup.
One thing that would help a lot is if you would post the output of EXPLAINing your query.

have you tried joins?
select * from a inner join b on a.id = b.aid inner join c on a.cid = c.id inner join d on a.did=d.id where a.did='XXX'
ORDER BY a.status
the correct use of joins (left, richt, inner, outer) depends on structure of tables
hope this helps

Related

Appropriate Index for a JOIN clause

Lets say that I have the following tables with the given attributes:
TableA: A_ID, B_NUM,C,D
TableB: B_ID, E, F
Having the following query:
SELECT TableA.*,TableB.E,TableB.F FROM TableA
INNER JOIN TableB ON TableA.B_NUM=TableB.B_ID
What index would benefit this query?
I am having a hard time compreending this subject, in terms of what index should I create where.
Thanks!
This query:
SELECT a.*, b.E, b.F
FROM TableA a INNER JOIN
TableB b
ON a.B_NUM = b.B_ID;
is returning all data that matches between the two tables.
The general advise for indexing a query that has no WHERE or GROUP BY is to add indexes on the columns used for the joins. I would go a little further.
My first guess of the best index would be on TableB(b_id, e, f). This is a covering index for TableB. That means that the SQL engine can use the index to fetch e and f. It does not need to go to the data pages. The index is bigger, but the data pages are not needed. (This is true in most databases; sometimes row-locking considerations make things a bit more complicated.)
On the other hand, if TableA is really big and TableB much smaller so most rows in TableA have no match in TableB, then an index on TableA(B_NUM) would be the better index.
You can include both indexes and let the optimizer decide which to use (it is possible the optimizer would decide to use both, but I think that is unlikely).

SQL Server 2012 join perfomance features?

Consider a simple 3 table database i SQL Server 2012.
Table A
AId
Name
Other1
Other2
Table B
BId
Name
Table A_B
BId
AId
Simple example query:
SELECT TOP(20) A.Aid, A.Name, B.Bid, B.Name
FROM A
INNER JOIN A_B ON A.AId = A_B.Aid
INNER JOIN A as AA ON AA.Aid = A_B.Aid
INNER JOIN B ON B.BId = A_B.Bid
WHERE AA.Aid = #aid
AND A.Other1 = #other1
There are millions of rows in table A.
There are thousands of rows in table B.
There are ten times more rows in table A_B than A.
The Other1 and Other2 fields can be used to filter the queries.
Join queries using Top(20) could be done at a rate of 100 requests per second or more (specs are unclear).
The queries will almost always be using different parameters so result caching would not help that much.
What features in SQL Server 2012 can help to improve join query perfomance given the example above?
My initial thought is that since it's all PK int joins there isn't much that I could do. However I don't know if partitioned views could help.
I'm thinking that probably it's just about adding memory.
Well the first thing to understand (well maybe not the first) is that a performance model is built into all current versions which is dependant on head seek times vs continuous reads, This may well change with solidstate drives. Your choice of clusted indexes will be important keeping likely frequently queried data together. Also having a covering index for each part of the query will mean that the data can be accessed without reading the table its self. Partitoning may help (but its probably a long way down the list). Keeping stats up do date is essential. To often poor performance comes from undermaintained indexes and stats. Actully all these things are true right back to SQL7 (except I dont think SQL7 had partitioned views). Having the right RAID structure can alter performace by a factor of 4. The number of tempdbs should be equivalent to the number of processors (upto about 16) and the tempdb load balancing option should be set to true. Having Tempdbs, logs and data distributed across diffent i/os. No auto shrink - its evil. These are the more obvious ones. If you really want to get to grips with large db, then "Inside SQL" by Kalen Delany is almost mandatory reading though probably costs more that a few GB of RAM. And as you said - more RAM.
First yes have a clustered index for the PK
If Table B is smaller than Int16 use Int16
Not for disk space but for more rows in the same amount of memory
The interesting part is Table A_B
The order of that PK will probably effect in performance
Against just a single PK index which ever is second will be a slower join
Try the order each way
Check the query plan
Check the tuning adviser
My thought is
PK AId, BId
Non clustered index on BId based on that index is smaller
Then flop them around and compare
If the same then go with AId, BId for smaller index size and speed of insert
Then you can go into hints on the joins
Defrag on a regular basis
Insert in the order of the PK
If the data comes in natural order and insert speed is an issue then use that order for the PK
If insert speed is a problem then it may help to disable the non clustered index, insert, and then rebuild the non clustered index
Millions and thousands is still not enormous.
And I would not write the query like that
Keep the number joins down
SELECT TOP(20) A.Aid, A.Name, B.Bid, B.Name
FROM A_B
JOIN A
ON A.Aid = A_B.Aid
JOIN B
ON B.BId = A_B.Bid
WHERE AA.Aid = #aid
AND A.Other1 = #other1
That query is very wasteful
Why join on all A.Aid = A_B.Aid to filter to a single AA.Aid in the where
Get the filter to execute early
This may perform better
SELECT TOP(20) A.Aid, A.Name, B.Bid, B.Name
FROM A_B
JOIN A
ON A.Aid = A_B.Aid
AND A.Aid = #aid
AND A.Other1 = #other1
JOIN B
ON B.BId = A_B.Bid
If you can get it to filter before it joins then less work
Check the query plan
A CTE on A with the conditions may coerce it to perform the filter first.
If you cannot get the filter to happen first with a single statement then create a #tempA with ID as a declared PK
(not a CTE the purpose is to materialize)
Insert into #tempA
select Id, Name
from Table A
where A.Aid = #aid
AND A.Other1 = #other1
If Id is PK on Table A then that query returns 0 or 1 records
The join to #tempA is trivial

Sql query optimization using IN over INNER JOIN

Given:
Table y
id int clustered index
name nvarchar(25)
Table anothertable
id int clustered Index
name nvarchar(25)
Table someFunction
does some math then returns a valid ID
Compare:
SELECT y.name
FROM y
WHERE dbo.SomeFunction(y.id) IN (SELECT anotherTable.id
FROM AnotherTable)
vs:
SELECT y.name
FROM y
JOIN AnotherTable ON dbo.SomeFunction(y.id) ON anotherTable.id
Question:
While timing these two queries out I found that at large data sets the first query using IN is much faster then the second query using an INNER JOIN. I do not understand why can someone help explain please.
Execution Plan
Generally speaking IN is different from JOIN in that a JOIN can return additional rows where a row has more than one match in the JOIN-ed table.
From your estimated execution plan though it can be seen that in this case the 2 queries are semantically the same
SELECT
A.Col1
,dbo.Foo(A.Col1)
,MAX(A.Col2)
FROM A
WHERE dbo.Foo(A.Col1) IN (SELECT Col1 FROM B)
GROUP BY
A.Col1,
dbo.Foo(A.Col1)
versus
SELECT
A.Col1
,dbo.Foo(A.Col1)
,MAX(A.Col2)
FROM A
JOIN B ON dbo.Foo(A.Col1) = B.Col1
GROUP BY
A.Col1,
dbo.Foo(A.Col1)
Even if duplicates are introduced by the JOIN then they will be removed by the GROUP BY as it only references columns from the left hand table. Additionally these duplicate rows will not alter the result as MAX(A.Col2) will not change. This would not be the case for all aggregates however. If you were to use SUM(A.Col2) (or AVG or COUNT) then the presence of the duplicates would change the result.
It seems that SQL Server doesn't have any logic to differentiate between aggregates such as MAX and those such as SUM and so quite possibly it is expanding out all the duplicates then aggregating them later and simply doing a lot more work.
The estimated number of rows being aggregated is 2893.54 for IN vs 28271800 for JOIN but these estimates won't necessarily be very reliable as the join predicate is unsargable.
Your second query is a bit funny - can you try this one instead??
SELECT y.name
FROM dbo.y
INNER JOIN dbo.AnotherTable a ON a.id = dbo.SomeFunction(y.id)
Does that make any difference?
Otherwise: look at the execution plans! And possibly post them here. Without knowing a lot more about your tables (amount and distribution of data etc.) and your system (RAM, disk etc.), it's really really hard to give a "globally" valid statement
Well, for one thing: get rid of the scalar UDF that is implied by dbo.SomeFunction(y.id). That will kill your performance real good. Even if you replace it with a one-row inline table-valued function it will be better.
As for your actual question, I have found similar results in other situations and have been similarly perplexed. The optimizer just treats them differently; I'll be interested to see what answers others provide.

SQLite join optimisation

If you have a query such as:
select a.Name, a.Description from a
inner join b on a.id1 = b.id1
inner join c on b.id2 = c.id2
group by a.Name, a.Description
What would be the most optimal columns to index for this query in SQLite if you consider that there are over 100,000 rows in each of the tables?
The reason that I ask is that I do not get the performance with the query with the group by that I would expect from another RDBMS (SQL Server) when I apply the same optimisation.
Would I be right in thinking that all columns referenced on a single table in a query in SQLite need to be included in a single composite index for best performance?
The problem is that you're expecting SQLite to have the same performance characteristics as a full RDBMS. It won't. SQLLite doesn't have the luxury of getting to cache quite as much in memory, has to rebuild the cache every time you run the application, is probably limited to set number of cores, etc, etc, etc. Tradeoffs for using an embedded RDBMS over a full one.
As far as optimizations go, try indexing the lookup columns and test. Then try creating a covering index. Be sure to test both selects and code paths that update the database, you're speeding up one at the expense of the other. Find the indexing that gives the best balance between the two for your needs and go with it.
From the SQLite query optimization overview:
When doing an indexed lookup of a row, the usual procedure is to do a binary search on the index to find the index entry, then extract the rowid from the index and use that rowid to do a binary search on the original table. Thus a typical indexed lookup involves two binary searches. If, however, all columns that were to be fetched from the table are already available in the index itself, SQLite will use the values contained in the index and will never look up the original table row. This saves one binary search for each row and can make many queries run twice as fast.
For any other RDBMS, I'd say to put a clustered index on b.id1 and c.id2. For SQLite, you might be better off including any columns from b and c that you want to lookup in those indexes too.
Beware: I know nothing of possible intricacies of SQLite and its execution plans.
You definitely need indexes on a.id1, b.id1, b.id2 and c.id2. I think a composite index (b.id1, b.id2) could yield a small performance increase. The same goes for (a.id1, a.Name, a.Description).
Since you're not using the other tables for your return columns, perhaps this will be faster:
SELECT DISTINCT a.Name, a.Description
FROM a, b, c
WHERE a.id1 = b.id1
AND b.id2 = c.id2
Looking at the returned columns, since the criteria seems to be only that they must be linked from a to b to c, you could look for all unique a.Name and a.Description pairs.
SELECT DISTINCT a.Name, a.Description
FROM a
WHERE a.id1 IN (
SELECT b.id1
FROM b
WHERE b.id2 IN (
SELECT c.id2
FROM c
)
)
Or, depending on if every pair of a.Name and a.Description is already unique, there should be some gain in finding out first the unique id's then fetching the other columns.
SELECT a.Name, a.Description
FROM a
WHERE a.id1 IN (
SELECT DISTINCT a.id1
FROM a
WHERE a.id1 IN (
SELECT b.id1
FROM b
WHERE b.id2 IN (
SELECT c.id2
FROM c
)
)
)
I think indexes on a.id1 and b.id2 would give you about as much benefit as you could get in terms of the JOINs. But SQLite offers EXPLAIN, and it might help you determine if there's an avoidable in efficiency in the current execution plan.

SQL (any) Request for insight on a query optimization

I have a particularly slow query due to the vast amount of information being joined together. However I needed to add a where clause in the shape of id in (select id from table).
I want to know if there is any gain from the following, and more pressing, will it even give the desired results.
select a.* from a where a.id in (select id from b where b.id = a.id)
as an alternative to:
select a.* from a where a.id in (select id from b)
Update:
MySQL
Can't be more specific sorry
table a is effectively a join between 7 different tables.
use of * is for examples
Edit, b doesn't get selected
Your question was about the difference between these two:
select a.* from a where a.id in (select id from b where b.id = a.id)
select a.* from a where a.id in (select id from b)
The former is a correlated subquery. It may cause MySQL to execute the subquery for each row of a.
The latter is a non-correlated subquery. MySQL should be able to execute it once and cache the results for comparison against each row of a.
I would use the latter.
Both queries you list are the equivalent of:
select a.*
from a
inner join b on b.id = a.id
Almost all optimizers will execute them in the same way.
You could post a real execution plan, and someone here might give you a way to speed it up. It helps if you specify what database server you are using.
YMMV, but I've often found using EXISTS instead of IN makes queries run faster.
SELECT a.* FROM a WHERE EXISTS (SELECT 1 FROM b WHERE b.id = a.id)
Of course, without seeing the rest of the query and the context, this may not make the query any faster.
JOINing may be a more preferable option, but if a.id appears more than once in the id column of b, you would have to throw a DISTINCT in there, and you more than likely go backwards in terms of optimization.
I would never use a subquery like this. A join would be much faster.
select a.*
from a
join b on a.id = b.id
Of course don't use select * either (especially never use it when doing a join as at least one field is repeated) and it wastes network resources to send unnneeded data.
Have you looked at the execution plan?
How about
select a.*
from a
inner join b
on a.id = b.id
presumably the id fields are primary keys?
Select a.* from a
inner join (Select distinct id from b) c
on a.ID = c.AssetID
I tried all 3 versions and they ran about the same. The execution plan was the same (inner join, IN (with and without where clause in subquery), Exists)
Since you are not selecting any other fields from B, I prefer to use the Where IN(Select...) Anyone would look at the query and know what you are trying to do (Only show in a if in b.).
your problem is most likely in the seven tables within "a"
make the FROM table contain the "a.id"
make the next join: inner join b on a.id = b.id
then join in the other six tables.
you really need to show the entire query, list all indexes, and approximate row counts of each table if you want real help