SQLite join optimisation - sql

If you have a query such as:
select a.Name, a.Description from a
inner join b on a.id1 = b.id1
inner join c on b.id2 = c.id2
group by a.Name, a.Description
What would be the most optimal columns to index for this query in SQLite if you consider that there are over 100,000 rows in each of the tables?
The reason that I ask is that I do not get the performance with the query with the group by that I would expect from another RDBMS (SQL Server) when I apply the same optimisation.
Would I be right in thinking that all columns referenced on a single table in a query in SQLite need to be included in a single composite index for best performance?

The problem is that you're expecting SQLite to have the same performance characteristics as a full RDBMS. It won't. SQLLite doesn't have the luxury of getting to cache quite as much in memory, has to rebuild the cache every time you run the application, is probably limited to set number of cores, etc, etc, etc. Tradeoffs for using an embedded RDBMS over a full one.
As far as optimizations go, try indexing the lookup columns and test. Then try creating a covering index. Be sure to test both selects and code paths that update the database, you're speeding up one at the expense of the other. Find the indexing that gives the best balance between the two for your needs and go with it.

From the SQLite query optimization overview:
When doing an indexed lookup of a row, the usual procedure is to do a binary search on the index to find the index entry, then extract the rowid from the index and use that rowid to do a binary search on the original table. Thus a typical indexed lookup involves two binary searches. If, however, all columns that were to be fetched from the table are already available in the index itself, SQLite will use the values contained in the index and will never look up the original table row. This saves one binary search for each row and can make many queries run twice as fast.
For any other RDBMS, I'd say to put a clustered index on b.id1 and c.id2. For SQLite, you might be better off including any columns from b and c that you want to lookup in those indexes too.

Beware: I know nothing of possible intricacies of SQLite and its execution plans.
You definitely need indexes on a.id1, b.id1, b.id2 and c.id2. I think a composite index (b.id1, b.id2) could yield a small performance increase. The same goes for (a.id1, a.Name, a.Description).

Since you're not using the other tables for your return columns, perhaps this will be faster:
SELECT DISTINCT a.Name, a.Description
FROM a, b, c
WHERE a.id1 = b.id1
AND b.id2 = c.id2
Looking at the returned columns, since the criteria seems to be only that they must be linked from a to b to c, you could look for all unique a.Name and a.Description pairs.
SELECT DISTINCT a.Name, a.Description
FROM a
WHERE a.id1 IN (
SELECT b.id1
FROM b
WHERE b.id2 IN (
SELECT c.id2
FROM c
)
)
Or, depending on if every pair of a.Name and a.Description is already unique, there should be some gain in finding out first the unique id's then fetching the other columns.
SELECT a.Name, a.Description
FROM a
WHERE a.id1 IN (
SELECT DISTINCT a.id1
FROM a
WHERE a.id1 IN (
SELECT b.id1
FROM b
WHERE b.id2 IN (
SELECT c.id2
FROM c
)
)
)

I think indexes on a.id1 and b.id2 would give you about as much benefit as you could get in terms of the JOINs. But SQLite offers EXPLAIN, and it might help you determine if there's an avoidable in efficiency in the current execution plan.

Related

Appropriate Index for a JOIN clause

Lets say that I have the following tables with the given attributes:
TableA: A_ID, B_NUM,C,D
TableB: B_ID, E, F
Having the following query:
SELECT TableA.*,TableB.E,TableB.F FROM TableA
INNER JOIN TableB ON TableA.B_NUM=TableB.B_ID
What index would benefit this query?
I am having a hard time compreending this subject, in terms of what index should I create where.
Thanks!
This query:
SELECT a.*, b.E, b.F
FROM TableA a INNER JOIN
TableB b
ON a.B_NUM = b.B_ID;
is returning all data that matches between the two tables.
The general advise for indexing a query that has no WHERE or GROUP BY is to add indexes on the columns used for the joins. I would go a little further.
My first guess of the best index would be on TableB(b_id, e, f). This is a covering index for TableB. That means that the SQL engine can use the index to fetch e and f. It does not need to go to the data pages. The index is bigger, but the data pages are not needed. (This is true in most databases; sometimes row-locking considerations make things a bit more complicated.)
On the other hand, if TableA is really big and TableB much smaller so most rows in TableA have no match in TableB, then an index on TableA(B_NUM) would be the better index.
You can include both indexes and let the optimizer decide which to use (it is possible the optimizer would decide to use both, but I think that is unlikely).

Why adding where cause with NOT IN to a SQL slow in Postgres

I have written a SQL query that seems very simple, tableA with primary key x, and ytime is a timestamp field.
select a.x, b.z from
# tableA join tableB etc.
where a.id not in (select x from tableA where a.ytime is not null)
There is another stackflow similar to this one but that one only talks about smaller subset of data.
Why SQL "NOT IN" is so slow?
Not sure if I need to index ytime column.
I have no science to back this up -- only antecdotal history, but for cases where the "in" list is significantly large, I have found that the semi-join (or anti-join in this case) is typically much more efficient than the in list:
select a1.x, b.z
from
tableA a1
join tableB b on ...
where not exists (
select null
from tablea a2
where a1.id = a2.x and a2.ytime is not null
)
When I say significant, a query than ran for minutes using the in-list ran in just seconds after I changed it to the semi.
Try it and see if it makes a difference.
I made some assumptions since your code was largely notional, but I think you get the idea.

SQL - DB2 Group By and indexes with distinct aggregate functions

I'm trying to flatten out a query result set with some COUNT functions and a GROUP BY clause. Basically, I have a query that can return over a dozen rows for what will essentially be processed as one object. Furthermore, the columns that cause this are only added up after the result set is processed, so this seems an ideal time for aggregation. For example:
SELECT
A.ID, A.NAME, A.DETAILS,
COUNT(DISTINCT CASE WHEN B.TYPE = 'ONE' THEN B.ID2 END) AS B1
COUNT(DISTINCT CASE WHEN B.TYPE = 'TWO' THEN B.ID2 END) AS B2
COUNT(DISTINCT CASE WHEN B.TYPE = 'THREE'
AND B.SUBTYPE = 'ONE-ONE' THEN B.ID END) AS B3
FROM A
LEFT JOIN B ON B.A_ID = A.ID
GROUP BY A.ID, A.NAME, A.DETAILS
The idea is to get a count of all unique associated B objects of various types/subtypes. The problem is that due to the nature of the possible queries and the way the database is structured (This is greatly simplified. There would be numerous joins, subqueries, and some parameters, but this is enough to get the gist), I could possibly get duplicate results for B.ID2 on each instance of A, which necessitates the DISTINCT, otherwise it counts all B.ID2 for that value of A and I get incorrect results. Unfortunately, this causes each aggregate function beyond the first to do a tablescan and create a TEMP table in the explain, and no amount of indexing seems to fix it. I'm not sure I can eliminate the duplicates without some significant changes to the queries that could themselves cause larger performance problems. Without this, I have to join to B on every type/subtype I want, include B.ID2 in the select, and then count them up once the query returns. This is going to seriously bloat the result set and I'd rather avoid it.
Is there a viable alternative here that I'm missing or perhaps some method of indexing those columns that might eliminate the tablescan and TEMP tables? Or is there really no good solution to this?

How To Create a SQL Index to Improve ORDER BY performance

I have some SQL similar to the following, which joins four tables and then orders the results by the "status" column of the first:
SELECT *
FROM a, b, c, d
WHERE b.aid=a.id AND c.id=a.cid AND a.did=d.id AND a.did='XXX'
ORDER BY a.status
It works. However, it's slow. I've worked out this is because of the ORDER BY clause and the lack of any index on table "a".
All four tables have the PRIMARY KEYs set on the "id" column.
So, I know I need to add an index to table a which includes the "status" column but what else does it need to include? Should "bid", "cid" and "did" be in there too?
I've tried to ask this in a general SQL sense but, if it's important, the target is SQLite for use with Gears.
Thanks in advance,
Jake (noob)
I would say it's slow because the engine is doing scans all over the place instead of seeks. Did you mean to do SELECT a.* instead? That would be faster as well, SELECT * here is equivalent to a.*, b.*, c.*, d.*.
You will probably get better results if you put a separate index on each of these columns:
a.did (so that a.did = 'XXX' is a seek instead of a scan, also helps a.did = d.id)
a.cid (for a.cid = c.id)
b.aid (for a.id = b.aid)
You could try adding Status to the first and second indexes with ASCENDING order, for additional performance - it doesn't hurt.
I'd be curious as to how you worked out that the problem is 'the ORDER BY clause and the lack of any index on table "a".' I find this a little suspicious because there is an index on table a, on the primary key, you later say.
Looking at the nature of the query and what I can guess about the nature of the data, I would think that this query would generally produce relatively few results compared to the size of the tables it's using, and that thus the ORDER BY would be extremely cheap. Of course, this is just a guess.
Whether an index will even help at all is dependent on the data in the table. What indices your query optimizer will use when doing a query is dependent on a lot of different factors, one of the big ones being the expected number of results produced from a lookup.
One thing that would help a lot is if you would post the output of EXPLAINing your query.
have you tried joins?
select * from a inner join b on a.id = b.aid inner join c on a.cid = c.id inner join d on a.did=d.id where a.did='XXX'
ORDER BY a.status
the correct use of joins (left, richt, inner, outer) depends on structure of tables
hope this helps

Is there something wrong with joins that don't use the JOIN keyword in SQL or MySQL?

When I started writing database queries I didn't know the JOIN keyword yet and naturally I just extended what I already knew and wrote queries like this:
SELECT a.someRow, b.someRow
FROM tableA AS a, tableB AS b
WHERE a.ID=b.ID AND b.ID= $someVar
Now that I know that this is the same as an INNER JOIN I find all these queries in my code and ask myself if I should rewrite them. Is there something smelly about them or are they just fine?
My answer summary: There is nothing wrong with this query BUT using the keywords will most probably make the code more readable/maintainable.
My conclusion: I will not change my old queries but I will correct my writing style and use the keywords in the future.
Filtering joins solely using WHERE can be extremely inefficient in some common scenarios. For example:
SELECT * FROM people p, companies c
WHERE p.companyID = c.id AND p.firstName = 'Daniel'
Most databases will execute this query quite literally, first taking the Cartesian product of the people and companies tables and then filtering by those which have matching companyID and id fields. While the fully-unconstrained product does not exist anywhere but in memory and then only for a moment, its calculation does take some time.
A better approach is to group the constraints with the JOINs where relevant. This is not only subjectively easier to read but also far more efficient. Thusly:
SELECT * FROM people p JOIN companies c ON p.companyID = c.id
WHERE p.firstName = 'Daniel'
It's a little longer, but the database is able to look at the ON clause and use it to compute the fully-constrained JOIN directly, rather than starting with everything and then limiting down. This is faster to compute (especially with large data sets and/or many-table joins) and requires less memory.
I change every query I see which uses the "comma JOIN" syntax. In my opinion, the only purpose for its existence is conciseness. Considering the performance impact, I don't think this is a compelling reason.
The more verbose INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN, FULL OUTER JOIN are from the ANSI SQL/92 syntax for joining. For me, this verbosity makes the join more clear to the developer/DBA of what the intent is with the join.
In SQL Server there are always query plans to check, a text output can be made as follows:
SET SHOWPLAN_ALL ON
GO
DECLARE #TABLE_A TABLE
(
ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
Data VARCHAR(10) NOT NULL
)
INSERT INTO #TABLE_A
SELECT 'ABC' UNION
SELECT 'DEF' UNION
SELECT 'GHI' UNION
SELECT 'JKL'
DECLARE #TABLE_B TABLE
(
ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
Data VARCHAR(10) NOT NULL
)
INSERT INTO #TABLE_B
SELECT 'ABC' UNION
SELECT 'DEF' UNION
SELECT 'GHI' UNION
SELECT 'JKL'
SELECT A.Data, B.Data
FROM
#TABLE_A AS A, #TABLE_B AS B
WHERE
A.ID = B.ID
SELECT A.Data, B.Data
FROM
#TABLE_A AS A
INNER JOIN #TABLE_B AS B ON A.ID = B.ID
Now I'll omit the plan for the table variable creates, the plan for both queries is identical though:
SELECT A.Data, B.Data FROM #TABLE_A AS A, #TABLE_B AS B WHERE A.ID = B.ID
|--Nested Loops(Inner Join, OUTER REFERENCES:([A].[ID]))
|--Clustered Index Scan(OBJECT:(#TABLE_A AS [A]))
|--Clustered Index Seek(OBJECT:(#TABLE_B AS [B]), SEEK:([B].[ID]=#TABLE_A.[ID] as [A].[ID]) ORDERED FORWARD)
SELECT A.Data, B.Data FROM #TABLE_A AS A INNER JOIN #TABLE_B AS B ON A.ID = B.ID
|--Nested Loops(Inner Join, OUTER REFERENCES:([A].[ID]))
|--Clustered Index Scan(OBJECT:(#TABLE_A AS [A]))
|--Clustered Index Seek(OBJECT:(#TABLE_B AS [B]), SEEK:([B].[ID]=#TABLE_A.[ID] as [A].[ID]) ORDERED FORWARD)
So, short answer - No need to rewrite, unless you spend a long time trying to read them each time you maintain them?
It's more of a syntax choice. I prefer grouping my join conditions with my joins, hence I use the INNER JOIN syntax
SELECT a.someRow, b.someRow
FROM tableA AS a
INNER JOIN tableB AS b
ON a.ID = b.ID
WHERE b.ID = ?
(? being a placeholder)
Nothing is wrong with the syntax in your example. The 'INNER JOIN' syntax is generally termed 'ANSI' syntax, and came after the style illustrated in your example. It exists to clarify the type/direction/constituents of the join, but is not generally functionally different than what you have.
Support for 'ANSI' joins is per-database platform, but it's more or less universal these days.
As a side note, one addition with the 'ANSI' syntax was the 'FULL OUTER JOIN' or 'FULL JOIN'.
Hope this helps.
In general:
Use the JOIN keyword to link (ie. "join") primary keys and foreign keys.
Use the WHERE clause to limit your result set to only the records you are interested in.
The one problem that can arise is when you try to mix the old "comma-style" join with SQL-92 joins in the same query, for example if you need one inner join and another outer join.
SELECT *
FROM table1 AS a, table2 AS b
LEFT OUTER JOIN table3 AS c ON a.column1 = c.column1
WHERE a.column2 = b.column2;
The problem is that recent SQL standards say that the JOIN is evaluated before the comma-join. So the reference to "a" in the ON clause gives an error, because the correlation name hasn't been defined yet as that ON clause is being evaluated. This is a very confusing error to get.
The solution is to not mix the two styles of joins. You can continue to use comma-style in your old code, but if you write a new query, convert all the joins to SQL-92 style.
SELECT *
FROM table1 AS a
INNER JOIN table2 AS b ON a.column2 = b.column2
LEFT OUTER JOIN table3 AS c ON a.column1 = c.column1;
Another thing to consider in the old join syntax is that is is very easy to get a cartesion join by accident since there is no on clause. If the Distinct keyword is in the query and it uses the old style joins, convert it to an ANSI standard join and see if you still need the distinct. If you are fixing accidental cartesion joins this way, you can improve performance tremendously by rewriting to specify the join and the join fields.
I avoid implicit joins; when the query is really large, they make the code hard to decipher
With explicit joins, and good formatting, the code is more readable and understandable without need for comments.
It also depends on whether you are just doing inner joins this way or outer joins as well. For instance, the MS SQL Server syntax for outer joins in the WHERE clause (=* and *=) can give different results than the OUTER JOIN syntax and is no longer supported (http://msdn.microsoft.com/en-us/library/ms178653(SQL.90).aspx) in SQL Server 2005.
And what about performances ???
As a matter of fact, performances is a very important problem in RDBMS.
So the question is what is the most performant... Using JOIN or having joined table in the WHERE clause ?
Because optimizer (or planer as they said in PG...) ordinary does a good job, the two execution plans are the same, so the performances while excuting the query will be the same...
But devil are hidden in some details....
All optimizers have a limited time or a limited amount of work to find the best plan... And when the limit is reached, the result is the best plan among all computed plans, and not the better of all possible plans !
Now the question is do I loose time when I use WHERE clause instead of JOINs for joining tables ?
And the answer is YES !
YES, because the relational engine use relational algebrae that knows only JOIN operator, not pseudo joins made in the WHERE clause. So the first thing that the optimizer do (in fact the parser or the algrebriser) is to rewrite the query... and this loose some chances to have the best of all plans !
I have seen this problem twice, in my long RDBMS career (40 years...)