DELETE FROM optimization - sql

I've this query:
DELETE FROM f
WHERE ft != 'f'
OR fs NOT IN ( SELECT fs
FROM f
GROUP BY fs
HAVING COUNT(fs) >1)
It is doing its job nicely except that it takes much more time than I expected. I'm talking about 2.25 secs to down ~209,000 records to ~187,000. I think it could be improved, and I would like to know how.
Query EXPLAIN:
addr opcode p1 p2 p3 p4 p5 comment
---- ------------- ---- ---- ---- ----------------- -- -------------
0 Trace 0 0 0 00
1 Goto 0 82 0 00
2 Null 0 1 0 00
3 String8 0 3 0 f 00
4 OpenRead 0 2 0 4 00
5 Rewind 0 69 0 00
6 Column 0 2 4 00
7 Ne 3 66 4 collseq(BINARY) 63
8 If 7 53 0 00
9 Integer 1 7 0 00
10 Null 0 6 0 00
11 OpenEphemeral 4 1 0 keyinfo(1,BINARY) 00
12 OpenEphemeral 5 2 0 keyinfo(1,BINARY) 00
13 Integer 0 11 0 00
14 Integer 0 10 0 00
15 Gosub 13 50 0 00
16 OpenRead 2 2 0 4 00
17 Rewind 2 23 0 00
18 Column 2 3 16 00
19 Sequence 5 17 0 00
20 MakeRecord 16 2 4 00
21 IdxInsert 5 4 0 00
22 Next 2 18 0 01
23 Close 2 0 0 00
24 Sort 5 53 0 00
25 Column 5 0 15 00
26 Compare 14 15 1 keyinfo(1,BINARY) 00
27 Jump 28 32 28 00
28 Move 15 14 1 00
29 Gosub 12 41 0 00
30 IfPos 11 53 0 00
31 Gosub 13 50 0 00
32 Column 5 0 16 00
33 AggStep 0 16 9 count(1) 01
34 Column 5 0 8 00
35 Integer 1 10 0 00
36 Next 5 25 0 00
37 Gosub 12 41 0 00
38 Goto 0 53 0 00
39 Integer 1 11 0 00
40 Return 12 0 0 00
41 IfPos 10 43 0 00
42 Return 12 0 0 00
43 AggFinal 9 1 0 count(1) 00
44 Integer 1 4 0 00
45 Le 4 42 9 6a
46 SCopy 8 18 0 00
47 MakeRecord 18 1 4 c 00
48 IdxInsert 4 4 0 00
49 Return 12 0 0 00
50 Null 0 8 0 00
51 Null 0 9 0 00
52 Return 13 0 0 00
53 Column 0 3 4 00
54 NotNull 4 57 0 00
55 Rewind 4 66 0 00
56 Goto 0 68 0 00
57 Affinity 4 1 0 c 00
58 Found 4 65 4 1 00
59 NotNull 6 63 0 00
60 Found 4 62 6 1 00
61 Integer -1 6 0 00
62 AddImm 6 1 0 00
63 If 6 68 0 00
64 Goto 0 66 0 00
65 Goto 0 68 0 00
66 Rowid 0 2 0 00
67 RowSetAdd 1 2 0 00
68 Next 0 6 0 01
69 Close 0 0 0 00
70 OpenWrite 0 2 0 4 00
71 OpenWrite 1 3 0 keyinfo(1,BINARY) 00
72 RowSetRead 1 79 2 00
73 NotExists 0 78 2 00
74 Rowid 0 20 0 00
75 Column 0 1 19 00
76 IdxDelete 1 19 2 00
77 Delete 0 1 0 f 00
78 Goto 0 72 0 00
79 Close 1 3 0 00
80 Close 0 0 0 00
81 Halt 0 0 0 00
82 Transaction 0 1 0 00
83 VerifyCookie 0 2 0 00
84 TableLock 0 2 1 f 00
85 Goto 0 2 0 00
Table definition (no indexes yet):
CREATE TABLE f (fi INTEGER PRIMARY KEY AUTOINCREMENT,
fn STRING,
ft STRING,
fs INTEGER)

I'm not sure that two seconds is a "killer" query (though this, of course, depends on your circumstances and needs) but one thing you could test is the effect of splitting the query into two.
That's because it will currently delete records matching either condition so it's an easy transformation into two delete statements (within a transaction if you want to ensure it's atomic). You could try:
DELETE FROM f WHERE ft != 'f';
DELETE FROM f WHERE fs NOT IN (
SELECT fs FROM f
GROUP BY fs
HAVING COUNT(fs) >1);
and see if that improves things. It may or it may not, depending on your DBMS and the makeup of your data. It's likely to get rid of the crossover records (those satisfying both conditions) in the most-likely-faster first query.
But, as with all database optimisations, measure, don't guess!
Then you can either recombine and re-evaluate, or concentrate on speeding up the second simpler query if it's still a problem.
One thing to particularly make sure of: have indexes on both the ft and fs columns. That should speed up your queries quite a bit if you don't already have them.

Try this query
DELETE FROM f
WHERE ft <> 'f'
OR NOT EXISTS (Select * from f f2
WHERE f2.fs=f.fs and f2.id <> f.id)
Indexes
on ft ; single-column
on (fs, id) ; composite, in that order

DELETE f
FROM (
SELECT fs
FROM f
GROUP BY fs
HAVING COUNT(fs) <=1
) s
WHERE (f.ft != 'f' OR f.fs = s.fs)

Related

How do speed up date range select in sqlite?

Here is my query:
const tags = await db.queryEntries(
"SELECT tag, count(tag) AS count, created_at FROM tags WHERE DATE(created_at) >= DATE('now', '-1 days') GROUP BY tag ORDER BY count DESC LIMIT 100"
);
Here is my schema:
CREATE TABLE tags (
tag TEXT,
url STRING
, created_at TEXT);
CREATE UNIQUE INDEX tag_url ON tags (tag, url)
;
CREATE INDEX idx_tags_created_at ON tags(created_at);
It's still very slow (30+ seconds) when I run the query there are about 1.5 million records in the db for tags.
Here are the results of EXPLAIN:
addr opcode p1 p2 p3 p4 p5 comment
---- ------------- ---- ---- ---- ------------- -- -------------
0 Init 0 56 0 00 Start at 56
1 OpenEphemeral 1 5 0 k(1,-B) 00 nColumn=5
2 Integer 100 1 0 00 r[1]=100; LIMIT counter
3 Noop 2 2 0 00
4 Integer 0 6 0 00 r[6]=0; clear abort flag
5 Null 0 9 9 00 r[9..9]=NULL
6 Gosub 8 46 0 00
7 OpenRead 0 7 0 3 00 root=7 iDb=0; tags
8 OpenRead 3 3693502 0 k(3,,,) 00 root=3693502 iDb=0; tag_url
9 Rewind 3 28 11 0 00
10 DeferredSeek 3 0 0 00 Move 0 to 3.rowid if needed
11 Column 0 2 12 00 r[12]=tags.created_at
12 Function 0 12 11 date(-1) 00 r[11]=func(r[12])
13 Lt 13 27 11 50 if r[11]<r[13] goto 27
14 Column 3 0 10 00 r[10]=tags.tag
15 Compare 9 10 1 k(1,-B) 00 r[9] <-> r[10]
16 Jump 17 21 17 00
17 Move 10 9 1 00 r[9]=r[10]
18 Gosub 7 32 0 00 output one row
19 IfPos 6 49 0 00 if r[6]>0 then r[6]-=0, goto 49; check abort flag
20 Gosub 8 46 0 00 reset accumulator
21 Column 3 0 11 00 r[11]=tags.tag
22 AggStep 0 11 3 count(1) 01 accum=r[3] step(r[11])
23 If 5 26 0 00
24 Column 3 0 2 00 r[2]=tags.tag
25 Column 0 2 4 00 r[4]=tags.created_at
26 Integer 1 5 0 00 r[5]=1; indicate data in accumulator
27 Next 3 10 0 01
28 Gosub 7 32 0 00 output final row
29 Goto 0 49 0 00
30 Integer 1 6 0 00 r[6]=1; set abort flag
31 Return 7 0 0 00
32 IfPos 5 34 0 00 if r[5]>0 then r[5]-=0, goto 34; Groupby result generator entry point
33 Return 7 0 0 00
34 AggFinal 3 1 0 count(1) 00 accum=r[3] N=1
35 Copy 3 14 0 00 r[14]=r[3]
36 Sequence 1 15 0 00 r[15]=cursor[1].ctr++
37 IfNotZero 1 41 0 00 if r[1]!=0 then r[1]--, goto 41
38 Last 1 0 0 00
39 IdxLE 1 45 14 1 00 key=r[14]
40 Delete 1 0 0 00
41 Copy 2 16 0 00 r[16]=r[2]
42 Copy 4 17 0 00 r[17]=r[4]
43 MakeRecord 14 4 19 00 r[19]=mkrec(r[14..17])
44 IdxInsert 1 19 14 4 00 key=r[19]
45 Return 7 0 0 00 end groupby result generator
46 Null 0 2 4 00 r[2..4]=NULL
47 Integer 0 5 0 00 r[5]=0; indicate accumulator empty
48 Return 8 0 0 00
49 Sort 1 55 0 00
50 Column 1 3 18 00 r[18]=created_at
51 Column 1 0 17 00 r[17]=count
52 Column 1 2 16 00 r[16]=tag
53 ResultRow 16 3 0 00 output=r[16..18]
54 Next 1 50 0 00
55 Halt 0 0 0 00
56 Transaction 0 0 8 0 01 usesStmtJournal=0
57 String8 0 20 0 now 00 r[20]='now'
58 String8 0 21 0 -1 days 00 r[21]='-1 days'
59 Function 3 20 13 date(-1) 00 r[13]=func(r[20])
60 Goto 0 1 0 00
Rewrite the query so that it is sargable (i.e. it can take advantage of an index):
SELECT tag, COUNT(tag) AS count
FROM tags
WHERE created_at >= strftime('%Y-%m-%d', 'now', '-1 days')
GROUP BY tag
ORDER BY count DESC
LIMIT 100;
The above query should benefit from the following index:
CREATE INDEX idx ON tags (created_at, tag);

SQLite Slow Select Query

I'm running the following Select query:
SELECT "entry"."id" AS "entry_id",
"entry"."input" AS "entry_input",
"entry"."output" AS "entry_output",
"entry"."numOfWords" AS "entry_numOfWords",
"entry"."times_seen" AS "entry_times_seen",
"word_class"."value" AS "word_class_value",
"dominant_noun"."noun" AS "dominant_noun_noun",
"dominant_noun"."article" AS "dominant_noun_article",
"dominant_noun"."isPluaral" AS "dominant_noun_isPluaral",
"subject"."subjectIndex" AS "subject_subjectIndex",
"last_time_visited"."value" AS "last_time_visited_value"
FROM "entry" "entry"
LEFT JOIN "word_class" "word_class" ON "word_class"."entryId"="entry"."id"
LEFT JOIN "dominant_noun" "dominant_noun" ON "dominant_noun"."entryId"="entry"."id"
LEFT JOIN "subject_entries_entry" "subject_entry" ON "subject_entry"."entryId"="entry"."id"
LEFT JOIN "subject" "subject" ON "subject"."id"="subject_entry"."subjectId"
LEFT JOIN "last_time_visited" "last_time_visited" ON "last_time_visited"."entryId"="entry"."id"
WHERE "entry"."inputLang" = 31
AND ("entry"."input" like '% hilfe %' OR "entry"."input" like 'hilfe %' OR "entry"."input" like '% hilfe')
ORDER BY "word_class"."value" DESC, "entry"."numOfWords" ASC;
Time result:
real 0m15.100s
user 0m14.072s
sys 0m1.024s
Against this Database schema:
CREATE TABLE sqlite_sequence(name,seq);
CREATE TABLE IF NOT EXISTS "subject" ("id" integer PRIMARY KEY AUTOINCREMENT NOT NULL, "subjectIndex" tinyint NOT NULL);
CREATE TABLE IF NOT EXISTS "entry" ("id" integer PRIMARY KEY AUTOINCREMENT NOT NULL, "inputLang" tinyint NOT NULL, "outputLang" tinyint NOT NULL, "input"
varchar NOT NULL, "output" varchar NOT NULL, "numOfWords" tinyint NOT NULL, "times_seen" integer NOT NULL DEFAULT (0));
CREATE TABLE IF NOT EXISTS "abbr" ("id" integer PRIMARY KEY AUTOINCREMENT NOT NULL, "value" varchar NOT NULL, "entryId" integer, CONSTRAINT "REL_ca935aaf7
66cba1e7bfbe90275" UNIQUE ("entryId"), CONSTRAINT "FK_ca935aaf766cba1e7bfbe902757" FOREIGN KEY ("entryId") REFERENCES "entry" ("id"));
CREATE TABLE IF NOT EXISTS "word_class" ("id" integer PRIMARY KEY AUTOINCREMENT NOT NULL, "value" integer NOT NULL, "entryId" integer, CONSTRAINT "REL_941
45442deb2b2209bd943a787" UNIQUE ("entryId"), CONSTRAINT "FK_94145442deb2b2209bd943a7874" FOREIGN KEY ("entryId") REFERENCES "entry" ("id"));
CREATE TABLE IF NOT EXISTS "dominant_noun" ("id" integer PRIMARY KEY AUTOINCREMENT NOT NULL, "noun" varchar NOT NULL, "article" tinyint NOT NULL, "isPluar
al" boolean NOT NULL, "entryId" integer, CONSTRAINT "REL_f493eeedea653d8a89f595c82c" UNIQUE ("entryId"), CONSTRAINT "FK_f493eeedea653d8a89f595c82c4" FOREI
GN KEY ("entryId") REFERENCES "entry" ("id"));
CREATE TABLE IF NOT EXISTS "last_time_visited" ("id" integer PRIMARY KEY AUTOINCREMENT NOT NULL, "value" datetime NOT NULL DEFAULT (CURRENT_TIMESTAMP), "e
ntryId" integer, CONSTRAINT "REL_e631a6f55d59214f8e6aaa6447" UNIQUE ("entryId"), CONSTRAINT "FK_e631a6f55d59214f8e6aaa64478" FOREIGN KEY ("entryId") REFER
ENCES "entry" ("id"));
CREATE TABLE IF NOT EXISTS "subject_entries_entry" ("subjectId" integer NOT NULL, "entryId" integer NOT NULL, CONSTRAINT "FK_d2eaa7a84a7963ed94e472cef0b"FOREIGN KEY ("subjectId") REFERENCES "subject" ("id") ON DELETE CASCADE, CONSTRAINT "FK_5f940450dd4c681a9fecf0b14b2" FOREIGN KEY ("entryId") REFERENCES "entry" ("id") ON DELETE CASCADE, PRIMARY KEY ("subjectId", "entryId"));
CREATE INDEX "IDX_3091789786b922bee00bbb44b1" ON "entry" ("inputLang") ;
CREATE INDEX "IDX_36ab3550b9e3ef647d1230affc" ON "entry" ("outputLang") ;
CREATE INDEX "IDX_1b0f6266dffb9a7e6343e7faa4" ON "entry" ("input") ;
CREATE INDEX "IDX_a77c7936ea412ec1958007154a" ON "entry" ("numOfWords") ;
CREATE INDEX "IDX_b32699a03d36223ff9bad94ea6" ON "entry" ("times_seen") ;
Explain result:
addr opcode p1 p2 p3 p4 p5 comment
---- ------------- ---- ---- ---- ------------- -- -------------
0 Init 0 109 0 00 Start at 109
1 SorterOpen 6 14 0 k(2,-B,B) 00
2 OpenRead 0 12 0 7 00 root=12 iDb=0; entry
3 OpenRead 1 2 0 3 00 root=2 iDb=0; word_class
4 OpenRead 7 3 0 k(2,,) 02 root=3 iDb=0; sqlite_autoindex_word_class_1
5 OpenRead 2 5 0 5 00 root=5 iDb=0; dominant_noun
6 OpenRead 8 6 0 k(2,,) 02 root=6 iDb=0; sqlite_autoindex_dominant_noun_1
7 OpenRead 3 10 0 2 00 root=10 iDb=0; subject_entries_entry
8 OpenRead 4 9 0 2 00 root=9 iDb=0; subject
9 OpenRead 5 7 0 3 00 root=7 iDb=0; last_time_visited
10 OpenRead 9 8 0 k(2,,) 02 root=8 iDb=0; sqlite_autoindex_last_time_visited_1
11 Rewind 0 92 0 00
12 Column 0 1 1 00 r[1]=entry.inputLang
13 Ne 2 91 1 (BINARY) 54 if r[1]!=r[2] goto 91
14 Column 0 3 4 00 r[4]=entry.input
15 Function0 1 3 1 like(2) 02 r[1]=func(r[3..4])
16 If 1 23 0 00
17 Column 0 3 6 00 r[6]=entry.input
18 Function0 1 5 1 like(2) 02 r[1]=func(r[5..6])
19 If 1 23 0 00
20 Column 0 3 8 00 r[8]=entry.input
21 Function0 1 7 1 like(2) 02 r[1]=func(r[7..8])
22 IfNot 1 91 1 00
23 Integer 0 9 0 00 r[9]=0; init LEFT JOIN no-match flag
24 Rowid 0 10 0 00 r[10]=rowid
25 SeekGE 7 87 10 1 00 key=r[10]
26 IdxGT 7 87 10 1 00 key=r[10]
27 DeferredSeek 7 0 1 00 Move 1 to 7.rowid if needed
28 Integer 1 9 0 00 r[9]=1; record LEFT JOIN hit
29 Integer 0 11 0 00 r[11]=0; init LEFT JOIN no-match flag
30 Rowid 0 12 0 00 r[12]=rowid
31 SeekGE 8 83 12 1 00 key=r[12]
32 IdxGT 8 83 12 1 00 key=r[12]
33 DeferredSeek 8 0 2 00 Move 2 to 8.rowid if needed
34 Integer 1 11 0 00 r[11]=1; record LEFT JOIN hit
35 Once 0 44 0 00
36 OpenAutoindex 10 3 0 k(3,B,,) 00 nColumn=3; for subject_entries_entry
37 Rewind 3 44 0 00
38 Column 3 1 13 00 r[13]=subject_entries_entry.entryId
39 Column 3 0 14 00 r[14]=subject_entries_entry.subjectId
40 Rowid 3 15 0 00 r[15]=rowid
41 MakeRecord 13 3 1 00 r[1]=mkrec(r[13..15])
42 IdxInsert 10 1 0 10 key=r[1]
43 Next 3 38 0 03
44 Integer 0 16 0 00 r[16]=0; init LEFT JOIN no-match flag
45 Rowid 0 17 0 00 r[17]=rowid
46 SeekGE 10 80 17 1 00 key=r[17]
47 IdxGT 10 80 17 1 00 key=r[17]
48 Integer 1 16 0 00 r[16]=1; record LEFT JOIN hit
49 Integer 0 18 0 00 r[18]=0; init LEFT JOIN no-match flag
50 Column 10 1 19 00 r[19]=subject_entries_entry.subjectId
51 SeekRowid 4 76 19 00 intkey=r[19]
52 Integer 1 18 0 00 r[18]=1; record LEFT JOIN hit
53 Integer 0 20 0 00 r[20]=0; init LEFT JOIN no-match flag
54 Rowid 0 21 0 00 r[21]=rowid
55 SeekGE 9 72 21 1 00 key=r[21]
56 IdxGT 9 72 21 1 00 key=r[21]
57 DeferredSeek 9 0 5 00 Move 5 to 9.rowid if needed
58 Integer 1 20 0 00 r[20]=1; record LEFT JOIN hit
59 Rowid 0 24 0 00 r[24]=rowid
60 Column 0 3 25 00 r[25]=entry.input
61 Column 0 4 26 00 r[26]=entry.output
62 Column 0 6 27 0 00 r[27]=entry.times_seen
63 Column 2 1 28 00 r[28]=dominant_noun.noun
64 Column 2 2 29 00 r[29]=dominant_noun.article
65 Column 2 3 30 00 r[30]=dominant_noun.isPluaral
66 Column 4 1 31 00 r[31]=subject.subjectIndex
67 Column 5 1 32 00 r[32]=last_time_visited.value
68 Column 1 1 22 00 r[22]=word_class.value
69 Column 0 5 23 00 r[23]=entry.numOfWords
70 MakeRecord 22 11 35 00 r[35]=mkrec(r[22..32])
71 SorterInsert 6 35 22 11 00 key=r[35]
72 IfPos 20 76 0 00 if r[20]>0 then r[20]-=0, goto 76
73 NullRow 5 0 0 00
74 NullRow 9 0 0 00
75 Goto 0 58 0 00
76 IfPos 18 79 0 00 if r[18]>0 then r[18]-=0, goto 79
77 NullRow 4 0 0 00
78 Goto 0 52 0 00
79 Next 10 47 0 00
80 IfPos 16 83 0 00 if r[16]>0 then r[16]-=0, goto 83
81 NullRow 10 0 0 00
82 Goto 0 48 0 00
83 IfPos 11 87 0 00 if r[11]>0 then r[11]-=0, goto 87
84 NullRow 2 0 0 00
85 NullRow 8 0 0 00
86 Goto 0 34 0 00
87 IfPos 9 91 0 00 if r[9]>0 then r[9]-=0, goto 91
88 NullRow 1 0 0 00
89 NullRow 7 0 0 00
90 Goto 0 28 0 00
91 Next 0 12 0 01
92 OpenPseudo 11 36 14 00 14 columns in r[36]
93 SorterSort 6 108 0 00
94 SorterData 6 36 11 00 r[36]=data
95 Column 11 10 34 00 r[34]=last_time_visited_value
96 Column 11 9 33 00 r[33]=subject_subjectIndex
97 Column 11 8 32 00 r[32]=dominant_noun_isPluaral
98 Column 11 7 31 00 r[31]=dominant_noun_article
99 Column 11 6 30 00 r[30]=dominant_noun_noun
100 Column 11 0 29 00 r[29]=word_class_value
101 Column 11 5 28 00 r[28]=entry_times_seen
102 Column 11 1 27 00 r[27]=entry_numOfWords
103 Column 11 4 26 00 r[26]=entry_output
104 Column 11 3 25 00 r[25]=entry_input
105 Column 11 2 24 00 r[24]=entry_id
106 ResultRow 24 11 0 00 output=r[24..34]
107 SorterNext 6 94 0 00
108 Halt 0 0 0 00
109 Transaction 0 0 348 0 01 usesStmtJournal=0
110 Integer 31 2 0 00 r[2]=31
111 String8 0 3 0 % hilfe % 00 r[3]='% hilfe %'
112 String8 0 5 0 hilfe % 00 r[5]='hilfe %'
113 String8 0 7 0 % hilfe 00 r[7]='% hilfe'
114 Goto 0 1 0 00
Explain Query Plan output:
QUERY PLAN
|--SCAN TABLE entry AS entry
|--SEARCH TABLE word_class AS word_class USING INDEX sqlite_autoindex_word_class_1 (entryId=?)
|--SEARCH TABLE dominant_noun AS dominant_noun USING INDEX sqlite_autoindex_dominant_noun_1 (entryId=?)
|--SEARCH TABLE subject_entries_entry AS subject_entry USING AUTOMATIC COVERING INDEX (entryId=?)
|--SEARCH TABLE subject AS subject USING INTEGER PRIMARY KEY (rowid=?)
|--SEARCH TABLE last_time_visited AS last_time_visited USING INDEX sqlite_autoindex_last_time_visited_1 (entryId=?)
`--USE TEMP B-TREE FOR ORDER BY
Analyze output:
subject||1437631
entry|IDX_b32699a03d36223ff9bad94ea6|2348382 2348382
entry|IDX_a77c7936ea412ec1958007154a|2348382 67097
entry|IDX_1b0f6266dffb9a7e6343e7faa4|2348382 2
entry|IDX_36ab3550b9e3ef647d1230affc|2348382 1174191
entry|IDX_3091789786b922bee00bbb44b1|2348382 1174191
abbr|sqlite_autoindex_abbr_1|42575 1
dominant_noun|sqlite_autoindex_dominant_noun_1|823071 1
word_class|sqlite_autoindex_word_class_1|2005516 1
subject_entries_entry|sqlite_autoindex_subject_entries_entry_1|1437631 1 1
It often takes more than 10 seconds to get the results. Although this is my first time working with SQLite but 20 seconds reply time seems strange. Please add a comment if i should provide extra info in order to resolve the problem?
The factor to the long query times you're seeing is the subject_entries_entry table. It's a standard junction table used to relate rows in the entry table to rows in the subject table. The table definition uses a primary key that puts the subject id first, followed by the entry id (PRIMARY KEY ("subjectId", "entryId")).
Your query, on the other hand, first joins the entry id on the table, and then the subject id - the opposite of the order in the key. Sqlite can and does reorder the tables in a join to try to be as efficient as possible, but in this case it's not doing so. Going to the EXPLAIN QUERY PLAN output:
SEARCH TABLE subject_entries_entry AS subject_entry USING AUTOMATIC COVERING INDEX (entryId=?)
The SEARCH means it's looking up specific rows in an index instead of looking at every single one (SCAN), which is what you want, but the AUTOMATIC COVERING INDEX part is bad. AUTOMATIC means that the query planner hasn't found an existing index it can use, but thinks using an index will be better than having to scan the table - so it builds a temporary one that exists just for that query. It looks like the subject_entries_entry table has a lot of rows, so this can take a while.
Recreating the table with the primary key columns in the same order they're used in the join cut the time down by a lot (As would a separate index with the columns flipped, at the expense of more disk space used).
My other advise for this table is making it a WITHOUT ROWID one. Normal sqlite tables use a 64-bit integer primary key (Known as rowid) regardless of what the table definition uses; a non-INTEGER PRIMARY KEY is just a normal UNIQUE index in such a table. With WITHOUT ROWID, the primary key is the actual primary key of the table, saving space in cases like this where there's no real use for the actual rowid. Instead of a table and an index that duplicates the contents of each row, it just has a table. This optimization won't affect query speed, though, since it's using a covering index that has all needed information in the index already; the actual table isn't even looked at in the query as it is now.
I'm not sure about further speedups - looking at the query plan, it's using pre-existing indexes for the rest of the tables, and the join clauses are all straightforward. I'm a little surprised it's not using the index on entry(inputLang) to do a search instead of a scan on that table. Maybe if you can rebuild the sqlite library with SQLITE_ENABLE_STAT4 turned on and then a PRAGMA optimize to rebuild the statistics tables, but that's getting into pretty advanced stuff depending on what language you're using (Easy in C or C++, harder in others).
Edit:
Some other things to explore:
Enable multi-threaded sorting using PRAGMA threads=N where N is the number of cores you're using.
Enabling memory-mapped database with PRAGMA mmap_size=X.

How can I improve performance of a slow query with LEFT JOINS, DISTINCT and ORDER in SQLite?

I have a slow query (2500+ ms) running in SQLite. This is the EXPLAIN:
sqlite> .explain
sqlite> explain SELECT DISTINCT vino_maridaje.ID_maridaje, maridaje_texto
...> FROM maridaje
...> LEFT OUTER JOIN vino_maridaje ON vino_maridaje.ID_maridaje = maridaje.ID AND maridaje.ID_categoria = 1
...> LEFT OUTER JOIN maridaje_texto ON maridaje_texto.ID_maridaje = maridaje.ID AND maridaje_texto.ID_idioma = 1
...> LEFT OUTER JOIN vino ON vino.ID = ID_vino
...> WHERE vino.ID_pais = 1 AND vino.ID_tipo = 6 AND activo = 1
...> ORDER BY maridaje_texto;
addr opcode p1 p2 p3 p4 p5 comment
---- ------------- ---- ---- ---- ------------- -- -------------
0 Init 0 83 0 00
1 SorterOpen 4 3 0 k(1,B) 00
2 OpenEphemeral 5 0 0 k(2,B,B) 08
3 OpenRead 0 718 0 2 00
4 OpenRead 1 1586 0 3 00
5 OpenRead 6 2794 0 k(2,nil,nil) 00
6 OpenRead 2 726 0 4 00
7 OpenRead 7 3698 0 k(2,nil,nil) 00
8 OpenRead 3 1412 0 20 00
9 Rewind 0 67 0 00
10 Integer 0 1 0 00
11 Rowid 0 2 0 00
12 IsNull 2 62 0 00
13 SeekGE 6 62 2 1 00
14 IdxGT 6 62 2 1 00
15 IdxRowid 6 3 0 00
16 Seek 1 3 0 00
17 Column 0 1 4 NULL 00
18 Integer 1 5 0 00
19 Ne 5 61 4 (BINARY) 6c
20 Integer 1 1 0 00
21 Integer 0 6 0 00
22 Integer 1 7 0 00
23 SeekGE 7 57 7 1 00
24 IdxGT 7 57 7 1 00
25 IdxRowid 7 8 0 00
26 Seek 2 8 0 00
27 Column 2 1 4 NULL 00
28 Rowid 0 5 0 00
29 Ne 5 56 4 (BINARY) 6b
30 Integer 1 6 0 00
31 Integer 0 9 0 00
32 Column 1 1 10 NULL 00
33 MustBeInt 10 53 0 00
34 NotExists 3 53 10 00
35 Integer 1 9 0 00
36 Column 3 7 5 NULL 00
37 Ne 11 53 5 (BINARY) 6c
38 Column 3 5 4 NULL 00
39 Ne 12 53 4 (BINARY) 6c
40 Column 3 19 13 NULL 00
41 Ne 11 53 13 (BINARY) 6c
42 Column 6 0 14 NULL 00
43 Column 2 2 15 NULL 00
44 Found 5 53 14 2 00
45 MakeRecord 14 2 16 00
46 IdxInsert 5 16 0 00
47 MakeRecord 14 2 16 00
48 Column 2 2 18 NULL 00
49 Sequence 4 19 0 00
50 Move 16 20 1 00
51 MakeRecord 18 3 17 00
52 SorterInsert 4 17 0 00
53 IfPos 9 56 0 00
54 NullRow 3 0 0 00
55 Goto 0 35 0 00
56 Next 7 24 1 00
57 IfPos 6 61 0 00
58 NullRow 2 0 0 00
59 NullRow 7 0 0 00
60 Goto 0 30 0 00
61 Next 6 14 1 00
62 IfPos 1 66 0 00
63 NullRow 1 0 0 00
64 NullRow 6 0 0 00
65 Goto 0 20 0 00
66 Next 0 10 0 01
67 Close 0 0 0 00
68 Close 1 0 0 00
69 Close 6 0 0 00
70 Close 2 0 0 00
71 Close 7 0 0 00
72 Close 3 0 0 00
73 OpenPseudo 8 16 2 00
74 OpenPseudo 9 21 3 00
75 SorterSort 4 82 0 00
76 SorterData 4 21 0 00
77 Column 9 2 16 20
78 Column 8 0 14 20
79 Column 8 1 15 00
80 ResultRow 14 2 0 00
81 SorterNext 4 76 0 00
82 Halt 0 0 0 00
83 Transaction 0 0 311 0 01
84 TableLock 0 718 0 maridaje 00
85 TableLock 0 1586 0 vino_maridaje 00
86 TableLock 0 726 0 maridaje_texto 00
87 TableLock 0 1412 0 vino 00
88 Integer 1 11 0 00
89 Integer 6 12 0 00
90 Goto 0 1 0 00
This is the EXPLAIN QUERY PLAN:
0 0 0 SCAN TABLE maridaje
0 1 1 SEARCH TABLE vino_maridaje USING INDEX idx_vino_maridaje_index_ID_maridaje (ID_maridaje=?)
0 2 2 SEARCH TABLE maridaje_texto USING INDEX idx_maridaje_texto_index_ID_idioma (ID_idioma=?)
0 3 3 SEARCH TABLE vino USING INTEGER PRIMARY KEY (rowid=?)
0 0 0 USE TEMP B-TREE FOR DISTINCT
0 0 0 USE TEMP B-TREE FOR ORDER BY
As you can see the tables already have the appropriate indexes. So how can I make the query faster?

Why isn't my index being used in large queries with joins?

I have a table called street_names:
CREATE TABLE street_names (
id INTEGER PRIMARY KEY NOT NULL,
name TEXT UNIQUE NOT NULL
);
When searched using LIKE it uses the index and returns results instantly. However, take this larger expression:
SELECT sn.name, sa.house_number, sa.entrance, pc.postal_code, ci.name, mu.name,
co.name, sa.latitude, sa.longitude
FROM
street_addresses AS sa
INNER JOIN street_names AS sn ON sa.street_name = sn.id
INNER JOIN postal_codes AS pc ON sa.postal_code = pc.id
INNER JOIN cities AS ci ON sa.city = ci.id
INNER JOIN municipalities AS mu ON sa.municipality = mu.id
INNER JOIN counties AS co ON mu.county = co.id
WHERE
sn.name = "FORNEBUVEIEN" AND
sa.house_number = 13
ORDER BY ci.name ASC, sn.name ASC, sa.house_number ASC, sa.entrance ASC
LIMIT 0, 100;
In its current state it's lightning fast and can run 6000 times per second on my machine, but as soon as I change the = to a LIKE on the street name:
SELECT sn.name, sa.house_number, sa.entrance, pc.postal_code, ci.name, mu.name,
co.name, sa.latitude, sa.longitude
FROM
street_addresses AS sa
INNER JOIN street_names AS sn ON sa.street_name = sn.id
INNER JOIN postal_codes AS pc ON sa.postal_code = pc.id
INNER JOIN cities AS ci ON sa.city = ci.id
INNER JOIN municipalities AS mu ON sa.municipality = mu.id
INNER JOIN counties AS co ON mu.county = co.id
WHERE
sn.name LIKE "FORNEBUVEIEN" AND
sa.house_number = 13
ORDER BY ci.name ASC, sn.name ASC, sa.house_number ASC, sa.entrance ASC
LIMIT 0, 100;
It turns sour and runs perhaps 10 times per second on my machine. Why is this? The only change I made was changing an = to a LIKE on an indexed column, and the query didn't even include any wildcards.
Table schemas:
CREATE TABLE street_addresses (
id INTEGER PRIMARY KEY NOT NULL,
house_number INTEGER NOT NULL,
entrance TEXT NOT NULL,
latitude REAL NOT NULL,
longitude REAL NOT NULL,
street_name INTEGER NOT NULL REFERENCES street_names(id),
postal_code INTEGER NOT NULL REFERENCES postal_codes(id),
city INTEGER NOT NULL REFERENCES cities(id),
municipality INTEGER NOT NULL REFERENCES municipalities(id),
CONSTRAINT unique_address UNIQUE(
street_name, house_number, entrance, postal_code, city
)
);
CREATE TABLE street_names (
id INTEGER PRIMARY KEY NOT NULL,
name TEXT UNIQUE NOT NULL
);
CREATE TABLE postal_codes (
id INTEGER PRIMARY KEY NOT NULL,
postal_code INTEGER NOT NULL,
city INTEGER NOT NULL REFERENCES cities(id),
CONSTRAINT unique_postal_code UNIQUE(postal_code, city)
);
CREATE TABLE cities (
id INTEGER PRIMARY KEY NOT NULL,
name TEXT NOT NULL,
municipality INTEGER NOT NULL REFERENCES municipalities(id),
CONSTRAINT unique_city UNIQUE(name, municipality)
);
CREATE TABLE municipalities (
id INTEGER PRIMARY KEY NOT NULL,
name TEXT NOT NULL,
NUMBER INTEGER UNIQUE NOT NULL,
county INTEGER NOT NULL REFERENCES counties(id),
CONSTRAINT unique_municipality UNIQUE(name, county)
);
CREATE TABLE counties (
id INTEGER PRIMARY KEY NOT NULL,
name TEXT UNIQUE NOT NULL
);
EXPLAIN for query ... sn.name = ... :
sqlite> EXPLAIN SELECT sn.name, sa.house_number, sa.entrance, pc.postal_code, ci.name, mu.name, co.name, sa.latitude, sa.longitude FROM street_addresses AS sa INNER JOIN street_names AS sn ON sa.street_name = sn.id INNER JOIN postal_codes AS pc ON sa.postal_code = pc.id INNER JOIN cities AS ci ON sa.city = ci.id INNER JOIN municipalities AS mu ON sa.municipality = mu.id INNER JOIN counties AS co ON mu.county = co.id WHERE sn.name = "FORNEBUVEIEN" AND sa.house_number = 13 ORDER BY ci.name ASC, sn.name ASC, sa.house_number ASC, sa.entrance ASC LIMIT 0, 100;
addr opcode p1 p2 p3 p4 p5 comment
---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
0 Init 0 91 0 00
1 OpenEpheme 6 6 0 k(4,B,B,B, 00
2 Integer 100 1 0 00
3 Integer 0 2 0 00
4 MustBeInt 2 0 0 00
5 IfPos 2 7 0 00
6 Integer 0 2 0 00
7 Add 1 2 3 00
8 IfPos 1 10 0 00
9 Integer -1 3 0 00
10 OpenRead 7 12 0 k(2,nil,ni 00
11 OpenRead 0 13 0 9 00
12 OpenRead 8 14 0 k(5,nil,ni 00
13 OpenRead 2 9 0 2 00
14 OpenRead 3 7 0 2 00
15 OpenRead 4 4 0 4 00
16 OpenRead 5 2 0 2 00
17 String8 0 4 0 FORNEBUVEI 00
18 SeekGE 7 65 4 1 00
19 IdxGT 7 65 4 1 00
20 IdxRowid 7 5 0 00
21 IsNull 5 65 0 00
22 Integer 13 6 0 00
23 SeekGE 8 65 5 2 00
24 IdxGT 8 65 5 2 00
25 IdxRowid 8 7 0 00
26 Seek 0 7 0 00
27 Column 8 3 8 00
28 MustBeInt 8 64 0 00
29 NotExists 2 64 8 00
30 Column 8 4 9 00
31 MustBeInt 9 64 0 00
32 NotExists 3 64 9 00
33 Column 0 8 10 00
34 MustBeInt 10 64 0 00
35 NotExists 4 64 10 00
36 Column 4 3 11 00
37 MustBeInt 11 64 0 00
38 NotExists 5 64 11 00
39 Column 7 0 12 00
40 Column 8 1 13 00
41 Column 8 2 14 00
42 Column 2 1 15 00
43 Column 3 1 16 00
44 Column 4 1 17 00
45 Column 5 1 18 00
46 Column 0 3 19 00
47 RealAffini 19 0 0 00
48 Column 0 4 20 00
49 RealAffini 20 0 0 00
50 MakeRecord 12 9 21 00
51 Column 3 1 22 00
52 Column 7 0 23 00
53 Column 8 1 24 00
54 Column 8 2 25 00
55 Sequence 6 26 0 00
56 Move 21 27 0 00
57 MakeRecord 22 6 28 00
58 IdxInsert 6 28 0 00
59 IfZero 3 62 0 00
60 AddImm 3 -1 0 00
61 Goto 0 64 0 00
62 Last 6 0 0 00
63 Delete 6 0 0 00
64 Next 8 24 0 00
65 Close 7 0 0 00
66 Close 0 0 0 00
67 Close 8 0 0 00
68 Close 2 0 0 00
69 Close 3 0 0 00
70 Close 4 0 0 00
71 Close 5 0 0 00
72 OpenPseudo 9 21 9 00
73 Sort 6 89 0 00
74 AddImm 2 -1 0 00
75 IfNeg 2 77 0 00
76 Goto 0 88 0 00
77 Column 6 5 21 00
78 Column 9 0 12 20
79 Column 9 1 13 00
80 Column 9 2 14 00
81 Column 9 3 15 00
82 Column 9 4 16 00
83 Column 9 5 17 00
84 Column 9 6 18 00
85 Column 9 7 19 00
86 Column 9 8 20 00
87 ResultRow 12 9 0 00
88 Next 6 74 0 00
89 Close 9 0 0 00
90 Halt 0 0 0 00
91 Transactio 0 0 10 0 01
92 TableLock 0 11 0 street_nam 00
93 TableLock 0 13 0 street_add 00
94 TableLock 0 9 0 postal_cod 00
95 TableLock 0 7 0 cities 00
96 TableLock 0 4 0 municipali 00
97 TableLock 0 2 0 counties 00
98 Goto 0 1 0 00
EXPLAIN for query ... sn.name LIKE ... :
sqlite> EXPLAIN SELECT sn.name, sa.house_number, sa.entrance, pc.postal_code, ci.name, mu.name, co.name, sa.latitude, sa.longitude FROM street_addresses AS sa INNER JOIN street_names AS sn ON sa.street_name = sn.id INNER JOIN postal_codes AS pc ON sa.postal_code = pc.id INNER JOIN cities AS ci ON sa.city = ci.id INNER JOIN municipalities AS mu ON sa.municipality = mu.id INNER JOIN counties AS co ON mu.county = co.id WHERE sn.name LIKE "FORNEBUVEIEN" AND sa.house_number = 13 ORDER BY ci.name ASC, sn.name ASC, sa.house_number ASC, sa.entrance ASC LIMIT 0, 100;
addr opcode p1 p2 p3 p4 p5 comment
---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
0 Init 0 88 0 00
1 OpenEpheme 6 6 0 k(4,B,B,B, 00
2 Integer 100 1 0 00
3 Integer 0 2 0 00
4 MustBeInt 2 0 0 00
5 IfPos 2 7 0 00
6 Integer 0 2 0 00
7 Add 1 2 3 00
8 IfPos 1 10 0 00
9 Integer -1 3 0 00
10 OpenRead 0 13 0 9 00
11 OpenRead 1 11 0 2 00
12 OpenRead 4 4 0 4 00
13 OpenRead 3 7 0 2 00
14 OpenRead 5 2 0 2 00
15 OpenRead 2 9 0 2 00
16 Rewind 0 63 0 00
17 Column 0 1 4 00
18 Ne 5 62 4 (BINARY) 6c
19 Column 0 5 6 00
20 MustBeInt 6 62 0 00
21 NotExists 1 62 6 00
22 Column 1 1 9 00
23 Function 1 8 7 like(2) 02
24 IfNot 7 62 1 00
25 Column 0 8 10 00
26 MustBeInt 10 62 0 00
27 NotExists 4 62 10 00
28 Column 0 7 11 00
29 MustBeInt 11 62 0 00
30 NotExists 3 62 11 00
31 Column 4 3 12 00
32 MustBeInt 12 62 0 00
33 NotExists 5 62 12 00
34 Column 0 6 13 00
35 MustBeInt 13 62 0 00
36 NotExists 2 62 13 00
37 Column 1 1 14 00
38 Copy 4 15 0 00
39 Column 0 2 16 00
40 Column 2 1 17 00
41 Column 3 1 18 00
42 Column 4 1 19 00
43 Column 5 1 20 00
44 Column 0 3 21 00
45 RealAffini 21 0 0 00
46 Column 0 4 22 00
47 RealAffini 22 0 0 00
48 MakeRecord 14 9 7 00
49 Column 3 1 23 00
50 Column 1 1 24 00
51 Column 0 1 25 00
52 Column 0 2 26 00
53 Sequence 6 27 0 00
54 Move 7 28 0 00
55 MakeRecord 23 6 29 00
56 IdxInsert 6 29 0 00
57 IfZero 3 60 0 00
58 AddImm 3 -1 0 00
59 Goto 0 62 0 00
60 Last 6 0 0 00
61 Delete 6 0 0 00
62 Next 0 17 0 01
63 Close 0 0 0 00
64 Close 1 0 0 00
65 Close 4 0 0 00
66 Close 3 0 0 00
67 Close 5 0 0 00
68 Close 2 0 0 00
69 OpenPseudo 7 7 9 00
70 Sort 6 86 0 00
71 AddImm 2 -1 0 00
72 IfNeg 2 74 0 00
73 Goto 0 85 0 00
74 Column 6 5 7 00
75 Column 7 0 14 20
76 Column 7 1 15 00
77 Column 7 2 16 00
78 Column 7 3 17 00
79 Column 7 4 18 00
80 Column 7 5 19 00
81 Column 7 6 20 00
82 Column 7 7 21 00
83 Column 7 8 22 00
84 ResultRow 14 9 0 00
85 Next 6 71 0 00
86 Close 7 0 0 00
87 Halt 0 0 0 00
88 Transactio 0 0 10 0 01
89 TableLock 0 13 0 street_add 00
90 TableLock 0 11 0 street_nam 00
91 TableLock 0 4 0 municipali 00
92 TableLock 0 7 0 cities 00
93 TableLock 0 2 0 counties 00
94 TableLock 0 9 0 postal_cod 00
95 Integer 13 5 0 00
96 String8 0 8 0 FORNEBUVEI 00
97 Goto 0 1 0 00
The documentation says LIKE requires a case-insensitive index:
CREATE INDEX ci_name ON street_names(name COLLATE NOCASE);
EXPLAIN QUERY PLAN SELECT ... sn.name LIKE "FORNEBUVEIEN" ...;
0|0|1|SEARCH TABLE street_names AS sn USING COVERING INDEX ci_name (name>? AND name<?)
...
Alternatively, use GLOB to be able to use the case-sensitive index:
EXPLAIN QUERY PLAN SELECT ... sn.name GLOB "FORNEBUVEIEN" ...;
0|0|1|SEARCH TABLE street_names AS sn USING COVERING INDEX sqlite_autoindex_street_names_1 (name>? AND name<?)
...
I am not an sqlite expert, but in any SQL dialect, LIKE is never going to be as fast as =. But perhaps you can rearrange the query to optimise it:
SELECT sn.name, sa.house_number, sa.entrance, pc.postal_code, ci.name, mu.name,
co.name, sa.latitude, sa.longitude
FROM
street_addresses AS sa
INNER JOIN street_names AS sn ON sa.street_name = sn.id
AND sn.name LIKE "FORNEBUVEIEN"
INNER JOIN postal_codes AS pc ON sa.postal_code = pc.id
AND sa.house_number = 13
INNER JOIN cities AS ci ON sa.city = ci.id
INNER JOIN municipalities AS mu ON sa.municipality = mu.id
INNER JOIN counties AS co ON mu.county = co.id
ORDER BY ci.name ASC, sn.name ASC, sa.house_number ASC, sa.entrance ASC
LIMIT 0, 100;
My thinking is that by forcing the evaluation early, there is less data to parse. Of course, if the optimiser is smart enough, it would already optimise the access path.

Sqlite subselect much faster than distinct + order by

I'm confused by the drastically different running times of the following two queries that produce identical output. The queries are running on Sqlite 3.7.9, on a table with about 4.5 million rows, and each produce ~50 rows of results.
Here are the queries:
% echo "SELECT DISTINCT acolumn FROM atable ORDER BY acolumn;" | time sqlite3 mydb
sqlite3 mydb 8.87s user 15.06s system 99% cpu 23.980 total
% echo "SELECT acolumn FROM (SELECT DISTINCT acolumn FROM atable) ORDER BY acolumn;" | time sqlite3 options
sqlite3 mydb 1.15s user 0.10s system 98% cpu 1.267 total
Shouldn't the performance of the two queries be closer? I understand that it may be the case that the query planner is performing the "sort" and "distinct" operations in different orders, but if so, does it need to? Or should it be able to figure out how to do it fastest?
Edit: as requested here is the output of the "EXPLAIN QUERY PLAN" command for each query.
For the first (monolithic) query:
0|0|0|SCAN TABLE atable (~1000000 rows)
0|0|0|USE TEMP B-TREE FOR DISTINCT
For the second (subquery) query:
1|0|0|SCAN TABLE atable (~1000000 rows)
1|0|0|USE TEMP B-TREE FOR DISTINCT
0|0|0|SCAN SUBQUERY 1 (~1000000 rows)
0|0|0|USE TEMP B-TREE FOR ORDER BY
Your first query orders the records first by inserting all of them into a sorted temporary table, and then implements the DISTINCT by going through them and returning only those that are not identical to the previous one.
(This can be seen in the EXPLAIN output shown below; the DISTINCT actually got converted to a GROUP BY, which behaves the same.)
Your second query is, in theory, identical to the first, but SQLite's query optimizer is rather simple and cannot prove that this conversion would be safe (as explained in the subquery flattening documentation).
Therefore, it is implemented by doing the DISTINCT first, by inserting only any non-duplicates into a temporary table, and then doing the ORDER BY with a second temporary table.
This second step is completely superfluous because the first temp table was already sorted, but this happens to be faster for your data anyway because you have so many duplicates that are never stored in either temp table.
In theory, your first query could be faster, because SQLite has already recognized that the DISTINCT and ORDER BY clauses can be implemented with the same sorted temporary table.
In practice, however, SQLite it is not smart enough to remember that the DISTINCT implies that duplicates do not need to be stored in the temp table.
(This particular optimization might be added to SQLite if you ask nicely on the mailing list.)
$ sqlite3 mydb
sqlite> .explain
sqlite> explain SELECT DISTINCT acolumn FROM atable ORDER BY acolumn;
addr opcode p1 p2 p3 p4 p5 comment
---- ------------- ---- ---- ---- ------------- -- -------------
0 Trace 0 0 0 00
1 SorterOpen 1 2 0 keyinfo(1,BINARY) 00
2 Integer 0 3 0 00 clear abort flag
3 Integer 0 2 0 00 indicate accumulator empty
4 Null 0 6 6 00
5 Gosub 5 37 0 00
6 Goto 0 40 0 00
7 OpenRead 0 2 0 1 00 atable
8 Rewind 0 14 0 00
9 Column 0 0 8 00 atable.acolumn
10 Sequence 1 9 0 00
11 MakeRecord 8 2 10 00
12 SorterInsert 1 10 0 00
13 Next 0 9 0 01
14 Close 0 0 0 00
15 OpenPseudo 2 10 2 00
16 SorterSort 1 39 0 00 GROUP BY sort
17 SorterData 1 10 0 00
18 Column 2 0 7 20
19 Compare 6 7 1 keyinfo(1,BINARY) 00
20 Jump 21 25 21 00
21 Move 7 6 0 00
22 Gosub 4 32 0 00 output one row
23 IfPos 3 39 0 00 check abort flag
24 Gosub 5 37 0 00 reset accumulator
25 Column 2 0 1 00
26 Integer 1 2 0 00 indicate data in accumulator
27 SorterNext 1 17 0 00
28 Gosub 4 32 0 00 output final row
29 Goto 0 39 0 00
30 Integer 1 3 0 00 set abort flag
31 Return 4 0 0 00
32 IfPos 2 34 0 00 Groupby result generator entry point
33 Return 4 0 0 00
34 Copy 1 11 0 00
35 ResultRow 11 1 0 00
36 Return 4 0 0 00 end groupby result generator
37 Null 0 1 0 00
38 Return 5 0 0 00
39 Halt 0 0 0 00
40 Transaction 0 0 0 00
41 VerifyCookie 0 2 0 00
42 TableLock 0 2 0 atable 00
43 Goto 0 7 0 00
sqlite> explain SELECT acolumn FROM (SELECT DISTINCT acolumn FROM atable) ORDER BY acolumn;
addr opcode p1 p2 p3 p4 p5 comment
---- ------------- ---- ---- ---- ------------- -- -------------
0 Trace 0 0 0 00
1 Goto 0 39 0 00
2 Goto 0 17 0 00
3 OpenPseudo 0 3 1 01 coroutine for sqlite_subquery_DA7480_
4 Integer 0 2 0 01
5 OpenEphemeral 2 0 0 keyinfo(1,BINARY) 08
6 OpenRead 1 2 0 1 00 atable
7 Rewind 1 14 0 00
8 Column 1 0 3 00 atable.acolumn
9 Found 2 13 3 1 00
10 MakeRecord 3 1 4 00
11 IdxInsert 2 4 0 00
12 Yield 1 0 0 00
13 Next 1 8 0 01
14 Close 1 0 0 00
15 Integer 1 2 0 00
16 Yield 1 0 0 00 end sqlite_subquery_DA7480_
17 SorterOpen 3 3 0 keyinfo(1,BINARY) 00
18 Integer 2 1 0 00
19 Yield 1 0 0 00 next row of co-routine sqlite_subquery_DA7480_
20 If 2 29 0 00
21 Column 0 0 5 00 sqlite_subquery_DA7480_.acolumn
22 MakeRecord 5 1 6 00
23 Column 0 0 7 00 sqlite_subquery_DA7480_.acolumn
24 Sequence 3 8 0 00
25 Move 6 9 0 00
26 MakeRecord 7 3 10 00
27 SorterInsert 3 10 0 00
28 Goto 0 19 0 00
29 OpenPseudo 4 6 1 00
30 OpenPseudo 5 11 3 00
31 SorterSort 3 37 0 00
32 SorterData 3 11 0 00
33 Column 5 2 6 20
34 Column 4 0 5 20
35 ResultRow 5 1 0 00
36 SorterNext 3 32 0 00
37 Close 4 0 0 00
38 Halt 0 0 0 00
39 Transaction 0 0 0 00
40 VerifyCookie 0 2 0 00
41 TableLock 0 2 0 atable 00
42 Goto 0 2 0 00
Inside most DBMS, SQL statements are translated into relational algebra and then structured in an expression tree.
The dbms then uses heuristics to optimise queries. One of the main heuristics is "Perform selection early" (p.46). I suppose the sqlite query planner does this as well, hence the differences in execution time.
Since the result of the subquery is much smaller (~50 rows opposed to 4.5 million), sorting, at the end of the expression tree, happens much faster. (Plain) Selecting isn't a very expensive process, running operations on a multitude of results is indeed.
I believe this must be because the order operation and distinct operations must be implemented more efficiently when separated by the subselect - which is effectively a simpler way to say way alexdeloy is saying.
This experiment is not complete. Please also run the following:
% echo "SELECT acolumn FROM (SELECT DISTINCT acolumn FROM atable ORDER BY acolumn) ;" | time sqlite3 mydb
Tell me if this takes longer than the other two on average and thanks.