Here is my query:
const tags = await db.queryEntries(
"SELECT tag, count(tag) AS count, created_at FROM tags WHERE DATE(created_at) >= DATE('now', '-1 days') GROUP BY tag ORDER BY count DESC LIMIT 100"
);
Here is my schema:
CREATE TABLE tags (
tag TEXT,
url STRING
, created_at TEXT);
CREATE UNIQUE INDEX tag_url ON tags (tag, url)
;
CREATE INDEX idx_tags_created_at ON tags(created_at);
It's still very slow (30+ seconds) when I run the query there are about 1.5 million records in the db for tags.
Here are the results of EXPLAIN:
addr opcode p1 p2 p3 p4 p5 comment
---- ------------- ---- ---- ---- ------------- -- -------------
0 Init 0 56 0 00 Start at 56
1 OpenEphemeral 1 5 0 k(1,-B) 00 nColumn=5
2 Integer 100 1 0 00 r[1]=100; LIMIT counter
3 Noop 2 2 0 00
4 Integer 0 6 0 00 r[6]=0; clear abort flag
5 Null 0 9 9 00 r[9..9]=NULL
6 Gosub 8 46 0 00
7 OpenRead 0 7 0 3 00 root=7 iDb=0; tags
8 OpenRead 3 3693502 0 k(3,,,) 00 root=3693502 iDb=0; tag_url
9 Rewind 3 28 11 0 00
10 DeferredSeek 3 0 0 00 Move 0 to 3.rowid if needed
11 Column 0 2 12 00 r[12]=tags.created_at
12 Function 0 12 11 date(-1) 00 r[11]=func(r[12])
13 Lt 13 27 11 50 if r[11]<r[13] goto 27
14 Column 3 0 10 00 r[10]=tags.tag
15 Compare 9 10 1 k(1,-B) 00 r[9] <-> r[10]
16 Jump 17 21 17 00
17 Move 10 9 1 00 r[9]=r[10]
18 Gosub 7 32 0 00 output one row
19 IfPos 6 49 0 00 if r[6]>0 then r[6]-=0, goto 49; check abort flag
20 Gosub 8 46 0 00 reset accumulator
21 Column 3 0 11 00 r[11]=tags.tag
22 AggStep 0 11 3 count(1) 01 accum=r[3] step(r[11])
23 If 5 26 0 00
24 Column 3 0 2 00 r[2]=tags.tag
25 Column 0 2 4 00 r[4]=tags.created_at
26 Integer 1 5 0 00 r[5]=1; indicate data in accumulator
27 Next 3 10 0 01
28 Gosub 7 32 0 00 output final row
29 Goto 0 49 0 00
30 Integer 1 6 0 00 r[6]=1; set abort flag
31 Return 7 0 0 00
32 IfPos 5 34 0 00 if r[5]>0 then r[5]-=0, goto 34; Groupby result generator entry point
33 Return 7 0 0 00
34 AggFinal 3 1 0 count(1) 00 accum=r[3] N=1
35 Copy 3 14 0 00 r[14]=r[3]
36 Sequence 1 15 0 00 r[15]=cursor[1].ctr++
37 IfNotZero 1 41 0 00 if r[1]!=0 then r[1]--, goto 41
38 Last 1 0 0 00
39 IdxLE 1 45 14 1 00 key=r[14]
40 Delete 1 0 0 00
41 Copy 2 16 0 00 r[16]=r[2]
42 Copy 4 17 0 00 r[17]=r[4]
43 MakeRecord 14 4 19 00 r[19]=mkrec(r[14..17])
44 IdxInsert 1 19 14 4 00 key=r[19]
45 Return 7 0 0 00 end groupby result generator
46 Null 0 2 4 00 r[2..4]=NULL
47 Integer 0 5 0 00 r[5]=0; indicate accumulator empty
48 Return 8 0 0 00
49 Sort 1 55 0 00
50 Column 1 3 18 00 r[18]=created_at
51 Column 1 0 17 00 r[17]=count
52 Column 1 2 16 00 r[16]=tag
53 ResultRow 16 3 0 00 output=r[16..18]
54 Next 1 50 0 00
55 Halt 0 0 0 00
56 Transaction 0 0 8 0 01 usesStmtJournal=0
57 String8 0 20 0 now 00 r[20]='now'
58 String8 0 21 0 -1 days 00 r[21]='-1 days'
59 Function 3 20 13 date(-1) 00 r[13]=func(r[20])
60 Goto 0 1 0 00
Rewrite the query so that it is sargable (i.e. it can take advantage of an index):
SELECT tag, COUNT(tag) AS count
FROM tags
WHERE created_at >= strftime('%Y-%m-%d', 'now', '-1 days')
GROUP BY tag
ORDER BY count DESC
LIMIT 100;
The above query should benefit from the following index:
CREATE INDEX idx ON tags (created_at, tag);
I have a slow query (2500+ ms) running in SQLite. This is the EXPLAIN:
sqlite> .explain
sqlite> explain SELECT DISTINCT vino_maridaje.ID_maridaje, maridaje_texto
...> FROM maridaje
...> LEFT OUTER JOIN vino_maridaje ON vino_maridaje.ID_maridaje = maridaje.ID AND maridaje.ID_categoria = 1
...> LEFT OUTER JOIN maridaje_texto ON maridaje_texto.ID_maridaje = maridaje.ID AND maridaje_texto.ID_idioma = 1
...> LEFT OUTER JOIN vino ON vino.ID = ID_vino
...> WHERE vino.ID_pais = 1 AND vino.ID_tipo = 6 AND activo = 1
...> ORDER BY maridaje_texto;
addr opcode p1 p2 p3 p4 p5 comment
---- ------------- ---- ---- ---- ------------- -- -------------
0 Init 0 83 0 00
1 SorterOpen 4 3 0 k(1,B) 00
2 OpenEphemeral 5 0 0 k(2,B,B) 08
3 OpenRead 0 718 0 2 00
4 OpenRead 1 1586 0 3 00
5 OpenRead 6 2794 0 k(2,nil,nil) 00
6 OpenRead 2 726 0 4 00
7 OpenRead 7 3698 0 k(2,nil,nil) 00
8 OpenRead 3 1412 0 20 00
9 Rewind 0 67 0 00
10 Integer 0 1 0 00
11 Rowid 0 2 0 00
12 IsNull 2 62 0 00
13 SeekGE 6 62 2 1 00
14 IdxGT 6 62 2 1 00
15 IdxRowid 6 3 0 00
16 Seek 1 3 0 00
17 Column 0 1 4 NULL 00
18 Integer 1 5 0 00
19 Ne 5 61 4 (BINARY) 6c
20 Integer 1 1 0 00
21 Integer 0 6 0 00
22 Integer 1 7 0 00
23 SeekGE 7 57 7 1 00
24 IdxGT 7 57 7 1 00
25 IdxRowid 7 8 0 00
26 Seek 2 8 0 00
27 Column 2 1 4 NULL 00
28 Rowid 0 5 0 00
29 Ne 5 56 4 (BINARY) 6b
30 Integer 1 6 0 00
31 Integer 0 9 0 00
32 Column 1 1 10 NULL 00
33 MustBeInt 10 53 0 00
34 NotExists 3 53 10 00
35 Integer 1 9 0 00
36 Column 3 7 5 NULL 00
37 Ne 11 53 5 (BINARY) 6c
38 Column 3 5 4 NULL 00
39 Ne 12 53 4 (BINARY) 6c
40 Column 3 19 13 NULL 00
41 Ne 11 53 13 (BINARY) 6c
42 Column 6 0 14 NULL 00
43 Column 2 2 15 NULL 00
44 Found 5 53 14 2 00
45 MakeRecord 14 2 16 00
46 IdxInsert 5 16 0 00
47 MakeRecord 14 2 16 00
48 Column 2 2 18 NULL 00
49 Sequence 4 19 0 00
50 Move 16 20 1 00
51 MakeRecord 18 3 17 00
52 SorterInsert 4 17 0 00
53 IfPos 9 56 0 00
54 NullRow 3 0 0 00
55 Goto 0 35 0 00
56 Next 7 24 1 00
57 IfPos 6 61 0 00
58 NullRow 2 0 0 00
59 NullRow 7 0 0 00
60 Goto 0 30 0 00
61 Next 6 14 1 00
62 IfPos 1 66 0 00
63 NullRow 1 0 0 00
64 NullRow 6 0 0 00
65 Goto 0 20 0 00
66 Next 0 10 0 01
67 Close 0 0 0 00
68 Close 1 0 0 00
69 Close 6 0 0 00
70 Close 2 0 0 00
71 Close 7 0 0 00
72 Close 3 0 0 00
73 OpenPseudo 8 16 2 00
74 OpenPseudo 9 21 3 00
75 SorterSort 4 82 0 00
76 SorterData 4 21 0 00
77 Column 9 2 16 20
78 Column 8 0 14 20
79 Column 8 1 15 00
80 ResultRow 14 2 0 00
81 SorterNext 4 76 0 00
82 Halt 0 0 0 00
83 Transaction 0 0 311 0 01
84 TableLock 0 718 0 maridaje 00
85 TableLock 0 1586 0 vino_maridaje 00
86 TableLock 0 726 0 maridaje_texto 00
87 TableLock 0 1412 0 vino 00
88 Integer 1 11 0 00
89 Integer 6 12 0 00
90 Goto 0 1 0 00
This is the EXPLAIN QUERY PLAN:
0 0 0 SCAN TABLE maridaje
0 1 1 SEARCH TABLE vino_maridaje USING INDEX idx_vino_maridaje_index_ID_maridaje (ID_maridaje=?)
0 2 2 SEARCH TABLE maridaje_texto USING INDEX idx_maridaje_texto_index_ID_idioma (ID_idioma=?)
0 3 3 SEARCH TABLE vino USING INTEGER PRIMARY KEY (rowid=?)
0 0 0 USE TEMP B-TREE FOR DISTINCT
0 0 0 USE TEMP B-TREE FOR ORDER BY
As you can see the tables already have the appropriate indexes. So how can I make the query faster?
I have a table called street_names:
CREATE TABLE street_names (
id INTEGER PRIMARY KEY NOT NULL,
name TEXT UNIQUE NOT NULL
);
When searched using LIKE it uses the index and returns results instantly. However, take this larger expression:
SELECT sn.name, sa.house_number, sa.entrance, pc.postal_code, ci.name, mu.name,
co.name, sa.latitude, sa.longitude
FROM
street_addresses AS sa
INNER JOIN street_names AS sn ON sa.street_name = sn.id
INNER JOIN postal_codes AS pc ON sa.postal_code = pc.id
INNER JOIN cities AS ci ON sa.city = ci.id
INNER JOIN municipalities AS mu ON sa.municipality = mu.id
INNER JOIN counties AS co ON mu.county = co.id
WHERE
sn.name = "FORNEBUVEIEN" AND
sa.house_number = 13
ORDER BY ci.name ASC, sn.name ASC, sa.house_number ASC, sa.entrance ASC
LIMIT 0, 100;
In its current state it's lightning fast and can run 6000 times per second on my machine, but as soon as I change the = to a LIKE on the street name:
SELECT sn.name, sa.house_number, sa.entrance, pc.postal_code, ci.name, mu.name,
co.name, sa.latitude, sa.longitude
FROM
street_addresses AS sa
INNER JOIN street_names AS sn ON sa.street_name = sn.id
INNER JOIN postal_codes AS pc ON sa.postal_code = pc.id
INNER JOIN cities AS ci ON sa.city = ci.id
INNER JOIN municipalities AS mu ON sa.municipality = mu.id
INNER JOIN counties AS co ON mu.county = co.id
WHERE
sn.name LIKE "FORNEBUVEIEN" AND
sa.house_number = 13
ORDER BY ci.name ASC, sn.name ASC, sa.house_number ASC, sa.entrance ASC
LIMIT 0, 100;
It turns sour and runs perhaps 10 times per second on my machine. Why is this? The only change I made was changing an = to a LIKE on an indexed column, and the query didn't even include any wildcards.
Table schemas:
CREATE TABLE street_addresses (
id INTEGER PRIMARY KEY NOT NULL,
house_number INTEGER NOT NULL,
entrance TEXT NOT NULL,
latitude REAL NOT NULL,
longitude REAL NOT NULL,
street_name INTEGER NOT NULL REFERENCES street_names(id),
postal_code INTEGER NOT NULL REFERENCES postal_codes(id),
city INTEGER NOT NULL REFERENCES cities(id),
municipality INTEGER NOT NULL REFERENCES municipalities(id),
CONSTRAINT unique_address UNIQUE(
street_name, house_number, entrance, postal_code, city
)
);
CREATE TABLE street_names (
id INTEGER PRIMARY KEY NOT NULL,
name TEXT UNIQUE NOT NULL
);
CREATE TABLE postal_codes (
id INTEGER PRIMARY KEY NOT NULL,
postal_code INTEGER NOT NULL,
city INTEGER NOT NULL REFERENCES cities(id),
CONSTRAINT unique_postal_code UNIQUE(postal_code, city)
);
CREATE TABLE cities (
id INTEGER PRIMARY KEY NOT NULL,
name TEXT NOT NULL,
municipality INTEGER NOT NULL REFERENCES municipalities(id),
CONSTRAINT unique_city UNIQUE(name, municipality)
);
CREATE TABLE municipalities (
id INTEGER PRIMARY KEY NOT NULL,
name TEXT NOT NULL,
NUMBER INTEGER UNIQUE NOT NULL,
county INTEGER NOT NULL REFERENCES counties(id),
CONSTRAINT unique_municipality UNIQUE(name, county)
);
CREATE TABLE counties (
id INTEGER PRIMARY KEY NOT NULL,
name TEXT UNIQUE NOT NULL
);
EXPLAIN for query ... sn.name = ... :
sqlite> EXPLAIN SELECT sn.name, sa.house_number, sa.entrance, pc.postal_code, ci.name, mu.name, co.name, sa.latitude, sa.longitude FROM street_addresses AS sa INNER JOIN street_names AS sn ON sa.street_name = sn.id INNER JOIN postal_codes AS pc ON sa.postal_code = pc.id INNER JOIN cities AS ci ON sa.city = ci.id INNER JOIN municipalities AS mu ON sa.municipality = mu.id INNER JOIN counties AS co ON mu.county = co.id WHERE sn.name = "FORNEBUVEIEN" AND sa.house_number = 13 ORDER BY ci.name ASC, sn.name ASC, sa.house_number ASC, sa.entrance ASC LIMIT 0, 100;
addr opcode p1 p2 p3 p4 p5 comment
---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
0 Init 0 91 0 00
1 OpenEpheme 6 6 0 k(4,B,B,B, 00
2 Integer 100 1 0 00
3 Integer 0 2 0 00
4 MustBeInt 2 0 0 00
5 IfPos 2 7 0 00
6 Integer 0 2 0 00
7 Add 1 2 3 00
8 IfPos 1 10 0 00
9 Integer -1 3 0 00
10 OpenRead 7 12 0 k(2,nil,ni 00
11 OpenRead 0 13 0 9 00
12 OpenRead 8 14 0 k(5,nil,ni 00
13 OpenRead 2 9 0 2 00
14 OpenRead 3 7 0 2 00
15 OpenRead 4 4 0 4 00
16 OpenRead 5 2 0 2 00
17 String8 0 4 0 FORNEBUVEI 00
18 SeekGE 7 65 4 1 00
19 IdxGT 7 65 4 1 00
20 IdxRowid 7 5 0 00
21 IsNull 5 65 0 00
22 Integer 13 6 0 00
23 SeekGE 8 65 5 2 00
24 IdxGT 8 65 5 2 00
25 IdxRowid 8 7 0 00
26 Seek 0 7 0 00
27 Column 8 3 8 00
28 MustBeInt 8 64 0 00
29 NotExists 2 64 8 00
30 Column 8 4 9 00
31 MustBeInt 9 64 0 00
32 NotExists 3 64 9 00
33 Column 0 8 10 00
34 MustBeInt 10 64 0 00
35 NotExists 4 64 10 00
36 Column 4 3 11 00
37 MustBeInt 11 64 0 00
38 NotExists 5 64 11 00
39 Column 7 0 12 00
40 Column 8 1 13 00
41 Column 8 2 14 00
42 Column 2 1 15 00
43 Column 3 1 16 00
44 Column 4 1 17 00
45 Column 5 1 18 00
46 Column 0 3 19 00
47 RealAffini 19 0 0 00
48 Column 0 4 20 00
49 RealAffini 20 0 0 00
50 MakeRecord 12 9 21 00
51 Column 3 1 22 00
52 Column 7 0 23 00
53 Column 8 1 24 00
54 Column 8 2 25 00
55 Sequence 6 26 0 00
56 Move 21 27 0 00
57 MakeRecord 22 6 28 00
58 IdxInsert 6 28 0 00
59 IfZero 3 62 0 00
60 AddImm 3 -1 0 00
61 Goto 0 64 0 00
62 Last 6 0 0 00
63 Delete 6 0 0 00
64 Next 8 24 0 00
65 Close 7 0 0 00
66 Close 0 0 0 00
67 Close 8 0 0 00
68 Close 2 0 0 00
69 Close 3 0 0 00
70 Close 4 0 0 00
71 Close 5 0 0 00
72 OpenPseudo 9 21 9 00
73 Sort 6 89 0 00
74 AddImm 2 -1 0 00
75 IfNeg 2 77 0 00
76 Goto 0 88 0 00
77 Column 6 5 21 00
78 Column 9 0 12 20
79 Column 9 1 13 00
80 Column 9 2 14 00
81 Column 9 3 15 00
82 Column 9 4 16 00
83 Column 9 5 17 00
84 Column 9 6 18 00
85 Column 9 7 19 00
86 Column 9 8 20 00
87 ResultRow 12 9 0 00
88 Next 6 74 0 00
89 Close 9 0 0 00
90 Halt 0 0 0 00
91 Transactio 0 0 10 0 01
92 TableLock 0 11 0 street_nam 00
93 TableLock 0 13 0 street_add 00
94 TableLock 0 9 0 postal_cod 00
95 TableLock 0 7 0 cities 00
96 TableLock 0 4 0 municipali 00
97 TableLock 0 2 0 counties 00
98 Goto 0 1 0 00
EXPLAIN for query ... sn.name LIKE ... :
sqlite> EXPLAIN SELECT sn.name, sa.house_number, sa.entrance, pc.postal_code, ci.name, mu.name, co.name, sa.latitude, sa.longitude FROM street_addresses AS sa INNER JOIN street_names AS sn ON sa.street_name = sn.id INNER JOIN postal_codes AS pc ON sa.postal_code = pc.id INNER JOIN cities AS ci ON sa.city = ci.id INNER JOIN municipalities AS mu ON sa.municipality = mu.id INNER JOIN counties AS co ON mu.county = co.id WHERE sn.name LIKE "FORNEBUVEIEN" AND sa.house_number = 13 ORDER BY ci.name ASC, sn.name ASC, sa.house_number ASC, sa.entrance ASC LIMIT 0, 100;
addr opcode p1 p2 p3 p4 p5 comment
---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
0 Init 0 88 0 00
1 OpenEpheme 6 6 0 k(4,B,B,B, 00
2 Integer 100 1 0 00
3 Integer 0 2 0 00
4 MustBeInt 2 0 0 00
5 IfPos 2 7 0 00
6 Integer 0 2 0 00
7 Add 1 2 3 00
8 IfPos 1 10 0 00
9 Integer -1 3 0 00
10 OpenRead 0 13 0 9 00
11 OpenRead 1 11 0 2 00
12 OpenRead 4 4 0 4 00
13 OpenRead 3 7 0 2 00
14 OpenRead 5 2 0 2 00
15 OpenRead 2 9 0 2 00
16 Rewind 0 63 0 00
17 Column 0 1 4 00
18 Ne 5 62 4 (BINARY) 6c
19 Column 0 5 6 00
20 MustBeInt 6 62 0 00
21 NotExists 1 62 6 00
22 Column 1 1 9 00
23 Function 1 8 7 like(2) 02
24 IfNot 7 62 1 00
25 Column 0 8 10 00
26 MustBeInt 10 62 0 00
27 NotExists 4 62 10 00
28 Column 0 7 11 00
29 MustBeInt 11 62 0 00
30 NotExists 3 62 11 00
31 Column 4 3 12 00
32 MustBeInt 12 62 0 00
33 NotExists 5 62 12 00
34 Column 0 6 13 00
35 MustBeInt 13 62 0 00
36 NotExists 2 62 13 00
37 Column 1 1 14 00
38 Copy 4 15 0 00
39 Column 0 2 16 00
40 Column 2 1 17 00
41 Column 3 1 18 00
42 Column 4 1 19 00
43 Column 5 1 20 00
44 Column 0 3 21 00
45 RealAffini 21 0 0 00
46 Column 0 4 22 00
47 RealAffini 22 0 0 00
48 MakeRecord 14 9 7 00
49 Column 3 1 23 00
50 Column 1 1 24 00
51 Column 0 1 25 00
52 Column 0 2 26 00
53 Sequence 6 27 0 00
54 Move 7 28 0 00
55 MakeRecord 23 6 29 00
56 IdxInsert 6 29 0 00
57 IfZero 3 60 0 00
58 AddImm 3 -1 0 00
59 Goto 0 62 0 00
60 Last 6 0 0 00
61 Delete 6 0 0 00
62 Next 0 17 0 01
63 Close 0 0 0 00
64 Close 1 0 0 00
65 Close 4 0 0 00
66 Close 3 0 0 00
67 Close 5 0 0 00
68 Close 2 0 0 00
69 OpenPseudo 7 7 9 00
70 Sort 6 86 0 00
71 AddImm 2 -1 0 00
72 IfNeg 2 74 0 00
73 Goto 0 85 0 00
74 Column 6 5 7 00
75 Column 7 0 14 20
76 Column 7 1 15 00
77 Column 7 2 16 00
78 Column 7 3 17 00
79 Column 7 4 18 00
80 Column 7 5 19 00
81 Column 7 6 20 00
82 Column 7 7 21 00
83 Column 7 8 22 00
84 ResultRow 14 9 0 00
85 Next 6 71 0 00
86 Close 7 0 0 00
87 Halt 0 0 0 00
88 Transactio 0 0 10 0 01
89 TableLock 0 13 0 street_add 00
90 TableLock 0 11 0 street_nam 00
91 TableLock 0 4 0 municipali 00
92 TableLock 0 7 0 cities 00
93 TableLock 0 2 0 counties 00
94 TableLock 0 9 0 postal_cod 00
95 Integer 13 5 0 00
96 String8 0 8 0 FORNEBUVEI 00
97 Goto 0 1 0 00
The documentation says LIKE requires a case-insensitive index:
CREATE INDEX ci_name ON street_names(name COLLATE NOCASE);
EXPLAIN QUERY PLAN SELECT ... sn.name LIKE "FORNEBUVEIEN" ...;
0|0|1|SEARCH TABLE street_names AS sn USING COVERING INDEX ci_name (name>? AND name<?)
...
Alternatively, use GLOB to be able to use the case-sensitive index:
EXPLAIN QUERY PLAN SELECT ... sn.name GLOB "FORNEBUVEIEN" ...;
0|0|1|SEARCH TABLE street_names AS sn USING COVERING INDEX sqlite_autoindex_street_names_1 (name>? AND name<?)
...
I am not an sqlite expert, but in any SQL dialect, LIKE is never going to be as fast as =. But perhaps you can rearrange the query to optimise it:
SELECT sn.name, sa.house_number, sa.entrance, pc.postal_code, ci.name, mu.name,
co.name, sa.latitude, sa.longitude
FROM
street_addresses AS sa
INNER JOIN street_names AS sn ON sa.street_name = sn.id
AND sn.name LIKE "FORNEBUVEIEN"
INNER JOIN postal_codes AS pc ON sa.postal_code = pc.id
AND sa.house_number = 13
INNER JOIN cities AS ci ON sa.city = ci.id
INNER JOIN municipalities AS mu ON sa.municipality = mu.id
INNER JOIN counties AS co ON mu.county = co.id
ORDER BY ci.name ASC, sn.name ASC, sa.house_number ASC, sa.entrance ASC
LIMIT 0, 100;
My thinking is that by forcing the evaluation early, there is less data to parse. Of course, if the optimiser is smart enough, it would already optimise the access path.
I'm confused by the drastically different running times of the following two queries that produce identical output. The queries are running on Sqlite 3.7.9, on a table with about 4.5 million rows, and each produce ~50 rows of results.
Here are the queries:
% echo "SELECT DISTINCT acolumn FROM atable ORDER BY acolumn;" | time sqlite3 mydb
sqlite3 mydb 8.87s user 15.06s system 99% cpu 23.980 total
% echo "SELECT acolumn FROM (SELECT DISTINCT acolumn FROM atable) ORDER BY acolumn;" | time sqlite3 options
sqlite3 mydb 1.15s user 0.10s system 98% cpu 1.267 total
Shouldn't the performance of the two queries be closer? I understand that it may be the case that the query planner is performing the "sort" and "distinct" operations in different orders, but if so, does it need to? Or should it be able to figure out how to do it fastest?
Edit: as requested here is the output of the "EXPLAIN QUERY PLAN" command for each query.
For the first (monolithic) query:
0|0|0|SCAN TABLE atable (~1000000 rows)
0|0|0|USE TEMP B-TREE FOR DISTINCT
For the second (subquery) query:
1|0|0|SCAN TABLE atable (~1000000 rows)
1|0|0|USE TEMP B-TREE FOR DISTINCT
0|0|0|SCAN SUBQUERY 1 (~1000000 rows)
0|0|0|USE TEMP B-TREE FOR ORDER BY
Your first query orders the records first by inserting all of them into a sorted temporary table, and then implements the DISTINCT by going through them and returning only those that are not identical to the previous one.
(This can be seen in the EXPLAIN output shown below; the DISTINCT actually got converted to a GROUP BY, which behaves the same.)
Your second query is, in theory, identical to the first, but SQLite's query optimizer is rather simple and cannot prove that this conversion would be safe (as explained in the subquery flattening documentation).
Therefore, it is implemented by doing the DISTINCT first, by inserting only any non-duplicates into a temporary table, and then doing the ORDER BY with a second temporary table.
This second step is completely superfluous because the first temp table was already sorted, but this happens to be faster for your data anyway because you have so many duplicates that are never stored in either temp table.
In theory, your first query could be faster, because SQLite has already recognized that the DISTINCT and ORDER BY clauses can be implemented with the same sorted temporary table.
In practice, however, SQLite it is not smart enough to remember that the DISTINCT implies that duplicates do not need to be stored in the temp table.
(This particular optimization might be added to SQLite if you ask nicely on the mailing list.)
$ sqlite3 mydb
sqlite> .explain
sqlite> explain SELECT DISTINCT acolumn FROM atable ORDER BY acolumn;
addr opcode p1 p2 p3 p4 p5 comment
---- ------------- ---- ---- ---- ------------- -- -------------
0 Trace 0 0 0 00
1 SorterOpen 1 2 0 keyinfo(1,BINARY) 00
2 Integer 0 3 0 00 clear abort flag
3 Integer 0 2 0 00 indicate accumulator empty
4 Null 0 6 6 00
5 Gosub 5 37 0 00
6 Goto 0 40 0 00
7 OpenRead 0 2 0 1 00 atable
8 Rewind 0 14 0 00
9 Column 0 0 8 00 atable.acolumn
10 Sequence 1 9 0 00
11 MakeRecord 8 2 10 00
12 SorterInsert 1 10 0 00
13 Next 0 9 0 01
14 Close 0 0 0 00
15 OpenPseudo 2 10 2 00
16 SorterSort 1 39 0 00 GROUP BY sort
17 SorterData 1 10 0 00
18 Column 2 0 7 20
19 Compare 6 7 1 keyinfo(1,BINARY) 00
20 Jump 21 25 21 00
21 Move 7 6 0 00
22 Gosub 4 32 0 00 output one row
23 IfPos 3 39 0 00 check abort flag
24 Gosub 5 37 0 00 reset accumulator
25 Column 2 0 1 00
26 Integer 1 2 0 00 indicate data in accumulator
27 SorterNext 1 17 0 00
28 Gosub 4 32 0 00 output final row
29 Goto 0 39 0 00
30 Integer 1 3 0 00 set abort flag
31 Return 4 0 0 00
32 IfPos 2 34 0 00 Groupby result generator entry point
33 Return 4 0 0 00
34 Copy 1 11 0 00
35 ResultRow 11 1 0 00
36 Return 4 0 0 00 end groupby result generator
37 Null 0 1 0 00
38 Return 5 0 0 00
39 Halt 0 0 0 00
40 Transaction 0 0 0 00
41 VerifyCookie 0 2 0 00
42 TableLock 0 2 0 atable 00
43 Goto 0 7 0 00
sqlite> explain SELECT acolumn FROM (SELECT DISTINCT acolumn FROM atable) ORDER BY acolumn;
addr opcode p1 p2 p3 p4 p5 comment
---- ------------- ---- ---- ---- ------------- -- -------------
0 Trace 0 0 0 00
1 Goto 0 39 0 00
2 Goto 0 17 0 00
3 OpenPseudo 0 3 1 01 coroutine for sqlite_subquery_DA7480_
4 Integer 0 2 0 01
5 OpenEphemeral 2 0 0 keyinfo(1,BINARY) 08
6 OpenRead 1 2 0 1 00 atable
7 Rewind 1 14 0 00
8 Column 1 0 3 00 atable.acolumn
9 Found 2 13 3 1 00
10 MakeRecord 3 1 4 00
11 IdxInsert 2 4 0 00
12 Yield 1 0 0 00
13 Next 1 8 0 01
14 Close 1 0 0 00
15 Integer 1 2 0 00
16 Yield 1 0 0 00 end sqlite_subquery_DA7480_
17 SorterOpen 3 3 0 keyinfo(1,BINARY) 00
18 Integer 2 1 0 00
19 Yield 1 0 0 00 next row of co-routine sqlite_subquery_DA7480_
20 If 2 29 0 00
21 Column 0 0 5 00 sqlite_subquery_DA7480_.acolumn
22 MakeRecord 5 1 6 00
23 Column 0 0 7 00 sqlite_subquery_DA7480_.acolumn
24 Sequence 3 8 0 00
25 Move 6 9 0 00
26 MakeRecord 7 3 10 00
27 SorterInsert 3 10 0 00
28 Goto 0 19 0 00
29 OpenPseudo 4 6 1 00
30 OpenPseudo 5 11 3 00
31 SorterSort 3 37 0 00
32 SorterData 3 11 0 00
33 Column 5 2 6 20
34 Column 4 0 5 20
35 ResultRow 5 1 0 00
36 SorterNext 3 32 0 00
37 Close 4 0 0 00
38 Halt 0 0 0 00
39 Transaction 0 0 0 00
40 VerifyCookie 0 2 0 00
41 TableLock 0 2 0 atable 00
42 Goto 0 2 0 00
Inside most DBMS, SQL statements are translated into relational algebra and then structured in an expression tree.
The dbms then uses heuristics to optimise queries. One of the main heuristics is "Perform selection early" (p.46). I suppose the sqlite query planner does this as well, hence the differences in execution time.
Since the result of the subquery is much smaller (~50 rows opposed to 4.5 million), sorting, at the end of the expression tree, happens much faster. (Plain) Selecting isn't a very expensive process, running operations on a multitude of results is indeed.
I believe this must be because the order operation and distinct operations must be implemented more efficiently when separated by the subselect - which is effectively a simpler way to say way alexdeloy is saying.
This experiment is not complete. Please also run the following:
% echo "SELECT acolumn FROM (SELECT DISTINCT acolumn FROM atable ORDER BY acolumn) ;" | time sqlite3 mydb
Tell me if this takes longer than the other two on average and thanks.
I've this query:
DELETE FROM f
WHERE ft != 'f'
OR fs NOT IN ( SELECT fs
FROM f
GROUP BY fs
HAVING COUNT(fs) >1)
It is doing its job nicely except that it takes much more time than I expected. I'm talking about 2.25 secs to down ~209,000 records to ~187,000. I think it could be improved, and I would like to know how.
Query EXPLAIN:
addr opcode p1 p2 p3 p4 p5 comment
---- ------------- ---- ---- ---- ----------------- -- -------------
0 Trace 0 0 0 00
1 Goto 0 82 0 00
2 Null 0 1 0 00
3 String8 0 3 0 f 00
4 OpenRead 0 2 0 4 00
5 Rewind 0 69 0 00
6 Column 0 2 4 00
7 Ne 3 66 4 collseq(BINARY) 63
8 If 7 53 0 00
9 Integer 1 7 0 00
10 Null 0 6 0 00
11 OpenEphemeral 4 1 0 keyinfo(1,BINARY) 00
12 OpenEphemeral 5 2 0 keyinfo(1,BINARY) 00
13 Integer 0 11 0 00
14 Integer 0 10 0 00
15 Gosub 13 50 0 00
16 OpenRead 2 2 0 4 00
17 Rewind 2 23 0 00
18 Column 2 3 16 00
19 Sequence 5 17 0 00
20 MakeRecord 16 2 4 00
21 IdxInsert 5 4 0 00
22 Next 2 18 0 01
23 Close 2 0 0 00
24 Sort 5 53 0 00
25 Column 5 0 15 00
26 Compare 14 15 1 keyinfo(1,BINARY) 00
27 Jump 28 32 28 00
28 Move 15 14 1 00
29 Gosub 12 41 0 00
30 IfPos 11 53 0 00
31 Gosub 13 50 0 00
32 Column 5 0 16 00
33 AggStep 0 16 9 count(1) 01
34 Column 5 0 8 00
35 Integer 1 10 0 00
36 Next 5 25 0 00
37 Gosub 12 41 0 00
38 Goto 0 53 0 00
39 Integer 1 11 0 00
40 Return 12 0 0 00
41 IfPos 10 43 0 00
42 Return 12 0 0 00
43 AggFinal 9 1 0 count(1) 00
44 Integer 1 4 0 00
45 Le 4 42 9 6a
46 SCopy 8 18 0 00
47 MakeRecord 18 1 4 c 00
48 IdxInsert 4 4 0 00
49 Return 12 0 0 00
50 Null 0 8 0 00
51 Null 0 9 0 00
52 Return 13 0 0 00
53 Column 0 3 4 00
54 NotNull 4 57 0 00
55 Rewind 4 66 0 00
56 Goto 0 68 0 00
57 Affinity 4 1 0 c 00
58 Found 4 65 4 1 00
59 NotNull 6 63 0 00
60 Found 4 62 6 1 00
61 Integer -1 6 0 00
62 AddImm 6 1 0 00
63 If 6 68 0 00
64 Goto 0 66 0 00
65 Goto 0 68 0 00
66 Rowid 0 2 0 00
67 RowSetAdd 1 2 0 00
68 Next 0 6 0 01
69 Close 0 0 0 00
70 OpenWrite 0 2 0 4 00
71 OpenWrite 1 3 0 keyinfo(1,BINARY) 00
72 RowSetRead 1 79 2 00
73 NotExists 0 78 2 00
74 Rowid 0 20 0 00
75 Column 0 1 19 00
76 IdxDelete 1 19 2 00
77 Delete 0 1 0 f 00
78 Goto 0 72 0 00
79 Close 1 3 0 00
80 Close 0 0 0 00
81 Halt 0 0 0 00
82 Transaction 0 1 0 00
83 VerifyCookie 0 2 0 00
84 TableLock 0 2 1 f 00
85 Goto 0 2 0 00
Table definition (no indexes yet):
CREATE TABLE f (fi INTEGER PRIMARY KEY AUTOINCREMENT,
fn STRING,
ft STRING,
fs INTEGER)
I'm not sure that two seconds is a "killer" query (though this, of course, depends on your circumstances and needs) but one thing you could test is the effect of splitting the query into two.
That's because it will currently delete records matching either condition so it's an easy transformation into two delete statements (within a transaction if you want to ensure it's atomic). You could try:
DELETE FROM f WHERE ft != 'f';
DELETE FROM f WHERE fs NOT IN (
SELECT fs FROM f
GROUP BY fs
HAVING COUNT(fs) >1);
and see if that improves things. It may or it may not, depending on your DBMS and the makeup of your data. It's likely to get rid of the crossover records (those satisfying both conditions) in the most-likely-faster first query.
But, as with all database optimisations, measure, don't guess!
Then you can either recombine and re-evaluate, or concentrate on speeding up the second simpler query if it's still a problem.
One thing to particularly make sure of: have indexes on both the ft and fs columns. That should speed up your queries quite a bit if you don't already have them.
Try this query
DELETE FROM f
WHERE ft <> 'f'
OR NOT EXISTS (Select * from f f2
WHERE f2.fs=f.fs and f2.id <> f.id)
Indexes
on ft ; single-column
on (fs, id) ; composite, in that order
DELETE f
FROM (
SELECT fs
FROM f
GROUP BY fs
HAVING COUNT(fs) <=1
) s
WHERE (f.ft != 'f' OR f.fs = s.fs)