I am trying to optimize the query I use for fetching paginated data from database with large data sets.
My schema looks like this:
CREATE TABLE users (
user_id TEXT PRIMARY KEY,
name TEXT,
custom_fields TEXT
);
CREATE TABLE events (
event_id TEXT PRIMARY KEY,
organizer_id TEXT NOT NULL REFERENCES users(user_id) ON DELETE SET NULL ON UPDATE CASCADE,
name TEXT NOT NULL,
type TEXT NOT NULL,
start_time INTEGER,
duration INTEGER
-- more columns here, omitted for the sake of simplicity
);
CREATE INDEX events_organizer_id_start_time_idx ON events(organizer_id, start_time);
CREATE INDEX events_organizer_id_type_idx ON events(organizer_id, type);
CREATE INDEX events_organizer_id_type_start_time_idx ON events(organizer_id, type, start_time);
CREATE INDEX events_type_start_time_idx ON events(type, start_time);
CREATE INDEX events_start_time_desc_idx ON events(start_time DESC);
CREATE INDEX events_start_time_asc_idx ON events(IFNULL(start_time, 253402300800) ASC);
CREATE TABLE event_participants (
participant_id TEXT NOT NULL REFERENCES users(user_id) ON DELETE CASCADE ON UPDATE CASCADE,
event_id TEXT NOT NULL REFERENCES events(event_id) ON DELETE CASCADE ON UPDATE CASCADE,
role INTEGER NOT NULL DEFAULT 0,
UNIQUE (participant_id, event_id) ON CONFLICT REPLACE
);
CREATE INDEX event_participants_participant_id_event_id_idx ON event_participants(participant_id, event_id);
CREATE INDEX event_participants_event_id_idx ON event_participants(event_id);
CREATE TABLE event_tag_maps (
event_id TEXT NOT NULL REFERENCES events(event_id) ON DELETE CASCADE ON UPDATE CASCADE,
tag_id TEXT NOT NULL,
PRIMARY KEY (event_id, tag_id) ON CONFLICT IGNORE
);
CREATE INDEX event_tag_maps_event_id_tag_id_idx ON event_tag_maps(event_id, tag_id);
Where in events table I have around 1,500,000 entries, and around 2,000,000 in event_participants.
Now, a typical query would look something like:
SELECT
EVTS.event_id,
EVTS.type,
EVTS.name,
EVTS.time,
EVTS.duration
FROM events AS EVTS
WHERE
EVTS.organizer_id IN(
'f39c3bb1-3ee3-11e6-a0dc-005056c00008',
'4555e70f-3f1d-11e6-a0dc-005056c00008',
'6e7e33ae-3f1c-11e6-a0dc-005056c00008',
'4850a6a0-3ee4-11e6-a0dc-005056c00008',
'e06f784c-3eea-11e6-a0dc-005056c00008',
'bc6a0f73-3f1d-11e6-a0dc-005056c00008',
'68959fb5-3ef3-11e6-a0dc-005056c00008',
'c4c96cf2-3f1a-11e6-a0dc-005056c00008',
'727e49d1-3f1b-11e6-a0dc-005056c00008',
'930bcfb6-3f09-11e6-a0dc-005056c00008')
AND EVTS.type IN('Meeting', 'Conversation')
AND(
EXISTS (
SELECT 1 FROM event_tag_maps AS ETM WHERE ETM.event_id = EVTS.event_id AND
ETM.tag_id IN ('00000000-0000-0000-0000-000000000000', '6ae6870f-1aac-11e6-aeb9-005056c00008', '6ae6870c-1aac-11e6-aeb9-005056c00008', '1f6d3ccb-eaed-4068-a46b-ec2547fec1ff'))
OR NOT EXISTS (
SELECT 1 FROM event_tag_maps AS ETM WHERE ETM.event_id = EVTS.event_id)
)
AND EXISTS (
SELECT 1 FROM event_participants AS EPRTS
WHERE
EVTS.event_id = EPRTS.event_id
AND participant_id NOT IN('79869516-3ef2-11e6-a0dc-005056c00008', '79869515-3ef2-11e6-a0dc-005056c00008', '79869516-4e18-11e6-a0dc-005056c00008')
)
ORDER BY IFNULL(EVTS.start_time, 253402300800) ASC
LIMIT 100 OFFSET #Offset;
Also, for fetching the overall count of the query-matching items, I would use the above query with count(1) instead of the columns and without the ORDER BY and LIMIT/OFFSET clauses.
I experience two main problems here:
1) The performance drastically decreases as I increase the #Offset value. The difference is very significant - from being almost immediate to a number of seconds.
2) The count query takes a long time (number of seconds) and produces the following execution plan:
0|0|0|SCAN TABLE events AS EVTS
0|0|0|EXECUTE LIST SUBQUERY 1
0|0|0|EXECUTE LIST SUBQUERY 1
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 1
1|0|0|SEARCH TABLE event_tag_maps AS ETM USING COVERING INDEX event_tag_maps_event_id_tag_id_idx (event_id=? AND tag_id=?)
1|0|0|EXECUTE LIST SUBQUERY 2
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 2
2|0|0|SEARCH TABLE event_tag_maps AS ETM USING COVERING INDEX event_tag_maps_event_id_tag_id_idx (event_id=?)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 3
3|0|0|SEARCH TABLE event_participants AS EPRTS USING INDEX event_participants_event_id_idx (event_id=?)
Here I don't understand why the full scan is performed instead of an index scan.
Additional info and SQLite settings used:
I use System.Data.SQLite provider (have to, because of custom functions support)
Page size = cluster size (4096 in my case)
Cache size = 100000
Journal mode = WAL
Temp store = 2 (memory)
No transaction is open for the query
Is there anything I could do to change the query/schema or settings in order to get as much performance improvement as possible?
Related
Given the following:
-- This table will have roughly 14 million records
CREATE TABLE IdMappings
(
Id int IDENTITY(1,1) NOT NULL,
OldId int NOT NULL,
NewId int NOT NULL,
RecordType varchar(80) NOT NULL, -- 15 distinct values, will never increase
Processed bit NOT NULL DEFAULT 0,
CONSTRAINT pk_IdMappings
PRIMARY KEY CLUSTERED (Id ASC)
)
CREATE UNIQUE INDEX ux_IdMappings_OldId ON IdMappings (OldId);
CREATE UNIQUE INDEX ux_IdMappings_NewId ON IdMappings (NewId);
and this is the most common query run against the table:
WHILE #firstBatchId <= #maxBatchId
BEGIN
-- the result of this is used to insert into another table:
SELECT
NewId, -- and lots of non-indexed columns from SOME_TABLE
FROM
IdMappings map
INNER JOIN
SOME_TABLE foo ON foo.Id = map.OldId
WHERE
map.Id BETWEEN #firstBatchId AND #lastBatchId
AND map.RecordType = #someRecordType
AND map.Processed = 0
-- We only really need this in case the user kills the binary or SQL Server service:
UPDATE IdMappings
SET Processed = 1
WHERE map.Id BETWEEN #firstBatchId AND #lastBatchId
AND map.RecordType = #someRecordType
SET #firstBatchId += 4999
SET #lastBatchId += 4999
END
What are the best indices to add? I figure Processed isn't worth indexing since it only has 2 values. Is it worth indexing RecordType since there are only about 15 distinct values? How many distinct values will a column likely have before we consider indexing it?
Is there any advantage in a composite key if some of the fields are in the WHERE and some are in a JOIN's ON condition? For example:
CREATE INDEX ix_IdMappings_RecordType_OldId
ON IdMappings (RecordType, OldId)
... if I wanted both these fields indexed (I'm not saying I do), does this composite key gain any advantage since both columns don't appear together in the same WHERE or same ON?
Insert time into IdMappings isn't really an issue. After we insert all records into the table, we don't need to do so again for months if ever.
I have a SQL table with a column called [applied], only one row from all rows can be applied ( have the value of 1) all other rows should have the value 0
Is there a check constraint that i can write to force such a case?
If you use null instead of 0, it will be much easier.
Have a CHECK constraint to make sure the (non-null) value = 1. Also have a UNIQUE constraint to only allow a single value 1.
create table testtable (
id int primary key,
applied int,
constraint applied_unique unique (applied),
constraint applied_eq_1 check (applied = 1)
);
Core ANSI SQL, i.e. expected to work with any database.
Most databases support filtered indexes:
create unique index unq_t_applied on t(applied) where applied = 1;
To know exactly how to write trigger that will help you an info of a database you use is needed.
You wil need a trigger where this will be your test control:
SELECT COUNT(APPLIED)
FROM TEST
WHERE APPLIED = 1
If it is > 0 then do not allow insert else allow.
While this can be done with triggers and constraints, they probably require an index. Instead, consider a join table.
create table things_applied (
id smallint primary key default 1,
thing_id bigint references things(id) not null,
check(id = 1)
);
Because the primary key is unique, there can only ever be one row.
The first is activated with an insert.
insert into things_applied (thing_id) values (1);
Change it by updating the row.
update things_applied set thing_id = 2;
To deactivate completely, delete the row.
delete things_applied;
To find the active row, join with the table.
select t.*
from things t
join things_applied ta on ta.thing_id = t.id
To check if it's active at all, count the rows.
select count(id) as active
from things_applied
Try it.
A table for storing items, in a particular order, associated with a container. Separate ak_* constraints involving item_id and seq ensure a container contains distinct items and the sequence of those items is distinct.
CREATE TABLE [container_items] (
[container_item_id] INT IDENTITY (1, 1) NOT NULL,
[container_id] INT NOT NULL,
[item_id] INT NOT NULL,
[seq] INT NOT NULL,
CONSTRAINT [pk_container_item] PRIMARY KEY CLUSTERED ([container_item_id] ASC),
CONSTRAINT [ak_container_item_seq] UNIQUE NONCLUSTERED ([container_id] ASC, [seq] ASC),
CONSTRAINT [ak_container_item_item] UNIQUE NONCLUSTERED ([container_id] ASC, [item_id] ASC),
CONSTRAINT [fk_container_item_item] FOREIGN KEY ([item_id]) REFERENCES [items] ([item_id]),
CONSTRAINT [fk_container_item_container] FOREIGN KEY ([container_id]) REFERENCES [containers] ([container_id])
);
Suppose for container_id=1 the original data is
container_item_id, container_id, item_id, seq
1,1,1,1
2,1,3,2
3,1,10,3
4,1,8,4
and some client app for reordering says the new sequence for the item_ids is
8,1
10,2
3,3
1,4
The ak_* constraints make it impossible to update the data base table directly. For instance, trying update in this manner:
update container
container_items
set item_id=8, seq=1
where container_item_id = 1
fails
Violation of UNIQUE KEY constraint ak_container_item_item. Cannot insert duplicate key in object 'container_items'. The duplicate key value is (1, 8).
The statement has been terminated.
Q: Is it worth the effort to find an algorithm that would reuse existing container_item_id records when the seq order is changed ?
A non-reusing approach would be to delete the existing records for survery_id=1 and then append the new sequenced item_ids as new records.
You can encapsulate the whole operation in a simple atomic transaction. Also, you need a at least one 'aux' value, I a couple of aux values with simple * -1 operation in this sample:
begin transaction tx1;
set transaction isolation level serializable;
update container
survey_items
set seq=-1*seq #<-- set aux values
where container_id = 1;
update container
survey_items
set seq=1
where container_id = 1 and item_id = 8;
update container
survey_items
set seq=2
where container_id = 1 and item_id = 10;
#and so on
commit;
Notice than you can work at repeatable read with the same guaranties because no phantoms are made.
I have two tables:
CREATE TABLE routing
(
id integer NOT NULL,
link_geom geometry,
source integer,
target integer,
traveltime_min double precision,
CONSTRAINT routing_pkey PRIMARY KEY (id)
)
WITH (
OIDS=FALSE
);
CREATE INDEX routing_id_idx
ON routing
USING btree
(id);
CREATE INDEX routing_link_geom_gidx
ON routing
USING gist
(link_geom);
CREATE INDEX routing_source_idx
ON routing
USING btree
(source);
CREATE INDEX routing_target_idx
ON routing
USING btree
(target);
and
CREATE TABLE test
(
link_id character varying,
link_geom geometry,
id integer NOT NULL,
.. (some more attributes here)
traveltime_min double precision,
CONSTRAINT id PRIMARY KEY (id),
CONSTRAINT test_link_id_key UNIQUE (link_id)
)
WITH (
OIDS=FALSE
);
ALTER TABLE test
OWNER TO postgres;
and I am trying to appy the follwing query:
update routing
set traveltime_min = t2.traveltime_min
from test t2
where t2.id = routing.id
Both tables have near 10 millions rows. The problem is that this query runs neverending. Here what 'EXPLAIN' shows:
Update on routing (cost=601725.94..1804772.15 rows=9712264 width=208)
-> Hash Join (cost=601725.94..1804772.15 rows=9712264 width=208)
Hash Cond: (routing.id = t2.id)"
-> Seq Scan on routing (cost=0.00..366200.23 rows=9798223 width=194)"
-> Hash (cost=423414.64..423414.64 rows=9712264 width=18)"
-> Seq Scan on test t2 (cost=0.00..423414.64 rows=9712264 width=18)"
I cannot understand what might cause the problem of such a slow response.
Is it possible to be a problem caused from the server settings? The thing is that i use the default postgrSQL 9.3 settings.
Drop all indexes on routing before you run the UPDATE and add them again afterwards. That will bring a huge improvement.
Set work_mem high in the session where you run the UPDATE. That will help with the hash.
Set shared_buffers to ΒΌ of the available memory, but not more than 1GB.
If not all the rows are actually changed by the UPDATE (if the get the same value as they had) , you should avoid these idempotent updates.
if you expect the query to affect every row, the query plan is not important. [except, maybe, for the case of overflowing hash tables ...]
-- these could be needed if the update would be more selective...
VACUUM analyze routing;
VACUUM analyze test;
UPDATE routing dst
SET traveltime_min = src.traveltime_min
FROM test src
WHERE dst.id = src.id
-- avoid useless updates and row-versions
AND dst.traveltime_min IS DISTINCT FROM src.traveltime_min
;
-- VACUUM analyze routing;
I have this table:
CREATE TABLE `search_engine_rankings` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`keyword_id` int(11) DEFAULT NULL,
`search_engine_id` int(11) DEFAULT NULL,
`total_results` int(11) DEFAULT NULL,
`rank` int(11) DEFAULT NULL,
`url` varchar(255) DEFAULT NULL,
`created_at` datetime DEFAULT NULL,
`updated_at` datetime DEFAULT NULL,
`indexed_at` date DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `unique_ranking` (`keyword_id`,`search_engine_id`,`rank`,`indexed_at`),
KEY `search_engine_rankings_search_engine_id_fk` (`search_engine_id`),
CONSTRAINT `search_engine_rankings_keyword_id_fk` FOREIGN KEY (`keyword_id`) REFERENCES `keywords` (`id`) ON DELETE CASCADE,
CONSTRAINT `search_engine_rankings_search_engine_id_fk` FOREIGN KEY (`search_engine_id`) REFERENCES `search_engines` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=244454637 DEFAULT CHARSET=utf8
It has about 250M rows in production.
When I do:
select id,
rank
from search_engine_rankings
where keyword_id = 19
and search_engine_id = 11
and indexed_at = "2010-12-03";
...it runs very quickly.
When I add the url column (VARCHAR):
select id,
rank,
url
from search_engine_rankings
where keyword_id = 19
and search_engine_id = 11
and indexed_at = "2010-12-03";
...it runs very slowly.
Any ideas?
The first query can be satisfied by the index alone -- no need to read the base table to obtain the values in the Select clause. The second statement requires reads of the base table because the URL column is not part of the index.
UNIQUE KEY `unique_ranking` (`keyword_id`,`search_engine_id`,`rank`,`indexed_at`),
The rows in tbe base table are not in the same physical order as the rows in the index, and so the read of the base table can involve considerable disk-thrashing.
You can think of it as a kind of proof of optimization -- on the first query the disk-thrashing is avoided because the engine is smart enough to consult the index for the values requested in the select clause; it will already have read that index into RAM for the where clause, so it takes advantage of that fact.
Additionally to Tim's answer. An index in Mysql can only be used left-to-right. Which means it can use columns of your index in your WHERE clause only up to the point you use them.
Currently, your UNIQUE index is keyword_id,search_engine_id,rank,indexed_at. This will be able to filter the columns keyword_id and search_engine_id, still needing to scan over the remaining rows to filter for indexed_at
But if you change it to: keyword_id,search_engine_id,indexed_at,rank (just the order). This will be able to filter the columns keyword_id,search_engine_id and indexed_at
I believe it will be able to fully use that index to read the appropriate part of your table.
I know it's an old post but I was experiencing the same situation and I didn't found an answer.
This really happens in MySQL, when you have varchar columns it takes a lot of time processing. My query took about 20 sec to process 1.7M rows and now is about 1.9 sec.
Ok first of all, create a view from this query:
CREATE VIEW view_one AS
select id,rank
from search_engine_rankings
where keyword_id = 19000
and search_engine_id = 11
and indexed_at = "2010-12-03";
Second, same query but with an inner join:
select v.*, s.url
from view_one AS v
inner join search_engine_rankings s ON s.id=v.id;
TLDR: I solved this by running optimize on the table.
I experienced the same just now. Even lookups on primary key and selecting just some few rows was slow. Testing a bit, I found it not to be limited to the varchar column, selecting an int also took considerable amounts of time.
A query roughly looking like this took around 3s:
select someint from mytable where id in (1234, 12345, 123456).
While a query roughly looking like this took <10ms:
select count(*) from mytable where id in (1234, 12345, 123456).
The approved answer here is to just make an index spanning someint also, and it will be fast, as mysql can fetch all information it needs from the index and won't have to touch the table. That probably works in some settings, but I think it's a silly workaround - something is clearly wrong, it should not take three seconds to fetch three rows from a table! Besides, most applications just does a "select * from mytable", and doing changes at the application side is not always trivial.
After optimize table, both queries takes <10ms.