Optimizing SQL query on table of 10 million rows: neverending query

Optimizing SQL query on table of 10 million rows: neverending query - sql

I have two tables:
CREATE TABLE routing
(
id integer NOT NULL,
link_geom geometry,
source integer,
target integer,
traveltime_min double precision,
CONSTRAINT routing_pkey PRIMARY KEY (id)
)
WITH (
OIDS=FALSE
);
CREATE INDEX routing_id_idx
ON routing
USING btree
(id);
CREATE INDEX routing_link_geom_gidx
ON routing
USING gist
(link_geom);
CREATE INDEX routing_source_idx
ON routing
USING btree
(source);
CREATE INDEX routing_target_idx
ON routing
USING btree
(target);
and
CREATE TABLE test
(
link_id character varying,
link_geom geometry,
id integer NOT NULL,
.. (some more attributes here)
traveltime_min double precision,
CONSTRAINT id PRIMARY KEY (id),
CONSTRAINT test_link_id_key UNIQUE (link_id)
)
WITH (
OIDS=FALSE
);
ALTER TABLE test
OWNER TO postgres;
and I am trying to appy the follwing query:
update routing
set traveltime_min = t2.traveltime_min
from test t2
where t2.id = routing.id
Both tables have near 10 millions rows. The problem is that this query runs neverending. Here what 'EXPLAIN' shows:
Update on routing (cost=601725.94..1804772.15 rows=9712264 width=208)
-> Hash Join (cost=601725.94..1804772.15 rows=9712264 width=208)
Hash Cond: (routing.id = t2.id)"
-> Seq Scan on routing (cost=0.00..366200.23 rows=9798223 width=194)"
-> Hash (cost=423414.64..423414.64 rows=9712264 width=18)"
-> Seq Scan on test t2 (cost=0.00..423414.64 rows=9712264 width=18)"
I cannot understand what might cause the problem of such a slow response.
Is it possible to be a problem caused from the server settings? The thing is that i use the default postgrSQL 9.3 settings.

Drop all indexes on routing before you run the UPDATE and add them again afterwards. That will bring a huge improvement.
Set work_mem high in the session where you run the UPDATE. That will help with the hash.
Set shared_buffers to ¼ of the available memory, but not more than 1GB.

If not all the rows are actually changed by the UPDATE (if the get the same value as they had) , you should avoid these idempotent updates.
if you expect the query to affect every row, the query plan is not important. [except, maybe, for the case of overflowing hash tables ...]
-- these could be needed if the update would be more selective...
VACUUM analyze routing;
VACUUM analyze test;
UPDATE routing dst
SET traveltime_min = src.traveltime_min
FROM test src
WHERE dst.id = src.id
-- avoid useless updates and row-versions
AND dst.traveltime_min IS DISTINCT FROM src.traveltime_min
;
-- VACUUM analyze routing;

Related

IBM Informix: getting 245, 144 error doing a select while another transaction has done an insert - possible bug?

Encountered this problem in production in the form of a deadlock. Figured out that if a transaction was inserting a row on my table, and I wanted to select a totally different row from that table, I would get the following error:
245: Could not position within a file via an index.
144: ISAM error: key value locked
Error in line 1
Near character position 70
My select statement was of the form select * from table where bar = 3 and foo = "CCCC";, where "foo" is a foreign key to a table with 18 rows, and "bar" is the first table's primary key. My insert statement was also inserting a row with foo = "CCCC". Curiously, the select query also returned the desired row before outputting the error.
I tried all this on informix 12.10 with isolation level set to repeatable read. I tried it on production, and in a fresh DB I set up with only the two tables mentioned. The lock mode of both tables is "row".
I investigated by modifying the select statement: select * from table where bar = 3; would not fail. Also, select * from table where bar = 3 and foo = "CCCC" order by ber; would not fail (ber being a random field from the table, ber is not indexed).
I would expect all the select statements I tried to return the desired row without error, OR all of them to fail. My solution in production was to order by a random field in the table, which fixed the deadlock issue
Does anyone know why this issue could have happened ? I suspect it is linked to the indexes on the table, which were all created automatically when adding the primary and foreign keys to the table. But I do not know enough about indexes to understand what happened. Could this be a bug ?
Schema of the tables:
create table options (
foo char(4) not null,
fee int not null)
extent size 16 next size 16
lock mode row;
alter table options add constraint (
primary key (foo)
constraint cons1 );
create table decisions (
bar char(3) not null,
foo char(4) not null,
ber int not null)
extent size 131072 next size 65536
lock mode row;
alter table decisions add constraint (
primary key (bar)
constraint cons2 );
alter table decisions add constraint (
foreign key (foo) references options(foo)
constraint cons3 );
Data I inserted into the "options" table:
AAAA|0|
BBBB|0|
CCCC|1|
DDDD|4|
EEEE|1|
FFFF|8|
Data I inserted into the "decisions" table:
QWE|AAAA|0|
WER|AAAA|9|
ERT|CCCC|2|
RTY|AAAA|32|
TYU|CCCC|1234|
YUI|CCCC|42398|
UIO|AAAA|23178|
IOP|CCCC|1233|
OPA|CCCC|11|
PAS|AAAA|890|
ASD|AAAA|90|
SDF|CCCC|2|
DFG|AAAA|4|
FGH|CCCC|7|
Edit: I used set explain on; for the queries.
select * from decisions where foo = "CCCC" and bar = "QWE" order by foo; returned that the index used was on foo="CCCC". However, for select * from decisions where foo = "CCCC" and bar = "QWE" order by ber;, it's indexed on bar="QWE".

SQL Teradata Delete performance

I have a simple but large table in TD:
CREATE SET TABLE TABLE1 ,FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO,
MAP = TD_MAP1
(
party INTEGER,
cd SMALLINT)
PRIMARY INDEX ( party );
I need to perform a delete:
DELETE FROM TABLE1 WHERE cd<0 AND cd<>-212;
Tried adding NUSI to cd:
CREATE INDEX (cd) ON TABLE1 ;
Bud that did not help.
Any advice how to increase performance?
Thanks, R.

SQL Server Indexing and Composite Keys

Given the following:
-- This table will have roughly 14 million records
CREATE TABLE IdMappings
(
Id int IDENTITY(1,1) NOT NULL,
OldId int NOT NULL,
NewId int NOT NULL,
RecordType varchar(80) NOT NULL, -- 15 distinct values, will never increase
Processed bit NOT NULL DEFAULT 0,
CONSTRAINT pk_IdMappings
PRIMARY KEY CLUSTERED (Id ASC)
)
CREATE UNIQUE INDEX ux_IdMappings_OldId ON IdMappings (OldId);
CREATE UNIQUE INDEX ux_IdMappings_NewId ON IdMappings (NewId);
and this is the most common query run against the table:
WHILE #firstBatchId <= #maxBatchId
BEGIN
-- the result of this is used to insert into another table:
SELECT
NewId, -- and lots of non-indexed columns from SOME_TABLE
FROM
IdMappings map
INNER JOIN
SOME_TABLE foo ON foo.Id = map.OldId
WHERE
map.Id BETWEEN #firstBatchId AND #lastBatchId
AND map.RecordType = #someRecordType
AND map.Processed = 0
-- We only really need this in case the user kills the binary or SQL Server service:
UPDATE IdMappings
SET Processed = 1
WHERE map.Id BETWEEN #firstBatchId AND #lastBatchId
AND map.RecordType = #someRecordType
SET #firstBatchId += 4999
SET #lastBatchId += 4999
END
What are the best indices to add? I figure Processed isn't worth indexing since it only has 2 values. Is it worth indexing RecordType since there are only about 15 distinct values? How many distinct values will a column likely have before we consider indexing it?
Is there any advantage in a composite key if some of the fields are in the WHERE and some are in a JOIN's ON condition? For example:
CREATE INDEX ix_IdMappings_RecordType_OldId
ON IdMappings (RecordType, OldId)
... if I wanted both these fields indexed (I'm not saying I do), does this composite key gain any advantage since both columns don't appear together in the same WHERE or same ON?
Insert time into IdMappings isn't really an issue. After we insert all records into the table, we don't need to do so again for months if ever.

SQLite performance tuning for paginated fetches

I am trying to optimize the query I use for fetching paginated data from database with large data sets.
My schema looks like this:
CREATE TABLE users (
user_id TEXT PRIMARY KEY,
name TEXT,
custom_fields TEXT
);
CREATE TABLE events (
event_id TEXT PRIMARY KEY,
organizer_id TEXT NOT NULL REFERENCES users(user_id) ON DELETE SET NULL ON UPDATE CASCADE,
name TEXT NOT NULL,
type TEXT NOT NULL,
start_time INTEGER,
duration INTEGER
-- more columns here, omitted for the sake of simplicity
);
CREATE INDEX events_organizer_id_start_time_idx ON events(organizer_id, start_time);
CREATE INDEX events_organizer_id_type_idx ON events(organizer_id, type);
CREATE INDEX events_organizer_id_type_start_time_idx ON events(organizer_id, type, start_time);
CREATE INDEX events_type_start_time_idx ON events(type, start_time);
CREATE INDEX events_start_time_desc_idx ON events(start_time DESC);
CREATE INDEX events_start_time_asc_idx ON events(IFNULL(start_time, 253402300800) ASC);
CREATE TABLE event_participants (
participant_id TEXT NOT NULL REFERENCES users(user_id) ON DELETE CASCADE ON UPDATE CASCADE,
event_id TEXT NOT NULL REFERENCES events(event_id) ON DELETE CASCADE ON UPDATE CASCADE,
role INTEGER NOT NULL DEFAULT 0,
UNIQUE (participant_id, event_id) ON CONFLICT REPLACE
);
CREATE INDEX event_participants_participant_id_event_id_idx ON event_participants(participant_id, event_id);
CREATE INDEX event_participants_event_id_idx ON event_participants(event_id);
CREATE TABLE event_tag_maps (
event_id TEXT NOT NULL REFERENCES events(event_id) ON DELETE CASCADE ON UPDATE CASCADE,
tag_id TEXT NOT NULL,
PRIMARY KEY (event_id, tag_id) ON CONFLICT IGNORE
);
CREATE INDEX event_tag_maps_event_id_tag_id_idx ON event_tag_maps(event_id, tag_id);
Where in events table I have around 1,500,000 entries, and around 2,000,000 in event_participants.
Now, a typical query would look something like:
SELECT
EVTS.event_id,
EVTS.type,
EVTS.name,
EVTS.time,
EVTS.duration
FROM events AS EVTS
WHERE
EVTS.organizer_id IN(
'f39c3bb1-3ee3-11e6-a0dc-005056c00008',
'4555e70f-3f1d-11e6-a0dc-005056c00008',
'6e7e33ae-3f1c-11e6-a0dc-005056c00008',
'4850a6a0-3ee4-11e6-a0dc-005056c00008',
'e06f784c-3eea-11e6-a0dc-005056c00008',
'bc6a0f73-3f1d-11e6-a0dc-005056c00008',
'68959fb5-3ef3-11e6-a0dc-005056c00008',
'c4c96cf2-3f1a-11e6-a0dc-005056c00008',
'727e49d1-3f1b-11e6-a0dc-005056c00008',
'930bcfb6-3f09-11e6-a0dc-005056c00008')
AND EVTS.type IN('Meeting', 'Conversation')
AND(
EXISTS (
SELECT 1 FROM event_tag_maps AS ETM WHERE ETM.event_id = EVTS.event_id AND
ETM.tag_id IN ('00000000-0000-0000-0000-000000000000', '6ae6870f-1aac-11e6-aeb9-005056c00008', '6ae6870c-1aac-11e6-aeb9-005056c00008', '1f6d3ccb-eaed-4068-a46b-ec2547fec1ff'))
OR NOT EXISTS (
SELECT 1 FROM event_tag_maps AS ETM WHERE ETM.event_id = EVTS.event_id)
)
AND EXISTS (
SELECT 1 FROM event_participants AS EPRTS
WHERE
EVTS.event_id = EPRTS.event_id
AND participant_id NOT IN('79869516-3ef2-11e6-a0dc-005056c00008', '79869515-3ef2-11e6-a0dc-005056c00008', '79869516-4e18-11e6-a0dc-005056c00008')
)
ORDER BY IFNULL(EVTS.start_time, 253402300800) ASC
LIMIT 100 OFFSET #Offset;
Also, for fetching the overall count of the query-matching items, I would use the above query with count(1) instead of the columns and without the ORDER BY and LIMIT/OFFSET clauses.
I experience two main problems here:
1) The performance drastically decreases as I increase the #Offset value. The difference is very significant - from being almost immediate to a number of seconds.
2) The count query takes a long time (number of seconds) and produces the following execution plan:
0|0|0|SCAN TABLE events AS EVTS
0|0|0|EXECUTE LIST SUBQUERY 1
0|0|0|EXECUTE LIST SUBQUERY 1
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 1
1|0|0|SEARCH TABLE event_tag_maps AS ETM USING COVERING INDEX event_tag_maps_event_id_tag_id_idx (event_id=? AND tag_id=?)
1|0|0|EXECUTE LIST SUBQUERY 2
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 2
2|0|0|SEARCH TABLE event_tag_maps AS ETM USING COVERING INDEX event_tag_maps_event_id_tag_id_idx (event_id=?)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 3
3|0|0|SEARCH TABLE event_participants AS EPRTS USING INDEX event_participants_event_id_idx (event_id=?)
Here I don't understand why the full scan is performed instead of an index scan.
Additional info and SQLite settings used:
I use System.Data.SQLite provider (have to, because of custom functions support)
Page size = cluster size (4096 in my case)
Cache size = 100000
Journal mode = WAL
Temp store = 2 (memory)
No transaction is open for the query
Is there anything I could do to change the query/schema or settings in order to get as much performance improvement as possible?

Mysql concurrent select and insert slow the database

I have a large mysql table (about 5M rows) on which i frequently insert data.
This table is the same i have to read data from and sometimes the entire database gets slow because of selecting data while there are many pending inserts.
I put indexes on each field i use in the WHERE statment, so i really don't know why select gets so slow.
Could anyone provide me a hint to solve this problem ?
here is the sql of table and query
CREATE TABLE `messages` (
`id` int(10) unsigned NOT NULL auto_increment,
`user_id` int(10) unsigned NOT NULL default '0',
`dest` varchar(20) character set latin1 default NULL,
`body` text character set latin1,
`sent_on` timestamp NOT NULL default CURRENT_TIMESTAMP,
`md5` varchar(32) character set latin1 NOT NULL default '',
`interface` enum('mobile','desktop') default NULL,
PRIMARY KEY (`id`),
KEY `user_id` (`user_id`),
KEY `md5` (`md5`),
FULLTEXT KEY `dest` (`dest`,`body`),
FULLTEXT KEY `body` (`body`)
) ENGINE=MyISAM AUTO_INCREMENT=7074256 DEFAULT CHARSET=utf8
and here the query:
EXPLAIN SELECT SQL_CALC_FOUND_ROWS id, sent_on, dest AS who, body,interface FROM messages WHERE user_id = 2 ORDER BY sent_on DESC LIMIT 0,50 \G;
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: messages
type: ref
possible_keys: user_id
key: user_id
key_len: 4
ref: const
rows: 13997
Extra: Using where; Using filesort
1 row in set (0.00 sec)

Note the following in your EXPLAIN output:
Extra: Using where; Using filesort
The Using filesort means that MySQL is dumping the query results to a file to sort it, then reading the results back in to get the top 50 rows.
While I'm no expert, I think that you could optimize this process by providing an index which can both satisfy the selection criteria and sort order all in one go; then the selection and ordering can be determiend by an index scan only, without having to sort the result set every time.
In this case, your WHERE is on user_id, and your ORDER BY is on sent_on. So, in theory, if you provide a single index on those two columns (in that order), then the engine will be able to use the first half of the index to filter the results, and because the second half of the index is on the sent_on column, the index results will already be in order according to that column, allowing MySQL to simply retrieve the first 50 results from that index. No additional sorting required.
Disclaimer: I'm not a DBA. I may be completely wrong.
See Also: Mysql.com: Multiple Column Indexes

Maybe you have disabled Concurrent Inserts?

Could the ORDER BY be slowing you down? I don't know if its a good idea to index sent_on, it would depend on SELECT vs INSERT frequency

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Optimizing SQL query on table of 10 million rows: neverending query - sql

Drop all indexes on routing before you run the UPDATE and add them again afterwards. That will bring a huge improvement. Set work_mem high in the session where you run the UPDATE. That will help with the hash. Set shared_buffers to ¼ of the available memory, but not more than 1GB.

Related

IBM Informix: getting 245, 144 error doing a select while another transaction has done an insert - possible bug?

SQL Teradata Delete performance

SQL Server Indexing and Composite Keys

SQLite performance tuning for paginated fetches

Mysql concurrent select and insert slow the database

Categories

Resources