I have around 5 million rows in a postgres table. I'd like to know how many rows match start_time >= NOW(), but despite having an index on start_time the query is extremely slow (in the order of several hours).
EXPLAIN SELECT COUNT(*) FROM core_event WHERE start_time >= NOW();
Aggregate (cost=449217.81..449217.82 rows=1 width=0)
-> Index Scan using core_event_start_time on core_event (cost=0.00..447750.83 rows=586791 width=0)
Index Cond: (start_time >= now())
Here's the schema information for the table:
id | integer | not null default nextval('core_event_id_seq'::regclass)
source | character varying(100) | not null
external_id | character varying(100) |
title | character varying(250) | not null
location | geometry | not null
start_time | timestamp with time zone |
stop_time | timestamp with time zone |
thumb | character varying(300) |
image | character varying(100) |
image_thumb | character varying(100) |
address | character varying(300) |
description | text |
venue_name | character varying(100) |
website | character varying(300) |
city_id | integer |
category | character varying(100) |
phone | character varying(50) |
place_id | integer |
image_url | character varying(300) |
event_type | character varying(200) |
hidden | boolean | not null
views | integer | not null
added | timestamp with time zone |
I have indexes on the following fields:
city_id
external_id (unique)
location
location_id
place_id
start_time
Is there any easy way for me to speed up the query (eg. a partial index), or am I going to have to resort to partitioning the data by date?
Try adding a partial index like the following:
CREATE INDEX core_event_start_time_recent_idx ON core_event (start_time)
WHERE start_time >= '2011-01-12 0:0'::timestamptz
This will create a comparatively small index. Index creation will take some time, but queries like this one will be much faster thereafter.
SELECT count(*) FROM core_event WHERE start_time >= now();
The effectiveness of this index for queries against now() will degrade slowly over the course of time, depending on how many new rows are coming in. Update (= drop & create) the index with a more recent timestamp occasionally at off hours.
You could automate this with plpgsql function that you call per cronjob or pgAgent.
You might try and see if a running CLUSTER on the table improves things (if it doesn't go against other requirements in your db):
CLUSTER core_event USING core_event_start_time;
Yes, cluster on the full index, not the partial one. This will take a while and needs an exclusive lock, because it effectively rewrites the table. It also effectively vacuums the table fully. Read about it in the manual.
You also may want to increase the statistics target for core_event.start_time;
ALTER core_event ALTER start_time SET STATISTICS 1000; -- example value
The default is just 100. Then:
ANALYZE core_event;
Or course, all the usual performance stuff applies, too.
Do most of these columns get populated for each row? If so, the amount of disk that postgresql has to look at to test rows for liveness even after checking the index will be fairly large. Try for example creating a separate table that only has id and start_time:
create table core_event_start_time as select id, start_time from core_event;
alter table core_event_start_time add primary key(id);
alter table core_event_start_time add foreign key(id) references core_event(id);
create index on core_event_start_time(start_time);
Now see how long it takes to count IDs in core_event_start_time only. Of course, this approach will take up more buffer cache at the expense of space for your actual core_event table...
If it helps, you can add a trigger onto core_event to keep the auxiliary table updated.
(postgresql 9.2 will introduce "index only scans" which may help with this sort of situation, but that's for the future)
Related
Postgres database
I'm trying to find a faster way to create a new column in a table which is a copy of the tables primary key column, so if I have the following columns in a table named students:
student_id Integer Auto-Increment -- Primary key
name varchar
Then I would like to create a new column named old_student_id which has all the same values as student_id.
To do this I create the column and the execute the following update statement
update student set old_student_id=student_id
Which works, but on my biggest table it takes over an hour, and I feels like I should be able to use some kind of alternative approach to get that down to a few minutes, I just don't know what.
So what I want at the end of the day is something that looks like this:
+------------+-----+---------------+
| student_id | name| old_student_id|
+------------+-----+---------------+
| 1 | bob | 1 |
+------------+-----+---------------+
| 2 | tod | 2 |
+------------+-----+---------------+
| 3 | joe | 3 |
+------------+-----+---------------+
| 4 | tim | 4 |
+------------+-----+---------------+
To speed things up a bit before I do the update query, I drop all the FK's and Indices on the table, then reapply them when it finishes. Also I'm on an AWS RDS, so I have setup a param group which has synchronized_commits=false, turned off backups, and increased working mem a bit for the duration of this update.
For context this is actually happening to every table in the database, across three databases. The old ids are used as references for several external systems which reference these ids, so I need to keep track of them in order to update those systems as well. I have an 8 hour downtime window, and currently merging the databases takes ~3 hours, and a whole hour of that time is spent creating these ids.
If in the future you do not need to update old_student_id column then you can use virtual columns on PostgreSQL.
CREATE TABLE table2 (
id serial4 NOT NULL,
val1 int4 NULL,
val2 int4 NULL,
total int4 NULL GENERATED ALWAYS AS (id) STORED
);
During the inserting process, the total field will be set to the same value as the id field. But you can not update this field, because this is a virtual column.
Alternative method is a using triggers. In this case you can update your fields. See this example:
Firstly we need create trigger function which will be called before table inserting.
CREATE OR REPLACE FUNCTION table2_insert()
RETURNS trigger
LANGUAGE plpgsql
AS $function$
begin
new.total = new.val1 * new.val2;
return new;
END;
$function$
;
After then:
CREATE TABLE table2 (
id serial4 NOT NULL,
val1 int4 NULL,
val2 int4 NULL,
total int4 NULL
);
create trigger my_trigger before
insert
on
table2 for each row execute function table2_insert();
With both methods, you don't have to update many records every time.
I am trying to create a table with an auto-increment column as below. Since Redshift psql doesn't support SERIAL, I had to use IDENTITY data type:
IDENTITY(seed, step)
Clause that specifies that the column is an IDENTITY column. An IDENTITY column contains unique auto-generated values. These values start with the value specified as seed and increment by the number specified as step. The data type for an IDENTITY column must be either INT or BIGINT.`
My create table statement looks like this:
CREATE TABLE my_table(
id INT IDENTITY(1,1),
name CHARACTER VARYING(255) NOT NULL,
PRIMARY KEY( id )
);
However, when I tried to insert data into my_table, rows increment only on the even number, like below:
id | name |
----+------+
2 | anna |
4 | tom |
6 | adam |
8 | bob |
10 | rob |
My insert statements look like below:
INSERT INTO my_table ( name )
VALUES ( 'anna' ), ('tom') , ('adam') , ('bob') , ('rob' );
I am also having trouble with bringing the id column back to start with 1. There are solutions for SERIAL data type, but I haven't seen any documentation for IDENTITY.
Any suggestions would be much appreciated!
You have to set your identity as follows:
id INT IDENTITY(0,1)
Source: http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_examples.html
And you can't reset the id to 0. You will have to drop the table and create it back again.
Set your seed value to 1 and your step value to 1.
Create table
CREATE table my_table(
id bigint identity(1, 1),
name varchar(100),
primary key(id));
Insert rows
INSERT INTO organization ( name )
VALUES ('anna'), ('tom') , ('adam'), ('bob'), ('rob');
Results
id | name |
----+------+
1 | anna |
2 | tom |
3 | adam |
4 | bob |
5 | rob |
For some reason, if you set your seed value to 0 and your step value to 1 then the integer will increase in steps of 2.
Create table
CREATE table my_table(
id bigint identity(0, 1),
name varchar(100),
primary key(id));
Insert rows
INSERT INTO organization ( name )
VALUES ('anna'), ('tom') , ('adam'), ('bob'), ('rob');
Results
id | name |
----+------+
0 | anna |
2 | tom |
4 | adam |
6 | bob |
8 | rob |
This issue is discussed at length in AWS forum.
https://forums.aws.amazon.com/message.jspa?messageID=623201
The answer from the AWS.
Short answer to your question is seed and step are only honored if you
disable both parallelism and the COMPUPDATE option in your COPY.
Parallelism is disabled if and only if you're loading your data from a
single file, which is what we normally do not recommend, and hence
will be an unlikely scenario for most users.
Parallelism impacts things because in order to ensure that there is no
single point of contention in assigning identity values to rows, there
end up being gaps in the value assignment. When parallelism is
disabled, the load is happening serially, and therefore, there is no
issue with assigning different id values in parallel.
The reason COMPUPDATE impacts things is when it's enabled, the COPY is
actually making 2 passes over your data. During the first pass, it
internally increments the identity values, and as a result, your
initial value starts with a larger value than you'd expect.
We'll update the doc to reflect this.
Also multiple nodes seems to cause such effect with IDENTITY column. In essence it can only provide you with guaranteed unique IDs.
Is it possable to optimize this query?
SELECT count(locId) AS antal , locId
FROM `geolitecity_block`
WHERE (1835880985>= startIpNum AND 1835880985 <= endIpNum)
OR (1836875969>= startIpNum AND 1836875969 <= endIpNum)
OR (1836878754>= startIpNum AND 1836878754 <= endIpNum)
...
...
OR (1843488110>= startIpNum AND 1843488110 <= endIpNum)
GROUP BY locId ORDER BY antal DESC LIMIT 100
The table looks like this
CREATE TABLE IF NOT EXISTS `geolitecity_block` (
`startIpNum` int(11) unsigned NOT NULL,
`endIpNum` int(11) unsigned NOT NULL,
`locId` int(11) unsigned NOT NULL,
PRIMARY KEY (`startIpNum`),
KEY `locId` (`locId`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
UPDATE
and the explain-query looks like this
+----+-------------+-------------------+-------+---------------+-------+---------+------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------------+-------+---------------+-------+---------+------+------+----------------------------------------------+
| 1 | SIMPLE | geolitecity_block | index | PRIMARY | locId | 4 | NULL | 108 | Using where; Using temporary; Using filesort |
+----+-------------+-------------------+-------+---------------+-------+---------+------+------+----------------------------------------------+
To optimize performance, create an index on startIpNum and endIpNum.
CREATE INDEX index_startIpNum ON geolitecity_block (startIpNum);
CREATE INDEX index_endIpNum ON geolitecity_block (endIpNum);
Indexing columns that are being grouped or sorted on will almost always improve performance. I would suggest plugging this query into the DTA (Database Tuning Advisor) to see if SQL can make any suggestions, this might include the creation of one or more indexes in addition to statistics.
If it is possible in your use case, create a temporary table TMP_RESULT (remove order) and than submit a second query that orders results by antal. Filesort is extremely slow and -- in your case -- you can not avoid this operation, because you do not sort by any of keys/indices. To perform count operation, you have to scan complete table. A temporary table is a much faster solution.
ps. Adding an index on (startIpNum, endIpNum) definitely will help you to get better performance but -- if you have a lot of rows -- it will not be a huge improvement.
I have a table with ~30 million rows ( and growing! ) and currently i have some problems with a simple range select.
The query, looks like this one:
SELECT SUM( CEIL( dlvSize / 100 ) ) as numItems
FROM log
WHERE timeLogged BETWEEN 1000000 AND 2000000
AND user = 'example'</pre>
It takes minutes to finish and i think that the solution would be at the indexes that i'm using. Here is the result of explain:
+----+-------------+-------+-------+---------------------------------+---------+---------+------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------------------------+---------+---------+------+----------+-------------+
| 1 | SIMPLE | log | range | PRIMARY,timeLogged | PRIMARY | 4 | NULL | 11839754 | Using where |
+----+-------------+-------+-------+---------------------------------+---------+---------+------+----------+-------------+
My table structure is this one ( reduced to make it fit better on the problem ):
CREATE TABLE IF NOT EXISTS `log` (
`origDomain` varchar(64) NOT NULL default '0',
`timeLogged` int(11) NOT NULL default '0',
`orig` varchar(128) NOT NULL default '',
`rcpt` varchar(128) NOT NULL default '',
`dlvSize` varchar(255) default NULL,
`user` varchar(255) default NULL,
PRIMARY KEY (`timeLogged`,`orig`,`rcpt`),
KEY `timeLogged` (`timeLogged`),
KEY `orig` (`orig`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Any ideas of what can I do to optimize this query or indexes on my table?
You may want to try adding a composite index on (user, timeLogged):
CREATE TABLE IF NOT EXISTS `log` (
...
KEY `user_timeLogged` (user, timeLogged),
...
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Related Stack Overflow post:
Database: When should I use a composite index?
In addition to the suggestions made by the other answers, I note that you have a column user in the table which is a varchar(255). If this refers to a column in a table of users, then 1) it would most likely to far more efficient to add an integer ID column to that table, and use that as the primary key and as a referencing column in other tables; 2) you are using InnoDB, so why not take advantage of the foreign key capabilities it offers?
Consider that if you index by a varchar(n) column, it is treated like a char(n) in the index, so each row of your current primary key takes up 4 + 128 + 128 = 260 bytes in the index.
Add an index on user.
I had all my tables in myISAM but the table level locking was starting to kill me when I had long running update jobs. I converted my primary tables over to InnoDB and now many of my queries are taking over 1 minute to complete where they were nearly instantaneous on myISAM. They are usually stuck in the Sorting result step. Did I do something wrong?
For example :
SELECT * FROM `metaward_achiever`
INNER JOIN `metaward_alias` ON (`metaward_achiever`.`alias_id` = `metaward_alias`.`id`)
WHERE `metaward_achiever`.`award_id` = 1507
ORDER BY `metaward_achiever`.`modified` DESC
LIMIT 100
Takes about 90 seconds now. Here is the describe :
+----+-------------+-------------------+--------+-------------------------------------------------------+----------------------------+---------+---------------------------------+-------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------------+--------+-------------------------------------------------------+----------------------------+---------+---------------------------------+-------+-----------------------------+
| 1 | SIMPLE | metaward_achiever | ref | metaward_achiever_award_id,metaward_achiever_alias_id | metaward_achiever_award_id | 4 | const | 66424 | Using where; Using filesort |
| 1 | SIMPLE | metaward_alias | eq_ref | PRIMARY | PRIMARY | 4 | paul.metaward_achiever.alias_id | 1 | |
+----+-------------+-------------------+--------+-------------------------------------------------------+----------------------------+---------+---------------------------------+-------+-----------------------------+
It seems that now TONS of my queries get stuck in the "Sorting result" step :
mysql> show processlist;
+--------+------+-----------+------+---------+------+----------------+------------------------------------------------------------------------------------------------------+
| Id | User | Host | db | Command | Time | State | Info |
+--------+------+-----------+------+---------+------+----------------+------------------------------------------------------------------------------------------------------+
| 460568 | paul | localhost | paul | Query | 0 | NULL | show processlist |
| 460638 | paul | localhost | paul | Query | 0 | Sorting result | SELECT `metaward_achiever`.`id`, `metaward_achiever`.`modified`, `metaward_achiever`.`created`, `met |
| 460710 | paul | localhost | paul | Query | 79 | Sending data | SELECT `metaward_achiever`.`id`, `metaward_achiever`.`modified`, `metaward_achiever`.`created`, `met |
| 460722 | paul | localhost | paul | Query | 49 | Updating | UPDATE `metaward_alias` SET `modified` = '2009-09-15 12:43:50', `created` = '2009-08-24 11:55:24', ` |
| 460732 | paul | localhost | paul | Query | 25 | Sorting result | SELECT `metaward_achiever`.`id`, `metaward_achiever`.`modified`, `metaward_achiever`.`created`, `met |
+--------+------+-----------+------+---------+------+----------------+------------------------------------------------------------------------------------------------------+
5 rows in set (0.00 sec)
Any why is that simple update stuck for 49 seconds?
If it helps, here are the schemas :
| metaward_alias | CREATE TABLE `metaward_alias` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`modified` datetime NOT NULL,
`created` datetime NOT NULL,
`string_id` varchar(255) DEFAULT NULL,
`shortname` varchar(100) NOT NULL,
`remote_image` varchar(500) DEFAULT NULL,
`image` varchar(100) NOT NULL,
`user_id` int(11) DEFAULT NULL,
`type_id` int(11) NOT NULL,
`md5` varchar(32) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `string_id` (`string_id`),
KEY `metaward_alias_user_id` (`user_id`),
KEY `metaward_alias_type_id` (`type_id`)
) ENGINE=InnoDB AUTO_INCREMENT=858381 DEFAULT CHARSET=utf8 |
| metaward_award | CREATE TABLE `metaward_award` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`modified` datetime NOT NULL,
`created` datetime NOT NULL,
`string_id` varchar(20) NOT NULL,
`owner_id` int(11) NOT NULL,
`name` varchar(100) NOT NULL,
`description` longtext NOT NULL,
`owner_points` int(11) NOT NULL,
`url` varchar(500) NOT NULL,
`remote_image` varchar(500) DEFAULT NULL,
`image` varchar(100) NOT NULL,
`parent_award_id` int(11) DEFAULT NULL,
`slug` varchar(110) NOT NULL,
`true_points` double DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `string_id` (`string_id`),
KEY `metaward_award_owner_id` (`owner_id`),
KEY `metaward_award_parent_award_id` (`parent_award_id`),
KEY `metaward_award_slug` (`slug`),
KEY `metaward_award_name` (`name`)
) ENGINE=InnoDB AUTO_INCREMENT=122176 DEFAULT CHARSET=utf8 |
| metaward_achiever | CREATE TABLE `metaward_achiever` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`modified` datetime NOT NULL,
`created` datetime NOT NULL,
`award_id` int(11) NOT NULL,
`alias_id` int(11) NOT NULL,
`count` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `metaward_achiever_award_id` (`award_id`),
KEY `metaward_achiever_alias_id` (`alias_id`)
) ENGINE=InnoDB AUTO_INCREMENT=77175366 DEFAULT CHARSET=utf8 |
And these in my my.cnf
innodb_file_per_table
innodb_buffer_pool_size = 2048M
innodb_additional_mem_pool_size = 16M
innodb_flush_method=O_DIRECT
As written in Should you move from MyISAM to Innodb ? (which is pretty recent):
Innodb Needs Tuning As a final note about MyISAM to Innodb migration I should mention about Innodb tuning. Innodb needs tuning. Really. MyISAM for many applications can work well with defaults. I’ve seen hundreds of GB databases ran with MyISAM with default settings and it worked reasonably. Innodb needs resources and it will not work well with defaults a lot. Tuning MyISAM from defaults rarely gives more than 2-3 times gain while it can be as much as 10-50 times for Innodb tables in particular for write intensive workloads. Check here for details.
So, about MySQL Innodb Settings, the author wrote in Innodb Performance Optimization Basics:
The most important ones are:
innodb_buffer_pool_size 70-80% of
memory is a safe bet. I set it to 12G
on 16GB box.
UPDATE: If you’re looking for more
details, check out detailed guide on
tuning innodb buffer pool
innodb_log_file_size – This depends on
your recovery speed needs but 256M
seems to be a good balance between
reasonable recovery time and good
performance
innodb_log_buffer_size=4M 4M is good
for most cases unless you’re piping
large blobs to Innodb in this case
increase it a bit.
innodb_flush_log_at_trx_commit=2 If you’re not concern about ACID and can loose transactions for last second or two in case of full OS crash than set this value. It can dramatic effect especially on a lot of short write transactions.
innodb_thread_concurrency=8 Even with current Innodb Scalability Fixes having limited concurrency helps. The actual number may be higher or lower depending on your application and default which is 8 is decent start
innodb_flush_method=O_DIRECT Avoid double buffering and reduce swap pressure, in most cases this setting improves performance. Though be careful if you do not have battery backed up RAID cache as when write IO may suffer.
innodb_file_per_table – If you do not have too many tables use this option, so you will not have uncontrolled innodb main tablespace growth which you can’t reclaim. This option was added in MySQL 4.1 and now stable enough to use.
Also check if your application can run in READ-COMMITED isolation mode – if it does – set it to be default as transaction-isolation=READ-COMMITTED. This option has some performance benefits, especially in locking in 5.0 and even more to come with MySQL 5.1 and row level replication.
Just for the record, the people behind mysqlperformanceblog.com ran a benchmark comparing Falcon, MyISAM and InnoDB. The benchmark was really supposed to be highlighting Falcon, except it was InnoDB that won the day, topping both Falcon and MyISAM in queries per second for almost every test: InnoDB vs MyISAM vs Falcon benchmarks – part 1.
That is a large result set (66,424 rows) that MySQL must manually sort. Try adding an index to metaward_achiever.modified.
There is a limitation with MySQL 4.x that only allows MySQL to use one index per table. Since it is using the index on metaward_achiever.award_id column for the WHERE selection, it cannot also use the index on metaward_achiever.modified for the sort. I hope you're using MySQL 5.x, which may have improved this.
You can see this by doing explain on this simplified query:
SELECT * FROM `metaward_achiever`
WHERE `metaward_achiever`.`award_id` = 1507
ORDER BY `metaward_achiever`.`modified` DESC
LIMIT 100
If you can get this using the indexes for both the WHERE selection and sorting, then you're set.
You could also create a compound index with both metaward_achiever.award_id and metaward_achiever. If MySQL doesn't use it, then you can hint at it or remove the one on just award_id.
Alternatively, if you can get rid of metaward_achiever.id and make metaward_achiever.award_id your primary key and add a key on metaward_achiever.modified, or better yet make metaward_achiever.award_id combined with metaward.modified your primary key, then you'll be really good.
You can try to optimize the file sorting by modifying settings. Unfortunately, I'm not experienced with this, as our DBA handles the configuration, but you might want to check out this great blog:
http://www.mysqlperformanceblog.com/
Here's an article about filesort in particular:
http://s.petrunia.net/blog/?p=24
My guess is that you probably haven't configured your InnoDB settings beyond the defaults. You should do a quick google for setting up your InnoDB options.
The one that caused me the most noticeable performance issues out of the box was innodb_buffer_pool_size. This should be set to 50-80% of your machine's memory. By default it's often only a few MB. Crank it way up, and you should see a noticeable performance increase.
Also take a look at innodb_additional_mem_pool_size.
Start here, but also google around for "innodb performance tuning".
MySQL's query optimizer is not good, from my memory. Try a subselect instead of a straight join.
SELECT * FROM (SELECT * FROM `metaward_achiever`
WHERE `metaward_achiever`.`award_id` = 1507) a
INNER JOIN `metaward_alias` ON (a.`alias_id` = `metaward_alias`.`id`)
ORDER BY a.`modified` DESC
LIMIT 100
Or something like that (untested syntax above).
Sorting is something done by the database server, not the storage engine, in MySQL.
If in both cases, the engine was not able to provide the results in already-sorted form (it depends on the index used), then the server needs to sort them.
The only reason that MyISAM / InnoDB might be different is that the order the rows come back could affect how sorted the data are already - MyISAM could give the data back in "more sorted" order in some cases (and vice versa).
Still, sorting 60k rows is not going to take long as it's a very small data set. Are you sure you've got your sort buffer set big enough?
Using an on-disc filesort() instead of an in-memory one is much slower. The engine should however, not make any difference to this. filesort is not an engine function, but a MySQL core function. filesort does, in fact, suck in quite a lot of ways but it's not normally that slow.
Try adding a key for fields: metaward_achiever.alias_id, metaward_achiever.award_id, and metaward_achiever.modified this will help alot. And do not use keys on varchar fields it will increase time for inserts and updates. Also it seems you have 77M records in achiever table, you may want to care about innodb optimizations. There are lots of good tutors around how to set memory limits for it.