In PostgreSQL 9.2, I have a table of items that are being rated by users:
id | userid | itemid | rating | timestamp | !update_time
--------+--------+--------+---------------+---------------------+------------------------
522241 | 3991 | 6887 | 0.1111111111 | 2005-06-20 03:13:56 | 2013-10-11 17:50:24.545
522242 | 3991 | 6934 | 0.1111111111 | 2005-04-05 02:25:21 | 2013-10-11 17:50:24.545
522243 | 3991 | 6936 | -0.1111111111 | 2005-03-31 03:17:25 | 2013-10-11 17:50:24.545
522244 | 3991 | 6942 | -0.3333333333 | 2005-03-24 04:38:02 | 2013-10-11 17:50:24.545
522245 | 3991 | 6951 | -0.5555555556 | 2005-06-20 03:15:35 | 2013-10-11 17:50:24.545
... | ... | ... | ... | ... | ...
I want to perform a very simple query: for each user, select the total number of ratings in the database.
I'm using the following straightforward approach:
SELECT userid, COUNT(*) AS rcount
FROM ratings
GROUP BY userid
The table contains 10M records. The query takes... well, about 2 or 3 minutes. Honestly, I'm not satisfied with that, and I believe that 10M is not so large number for the query to take so long. (Or is it..??)
Henceforth, I asked PostgreSQL to show me the execution plan:
EXPLAIN SELECT userid, COUNT(*) AS rcount
FROM ratings
GROUP BY userid
This results in:
GroupAggregate (cost=1756177.54..1831423.30 rows=24535 width=5)
-> Sort (cost=1756177.54..1781177.68 rows=10000054 width=5)
Sort Key: userid
-> Seq Scan on ratings (cost=0.00..183334.54 rows=10000054 width=5)
I read this as follows: Firstly, the whole table is read from the disk (seq scan). Secondly, it is sorted by userid in n*log(n) (sort). Finally, the sorted table is read row-by-row and aggregated in linear time. Well, not exactly the optimal algorithm I think, if I were to implement it by myself, I would use a hash table and build the result in the first pass. Never mind.
It seems that it is the sorting by userid which takes so long. So added an index:
CREATE INDEX ratings_userid_index ON ratings (userid)
Unfortunately, this didn't help and the performance remained the same. I definitely do not consider myself an advanced user and I believe I'm doing something fundamentally wrong. However, this is where I got stuck. I would appreciate any ideas how to make the query execute in reasonable time. One more note: PostgreSQL worker process utilizes 100 % of one of my CPU cores during the execution, suggesting that disk access is not the main bottleneck.
EDIT
As requested by #a_horse_with_no_name. Wow, quite advanced for me:
EXPLAIN (analyze on, buffers on, verbose on)
SELECT userid,COUNT(userid) AS rcount
FROM movielens_10m.ratings
GROUP BY userId
Outputs:
GroupAggregate (cost=1756177.54..1831423.30 rows=24535 width=5) (actual time=110666.899..127168.304 rows=69878 loops=1)
Output: userid, count(userid)
Buffers: shared hit=906 read=82433, temp read=19358 written=19358
-> Sort (cost=1756177.54..1781177.68 rows=10000054 width=5) (actual time=110666.838..125180.683 rows=10000054 loops=1)
Output: userid
Sort Key: ratings.userid
Sort Method: external merge Disk: 154840kB
Buffers: shared hit=906 read=82433, temp read=19358 written=19358
-> Seq Scan on movielens_10m.ratings (cost=0.00..183334.54 rows=10000054 width=5) (actual time=0.019..2889.583 rows=10000054 loops=1)
Output: userid
Buffers: shared hit=901 read=82433
Total runtime: 127193.524 ms
EDIT 2
#a_horse_with_no_name's comment solved the problem. I feel happy to share my findings:
SET work_mem = '1MB';
EXPLAIN SELECT userid,COUNT(userid) AS rcount
FROM movielens_10m.ratings
GROUP BY userId
produces the same as above:
GroupAggregate (cost=1756177.54..1831423.30 rows=24535 width=5)
-> Sort (cost=1756177.54..1781177.68 rows=10000054 width=5)
Sort Key: userid
-> Seq Scan on ratings (cost=0.00..183334.54 rows=10000054 width=5)
However,
SET work_mem = '10MB';
EXPLAIN SELECT userid,COUNT(userid) AS rcount
FROM movielens_10m.ratings
GROUP BY userId
gives
HashAggregate (cost=233334.81..233580.16 rows=24535 width=5)
-> Seq Scan on ratings (cost=0.00..183334.54 rows=10000054 width=5)
The query now only takes about 3.5 seconds to complete.
Consider how your query could possibly return a result... You could build a variable-length hash and create/increment its values; or you could sort all rows by userid and count. Computationally, the latter option is cheaper. That is what Postgres does.
Then consider how to sort the data, taking disk IO into account. One option is to open disk pages A, B, C, D, etc., and then sorting rows by userid in memory. In other words, seq scan followed by a sort. The other option, called an index scan, would be to pull rows in order by using an index: visit page B, then D, then A, then B again, A again, C, ad nausea.
An index scan is efficient when pulling a handful of rows in order; not so much to fetch many rows in order — let alone all rows in order. As such, the plan you're getting is the optimal one:
Plough throw all rows (seq scan)
Sort rows to group by criteria
Count rows by criteria
Trouble is, you're sorting roughly 10 million rows in order to count them by userid. Nothing will make things faster short of investing in more RAM and super fast SSDs.
You can, however, avoid this query altogether. Either:
Count ratings for the handful of users that you actually need — using a where clause — instead of pulling the entire set; or
Add a ratings_count field to your users table and use triggers on ratings to maintain the count.
Use a materialized view, if the precise count is less relevant than having a vague idea of it.
Try like below, because COUNT(*) and COUNT(userid) makes a lot of difference.
SELECT userid, COUNT(userid) AS rcount
FROM ratings
GROUP BY userid
You can try to run 'VACUUM ANALYZE ratings' to update data Statics, so the optimizer can choose a better scenario to execute SQL.
Related
I have the following PostgreSQL table with about 67 million rows, which stores the EOD prices for all the US stocks starting in 1985:
Table "public.eods"
Column | Type | Collation | Nullable | Default
--------+-----------------------+-----------+----------+---------
stk | character varying(16) | | not null |
dt | date | | not null |
o | integer | | not null |
hi | integer | | not null |
lo | integer | | not null |
c | integer | | not null |
v | integer | | |
Indexes:
"eods_pkey" PRIMARY KEY, btree (stk, dt)
"eods_dt_idx" btree (dt)
I would like to query efficiently the table above based on either the stock name or the date. The primary key of the table is stock name and date. I have also defined an index on the date column, hoping to improve performance for queries that retrieve all the records for a specific date.
Unfortunately, I see a big difference in performance for the queries below. While getting all the records for a specific stock takes a decent amount of time to complete (2 seconds), getting all the records for a specific date takes much longer (about 56 seconds). I have tried to analyze these queries using explain analyze, and I have got the results below:
explain analyze select * from eods where stk='MSFT';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on eods (cost=169.53..17899.61 rows=4770 width=36) (actual time=207.218..2142.215 rows=8364 loops=1)
Recheck Cond: ((stk)::text = 'MSFT'::text)
Heap Blocks: exact=367
-> Bitmap Index Scan on eods_pkey (cost=0.00..168.34 rows=4770 width=0) (actual time=187.844..187.844 rows=8364 loops=1)
Index Cond: ((stk)::text = 'MSFT'::text)
Planning Time: 577.906 ms
Execution Time: 2143.101 ms
(7 rows)
explain analyze select * from eods where dt='2010-02-22';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Index Scan using eods_dt_idx on eods (cost=0.56..25886.45 rows=7556 width=36) (actual time=40.047..56963.769 rows=8143 loops=1)
Index Cond: (dt = '2010-02-22'::date)
Planning Time: 67.876 ms
Execution Time: 56970.499 ms
(4 rows)
I really cannot understand why the second query runs 28 times slower than the first query. They retrieve a similar number of records, they both seem to be using an index. So could somebody please explain to me why this difference in performance, and can I do something to improve the performance of the queries that retrieve all the records for a specific date?
I would guess that this has to do with the data layout. I am guessing that you are loading the data by stk, so the rows for a given stk are on a handful of pages that pretty much only contain that stk.
So, the execution engine is only reading about 25 pages.
On the other hand, no single page contains two records for the same date. When you read by date, you have to read about 7,556 pages. That is, about 300 times the number of pages.
The scaling must also take into account the work for loading and reading the index. This should be about the same for the two queries, so the ratio is less than a factor of 300.
There can be more issues - so it is hard to say where is a problem. Index scan should be usually faster, than bitmap heap scan - if not, then there can be following problems:
unhealthy index - try to run REINDEX INDEX indexname
bad statistics - try to run ANALYZE tablename
suboptimal state of table - try to run VACUUM tablename
too low, or to high setting of effective_cache_size
issues with IO - some systems has a problem with high random IO, try to increase random_page_cost
Investigation what is a issue is little bit alchemy - but it is possible - there are only closed set of very probably issues. Good start is
VACUUM ANALYZE tablename
benchmark your IO if it is possible (like bonie++)
To find the difference, you'll probably have to run EXPLAIN (ANALYZE, BUFFERS) on the query so that you see how many blocks are touched and where they come from.
I can think of two reasons:
Bad statistics that make PostgreSQL believe that dt has a high correlation while it has not. If the correlation is low, a bitmap index scan is often more efficient.
To see if that is the problem, run
ANALYZE eods;
and see if that changes the execution plans chosen.
Caching effects: perhaps the first query finds all required blocks already cached, while the second doesn't.
At any rate, it might be worth experimenting to see if a bitmap index scan would be cheaper for the second query:
SET enable_indexscan = off;
Then repeat the query.
Take the following two tables:
Table "public.contacts"
Column | Type | Modifiers | Storage | Stats target | Description
--------------------+-----------------------------+-------------------------------------------------------+----------+--------------+-------------
id | integer | not null default nextval('contacts_id_seq'::regclass) | plain | |
created_at | timestamp without time zone | not null | plain | |
updated_at | timestamp without time zone | not null | plain | |
external_id | integer | | plain | |
email_address | character varying | | extended | |
first_name | character varying | | extended | |
last_name | character varying | | extended | |
company | character varying | | extended | |
industry | character varying | | extended | |
country | character varying | | extended | |
region | character varying | | extended | |
ext_instance_id | integer | | plain | |
title | character varying | | extended | |
Indexes:
"contacts_pkey" PRIMARY KEY, btree (id)
"index_contacts_on_ext_instance_id_and_external_id" UNIQUE, btree (ext_instance_id, external_id)
and
Table "public.members"
Column | Type | Modifiers | Storage | Stats target | Description
-----------------------+-----------------------------+--------------------------------------------------------------------+----------+--------------+-------------
id | integer | not null default nextval('members_id_seq'::regclass) | plain | |
step_id | integer | | plain | |
contact_id | integer | | plain | |
rule_id | integer | | plain | |
request_id | integer | | plain | |
sync_id | integer | | plain | |
status | integer | not null default 0 | plain | |
matched_targeted_rule | boolean | default false | plain | |
external_fields | jsonb | | extended | |
imported_at | timestamp without time zone | | plain | |
campaign_id | integer | | plain | |
ext_instance_id | integer | | plain | |
created_at | timestamp without time zone | | plain | |
Indexes:
"members_pkey" PRIMARY KEY, btree (id)
"index_members_on_contact_id_and_step_id" UNIQUE, btree (contact_id, step_id)
"index_members_on_campaign_id" btree (campaign_id)
"index_members_on_step_id" btree (step_id)
"index_members_on_sync_id" btree (sync_id)
"index_members_on_request_id" btree (request_id)
"index_members_on_status" btree (status)
Indices exist for both primary keys and members.contact_id.
I need to delete any contact which has no related members. There are roughly 3MM contact and 25MM member records.
I'm attempting the following two queries:
Query 1:
DELETE FROM "contacts"
WHERE "contacts"."id" IN (SELECT "contacts"."id"
FROM "contacts"
LEFT OUTER JOIN members
ON
members.contact_id = contacts.id
WHERE members.id IS NULL);
DELETE 0
Time: 173033.801 ms
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Delete on contacts (cost=2654306.79..2654307.86 rows=1 width=18) (actual time=188717.354..188717.354 rows=0 loops=1)
-> Nested Loop (cost=2654306.79..2654307.86 rows=1 width=18) (actual time=188717.351..188717.351 rows=0 loops=1)
-> HashAggregate (cost=2654306.36..2654306.37 rows=1 width=16) (actual time=188717.349..188717.349 rows=0 loops=1)
Group Key: contacts_1.id
-> Hash Right Join (cost=161177.46..2654306.36 rows=1 width=16) (actual time=188717.345..188717.345 rows=0 loops=1)
Hash Cond: (members.contact_id = contacts_1.id)
Filter: (members.id IS NULL)
Rows Removed by Filter: 26725870
-> Seq Scan on members (cost=0.00..1818698.96 rows=25322396 width=14) (actual time=0.043..160226.686 rows=26725870 loops=1)
-> Hash (cost=105460.65..105460.65 rows=3205265 width=10) (actual time=1962.612..1962.612 rows=3196180 loops=1)
Buckets: 262144 Batches: 4 Memory Usage: 34361kB
-> Seq Scan on contacts contacts_1 (cost=0.00..105460.65 rows=3205265 width=10) (actual time=0.011..950.657 rows=3196180 loops=1)
-> Index Scan using contacts_pkey on contacts (cost=0.43..1.48 rows=1 width=10) (never executed)
Index Cond: (id = contacts_1.id)
Planning time: 0.488 ms
Execution time: 188718.862 ms
Query 2:
DELETE FROM contacts
WHERE NOT EXISTS (SELECT 1
FROM members c
WHERE c.contact_id = contacts.id);
DELETE 0
Time: 170871.219 ms
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Delete on contacts (cost=2258873.91..2954594.50 rows=1895601 width=12) (actual time=177523.034..177523.034 rows=0 loops=1)
-> Hash Anti Join (cost=2258873.91..2954594.50 rows=1895601 width=12) (actual time=177523.029..177523.029 rows=0 loops=1)
Hash Cond: (contacts.id = c.contact_id)
-> Seq Scan on contacts (cost=0.00..105460.65 rows=3205265 width=10) (actual time=0.018..1068.357 rows=3196180 loops=1)
-> Hash (cost=1818698.96..1818698.96 rows=25322396 width=10) (actual time=169587.802..169587.802 rows=26725870 loops=1)
Buckets: 262144 Batches: 32 Memory Usage: 36228kB
-> Seq Scan on members c (cost=0.00..1818698.96 rows=25322396 width=10) (actual time=0.052..160081.880 rows=26725870 loops=1)
Planning time: 0.901 ms
Execution time: 177524.526 ms
As you can see that without even deleting any records both queries show similar performance taking ~3 minutes.
The server disk I/O spikes to 100% so I'm assuming that data is being spilled out to the disk because a sequential scan is done on both contacts and members.
The server is an EC2 r3.large (15GB RAM).
Any ideas on what I can do to optimize this query?
Update #1:
After running vacuum analyze for both tables and ensuring enable_mergejoin is set to on there is no difference in the query time:
DELETE FROM contacts
WHERE NOT EXISTS (SELECT 1
FROM members c
WHERE c.contact_id = contacts.id);
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Delete on contacts (cost=2246088.17..2966677.08 rows=1875003 width=12) (actual time=209406.342..209406.342 rows=0 loops=1)
-> Hash Anti Join (cost=2246088.17..2966677.08 rows=1875003 width=12) (actual time=209406.338..209406.338 rows=0 loops=1)
Hash Cond: (contacts.id = c.contact_id)
-> Seq Scan on contacts (cost=0.00..105683.28 rows=3227528 width=10) (actual time=0.008..1010.643 rows=3227462 loops=1)
-> Hash (cost=1814029.74..1814029.74 rows=24855474 width=10) (actual time=198054.302..198054.302 rows=27307060 loops=1)
Buckets: 262144 Batches: 32 Memory Usage: 37006kB
-> Seq Scan on members c (cost=0.00..1814029.74 rows=24855474 width=10) (actual time=1.132..188654.555 rows=27307060 loops=1)
Planning time: 0.328 ms
Execution time: 209408.040 ms
Update 2:
PG Version:
PostgreSQL 9.4.4 on x86_64-pc-linux-gnu, compiled by x86_64-pc-linux-gnu-gcc (Gentoo Hardened 4.5.4 p1.0, pie-0.4.7) 4.5.4, 64-bit
Relation size:
Table | Size | External Size
-----------------------+---------+---------------
members | 23 GB | 11 GB
contacts | 944 MB | 371 MB
Settings:
work_mem
----------
64MB
random_page_cost
------------------
4
Update 3:
Experimenting with doing this in batches doesn't seem to help out on the I/O usage (still spikes to 100%) and doesn't seem to improve on time despite using index-based plans.
DO $do$
BEGIN
FOR i IN 57..668
LOOP
DELETE
FROM contacts
WHERE contacts.id IN
(
SELECT contacts.id
FROM contacts
left outer join members
ON members.contact_id = contacts.id
WHERE members.id IS NULL
AND contacts.id >= (i * 10000)
AND contacts.id < ((i+1) * 10000));
END LOOP;END $do$;
I had to kill the query after Time: 1203492.326 ms and disk I/O stayed at 100% the entire time the query ran. I also experimented with 1,000 and 5,000 chunks but did not see any increase in performance.
Note: The 57..668 range was used because I know those are existing contact IDs. (E.g. min(id) and max(id))
One approach to problems like this can be to do it in smaller chunks.
DELETE FROM "contacts"
WHERE "contacts"."id" IN (
SELECT id
FROM contacts
LEFT OUTER JOIN members ON members.contact_id = contacts.id
WHERE members.id IS NULL
AND id >= 1 AND id < 1000
);
DELETE FROM "contacts"
WHERE "contacts"."id" IN (
SELECT id
FROM contacts
LEFT OUTER JOIN members ON members.contact_id = contacts.id
WHERE members.id IS NULL
AND id >= 1001 AND id < 2000
);
Rinse, repeat. Experiment with different chunk sizes to find an optimal one for your data set, which uses the fewest queries, while keeping them all in memory.
Naturally, you would want to script this, possibly in plpgsql, or in whatever scripting language you prefer.
Any ideas on what I can do to optimize this query?
Your queries are perfect. I would use the NOT EXISTS variant.
Your index index_members_on_contact_id_and_step_id is also good for it:
Is a composite index also good for queries on the first field?
But see below about BRIN indexes.
You can tune your server, table and index configuration.
Since you do not actually update or delete many rows (hardly any at all, according to your comment?), you need to optimize read performance.
1. Upgrade your Postgres version
You provided:
The server is an EC2 r3.large (15GB RAM).
And:
PostgreSQL 9.4.4
Your version is seriously outdated. At least upgrade to the latest minor version. Better yet, upgrade to the current major version. Postgres 9.5 and 9.6 brought major improvements for big data - which is what you need exactly.
Consider the versioning policy of the project.
Amazon allows you to upgrade!
2. Improve table statistics
There is an unexpected 10% mismatch between expected and actual row count in the basic sequential scan:
Seq Scan on members c (cost=0.00..1814029.74 rows=24855474 width=10) (actual time=1.132..188654.555 rows=27307060 loops=1)
Not dramatic at all, but still should not occur in this query. Indicates that you might have to tune your autovacuum settings - possibly per table for the very big ones.
More problematic:
Hash Anti Join (cost=2246088.17..2966677.08 rows=1875003 width=12) (actual time=209406.338..209406.338 rows=0 loops=1)
Postgres expects to find 1875003 rows to delete, while actually 0 rows are found. That's unexpected. Maybe substantially increasing the statistics target on members.contact_id and contacts.id can help to decrease the gap, which might allow better query plans. See:
Keep PostgreSQL from sometimes choosing a bad query plan
3. Avoid table and index bloat
Your ~ 25MM rows in members occupy 23 GB - that's almost 1kb per row, which seems excessive for the table definition you presented (even if the total size you provided should include indexes):
4 bytes item identifier
24 tuple header
8 null bitmap
36 9x integer
16 2x ts
1 1x bool
?? 1x jsonb
See:
Making sense of Postgres row sizes
That's 89 bytes per row - or less with some NULL values - and hardly any alignment padding, so 96 bytes max, plus your jsonb column.
Either that jsonb column is very big which would make me suggest to normalize the data into separate columns or a separate table. Consider:
How to perform update operations on columns of type JSONB in Postgres 9.4
Or your table is bloated, which can be solved with VACUUM FULL ANALYZE or, while being at it:
CLUSTER members USING index_members_on_contact_id_and_step_id;
VACUUM members;
But either takes an exclusive lock on the table, which you say you cannot afford. pg_repack can do it without exclusive lock. See:
VACUUM returning disk space to operating system
Even if we factor in index sizes, your table seems too big: you have 7 small indexes, each 36 - 44 bytes per row without bloat, less with NULL values, so < 300 bytes altogether.
Either way, consider more aggressive autovacuum settings for your table members. Related:
Aggressive Autovacuum on PostgreSQL
What fillfactor for caching table?
And / or stop bloating the table to begin with. Are you updating rows a lot? Any particular column you update a lot? That jsonb column maybe? You might move that to a separate (1:1) table just to stop bloating the main table with dead tuples - and keeping autovacuum from doing its job.
4. Try a BRIN index
Block range indexes require Postgres 9.5 or later and dramatically reduce index size. I was too optimistic in my first draft. A BRIN index is perfect for your use case if you have many rows in members for each contact.id - after physically clustering your table at least once (see ③ for the fitting CLUSTER command). In that case Postgres can rule out whole data pages quickly. But your numbers indicate only around 8 rows per contact.id, so data pages would often contain multiple values, which voids much of the effect. Depends on actual details of your data distribution ...
On the other hand, as it stands, your tuple size is around 1 kb, so only ~ 8 rows per data page (typically 8kb). If that isn't mostly bloat, a BRIN index might help after all.
But you need to upgrade your server version first. See ①.
CREATE INDEX members_contact_id_brin_idx ON members USING BRIN (contact_id);
Update statistics used by the planner and set enable_mergejoin to on:
vacuum analyse members;
vacuum analyse contacts;
set enable_mergejoin to on;
You should get a query plan similar to this one:
explain analyse
delete from contacts
where not exists (
select 1
from members c
where c.contact_id = contacts.id);
QUERY PLAN
----------------------------------------------------------------------
Delete on contacts
-> Merge Anti Join
Merge Cond: (contacts.id = c.contact_id)
-> Index Scan using contacts_pkey on contacts
-> Index Scan using members_contact_id_idx on members c
Here is another variant to try:
DELETE FROM contacts
USING contacts c
LEFT JOIN members m
ON c.id = m.contact_id
WHERE m.contact_id IS NULL;
It uses a technique for deleting from a joined query described here.
I can't vouch for whether this would definitely be faster but it might be because of the avoidance of a subquery. Would be interested in the results...
Using subquery in where clause take a lot of time
you should use with and using this will be a lot a lot a lot ... faster
with
c_not_member as (
-- here extarct the id of contacts that not in members
SELECT
c.id
FROM contacts c LEFT JOIN members m on c.id = m.contact_id
WHERE
-- to get the contact that don't exist in member just
-- use condition in a field on member that cannot be null
-- in this case you have id
m.id is null
-- the only case when m.id is null is when c.id does not have m.contact_id maching c.id
-- in another way c.id doesn't exists in m.contact_id
)
DELETE FROM contacts all_c using c_not_member WHERE all_c.id = not_member.id ;
Hi I'm curious about why index doesn't work when data rows are large even 100.
Here's select for 10 data:
mydb> explain select * from data where user_id=1;
+-----------------------------------------------------------------------------------+
| QUERY PLAN |
|-----------------------------------------------------------------------------------|
| Index Scan using ix_data_user_id on data (cost=0.14..8.15 rows=1 width=2043) |
| Index Cond: (user_id = 1) |
+-----------------------------------------------------------------------------------+
EXPLAIN
Here's select for 100 data:
mydb> explain select * from data where user_id=1;
+------------------------------------------------------------+
| QUERY PLAN |
|------------------------------------------------------------|
| Seq Scan on data (cost=0.00..44.67 rows=1414 width=945) |
| Filter: (user_id = 1) |
+------------------------------------------------------------+
EXPLAIN
How can index work when data rows are 100?
100 is not a large amount of data. Think 10,000 or 100,000 rows for a respectable amount.
To put it simply, records in a table are stored on data pages. A data page typically has about 8k bytes (it depends on the database and on settings). A major purpose of indexes is to reduce the number of data pages that need to be read.
If all the records in a table fit on one page, there is no need to reduce the number pages being read. The one page will be read. Hence, the index may not be particularly useful.
I have a table that holds historical records. Whenever a count gets updated, a record is added specifying that a new value was fetched at that time. The table schema looks like this:
Column | Type | Modifiers
---------------+--------------------------+--------------------------------------------------------------------
id | integer | not null default nextval('project_accountrecord_id_seq'::regclass)
user_id | integer | not null
created | timestamp with time zone | not null
service | character varying(200) | not null
metric | character varying(200) | not null
value | integer | not null
Now I'd like to get the total number of records updated each day, for the last seven days. Here's what I came up with:
SELECT
created::timestamp::date as created_date,
count(created)
FROM
project_accountrecord
GROUP BY
created::timestamp::date
ORDER BY
created_date DESC
LIMIT 7;
This runs slowly (11406.347ms). EXPLAIN ANALYZE gives:
Limit (cost=440939.66..440939.70 rows=7 width=8) (actual time=24184.547..24370.715 rows=7 loops=1)
-> GroupAggregate (cost=440939.66..477990.56 rows=6711746 width=8) (actual time=24184.544..24370.699 rows=7 loops=1)
-> Sort (cost=440939.66..444340.97 rows=6802607 width=8) (actual time=24161.120..24276.205 rows=92413 loops=1)
Sort Key: (((created)::timestamp without time zone)::date)
Sort Method: external merge Disk: 146328kB
-> Seq Scan on project_accountrecord (cost=0.00..153671.43 rows=6802607 width=8) (actual time=0.017..10132.970 rows=6802607 loops=1)
Total runtime: 24420.988 ms
There are a little over 6.8 million rows in this table. What can I do to increase performance of this query? Ideally I'd like it to run in under a second so I can cache it and update it in the background a couple of times a day.
Now, your query must scan whole table, calculate result and limit to 7 recent days.
You can speedup query by scanning only last 7 days (or more if you don't update records every day):
where created_date>now()::date-'7 days'::interval
Another aproach is to cache historical results in extra table and count only current day.
I have two tables
junk=# select * from t;
name | intval
----------+--------
bar2 | 2
bar3 | 3
bar4 | 4
(3 rows)
and
junk=# select * from temp;
id | name | intval
----+------------+--------
1 | foo | 0
2 | foo2 | 2
3 | foo3 | 3
4 | foo4 | 4
5 | foo5 | 5
(5 rows)
Now, I want to use the values from table t to update the values in table temp. Basically, I want to replace the name column in second, third and fourth values in temp by bar2, bar3 and bar4.
I created the table t using the COPY statement. I am doing batch updates and I am trying to optimize that.
So, I get this error. I think this is pretty basic one.
junk=# UPDATE temp FROM t SET name=t.name FROM t WHERE intval=t.intval;
ERROR: syntax error at or near "FROM"
LINE 1: UPDATE temp FROM t SET name=t.name FROM t WHERE intval=t.int...
^
junk=#
Fow now, this works.
UPDATE test SET name=t.name FROM t WHERE test.intval=t.intval
Get rid of your first FROM t clause.
FROM must come after SET, not before and it can only affect the WHERE clause. SET must be done with subqueries.
your completed code is:
UPDATE temp SET name=(SELECT t.name FROM t WHERE temp.intval = t.inval);
PostgreSQL has some ways to optimize this so it's not like you are just doing a huge nested loop join (and looking up one row over and over from the heap based on the join criteria).
Edit: Adding plan to show we are not, in fact, running through a sequential scan of the second table for each row on the first one.
Here is an example that updates 172 rows in one table using a group-by from another:
mtech_test=# explain analyze
update ap
set amount = (select sum(amount) from acc_trans ac where ac.trans_id = ap.id) + 1;
QUERY PLAN
--------------------------------------------------------------------------------
---------------------------------------------------------------------
Update on ap (cost=0.00..3857.06 rows=229 width=231) (actual time=39.074..39.0
74 rows=0 loops=1)
-> Seq Scan on ap (cost=0.00..3857.06 rows=229 width=231) (actual time=0.050..28.444 rows=172 loops=1)
SubPlan 1
-> Aggregate (cost=16.80..16.81 rows=1 width=5) (actual time=0.109..0.110 rows=1 loops=172)
-> Bitmap Heap Scan on acc_trans ac (cost=4.28..16.79 rows=4 width=5) (actual time=0.075..0.102 rows=4 loops=172)
Recheck Cond: (trans_id = ap.id)
-> Bitmap Index Scan on acc_trans_trans_id_key (cost=0.00..4.28 rows=4 width=0) (actual time=0.006..0.006 rows=4 loops=172)
Index Cond: (trans_id = ap.id)
Trigger for constraint ap_entity_id_fkey: time=69.532 calls=172
Trigger ap_audit_trail: time=391.722 calls=172
Trigger ap_track_global_sequence: time=1.954 calls=172
Trigger check_department: time=111.301 calls=172
Total runtime: 612.001 ms
(13 rows)
`