How to speed up SUM query in postgres on large table - sql

The problem
I'm trying to run the following query on a SQL view in a postgres database:
SELECT sum(value) FROM invoices_view;
The invoices_view has approximately 45 million rows, the data size of the entire database is 40.5 GB and the database has 61 GB of RAM.
Currently this query is taking 4.5 seconds, and I'd like it to be ideally under 1 second.
Things I've tried
I cannot add indexes directly to the SQL view of course, but have an index on the underlying table:
CREATE INDEX invoices_on_value_idx ON invoices (value);
I have also run a VACUUM ANALYZE on the invoices table.
EXPLAIN ANALYZE
The output of EXPLAIN ANALYZE is as follows:
EXPLAIN (ANALYZE, BUFFERS) SELECT sum(value) FROM invoices_view;
Finalize Aggregate (cost=1514195.47..1514195.47 rows=1 width=32) (actual time=5102.805..5102.806 rows=1 loops=1)
Buffers: shared hit=14996 read=1446679
I/O Timings: read=3235.147
-> Gather (cost=1514195.16..1514195.47 rows=3 width=32) (actual time=5102.716..5109.229 rows=4 loops=1)
Workers Planned: 3
Workers Launched: 3
Buffers: shared hit=14996 read=1446679
I/O Timings: read=3235.147
-> Partial Aggregate (cost=1513195.16..1513195.17 rows=1 width=32) (actual time=5097.626..5097.626 rows=1 loops=4)
Buffers: shared hit=14996 read=1446679
I/O Timings: read=3235.147
-> Parallel Seq Scan on invoices (cost=0.00..1505835.14 rows=14720046 width=6) (actual time=0.049..3734.495 rows=11408036 loops=4)
Buffers: shared hit=14996 read=1446679
I/O Timings: read=3235.147
Planning Time: 2.503 ms
Execution Time: 5109.327 ms
Does anyone have any thought on how I might be able to speed this up? Or should I be looking at alternatives to postgres at this point?
More detail
This is the simplest version of the queries I'll need to run over the dataset.
For example, I need to be able to SUM based on user inputs i.e. additional WHERE clauses and GROUP BYs.
Keeping a running total would solve for this simplest case only.

You should consider using a trigger to keep track of a rolling sum:
CREATE OR REPLACE FUNCTION func_sum_invoice()
RETURNS trigger AS
$BODY$
BEGIN
UPDATE invoices_sum
SET total = total + NEW.value;
RETURN NEW;
END;
$BODY$
Then create the trigger using this function:
CREATE TRIGGER sum_invoice
AFTER INSERT ON invoices
FOR EACH ROW
EXECUTE PROCEDURE func_sum_invoice();
Now each insert into the invoices table will fire a trigger which tallies the rolling sum. To obtain that sum, now you need only a single select, which should be very fast:
SELECT total
FROM invoices_sum;

Related

IN clause is slow when sorting

I have simple SQL with IN clause and ordering:
SELECT * FROM my_table
WHERE fieldA IN ('value1', 'value2')
ORDER BY fieldB
I've also created the following indexes:
CREATE INDEX idx_my_table_fieldA ON my_table (fieldA)
CREATE INDEX idx_my_table_fieldA_fieldB ON my_table (fieldA, fieldB)
There are millions of rows in this table.
I have very slow queries when I run it with multiple values in IN clause:
Sort (cost=800.35..802.21 rows=744 width=601) (actual time=36.409..36.423 rows=5 loops=1)
Sort Key: "fieldB"
Sort Method: quicksort Memory: 30kB
Buffers: shared hit=18 read=13
I/O Timings: read=29.326
-> Index Scan using idx_my_table_fieldA on my_table (cost=0.57..764.86 rows=744 width=601) (actual time=5.597..35.819 rows=5 loops=1)
Index Cond: (fieldA = ANY ('{value1,value2}'::text[]))
Buffers: shared hit=15 read=13
I/O Timings: read=29.326
Planning:
Buffers: shared hit=238 read=54
I/O Timings: read=81.777
Planning Time: 94.622 ms
Execution Time: 36.528 ms
(14 rows)
The idx_my_table_fieldA index is used.
But it's very fast if I run it for single value in IN clause:
fast SQL example:
SELECT * FROM my_table
WHERE fieldA IN ('value1')
ORDER BY fieldB
This is the query plan:
Index Scan using idx_my_table_fieldA_fieldB on my_table (cost=0.57..153.17 rows=149 width=601) (actual time=1.435..1.440 rows=1 loops=1)
Index Cond: (fieldA = 'value1'::text)
Buffers: shared hit=3 read=2
I/O Timings: read=1.313
Planning Time: 0.194 ms
Execution Time: 1.472 ms
(6 rows)
The multicolumn index is used in this case.
Could you recommend me how to improve this query? I use an ORM system (Hibernate + Spring Data), so it's better not to use native SQL. It would be great to solve this problem using another appropriate indexes (if it's possible).
This query is slow because the "planer" (called optimizer in other RDBMS) is very limited.
In such cases (with a list of values in a IN operator), the query is transformed in Oracle or SQL Server in several queries with a unique value, then results are concatenates, this way to have benefits from indexes.
But it seems that Postgresql does not rewrite your query this way.
You can manually test if a rewrite will do a better plan...

PostgreSQL. Improve indexes

I have the following structure:
create table bitmex
(
timestamp timestamp with time zone not null,
symbol varchar(255) not null,
side varchar(255) not null,
tid varchar(255) not null,
size numeric not null,
price numeric not null,
constraint bitmex_tid_symbol_pk
primary key (tid, symbol)
);
create index bitmex_timestamp_symbol_index on bitmex (timestamp, symbol);
create index bitmex_symbol_index on bitmex (symbol);
I need to know the exact value of the quantity every time. So reltuples is not usable.
The table has more than 45,000,000 rows.
Running
explain analyze select count(*) from bitmex where symbol = 'XBTUSD';
gives
Finalize Aggregate (cost=1038428.56..1038428.57 rows=1 width=8)
-> Gather (cost=1038428.35..1038428.56 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=1037428.35..1037428.36 rows=1 width=8)
-> Parallel Seq Scan on bitmex (cost=0.00..996439.12 rows=16395690 width=0)
Filter: ((symbol)::text = 'XBTUSD'::text)
Running
explain analyze select count(*) from bitmex;
gives
Finalize Aggregate (cost=997439.34..997439.35 rows=1 width=8) (actual time=6105.463..6105.463 rows=1 loops=1)
-> Gather (cost=997439.12..997439.33 rows=2 width=8) (actual time=6105.444..6105.457 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=996439.12..996439.14 rows=1 width=8) (actual time=6085.960..6085.960 rows=1 loops=3)
-> Parallel Seq Scan on bitmex (cost=0.00..954473.50 rows=16786250 width=0) (actual time=0.364..4342.460 rows=13819096 loops=3)
Planning time: 0.080 ms
Execution time: 6108.277 ms
Why it did not use indexes?
Thanks
If all rows have to be visited, an index scan is only cheaper if the table does not have to be consulted for most of the values found in the index.
Due to the way PostgreSQL is organized, the table has to be visited to determine if the entry found in the index is visible or not. This step can be skipped if the whole page is marked as “visible” in the visibility map of the table.
To update the visibility map, run VACUUM on the table. Maybe then an index only scan will be used.
But counting the number of rows in a table is never cheap, even with an index scan. If you need to do that often, it may be a good idea to have a separate table that only contains a counter for the number of rows. Then you can write triggers that update the counter whenever rows are inserted or deleted.
That will slow down the performance during INSERT and DELETE, but you can count the rows with lightning speed.

Postgres inconsistent use of Index vs Seq Scan

I'm having difficulty understanding what I perceive as an inconsistancy in how postgres chooses to use indices. We have a query based on NOT IN against an indexed column that postgres executes sequentially, but when we perform the same query as IN, it uses the index.
I've created a simplistic example that I believe demonstrates the issue, notice this first query is sequential
CREATE TABLE node
(
id SERIAL PRIMARY KEY,
vid INTEGER
);
CREATE INDEX x ON node(vid);
INSERT INTO node(vid) VALUES (1),(2);
EXPLAIN ANALYZE
SELECT *
FROM node
WHERE NOT vid IN (1);
Seq Scan on node (cost=0.00..36.75 rows=2129 width=8) (actual time=0.009..0.010 rows=1 loops=1)
Filter: (vid <> 1)
Rows Removed by Filter: 1
Total runtime: 0.025 ms
But if we invert the query to IN, you'll notice that it now decided to use the index
EXPLAIN ANALYZE
SELECT *
FROM node
WHERE vid IN (2);
Bitmap Heap Scan on node (cost=4.34..15.01 rows=11 width=8) (actual time=0.017..0.017 rows=1 loops=1)
Recheck Cond: (vid = 1)
-> Bitmap Index Scan on x (cost=0.00..4.33 rows=11 width=0) (actual time=0.012..0.012 rows=1 loops=1)
Index Cond: (vid = 1)
Total runtime: 0.039 ms
Can anyone shed any light on this? Specifically, is there a way to re-write out NOT IN to work with the index (when obviously the result set is not as simplistic as just 1 or 2).
We are using Postgres 9.2 on CentOS 6.6
PostgreSQL is going to use an Index when it makes sense. It is likely that the statistics state that your NOT IN has too many tuples to return to make an Index effective.
You can test this by doing the following:
set enable_seqscan to false;
explain analyze .... NOT IN
set enable_seqscan to true;
explain analyze .... NOT IN
The results will tell you if PostgreSQL is making the correct decision. If it isn't you can make adjustments to the statistics of the column and or the costs (random_page_cost) to get the desired behavior.

How to manually update the statistics data of tables in PostgreSQL

The ANALYZE statement can be used in PostgreSQL to collect the statistics data of tables. However, I do not want to actually insert these data into tables, I just need to evaluate the cost of some queries, is there anyway to manually specify the statistics data of tables in PostgreSQL without actually putting data into it?
I think you are muddling ANALYZE with EXPLAIN ANALYZE. There are different things.
If you want query costs and timing without applying the changes, the only real option you have is to begin a transaction, execute the query under EXPLAIN ANALYZE, and then ROLLBACK.
This still executes the query, meaning that:
CPU time and I/O are consumed
Locks are still taken and held for the duration
New rows are actually written to the tables and indexes, but are never marked visible. They are cleaned up in the next VACUUM.
You can already EXPLAIN ANALYSE a query even with no inserted data, it will help you to get a feeling of the execution plan.
But there's no such thing as real data :)
What you can do, as a workaround, is BEGINning a transaction, INSERT some data, EXPLAIN ANALYSE your query, then ROLLBACK your transaction.
Example :
mydatabase=# BEGIN;
BEGIN
mydatabase=# INSERT INTO auth_message (user_id, message) VALUES (1, 'foobar');
INSERT 0 1
mydatabase=# EXPLAIN ANALYSE SELECT count(*) FROM auth_message;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Aggregate (cost=24.50..24.51 rows=1 width=0) (actual time=0.011..0.011 rows=1 loops=1)
-> Seq Scan on auth_message (cost=0.00..21.60 rows=1160 width=0) (actual time=0.007..0.008 rows=1 loops=1)
Total runtime: 0.042 ms
(3 lignes)
mydatabase=# ROLLBACK;
ROLLBACK
mydatabase=# EXPLAIN ANALYSE SELECT count(*) FROM auth_message;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Aggregate (cost=24.50..24.51 rows=1 width=0) (actual time=0.011..0.011 rows=1 loops=1)
-> Seq Scan on auth_message (cost=0.00..21.60 rows=1160 width=0) (actual time=0.009..0.009 rows=0 loops=1)
Total runtime: 0.043 ms
(3 lignes)
The 1st EXPLAIN ANALYSE shows that there was some "temporary" data (rows=1)
This is not strictly a "mock", but at least, PostgreSQL plan execution (and various optimizations it could do) should be, IMHO, best than with no data (disclaimer : purely intuitive)

Why does this SUM() function take so long in PostgreSQL?

This is my query:
SELECT SUM(amount) FROM bill WHERE name = 'peter'
There are 800K+ rows in the table. EXPLAIN ANALYZE says:
Aggregate (cost=288570.06..288570.07 rows=1 width=4) (actual time=537213.327..537213.328 rows=1 loops=1)
-> Seq Scan on bill (cost=0.00..288320.94 rows=498251 width=4) (actual time=48385.201..535941.041 rows=800947 loops=1)
Filter: ((name)::text = 'peter'::text)
Rows Removed by Filter: 8
Total runtime: 537213.381 ms
All rows are affected, and this is correct. But why so long? A similar query without WHERE runs way faster:
ANALYZE EXPLAIN SELECT SUM(amount) FROM bill
Aggregate (cost=137523.31..137523.31 rows=1 width=4) (actual time=2198.663..2198.664 rows=1 loops=1)
-> Index Only Scan using idx_amount on bill (cost=0.00..137274.17 rows=498268 width=4) (actual time=0.032..1223.512 rows=800955 loops=1)
Heap Fetches: 533399
Total runtime: 2198.717 ms
I have an index on amount and an index on name. Have I missed any indexes?
ps. I managed to solve the problem just by adding a new idex ON bill(name, amount). I didn't get why it helped, so let's leave the question open for some time...
Since you are searching for a specific name, you should have an index that has name as the first column, e.g. CREATE INDEX IX_bill_name ON bill( name ).
But Postgres can still opt to do a full table scan if it estimates your index to not be specific enough, i.e. if it thinks it is faster to just scan all rows and pick the matching ones instead of consulting an index and start jumping around in the table to gather the matching rows. Postgres uses a cost-based estimation technique that weights random disk reads to be more expensive than sequential reads.
For an index to actually be used in your situation, there should be no more than 10% of the rows matching what you are searching for. Since most of your rows have name=peter it is actually faster to do a full table scan.
As to why the SUM without filtering runs faster has to do with overall width of the table. With a where-clause, postgres has to sequentially read all rows in the table so it can disregard those that do not match the filter. Without a where-clause, postgres can instead read all the amounts from the index. Because the index on amounts contains the amounts and pointers to each corresponding rows, but no other data from the table, it is simply less data to wade through. Based on the big different in performance I guess you have quite a lot of other fields in your table..