PostgreSQL ignoring index on timestamp column

PostgreSQL ignoring index on timestamp column - sql

I have the following table and index created:
CREATE TABLE cdc_auth_user
(
cdc_auth_user_id bigint NOT NULL DEFAULT nextval('cdc_auth_user_id_seq'::regclass),
cdc_timestamp timestamp without time zone DEFAULT ('now'::text)::timestamp without time zone,
cdc_operation text,
id integer,
username character varying(30)
);
CREATE INDEX idx_cdc_auth_user_cdc_timestamp
ON cdc_auth_user
USING btree (cdc_timestamp);
However, when I perform a select using the timestamp field, the index is being ignored and my query takes almost 10 seconds to return:
EXPLAIN SELECT *
FROM cdc_auth_user
WHERE cdc_timestamp BETWEEN '1900/02/24 12:12:34.818'
AND '2012/02/24 12:17:45.963';
Seq Scan on cdc_auth_user (cost=0.00..1089.05 rows=30003 width=126)
Filter: ((cdc_timestamp >= '1900-02-24 12:12:34.818'::timestamp without time zone) AND (cdc_timestamp <= '2012-02-24 12:17:45.963'::timestamp without time zone))

If there are a lot of results, the btree can be slower than just doing a table scan. btree indices are really not designed for this kind of "range-selection" kind of query you're doing here; the entries are placed in a big unsorted file and the index is built against that unsorted group, so every result potentially requires a disk seek after it is found in the btree. Sure, the btree can be easily read in order but the results still need to get pulled from the disk.
Clustered indices solve this problem by ordering the actual database records according to what's in the btree, so they actually are helpful for ranged queries like this. Consider using a clustered index instead and see how it works.

Related

Identical SELECT vs DELETE query creates different query plans with vastly different execution time

I am trying to speed up a delete query that appears to be very slow when compared to an identical select query:
Slow delete query:
https://explain.depesz.com/s/kkWJ
delete from processed.token_utxo
where token_utxo.output_tx_time >= (select '2022-03-01T00:00:00+00:00'::timestamp with time zone)
and token_utxo.output_tx_time < (select '2022-03-02T00:00:00+00:00'::timestamp with time zone)
and not exists (
select 1
from public.ma_tx_out
where ma_tx_out.id = token_utxo.id
)
Fast select query: https://explain.depesz.com/s/Bp8q
select * from processed.token_utxo
where token_utxo.output_tx_time >= (select '2022-03-01T00:00:00+00:00'::timestamp with time zone)
and token_utxo.output_tx_time < (select '2022-03-02T00:00:00+00:00'::timestamp with time zone)
and not exists (
select 1
from public.ma_tx_out
where ma_tx_out.id = token_utxo.id
)
Table reference:
create table processed.token_utxo (
id bigint,
tx_out_id bigint,
token_id bigint,
output_tx_id bigint,
output_tx_index int,
output_tx_time timestamp,
input_tx_id bigint,
input_tx_time timestamp,
address varchar,
address_has_script boolean,
payment_cred bytea,
redeemer_id bigint,
stake_address_id bigint,
quantity numeric,
primary key (id)
);
create index token_utxo_output_tx_id on processed.token_utxo using btree (output_tx_id);
create index token_utxo_input_tx_id on processed.token_utxo using btree (input_tx_id);
create index token_utxo_output_tx_time on processed.token_utxo using btree (output_tx_time);
create index token_utxo_input_tx_time on processed.token_utxo using btree (input_tx_time);
create index token_utxo_address on processed.token_utxo using btree (address);
create index token_utxo_token_id on processed.token_utxo using btree (token_id);
Version: PostgreSQL 13.6 on x86_64-pc-linux-gnu, compiled by Debian clang version 12.0.1, 64-bit
Postgres chooses different query plans which results in drastically different performance. I'm not familiar enough with Postgres to understand why it makes this decision. Hoping there is a simple way to guide it towards a better plan here.

Why it comes up with different plans is relatively easy to explain. First, the DELETE cannot use parallel queries, so the plan which is believed to be more parallel-friendly is more favored by the SELECT rather than the DELETE. Maybe that restriction will be eased in some future version. Second, the DELETE cannot use an index-only-scan on ma_tx_out_pkey, like the pure SELECT can--it would use an index scan instead. This too will make the faster plan appear less fast for the DELETE than it does for the SELECT. These two factors combined are apparently enough to get it switch plans. We have already seen evidence of the first factor, You can probably verify this 2nd factor by setting enable_seqscan to off and seeing what plan the DELETE chooses then, and if it is the nested loop, verifying that the last index scan is not index-only.
But of course the only reason those factors can make the decision between plans differ is because the plan estimates were so close together in the first place, despite being so different in actual performance. So what explains that false closeness? That is harder to determine with the info we have (it would be better if you had done EXPLAIN (ANALYZE, BUFFERS) with track_io_timing turned on).
One possibility is that the difference in actual performance is illusory. Maybe the nested loop is so fast only because all the data it needs is in memory, and the only reason for that is that you executed the same query repeatedly with the same parameters as part of your testing. Is it still so fast if you change the timestamps params, or clear both the PostgreSQL buffers and the file cache between runs?
Another possibility is that your system is just poorly tuned. For example, if your data is on SSD, then the default setting of random_page_cost is probably much too high. 1.1 might be a more reasonable setting than 4.
Finally, your setting of work_mem is probably way too low. That results in the hash using an extravagant number of batches: 8192. How much this effects the performance is hard predict, as it depends on your hardware, your kernel, your filesystem, etc. (Which is maybe why the planner does not try to take it into account). It is pretty easy to test, you can increase the setting of work_mem locally (in your session) and see if it changes the speed.
Much of this analysis is possible only based on the fact that your delete doesn't actually find any rows to delete. If it were deleting rows, that would make the situation far more complex.

I am not exactly sure what triggers the switch of query plan between SELECT and DELETE, but I do know this: the subqueries returning a constant value are actively unhelpful. Use instead:
SELECT *
FROM processed.token_utxo t
WHERE t.output_tx_time >= '2022-03-01T00:00:00+00:00'::timestamptz -- no subquery
AND t.output_tx_time < '2022-03-02T00:00:00+00:00'::timestamptz -- no subquery
AND NOT EXISTS (SELECT FROM public.ma_tx_out m WHERE m.id = t.id)
DELETE FROM processed.token_utxo t
WHERE t.output_tx_time >= '2022-03-01T00:00:00+00:00'::timestamptz
AND t.output_tx_time < '2022-03-02T00:00:00+00:00'::timestamptz
AND NOT EXISTS (SELECT FROM public.ma_tx_out m WHERE m.id = t.id)
As you can see in the query plan, Postgres comes up with a generic plan for yet unknown timestamps:
Index Cond: ((output_tx_time >= $0) AND (output_tx_time < $1))
My fixed query allows Postgres to devise a plan for the actual given constant values. If your column statistics are up to date, this allows for more optimization according to the number of rows expected to qualify for that time interval. The query plan will change to:
Index Cond: ((output_tx_time >= '2022-03-01T00:00:00+00:00'::timestamp with time zone) AND (output_tx_time < '2022-03-02T00:00:00+00:00'::timestamp with time zone))
And you will see different row estimates, that may result in a different query plan.
Of course, DELETE cannot have the exact same plan. Besides the obvious difference that DELETE has write-lock and write to dying rows, it also cannot (currently - up to at least pg 15) use parallelism, and it cannot use index-only scans. See:
Delete using another table with index
So you'll see an index scan where SELECT might use an index-only scan.

Efficient inserts with duplicate checks for large tables in Postgres

I'm currently working on a project collecting a very large amount of data from a network of wireless modems out in the field. We have a table 'readings' that looks like this:
CREATE TABLE public.readings (
id INTEGER PRIMARY KEY NOT NULL DEFAULT nextval('readings_id_seq'::regclass),
created TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT now(),
timestamp TIMESTAMP WITHOUT TIME ZONE NOT NULL,
modem_serial CHARACTER VARYING(255) NOT NULL,
channel1 INTEGER NOT NULL,
channel2 INTEGER NOT NULL,
signal_strength INTEGER,
battery INTEGER,
excluded BOOLEAN NOT NULL DEFAULT false
);
CREATE UNIQUE INDEX _timestamp_modemserial_uc ON readings USING BTREE (timestamp, modem_serial);
CREATE INDEX ix_readings_timestamp ON readings USING BTREE (timestamp);
CREATE INDEX ix_readings_modem_serial ON readings USING BTREE (modem_serial);
It's important for the integrity of the system that we never have two readings from the same modem with the same timestamp, hence the unique index.
Our challenge at the moment is to find a performant way of inserting readings. We often have to insert millions of rows as we bring in historical data, and when adding to an existing base of 100 million plus readings, this can get kind of slow.
Our current approach is to import batches of 10,000 readings into a temporary_readings table, which is essentially an unindexed copy of readings. We then run the following SQL to merge it into the main table and remove duplicates:
INSERT INTO readings (created, timestamp, modem_serial, channel1, channel2, signal_strength, battery)
SELECT DISTINCT ON (timestamp, modem_serial) created, timestamp, modem_serial, channel1, channel2, signal_strength, battery
FROM temporary_readings
WHERE NOT EXISTS(
SELECT * FROM readings
WHERE timestamp=temporary_readings.timestamp
AND modem_serial=temporary_readings.modem_serial
)
ORDER BY timestamp, modem_serial ASC;
This works well, but takes ~20 seconds per 10,000 row block to insert. My question is twofold:
Is this the best way to approach the problem? I'm relatively new to projects with these sorts of performance demands, so I'm curious to know if there are better solutions.
What steps can I take to speed up the insert process?
Thanks in advance!

Your query idea is okay. I would try timing it for 100,000 rows in the batch, to start to get an idea of an optimal batch size.
However, the distinct on is slowing things down. Here are two ideas.
The first is to assume that duplicates in batches are quite rare. If this is true, try inserting the data without the distinct on. If that fails, then run the code again with the distinct on. This complicates the insertion logic, but it might make the average insertion much shorter.
The second is to build an index on temporary_readings(timestamp, modem_serial) (not a unique index). Postgres will take advantage of this index for the insertion logic -- and sometimes building an index and using it is faster than alternative execution plans. If this does work, you might try larger batch sizes.
There is a third solution which is to use on conflict. That would allow the insertion itself to ignore duplicate values. This is only available in Postgres 9.5, though.

Adding to a table that already contains 100 million indexed records will be slow no matter what! You can probably speed things up somewhat by taking a fresh look at your indexes.
CREATE UNIQUE INDEX _timestamp_modemserial_uc ON readings USING BTREE (timestamp, modem_serial);
CREATE INDEX ix_readings_timestamp ON readings USING BTREE (timestamp);
CREATE INDEX ix_readings_modem_serial ON readings USING BTREE (modem_serial);
At the moment you have three indexes but they are on the same combination of columns. Can't you manage with just the unique index?
I don't know what your other queries are like but your WHERE NOT EXISTS query can make use of this unique index.
If you have queries with the WHERE clause only filtering on the modem_serial field, your unique index is unlikely to be used. However if you flip the columns in that index it will be!
CREATE UNIQUE INDEX _timestamp_modemserial_uc ON readings USING BTREE (timestamp, modem_serial);
To quote from the manual:
A multicolumn B-tree index can be used with query conditions that
involve any subset of the index's columns, but the index is most
efficient when there are constraints on the leading (leftmost)
columns.
The order of the columns in the index matters.

Right index for timestamp field on Postgresql

here is the table
CREATE TABLE material
(
mid bigserial NOT NULL,
...
active_from timestamp without time zone,
....
CONSTRAINT material_pkey PRIMARY KEY (mid),
)
CREATE INDEX i_test_t_year
ON material
USING btree
(date_part('year'::text, active_from));
if I made sorting by mid field
select mid from material order by mid desc
"Index Only Scan Backward using material_pkey on material (cost=0.29..3573.20 rows=100927 width=8)"
but if I use active_from for sorting
select * from material order by active_from desc
"Sort (cost=12067.29..12319.61 rows=100927 width=16)"
" Sort Key: active_from"
" -> Seq Scan on material (cost=0.00..1953.27 rows=100927 width=16)"
Maybe index for active_from wrong? How to make right one for lower cost

The index on date_part('year'::text, active_from) can't be used to sort by active_from; you know that sorting by that function and then by active_from gives the same order as simply sorting by active_from but postgresql doesn't. If you create the following index:
CREATE INDEX i_test_t_year ON material (active_from);
then postgresql will be able to use it to answer the query:
Index Scan Backward using i_test_t_year on material (cost=0.15..74.70 rows=1770 width=16)
However, remember that postgresql will only use the index if it thinks it will be faster than doing a sequential scan then sorting, so creating the correct index doesn't guarantee that it will be used for this query.

YOu need to fully understand an index and I believe this will answer your question.
An index is literally a lookup stored in it's own memory next to the table. You literally look at the index and then it points to the rows to fetch.
If you look at your index, what you are storing is a TEXT value of the 'year' extracted from the "active_from" column.
So if you were to look at the index, it will look like a bunch of entries saying:
2015
2015
2014
2014
2013
etc.
They are stored as TEXT value, not as the timestamp.
In your query you are ordering it DESC as a timestamp value.
So it just doesn't match the index, as you have stored it.
If you put the ORDER BY in your query as "order by date_part('year'::text,active_from)" then it would call the index you put there.
So I suggest you just add the index on "active_from" with out parsing the date at all.

What is the correct way to optimize and/or index this query?

I've got a table pings with about 15 million rows in it. I'm on postgres 9.2.4. The relevant columns it has are a foreign key monitor_id, a created_at timestamp, and a response_time that's an integer that represents milliseconds. Here is the exact structure:
Column | Type | Modifiers
-----------------+-----------------------------+----------------------------------------------------
id | integer | not null default nextval('pings_id_seq'::regclass)
url | character varying(255) |
monitor_id | integer |
response_status | integer |
response_time | integer |
created_at | timestamp without time zone |
updated_at | timestamp without time zone |
response_body | text |
Indexes:
"pings_pkey" PRIMARY KEY, btree (id)
"index_pings_on_created_at_and_monitor_id" btree (created_at DESC, monitor_id)
"index_pings_on_monitor_id" btree (monitor_id)
I want to query for all the response times that are not NULL (90% won't be NULL, about 10% will be NULL), that have a specific monitor_id, and that were created in the last month. I'm doing the query with ActiveRecord, but the end result looks something like this:
SELECT "pings"."response_time"
FROM "pings"
WHERE "pings"."monitor_id" = 3
AND (created_at > '2014-03-03 20:23:07.254281'
AND response_time IS NOT NULL)
It's a pretty basic query, but it takes about 2000ms to run, which seems rather slow. I'm assuming an index would make it faster, but all the indexes I've tried aren't working, which I'm assuming means I'm not indexing properly.
When I run EXPLAIN ANALYZE, this is what I get:
Bitmap Heap Scan on pings (cost=6643.25..183652.31 rows=83343 width=4) (actual time=58.997..1736.179 rows=42063 loops=1)
Recheck Cond: (monitor_id = 3)
Rows Removed by Index Recheck: 11643313
Filter: ((response_time IS NOT NULL) AND (created_at > '2014-03-03 20:23:07.254281'::timestamp without time zone))
Rows Removed by Filter: 324834
-> Bitmap Index Scan on index_pings_on_monitor_id (cost=0.00..6622.41 rows=358471 width=0) (actual time=57.935..57.935 rows=366897 loops=1)
Index Cond: (monitor_id = 3)
So there is an index on monitor_id that is being used towards the end, but nothing else. I've tried various permutations and orders of compound indexes using monitor_id, created_at, and response_time. I've tried ordering the index by created_at in descending order. I've tried a partial index with response_time IS NOT NULL.
Nothing I've tried makes the query any faster. How would you optimize and/or index it?

Sequence of columns
Create a partial multicolumn index with the right sequence of columns. You have one:
"index_pings_on_created_at_and_monitor_id" btree (created_at DESC, monitor_id)
But the sequence of columns is not serving you well. Reverse it:
CREATE INDEX idx_pings_monitor_created ON pings (monitor_id, created_at DESC)
WHERE response_time IS NOT NULL;
The rule of thumb here is: equality first, ranges later. More about that:
Multicolumn index and performance
As discussed, the condition WHERE response_time IS NOT NULL does not buy you much. If you have other queries that could utilize this index including NULL values in response_time, drop it. Else, keep it.
You can probably also drop both other existing indexes. More about the sequence of columns in btree indexes:
Working of indexes in PostgreSQL
Covering index
If all you need from the table is response_time, this can be much faster yet - if you don't have lots of write operations on the rows of your table. Include the column in the index at the last position to allow index-only scans (making it a "covering index"):
CREATE INDEX idx_pings_monitor_created
ON pings (monitor_id, created_at DESC, response_time)
WHERE response_time IS NOT NULL; -- maybe
Or, you try this even ..
More radical partial index
Create a tiny helper function. Effectively a "global constant" in your db:
CREATE OR REPLACE FUNCTION f_ping_event_horizon()
RETURNS timestamp LANGUAGE sql IMMUTABLE COST 1 AS
$$SELECT '2014-03-03 0:0'::timestamp$$; -- One month in the past
Use it as condition in your index:
CREATE INDEX idx_pings_monitor_created_response_time
ON pings (monitor_id, created_at DESC, response_time)
WHERE response_time IS NOT NULL -- maybe
AND created_at > f_ping_event_horizon();
And your query looks like this now:
SELECT response_time
FROM pings
WHERE monitor_id = 3
AND response_time IS NOT NULL
AND created_at > '2014-03-03 20:23:07.254281'
AND created_at > f_ping_event_horizon();
Aside: I trimmed some noise.
The last condition seems logically redundant. Only include it, if Postgres does not understand it can use the index without it. Might be necessary. The actual timestamp in the condition must be bigger than the one in the function. But that's obviously the case according to your comments.
This way we cut all the irrelevant rows and make the index much smaller. The effect degrades slowly over time. Refit the event horizon and recreate indexes from time to time to get rid of added weight. You could do with a weekly cron job, for example.
When updating (recreating) the function, you need to recreate all indexes that use the function in any way. Best in the same transaction. Because the IMMUTABLE declaration for the helper function is a bit of a false promise. But Postgres only accepts immutable functions in index definitions. So we have to lie about it. More about that:
Does PostgreSQL support "accent insensitive" collations?
Why the function at all? This way, all the queries using the index can remain unchanged.
With all of these changes the query should be faster by orders of magnitude now. A single, continuous index-only scan is all that's needed. Can you confirm that?

Bitmap Heap Scan performance

I have a big report table. Bitmap Heap Scan step take more than 5 sec.
Is there something that I can do? I add columns to the table, does reindex the index that it use will help?
I do union and sum on the data, so I don't return 500K records to the client.
I use postgres 9.1.
Here the explain:
Bitmap Heap Scan on foo_table (cost=24747.45..1339408.81 rows=473986 width=116) (actual time=422.210..5918.037 rows=495747 loops=1)
Recheck Cond: ((foo_id = 72) AND (date >= '2013-04-04 00:00:00'::timestamp without time zone) AND (date <= '2013-05-05 00:00:00'::timestamp without time zone))
Filter: ((foo)::text = 'foooooo'::text)
-> Bitmap Index Scan on foo_table_idx (cost=0.00..24628.96 rows=573023 width=0) (actual time=341.269..341.269 rows=723918 loops=1)
Query:
explain analyze
SELECT CAST(date as date) AS date, foo_id, ....
from foo_table
where foo_id = 72
and date >= '2013-04-04'
and date <= '2013-05-05'
and foo = 'foooooo'
Index def:
Index "public.foo_table_idx"
Column | Type
-------------+-----------------------------
foo_id | bigint
date | timestamp without time zone
btree, for table "public.external_channel_report"
Table:
foo is text field with 4 different values.
foo_id is bigint with currently 10K distinct values.

Create a composite index on (foo_id, foo, date) (in this order).
Note that if you select 500k records (and return them all to the client), this may take long.
Are you sure you need all 500k records on the client (rather than some kind of an aggregate or a LIMIT)?

Answer to comment
Do i need the where columns in the same order of the index?
The order of expressions in the WHERE clause is completely irrelevant, SQL is not a procedural language.
Fix mistakes
The timestamp column should not be named "date" for several reasons. Obviously, it's a timestamp, not a date. But more importantly, date it is a reserved word in all SQL standards and a type and function name in Postgres and shouldn't be used as identifier.
You should provide proper information with your question, including a complete table definition and conclusive information about existing indexes. I might be a good idea to start by reading the chapter about indexes in the manual.
The WHERE conditions on the timestamp are most probably incorrect:
and date >= '2013-04-04'
and date <= '2013-05-05'
The upper border for a timestamp column should probably be excluded:
and date >= '2013-04-04'
and date < '2013-05-05'
Index
With the multicolumn index #Quassnoi provided, your query will be much faster, since all qualifying rows can be read from one continuous data block of the index. No row is read in vain (and later disqualified), like you have it now.
But 500k rows will still take some time. Normally you have to verify visibility and fetch additional columns from the table. An index-only scan might be an option in Postgres 9.2+.
The order of columns is best this way, because the rule of thumb is: columns for equality first — then for ranges. More explanation and links in this related answer on dba.SE.
CLUSTER / pg_repack
You could further speed things up by streamlining the table according to this index, so that a minimum of blocks have to be read from the table - if you don't have other requirements that stand against it!
If you want it faster, yet, you could streamline the physical order of rows in your table. If you can afford to lock your table exclusively for a few seconds (at off hours for instance) to rewrite your table and order rows according to the index:
ALTER TABLE foo_table CLUSTER ON idx_myindex_idx;
If concurrent use is a problem, consider pg_repack, which can do the same without exclusive lock.
The effect: fewer blocks need to be read from the table and everything is pre-sorted. It's a one-time effect deteriorating over time, if you have writes on the table. So you would rerun it from time to time.
I copied and adapted the last chapter from this related answer on dba.SE.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas