Right index for timestamp field on Postgresql - sql

here is the table
CREATE TABLE material
(
mid bigserial NOT NULL,
...
active_from timestamp without time zone,
....
CONSTRAINT material_pkey PRIMARY KEY (mid),
)
CREATE INDEX i_test_t_year
ON material
USING btree
(date_part('year'::text, active_from));
if I made sorting by mid field
select mid from material order by mid desc
"Index Only Scan Backward using material_pkey on material (cost=0.29..3573.20 rows=100927 width=8)"
but if I use active_from for sorting
select * from material order by active_from desc
"Sort (cost=12067.29..12319.61 rows=100927 width=16)"
" Sort Key: active_from"
" -> Seq Scan on material (cost=0.00..1953.27 rows=100927 width=16)"
Maybe index for active_from wrong? How to make right one for lower cost

The index on date_part('year'::text, active_from) can't be used to sort by active_from; you know that sorting by that function and then by active_from gives the same order as simply sorting by active_from but postgresql doesn't. If you create the following index:
CREATE INDEX i_test_t_year ON material (active_from);
then postgresql will be able to use it to answer the query:
Index Scan Backward using i_test_t_year on material (cost=0.15..74.70 rows=1770 width=16)
However, remember that postgresql will only use the index if it thinks it will be faster than doing a sequential scan then sorting, so creating the correct index doesn't guarantee that it will be used for this query.

YOu need to fully understand an index and I believe this will answer your question.
An index is literally a lookup stored in it's own memory next to the table. You literally look at the index and then it points to the rows to fetch.
If you look at your index, what you are storing is a TEXT value of the 'year' extracted from the "active_from" column.
So if you were to look at the index, it will look like a bunch of entries saying:
2015
2015
2014
2014
2013
etc.
They are stored as TEXT value, not as the timestamp.
In your query you are ordering it DESC as a timestamp value.
So it just doesn't match the index, as you have stored it.
If you put the ORDER BY in your query as "order by date_part('year'::text,active_from)" then it would call the index you put there.
So I suggest you just add the index on "active_from" with out parsing the date at all.

Related

Increase speed of query on PostgreSQL JSONB column containing billions of rows

I have single JSONB column in a table, which looks like - {key_x: value_x}. The table contains billions of rows.
I am querying for value in it using -
SELECT data->> some_key FROM tableName WHERE data ? some_key;
I have used GIN index on the column, used query-
CREATE INDEX data_index ON tableName USING GIN (data))`
I have to use a lot of these queries, and at present, it is taking too much time.
EXPLAIN (ANALYZE, BUFFERS) SELECT data->> 'somekey' FROM tableName WHERE data ? 'some_key';
returns-
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Seq Scan on homeshubhgoethereumagethchaindata (cost=0.00..1885.42 rows=39 width=32) (actual time=1.911..15.488 rows=545 loops=1)
Filter: (data ? 'c2VjdXJlLWtleS3GJ+NCu6KAcCJRTC1SLiK6ZvkRZT0avMdL0KeGitPLNg=='::text)
Rows Removed by Filter: 37748
Buffers: shared hit=1397
Planning time: 3.574 ms
Execution time: 121.253 ms
The number of rows is supposed to increase in future. Is there some way to increase the speed of query?
From your question it looks like you have single key-value record in jsonb column, not array. If so, did you consider to replace this jsonb with two regular columns with B-tree index? This will work much faster than GIN-index on whole json data.
Or in case if this jsonb is required, you can keep it, just add regular column for key field and use it for searching. Sure, it means data duplication, but on the other hand you will get speed gain.
UPD. You can convert json to columns with the following query:
ALTER TABLE tableName
ADD COLUMN "key" VARCHAR,
ADD COLUMN "value" VARCHAR;
UPDATE tableName SET
key = (SELECT jsonb_object_keys(data)),
value = json ->> (SELECT jsonb_object_keys(data));
You should use specific functional index on jsonb column (not GIN):
Try this:
CREATE INDEX ON tableName((data->>'some_key'));

What is the correct way to optimize and/or index this query?

I've got a table pings with about 15 million rows in it. I'm on postgres 9.2.4. The relevant columns it has are a foreign key monitor_id, a created_at timestamp, and a response_time that's an integer that represents milliseconds. Here is the exact structure:
Column | Type | Modifiers
-----------------+-----------------------------+----------------------------------------------------
id | integer | not null default nextval('pings_id_seq'::regclass)
url | character varying(255) |
monitor_id | integer |
response_status | integer |
response_time | integer |
created_at | timestamp without time zone |
updated_at | timestamp without time zone |
response_body | text |
Indexes:
"pings_pkey" PRIMARY KEY, btree (id)
"index_pings_on_created_at_and_monitor_id" btree (created_at DESC, monitor_id)
"index_pings_on_monitor_id" btree (monitor_id)
I want to query for all the response times that are not NULL (90% won't be NULL, about 10% will be NULL), that have a specific monitor_id, and that were created in the last month. I'm doing the query with ActiveRecord, but the end result looks something like this:
SELECT "pings"."response_time"
FROM "pings"
WHERE "pings"."monitor_id" = 3
AND (created_at > '2014-03-03 20:23:07.254281'
AND response_time IS NOT NULL)
It's a pretty basic query, but it takes about 2000ms to run, which seems rather slow. I'm assuming an index would make it faster, but all the indexes I've tried aren't working, which I'm assuming means I'm not indexing properly.
When I run EXPLAIN ANALYZE, this is what I get:
Bitmap Heap Scan on pings (cost=6643.25..183652.31 rows=83343 width=4) (actual time=58.997..1736.179 rows=42063 loops=1)
Recheck Cond: (monitor_id = 3)
Rows Removed by Index Recheck: 11643313
Filter: ((response_time IS NOT NULL) AND (created_at > '2014-03-03 20:23:07.254281'::timestamp without time zone))
Rows Removed by Filter: 324834
-> Bitmap Index Scan on index_pings_on_monitor_id (cost=0.00..6622.41 rows=358471 width=0) (actual time=57.935..57.935 rows=366897 loops=1)
Index Cond: (monitor_id = 3)
So there is an index on monitor_id that is being used towards the end, but nothing else. I've tried various permutations and orders of compound indexes using monitor_id, created_at, and response_time. I've tried ordering the index by created_at in descending order. I've tried a partial index with response_time IS NOT NULL.
Nothing I've tried makes the query any faster. How would you optimize and/or index it?
Sequence of columns
Create a partial multicolumn index with the right sequence of columns. You have one:
"index_pings_on_created_at_and_monitor_id" btree (created_at DESC, monitor_id)
But the sequence of columns is not serving you well. Reverse it:
CREATE INDEX idx_pings_monitor_created ON pings (monitor_id, created_at DESC)
WHERE response_time IS NOT NULL;
The rule of thumb here is: equality first, ranges later. More about that:
Multicolumn index and performance
As discussed, the condition WHERE response_time IS NOT NULL does not buy you much. If you have other queries that could utilize this index including NULL values in response_time, drop it. Else, keep it.
You can probably also drop both other existing indexes. More about the sequence of columns in btree indexes:
Working of indexes in PostgreSQL
Covering index
If all you need from the table is response_time, this can be much faster yet - if you don't have lots of write operations on the rows of your table. Include the column in the index at the last position to allow index-only scans (making it a "covering index"):
CREATE INDEX idx_pings_monitor_created
ON pings (monitor_id, created_at DESC, response_time)
WHERE response_time IS NOT NULL; -- maybe
Or, you try this even ..
More radical partial index
Create a tiny helper function. Effectively a "global constant" in your db:
CREATE OR REPLACE FUNCTION f_ping_event_horizon()
RETURNS timestamp LANGUAGE sql IMMUTABLE COST 1 AS
$$SELECT '2014-03-03 0:0'::timestamp$$; -- One month in the past
Use it as condition in your index:
CREATE INDEX idx_pings_monitor_created_response_time
ON pings (monitor_id, created_at DESC, response_time)
WHERE response_time IS NOT NULL -- maybe
AND created_at > f_ping_event_horizon();
And your query looks like this now:
SELECT response_time
FROM pings
WHERE monitor_id = 3
AND response_time IS NOT NULL
AND created_at > '2014-03-03 20:23:07.254281'
AND created_at > f_ping_event_horizon();
Aside: I trimmed some noise.
The last condition seems logically redundant. Only include it, if Postgres does not understand it can use the index without it. Might be necessary. The actual timestamp in the condition must be bigger than the one in the function. But that's obviously the case according to your comments.
This way we cut all the irrelevant rows and make the index much smaller. The effect degrades slowly over time. Refit the event horizon and recreate indexes from time to time to get rid of added weight. You could do with a weekly cron job, for example.
When updating (recreating) the function, you need to recreate all indexes that use the function in any way. Best in the same transaction. Because the IMMUTABLE declaration for the helper function is a bit of a false promise. But Postgres only accepts immutable functions in index definitions. So we have to lie about it. More about that:
Does PostgreSQL support "accent insensitive" collations?
Why the function at all? This way, all the queries using the index can remain unchanged.
With all of these changes the query should be faster by orders of magnitude now. A single, continuous index-only scan is all that's needed. Can you confirm that?

Bitmap Heap Scan performance

I have a big report table. Bitmap Heap Scan step take more than 5 sec.
Is there something that I can do? I add columns to the table, does reindex the index that it use will help?
I do union and sum on the data, so I don't return 500K records to the client.
I use postgres 9.1.
Here the explain:
Bitmap Heap Scan on foo_table (cost=24747.45..1339408.81 rows=473986 width=116) (actual time=422.210..5918.037 rows=495747 loops=1)
Recheck Cond: ((foo_id = 72) AND (date >= '2013-04-04 00:00:00'::timestamp without time zone) AND (date <= '2013-05-05 00:00:00'::timestamp without time zone))
Filter: ((foo)::text = 'foooooo'::text)
-> Bitmap Index Scan on foo_table_idx (cost=0.00..24628.96 rows=573023 width=0) (actual time=341.269..341.269 rows=723918 loops=1)
Query:
explain analyze
SELECT CAST(date as date) AS date, foo_id, ....
from foo_table
where foo_id = 72
and date >= '2013-04-04'
and date <= '2013-05-05'
and foo = 'foooooo'
Index def:
Index "public.foo_table_idx"
Column | Type
-------------+-----------------------------
foo_id | bigint
date | timestamp without time zone
btree, for table "public.external_channel_report"
Table:
foo is text field with 4 different values.
foo_id is bigint with currently 10K distinct values.
Create a composite index on (foo_id, foo, date) (in this order).
Note that if you select 500k records (and return them all to the client), this may take long.
Are you sure you need all 500k records on the client (rather than some kind of an aggregate or a LIMIT)?
Answer to comment
Do i need the where columns in the same order of the index?
The order of expressions in the WHERE clause is completely irrelevant, SQL is not a procedural language.
Fix mistakes
The timestamp column should not be named "date" for several reasons. Obviously, it's a timestamp, not a date. But more importantly, date it is a reserved word in all SQL standards and a type and function name in Postgres and shouldn't be used as identifier.
You should provide proper information with your question, including a complete table definition and conclusive information about existing indexes. I might be a good idea to start by reading the chapter about indexes in the manual.
The WHERE conditions on the timestamp are most probably incorrect:
and date >= '2013-04-04'
and date <= '2013-05-05'
The upper border for a timestamp column should probably be excluded:
and date >= '2013-04-04'
and date < '2013-05-05'
Index
With the multicolumn index #Quassnoi provided, your query will be much faster, since all qualifying rows can be read from one continuous data block of the index. No row is read in vain (and later disqualified), like you have it now.
But 500k rows will still take some time. Normally you have to verify visibility and fetch additional columns from the table. An index-only scan might be an option in Postgres 9.2+.
The order of columns is best this way, because the rule of thumb is: columns for equality first — then for ranges. More explanation and links in this related answer on dba.SE.
CLUSTER / pg_repack
You could further speed things up by streamlining the table according to this index, so that a minimum of blocks have to be read from the table - if you don't have other requirements that stand against it!
If you want it faster, yet, you could streamline the physical order of rows in your table. If you can afford to lock your table exclusively for a few seconds (at off hours for instance) to rewrite your table and order rows according to the index:
ALTER TABLE foo_table CLUSTER ON idx_myindex_idx;
If concurrent use is a problem, consider pg_repack, which can do the same without exclusive lock.
The effect: fewer blocks need to be read from the table and everything is pre-sorted. It's a one-time effect deteriorating over time, if you have writes on the table. So you would rerun it from time to time.
I copied and adapted the last chapter from this related answer on dba.SE.

Postgresql planner uses wrong index

Recently i upgraded Postgresql from 9.1 to 9.2 version. New planner uses wrong index and query executes too long.
Query:
explain SELECT mentions.* FROM mentions WHERE (searches_id = 7646553) ORDER BY id ASC LIMIT 1000
Explain in 9.1 version:
Limit (cost=5762.99..5765.49 rows=1000 width=184)
-> Sort (cost=5762.99..5842.38 rows=31755 width=184)
Sort Key: id
-> Index Scan using mentions_searches_id_idx on mentions (cost=0.00..4021.90 rows=31755 width=184)
Index Cond: (searches_id = 7646553)
Expain in 9.2 version:
Limit (cost=0.00..450245.54 rows=1000 width=244)
-> Index Scan using mentions_pk on mentions (cost=0.00..110469543.02 rows=245354 width=244
Index Cond: (id > 0)"
Filter: (searches_id = 7646553)
The correct approach is in 9.1 version, where planner uses index on searches_id. In 9.2 version planner doesn't not uses that index and filter rows by searches_id.
When i execute on 9.2 version query without ORDER BY id, planner uses index on searches_id, but i need to order by id.
I also tried to select rows in subquery and order it in second query, but explain shows that, the planner do the same like in normal query.
select * from (
SELECT mentions.* FROM mentions WHERE (searches_id = 7646553))
AS q1
order by id asc
What would you recommend?
If searches_id #7646553 rows are more than a few percent of the table then the index on that column will not be used as a table scan would be faster. Do a
select count(*) from mentions where searches_id = 7646553
and compare to the total rows.
If they are less than a few percent of the table then try
with m as (
SELECT *
FROM mentions
WHERE searches_id = 7646553
)
select *
from m
order by id asc
(From PostgreSQL v12 on, you have to use with ... as materialized.)
Or create a composite index:
create index index_name on mentions (searches_id, id)
If searches_id has low cardinality then create the same index in the opposite order
create index index_name on mentions (id, searches_id)
Do
analyze mentions
After creating an index.
For me, I had indexes but they were all based on 3 columns, and I wasn't calling out one of the columns in the indexes, so it was doing seq scan across the entire thing. Possible fix: more indexes but that use fewer columns (and/or switch column order).
Another problem we saw was that we had the right index, but apparently it was an "invalid" (poorly created CONCURRENT?) index. So dropping it and creating it (or reindexing it) and it started using it.
What are the available options to identify and remove the invalid objects in Postgres (ex: corrupted indexes)
See also http://www.postgresql.org/docs/8.4/static/indexes-multicolumn.html

PostgreSQL ignoring index on timestamp column

I have the following table and index created:
CREATE TABLE cdc_auth_user
(
cdc_auth_user_id bigint NOT NULL DEFAULT nextval('cdc_auth_user_id_seq'::regclass),
cdc_timestamp timestamp without time zone DEFAULT ('now'::text)::timestamp without time zone,
cdc_operation text,
id integer,
username character varying(30)
);
CREATE INDEX idx_cdc_auth_user_cdc_timestamp
ON cdc_auth_user
USING btree (cdc_timestamp);
However, when I perform a select using the timestamp field, the index is being ignored and my query takes almost 10 seconds to return:
EXPLAIN SELECT *
FROM cdc_auth_user
WHERE cdc_timestamp BETWEEN '1900/02/24 12:12:34.818'
AND '2012/02/24 12:17:45.963';
Seq Scan on cdc_auth_user (cost=0.00..1089.05 rows=30003 width=126)
Filter: ((cdc_timestamp >= '1900-02-24 12:12:34.818'::timestamp without time zone) AND (cdc_timestamp <= '2012-02-24 12:17:45.963'::timestamp without time zone))
If there are a lot of results, the btree can be slower than just doing a table scan. btree indices are really not designed for this kind of "range-selection" kind of query you're doing here; the entries are placed in a big unsorted file and the index is built against that unsorted group, so every result potentially requires a disk seek after it is found in the btree. Sure, the btree can be easily read in order but the results still need to get pulled from the disk.
Clustered indices solve this problem by ordering the actual database records according to what's in the btree, so they actually are helpful for ranged queries like this. Consider using a clustered index instead and see how it works.