PostgreSQL create index on JSONB[] - sql

Consider a table defined as follows:
CREATE TABLE test (
id int4 NOT NULL,
tag_counts _jsonb NOT NULL DEFAULT ARRAY[]::jsonb[]
);
INSERT INTO test(id, tag_counts) values(1,array['{"type":1, "count":4}','{"type":2, "count":10}' ]::jsonb[])
How can I create an index on json key type and how can I query on it?
Edit: Previously, there were no indexes on json keys and select queries used an unnest operation as shown below:
select * from (SELECT unnest(tag_counts) as tc
FROM public.test) as t
where tc->'type' = '2';
The problem is, if the table has a large number of rows, the above query will not only include a full table scan, but also filtering through each jsonb array.

There is a way to index this, not sure how fast it will be.
If that was a "regular" jsonb column, you could use a condition like where tag_counts #> '[{"type": 2}]' which can use a GIN index on the column.
You can use that operator if you convert the array to "plain" json value:
select *
from test
where to_jsonb(tag_counts) #> '[{"type": 2}]'
Unfortunately, to_jsonb() is not marked as immutable (I guess because of potential timestamp conversion in there) which is a requirement if you want to use an expression in an index.
But for your data, this is indeed immutable, so we can create a little wrapper function:
create function as_jsonb(p_input jsonb[])
returns jsonb
as
$$
select to_jsonb(p_input);
$$
language sql
immutable;
And with that function we can create an index:
create index on test using gin ( as_jsonb(tag_counts) jsonb_path_ops);
You will need to use that function in your query:
select *
from test
where as_jsonb(tag_counts) #> '[{"type": 2}]'
On a table with a million rows, I get the following execution plan:
Bitmap Heap Scan on stuff.test (cost=1102.62..67028.01 rows=118531 width=252) (actual time=15.145..684.062 rows=147293 loops=1)
Output: id, tag_counts
Recheck Cond: (as_jsonb(test.tag_counts) #> '[{"type": 2}]'::jsonb)
Heap Blocks: exact=25455
Buffers: shared hit=25486
-> Bitmap Index Scan on ix_test (cost=0.00..1072.99 rows=118531 width=0) (actual time=12.347..12.356 rows=147293 loops=1)
Index Cond: (as_jsonb(test.tag_counts) #> '[{"type": 2}]'::jsonb)
Buffers: shared hit=31
Planning:
Buffers: shared hit=23
Planning Time: 0.444 ms
Execution Time: 690.160 ms

Related

PostgreSQL not using any index in regex search

I have the following SQL statement to filter data with a regex search:
select * from others.table
where vintage ~* '(17|18|19|20)[0-9]{2,}'
Upon some researching, I found that I need to create gin/gist index for better performance:
create index idx_vintage_gist on others.table using gist (vintage gist_trgm_ops);
create index idx_vintage_gin on others.table using gin (vintage gin_trgm_ops);
create index idx_vintage_varchar on others.table using btree (vintage varchar_pattern_ops);
Looking at the explain plan, it is not using any index but a seq scan:
Seq Scan on table t (cost=0.00..45412.25 rows=1070800 width=91) (actual time=0.038..8518.830 rows=1075980 loops=1)
Filter: (vintage ~* '(17|18|19|20)[0-9]{2,}'::text)
Rows Removed by Filter: 25400
Planning Time: 0.481 ms
Execution Time: 8767.998 ms
There are total 1101380 rows in the table.
My question is why is it not using any index for the regex search?
(Answer was in comments; posting as community wiki.)
From the execution plan, 1070800 rows were expected to be returned, which is 1070800/1101380 ≈ 97.2% of the table. With so much of the table being in the results, using an index wouldn't be advantageous, so a sequential scan is performed.

PostgreSQL index for jsonb #> search

I have the following query:
SELECT "survey_results".* FROM "survey_results" WHERE (raw #> '{"client":{"token":"test_token"}}');
EXPLAIN ANALYZE returns following results:
Seq Scan on survey_results (cost=0.00..352.68 rows=2 width=2039) (actual time=132.942..132.943 rows=1 loops=1)
Filter: (raw #> '{"client": {"token": "test_token"}}'::jsonb)
Rows Removed by Filter: 2133
Planning time: 0.157 ms
Execution time: 132.991 ms
(5 rows)
I want to add index on client key inside raw field so search will be faster. I don't know how to do it. When I add index for whole raw column like this:
CREATE INDEX test_index on survey_results USING GIN (raw);
then everything works as expected. I don't want to add index for whole raw because I have a lot of records in database and I do not want to increase its size.
If you are using JSON objects as atm in the example then you can specify index only client like that:
CREATE INDEX test_client_index ON survey_results USING GIN (( raw->'client ));
But since you are using #> operator in your query then in your case it might make sense to create index only for that operator like that:
CREATE INDEX test_client_index ON survey_results USING GIN (raw jsonb_path_ops);
See more from documentation about Postgres JSONB indexing:

Creating index for a Json attribute stored in a JSONb column which contains array of json in Postgres

I am using postgres from past few days and i came up with requirement which i am trying to find solution.
I am using postgres sql.
I have a Table which is like this
CREATE TABLE location (id location_Id, details jsonb);
INSERT INTO location (id,details)
VALUES (1,'[{"Slno" : 1, "value" : "Bangalore"},
{"Slno" : 2, "value" : "Delhi"}]');
INSERT INTO location (id,details)
VALUES (2,'[{"Slno" : 5, "value" : "Chennai"}]');
From the above queries you can see that a jsonb column with name details
is present which has an array of json as value stored.
The data is stored like this because of some requirement.
Here i want to create an index for the Slno property present in jsonb
column values.
Can someone help me out in finding solution for this as it would be
very helpful.
Thanks
It is not an index for a single JSON property, rather for the whole attribute, but you can use it to perform a search like you want:
CREATE INDEX locind ON location USING GIN (details jsonb_path_ops);
If you need the index for other operations except #> (contains), omit the jsonb_path_ops. This will make the index larger and slower.
Now you can search for the property using the index:
EXPLAIN (VERBOSE) SELECT id FROM location WHERE details #> '[{"Slno" : 1}]';
QUERY PLAN
-------------------------------------------------------------------------
Bitmap Heap Scan on laurenz.location (cost=8.00..12.01 rows=1 width=4)
Output: id
Recheck Cond: (location.details #> '[{"Slno": 1}]'::jsonb)
-> Bitmap Index Scan on locind (cost=0.00..8.00 rows=1 width=0)
Index Cond: (location.details #> '[{"Slno": 1}]'::jsonb)
(5 rows)

Why is PostgreSQL not using *just* the covering index in this query depending on the contents of its IN() clause?

I have a table with a covering index that should respond to a query using just the index, without checking the table at all. Postgres does, in fact, do that, if the IN() clause has 1 or a few elements in it. However, if the IN clause has lots of elements, it seems like it's doing the search on the index, and then going to the table and re-checking the conditions...
I can't figure out why Postgres would do that. It can either serve the query straight from the index or it can't, why would it go to the table if it (in theory) doesn't have anything else to add?
The table:
CREATE TABLE phone_numbers
(
id serial NOT NULL,
phone_number character varying,
hashed_phone_number character varying,
user_id integer,
created_at timestamp without time zone,
updated_at timestamp without time zone,
ghost boolean DEFAULT false,
CONSTRAINT phone_numbers_pkey PRIMARY KEY (id)
)
WITH (
OIDS=FALSE
);
CREATE INDEX index_phone_numbers_covering_hashed_ghost_and_user
ON phone_numbers
USING btree
(hashed_phone_number COLLATE pg_catalog."default", ghost, user_id);
The query I'm running is :
SELECT "phone_numbers"."user_id"
FROM "phone_numbers"
WHERE "phone_numbers"."hashed_phone_number" IN (*several numbers*)
AND "phone_numbers"."ghost" = 'f'
As you can see, the index has all the fields it needs to reply to that query.
And if I have only one or a few numbers in the IN clause, it does:
1 number:
Index Scan using index_phone_numbers_on_hashed_phone_number on phone_numbers (cost=0.41..8.43 rows=1 width=4)
Index Cond: ((hashed_phone_number)::text = 'bebd43a6eb29b2fda3bcb63dcc7ffaf5433e78660ccd1a495c1180a3eaaf6b6a'::text)
Filter: (NOT ghost)"
3 numbers:
Index Only Scan using index_phone_numbers_covering_hashed_ghost_and_user on phone_numbers (cost=0.42..17.29 rows=1 width=4)
Index Cond: ((hashed_phone_number = ANY ('{8228a8116f1fdb12e243102cb85ecd859ebf7873d9332dce5f1343a481ec72e8,43ddeebdca2ea829d468d5debc84d475c8322cf4bf6edca286c918b04216387e,1578bf773eb6eb8a9b57a130922a28c9c91f1bda67202ef5936b39630ca4cfe4}'::text[])) AND (...)
Filter: (NOT ghost)"
However, when I have a lot of numbers in the IN clause, Postgres is using the Index, but then hitting the table, and I don't know why:
Bitmap Heap Scan on phone_numbers (cost=926.59..1255.81 rows=106 width=4)
Recheck Cond: ((hashed_phone_number)::text = ANY ('{b6459ce58f21d99c462b132cce7adc9ea947fa522a3849321e9fb65893006a5e,8228a8116f1fdb12e243102cb85ecd859ebf7873d9332dce5f1343a481ec72e8,ab3554acc1f287bb2e22ff20bb855e19a4177ef552676689d217dbb2a1a6177b,7ec9f58 (...)
Filter: (NOT ghost)
-> Bitmap Index Scan on index_phone_numbers_covering_hashed_ghost_and_user (cost=0.00..926.56 rows=106 width=0)
Index Cond: (((hashed_phone_number)::text = ANY ('{b6459ce58f21d99c462b132cce7adc9ea947fa522a3849321e9fb65893006a5e,8228a8116f1fdb12e243102cb85ecd859ebf7873d9332dce5f1343a481ec72e8,ab3554acc1f287bb2e22ff20bb855e19a4177ef552676689d217dbb2a1a6177b,7e (...)
This is currently making this query, which is looking for 250 records in a table with 50k total rows, about twice as low as a similar query on another table, which looks for 250 records in a table with 5 million rows, which doesn't make much sense.
Any ideas what could be happening, and whether I can do anything to improve this?
UPDATE: Changing the order of the columns in the covering index to have first ghost and then hashed_phone_number also doesn't solve it:
Bitmap Heap Scan on phone_numbers (cost=926.59..1255.81 rows=106 width=4)
Recheck Cond: ((hashed_phone_number)::text = ANY ('{b6459ce58f21d99c462b132cce7adc9ea947fa522a3849321e9fb65893006a5e,8228a8116f1fdb12e243102cb85ecd859ebf7873d9332dce5f1343a481ec72e8,ab3554acc1f287bb2e22ff20bb855e19a4177ef552676689d217dbb2a1a6177b,7ec9f58 (...)
Filter: (NOT ghost)
-> Bitmap Index Scan on index_phone_numbers_covering_ghost_hashed_and_user (cost=0.00..926.56 rows=106 width=0)
Index Cond: ((ghost = false) AND ((hashed_phone_number)::text = ANY ('{b6459ce58f21d99c462b132cce7adc9ea947fa522a3849321e9fb65893006a5e,8228a8116f1fdb12e243102cb85ecd859ebf7873d9332dce5f1343a481ec72e8,ab3554acc1f287bb2e22ff20bb855e19a4177ef55267668 (...)
The choice of indexes is based on what the optimizer says is the best solution for the query. Postgres is trying really hard with your index, but it is not the best index for the query.
The best index has ghost first:
CREATE INDEX index_phone_numbers_covering_hashed_ghost_and_user
ON phone_numbers
USING btree
(ghost, hashed_phone_number COLLATE pg_catalog."default", user_id);
I happen to think that MySQL documentation does a good job of explaining how composite indexes are used.
Essentially, what is happening is that Postgres needs to do an index seek for every element of the in list. This may be compounded by the use of strings -- because collations/encodings affect the comparisons. Eventually, Postgres decides that other approaches are more efficient. If you put ghost first, then it will just jump to the right part of the index and find the rows it needs there.

Postgres: huge table with (delayed) read and write access

I have a huge table (currently ~3mil rows, expected to increase by a factor of 1000) with lots of inserts every second. The table is never updated.
Now I have to run queries on that table which are pretty slow (as expected). These queries do not have to be 100% accurate, it is ok if the result is a day old (but not older).
There is currently two indexes on two single integer columns and I would have to add two more indexes (integer and timestamp columns) to speed up my queries.
The ideas I had so far:
Add the two missing indexes to the table
No indexes on the huge table at all and copy the content (as a daily task) to a second table (just the important rows) then create the indexes on the second table and run the queries on that table?
Partitioning the huge table
Master/Slave setup (writing to the master and reading from the slaves).
What option is the best in terms of performance? Do you have any other suggestions?
EDIT:
Here is the table (I have marked the foreign keys and prettified the query a bit):
CREATE TABLE client_log
(
id serial NOT NULL,
logid integer NOT NULL,
client_id integer NOT NULL, (FOREIGN KEY)
client_version varchar(16),
sessionid varchar(100) NOT NULL,
created timestamptz NOT NULL,
filename varchar(256),
funcname varchar(256),
linenum integer,
comment text,
domain varchar(128),
code integer,
latitude float8,
longitude float8,
created_on_server timestamptz NOT NULL,
message_id integer, (FOREIGN KEY)
app_id integer NOT NULL, (FOREIGN KEY)
result integer
);
CREATE INDEX client_log_code_idx ON client_log USING btree (code);
CREATE INDEX client_log_created_idx ON client_log USING btree (created);
CREATE INDEX clients_clientlog_app_id ON client_log USING btree (app_id);
CREATE INDEX clients_clientlog_client_id ON client_log USING btree (client_id);
CREATE UNIQUE INDEX clients_clientlog_logid_client_id_key ON client_log USING btree (logid, client_id);
CREATE INDEX clients_clientlog_message_id ON client_log USING btree (message_id);
And an example query:
SELECT
client_log.comment,
COUNT(client_log.comment) AS count
FROM
client_log
WHERE
client_log.app_id = 33 AND
client_log.code = 3 AND
client_log.client_id IN (SELECT client.id FROM client WHERE
client.app_id = 33 AND
client."replaced_id" IS NULL)
GROUP BY client_log.comment ORDER BY count DESC;
client_log_code_idx is the index needed for the query above. There is other queries needing the client_log_created_idx index.
And the query plan:
Sort (cost=2844.72..2844.75 rows=11 width=242) (actual time=4684.113..4684.180 rows=70 loops=1)
Sort Key: (count(client_log.comment))
Sort Method: quicksort Memory: 32kB
-> HashAggregate (cost=2844.42..2844.53 rows=11 width=242) (actual time=4683.830..4683.907 rows=70 loops=1)
-> Hash Semi Join (cost=1358.52..2844.32 rows=20 width=242) (actual time=303.515..4681.211 rows=1202 loops=1)
Hash Cond: (client_log.client_id = client.id)
-> Bitmap Heap Scan on client_log (cost=1108.02..2592.57 rows=387 width=246) (actual time=113.599..4607.568 rows=6962 loops=1)
Recheck Cond: ((app_id = 33) AND (code = 3))
-> BitmapAnd (cost=1108.02..1108.02 rows=387 width=0) (actual time=104.955..104.955 rows=0 loops=1)
-> Bitmap Index Scan on clients_clientlog_app_id (cost=0.00..469.96 rows=25271 width=0) (actual time=58.315..58.315 rows=40662 loops=1)
Index Cond: (app_id = 33)
-> Bitmap Index Scan on client_log_code_idx (cost=0.00..637.61 rows=34291 width=0) (actual time=45.093..45.093 rows=36310 loops=1)
Index Cond: (code = 3)
-> Hash (cost=248.06..248.06 rows=196 width=4) (actual time=61.069..61.069 rows=105 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 4kB
-> Bitmap Heap Scan on client (cost=10.95..248.06 rows=196 width=4) (actual time=27.843..60.867 rows=105 loops=1)
Recheck Cond: (app_id = 33)
Filter: (replaced_id IS NULL)
Rows Removed by Filter: 271
-> Bitmap Index Scan on clients_client_app_id (cost=0.00..10.90 rows=349 width=0) (actual time=15.144..15.144 rows=380 loops=1)
Index Cond: (app_id = 33)
Total runtime: 4684.843 ms
In general, in a system where time related data is constantly being inserted into the database, I'd recommend partitioning according to time.
This is not just because it might improve query times, but because otherwise it makes managing the data difficult. However big your hardware is, it will have a limit to its capacity, so you will eventually have to start removing rows that are older than a certain date. The rate at which you remove the rows will have to be equal to the rate they are going in.
If you just have one big table, and you remove old rows using DELETE, you will leave a lot of dead tuples that need to be vacuumed out. The autovacuum will be running constantly, using up valuable disk IO.
On the other hand, if you partition according to time, then removing out of date data is as easy as dropping the relevant child table.
In terms of indexes - the indexes are not inherited, so you can save on creating the indexes until after the partition is loaded. You could have a partition size of 1 day in your use case. This means the indexes do not need to be constantly updated as data is being inserted. It will be more practical to have additional indexes as needed to make your queries perform.
Your sample query does not filter on the 'created' time field, but you say other queries do. If you partition by time, and are careful about how you construct your queries, constraint exclusion will kick in and it will only include the specific partitions that are relevant to the query.
Except for partitioning I would consider splitting the table into many tables, aka Sharding.
I don't have the full picture of your domain but these are some suggestions:
Each client get their own table in their own schema (or a set of clients share a schema depending on how many clients you have and how many new clients you expect to get).
create table client1.log(id, logid,.., code, app_id);
create table client2.log(id, logid,.., code, app_id);
Splitting the table like this should also reduce the contention on inserts.
The table can be split even more. Within each client-schema you can also split the table per "code" or "app_id" or something else that makes sense for you. This might be overdoing it but it is easy to implement if the number of "code" and/or "app_id" values do not change often.
Do keep the code/app_id columns even in the new smaller tables but do put a constraint on the column so that no other type of log record can be inserted. The constraint will also help the optimiser when searching, see this example:
create schema client1;
set search_path = 'client1';
create table error_log(id serial, code text check(code ='error'));
create table warning_log(id serial, code text check(code ='warning'));
create table message_log(id serial, code text check(code ='message'));
To get the full picture (all rows) of a client you can use a view on top of all tables:
create view client_log as
select * from error_log
union all
select * from warning_log
union all
select * from message_log;
The check constraints should allow the optimiser to only search the table where the "code" can exist.
explain
select * from client_log where code = 'error';
-- Output
Append (cost=0.00..25.38 rows=6 width=36)
-> Seq Scan on error_log (cost=0.00..25.38 rows=6 width=36)
Filter: (code = 'error'::text)