Increase speed of query on PostgreSQL JSONB column containing billions of rows

Increase speed of query on PostgreSQL JSONB column containing billions of rows - sql

I have single JSONB column in a table, which looks like - {key_x: value_x}. The table contains billions of rows.
I am querying for value in it using -
SELECT data->> some_key FROM tableName WHERE data ? some_key;
I have used GIN index on the column, used query-
CREATE INDEX data_index ON tableName USING GIN (data))`
I have to use a lot of these queries, and at present, it is taking too much time.
EXPLAIN (ANALYZE, BUFFERS) SELECT data->> 'somekey' FROM tableName WHERE data ? 'some_key';
returns-
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Seq Scan on homeshubhgoethereumagethchaindata (cost=0.00..1885.42 rows=39 width=32) (actual time=1.911..15.488 rows=545 loops=1)
Filter: (data ? 'c2VjdXJlLWtleS3GJ+NCu6KAcCJRTC1SLiK6ZvkRZT0avMdL0KeGitPLNg=='::text)
Rows Removed by Filter: 37748
Buffers: shared hit=1397
Planning time: 3.574 ms
Execution time: 121.253 ms
The number of rows is supposed to increase in future. Is there some way to increase the speed of query?

From your question it looks like you have single key-value record in jsonb column, not array. If so, did you consider to replace this jsonb with two regular columns with B-tree index? This will work much faster than GIN-index on whole json data.
Or in case if this jsonb is required, you can keep it, just add regular column for key field and use it for searching. Sure, it means data duplication, but on the other hand you will get speed gain.
UPD. You can convert json to columns with the following query:
ALTER TABLE tableName
ADD COLUMN "key" VARCHAR,
ADD COLUMN "value" VARCHAR;
UPDATE tableName SET
key = (SELECT jsonb_object_keys(data)),
value = json ->> (SELECT jsonb_object_keys(data));

You should use specific functional index on jsonb column (not GIN):
Try this:
CREATE INDEX ON tableName((data->>'some_key'));

Related

SELECT COUNT query on indexed column

I have a simple table storing words for different ids.
CREATE TABLE words (
id INTEGER,
word TEXT,
);
CREATE INDEX ON words USING hash (id);
CREATE INDEX ON words USING hash (word);
Now I simply want to count the number of times a given word appears. My actual query is a bit different and involves other filters.
SELECT COUNT(1) FROM "words"
WHERE word = 'car'
My table has a billion of rows but the answer for this specific query is about 45k.
I hoped that the index on the words would make the query super fast but it still takes a 1 min 20 to be executed, which looks dereasonable. As a comparison, SELECT COUNT(1) FROM "words" takes 1 min 57.
Here is the output of EXPLAIN:
Aggregate (cost=48667.00..48667.01 rows=1 width=8)
-> Bitmap Heap Scan on words (cost=398.12..48634.05 rows=13177 width=0)
Recheck Cond: (word = 'car'::text)
-> Bitmap Index Scan on words_word_idx (cost=0.00..394.83 rows=13177 width=0)
Index Cond: (word = 'car'::text)
I don't understand why there is a need to recheck the condition and why this query is not efficient.

Hash indexes don't store the indexed value in the index, just its 32-bit hash and the ctid (pointer to the table row). That means they can't resolve hash collisions on their own, so it has to go to the table to obtain the value and then recheck it. This can involve a lot or extra IO compared to a btree index, which do store the value and can support index only scans.

You could use B-Tree index whenever an indexed column is involved in a comparison using one of these operators:
< <= = >= >
I assume you are using = for counting how many words are there. So, B-Tree index satisfy your requirements.
Reference: https://www.postgresql.org/docs/current/indexes-types.html#INDEXES-TYPES-BTREE

PostgreSQL does not use a partial index

I have a table in PostgreSQL 9.2 that has a text column. Let's call this text_col. The values in this column are fairly unique (may contain 5-6 duplicates at the most). The table has ~5 million rows. About half these rows contain a null value for text_col. When I execute the following query I expect 1-5 rows. In most cases (>80%) I only expect 1 row.
Query
explain analyze SELECT col1,col2.. colN
FROM table
WHERE text_col = 'my_value';
A btree index exists on text_col. This index is never used by the query planner and I am not sure why. This is the output of the query.
Planner
Seq Scan on two (cost=0.000..459573.080 rows=93 width=339) (actual time=1392.864..3196.283 rows=2 loops=1)
Filter: (victor = 'foxtrot'::text)
Rows Removed by Filter: 4077384
I added another partial index to try to filter out those values that were not null, but that did not help (with or without text_pattern_ops. I do not need text_pattern_ops considering no LIKE conditions are expressed in my queries, but they also match equality).
CREATE INDEX name_idx
ON table
USING btree
(text_col COLLATE pg_catalog."default" text_pattern_ops)
WHERE text_col IS NOT NULL;
Disabling sequence scans using set enable_seqscan = off; makes the planner still pick the seqscan over an index_scan. In summary...
The number of rows returned by this query is small.
Given that the non-null rows are fairly unique, an index scan over the text should be faster.
Vacuuming and analyzing the table did not help the optimizer pick the index.
My questions
Why does the database pick the sequence scan over the index scan?
When a table has a text column whose equality condition should be checked, are there any best practices I can adhere to?
How do I reduce the time taken for this query?
[Edit - More information]
The index scan is picked up on my local database that houses about 10% of the data that is available in production.

A partial index is a good idea to exclude half the rows of the table which you obviously do not need. Simpler:
CREATE INDEX name_idx ON table (text_col)
WHERE text_col IS NOT NULL;
Be sure to run ANALYZE table after creating the index. (Autovacuum does that automatically after some time if you don't do it manually, but if you test right after creation, your test will fail.)
Then, to convince the query planner that a particular partial index can be used, repeat the WHERE condition in the query - even if it seems completely redundant:
SELECT col1,col2, .. colN
FROM table
WHERE text_col = 'my_value'
AND text_col IS NOT NULL; -- repeat condition
Voilá.
Per documentation:
However, keep in mind that the predicate must match the conditions
used in the queries that are supposed to benefit from the index. To be
precise, a partial index can be used in a query only if the system can
recognize that the WHERE condition of the query mathematically implies
the predicate of the index. PostgreSQL does not have a sophisticated
theorem prover that can recognize mathematically equivalent
expressions that are written in different forms. (Not only is such a
general theorem prover extremely difficult to create, it would
probably be too slow to be of any real use.) The system can recognize
simple inequality implications, for example "x < 1" implies "x < 2";
otherwise the predicate condition must exactly match part of the
query's WHERE condition or the index will not be recognized as usable.
Matching takes place at query planning time, not at run time. As a
result, parameterized query clauses do not work with a partial index.
As for parameterized queries: again, add the (redundant) predicate of the partial index as an additional, constant WHERE condition, and it works just fine.
An important update in Postgres 9.6 largely improves chances for index-only scans (which can make queries cheaper and the query planner will more readily chose such query plans). Related:
PostgreSQL not using index during count(*)

A partial index is only used if the WHERE conditions match. Thus an index with WHERE text_col IS NOT NULL can only be used if you use the same condition in your SELECT. Collation mismatch could also cause harm.
Try the following:
Make a simplest possible btree index CREATE INDEX foo ON table (text_col)
ANALYZE table
Query

I figured it out. Upon taking a closer look at the pg_stats view that analyze helps build, I came across this excerpt on the documentation.
Correlation
Statistical correlation between physical row ordering and logical
ordering of the column values. This ranges from -1 to +1. When the
value is near -1 or +1, an index scan on the column will be estimated
to be cheaper than when it is near zero, due to reduction of random
access to the disk. (This column is null if the column data type does
not have a < operator.)
On my local box the correlation number is 0.97 and on production it was 0.05. Thus the planner is estimating that it is easier to go through all those rows sequentially instead of looking up the index each time and diving into a random access on the disk block. This is the query I used to peek at the correlation number.
select * from pg_stats where tablename = 'table_name' and attname = 'text_col';
This table also has a few updates performed on its rows. The avg_width of the rows is estimated to be 20 bytes. If the update has a large value for a text column, it can exceed the average and also result in a slower update. My guess was that the physical and logical ordering are slowing moving apart with each update. To fix that I executed the following queries.
ALTER TABLE table_name SET (FILLFACTOR = 80);
VACUUM FULL table_name;
REINDEX TABLE table_name;
ANALYZE table_name;
The idea is that I could give each disk block a 20% buffer and vacuum full the table to reclaim lost space and maintain physical and logical order. After I did this the query picks up the index.
Query
explain analyze SELECT col1,col2... colN
FROM table_name
WHERE text_col is not null
AND
text_col = 'my_value';
Partial index scan - 1.5ms
Index Scan using tango on two (cost=0.000..165.290 rows=40 width=339) (actual time=0.083..0.086 rows=1 loops=1)
Index Cond: ((victor five NOT NULL) AND (victor = 'delta'::text))
Excluding the NULL condition picks up the other index with a bitmap heap scan.
Full index - 0.08ms
Bitmap Heap Scan on two (cost=5.380..392.150 rows=98 width=339) (actual time=0.038..0.039 rows=1 loops=1)
Recheck Cond: (victor = 'delta'::text)
-> Bitmap Index Scan on tango (cost=0.000..5.360 rows=98 width=0) (actual time=0.029..0.029 rows=1 loops=1)
Index Cond: (victor = 'delta'::text)
[EDIT]
While it initially looked like correlation plays a major role in choosing the index scan #Mike has observed that a correlation value that is close to 0 on his database still resulted in an index scan. Changing fill factor and vacuuming fully has helped but I'm unsure why.

PostgreSQL: sequential table scan on Hstore column despite having a GiN index

I have a table with a hstore column and roughly 22 mio records (the ways table from a partial osm-database).
Despite having a GIN index on the hstore column, queries for a specific tag result in a sequential table scan that takes > 60 sec to return a single column.
What i have been doing so far.
I created the GIN index using pgAdminIII.
Executing vacuum analayze
Executing a query of the kind: select id from table where tags->'name'='foo'
Deleting index and starting from 1. again ...
[Edit] As suggested by the user a_horse_with_no_name I updated the table statistics by executing analyze on the table. But that had no effect.
You can see the query plan here. For some reason the explain analyze takes only ~20 sec to complete.
How can I properly index a hstore column on a large table like this, to reduce query execution cost significantly?
Thank you for your help!

I see two possible solutions:
If you always query that key value for equality you can use an a B-Tree index on the expression (`tags -> 'name')
create index idx_name on ways ( (tags -> 'name') );
A quick test has shown that Postgres does use the index to find if a key value is present in the hstore column, but apparently not for finding the associated value.
So you could try to add a condition to test for that key value as well:
select id
from ways
where tags ? 'name'
and tags -> 'name' = 'Wiehbergpark';
If all rows contain that key, it might not help though.

Effectively query on column that includes a substring

Given a string column with a value similar to /123/12/34/56/5/, what is the optimal way of querying for all the records that include the given number (12 for example)?
The solution from top of my head is:
SELECT id FROM things WHERE things.path LIKE '%/12/%'
But AFAIK this query can't use indexes on the column due to the leading %.
There must be something better. What is it?
Using PostgreSQL, but would prefer the solution that would work across other DBs too.

If you're happy turning that column into an array of integers, like:
'/123/12/34/56/5/' becomes ARRAY[123,12,34,56,5]
So that path_arr is a column of type INTEGER[], then you can create a GIN index on that column:
CREATE INDEX ON things USING gin(path_arr);
A query for all items containing 12 then becomes:
SELECT * FROM things WHERE ARRAY[12] <# path_arr;
Which will use the index. In my test (with a million rows), I get plans like:
EXPLAIN SELECT * FROM things WHERE ARRAY[12] <# path_arr;
QUERY PLAN
----------------------------------------------------------------------------------------
Bitmap Heap Scan on things (cost=5915.75..9216.99 rows=1000 width=92)
Recheck Cond: (path_arr <# '{12}'::integer[])
-> Bitmap Index Scan on things_path_arr_idx (cost=0.00..5915.50 rows=1000 width=0)
Index Cond: ('{12}'::integer[] <# path_arr)
(4 rows)

In PostgreSQL 9.1 you could utilize the pg_trgm module and build a GIN index with it.
CREATE EXTENSION pg_trgm; -- once per database
CREATE INDEX things_path_trgm_gin_idx ON things USING gin (path gin_trgm_ops);
Your LIKE expression can use this index even if it is not left-anchored.
See a detailed demo by depesz here.
Normalize it If you can, though.

How to increase query speed without using full-text search?

This is my simple query; By searching selectnothing I'm sure I'll have no hits.
SELECT nome_t FROM myTable WHERE nome_t ILIKE '%selectnothing%';
This is the EXPLAIN ANALYZE VERBOSE
Seq Scan on myTable (cost=0.00..15259.04 rows=37 width=29) (actual time=2153.061..2153.061 rows=0 loops=1)
Output: nome_t
Filter: (nome_t ~~* '%selectnothing%'::text)
Total runtime: 2153.116 ms
myTable has around 350k rows and the table definition is something like:
CREATE TABLE myTable (
nome_t text NOT NULL,
)
I have an index on nome_t as stated below:
CREATE INDEX idx_m_nome_t ON myTable
USING btree (nome_t);
Although this is clearly a good candidate for Fulltext search I would like to rule that option out for now.
This query is meant to be run from a web application and currently it's taking around 2 seconds which is obviously too much;
Is there anything I can do, like using other index methods, to improve the speed of this query?

No, ILIKE '%selectnothing%' always needs a full table scan, every index is useless. You need full text search, it's not that hard to implement.
Edit: You could use a Wildspeed, I forgot about this option. The indexes will be huge, but your performance will also be much better.
Wildspeed extension provides GIN index
support for wildcard search for LIKE
operator.
http://www.sai.msu.su/~megera/wiki/wildspeed

another thing you can do-- is break this nome_t column in table myTable into it's own table. Searching one column out of a table is slow (if there are fifty other wide columns) because the other data effectively slows down the scan against that column (because there are less records per page/extent).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas