Postgres / Postgis query optimization - sql

I have a query in Postgres / Postgis which is based around finding the nearest points of a given point filtered by some other columns in the table.
The table consists of a bit over 10 million rows and the query looks like this:
SELECT t.id FROM my_table t
WHERE round(t.col1) IN (1,2,3)
ORDER BY t.geom <-> st_transform(st_setsrid('POINT(lon lat)'::geometry, 4326), 3857)
LIMIT 1000;
The geom column is indexed using GIST and col1 is also indexed.
When the WHERE clause finds many rows that are also near the point this is blazing fast using the geom index:
Limit (cost=0.42..10575.49 rows=1000 width=12) (actual time=0.150..6.742 rows=1000 loops=1)
-> Index Scan using my_table_geom_idx on my_table t (cost=0.42..2148612.35 rows=203177 width=12) (actual time=0.149..6.663 rows=1000 loops=1)
Order By: (geom <-> '.....'::geometry)
Filter: (round(t.col1) = ANY ('{1,2,3}'::double precision[]))
Rows Removed by Filter: 3348
Planning Time: 0.288 ms
Execution Time: 6.817 ms
The problem occurs when the WHERE clause does not find many rows that are close in distance to the given point. Example:
SELECT t.id FROM my_table t
WHERE round(t.col1) IN (1) // 1 is very rare near the given point
ORDER BY t.geom <-> st_transform(st_setsrid('POINT(lon lat)'::geometry, 4326), 3857)
LIMIT 1000;
This query runs much much slower:
Limit (cost=0.42..14487.97 rows=1000 width=12) (actual time=8443.514..10629.745 rows=1000 loops=1)
-> Index Scan using my_table_geom_idx on my_table t (cost=0.42..1962368.41 rows=135452 width=12) (actual time=8443.513..10629.553 rows=1000 loops=1)
Order By: (t.geom <-> '.....'::geometry)
Filter: (round(t.col1) = ANY ('{1}'::double precision[]))
Rows Removed by Filter: 5866030
Planning Time: 0.292 ms
Execution Time: 10629.906 ms
I created an index on round(col1) trying to speed up searches on col1 but postgres uses the geom index only which works great when there are many rows nearby that fit the criteria but not so great if there are few rows that match.
If I remove the LIMIT clause Postgres uses the index on col1, which works great when there are few resulting rows but is very slow when the result contains many rows, so I would like to keep the LIMIT clause.
Any suggestions on how I could optimize this query or create an index that handles this?
EDIT:
Thank you for all the suggestions and feedback!
I tried the tip from #JGH and restricted my query using st_dwithin as to not order the entire table before limiting.
...where st_dwithin(geom, searchpoint, 10000)
This greatly reduced the time of the slow query, down to a few milliseconds. Restricting the search to a constant distance works well in my application, so I will use this as the solution.

Related

Why would LIMIT 2 queries work but LIMIT 1 always times out?

I'm using this public Postgres DB of NEAR protocol: https://github.com/near/near-indexer-for-explorer#shared-public-access
postgres://public_readonly:nearprotocol#mainnet.db.explorer.indexer.near.dev/mainnet_explorer
SELECT "public"."receipts"."receipt_id",
"public"."receipts"."included_in_block_hash",
"public"."receipts"."included_in_chunk_hash",
"public"."receipts"."index_in_chunk",
"public"."receipts"."included_in_block_timestamp",
"public"."receipts"."predecessor_account_id",
"public"."receipts"."receiver_account_id",
"public"."receipts"."receipt_kind",
"public"."receipts"."originated_from_transaction_hash"
FROM "public"."receipts"
WHERE ("public"."receipts"."receipt_id") IN
(SELECT "t0"."receipt_id"
FROM "public"."receipts" AS "t0"
INNER JOIN "public"."action_receipts" AS "j0" ON ("j0"."receipt_id") = ("t0"."receipt_id")
WHERE ("j0"."signer_account_id" = 'ryancwalsh.near'
AND "t0"."receipt_id" IS NOT NULL))
ORDER BY "public"."receipts"."included_in_block_timestamp" DESC
LIMIT 1
OFFSET 0
always results in:
ERROR: canceling statement due to statement timeout
SQL state: 57014
But if I change it to LIMIT 2, the query runs in less than 1 second.
How would that ever be the case? Does that mean the database isn't set up well? Or am I doing something wrong?
P.S. The query here was generated via Prisma. findFirst always times out, so I think I might need to change it to findMany as a workaround.
Your query can be simplified /optimized:
SELECT r.receipt_id
, r.included_in_block_hash
, r.included_in_chunk_hash
, r.index_in_chunk
, r.included_in_block_timestamp
, r.predecessor_account_id
, r.receiver_account_id
, r.receipt_kind
, r.originated_from_transaction_hash
FROM public.receipts r
WHERE EXISTS (
SELECT FROM public.action_receipts j
WHERE j.receipt_id = r.receipt_id
AND j.signer_account_id = 'ryancwalsh.near'
)
ORDER BY r.included_in_block_timestamp DESC
LIMIT 1;
However, that only scratches the surface of your underlying problem.
Like Kirk already commented, Postgres chooses a different query plan for LIMIT 1, obviously ignorant of the fact that there are only 90 rows in table action_receipts with signer_account_id = 'ryancwalsh.near', while both involved tables have more than 220 million rows, obviously growing steadily.
Changing to LIMIT 2 makes a different query plan seem more favorable, hence the observed difference in performance. (So the query planner has the general idea that the filter is very selective, just not close enough for the corner case of LIMIT 1.)
You should have mentioned cardinalities to set us on the right track.
Knowing that our filter is so selective, we can force a more favorable query plan with a different query:
WITH j AS (
SELECT receipt_id -- is PK!
FROM public.action_receipts
WHERE signer_account_id = 'ryancwalsh.near'
)
SELECT r.receipt_id
, r.included_in_block_hash
, r.included_in_chunk_hash
, r.index_in_chunk
, r.included_in_block_timestamp
, r.predecessor_account_id
, r.receiver_account_id
, r.receipt_kind
, r.originated_from_transaction_hash
FROM j
JOIN public.receipts r USING (receipt_id)
ORDER BY r.included_in_block_timestamp DESC
LIMIT 1;
This uses the same query plan for LIMIT 1, and either finishes in under 2 ms in my test:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=134904.89..134904.89 rows=1 width=223) (actual time=1.750..1.754 rows=1 loops=1)
CTE j
-> Bitmap Heap Scan on action_receipts (cost=319.46..41564.59 rows=10696 width=44) (actual time=0.058..0.179 rows=90 loops=1)
Recheck Cond: (signer_account_id = 'ryancwalsh.near'::text)
Heap Blocks: exact=73
-> Bitmap Index Scan on action_receipt_signer_account_id_idx (cost=0.00..316.79 rows=10696 width=0) (actual time=0.043..0.043 rows=90 loops=1)
Index Cond: (signer_account_id = 'ryancwalsh.near'::text)
-> Sort (cost=93340.30..93367.04 rows=10696 width=223) (actual time=1.749..1.750 rows=1 loops=1)
Sort Key: r.included_in_block_timestamp DESC
Sort Method: top-N heapsort Memory: 25kB
-> Nested Loop (cost=0.70..93286.82 rows=10696 width=223) (actual time=0.089..1.705 rows=90 loops=1)
-> CTE Scan on j (cost=0.00..213.92 rows=10696 width=32) (actual time=0.060..0.221 rows=90 loops=1)
-> Index Scan using receipts_pkey on receipts r (cost=0.70..8.70 rows=1 width=223) (actual time=0.016..0.016 rows=1 loops=90)
Index Cond: (receipt_id = j.receipt_id)
Planning Time: 0.281 ms
Execution Time: 1.801 ms
The point is to execute the hugely selective query in the CTE first. Then Postgres does not attempt to walk the index on (included_in_block_timestamp) under the wrong assumption that it would find a matching row soon enough. (It does not.)
The DB at hand runs Postgres 11, where CTEs are always optimization barriers. In Postgres 12 or later add AS MATERIALIZED to the CTE to guarantee the same effect.
Or you could use the "OFFSET 0 hack" in any version like this:
SELECT r.receipt_id
, r.included_in_block_hash
, r.included_in_chunk_hash
, r.index_in_chunk
, r.included_in_block_timestamp
, r.predecessor_account_id
, r.receiver_account_id
, r.receipt_kind
, r.originated_from_transaction_hash
FROM (
SELECT receipt_id -- is PK!
FROM public.action_receipts
WHERE signer_account_id = 'ryancwalsh.near'
OFFSET 0 -- !
) j
JOIN public.receipts r USING (receipt_id)
ORDER BY r.included_in_block_timestamp DESC
LIMIT 1;
Prevents "inlining" of the subquery to the same effect. Finishes in < 2ms.
See:
How to prevent PostgreSQL from rewriting OUTER JOIN queries?
"Fix" the database?
The proper fix depends on the complete picture. The underlying problem is that Postgres overestimates the number of qualifying rows in table action_receipts. The MCV list (most common values) cannot keep up with 220 million rows (and growing). It's most probably not just ANALYZE lagging behind. (Though it could be: autovacuum not properly configured? Rookie mistake?) Depending on the actual cardinalities (data distribution) in action_receipts.signer_account_id and access patterns you could do various things to "fix" it. Two options:
1. Increase n_distinct and STATISTICS
If most values in action_receipts.signer_account_id are equally rare (high cardinality), consider setting a very large n_distinct value for the column. And combine that with a moderately increased STATISTICS target for the same column to counter errors in the other direction (underestimating the number of qualifying rows for common values). Read both answers here:
Postgres sometimes uses inferior index for WHERE a IN (...) ORDER BY b LIMIT N
And:
Very bad query plan in PostgreSQL 9.6
2. Local fix with partial index
If action_receipts.signer_account_id = 'ryancwalsh.near' is special in that it's queried more regularly than others, consider a small partial index for it, to fix just that case. Like:
CREATE INDEX ON action_receipts (receipt_id)
WHERE signer_account_id = 'ryancwalsh.near';

Improve Postgre SQL query performance

I'm running this query in our database:
select
(
select least(2147483647, sum(pb.nr_size))
from tb_pr_dc pd
inner join tb_pr_dc_bn pb on 1=1
and pb.id_pr_dc_bn = pd.id_pr_dc_bn
where 1=1
and pd.id_pr = pt.id_pr -- outer query column
)
from
(
select regexp_split_to_table('[list of 500 ids]', ',')::integer id_pr
) pt
;
Which outputs 500 rows having a single result column and takes around 1 min and 43 secs to run. The explain (analyze, verbose, buffers) outputs the following plan:
Subquery Scan on pt (cost=0.00..805828.19 rows=1000 width=8) (actual time=96.791..103205.872 rows=500 loops=1)
Output: (SubPlan 1)
Buffers: shared hit=373771 read=153484
-> Result (cost=0.00..22.52 rows=1000 width=4) (actual time=0.434..3.729 rows=500 loops=1)
Output: ((regexp_split_to_table('[list of 500 ids]', ',')::integer id_pr)
-> ProjectSet (cost=0.00..5.02 rows=1000 width=32) (actual time=0.429..2.288 rows=500 loops=1)
Output: (regexp_split_to_table('[list of 500 ids]', ',')::integer id_pr
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1)
SubPlan 1
-> Aggregate (cost=805.78..805.80 rows=1 width=8) (actual time=206.399..206.400 rows=1 loops=500)
Output: LEAST('2147483647'::bigint, sum((pb.nr_size)::integer))
Buffers: shared hit=373771 read=153484
-> Nested Loop (cost=0.87..805.58 rows=83 width=4) (actual time=1.468..206.247 rows=219 loops=500)
Output: pb.nr_size
Inner Unique: true
Buffers: shared hit=373771 read=153484
-> Index Scan using tb_pr_dc_in05 on db.tb_pr_dc pd (cost=0.43..104.02 rows=83 width=4) (actual time=0.233..49.289 rows=219 loops=500)
Output: pd.id_pr_dc, pd.ds_pr_dc, pd.id_pr, pd.id_user_in, pd.id_user_ex, pd.dt_in, pd.dt_ex, pd.ds_mt_ex, pd.in_at, pd.id_tp_pr_dc, pd.id_pr_xz (...)
Index Cond: ((pd.id_pr)::integer = pt.id_pr)
Buffers: shared hit=24859 read=64222
-> Index Scan using tb_pr_dc_bn_pk on db.tb_pr_dc_bn pb (cost=0.43..8.45 rows=1 width=8) (actual time=0.715..0.715 rows=1 loops=109468)
Output: pb.id_pr_dc_bn, pb.ds_ex, pb.ds_md_dc, pb.ds_m5_dc, pb.nm_aq, pb.id_user, pb.dt_in, pb.ob_pr_dc, pb.nr_size, pb.ds_sg, pb.ds_cr_ch, pb.id_user_ (...)
Index Cond: ((pb.id_pr_dc_bn)::integer = (pd.id_pr_dc_bn)::integer)
Buffers: shared hit=348912 read=89262
Planning Time: 1.151 ms
Execution Time: 103206.243 ms
The logic is: for each id_pr chosen (in the list of 500 ids) calculate the sum of the integer column pb.nr_size associated with them, returning the lesser value between this amount and the number 2,147,483,647. The result must contain 500 rows, one for each id, and we already know that they'll match at least one row in the subquery, so will not produce null values.
The index tb_pr_dc_in05 is a b-tree on id_pr only, which is of integer type. The index tb_pr_dc_bn_pk is a b-tree on the primary key id_pr_dc_bn only, which is of integer type also. Table tb_pr_dc has many rows for each id_pr. Actually, we have 209,217 unique id_prs in tb_pr_dc for a total of 13,910,855 rows. Table tb_pr_dc_bn has the same amount of rows.
As can be seen, we defined 500 ids to query tb_pr_dc, finding 109,468 rows (less than 1% of the table size) and then finding the same amount looking in tb_pr_dc_bn. Imo, the indexes look fine and the amount of rows to evaluate is minimal, so I can't understand why it's taking so much time to run this query. A lot of other queries reading a lot more of data on other tables and doing more calculations are running fine. The DBA just ran a reindex and vacuum analyze, but still it's running the same slow way. We are running PostgreSQL 11 on Linux. I'm running this query in a replica without concurrent access.
What could I be missing that could improve this query performance?
Thanks for your attention.
The time is spent jumping all over the table to find 109468 randomly scattered rows, issuing random IO requests to do so. You can verify that be turning track_io_timing on and redoing the plans (probably just leave it turned on globally and by default, the overhead is low and the value it produces is high), but I'm sure enough that I don't need to see that output before reaching this conclusion. The other queries that are faster are probably accessing fewer disk pages because they access data that is more tightly packed, or is organized so that it can be read more sequentially. In fact, I would say your query is quite fast given how many pages it had to read.
You ask about why so many columns are output in the internal nodes of the plan. The reason for that is that PostgreSQL often just passes around pointers to where the tuple lives in the shared_buffers, and the tuple being pointed to has the columns that the table itself has. It could allocate memory in which to store a reformatted version of the tuple with the unnecessary columns stripped out, but that would generally be more work, not less. If it was a reason to copy and re-form the tuple anyway, it will remove the extraneous columns while it does so. But it won't do it without a reason.
One way to sped this up is to create indexes which will enable index-only scans. Those would be on tb_pr_dc (id_pr, id_pr_dc_bn) and on tb_pr_dc_bn (id_pr_dc_bn, nr_size).
If this isn't enough, there might be other ways to improve this too; but I can't think through them if I keep getting distracted by the long strings of unmemorable unpronounceable gibberish you have for table and column names.

Slow LEFT JOIN on CTE with time intervals

I am trying to debug a query in PostgreSQL that I've built to bucket market data in time buckets in arbitrary time intervals. Here is my table definition:
CREATE TABLE historical_ohlcv (
exchange_symbol TEXT NOT NULL,
symbol_id TEXT NOT NULL,
kafka_key TEXT NOT NULL,
open NUMERIC,
high NUMERIC,
low NUMERIC,
close NUMERIC,
volume NUMERIC,
time_open TIMESTAMP WITH TIME ZONE NOT NULL,
time_close TIMESTAMP WITH TIME ZONE,
CONSTRAINT historical_ohlcv_pkey
PRIMARY KEY (exchange_symbol, symbol_id, time_open)
);
CREATE INDEX symbol_id_idx
ON historical_ohlcv (symbol_id);
CREATE INDEX open_close_symbol_id
ON historical_ohlcv (time_open, time_close, exchange_symbol, symbol_id);
CREATE INDEX time_open_idx
ON historical_ohlcv (time_open);
CREATE INDEX time_close_idx
ON historical_ohlcv (time_close);
The table has ~25m rows currently. My query as an example for 1 hour, but could be 5 mins, 10 mins, 2 days, etc.
EXPLAIN ANALYZE WITH vals AS (
SELECT
NOW() - '5 months' :: INTERVAL AS frame_start,
NOW() AS frame_end,
INTERVAL '1 hour' AS t_interval
)
, grid AS (
SELECT
start_time,
lead(start_time, 1)
OVER (
ORDER BY start_time ) AS end_time
FROM (
SELECT
generate_series(frame_start, frame_end,
t_interval) AS start_time,
frame_end
FROM vals
) AS x
)
SELECT max(high)
FROM grid g
LEFT JOIN historical_ohlcv ohlcv ON ohlcv.time_open >= g.start_time
WHERE exchange_symbol = 'BINANCE'
AND symbol_id = 'ETHBTC'
GROUP BY start_time;
The WHERE clause could be any valid value in the table.
This technique was inspired by:
Best way to count records by arbitrary time intervals in Rails+Postgres.
The idea is to make a common table and left join your data with that to indicate which bucket stuff is in. This query is really slow! It's currently taking 15s. Based on the query planner, we have a really expensive nested loop:
QUERY PLAN
HashAggregate (cost=2758432.05..2758434.05 rows=200 width=40) (actual time=16023.713..16023.817 rows=542 loops=1)
Group Key: g.start_time
CTE vals
-> Result (cost=0.00..0.02 rows=1 width=32) (actual time=0.005..0.005 rows=1 loops=1)
CTE grid
-> WindowAgg (cost=64.86..82.36 rows=1000 width=16) (actual time=2.986..9.594 rows=3625 loops=1)
-> Sort (cost=64.86..67.36 rows=1000 width=8) (actual time=2.981..4.014 rows=3625 loops=1)
Sort Key: x.start_time
Sort Method: quicksort Memory: 266kB
-> Subquery Scan on x (cost=0.00..15.03 rows=1000 width=8) (actual time=0.014..1.991 rows=3625 loops=1)
-> ProjectSet (cost=0.00..5.03 rows=1000 width=16) (actual time=0.013..1.048 rows=3625 loops=1)
-> CTE Scan on vals (cost=0.00..0.02 rows=1 width=32) (actual time=0.008..0.009 rows=1 loops=1)
-> Nested Loop (cost=0.56..2694021.34 rows=12865667 width=14) (actual time=7051.730..16015.873 rows=31978 loops=1)
-> CTE Scan on grid g (cost=0.00..20.00 rows=1000 width=16) (actual time=2.988..11.635 rows=3625 loops=1)
-> Index Scan using historical_ohlcv_pkey on historical_ohlcv ohlcv (cost=0.56..2565.34 rows=12866 width=22) (actual time=3.712..4.413 rows=9 loops=3625)
Index Cond: ((exchange_symbol = 'BINANCE'::text) AND (symbol_id = 'ETHBTC'::text) AND (time_open >= g.start_time))
Filter: (time_close < g.end_time)
Rows Removed by Filter: 15502
Planning time: 0.568 ms
Execution time: 16023.979 ms
My guess is this line is doing a lot:
LEFT JOIN historical_ohlcv ohlcv ON ohlcv.time_open >= g.start_time
AND ohlcv.time_close < g.end_time
But I'm not sure how to accomplish this in another way.
P.S. apologies if this belongs to dba.SE. I read the FAQ and this seemed too basic for that site, so I posted here.
Edits as requested:
SELECT avg(pg_column_size(t)) FROM historical_ohlcv t TABLESAMPLE SYSTEM (0.1); returns 107.632
For exchange_symbol, there are 3 unique values, for symbol_id there are ~400
PostgreSQL version: PostgreSQL 10.3 (Ubuntu 10.3-1.pgdg16.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609, 64-bit.
The table will be growing about ~1m records a day, so not exactly read-only. All this stuff is done locally and I will try to move to RDS or to help manage hardware issues.
Related: if I wanted to add other aggregates, specifically 'first in the bucket', 'last in the bucket', min, sum, would my indexing strategy change?
Correctness first: I suspect a bug in your query:
LEFT JOIN historical_ohlcv ohlcv ON ohlcv.time_open >= g.start_time
AND ohlcv.time_close < g.end_time
Unlike my referenced answer, you join on a time interval: (time_open, time_close]. The way you do it excludes rows in the table where the interval crosses bucket borders. Only intervals fully contained in a single bucket count. I don't think that's intended?
A simple fix would be to decide bucket membership based on time_open (or time_close) alone. If you want to keep working with both, you have to define exactly how to deal with intervals overlapping with multiple buckets.
Also, you are looking for max(high) per bucket, which is different in nature from count(*) in my referenced answer.
And your buckets are simple intervals per hour?
Then we can radically simplify. Working with just time_open:
SELECT date_trunc('hour', time_open) AS hour, max(high) AS max_high
FROM historical_ohlcv
WHERE exchange_symbol = 'BINANCE'
AND symbol_id = 'ETHBTC'
AND time_open >= now() - interval '5 months' -- frame_start
AND time_open < now() -- frame_end
GROUP BY 1
ORDER BY 1;
Related:
Resample on time series data
It's hard to talk about further performance optimization while basics are unclear. And we'd need more information.
Are WHERE conditions variable?
How many distinct values in exchange_symbol and symbol_id?
Avg. row size? What do you get for:
SELECT avg(pg_column_size(t)) FROM historical_ohlcv t TABLESAMPLE SYSTEM (0.1);
Is the table read-only?
Assuming you always filter on exchange_symbol and symbol_id and values are variable, your table is read-only or autovacuum can keep up with the write load so we can hope for index-only scans, you would best have a multicolumn index on (exchange_symbol, symbol_id, time_open, high DESC) to support this query. Index columns in this order. Related:
Multicolumn index and performance
Depending on data distribution and other details a LEFT JOIN LATERAL solution might be another option. Related:
How to find an average of values for time intervals in postgres
Optimize GROUP BY query to retrieve latest record per user
Aside from all that, you EXPLAIN plan exhibits some very bad estimates:
https://explain.depesz.com/s/E5yI
Are you using a current version of Postgres? You may have to work on your server configuration - or at least set higher statistics targets on relevant columns and more aggressive autovacuum settings for the big table. Related:
Keep PostgreSQL from sometimes choosing a bad query plan
Aggressive Autovacuum on PostgreSQL

How to optimize query by index PostgreSQL

I want to fetch users that has 1 or more processed bets. I do this by using next sql:
SELECT user_id FROM bets
WHERE bets.state in ('guessed', 'losed')
GROUP BY user_id
HAVING count(*) > 0;
But running EXPLAIN ANALYZE I noticed no index is used and query execution time is very high. I tried add partial index like:
CREATE INDEX processed_bets_index ON bets(state) WHERE state in ('guessed', 'losed');
But EXPLAIN ANALYZE output not changed:
HashAggregate (cost=34116.36..34233.54 rows=9375 width=4) (actual time=235.195..237.623 rows=13310 loops=1)
Filter: (count(*) > 0)
-> Seq Scan on bets (cost=0.00..30980.44 rows=627184 width=4) (actual time=0.020..150.346 rows=626674 loops=1)
Filter: ((state)::text = ANY ('{guessed,losed}'::text[]))
Rows Removed by Filter: 20951
Total runtime: 238.115 ms
(6 rows)
Records with other statuses except (guessed, losed) a little.
How do I create proper index?
I'm using PostgreSQL 9.3.4.
I assume that the state mostly consists of 'guessed' and 'losed', with maybe a few other states as well in there. So most probably the optimizer do not see the need to use the index since it would still fetch most of the rows.
What you do need is an index on the user_id, so perhaps something like this would work:
CREATE INDEX idx_bets_user_id_in_guessed_losed ON bets(user_id) WHERE state in ('guessed', 'losed');
Or, by not using a partial index:
CREATE INDEX idx_bets_state_user_id ON bets(state, user_id);

Help to choose NoSQL database for project

There is a table:
doc_id(integer)-value(integer)
Approximate 100.000 doc_id and 27.000.000 rows.
Majority query on this table - searching documents similar to current document:
select 10 documents with maximum of
(count common to current document value)/(count ov values in document).
Nowadays we use PostgreSQL. Table weight (with index) ~1,5 GB. Average query time ~0.5s - it is to hight. And, for my opinion this time will grow exponential with growing of database.
Should I transfer all this to NoSQL base, if so, what?
QUERY:
EXPLAIN ANALYZE
SELECT D.doc_id as doc_id,
(count(D.doc_crc32) *1.0 / testing.get_count_by_doc_id(D.doc_id))::real as avg_doc
FROM testing.text_attachment D
WHERE D.doc_id !=29758 -- 29758 - is random id
AND D.doc_crc32 IN (select testing.get_crc32_rows_by_doc_id(29758)) -- get_crc32... is IMMUTABLE
GROUP BY D.doc_id
ORDER BY avg_doc DESC
LIMIT 10
Limit (cost=95.23..95.26 rows=10 width=8) (actual time=1849.601..1849.641 rows=10 loops=1)
-> Sort (cost=95.23..95.28 rows=20 width=8) (actual time=1849.597..1849.609 rows=10 loops=1)
Sort Key: (((((count(d.doc_crc32))::numeric * 1.0) / (testing.get_count_by_doc_id(d.doc_id))::numeric))::real)
Sort Method: top-N heapsort Memory: 25kB
-> HashAggregate (cost=89.30..94.80 rows=20 width=8) (actual time=1211.835..1847.578 rows=876 loops=1)
-> Nested Loop (cost=0.27..89.20 rows=20 width=8) (actual time=7.826..928.234 rows=167771 loops=1)
-> HashAggregate (cost=0.27..0.28 rows=1 width=4) (actual time=7.789..11.141 rows=1863 loops=1)
-> Result (cost=0.00..0.26 rows=1 width=0) (actual time=0.130..4.502 rows=1869 loops=1)
-> Index Scan using crc32_idx on text_attachment d (cost=0.00..88.67 rows=20 width=8) (actual time=0.022..0.236 rows=90 loops=1863)
Index Cond: (d.doc_crc32 = (testing.get_crc32_rows_by_doc_id(29758)))
Filter: (d.doc_id <> 29758)
Total runtime: 1849.753 ms
(12 rows)
1.5 GByte is nothing. Serve from ram. Build a datastructure that helps you searching.
I don't think your main problem here is the kind of database you're using but the fact that you don't in fact have an "index" for what you're searching: similarity between documents.
My proposal is to determine once which are the 10 documents similar to each of the 100.000 doc_ids and cache the result in a new table like this:
doc_id(integer)-similar_doc(integer)-score(integer)
where you'll insert 10 rows per document each of them representing the 10 best matches for it. You'll get 400.000 rows which you can directly access by index which should take down search time to something like O(log n) (depending on index implementation).
Then, on each insertion or removal of a document (or one of its values) you iterate through the documents and update the new table accordingly.
e.g. when a new document is inserted:
for each of the documents already in the table
you calculate its match score and
if the score is higher than the lowest score of the similar documents cached in the new table you swap in the similar_doc and score of the newly inserted document
If you're getting that bad performance out of PostgreSQL, a good start would be to tune PostgreSQL, your query and possibly your datamodel. A query like that should serve a lot faster on such a small table.
First, is 0.5s a problem or not? And did you already optimize your queries, datamodel and configuration settings? If not, you can still get better performance. Performance is a choice.
Besides speed, there is also functionality, that's what you will loose.
===
What about pushing the function to a JOIN:
EXPLAIN ANALYZE
SELECT
D.doc_id as doc_id,
(count(D.doc_crc32) *1.0 / testing.get_count_by_doc_id(D.doc_id))::real as avg_doc
FROM
testing.text_attachment D
JOIN (SELECT testing.get_crc32_rows_by_doc_id(29758) AS r) AS crc ON D.doc_crc32 = r
WHERE
D.doc_id <> 29758
GROUP BY D.doc_id
ORDER BY avg_doc DESC
LIMIT 10