Why does the following join increase the query time significantly? - sql

I have a star schema here and I am querying the fact table and would like to join one very small dimension table. I can't really explain the following:
EXPLAIN ANALYZE SELECT
COUNT(impression_id), imp.os_id
FROM bi.impressions imp
GROUP BY imp.os_id;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=868719.08..868719.24 rows=16 width=10) (actual time=12559.462..12559.466 rows=26 loops=1)
-> Seq Scan on impressions imp (cost=0.00..690306.72 rows=35682472 width=10) (actual time=0.009..3030.093 rows=35682474 loops=1)
Total runtime: 12559.523 ms
(3 rows)
This takes ~12600ms, but of course there is no joined data, so I can't "resolve" the imp.os_id to something meaningful, so I add a join:
EXPLAIN ANALYZE SELECT
COUNT(impression_id), imp.os_id, os.os_desc
FROM bi.impressions imp, bi.os_desc os
WHERE imp.os_id=os.os_id
GROUP BY imp.os_id, os.os_desc;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=1448560.83..1448564.99 rows=416 width=22) (actual time=25565.124..25565.127 rows=26 loops=1)
-> Hash Join (cost=1.58..1180942.29 rows=35682472 width=22) (actual time=0.046..15157.684 rows=35682474 loops=1)
Hash Cond: (imp.os_id = os.os_id)
-> Seq Scan on impressions imp (cost=0.00..690306.72 rows=35682472 width=10) (actual time=0.007..3705.647 rows=35682474 loops=1)
-> Hash (cost=1.26..1.26 rows=26 width=14) (actual time=0.028..0.028 rows=26 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 2kB
-> Seq Scan on os_desc os (cost=0.00..1.26 rows=26 width=14) (actual time=0.003..0.010 rows=26 loops=1)
Total runtime: 25565.199 ms
(8 rows)
This effectively doubles the execution time of my query. My question is, what did I leave out from the picture? I would think such a small lookup was not causing huge difference in query execution time.

Rewritten with (recommended) explicit ANSI JOIN syntax:
SELECT COUNT(impression_id), imp.os_id, os.os_desc
FROM bi.impressions imp
JOIN bi.os_desc os ON os.os_id = imp.os_id
GROUP BY imp.os_id, os.os_desc;
First of all, your second query might be wrong, if more or less than exactly one match are found in os_desc for every row in impressions.
This can be ruled out if you have a foreign key constraint on os_id in place, that guarantees referential integrity, plus a NOT NULL constraint on bi.impressions.os_id. If so, in a first step, simplify to:
SELECT COUNT(*) AS ct, imp.os_id, os.os_desc
FROM bi.impressions imp
JOIN bi.os_desc os USING (os_id)
GROUP BY imp.os_id, os.os_desc;
count(*) is faster than count(column) and equivalent here if the column is NOT NULL. And add a column alias for the count.
Faster, yet:
SELECT os_id, os.os_desc, sub.ct
FROM (
SELECT os_id, COUNT(*) AS ct
FROM bi.impressions
GROUP BY 1
) sub
JOIN bi.os_desc os USING (os_id)
Aggregate first, join later. More here:
Aggregate a single column in query with many columns
PostgreSQL - order by an array

HashAggregate (cost=868719.08..868719.24 rows=16 width=10)
HashAggregate (cost=1448560.83..1448564.99 rows=416 width=22)
Hmm, width from 10 to 22 is a doubling. Perhaps you should join after grouping instead of before?

The following query solves the problem without increasing the query execution time. The question still stands why does the execution time increase significantly with adding a very simple join, but it might be a Postgres specific question and somebody with extensive experience in the area might answer it eventually.
WITH
OSES AS (SELECT os_id,os_desc from bi.os_desc)
SELECT
COUNT(impression_id) as imp_count,
os_desc FROM bi.impressions imp,
OSES os
WHERE
os.os_id=imp.os_id
GROUP BY os_desc
ORDER BY imp_count;

Related

Query plan difference inner join/right join "greatest-n-per-group", self joined, aggregated query

For a small Postgres 10 data warehouse I was checking for improvements in our analytics queries and discovered a rather slow query where the possible improvement basically boiled down to this subquery (classic greatest-n-per-group problem):
SELECT s_postings.*
FROM dwh.s_postings
JOIN (SELECT s_postings.id,
max(s_postings.load_dts) AS load_dts
FROM dwh.s_postings
GROUP BY s_postings.id) AS current_postings
ON s_postings.id = current_postings.id AND s_postings.load_dts = current_postings.load_dts
With the following execution plan:
"Gather (cost=23808.51..38602.59 rows=66 width=376) (actual time=1385.927..1810.844 rows=170847 loops=1)"
" Workers Planned: 2"
" Workers Launched: 2"
" -> Hash Join (cost=22808.51..37595.99 rows=28 width=376) (actual time=1199.647..1490.652 rows=56949 loops=3)"
" Hash Cond: (((s_postings.id)::text = (s_postings_1.id)::text) AND (s_postings.load_dts = (max(s_postings_1.load_dts))))"
" -> Parallel Seq Scan on s_postings (cost=0.00..14113.25 rows=128425 width=376) (actual time=0.016..73.604 rows=102723 loops=3)"
" -> Hash (cost=20513.00..20513.00 rows=153034 width=75) (actual time=1195.616..1195.616 rows=170847 loops=3)"
" Buckets: 262144 Batches: 1 Memory Usage: 20735kB"
" -> HashAggregate (cost=17452.32..18982.66 rows=153034 width=75) (actual time=836.694..1015.499 rows=170847 loops=3)"
" Group Key: s_postings_1.id"
" -> Seq Scan on s_postings s_postings_1 (cost=0.00..15911.21 rows=308221 width=75) (actual time=0.032..251.122 rows=308168 loops=3)"
"Planning time: 1.184 ms"
"Execution time: 1912.865 ms"
The row estimate is absolutely wrong! For me the weird thing is if I change the join to a right join now:
SELECT s_postings.*
FROM dwh.s_postings
RIGHT JOIN (SELECT s_postings.id,
max(s_postings.load_dts) AS load_dts
FROM dwh.s_postings
GROUP BY s_postings.id) AS current_postings
ON s_postings.id = current_postings.id AND s_postings.load_dts = current_postings.load_dts
With the execution plan:
"Hash Right Join (cost=22829.85..40375.62 rows=153177 width=376) (actual time=814.097..1399.673 rows=170848 loops=1)"
" Hash Cond: (((s_postings.id)::text = (s_postings_1.id)::text) AND (s_postings.load_dts = (max(s_postings_1.load_dts))))"
" -> Seq Scan on s_postings (cost=0.00..15926.10 rows=308510 width=376) (actual time=0.011..144.584 rows=308419 loops=1)"
" -> Hash (cost=20532.19..20532.19 rows=153177 width=75) (actual time=812.587..812.587 rows=170848 loops=1)"
" Buckets: 262144 Batches: 1 Memory Usage: 20735kB"
" -> HashAggregate (cost=17468.65..19000.42 rows=153177 width=75) (actual time=553.633..683.850 rows=170848 loops=1)"
" Group Key: s_postings_1.id"
" -> Seq Scan on s_postings s_postings_1 (cost=0.00..15926.10 rows=308510 width=75) (actual time=0.011..157.000 rows=308419 loops=1)"
"Planning time: 0.402 ms"
"Execution time: 1469.808 ms"
The row estimate is way better!
I am aware that for example parallel sequential scans can in some conditions decrease performance but they should not change the row estimate!?
If I remember correctly aggregate functions also block the proper use of indexes anyway and also don’t see any potential gains with additional multivariate statistics e.g. for the tuple id, load_dts. The database is VACUUM ANALYZEd.
For me the queries are logically the same.
Is there a way to support the query planner to make better assumptions about the estimates or improve the query? Maybe somebody knows a reason why this difference exists?
Edit: Previously the join condition was ON s_postings.id::text = current_postings.id::text
I changed that to ON s_postings.id = current_postings.id to not confuse anybody. Removing this conversion does not change the query plan.
Edit2: As suggested below there is a different solution to the greatest-n-per-group problem:
SELECT p.*
FROM (SELECT p.*,
RANK() OVER (PARTITION BY p.id ORDER BY p.load_dts DESC) as seqnum
FROM dwh.s_postings p
) p
WHERE seqnum = 1;
A really nice solution but sadly the query planner also underestimates the row count:
"Subquery Scan on p (cost=44151.67..54199.31 rows=1546 width=384) (actual time=1742.902..2594.359 rows=171269 loops=1)"
" Filter: (p.seqnum = 1)"
" Rows Removed by Filter: 137803"
" -> WindowAgg (cost=44151.67..50334.83 rows=309158 width=384) (actual time=1742.899..2408.240 rows=309072 loops=1)"
" -> Sort (cost=44151.67..44924.57 rows=309158 width=376) (actual time=1742.887..1927.325 rows=309072 loops=1)"
" Sort Key: p_1.id, p_1.load_dts DESC"
" Sort Method: quicksort Memory: 172275kB"
" -> Seq Scan on s_postings p_1 (cost=0.00..15959.58 rows=309158 width=376) (actual time=0.007..221.240 rows=309072 loops=1)"
"Planning time: 0.149 ms"
"Execution time: 2666.645 ms"
The difference in timing is not very large. It could easily just be caching effects. If you alternate between them back-to-back repeatedly, do you still get the difference? If you disable parallel execution by setting max_parallel_workers_per_gather = 0, does that equalize them?
The row estimate is absolutely wrong!
While this is obviously true, I don't think the misestimation is causing anything particularly bad to happen.
I am aware that for example parallel sequential scans can in some conditions decrease performance but they should not change the row estimate!?
Correct. It is the change in the JOIN type that causes the estimation change, and that in turn causes the change in parallelization. Thinking it has to push more tuples up to the leader (rather than disqualifying them down in the workers) discourages parallel plans, due to parallel_tuple_cost.
If I remember correctly aggregate functions also block the proper use of indexes
No, an index on (id, load_dts) or even just (id) should be usable for doing the aggregation, but since you need to read the entire table, it will probably be slower to read the entire index and entire table, than it is to just read the entire table into a HashAgg. You can test if PostgreSQL thinks it is capable of using such an index by setting enable_seqscan=off. If it does the seq scan anyway, then it doesn't think the index is usable. Otherwise, it just thinks using the index is counterproductive.
Is there a way to support the query planner to make better assumptions about the estimates or improve the query? Maybe somebody knows a reason why this difference exists?
The planner lacks the insight to know that every id,max(load_dts) from the derived table must have come from at least one row in the original table. Instead it applies the two conditions in the ON as independent variables, and doesn't even know what the most common values/histograms for your derived table will be so can't predict the degree of overlap. But with the RIGHT JOIN, it knows that every row in the derived table gets returned, whether a match is found in the "other" table or not. If you create a temp table from your derived subquery and ANALYZE it then use that table in the join, you should get better estimates because it at least knows how much the distributions in each column overlap. But those better estimate are not likely to load to hugely better plans, so I wouldn't bother with that complexity.
You can probably get some marginal speed up by rewriting it into a DISTINCT ON query, but it won't be magically better. Also note that these are not equivalent. The join will return all rows which are tied for first place within a given id, while DISTINCT ON will return an arbitrary one of them (unless you add columns to the ORDER BY to break the ties)
Use window functions:
SELECT p.*
FROM (SELECT p.*,
RANK() OVER (PARTITION BY p.id ORDER BY p.load_dts DESC) as seqnum
FROM dwh.s_postings p
) p
WHERE seqnum = 1;
Or, better yet, if you want one row per id use DISTINCT ON:
SELECT DISTINCT ON (p.id) p.*
FROM dwh.s_postings p
ORDER BY p.id, p.load_dts DESC;
If I had to speculate, the conversion of the id -- which is utterly unnecessary -- throws off the optimizer. With the right join it is clear that all rows are kept from one of the tables, and that might help the statistics calculation.

Postgresql LIKE ANY versus LIKE

I've tried to be thorough in this question, so if you're impatient, just jump to the end to see what the actual question is...
I'm working on adjusting how some search features in one of our databases is implemented. To this end, I'm adding some wildcard capabilities to our application's API that interfaces back to Postgresql.
The issue that I've found is that the EXPLAIN ANALYZE times do not make sense to me and I'm trying to figure out where I could be going wrong; it doesn't seem likely that 15 queries is better than just one optimized query!
The table, Words, has two relevant columns for this question: id and text. The text column has an index on it that was build with the text_pattern_ops option. Here's what I'm seeing:
First, using a LIKE ANY with a VALUES clause, which some references seem to indicate would be ideal in my case (finding multiple words):
events_prod=# explain analyze select distinct id from words where words.text LIKE ANY (values('test%'));
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=6716668.40..6727372.85 rows=1070445 width=4) (actual time=103088.381..103091.468 rows=256 loops=1)
Group Key: words.id
-> Nested Loop Semi Join (cost=0.00..6713992.29 rows=1070445 width=4) (actual time=0.670..103087.904 rows=256 loops=1)
Join Filter: ((words.text)::text ~~ "*VALUES*".column1)
Rows Removed by Join Filter: 214089311
-> Seq Scan on words (cost=0.00..3502655.91 rows=214089091 width=21) (actual time=0.017..25232.135 rows=214089567 loops=1)
-> Materialize (cost=0.00..0.02 rows=1 width=32) (actual time=0.000..0.000 rows=1 loops=214089567)
-> Values Scan on "*VALUES*" (cost=0.00..0.01 rows=1 width=32) (actual time=0.006..0.006 rows=1 loops=1)
Planning time: 0.226 ms
Execution time: 103106.296 ms
(10 rows)
As you can see, the execution time is horrendous.
A second attempt, using LIKE ANY(ARRAY[... yields:
events_prod=# explain analyze select distinct id from words where words.text LIKE ANY(ARRAY['test%']);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=3770401.08..3770615.17 rows=21409 width=4) (actual time=37399.573..37399.704 rows=256 loops=1)
Group Key: id
-> Seq Scan on words (cost=0.00..3770347.56 rows=21409 width=4) (actual time=0.224..37399.054 rows=256 loops=1)
Filter: ((text)::text ~~ ANY ('{test%}'::text[]))
Rows Removed by Filter: 214093922
Planning time: 0.611 ms
Execution time: 37399.895 ms
(7 rows)
As you can see, performance is dramatically improved, but still far from ideal... 37 seconds. with one word in the list. Moving that up to three words that returns a total of 256 rows changes the execution time to well over 100 seconds.
The last try, doing a LIKE for a single word:
events_prod=# explain analyze select distinct id from words where words.text LIKE 'test%';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=60.14..274.23 rows=21409 width=4) (actual time=1.437..1.576 rows=256 loops=1)
Group Key: id
-> Index Scan using words_special_idx on words (cost=0.57..6.62 rows=21409 width=4) (actual time=0.048..1.258 rows=256 loops=1)
Index Cond: (((text)::text ~>=~ 'test'::text) AND ((text)::text ~<~ 'tesu'::text))
Filter: ((text)::text ~~ 'test%'::text)
Planning time: 0.826 ms
Execution time: 1.858 ms
(7 rows)
As expected, this is the fastest, but the 1.85ms makes me wonder if there is something else I'm missing with the VALUES and ARRAY approach.
The Question
Is there some more efficient way to do something like this in Postgresql that I've missed in my research?
select distinct id
from words
where words.text LIKE ANY(ARRAY['word1%', 'another%', 'third%']);
This is a bit speculative. I think the key is your pattern:
where words.text LIKE 'test%'
Note that the like pattern starts with a constant string. The means that Postgres can do a range scan on the index for the words that start with 'test'.
When you then introduce multiple comparisons, the optimizer gets confused and no longer considers multiple range scans. Instead, it decides that it needs to process all the rows.
This may be a case where this re-write gives you the performance that you want:
select id
from words
where words.text LIKE 'word1%'
union
select id
from words
where words.text LIKE 'another%'
union
select id
from words
where words.text LIKE 'third%';
Notes:
The distinct is not needed because of the union.
If the pattern starts with a wildcard, then a full scan is needed anyway.
You might want to consider an n-gram or full-text index on the table.

GroupAggregate for Subquery in Redshift/PostgreSQL

I've noticed some strange behavior in the query optimizer for Redshift, and I'm wondering if anyone can explain it or point out a workaround.
For large group by queries, it's pretty essential to get the optimizer to plan a GroupAggregate rather than a HashAggregate, so it doesn't try to fit the temporary results in memory. This works fine for me in general. But when I try to use that group by as a subquery, it switches to HashAggregate.
For example, consider the following query.
select install_app_version, user_id, max(platform) as plat
from dailies
group by install_app_version, user_id;
The table dailies has sortkeys (install_app_version, user_id) and distkey (user_id). Hence a GroupAggregate is possible, and the query plan looks like this, as it should.
XN GroupAggregate (cost=0.00..184375.32 rows=1038735 width=51)
-> XN Seq Scan on daily_players (cost=0.00..103873.42 rows=10387342 width=51)
In contrast, if I use the above in a subquery of any other query, I get a HashAggregate. For example, even something as simple as
select count(1) from
( select install_app_version, user_id, max(platform) as plat
from daily_players
group by install_app_version, user_id
);
has the query plan
XN Aggregate (cost=168794.32..168794.32 rows=1 width=0)
-> XN Subquery Scan derived_table1 (cost=155810.13..166197.48 rows=1038735 width=0)
-> XN HashAggregate (cost=155810.13..155810.13 rows=1038735 width=39)
-> XN Seq Scan on daily_players (cost=0.00..103873.42 rows=10387342 width=39)
The same pattern persists no matter what I do in the outer query. I can group by install_app_version and user_id, I can take aggregates, I can do no grouping at all externally. Even sorting the inner query does nothing.
In the cases I've shown it's not such a big deal, but I'm joining several subqueries with their own group by, doing aggregates over that - it quickly gets out of hand and very slow without GroupAggregate.
If anyone has wisdom about the query optimizer and can answer this, it'd be much appreciated! Thanks!
don't know if your question is still opened, but I put this here because I think others could be interested.
Redshift seems to perform GROUP BY aggregation with HashAggregate by default (even when conditions for GroupAggregate are right), and switch only to GroupAggregate when there is at least one computation made by aggregation THAT NEED TO BE RESOLVED FOR THE QUERY TO RETURN. I mean by this that, in your previous example, the "max(platform) as plat" is of no use for the final "COUNT(1)" result of the query. I believe that, in such case, the aggregate computation of MAX() function is not computed at all.
The workaround I use is to add an useless HAVING clause that does nothing but still need to be computed (for exemple "HAVING COUNT(1)"). This always return true (because each group has COUNT(1) equals to at least 1 and so is true), but enables the query plan to use GroupAggregate.
Example :
EXPLAIN SELECT COUNT(*) FROM (SELECT mycol FROM mytable GROUP BY 1);
XN Aggregate (cost=143754365.00..143754365.00 rows=1 width=0)
-> XN Subquery Scan derived_table1 (cost=141398732.80..143283238.56 rows=188450576 width=0)
-> XN HashAggregate (cost=141398732.80..141398732.80 rows=188450576 width=40)
-> XN Seq Scan on mytable (cost=0.00..113118986.24 rows=11311898624 width=40)
EXPLAIN SELECT COUNT(*) FROM (SELECT mycol FROM mytable GROUP BY 1 HAVING COUNT(1));
XN Aggregate (cost=171091871.18..171091871.18 rows=1 width=0)
-> XN Subquery Scan derived_table1 (cost=0.00..171091868.68 rows=1000 width=0)
-> XN GroupAggregate (cost=0.00..171091858.68 rows=1000 width=40)
Filter: ((count(1))::boolean = true)
-> XN Seq Scan on mytable (cost=0.00..113118986.24 rows=11311898624 width=40)
This works because 'mycol' is both the distkey and the sortkey of 'mytable'.
As you can see, the query plan estimate than the query with GroupAggregate is more costly than the one with HashAggregate (which must be the thing which make the query plan choose HashAggregate). Do not trust that, in my example the second query runs up to 7 times faster than the first one ! The cool thing is that the GroupAggregate do not need much memory to be computed, and so will almost never perform 'Disk Based Aggregate'.
In fact, I realised it's even a much better option to perform COUNT(DISTINCT x) with a subquery GroupAggregate than with the standard COUNT(DISTINCT x) (in my example, 'mycol' is a NOT NULL column) :
EXPLAIN SELECT COUNT(DISTINCT mycol) FROM mytable ;
XN Aggregate (cost=143754365.00..143754365.00 rows=1 width=72)
-> XN Subquery Scan volt_dt_0 (cost=141398732.80..143283238.56 rows=188450576 width=72)
-> XN HashAggregate (cost=141398732.80..141398732.80 rows=188450576 width=40)
-> XN Seq Scan on mytable (cost=0.00..113118986.24 rows=11311898624 width=40)
3 minutes 46 s
EXPLAIN SELECT COUNT(*) FROM (SELECT mycol FROM mytable GROUP BY 1 HAVING COUNT(1));
XN Aggregate (cost=171091871.18..171091871.18 rows=1 width=0)
-> XN Subquery Scan derived_table1 (cost=0.00..171091868.68 rows=1000 width=0)
-> XN GroupAggregate (cost=0.00..171091858.68 rows=1000 width=40)
Filter: ((count(1))::boolean = true)
-> XN Seq Scan on mytable (cost=0.00..113118986.24 rows=11311898624 width=40)
40 seconds
Hopes that helps !

Incorrect rows estimate for joins

I have simple query (Postgres 9.4):
EXPLAIN ANALYZE
SELECT
COUNT(*)
FROM
bo_labels L
LEFT JOIN bo_party party ON (party.id = L.bo_party_fkey)
LEFT JOIN bo_document_base D ON (D.id = L.bo_doc_base_fkey)
LEFT JOIN bo_contract_hardwood_deal C ON (C.bo_document_fkey = D.id)
WHERE
party.inn = '?'
Explain looks like:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=2385.30..2385.30 rows=1 width=0) (actual time=31762.367..31762.367 rows=1 loops=1)
-> Nested Loop Left Join (cost=1.28..2385.30 rows=1 width=0) (actual time=7.621..31760.776 rows=1694 loops=1)
Join Filter: ((c.bo_document_fkey)::text = (d.id)::text)
Rows Removed by Join Filter: 101658634
-> Nested Loop Left Join (cost=1.28..106.33 rows=1 width=10) (actual time=0.110..54.635 rows=1694 loops=1)
-> Nested Loop (cost=0.85..105.69 rows=1 width=9) (actual time=0.081..4.404 rows=1694 loops=1)
-> Index Scan using bo_party_inn_idx on bo_party party (cost=0.43..12.43 rows=3 width=10) (actual time=0.031..0.037 rows=3 loops=1)
Index Cond: (inn = '2534005760'::text)
-> Index Only Scan using bo_labels__party_fkey__docbase_fkey__tnved_fkey__idx on bo_labels l (cost=0.42..29.80 rows=1289 width=17) (actual time=0.013..1.041 rows=565 loops=3)
Index Cond: (bo_party_fkey = (party.id)::text)
Heap Fetches: 0
-> Index Only Scan using bo_document_pkey on bo_document_base d (cost=0.43..0.64 rows=1 width=10) (actual time=0.022..0.025 rows=1 loops=1694)
Index Cond: (id = (l.bo_doc_base_fkey)::text)
Heap Fetches: 1134
-> Seq Scan on bo_contract_hardwood_deal c (cost=0.00..2069.77 rows=59770 width=9) (actual time=0.003..11.829 rows=60012 loops=1694)
Planning time: 13.484 ms
Execution time: 31762.885 ms
http://explain.depesz.com/s/V2wn
What is very annoying is incorrect estimate of rows:
Nested Loop (cost=0.85..105.69 rows=1 width=9) (actual time=0.081..4.404 rows=1694 loops=1)
Because that postgres choose nested loops and query run about 30 seconds.
With SET LOCAL enable_nestloop = OFF; it accomplished just in a second.
What is also interesting, I have default_statistics_target = 10000 (at max value) and on all 4 tables run VACUUM VERBOSE ANALYZE just before.
As postgres does not gather statistic between tables such case is very likely possible to happens for other joins too.
Without external extension pghintplan it is not possible change enable_nestloop for just that query.
Is there some other way I could try to force use more speedy way to accomplish that query?
Update by comments
I can't eliminate join in common way. My main search is there any possibilities change statistic (for example) to include desired values which break normal statistical appearance? May be other way to force postgres to change weight of nested loops to use it not so frequently?
Could also someone explain or point to documentation how postgres analyzer for nested loops of two results with 3 (exact correct) and 1289 (which will really 565, but actually such error different question) rows made assumption what in result will be only 1 row??? I've speak about that part of plan:
-> Nested Loop (cost=0.85..105.69 rows=1 width=9) (actual time=0.081..4.404 rows=1694 loops=1)
-> Index Scan using bo_party_inn_idx on bo_party party (cost=0.43..12.43 rows=3 width=10) (actual time=0.031..0.037 rows=3 loops=1)
Index Cond: (inn = '2534005760'::text)
-> Index Only Scan using bo_labels__party_fkey__docbase_fkey__tnved_fkey__idx on bo_labels l (cost=0.42..29.80 rows=1289 width=17) (actual time=0.013..1.041 rows=565 loops=3)
Index Cond: (bo_party_fkey = (party.id)::text)
On first glance it looks initially wrong. What statistics used there and how?
Does postgres maintain also some statistics for indexes?
Actually, I don't have a good sample data to test my answer but I think it might help.
Based on your join columns I'm assuming the following relationship cardinality:
1) bo_party (id 1:N bo_party_fkey) bo_labels
2) bo_labels (bo_doc_base_fkey N:1 id) bo_document_base
3) bo_document_base (id 1:N bo_document_fkey) bo_contract_hardwood_deal
You want to count how much rows were selected. So, based on the cardinality in 1) and 2) the table "bo_labels" have a many to many relationship. This means that joining it with "bo_party" and "bo_document_base" will produce no more rows than the ones existing in the table.
But, after joining "bo_document_base", another join is done to "bo_contract_hardwood_deal" which cardinality described in 3) is one to many, perhaps generating more rows in the final result.
This way, to find the right count of rows you can simplify the join structure to "bo_labels" and "bo_contract_hardwood_deal" through:
4) bo_labels (bo_doc_base_fkey 1:N bo_document_fkey) bo_contract_hardwood_deal
A sample query could be one of the following:
SELECT COUNT(*)
FROM bo_labels L
LEFT JOIN bo_contract_hardwood_deal C ON (C.bo_document_fkey = L.bo_doc_base_fkey)
WHERE 1=1
and exists
(
select 1
from bo_party party
where 1=1
and party.id = L.bo_party_fkey
and party.inn = '?'
)
;
or
SELECT sum((select COUNT(*) from bo_contract_hardwood_deal C where C.bo_document_fkey = L.bo_doc_base_fkey))
FROM bo_labels L
WHERE 1=1
and exists
(
select 1
from bo_party party
where 1=1
and party.id = L.bo_party_fkey
and party.inn = '?'
)
;
I could not test with large tables, so I don't know exactly if it will improve performance against your original query, but I think it might help.

Help to choose NoSQL database for project

There is a table:
doc_id(integer)-value(integer)
Approximate 100.000 doc_id and 27.000.000 rows.
Majority query on this table - searching documents similar to current document:
select 10 documents with maximum of
(count common to current document value)/(count ov values in document).
Nowadays we use PostgreSQL. Table weight (with index) ~1,5 GB. Average query time ~0.5s - it is to hight. And, for my opinion this time will grow exponential with growing of database.
Should I transfer all this to NoSQL base, if so, what?
QUERY:
EXPLAIN ANALYZE
SELECT D.doc_id as doc_id,
(count(D.doc_crc32) *1.0 / testing.get_count_by_doc_id(D.doc_id))::real as avg_doc
FROM testing.text_attachment D
WHERE D.doc_id !=29758 -- 29758 - is random id
AND D.doc_crc32 IN (select testing.get_crc32_rows_by_doc_id(29758)) -- get_crc32... is IMMUTABLE
GROUP BY D.doc_id
ORDER BY avg_doc DESC
LIMIT 10
Limit (cost=95.23..95.26 rows=10 width=8) (actual time=1849.601..1849.641 rows=10 loops=1)
-> Sort (cost=95.23..95.28 rows=20 width=8) (actual time=1849.597..1849.609 rows=10 loops=1)
Sort Key: (((((count(d.doc_crc32))::numeric * 1.0) / (testing.get_count_by_doc_id(d.doc_id))::numeric))::real)
Sort Method: top-N heapsort Memory: 25kB
-> HashAggregate (cost=89.30..94.80 rows=20 width=8) (actual time=1211.835..1847.578 rows=876 loops=1)
-> Nested Loop (cost=0.27..89.20 rows=20 width=8) (actual time=7.826..928.234 rows=167771 loops=1)
-> HashAggregate (cost=0.27..0.28 rows=1 width=4) (actual time=7.789..11.141 rows=1863 loops=1)
-> Result (cost=0.00..0.26 rows=1 width=0) (actual time=0.130..4.502 rows=1869 loops=1)
-> Index Scan using crc32_idx on text_attachment d (cost=0.00..88.67 rows=20 width=8) (actual time=0.022..0.236 rows=90 loops=1863)
Index Cond: (d.doc_crc32 = (testing.get_crc32_rows_by_doc_id(29758)))
Filter: (d.doc_id <> 29758)
Total runtime: 1849.753 ms
(12 rows)
1.5 GByte is nothing. Serve from ram. Build a datastructure that helps you searching.
I don't think your main problem here is the kind of database you're using but the fact that you don't in fact have an "index" for what you're searching: similarity between documents.
My proposal is to determine once which are the 10 documents similar to each of the 100.000 doc_ids and cache the result in a new table like this:
doc_id(integer)-similar_doc(integer)-score(integer)
where you'll insert 10 rows per document each of them representing the 10 best matches for it. You'll get 400.000 rows which you can directly access by index which should take down search time to something like O(log n) (depending on index implementation).
Then, on each insertion or removal of a document (or one of its values) you iterate through the documents and update the new table accordingly.
e.g. when a new document is inserted:
for each of the documents already in the table
you calculate its match score and
if the score is higher than the lowest score of the similar documents cached in the new table you swap in the similar_doc and score of the newly inserted document
If you're getting that bad performance out of PostgreSQL, a good start would be to tune PostgreSQL, your query and possibly your datamodel. A query like that should serve a lot faster on such a small table.
First, is 0.5s a problem or not? And did you already optimize your queries, datamodel and configuration settings? If not, you can still get better performance. Performance is a choice.
Besides speed, there is also functionality, that's what you will loose.
===
What about pushing the function to a JOIN:
EXPLAIN ANALYZE
SELECT
D.doc_id as doc_id,
(count(D.doc_crc32) *1.0 / testing.get_count_by_doc_id(D.doc_id))::real as avg_doc
FROM
testing.text_attachment D
JOIN (SELECT testing.get_crc32_rows_by_doc_id(29758) AS r) AS crc ON D.doc_crc32 = r
WHERE
D.doc_id <> 29758
GROUP BY D.doc_id
ORDER BY avg_doc DESC
LIMIT 10