Postgres: which index to add

Postgres: which index to add - sql

I have a table mainly used by this query (only 3 columns are in use here, meter, timeStampUtc and createdOnUtc, but there are other in the table), which starts to take too long:
select
rank() over (order by mr.meter, mr."timeStampUtc") as row_name
, max(mr."createdOnUtc") over (partition by mr.meter, mr."timeStampUtc") as "createdOnUtc"
from
"MeterReading" mr
where
"createdOnUtc" >= '2021-01-01'
order by row_name
;
(this is the minimal query to show my issue. It might not make too much sense on its own, or could be rewritten)
I am wondering which index (or other technique) to use to optimise this particular query.
A basic index on createdOnUtc helps already.
I am mostly wondering about those 2 windows functions. They are very similar, so I factorised them (named window with thus identical partition by and order by), it had no effect. Adding an index on meter, "timeStampUtc" had no effect either (query plan unchanged).
Is there no way to use an index on those 2 columns inside a window function?
Edit - explain analyze output: using the createdOnUtc index
Sort (cost=8.51..8.51 rows=1 width=40) (actual time=61.045..62.222 rows=26954 loops=1)
Sort Key: (rank() OVER (?))
Sort Method: quicksort Memory: 2874kB
-> WindowAgg (cost=8.46..8.50 rows=1 width=40) (actual time=18.373..57.892 rows=26954 loops=1)
-> WindowAgg (cost=8.46..8.48 rows=1 width=40) (actual time=18.363..32.444 rows=26954 loops=1)
-> Sort (cost=8.46..8.46 rows=1 width=32) (actual time=18.353..19.663 rows=26954 loops=1)
Sort Key: meter, "timeStampUtc"
Sort Method: quicksort Memory: 2874kB
-> Index Scan using "MeterReading_createdOnUtc_idx" on "MeterReading" mr (cost=0.43..8.45 rows=1 width=32) (actual time=0.068..8.059 rows=26954 loops=1)
Index Cond: ("createdOnUtc" >= '2021-01-01 00:00:00'::timestamp without time zone)
Planning Time: 0.082 ms
Execution Time: 63.698 ms

Is there no way to use an index on those 2 columns inside a window function?
That is correct; a window function cannot use an index, as the work only on what otherwise would be the final result, all data selection has already finished. From the documentation.
The rows considered by a window function are those of the “virtual
table” produced by the query's FROM clause as filtered by its WHERE,
GROUP BY, and HAVING clauses if any. For example, a row removed
because it does not meet the WHERE condition is not seen by any window
function. A query can contain multiple window functions that slice up
the data in different ways using different OVER clauses, but they all
act on the same collection of rows defined by this virtual table.
The purpose of an index is to speed up the creation of that "virtual table". Applying an index would just slow things down: the data is already in memory. Scanning it is orders of magnitude faster any any index.

Related

Simple aggregate query running slow

I am trying to determine why a fairly simple aggregate query is taking so long to perform on a single table. The table is called plots, and it is [id, device_id, time, ...] There are two indices, UNIQUE(id) and UNIQUE(device_id, time).
The query is simply:
SELECT device_id, MIN(time)
FROM plots
GROUP BY device_id
To me, this should be very fast, but it is taking 3+ minutes. The table has ~45 million rows, divided roughly equally among 1200 or so device_id's.
EXPLAIN for the query:
Finalize GroupAggregate (cost=1502955.41..1503055.97 rows=906 width=12)
Group Key: device_id
-> Gather Merge (cost=1502955.41..1503052.35 rows=906 width=12)
Workers Planned: 1
-> Sort (cost=1501955.41..1501955.86 rows=906 width=12)
Sort Key: device_id
-> Partial HashAggregate (cost=1501943.79..1501946.51 rows=906 width=12)
Group Key: device_id
-> Parallel Seq Scan on plots (cost=0.00..1476417.34 rows=25526447 width=12)
EXPLAIN for query with a where device_id = xxx:
GroupAggregate (cost=398.86..78038.77 rows=906 width=12)
Group Key: device_id
-> Bitmap Heap Scan on plots (cost=398.86..77992.99 rows=43065 width=12)
Recheck Cond: (device_id = 6780)
-> Bitmap Index Scan on index_plots_on_device_id_and_time (cost=0.00..396.71 rows=43065 width=0)
Index Cond: (device_id = 6780)
I have done VACUUM (FULL, ANALYZE) as well as REINDEX DATABASE.
I have also tried doing partition queries to accomplish the same.
Any pointers on making this faster? Or am I just boned on the table size. It seems like it should be fine with the index though. Maybe I am missing something...
EDIT / UPDATE:
The problem seems to be resolved at this point, though I am not sure why. I have dropped and rebuilt the index many times, and suddenly the query is only taking ~7 seconds, which is acceptable. Of note, this morning I dropped the index and created a new one with the reverse column order (time, device_id) and I was surprised to see good results. I then reverted to the previous index, and the results were improved further. I will refork the production database and try to retrace my steps and post an update. Should I be worried about the query planner wonking out in the future?
Current EXPLAIN with analysis (as requested):
Finalize GroupAggregate (cost=1000.12..480787.58 rows=905 width=12) (actual time=36.299..7530.403 rows=916 loops=1)
Group Key: device_id
Buffers: shared hit=135087 read=40325
I/O Timings: read=138.419
-> Gather Merge (cost=1000.12..480783.96 rows=905 width=12) (actual time=36.226..7552.052 rows=1829 loops=1)
Workers Planned: 1
Workers Launched: 1
Buffers: shared hit=509502 read=160807
I/O Timings: read=639.797
-> Partial GroupAggregate (cost=0.11..479687.58 rows=905 width=12) (actual time=15.779..5026.094 rows=914 loops=2)
Group Key: device_id
Buffers: shared hit=509502 read=160807
I/O Timings: read=639.797
-> Parallel Index Only Scan using index_plots_time_and_device_id on plots (cost=0.11..454158.41 rows=25526447 width=12) (actual time=0.033..2999.764 rows=21697480 loops=2)
Heap Fetches: 0
Buffers: shared hit=509502 read=160807
I/O Timings: read=639.797
Planning Time: 0.092 ms
Execution Time: 7554.100 ms
(19 rows)

Approach 1:
You can try to remove your UNIQUE to an index on your database. CREATE UNIQUE INDEX and CREATE INDEX have different behaviors. I believe that you can get benefits from CREATE INDEX.
Approach 2:
You can create a materialized view. If you can get some delay on your information, you can do the following:
CREATE MATERIALIZED VIEW myreport AS
SELECT device_id,
MIN(time) AS mintime
FROM plots
GROUP BY device_id
CREATE INDEX myreport_device_id ON myreport(device_id);
Also, you need to remember to regularly do:
REFRESH MATERIALIZED VIEW CONCURRENTLY myreport;
And less regularly do:
VACUUM ANALYZE myreport

Improve PostgresSQL aggregation query performance

I am aggregating data from a Postgres table, the query is taking approx 2 seconds which I want to reduce to less than a second.
Please find below the execution details:
Query
select
a.search_keyword,
hll_cardinality( hll_union_agg(a.users) ):: int as user_count,
hll_cardinality( hll_union_agg(a.sessions) ):: int as session_count,
sum(a.total) as keyword_count
from
rollup_day a
where
a.created_date between '2018-09-01' and '2019-09-30'
and a.tenant_id = '62850a62-19ac-477d-9cd7-837f3d716885'
group by
a.search_keyword
order by
session_count desc
limit 100;
Table metadata
Total number of rows - 506527
Composite Index on columns : tenant_id and created_date
Query plan
Custom Scan (cost=0.00..0.00 rows=0 width=0) (actual time=1722.685..1722.694 rows=100 loops=1)
Task Count: 1
Tasks Shown: All
-> Task
Node: host=localhost port=5454 dbname=postgres
-> Limit (cost=64250.24..64250.49 rows=100 width=42) (actual time=1783.087..1783.106 rows=100 loops=1)
-> Sort (cost=64250.24..64558.81 rows=123430 width=42) (actual time=1783.085..1783.093 rows=100 loops=1)
Sort Key: ((hll_cardinality(hll_union_agg(sessions)))::integer) DESC
Sort Method: top-N heapsort Memory: 33kB
-> GroupAggregate (cost=52933.89..59532.83 rows=123430 width=42) (actual time=905.502..1724.363 rows=212633 loops=1)
Group Key: search_keyword
-> Sort (cost=52933.89..53636.53 rows=281055 width=54) (actual time=905.483..1351.212 rows=280981 loops=1)
Sort Key: search_keyword
Sort Method: external merge Disk: 18496kB
-> Seq Scan on rollup_day a (cost=0.00..17890.22 rows=281055 width=54) (actual time=29.720..112.161 rows=280981 loops=1)
Filter: ((created_date >= '2018-09-01'::date) AND (created_date <= '2019-09-30'::date) AND (tenant_id = '62850a62-19ac-477d-9cd7-837f3d716885'::uuid))
Rows Removed by Filter: 225546
Planning Time: 0.129 ms
Execution Time: 1786.222 ms
Planning Time: 0.103 ms
Execution Time: 1722.718 ms
What I've tried
I've tried with indexes on tenant_id and created_date but as the data is huge so it's always doing sequence scan rather than an index scan for filters. I've read about it and found, the Postgres query engine switch to sequence scan if the data returned is > 5-10% of the total rows. Please follow the link for more reference.
I've increased the work_mem to 100MB but it only improved the performance a little bit.
Any help would be really appreciated.
Update
Query plan after setting work_mem to 100MB
Custom Scan (cost=0.00..0.00 rows=0 width=0) (actual time=1375.926..1375.935 rows=100 loops=1)
Task Count: 1
Tasks Shown: All
-> Task
Node: host=localhost port=5454 dbname=postgres
-> Limit (cost=48348.85..48349.10 rows=100 width=42) (actual time=1307.072..1307.093 rows=100 loops=1)
-> Sort (cost=48348.85..48633.55 rows=113880 width=42) (actual time=1307.071..1307.080 rows=100 loops=1)
Sort Key: (sum(total)) DESC
Sort Method: top-N heapsort Memory: 35kB
-> GroupAggregate (cost=38285.79..43996.44 rows=113880 width=42) (actual time=941.504..1261.177 rows=172945 loops=1)
Group Key: search_keyword
-> Sort (cost=38285.79..38858.52 rows=229092 width=54) (actual time=941.484..963.061 rows=227261 loops=1)
Sort Key: search_keyword
Sort Method: quicksort Memory: 32982kB
-> Seq Scan on rollup_day_104290 a (cost=0.00..17890.22 rows=229092 width=54) (actual time=38.803..104.350 rows=227261 loops=1)
Filter: ((created_date >= '2019-01-01'::date) AND (created_date <= '2019-12-30'::date) AND (tenant_id = '62850a62-19ac-477d-9cd7-837f3d716885'::uuid))
Rows Removed by Filter: 279266
Planning Time: 0.131 ms
Execution Time: 1308.814 ms
Planning Time: 0.112 ms
Execution Time: 1375.961 ms
Update 2
After creating an index on created_date and increased work_mem to 120MB
create index date_idx on rollup_day(created_date);
The total number of rows is: 12,124,608
Query Plan is:
Custom Scan (cost=0.00..0.00 rows=0 width=0) (actual time=2635.530..2635.540 rows=100 loops=1)
Task Count: 1
Tasks Shown: All
-> Task
Node: host=localhost port=9702 dbname=postgres
-> Limit (cost=73545.19..73545.44 rows=100 width=51) (actual time=2755.849..2755.873 rows=100 loops=1)
-> Sort (cost=73545.19..73911.25 rows=146424 width=51) (actual time=2755.847..2755.858 rows=100 loops=1)
Sort Key: (sum(total)) DESC
Sort Method: top-N heapsort Memory: 35kB
-> GroupAggregate (cost=59173.97..67948.97 rows=146424 width=51) (actual time=2014.260..2670.732 rows=296537 loops=1)
Group Key: search_keyword
-> Sort (cost=59173.97..60196.85 rows=409152 width=55) (actual time=2013.885..2064.775 rows=410618 loops=1)
Sort Key: search_keyword
Sort Method: quicksort Memory: 61381kB
-> Index Scan using date_idx_102913 on rollup_day_102913 a (cost=0.42..21036.35 rows=409152 width=55) (actual time=0.026..183.370 rows=410618 loops=1)
Index Cond: ((created_date >= '2018-01-01'::date) AND (created_date <= '2018-12-31'::date))
Filter: (tenant_id = '12850a62-19ac-477d-9cd7-837f3d716885'::uuid)
Planning Time: 0.135 ms
Execution Time: 2760.667 ms
Planning Time: 0.090 ms
Execution Time: 2635.568 ms

You should experiment with higher settings of work_mem until you get an in-memory sort. Of course you can only be generous with memory if your machine has enough of it.
What would make your query way faster is if you store pre-aggregated data, either using a materialized view or a second table and a trigger on your original table that keeps the sums in the other table updated. I don't know if that is possible with your data, as I don't know what hll_cardinality and hll_union_agg are.

Have you tried a Covering indexes, so the optimizer will use the index, and not do a sequential scan ?
create index covering on rollup_day(tenant_id, created_date, search_keyword, users, sessions, total);
If Postgres 11
create index covering on rollup_day(tenant_id, created_date) INCLUDE (search_keyword, users, sessions, total);
But since you also do a sort/group by on search_keyword maybe :
create index covering on rollup_day(tenant_id, created_date, search_keyword);
create index covering on rollup_day(tenant_id, search_keyword, created_date);
Or :
create index covering on rollup_day(tenant_id, created_date, search_keyword) INCLUDE (users, sessions, total);
create index covering on rollup_day(tenant_id, search_keyword, created_date) INCLUDE (users, sessions, total);
One of these indexes should make the query faster. You should only add one of these indexes.
Even if it makes this query faster, having big indexes will/might make your write operations slower (especially HOT updates are not available on indexed columns). And you will use more storage.
Idea came from here , there is also an hint about size for the work_mem
Another example where the index was not used

use the table partitions and create a composite index it will bring down the total cost as:
it will save huge cost on scans for you.
partitions will segregate data and will be very helpful in future purge operations as well.
I have personally tried and tested table partitions with such cases and the throughput is amazing with the combination of
partitions & composite indexes.
Partitioning can be done on the range of created date and then composite indexes on date & tenant.
Remember you can always have a composite index with a condition in it if there is a very specific requirement for the condition in your query. This way the data will be sorted already in the index and will save huge costs for sort operations as well.
Hope this helps.
PS: Also, is it possible to share any test sample data for the same?

my suggestion would be to break up the select.
Now what I would try also in combination with this to setup 2 indices on the table. One on the Dates the other on the ID. One of the problem with weird IDs is, that it takes time to compare and they can be treated as string compare in the background. Thats why the break up, to prefilter the data before the between command is executed. Now the between command can make a select slow. Here I would suggest to break it up into 2 selects and inner join (I now the memory consumption is a problem).
Here is an example what I mean. I hope the optimizer is smart enough to restructure your query.
SELECT
a.search_keyword,
hll_cardinality( hll_union_agg(a.users) ):: int as user_count,
hll_cardinality( hll_union_agg(a.sessions) ):: int as session_count,
sum(a.total) as keyword_count
FROM
(SELECT
*
FROM
rollup_day a
WHERE
a.tenant_id = '62850a62-19ac-477d-9cd7-837f3d716885') t1
WHERE
a.created_date between '2018-09-01' and '2019-09-30'
group by
a.search_keyword
order by
session_count desc
Now if this does not work then you need more specific optimizations. For example. Can the total be equal to 0, then you need filtered index on the data where the total is > 0. Are there any other criteria that make it easy to exclude rows from the select.
The next consideration would be to create a row where there is a short ID (instead of 62850a62-19ac-477d-9cd7-837f3d716885 -> 62850 ), that can be a number and that would make preselection very easy and memory consumption less.

Postgresql LIKE ANY versus LIKE

I've tried to be thorough in this question, so if you're impatient, just jump to the end to see what the actual question is...
I'm working on adjusting how some search features in one of our databases is implemented. To this end, I'm adding some wildcard capabilities to our application's API that interfaces back to Postgresql.
The issue that I've found is that the EXPLAIN ANALYZE times do not make sense to me and I'm trying to figure out where I could be going wrong; it doesn't seem likely that 15 queries is better than just one optimized query!
The table, Words, has two relevant columns for this question: id and text. The text column has an index on it that was build with the text_pattern_ops option. Here's what I'm seeing:
First, using a LIKE ANY with a VALUES clause, which some references seem to indicate would be ideal in my case (finding multiple words):
events_prod=# explain analyze select distinct id from words where words.text LIKE ANY (values('test%'));
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=6716668.40..6727372.85 rows=1070445 width=4) (actual time=103088.381..103091.468 rows=256 loops=1)
Group Key: words.id
-> Nested Loop Semi Join (cost=0.00..6713992.29 rows=1070445 width=4) (actual time=0.670..103087.904 rows=256 loops=1)
Join Filter: ((words.text)::text ~~ "*VALUES*".column1)
Rows Removed by Join Filter: 214089311
-> Seq Scan on words (cost=0.00..3502655.91 rows=214089091 width=21) (actual time=0.017..25232.135 rows=214089567 loops=1)
-> Materialize (cost=0.00..0.02 rows=1 width=32) (actual time=0.000..0.000 rows=1 loops=214089567)
-> Values Scan on "*VALUES*" (cost=0.00..0.01 rows=1 width=32) (actual time=0.006..0.006 rows=1 loops=1)
Planning time: 0.226 ms
Execution time: 103106.296 ms
(10 rows)
As you can see, the execution time is horrendous.
A second attempt, using LIKE ANY(ARRAY[... yields:
events_prod=# explain analyze select distinct id from words where words.text LIKE ANY(ARRAY['test%']);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=3770401.08..3770615.17 rows=21409 width=4) (actual time=37399.573..37399.704 rows=256 loops=1)
Group Key: id
-> Seq Scan on words (cost=0.00..3770347.56 rows=21409 width=4) (actual time=0.224..37399.054 rows=256 loops=1)
Filter: ((text)::text ~~ ANY ('{test%}'::text[]))
Rows Removed by Filter: 214093922
Planning time: 0.611 ms
Execution time: 37399.895 ms
(7 rows)
As you can see, performance is dramatically improved, but still far from ideal... 37 seconds. with one word in the list. Moving that up to three words that returns a total of 256 rows changes the execution time to well over 100 seconds.
The last try, doing a LIKE for a single word:
events_prod=# explain analyze select distinct id from words where words.text LIKE 'test%';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=60.14..274.23 rows=21409 width=4) (actual time=1.437..1.576 rows=256 loops=1)
Group Key: id
-> Index Scan using words_special_idx on words (cost=0.57..6.62 rows=21409 width=4) (actual time=0.048..1.258 rows=256 loops=1)
Index Cond: (((text)::text ~>=~ 'test'::text) AND ((text)::text ~<~ 'tesu'::text))
Filter: ((text)::text ~~ 'test%'::text)
Planning time: 0.826 ms
Execution time: 1.858 ms
(7 rows)
As expected, this is the fastest, but the 1.85ms makes me wonder if there is something else I'm missing with the VALUES and ARRAY approach.
The Question
Is there some more efficient way to do something like this in Postgresql that I've missed in my research?
select distinct id
from words
where words.text LIKE ANY(ARRAY['word1%', 'another%', 'third%']);

This is a bit speculative. I think the key is your pattern:
where words.text LIKE 'test%'
Note that the like pattern starts with a constant string. The means that Postgres can do a range scan on the index for the words that start with 'test'.
When you then introduce multiple comparisons, the optimizer gets confused and no longer considers multiple range scans. Instead, it decides that it needs to process all the rows.
This may be a case where this re-write gives you the performance that you want:
select id
from words
where words.text LIKE 'word1%'
union
select id
from words
where words.text LIKE 'another%'
union
select id
from words
where words.text LIKE 'third%';
Notes:
The distinct is not needed because of the union.
If the pattern starts with a wildcard, then a full scan is needed anyway.
You might want to consider an n-gram or full-text index on the table.

Postgres: STABLE function called multiple times on constant

I'm having a Postgresql (version 9.4) performance puzzle. I have a function (prevd) declared as STABLE (see below). When I run this function on a constant in where clause, it is called multiple times - instead of once.
If I understand postgres documentation correctly, the query should be optimized to call prevd only once.
A STABLE function cannot modify the database and is guaranteed to return the same results given the same arguments for all rows within a single statement
Why it doesn't optimize calls to prevd in this case?
I'm not expecting prevd to be called once for all subsequent queries using prevd on the same argument (like it was IMMUTABLE). I'm expecting postgres to create a plan for my query with just one call to prevd('2015-12-12')
Please find the code below:
Schema
create table somedata(d date, number double precision);
create table dates(d date);
insert into dates
select generate_series::date
from generate_series('2015-01-01'::date, '2015-12-31'::date, '1 day');
insert into somedata
select '2015-01-01'::date + (random() * 365 + 1)::integer, random()
from generate_series(1, 100000);
create or replace function prevd(date_ date)
returns date
language sql
stable
as $$
select max(d) from dates where d < date_;
$$
Slow Query
select avg(number) from somedata where d=prevd('2015-12-12');
Poor query plan of the query above
Aggregate (cost=28092.74..28092.75 rows=1 width=8) (actual time=3532.638..3532.638 rows=1 loops=1)
Output: avg(number)
-> Seq Scan on public.somedata (cost=0.00..28091.43 rows=525 width=8) (actual time=10.210..3532.576 rows=282 loops=1)
Output: d, number
Filter: (somedata.d = prevd('2015-12-12'::date))
Rows Removed by Filter: 99718
Planning time: 1.144 ms
Execution time: 3532.688 ms
(8 rows)
Performance
The query above, on my machine runs around 3.5s. After changing prevd to IMMUTABLE, it's changing to 0.035s.

I started writing this as a comment, but it got a bit long, so I'm expanding it into an answer.
As discussed in this previous answer, Postgres does not promise to always optimise based on STABLE or IMMUTABLE annotations, only that it can sometimes do so. It does this by planning the query differently by taking advantage of certain assumptions. This part of the previous answer is directly analogous to your case:
This particular sort of rewriting depends upon immutability or stability. With where test_multi_calls1(30) != num query re-writing will happen for immutable but not for merely stable functions.
If you change the function to IMMUTABLE and look at the query plan, you will see that the rewriting it does is really rather radical:
Seq Scan on public.somedata (cost=0.00..1791.00 rows=272 width=12) (actual time=0.036..14.549 rows=270 loops=1)
Output: d, number
Filter: (somedata.d = '2015-12-11'::date)
Buffers: shared read=541 written=14
Total runtime: 14.589 ms
It actually runs the function while planning the query, and substitutes the value before the query is even executed. With a STABLE function, this optimisation would clearly not be appropriate - the data might change between planning and executing the query.
In a comment, it was mentioned that this query results in an optimised plan:
select avg(number) from somedata where d=(select prevd(date '2015-12-12'));
This is fast, but note that the plan doesn't look anything like what the IMMUTABLE version did:
Aggregate (cost=1791.69..1791.70 rows=1 width=8) (actual time=14.670..14.670 rows=1 loops=1)
Output: avg(number)
Buffers: shared read=541 written=21
InitPlan 1 (returns $0)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1)
Output: '2015-12-11'::date
-> Seq Scan on public.somedata (cost=0.00..1791.00 rows=273 width=8) (actual time=0.026..14.589 rows=270 loops=1)
Output: d, number
Filter: (somedata.d = $0)
Buffers: shared read=541 written=21
Total runtime: 14.707 ms
By putting it into a sub-query, you are moving the function call from the WHERE clause to the SELECT clause. More importantly, the sub-query can always be executed once and used by the rest of the query; so the function is run once in a separate node of the plan.
To confirm this, we can take the SQL out of a function altogether:
select avg(number) from somedata where d=(select max(d) from dates where d < '2015-12-12');
This gives a rather longer plan with very similar performance:
Aggregate (cost=1799.12..1799.13 rows=1 width=8) (actual time=14.174..14.174 rows=1 loops=1)
Output: avg(somedata.number)
Buffers: shared read=543 written=19
InitPlan 1 (returns $0)
-> Aggregate (cost=7.43..7.44 rows=1 width=4) (actual time=0.150..0.150 rows=1 loops=1)
Output: max(dates.d)
Buffers: shared read=2
-> Seq Scan on public.dates (cost=0.00..6.56 rows=347 width=4) (actual time=0.015..0.103 rows=345 loops=1)
Output: dates.d
Filter: (dates.d < '2015-12-12'::date)
Buffers: shared read=2
-> Seq Scan on public.somedata (cost=0.00..1791.00 rows=273 width=8) (actual time=0.190..14.098 rows=270 loops=1)
Output: somedata.d, somedata.number
Filter: (somedata.d = $0)
Buffers: shared read=543 written=19
Total runtime: 14.232 ms
The important thing to note is that the inner Aggregate (the max(d)) is executed once, on a separate node from the main Seq Scan (which is checking the where clause). In this position, even a VOLATILE function can be optimised in the same way.
In short, while you know that the query you've produced can be optimised by executing the function only once, it doesn't match any of the patterns that Postgres's query planner knows how to rewrite, so it uses a naive plan which runs the function multiple times.
[Note: all tests performed on Postgres 9.1, because it's what I happened to have to hand.]

Help to choose NoSQL database for project

There is a table:
doc_id(integer)-value(integer)
Approximate 100.000 doc_id and 27.000.000 rows.
Majority query on this table - searching documents similar to current document:
select 10 documents with maximum of
(count common to current document value)/(count ov values in document).
Nowadays we use PostgreSQL. Table weight (with index) ~1,5 GB. Average query time ~0.5s - it is to hight. And, for my opinion this time will grow exponential with growing of database.
Should I transfer all this to NoSQL base, if so, what?
QUERY:
EXPLAIN ANALYZE
SELECT D.doc_id as doc_id,
(count(D.doc_crc32) *1.0 / testing.get_count_by_doc_id(D.doc_id))::real as avg_doc
FROM testing.text_attachment D
WHERE D.doc_id !=29758 -- 29758 - is random id
AND D.doc_crc32 IN (select testing.get_crc32_rows_by_doc_id(29758)) -- get_crc32... is IMMUTABLE
GROUP BY D.doc_id
ORDER BY avg_doc DESC
LIMIT 10
Limit (cost=95.23..95.26 rows=10 width=8) (actual time=1849.601..1849.641 rows=10 loops=1)
-> Sort (cost=95.23..95.28 rows=20 width=8) (actual time=1849.597..1849.609 rows=10 loops=1)
Sort Key: (((((count(d.doc_crc32))::numeric * 1.0) / (testing.get_count_by_doc_id(d.doc_id))::numeric))::real)
Sort Method: top-N heapsort Memory: 25kB
-> HashAggregate (cost=89.30..94.80 rows=20 width=8) (actual time=1211.835..1847.578 rows=876 loops=1)
-> Nested Loop (cost=0.27..89.20 rows=20 width=8) (actual time=7.826..928.234 rows=167771 loops=1)
-> HashAggregate (cost=0.27..0.28 rows=1 width=4) (actual time=7.789..11.141 rows=1863 loops=1)
-> Result (cost=0.00..0.26 rows=1 width=0) (actual time=0.130..4.502 rows=1869 loops=1)
-> Index Scan using crc32_idx on text_attachment d (cost=0.00..88.67 rows=20 width=8) (actual time=0.022..0.236 rows=90 loops=1863)
Index Cond: (d.doc_crc32 = (testing.get_crc32_rows_by_doc_id(29758)))
Filter: (d.doc_id <> 29758)
Total runtime: 1849.753 ms
(12 rows)

1.5 GByte is nothing. Serve from ram. Build a datastructure that helps you searching.

I don't think your main problem here is the kind of database you're using but the fact that you don't in fact have an "index" for what you're searching: similarity between documents.
My proposal is to determine once which are the 10 documents similar to each of the 100.000 doc_ids and cache the result in a new table like this:
doc_id(integer)-similar_doc(integer)-score(integer)
where you'll insert 10 rows per document each of them representing the 10 best matches for it. You'll get 400.000 rows which you can directly access by index which should take down search time to something like O(log n) (depending on index implementation).
Then, on each insertion or removal of a document (or one of its values) you iterate through the documents and update the new table accordingly.
e.g. when a new document is inserted:
for each of the documents already in the table
you calculate its match score and
if the score is higher than the lowest score of the similar documents cached in the new table you swap in the similar_doc and score of the newly inserted document

If you're getting that bad performance out of PostgreSQL, a good start would be to tune PostgreSQL, your query and possibly your datamodel. A query like that should serve a lot faster on such a small table.

First, is 0.5s a problem or not? And did you already optimize your queries, datamodel and configuration settings? If not, you can still get better performance. Performance is a choice.
Besides speed, there is also functionality, that's what you will loose.
===
What about pushing the function to a JOIN:
EXPLAIN ANALYZE
SELECT
D.doc_id as doc_id,
(count(D.doc_crc32) *1.0 / testing.get_count_by_doc_id(D.doc_id))::real as avg_doc
FROM
testing.text_attachment D
JOIN (SELECT testing.get_crc32_rows_by_doc_id(29758) AS r) AS crc ON D.doc_crc32 = r
WHERE
D.doc_id <> 29758
GROUP BY D.doc_id
ORDER BY avg_doc DESC
LIMIT 10

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas