Sql query based on historical event and its performance

Sql query based on historical event and its performance - sql

Below are tables and sample data. I want to get the data up to a provided event_id - 1 or to the latest if no event_id is provided for an ngram and only return the data if that max(event_id) having some data.
Please look at examples and scenarios below. It's easier to explain that way.
CREATE TABLE NGRAM_CONTENT
(
event_id BIGINT PRIMARY KEY,
ref TEXT NOT NULL,
data TEXT
);
CREATE TABLE NGRAM
(
event_id BIGINT PRIMARY KEY,
ngram TEXT NOT NULL ,
ref TEXT not null,
name_length INT not null,
) ;
insert into NGRAM_CONTENT(event_id, ref, data) values (1, 'p1', 'a data 1');
insert into NGRAM_CONTENT(event_id, ref, data) values (2, 'p1', 'a data 2');
insert into NGRAM_CONTENT(event_id, ref, data) values (3, 'p1', null);
insert into NGRAM_CONTENT(event_id, ref, data) values (4, 'p2', 'b data 1');
insert into NGRAM_CONTENT(event_id, ref, data) values (5, 'p2', 'b data 2');
insert into NGRAM_CONTENT(event_id, ref, data) values (6, 'p2', 'c data 1');
insert into NGRAM_CONTENT(event_id, ref, data) values (7, 'p2', 'c data 2');
insert into NGRAM(ngram, ref, event_id, name_length) values ('a', 'p1', 1, 10);
insert into NGRAM(ngram, ref, event_id, name_length) values ('a', 'p1', 2, 12);
insert into NGRAM(ngram, ref, event_id, name_length) values ('b', 'p1', 2, 13);
insert into NGRAM(ngram, ref, event_id, name_length) values ('b', 'p2', 4, 8);
insert into NGRAM(ngram, ref, event_id, name_length) values ('b', 'p2', , 10);
insert into NGRAM(ngram, ref, event_id, name_length) values ('c', 'p2', 6, 20);
insert into NGRAM(ngram, ref, event_id, name_length) values ('c', 'p2', 7, 50);
Here are the tables of example input and desired output
| ngram | event_id | output |
| 'a','b'| < 2 | 'a data 1' |
| 'a','b'| < 3 | 'a data 2' |
| 'a','b'| < 4 | null |
| 'a' | - | null |
| 'a' | - | null |
| 'b' | - | null |
| 'a','b'| - | null |
| 'b' | < 5 | b data 1 |
| 'c' | < 7 | c data 1 |
| 'c' | - | c data 2 |
I've got the following query that worked. (I didn't mention name_length above, as it's just complicated example). Need to replace that 100000000 with an event_id above and also the values for searching of ngram
with max_matched_event_id_and_ref_from_index as
(
-- getting max event id and ref for all the potential matches
select max(event_id) as max_matched_event_id, ref
from ngram
where name_length between 14 and 18 and
ngram in ('a', 'b')
and event_id < 1000000000
group by ref
having count(*) >= 2
),
max_current_event_id as
(
select max(event_id) as max_current_event_id
from ngram_content w
inner join max_matched_event_id_and_ref_from_index n on w.ref = n.ref
where w.event_id >= n.max_matched_event_id and event_id < 1000000000
group by n.ref
)
select nc.data
from ngram_content nc
inner join max_current_event_id m on nc.event_id = m.max_current_event_id
inner join max_matched_event_id_and_ref_from_index mi on nc.event_id = mi.max_matched_event_id;
I have about 450 million from ngram table and about 55 million rows from ngram_content table.
The current query takes over a minute to return which is too slow for our usage.
I've got indexes as followings:
CREATE INDEX combined_index ON NGRAM (ngram, name_length, ref, event_id);
CREATE INDEX idx_ref_ngram_content ON ngram_content (ref);
CREATE INDEX idx_ngram_content_event_id ON ngram_content (event_id);
And here are the detailed explanation from query plan execution:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=4818702.89..4820783.06 rows=1 width=365) (actual time=29201.537..29227.214 rows=15081 loops=1)
Hash Cond: (mi.max_matched_event_id = nc.event_id)
Buffers: shared hit=381223 read=342422, temp read=2204 written=2213
CTE max_matched_event_id_and_ref_from_index
-> Finalize GroupAggregate (cost=3720947.79..3795574.47 rows=87586 width=16) (actual time=19163.811..19978.720 rows=43427 loops=1)
Group Key: ngram.ref
Filter: (count(*) >= 2)
Rows Removed by Filter: 999474
Buffers: shared hit=35270 read=225113, temp read=1402 written=1410
-> Gather Merge (cost=3720947.79..3788348.60 rows=525518 width=24) (actual time=19163.620..19649.679 rows=1048271 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=104576 read=669841 written=1, temp read=4183 written=4207
-> Partial GroupAggregate (cost=3719947.77..3726690.76 rows=262759 width=24) (actual time=19143.782..19356.718 rows=349424 loops=3)
Group Key: ngram.ref
Buffers: shared hit=104576 read=669841 written=1, temp read=4183 written=4207
-> Sort (cost=3719947.77..3720976.62 rows=411540 width=16) (actual time=19143.770..19231.539 rows=362406 loops=3)
Sort Key: ngram.ref
Sort Method: external merge Disk: 11216kB
Worker 0: Sort Method: external merge Disk: 11160kB
Worker 1: Sort Method: external merge Disk: 11088kB
Buffers: shared hit=104576 read=669841 written=1, temp read=4183 written=4207
-> Parallel Index Only Scan using combined_index on ngram (cost=0.57..3674535.28 rows=411540 width=16) (actual time=1.122..18715.404 rows=362406 loops=3)
Index Cond: ((ngram = ANY ('{ORA,AN,MG}'::text[])) AND (name_length >= 14) AND (name_length <= 18) AND (event_id < 1000000000))
Heap Fetches: 1087219
Buffers: shared hit=104560 read=669841 written=1
CTE max_current_event_id
-> GroupAggregate (cost=1020964.39..1021403.41 rows=200 width=40) (actual time=7631.312..7674.228 rows=43427 loops=1)
Group Key: n.ref
Buffers: shared hit=174985 read=70887, temp read=1179 written=273
-> Sort (cost=1020964.39..1021110.06 rows=58270 width=40) (actual time=7631.304..7644.203 rows=71773 loops=1)
Sort Key: n.ref
Sort Method: external merge Disk: 2176kB
Buffers: shared hit=174985 read=70887, temp read=1179 written=273
-> Nested Loop (cost=0.56..1016352.18 rows=58270 width=40) (actual time=1.093..7574.448 rows=71773 loops=1)
Buffers: shared hit=174985 read=70887, temp read=907
-> CTE Scan on max_matched_event_id_and_ref_from_index n (cost=0.00..1751.72 rows=87586 width=40) (actual time=0.000..838.522 rows=43427 loops=1)
Buffers: temp read=907
-> Index Scan using idx_ref_ngram_content on ngram_content w (cost=0.56..11.57 rows=1 width=16) (actual time=0.104..0.154 rows=2 loops=43427)
Index Cond: (ref = n.ref)
Filter: ((event_id < 1000000000) AND (event_id >= n.max_matched_event_id))
Rows Removed by Filter: 0
Buffers: shared hit=174985 read=70887
-> CTE Scan on max_matched_event_id_and_ref_from_index mi (cost=0.00..1751.72 rows=87586 width=8) (actual time=19163.813..19168.081 rows=43427 loops=1)
Buffers: shared hit=35270 read=225113, temp read=495 written=1410
-> Hash (cost=1722.50..1722.50 rows=200 width=381) (actual time=10035.797..10035.797 rows=43427 loops=1)
Buckets: 32768 (originally 1024) Batches: 2 (originally 1) Memory Usage: 3915kB
Buffers: shared hit=345953 read=117309, temp read=1179 written=704
-> Nested Loop (cost=0.56..1722.50 rows=200 width=381) (actual time=7632.365..9994.328 rows=43427 loops=1)
Buffers: shared hit=345953 read=117309, temp read=1179 written=273
-> CTE Scan on max_current_event_id m (cost=0.00..4.00 rows=200 width=8) (actual time=7631.315..7695.869 rows=43427 loops=1)
Buffers: shared hit=174985 read=70887, temp read=1179 written=273
-> Index Scan using idx_ngram_content_event_id on ngram_content nc (cost=0.56..8.58 rows=1 width=373) (actual time=0.052..0.052 rows=1 loops=43427)
Index Cond: (event_id = m.max_current_event_id)
Buffers: shared hit=170968 read=46422
Planning Time: 7.872 ms
Execution Time: 29231.222 ms
(57 rows)
Any suggestions on how to optimise the query or indexes so the query can run faster please?

Related

Query Value by Max Date in Postgresql

I already asked this question here but there contained less information about my question. So, I create a new question with more information.
Here is the sample table that I have. Each row contains the filled data by the user at every time. So that the timestamp column will not be null through the whole table. There may be unrecorded value under item, if the user didn't fill. The id is the auto-generated column for each record.
CREATE TABLE tbl (id int, customer_id text, item text, value text, timestamp timestamp);
INSERT INTO tbl VALUES
(1, '001', 'price', '1000', '2021-11-01 01:00:00'),
(2, '001', 'price', '1500', '2021-11-02 01:00:00'),
(3, '001', 'price', '1400', '2021-11-03 01:00:00'),
(4, '001', 'condition', 'good', '2021-11-01 01:00:00'),
(5, '001', 'condition', 'good', '2021-11-02 01:00:00'),
(6, '001', 'condition', 'ok', '2021-11-03 01:00:00'),
(7, '001', 'feeling', 'sad', '2021-11-01 01:00:00'),
(8, '001', 'feeling', 'angry', '2021-11-02 01:00:00'),
(9, '001', 'feeling', 'fine', '2021-11-03 01:00:00'),
(10, '002', 'price', '1200', '2021-11-01 01:00:00'),
(11, '002', 'price', '1600', '2021-11-02 01:00:00'),
(12, '002', 'price', '2000', '2021-11-03 01:00:00'),
(13, '002', 'weather', 'sunny', '2021-11-01 01:00:00'),
(14, '002', 'weather', 'rain', '2021-11-02 01:00:00'),
(15, '002', 'price', '1900', '2021-11-04 01:00:00'),
(16, '002', 'feeling', 'sad', '2021-11-01 01:00:00'),
(17, '002', 'feeling', 'angry', '2021-11-02 01:00:00'),
(18, '002', 'feeling', 'fine', '2021-11-03 01:00:00'),
(19, '003', 'price', '1000', '2021-11-01 01:00:00'),
(20, '003', 'price', '1500', '2021-11-02 01:00:00'),
(21, '003', 'price', '2000', '2021-11-03 01:00:00'),
(22, '003', 'condition', 'ok', '2021-11-01 01:00:00'),
(23, '003', 'weather', 'rain', '2021-11-02 01:00:00'),
(24, '003', 'condition', 'bad', '2021-11-03 01:00:00'),
(25, '003', 'feeling', 'fine', '2021-11-01 01:00:00'),
(26, '003', 'weather', 'sunny', '2021-11-03 01:00:00'),
(27, '003', 'feeling', 'sad', '2021-11-03 01:00:00')
;
To see clearly, I order the above table by id and timestamp. It doesn't matter.
We are using Postgresql Version: PostgreSQL 9.5.19
The actual table contains over 4 million rows
The item column contains over 500 distinct items, but don't worry. I will use 10 items at most for a query. In the above table, I used only 4 items.
We also have another table called Customer_table with a unique Customer_id containing customers' general information.
From the above table, I want to query the data to create a table with the latest date updated data as below. I will use 10 items at most for a query so that there may be 10 columns.
customer_id price condition feeling weather .......(there may be other columns from item column)
002 1900 null fine rain
001 1400 ok fine null
003 2000 bad sad sunny
This is the query that I get from previous questions, but I asked only for two item.
SELECT customer_id, p.value AS price, c.value AS condition
FROM (
SELECT DISTINCT ON (customer_id)
customer_id, value
FROM tbl
WHERE item = 'condition'
ORDER BY customer_id, timestamp DESC
) c
FULL JOIN (
SELECT DISTINCT ON (customer_id)
customer_id, value
FROM tbl
WHERE item = 'price'
ORDER BY customer_id, timestamp DESC
) p USING (customer_id)
So, if there is any better solution please help me.
Thank you.

You may try other approaches using row_number to generate a value to filter your data on the most recent data. You may then aggregate on customer id with the max value for a case expression filtering your records based on the desired row number rn=1 (we will order by descending) and item name.
These approaches are less verbose and based on the results online seem to be more performant. Let me know how replicating this in your environment works in the comments.
You may use EXPLAIN ANALYZE to compare this approach to the current one. The results in the online environment provided:
Current Approach
| Planning time: 0.129 ms
| Execution time: 0.056 ms
Suggested Approach 1
| Planning time: 0.061 ms
| Execution time: 0.070 ms
Suggested Approach 2
| Planning time: 0.047 ms
| Execution time: 0.056 ms
NB. You may use EXPLAIN ANALYZE to compare these approaches in your environment which we cannot replicate online. The results may also vary on each run. Indexes and early filters on the item column are also recommended to improve performance.
Schema (PostgreSQL v9.5)
Suggested Approach 1
SELECT
t1.customer_id,
MAX(CASE WHEN t1.item='condition' THEN t1.value END) as conditio,
MAX(CASE WHEN t1.item='price' THEN t1.value END) as price,
MAX(CASE WHEN t1.item='feeling' THEN t1.value END) as feeling,
MAX(CASE WHEN t1.item='weather' THEN t1.value END) as weather
FROM (
SELECT
* ,
ROW_NUMBER() OVER (
PARTITION BY customer_id,item
ORDER BY tbl.timestamp DESC
) as rn
FROM
tbl
-- ensure that you filter based on your desired items
-- indexes on item column are recommended to improve performance
) t1
WHERE rn=1
GROUP BY
1;
customer_id
conditio
price
feeling
weather
001
ok
1400
fine
002
1900
fine
rain
003
bad
2000
sad
sunny
Suggested Approach 2
SELECT
t1.customer_id,
MAX(t1.value) FILTER (WHERE t1.item='condition') as conditio,
MAX(t1.value) FILTER (WHERE t1.item='price') as price,
MAX(t1.value) FILTER (WHERE t1.item='feeling') as feeling,
MAX(t1.value) FILTER (WHERE t1.item='weather') as weather
FROM (
SELECT
* ,
ROW_NUMBER() OVER (
PARTITION BY customer_id,item
ORDER BY tbl.timestamp DESC
) as rn
FROM
tbl
-- ensure that you filter based on your desired items
-- indexes on item column are recommended to improve performance
) t1
WHERE rn=1
GROUP BY
1;
customer_id
conditio
price
feeling
weather
001
ok
1400
fine
002
1900
fine
rain
003
bad
2000
sad
sunny
Current Approach with EXPLAIN ANALYZE
EXPLAIN(ANALYZE,BUFFERS)
SELECT customer_id, p.value AS price, c.value AS condition
FROM (
SELECT DISTINCT ON (customer_id)
customer_id, value
FROM tbl
WHERE item = 'condition'
ORDER BY customer_id, timestamp DESC
) c
FULL JOIN (
SELECT DISTINCT ON (customer_id)
customer_id, value
FROM tbl
WHERE item = 'price'
ORDER BY customer_id, timestamp DESC
) p USING (customer_id);
QUERY PLAN
Merge Full Join (cost=35.05..35.12 rows=1 width=128) (actual time=0.025..0.030 rows=3 loops=1)
Merge Cond: (tbl.customer_id = tbl_1.customer_id)
Buffers: shared hit=2
-> Unique (cost=17.52..17.54 rows=1 width=72) (actual time=0.013..0.014 rows=2 loops=1)
Buffers: shared hit=1
-> Sort (cost=17.52..17.53 rows=3 width=72) (actual time=0.013..0.013 rows=5 loops=1)
Sort Key: tbl.customer_id, tbl."timestamp" DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=1
-> Seq Scan on tbl (cost=0.00..17.50 rows=3 width=72) (actual time=0.004..0.006 rows=5 loops=1)
Filter: (item = 'condition'::text)
Rows Removed by Filter: 22
Buffers: shared hit=1
-> Materialize (cost=17.52..17.55 rows=1 width=64) (actual time=0.010..0.013 rows=3 loops=1)
Buffers: shared hit=1
-> Unique (cost=17.52..17.54 rows=1 width=72) (actual time=0.010..0.012 rows=3 loops=1)
Buffers: shared hit=1
-> Sort (cost=17.52..17.53 rows=3 width=72) (actual time=0.010..0.010 rows=10 loops=1)
Sort Key: tbl_1.customer_id, tbl_1."timestamp" DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=1
-> Seq Scan on tbl tbl_1 (cost=0.00..17.50 rows=3 width=72) (actual time=0.001..0.003 rows=10 loops=1)
Filter: (item = 'price'::text)
Rows Removed by Filter: 17
Buffers: shared hit=1
Planning time: 0.129 ms
Execution time: 0.056 ms
Suggested Approach 1 with EXPLAIN ANALYZE
EXPLAIN(ANALYZE,BUFFERS)
SELECT
t1.customer_id,
MAX(CASE WHEN t1.item='price' THEN t1.value END) as price,
MAX(CASE WHEN t1.item='condition' THEN t1.value END) as conditio
FROM (
SELECT
* ,
ROW_NUMBER() OVER (
PARTITION BY customer_id,item
ORDER BY tbl.timestamp DESC
) as rn
FROM
tbl
where item IN ('price','condition')
) t1
WHERE rn=1
GROUP BY
1;
QUERY PLAN
GroupAggregate (cost=17.58..17.81 rows=1 width=96) (actual time=0.039..0.047 rows=3 loops=1)
Group Key: t1.customer_id
Buffers: shared hit=1
-> Subquery Scan on t1 (cost=17.58..17.79 rows=1 width=96) (actual time=0.030..0.040 rows=5 loops=1)
Filter: (t1.rn = 1)
Rows Removed by Filter: 10
Buffers: shared hit=1
-> WindowAgg (cost=17.58..17.71 rows=6 width=104) (actual time=0.029..0.038 rows=15 loops=1)
Buffers: shared hit=1
-> Sort (cost=17.58..17.59 rows=6 width=104) (actual time=0.028..0.030 rows=15 loops=1)
Sort Key: tbl.customer_id, tbl.item, tbl."timestamp" DESC
Sort Method: quicksort Memory: 26kB
Buffers: shared hit=1
-> Seq Scan on tbl (cost=0.00..17.50 rows=6 width=104) (actual time=0.003..0.008 rows=15 loops=1)
Filter: (item = ANY ('{price,condition}'::text[]))
Rows Removed by Filter: 12
Buffers: shared hit=1
Planning time: 0.061 ms
Execution time: 0.070 ms
Suggested Approach 2 with EXPLAIN ANALYZE
EXPLAIN(ANALYZE,BUFFERS)
SELECT
t1.customer_id,
MAX(t1.value) FILTER (WHERE t1.item='price') as price,
MAX(t1.value) FILTER (WHERE t1.item='condition') as conditio
FROM (
SELECT
* ,
ROW_NUMBER() OVER (
PARTITION BY customer_id,item
ORDER BY tbl.timestamp DESC
) as rn
FROM
tbl
where item IN ('price','condition')
) t1
WHERE rn=1
GROUP BY
1;
QUERY PLAN
GroupAggregate (cost=17.58..17.81 rows=1 width=96) (actual time=0.029..0.037 rows=3 loops=1)
Group Key: t1.customer_id
Buffers: shared hit=1
-> Subquery Scan on t1 (cost=17.58..17.79 rows=1 width=96) (actual time=0.021..0.032 rows=5 loops=1)
Filter: (t1.rn = 1)
Rows Removed by Filter: 10
Buffers: shared hit=1
-> WindowAgg (cost=17.58..17.71 rows=6 width=104) (actual time=0.021..0.030 rows=15 loops=1)
Buffers: shared hit=1
-> Sort (cost=17.58..17.59 rows=6 width=104) (actual time=0.019..0.021 rows=15 loops=1)
Sort Key: tbl.customer_id, tbl.item, tbl."timestamp" DESC
Sort Method: quicksort Memory: 26kB
Buffers: shared hit=1
-> Seq Scan on tbl (cost=0.00..17.50 rows=6 width=104) (actual time=0.003..0.008 rows=15 loops=1)
Filter: (item = ANY ('{price,condition}'::text[]))
Rows Removed by Filter: 12
Buffers: shared hit=1
Planning time: 0.047 ms
Execution time: 0.056 ms
View working demo on DB Fiddle

You operate on a big table. You mentioned 4 million rows, obviously growing. While querying for ...
all customers
all items
with few rows per (customer_id, item)
with narrow rows (small row size)
... ggordon's solutions with row_number() are great. And short, too.
The whole table has to be processed in a sequential scan. Indices won't be used.
But prefer "Approach 2" with the modern aggregate FILTER syntax. It's clearer and faster. See performance tests here:
For absolute performance, is SUM faster or COUNT?
Approach 3: Pivot with crosstab()
crosstab() is typically faster, especially for more than a few items. See:
PostgreSQL Crosstab Query
SELECT *
FROM crosstab(
$$
SELECT customer_id, item, value
FROM (
SELECT customer_id, item, value
, row_number() OVER (PARTITION BY customer_id, item ORDER BY t.timestamp DESC) AS rn
FROM tbl t
WHERE item = ANY ('{condition,price,feeling,weather}') -- your items here ...
) t1
WHERE rn = 1
ORDER BY customer_id, item
$$
, $$SELECT unnest('{condition,price,feeling,weather}'::text[])$$ -- ... here ...
) AS ct (customer_id text, condition text, price text, feeling text, weather text); -- ... and here ...
Approach 4: LATERAL Subqueries
If one or more of the criteria listed at the top do not apply, the above queries fall off quickly in performance.
For starters, only max 10 of "500 distinct items" are involved. That's max ~ 2 % of the big table. That alone should make one of the following queries (much) faster in comparison:
SELECT *
FROM (SELECT customer_id FROM customer) c
LEFT JOIN LATERAL (
SELECT value AS condition
FROM tbl t
WHERE t.customer_id = c.customer_id
AND t.item = 'condition'
ORDER BY t.timestamp DESC
LIMIT 1
) AS t1 ON true
LEFT JOIN LATERAL (
SELECT value AS price
FROM tbl t
WHERE t.customer_id = c.customer_id
AND t.item = 'price'
ORDER BY t.timestamp DESC
LIMIT 1
) AS t2 ON true
LEFT JOIN LATERAL (
SELECT value AS feeling
FROM tbl t
WHERE t.customer_id = c.customer_id
AND t.item = 'feeling'
ORDER BY t.timestamp DESC
LIMIT 1
) AS t3 ON true
-- ... more?
About LEFT JOIN LATERAL:
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
The point is to get a query plan with relatively few index(-only) scans to replace the expensive sequential scan on the big table.
Requires an applicable index, obviously:
CREATE INDEX ON tbl (customer_id, item);
Or better (in Postgres 9.5):
CREATE INDEX ON tbl (customer_id, item, timestamp DESC, value);
In Postgres 11 or later, this would be better, yet:
CREATE INDEX ON tbl (customer_id, item, timestamp DESC) INCLUDE (value);
See here or here or here.
If only few items are of interest, partial indices on those items would be even better.
Approach 5: Correlated Subqueries
SELECT c.customer_id
, (SELECT value FROM tbl t WHERE t.customer_id = c.customer_id AND t.item = 'condition' ORDER BY t.timestamp DESC LIMIT 1) AS condition
, (SELECT value FROM tbl t WHERE t.customer_id = c.customer_id AND t.item = 'price' ORDER BY t.timestamp DESC LIMIT 1) AS price
, (SELECT value FROM tbl t WHERE t.customer_id = c.customer_id AND t.item = 'feeling' ORDER BY t.timestamp DESC LIMIT 1) AS feeling
, (SELECT value FROM tbl t WHERE t.customer_id = c.customer_id AND t.item = 'weather' ORDER BY t.timestamp DESC LIMIT 1) AS weather
FROM customer c;
Not as versatile as LATERAL, but good enough for the purpose. Same index requirements as approach 4.
Approach 5 will be king of performance in most cases.
db<>fiddle here
Improving your relational design and/or upgrading to a current version of Postgres would go a long way, too.

Primary key scanning in partitioned table

I have really big table, which I need to be partitioned by date (via trigger in my case).
The problem I've encountered is I can get data by timestamp filter pretty fast, but can't get good performance while extracting data for certain row by primary key.
The main table is:
CREATE TABLE parent_table (
guid uuid NOT NULL DEFAULT uuid_generate_v4(), -- This is gonna be the primary key
tm timestamptz NOT NULL, -- Timestamp, on which paritions are based
value int4 not null default -1, -- Just a value
CONSTRAINT z_detections_pk PRIMARY KEY (guid)
);
CREATE INDEX parent_table_tm_idx ON dev.dev_z_detections USING btree (tm DESC);
Then I create simple trigger for creation new parition if there are new date
CREATE OR REPLACE FUNCTION parent_table_insert_fn()
RETURNS trigger
LANGUAGE plpgsql
AS $function$
DECLARE
schema_name varchar(255) := 'public';
table_master varchar(255) := 'parent_table';
table_part varchar(255) := '';
table_date_underscore varchar(255) := '';
constraint_tm_start timestamp with time zone;
constraint_tm_end timestamp with time zone;
BEGIN
table_part := table_master || '_' || to_char(timezone('utc', new.tm), 'YYYY_MM_DD');
table_date_underscore := '' || to_char(timezone('utc', new.tm), 'YYYY_MM_DD');
PERFORM
1
from
information_schema.tables
WHERE
table_schema = schema_name
AND table_name = table_part
limit 1;
IF NOT FOUND
then
constraint_tm_start := to_char(timezone('utc', new.tm), 'YYYY-MM-DD')::timestamp at time zone 'utc';
constraint_tm_end := constraint_tm_start + interval '1 day';
execute '
CREATE TABLE ' || schema_name || '.' || table_part || ' (
CONSTRAINT parent_table_' || table_date_underscore || '_pk PRIMARY KEY (guid),
CONSTRAINT parent_table_' || table_date_underscore || '_ck CHECK ( tm >= ' || QUOTE_LITERAL(constraint_tm_start) || ' and tm < ' || QUOTE_LITERAL(constraint_tm_end) || ' )
) INHERITS (' || schema_name || '.' || table_master || ');
CREATE INDEX parent_table_' || table_date_underscore || '_tidx ON ' || schema_name || '.' || table_part || ' USING btree (tm desc);
';
END IF;
execute '
INSERT INTO ' || schema_name || '.' || table_part || '
SELECT ( (' || QUOTE_LITERAL(NEW) || ')::' || schema_name || '.' || TG_RELNAME || ' ).*;';
RETURN NULL;
END;
$function$
;
Enable trigger on parent table:
create trigger parent_table_insert_fn_trigger before insert
on parent_table for each row execute function parent_table_insert_fn();
And insert some data in it:
insert into parent_table(guid, tm, value)
values
('1f4835c0-2b22-4cfc-ab3c-940af679ace6', '2021-04-06 14:00:00+03:00', 1),
('5ca37d57-e79e-4e1f-ace7-91eb671f3a82', '2021-04-07 15:30:00+03:00', 2),
('b57bfbf6-7ed0-4dde-a40b-9fa2e6f24808', '2021-04-07 17:10:00+03:00', 3),
('ad69cd35-5b20-466f-9d5c-61fa5d41bc5f', '2021-04-08 16:50:00+03:00', 66),
('bb0ec87a-72bb-438e-8f4c-2cdc3ae7d525', '2021-03-21 19:00:00+03:00', -10);
After those manipulations I've got 4 tables:
parent_table
parent_table_2021_03_21
parent_table_2021_04_06
parent_table_2021_04_07
parent_table_2021_04_08
Checking if indexing works good for timestamps filter:
explain analyze
select * from parent_table where tm > '2021-04-07 10:00:00+03:00' and tm <= '2021-04-07 16:30:00+03:00';
> > >
Append (cost=0.00..14.43 rows=8 width=28) (actual time=0.017..0.020 rows=1 loops=1)
-> Seq Scan on parent_table parent_table_1 (cost=0.00..0.00 rows=1 width=28) (actual time=0.002..0.002 rows=0 loops=1)
Filter: ((tm > '2021-04-07 10:00:00+03'::timestamp with time zone) AND (tm <= '2021-04-07 16:30:00+03'::timestamp with time zone))
-> Bitmap Heap Scan on parent_table_2021_04_07 parent_table_2 (cost=4.22..14.39 rows=7 width=28) (actual time=0.013..0.015 rows=1 loops=1)
Recheck Cond: ((tm > '2021-04-07 10:00:00+03'::timestamp with time zone) AND (tm <= '2021-04-07 16:30:00+03'::timestamp with time zone))
Heap Blocks: exact=1
-> Bitmap Index Scan on parent_table_2021_04_07_tidx (cost=0.00..4.22 rows=7 width=0) (actual time=0.008..0.008 rows=1 loops=1)
Index Cond: ((tm > '2021-04-07 10:00:00+03'::timestamp with time zone) AND (tm <= '2021-04-07 16:30:00+03'::timestamp with time zone))
Planning Time: 0.381 ms
Execution Time: 0.053 ms
This is fine and works as I expected.
But selecting by certain primary key gives me next analyze's output:
explain analyze
select * from parent_table where guid = 'b57bfbf6-7ed0-4dde-a40b-9fa2e6f24808';
> > >
Append (cost=0.00..32.70 rows=5 width=28) (actual time=0.021..0.035 rows=1 loops=1)
-> Seq Scan on parent_table parent_table_1 (cost=0.00..0.00 rows=1 width=28) (actual time=0.003..0.004 rows=0 loops=1)
Filter: (guid = 'b57bfbf6-7ed0-4dde-a40b-9fa2e6f24808'::uuid)
-> Index Scan using parent_table_2021_04_06_pk on parent_table_2021_04_06 parent_table_2 (cost=0.15..8.17 rows=1 width=28) (actual time=0.008..0.008 rows=0 loops=1)
Index Cond: (guid = 'b57bfbf6-7ed0-4dde-a40b-9fa2e6f24808'::uuid)
-> Index Scan using parent_table_2021_04_07_pk on parent_table_2021_04_07 parent_table_3 (cost=0.15..8.17 rows=1 width=28) (actual time=0.008..0.009 rows=1 loops=1)
Index Cond: (guid = 'b57bfbf6-7ed0-4dde-a40b-9fa2e6f24808'::uuid)
-> Index Scan using parent_table_2021_04_08_pk on parent_table_2021_04_08 parent_table_4 (cost=0.15..8.17 rows=1 width=28) (actual time=0.004..0.004 rows=0 loops=1)
Index Cond: (guid = 'b57bfbf6-7ed0-4dde-a40b-9fa2e6f24808'::uuid)
-> Index Scan using parent_table_2021_03_21_pk on parent_table_2021_03_21 parent_table_5 (cost=0.15..8.17 rows=1 width=28) (actual time=0.006..0.006 rows=0 loops=1)
Index Cond: (guid = 'b57bfbf6-7ed0-4dde-a40b-9fa2e6f24808'::uuid)
Planning Time: 0.345 ms
Execution Time: 0.076 ms
And this query gives me bad perfomance (I guess?) especially on really big paritioned tables like 10M+ rows for each partition.
So my question is: what should I do to evade partitions scans for simple primary key lookup?
Note: I'm using PostgreSQL 13.1
UPDATE 2021-04-07 15:22+03:00:
So, in semi-production table I have such results:
Timestamp filter
Append (cost=0.00..809.35 rows=16616 width=32) (actual time=0.037..5.612 rows=16865 loops=1)
-> Seq Scan on wifi_logs t_1 (cost=0.00..0.00 rows=1 width=32) (actual time=0.010..0.011 rows=0 loops=1)
Filter: ((tm >= '2020-04-07 14:00:00+03'::timestamp with time zone) AND (tm <= '2020-04-07 17:00:00+03'::timestamp with time zone))
-> Index Scan using wifi_logs_tm_idx_2020_04_07 on wifi_logs_2020_04_07 t_2 (cost=0.29..726.27 rows=16615 width=32) (actual time=0.026..4.655 rows=16865 loops=1)
Index Cond: ((tm >= '2020-04-07 14:00:00+03'::timestamp with time zone) AND (tm <= '2020-04-07 17:00:00+03'::timestamp with time zone))
Planning Time: 14.869 ms
Execution Time: 6.151 ms
GUID (primary key filter)
-> Seq Scan on wifi_logs t_1 (cost=0.00..0.00 rows=1 width=32) (actual time=0.015..0.016 rows=0 loops=1)
Filter: (guid = '78bc5537-4f2f-4e83-8abd-4241ac3f9f27'::uuid)
-> Seq Scan on wifi_logs_2014_12_04 t_4 (cost=0.00..1.01 rows=1 width=32) (actual time=0.006..0.006 rows=0 loops=1)
Filter: (guid = '78bc5537-4f2f-4e83-8abd-4241ac3f9f27'::uuid)
Rows Removed by Filter: 1
--
-- TONS OF PARTITION TABLE SCANS
---
-> Index Scan using wifi_logs_2021_03_18_pk on wifi_logs_2021_03_18 t_387 (cost=0.42..8.44 rows=1 width=32) (actual time=0.011..0.011 rows=0 loops=1)
Index Cond: (guid = '78bc5537-4f2f-4e83-8abd-4241ac3f9f27'::uuid)
-> Seq Scan on wifi_logs_1970_01_01 t_388 (cost=0.00..3.60 rows=1 width=32) (actual time=0.020..0.020 rows=0 loops=1)
Filter: (guid = '78bc5537-4f2f-4e83-8abd-4241ac3f9f27'::uuid)
Rows Removed by Filter: 119
-> Index Scan using wifi_logs_2021_03_19_pk on wifi_logs_2021_03_19 t_389 (cost=0.42..8.44 rows=1 width=32) (actual time=0.012..0.012 rows=0 loops=1)
Index Cond: (guid = '78bc5537-4f2f-4e83-8abd-4241ac3f9f27'::uuid)
--
-- ANOTHER TONS OF PARTITION TABLE SCANS
---
-> Index Scan using wifi_logs_2021_04_07_pk on wifi_logs_2021_04_07 t_408 (cost=0.42..8.44 rows=1 width=32) (actual time=0.010..0.010 rows=0 loops=1)
Index Cond: (guid = '78bc5537-4f2f-4e83-8abd-4241ac3f9f27'::uuid)
Planning Time: 97.662 ms
Execution Time: 36.756 ms

This is normal, and there is no way to avoid it except
create fewer partitions, so that you have to scan fewer partitions
add a condition on tm to the query to avoid scanning them all
You will notice that the planning time greatly exceeds the query execution time. To help with that, you can
create fewer partitions, so that the optimizer has less work to do
use prepared statements to avoid the planing effort

Separate PostgreSQL partitions join

I'm using PostgreSQL 10.6. I have several tables partitioned by day. Each day has its own data. I want to select rows from this tables within a day.
drop table IF EXISTS request;
drop table IF EXISTS request_identity;
CREATE TABLE IF NOT EXISTS request (
id bigint not null,
record_date date not null,
payload text not null
) PARTITION BY LIST (record_date);
CREATE TABLE IF NOT EXISTS request_p1 PARTITION OF request FOR VALUES IN ('2001-01-01');
CREATE TABLE IF NOT EXISTS request_p2 PARTITION OF request FOR VALUES IN ('2001-01-02');
CREATE INDEX IF NOT EXISTS i_request_p1_id ON request_p1 (id);
CREATE INDEX IF NOT EXISTS i_request_p2_id ON request_p2 (id);
do $$
begin
for i in 1..100000 loop
INSERT INTO request (id,record_date,payload) values (i, '2001-01-01', 'abc');
end loop;
for i in 100001..200000 loop
INSERT INTO request (id,record_date,payload) values (i, '2001-01-02', 'abc');
end loop;
end;
$$;
CREATE TABLE IF NOT EXISTS request_identity (
record_date date not null,
parent_id bigint NOT NULL,
identity_name varchar(32),
identity_value varchar(32)
) PARTITION BY LIST (record_date);
CREATE TABLE IF NOT EXISTS request_identity_p1 PARTITION OF request_identity FOR VALUES IN ('2001-01-01');
CREATE TABLE IF NOT EXISTS request_identity_p2 PARTITION OF request_identity FOR VALUES IN ('2001-01-02');
CREATE INDEX IF NOT EXISTS i_request_identity_p1_payload ON request_identity_p1 (identity_name, identity_value);
CREATE INDEX IF NOT EXISTS i_request_identity_p2_payload ON request_identity_p2 (identity_name, identity_value);
do $$
begin
for i in 1..100000 loop
INSERT INTO request_identity (parent_id,record_date,identity_name,identity_value) values (i, '2001-01-01', 'NAME', 'somename'||i);
end loop;
for i in 100001..200000 loop
INSERT INTO request_identity (parent_id,record_date,identity_name,identity_value) values (i, '2001-01-02', 'NAME', 'somename'||i);
end loop;
end;
$$;
analyze request;
analyze request_identity;
I make select inside 1 day and see a good request plan:
explain analyze select *
from request
where record_date between '2001-01-01' and '2001-01-01'
and exists (select * from request_identity where parent_id = id and identity_name = 'NAME' and identity_value = 'somename555' and record_date between '2001-01-01' and '2001-01-01')
limit 100;
Limit (cost=8.74..16.78 rows=1 width=16)
-> Nested Loop (cost=8.74..16.78 rows=1 width=16)
-> HashAggregate (cost=8.45..8.46 rows=1 width=8)
Group Key: request_identity_p1.parent_id
-> Append (cost=0.42..8.44 rows=1 width=8)
-> Index Scan using i_request_identity_p1_payload on request_identity_p1 (cost=0.42..8.44 rows=1 width=8)
Index Cond: (((identity_name)::text = 'NAME'::text) AND ((identity_value)::text = 'somename555'::text))
Filter: ((record_date >= '2001-01-01'::date) AND (record_date <= '2001-01-01'::date))
-> Append (cost=0.29..8.32 rows=1 width=16)
-> Index Scan using i_request_p1_id on request_p1 (cost=0.29..8.32 rows=1 width=16)
Index Cond: (id = request_identity_p1.parent_id)
Filter: ((record_date >= '2001-01-01'::date) AND (record_date <= '2001-01-01'::date))
But if I make a select for 2 days or more, then PostgreSQL first appends rows of all partitions of request_identity and all partitions of request, and then joins them.
So this is the SQL that is not working as i want:
explain analyze select *
from request
where record_date between '2001-01-01' and '2001-01-02'
and exists (select * from request_identity where parent_id = id and identity_name = 'NAME' and identity_value = 'somename1777' and record_date between '2001-01-01' and '2001-01-02')
limit 100;
Limit (cost=17.19..50.21 rows=2 width=16)
-> Nested Loop (cost=17.19..50.21 rows=2 width=16)
-> Unique (cost=16.90..16.91 rows=2 width=8)
-> Sort (cost=16.90..16.90 rows=2 width=8)
Sort Key: request_identity_p1.parent_id
-> Append (cost=0.42..16.89 rows=2 width=8)
-> Index Scan using i_request_identity_p1_payload on request_identity_p1 (cost=0.42..8.44 rows=1 width=8)
Index Cond: (((identity_name)::text = 'NAME'::text) AND ((identity_value)::text = 'somename1777'::text))
Filter: ((record_date >= '2001-01-01'::date) AND (record_date <= '2001-01-02'::date))
-> Index Scan using i_request_identity_p2_payload on request_identity_p2 (cost=0.42..8.44 rows=1 width=8)
Index Cond: (((identity_name)::text = 'NAME'::text) AND ((identity_value)::text = 'somename1777'::text))
Filter: ((record_date >= '2001-01-01'::date) AND (record_date <= '2001-01-02'::date))
-> Append (cost=0.29..16.63 rows=2 width=16)
-> Index Scan using i_request_p1_id on request_p1 (cost=0.29..8.32 rows=1 width=16)
Index Cond: (id = request_identity_p1.parent_id)
Filter: ((record_date >= '2001-01-01'::date) AND (record_date <= '2001-01-02'::date))
-> Index Scan using i_request_p2_id on request_p2 (cost=0.29..8.32 rows=1 width=16)
Index Cond: (id = request_identity_p1.parent_id)
Filter: ((record_date >= '2001-01-01'::date) AND (record_date <= '2001-01-02'::date))
In my case it doesn't make sense to join (with nested loops) of these appends since the consistent rows are only within 1 day partitions group.
The desired result for me is that PostgreSQL makes joins between request_p1 to request_identity_p1, and request_p2 to request_identity_p2 first and only after that is makes appends of results.
The question is:
Is there a way to perform joins between partitions separately within 1 day partitions group?
Thanks.

Missing table access in PostgreSQL query plan

I have two same tables one having 1k rows and the second 1M rows. I use the following script to populate them.
CREATE TABLE Table1 (
id int NOT NULL primary key,
groupby int NOT NULL,
orderby int NOT NULL,
local_search int NOT NULL,
global_search int NOT NULL,
padding varchar(100) NOT NULL
);
CREATE TABLE Table2 (
id int NOT NULL primary key,
groupby int NOT NULL,
orderby int NOT NULL,
local_search int NOT NULL,
global_search int NOT NULL,
padding varchar(100) NOT NULL
);
INSERT
INTO Table1
WITH t1 AS
(
SELECT id
FROM generate_series(1, 10000) id
), t2 AS
(
SELECT id,
id % 100 groupby
FROM t1
), t3 AS
(
SELECT b.id, b.groupby, row_number() over (partition by groupby order by id) orderby
FROM t2 b
)
SELECT id,
groupby,
orderby,
orderby % 50 local_search,
id % 1000 global_search,
RPAD('Value ' || id || ' ' , 100, '*') as padding
FROM t3;
INSERT
INTO Table2
WITH t1 AS
(
SELECT id
FROM generate_series(1, 1000000) id
), t2 AS
(
SELECT id,
id % 100 groupby
FROM t1
), t3 AS
(
SELECT b.id, b.groupby, row_number() over (partition by groupby order by id) orderby
FROM t2 b
)
SELECT id,
groupby,
orderby,
orderby % 50 local_search,
id % 1000 global_search,
RPAD('Value ' || id || ' ' , 100, '*') as padding
FROM t3;
I created also secondary index on table2
CREATE INDEX ix_Table2_groupby_orderby ON Table2 (groupby, orderby);
Now, I have the following query
select b.id, b.groupby, b.orderby, b.local_search, b.global_search, b.padding
from Table2 b
join Table1 a on b.orderby = a.id
where a.global_search = 1 and b.groupby < 10;
which leads to the following query plan using explain(analyze)
"Nested Loop (cost=0.42..17787.05 rows=100 width=121) (actual time=0.056..34.722 rows=100 loops=1)"
" -> Seq Scan on table1 a (cost=0.00..318.00 rows=10 width=4) (actual time=0.033..1.313 rows=10 loops=1)"
" Filter: (global_search = 1)"
" Rows Removed by Filter: 9990"
" -> Index Scan using ix_table2_groupby_orderby on table2 b (cost=0.42..1746.81 rows=10 width=121) (actual time=0.159..3.337 rows=10 loops=10)"
" Index Cond: ((groupby < 10) AND (orderby = a.id))"
"Planning time: 0.296 ms"
"Execution time: 34.775 ms"
and my question is: how it comes that he does not access the table2 in the query plan? He uses just ix_table2_groupby_orderby, but it contains just groupby, orderby and maybe id columns. How he gets the remaining columns of Table2 and why it is not in the query plan?
** EDIT **
I have tried explain(verbose) As suggested #laurenzalbe. This is the result
"Nested Loop (cost=0.42..17787.05 rows=100 width=121) (actual time=0.070..35.678 rows=100 loops=1)"
" Output: b.id, b.groupby, b.orderby, b.local_search, b.global_search, b.padding"
" -> Seq Scan on public.table1 a (cost=0.00..318.00 rows=10 width=4) (actual time=0.031..1.642 rows=10 loops=1)"
" Output: a.id, a.groupby, a.orderby, a.local_search, a.global_search, a.padding"
" Filter: (a.global_search = 1)"
" Rows Removed by Filter: 9990"
" -> Index Scan using ix_table2_groupby_orderby on public.table2 b (cost=0.42..1746.81 rows=10 width=121) (actual time=0.159..3.398 rows=10 loops=10)"
" Output: b.id, b.groupby, b.orderby, b.local_search, b.global_search, b.padding"
" Index Cond: ((b.groupby < 10) AND (b.orderby = a.id))"
"Planning time: 16.201 ms"
"Execution time: 35.754 ms"
Actually, I do not fully understand why the access to the heap of table2 is not there, but I accept it as an answer.

An index scan in PostgreSQL accesses not only the index, but also the table. This is not explicitly shown in the execution plan and is necessary to find out if a row is visible to the transaction or not.
Try EXPLAIN (VERBOSE) to see what columns are returned.
See the documentation for details:
All indexes in PostgreSQL are secondary indexes, meaning that each index is stored separately from the table's main data area (which is called the table's heap in PostgreSQL terminology). This means that in an ordinary index scan, each row retrieval requires fetching data from both the index and the heap.

Postgresql partition into a fixed set of files by ID

Apache Spark has the option to split into multiple files with the bucketBy command. For example if I have 100 million user IDs, I can split the table into 32 different files, where some type of hashing algorithm is used to distribute and lookup the data between files.
Can Postgres split tables into a fixed number of partitions somehow? If it's not a native feature can it still be accomplished, for example generate a hash; turn hash into a number; take modulo % 32 as parititon range.

example with modulo:
a short partitions setup:
db=# create table p(i int);
CREATE TABLE
db=# create table p1 ( check (mod(i,3)=0) ) inherits (p);
CREATE TABLE
db=# create table p2 ( check (mod(i,3)=1) ) inherits (p);
CREATE TABLE
db=# create table p3 ( check (mod(i,3)=2) ) inherits (p);
CREATE TABLE
db=# create rule pir3 AS ON insert to p where mod(i,3) = 2 do instead insert into p3 values (new.*);
CREATE RULE
db=# create rule pir2 AS ON insert to p where mod(i,3) = 1 do instead insert into p2 values (new.*);
CREATE RULE
db=# create rule pir1 AS ON insert to p where mod(i,3) = 0 do instead insert into p1 values (new.*);
CREATE RULE
checking:
db=# insert into p values (1),(2),(3),(4),(5);
INSERT 0 0
db=# select * from p;
i
---
3
1
4
2
5
(5 rows)
db=# select * from p1;
i
---
3
(1 row)
db=# select * from p2;
i
---
1
4
(2 rows)
db=# select * from p3;
i
---
2
5
(2 rows)
https://www.postgresql.org/docs/current/static/tutorial-inheritance.html
https://www.postgresql.org/docs/current/static/ddl-partitioning.html
and demo of partitions working:
db=# explain analyze select * from p where mod(i,3) = 2;
QUERY PLAN
----------------------------------------------------------------------------------------------------
Append (cost=0.00..48.25 rows=14 width=4) (actual time=0.013..0.015 rows=2 loops=1)
-> Seq Scan on p (cost=0.00..0.00 rows=1 width=4) (actual time=0.004..0.004 rows=0 loops=1)
Filter: (mod(i, 3) = 2)
-> Seq Scan on p3 (cost=0.00..48.25 rows=13 width=4) (actual time=0.009..0.011 rows=2 loops=1)
Filter: (mod(i, 3) = 2)
Planning time: 0.203 ms
Execution time: 0.052 ms
(7 rows)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Sql query based on historical event and its performance - sql

Related

Query Value by Max Date in Postgresql

Primary key scanning in partitioned table

Separate PostgreSQL partitions join

Missing table access in PostgreSQL query plan

Postgresql partition into a fixed set of files by ID

Categories

Resources