I already asked this question here but there contained less information about my question. So, I create a new question with more information.
Here is the sample table that I have. Each row contains the filled data by the user at every time. So that the timestamp column will not be null through the whole table. There may be unrecorded value under item, if the user didn't fill. The id is the auto-generated column for each record.
CREATE TABLE tbl (id int, customer_id text, item text, value text, timestamp timestamp);
INSERT INTO tbl VALUES
(1, '001', 'price', '1000', '2021-11-01 01:00:00'),
(2, '001', 'price', '1500', '2021-11-02 01:00:00'),
(3, '001', 'price', '1400', '2021-11-03 01:00:00'),
(4, '001', 'condition', 'good', '2021-11-01 01:00:00'),
(5, '001', 'condition', 'good', '2021-11-02 01:00:00'),
(6, '001', 'condition', 'ok', '2021-11-03 01:00:00'),
(7, '001', 'feeling', 'sad', '2021-11-01 01:00:00'),
(8, '001', 'feeling', 'angry', '2021-11-02 01:00:00'),
(9, '001', 'feeling', 'fine', '2021-11-03 01:00:00'),
(10, '002', 'price', '1200', '2021-11-01 01:00:00'),
(11, '002', 'price', '1600', '2021-11-02 01:00:00'),
(12, '002', 'price', '2000', '2021-11-03 01:00:00'),
(13, '002', 'weather', 'sunny', '2021-11-01 01:00:00'),
(14, '002', 'weather', 'rain', '2021-11-02 01:00:00'),
(15, '002', 'price', '1900', '2021-11-04 01:00:00'),
(16, '002', 'feeling', 'sad', '2021-11-01 01:00:00'),
(17, '002', 'feeling', 'angry', '2021-11-02 01:00:00'),
(18, '002', 'feeling', 'fine', '2021-11-03 01:00:00'),
(19, '003', 'price', '1000', '2021-11-01 01:00:00'),
(20, '003', 'price', '1500', '2021-11-02 01:00:00'),
(21, '003', 'price', '2000', '2021-11-03 01:00:00'),
(22, '003', 'condition', 'ok', '2021-11-01 01:00:00'),
(23, '003', 'weather', 'rain', '2021-11-02 01:00:00'),
(24, '003', 'condition', 'bad', '2021-11-03 01:00:00'),
(25, '003', 'feeling', 'fine', '2021-11-01 01:00:00'),
(26, '003', 'weather', 'sunny', '2021-11-03 01:00:00'),
(27, '003', 'feeling', 'sad', '2021-11-03 01:00:00')
;
To see clearly, I order the above table by id and timestamp. It doesn't matter.
We are using Postgresql Version: PostgreSQL 9.5.19
The actual table contains over 4 million rows
The item column contains over 500 distinct items, but don't worry. I will use 10 items at most for a query. In the above table, I used only 4 items.
We also have another table called Customer_table with a unique Customer_id containing customers' general information.
From the above table, I want to query the data to create a table with the latest date updated data as below. I will use 10 items at most for a query so that there may be 10 columns.
customer_id price condition feeling weather .......(there may be other columns from item column)
002 1900 null fine rain
001 1400 ok fine null
003 2000 bad sad sunny
This is the query that I get from previous questions, but I asked only for two item.
SELECT customer_id, p.value AS price, c.value AS condition
FROM (
SELECT DISTINCT ON (customer_id)
customer_id, value
FROM tbl
WHERE item = 'condition'
ORDER BY customer_id, timestamp DESC
) c
FULL JOIN (
SELECT DISTINCT ON (customer_id)
customer_id, value
FROM tbl
WHERE item = 'price'
ORDER BY customer_id, timestamp DESC
) p USING (customer_id)
So, if there is any better solution please help me.
Thank you.
You may try other approaches using row_number to generate a value to filter your data on the most recent data. You may then aggregate on customer id with the max value for a case expression filtering your records based on the desired row number rn=1 (we will order by descending) and item name.
These approaches are less verbose and based on the results online seem to be more performant. Let me know how replicating this in your environment works in the comments.
You may use EXPLAIN ANALYZE to compare this approach to the current one. The results in the online environment provided:
Current Approach
| Planning time: 0.129 ms
| Execution time: 0.056 ms
Suggested Approach 1
| Planning time: 0.061 ms
| Execution time: 0.070 ms
Suggested Approach 2
| Planning time: 0.047 ms
| Execution time: 0.056 ms
NB. You may use EXPLAIN ANALYZE to compare these approaches in your environment which we cannot replicate online. The results may also vary on each run. Indexes and early filters on the item column are also recommended to improve performance.
Schema (PostgreSQL v9.5)
Suggested Approach 1
SELECT
t1.customer_id,
MAX(CASE WHEN t1.item='condition' THEN t1.value END) as conditio,
MAX(CASE WHEN t1.item='price' THEN t1.value END) as price,
MAX(CASE WHEN t1.item='feeling' THEN t1.value END) as feeling,
MAX(CASE WHEN t1.item='weather' THEN t1.value END) as weather
FROM (
SELECT
* ,
ROW_NUMBER() OVER (
PARTITION BY customer_id,item
ORDER BY tbl.timestamp DESC
) as rn
FROM
tbl
-- ensure that you filter based on your desired items
-- indexes on item column are recommended to improve performance
) t1
WHERE rn=1
GROUP BY
1;
customer_id
conditio
price
feeling
weather
001
ok
1400
fine
002
1900
fine
rain
003
bad
2000
sad
sunny
Suggested Approach 2
SELECT
t1.customer_id,
MAX(t1.value) FILTER (WHERE t1.item='condition') as conditio,
MAX(t1.value) FILTER (WHERE t1.item='price') as price,
MAX(t1.value) FILTER (WHERE t1.item='feeling') as feeling,
MAX(t1.value) FILTER (WHERE t1.item='weather') as weather
FROM (
SELECT
* ,
ROW_NUMBER() OVER (
PARTITION BY customer_id,item
ORDER BY tbl.timestamp DESC
) as rn
FROM
tbl
-- ensure that you filter based on your desired items
-- indexes on item column are recommended to improve performance
) t1
WHERE rn=1
GROUP BY
1;
customer_id
conditio
price
feeling
weather
001
ok
1400
fine
002
1900
fine
rain
003
bad
2000
sad
sunny
Current Approach with EXPLAIN ANALYZE
EXPLAIN(ANALYZE,BUFFERS)
SELECT customer_id, p.value AS price, c.value AS condition
FROM (
SELECT DISTINCT ON (customer_id)
customer_id, value
FROM tbl
WHERE item = 'condition'
ORDER BY customer_id, timestamp DESC
) c
FULL JOIN (
SELECT DISTINCT ON (customer_id)
customer_id, value
FROM tbl
WHERE item = 'price'
ORDER BY customer_id, timestamp DESC
) p USING (customer_id);
QUERY PLAN
Merge Full Join (cost=35.05..35.12 rows=1 width=128) (actual time=0.025..0.030 rows=3 loops=1)
Merge Cond: (tbl.customer_id = tbl_1.customer_id)
Buffers: shared hit=2
-> Unique (cost=17.52..17.54 rows=1 width=72) (actual time=0.013..0.014 rows=2 loops=1)
Buffers: shared hit=1
-> Sort (cost=17.52..17.53 rows=3 width=72) (actual time=0.013..0.013 rows=5 loops=1)
Sort Key: tbl.customer_id, tbl."timestamp" DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=1
-> Seq Scan on tbl (cost=0.00..17.50 rows=3 width=72) (actual time=0.004..0.006 rows=5 loops=1)
Filter: (item = 'condition'::text)
Rows Removed by Filter: 22
Buffers: shared hit=1
-> Materialize (cost=17.52..17.55 rows=1 width=64) (actual time=0.010..0.013 rows=3 loops=1)
Buffers: shared hit=1
-> Unique (cost=17.52..17.54 rows=1 width=72) (actual time=0.010..0.012 rows=3 loops=1)
Buffers: shared hit=1
-> Sort (cost=17.52..17.53 rows=3 width=72) (actual time=0.010..0.010 rows=10 loops=1)
Sort Key: tbl_1.customer_id, tbl_1."timestamp" DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=1
-> Seq Scan on tbl tbl_1 (cost=0.00..17.50 rows=3 width=72) (actual time=0.001..0.003 rows=10 loops=1)
Filter: (item = 'price'::text)
Rows Removed by Filter: 17
Buffers: shared hit=1
Planning time: 0.129 ms
Execution time: 0.056 ms
Suggested Approach 1 with EXPLAIN ANALYZE
EXPLAIN(ANALYZE,BUFFERS)
SELECT
t1.customer_id,
MAX(CASE WHEN t1.item='price' THEN t1.value END) as price,
MAX(CASE WHEN t1.item='condition' THEN t1.value END) as conditio
FROM (
SELECT
* ,
ROW_NUMBER() OVER (
PARTITION BY customer_id,item
ORDER BY tbl.timestamp DESC
) as rn
FROM
tbl
where item IN ('price','condition')
) t1
WHERE rn=1
GROUP BY
1;
QUERY PLAN
GroupAggregate (cost=17.58..17.81 rows=1 width=96) (actual time=0.039..0.047 rows=3 loops=1)
Group Key: t1.customer_id
Buffers: shared hit=1
-> Subquery Scan on t1 (cost=17.58..17.79 rows=1 width=96) (actual time=0.030..0.040 rows=5 loops=1)
Filter: (t1.rn = 1)
Rows Removed by Filter: 10
Buffers: shared hit=1
-> WindowAgg (cost=17.58..17.71 rows=6 width=104) (actual time=0.029..0.038 rows=15 loops=1)
Buffers: shared hit=1
-> Sort (cost=17.58..17.59 rows=6 width=104) (actual time=0.028..0.030 rows=15 loops=1)
Sort Key: tbl.customer_id, tbl.item, tbl."timestamp" DESC
Sort Method: quicksort Memory: 26kB
Buffers: shared hit=1
-> Seq Scan on tbl (cost=0.00..17.50 rows=6 width=104) (actual time=0.003..0.008 rows=15 loops=1)
Filter: (item = ANY ('{price,condition}'::text[]))
Rows Removed by Filter: 12
Buffers: shared hit=1
Planning time: 0.061 ms
Execution time: 0.070 ms
Suggested Approach 2 with EXPLAIN ANALYZE
EXPLAIN(ANALYZE,BUFFERS)
SELECT
t1.customer_id,
MAX(t1.value) FILTER (WHERE t1.item='price') as price,
MAX(t1.value) FILTER (WHERE t1.item='condition') as conditio
FROM (
SELECT
* ,
ROW_NUMBER() OVER (
PARTITION BY customer_id,item
ORDER BY tbl.timestamp DESC
) as rn
FROM
tbl
where item IN ('price','condition')
) t1
WHERE rn=1
GROUP BY
1;
QUERY PLAN
GroupAggregate (cost=17.58..17.81 rows=1 width=96) (actual time=0.029..0.037 rows=3 loops=1)
Group Key: t1.customer_id
Buffers: shared hit=1
-> Subquery Scan on t1 (cost=17.58..17.79 rows=1 width=96) (actual time=0.021..0.032 rows=5 loops=1)
Filter: (t1.rn = 1)
Rows Removed by Filter: 10
Buffers: shared hit=1
-> WindowAgg (cost=17.58..17.71 rows=6 width=104) (actual time=0.021..0.030 rows=15 loops=1)
Buffers: shared hit=1
-> Sort (cost=17.58..17.59 rows=6 width=104) (actual time=0.019..0.021 rows=15 loops=1)
Sort Key: tbl.customer_id, tbl.item, tbl."timestamp" DESC
Sort Method: quicksort Memory: 26kB
Buffers: shared hit=1
-> Seq Scan on tbl (cost=0.00..17.50 rows=6 width=104) (actual time=0.003..0.008 rows=15 loops=1)
Filter: (item = ANY ('{price,condition}'::text[]))
Rows Removed by Filter: 12
Buffers: shared hit=1
Planning time: 0.047 ms
Execution time: 0.056 ms
View working demo on DB Fiddle
You operate on a big table. You mentioned 4 million rows, obviously growing. While querying for ...
all customers
all items
with few rows per (customer_id, item)
with narrow rows (small row size)
... ggordon's solutions with row_number() are great. And short, too.
The whole table has to be processed in a sequential scan. Indices won't be used.
But prefer "Approach 2" with the modern aggregate FILTER syntax. It's clearer and faster. See performance tests here:
For absolute performance, is SUM faster or COUNT?
Approach 3: Pivot with crosstab()
crosstab() is typically faster, especially for more than a few items. See:
PostgreSQL Crosstab Query
SELECT *
FROM crosstab(
$$
SELECT customer_id, item, value
FROM (
SELECT customer_id, item, value
, row_number() OVER (PARTITION BY customer_id, item ORDER BY t.timestamp DESC) AS rn
FROM tbl t
WHERE item = ANY ('{condition,price,feeling,weather}') -- your items here ...
) t1
WHERE rn = 1
ORDER BY customer_id, item
$$
, $$SELECT unnest('{condition,price,feeling,weather}'::text[])$$ -- ... here ...
) AS ct (customer_id text, condition text, price text, feeling text, weather text); -- ... and here ...
Approach 4: LATERAL Subqueries
If one or more of the criteria listed at the top do not apply, the above queries fall off quickly in performance.
For starters, only max 10 of "500 distinct items" are involved. That's max ~ 2 % of the big table. That alone should make one of the following queries (much) faster in comparison:
SELECT *
FROM (SELECT customer_id FROM customer) c
LEFT JOIN LATERAL (
SELECT value AS condition
FROM tbl t
WHERE t.customer_id = c.customer_id
AND t.item = 'condition'
ORDER BY t.timestamp DESC
LIMIT 1
) AS t1 ON true
LEFT JOIN LATERAL (
SELECT value AS price
FROM tbl t
WHERE t.customer_id = c.customer_id
AND t.item = 'price'
ORDER BY t.timestamp DESC
LIMIT 1
) AS t2 ON true
LEFT JOIN LATERAL (
SELECT value AS feeling
FROM tbl t
WHERE t.customer_id = c.customer_id
AND t.item = 'feeling'
ORDER BY t.timestamp DESC
LIMIT 1
) AS t3 ON true
-- ... more?
About LEFT JOIN LATERAL:
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
The point is to get a query plan with relatively few index(-only) scans to replace the expensive sequential scan on the big table.
Requires an applicable index, obviously:
CREATE INDEX ON tbl (customer_id, item);
Or better (in Postgres 9.5):
CREATE INDEX ON tbl (customer_id, item, timestamp DESC, value);
In Postgres 11 or later, this would be better, yet:
CREATE INDEX ON tbl (customer_id, item, timestamp DESC) INCLUDE (value);
See here or here or here.
If only few items are of interest, partial indices on those items would be even better.
Approach 5: Correlated Subqueries
SELECT c.customer_id
, (SELECT value FROM tbl t WHERE t.customer_id = c.customer_id AND t.item = 'condition' ORDER BY t.timestamp DESC LIMIT 1) AS condition
, (SELECT value FROM tbl t WHERE t.customer_id = c.customer_id AND t.item = 'price' ORDER BY t.timestamp DESC LIMIT 1) AS price
, (SELECT value FROM tbl t WHERE t.customer_id = c.customer_id AND t.item = 'feeling' ORDER BY t.timestamp DESC LIMIT 1) AS feeling
, (SELECT value FROM tbl t WHERE t.customer_id = c.customer_id AND t.item = 'weather' ORDER BY t.timestamp DESC LIMIT 1) AS weather
FROM customer c;
Not as versatile as LATERAL, but good enough for the purpose. Same index requirements as approach 4.
Approach 5 will be king of performance in most cases.
db<>fiddle here
Improving your relational design and/or upgrading to a current version of Postgres would go a long way, too.
Related
I'm using PostgreSQL 10.6. I have several tables partitioned by day. Each day has its own data. I want to select rows from this tables within a day.
drop table IF EXISTS request;
drop table IF EXISTS request_identity;
CREATE TABLE IF NOT EXISTS request (
id bigint not null,
record_date date not null,
payload text not null
) PARTITION BY LIST (record_date);
CREATE TABLE IF NOT EXISTS request_p1 PARTITION OF request FOR VALUES IN ('2001-01-01');
CREATE TABLE IF NOT EXISTS request_p2 PARTITION OF request FOR VALUES IN ('2001-01-02');
CREATE INDEX IF NOT EXISTS i_request_p1_id ON request_p1 (id);
CREATE INDEX IF NOT EXISTS i_request_p2_id ON request_p2 (id);
do $$
begin
for i in 1..100000 loop
INSERT INTO request (id,record_date,payload) values (i, '2001-01-01', 'abc');
end loop;
for i in 100001..200000 loop
INSERT INTO request (id,record_date,payload) values (i, '2001-01-02', 'abc');
end loop;
end;
$$;
CREATE TABLE IF NOT EXISTS request_identity (
record_date date not null,
parent_id bigint NOT NULL,
identity_name varchar(32),
identity_value varchar(32)
) PARTITION BY LIST (record_date);
CREATE TABLE IF NOT EXISTS request_identity_p1 PARTITION OF request_identity FOR VALUES IN ('2001-01-01');
CREATE TABLE IF NOT EXISTS request_identity_p2 PARTITION OF request_identity FOR VALUES IN ('2001-01-02');
CREATE INDEX IF NOT EXISTS i_request_identity_p1_payload ON request_identity_p1 (identity_name, identity_value);
CREATE INDEX IF NOT EXISTS i_request_identity_p2_payload ON request_identity_p2 (identity_name, identity_value);
do $$
begin
for i in 1..100000 loop
INSERT INTO request_identity (parent_id,record_date,identity_name,identity_value) values (i, '2001-01-01', 'NAME', 'somename'||i);
end loop;
for i in 100001..200000 loop
INSERT INTO request_identity (parent_id,record_date,identity_name,identity_value) values (i, '2001-01-02', 'NAME', 'somename'||i);
end loop;
end;
$$;
analyze request;
analyze request_identity;
I make select inside 1 day and see a good request plan:
explain analyze select *
from request
where record_date between '2001-01-01' and '2001-01-01'
and exists (select * from request_identity where parent_id = id and identity_name = 'NAME' and identity_value = 'somename555' and record_date between '2001-01-01' and '2001-01-01')
limit 100;
Limit (cost=8.74..16.78 rows=1 width=16)
-> Nested Loop (cost=8.74..16.78 rows=1 width=16)
-> HashAggregate (cost=8.45..8.46 rows=1 width=8)
Group Key: request_identity_p1.parent_id
-> Append (cost=0.42..8.44 rows=1 width=8)
-> Index Scan using i_request_identity_p1_payload on request_identity_p1 (cost=0.42..8.44 rows=1 width=8)
Index Cond: (((identity_name)::text = 'NAME'::text) AND ((identity_value)::text = 'somename555'::text))
Filter: ((record_date >= '2001-01-01'::date) AND (record_date <= '2001-01-01'::date))
-> Append (cost=0.29..8.32 rows=1 width=16)
-> Index Scan using i_request_p1_id on request_p1 (cost=0.29..8.32 rows=1 width=16)
Index Cond: (id = request_identity_p1.parent_id)
Filter: ((record_date >= '2001-01-01'::date) AND (record_date <= '2001-01-01'::date))
But if I make a select for 2 days or more, then PostgreSQL first appends rows of all partitions of request_identity and all partitions of request, and then joins them.
So this is the SQL that is not working as i want:
explain analyze select *
from request
where record_date between '2001-01-01' and '2001-01-02'
and exists (select * from request_identity where parent_id = id and identity_name = 'NAME' and identity_value = 'somename1777' and record_date between '2001-01-01' and '2001-01-02')
limit 100;
Limit (cost=17.19..50.21 rows=2 width=16)
-> Nested Loop (cost=17.19..50.21 rows=2 width=16)
-> Unique (cost=16.90..16.91 rows=2 width=8)
-> Sort (cost=16.90..16.90 rows=2 width=8)
Sort Key: request_identity_p1.parent_id
-> Append (cost=0.42..16.89 rows=2 width=8)
-> Index Scan using i_request_identity_p1_payload on request_identity_p1 (cost=0.42..8.44 rows=1 width=8)
Index Cond: (((identity_name)::text = 'NAME'::text) AND ((identity_value)::text = 'somename1777'::text))
Filter: ((record_date >= '2001-01-01'::date) AND (record_date <= '2001-01-02'::date))
-> Index Scan using i_request_identity_p2_payload on request_identity_p2 (cost=0.42..8.44 rows=1 width=8)
Index Cond: (((identity_name)::text = 'NAME'::text) AND ((identity_value)::text = 'somename1777'::text))
Filter: ((record_date >= '2001-01-01'::date) AND (record_date <= '2001-01-02'::date))
-> Append (cost=0.29..16.63 rows=2 width=16)
-> Index Scan using i_request_p1_id on request_p1 (cost=0.29..8.32 rows=1 width=16)
Index Cond: (id = request_identity_p1.parent_id)
Filter: ((record_date >= '2001-01-01'::date) AND (record_date <= '2001-01-02'::date))
-> Index Scan using i_request_p2_id on request_p2 (cost=0.29..8.32 rows=1 width=16)
Index Cond: (id = request_identity_p1.parent_id)
Filter: ((record_date >= '2001-01-01'::date) AND (record_date <= '2001-01-02'::date))
In my case it doesn't make sense to join (with nested loops) of these appends since the consistent rows are only within 1 day partitions group.
The desired result for me is that PostgreSQL makes joins between request_p1 to request_identity_p1, and request_p2 to request_identity_p2 first and only after that is makes appends of results.
The question is:
Is there a way to perform joins between partitions separately within 1 day partitions group?
Thanks.
Below are tables and sample data. I want to get the data up to a provided event_id - 1 or to the latest if no event_id is provided for an ngram and only return the data if that max(event_id) having some data.
Please look at examples and scenarios below. It's easier to explain that way.
CREATE TABLE NGRAM_CONTENT
(
event_id BIGINT PRIMARY KEY,
ref TEXT NOT NULL,
data TEXT
);
CREATE TABLE NGRAM
(
event_id BIGINT PRIMARY KEY,
ngram TEXT NOT NULL ,
ref TEXT not null,
name_length INT not null,
) ;
insert into NGRAM_CONTENT(event_id, ref, data) values (1, 'p1', 'a data 1');
insert into NGRAM_CONTENT(event_id, ref, data) values (2, 'p1', 'a data 2');
insert into NGRAM_CONTENT(event_id, ref, data) values (3, 'p1', null);
insert into NGRAM_CONTENT(event_id, ref, data) values (4, 'p2', 'b data 1');
insert into NGRAM_CONTENT(event_id, ref, data) values (5, 'p2', 'b data 2');
insert into NGRAM_CONTENT(event_id, ref, data) values (6, 'p2', 'c data 1');
insert into NGRAM_CONTENT(event_id, ref, data) values (7, 'p2', 'c data 2');
insert into NGRAM(ngram, ref, event_id, name_length) values ('a', 'p1', 1, 10);
insert into NGRAM(ngram, ref, event_id, name_length) values ('a', 'p1', 2, 12);
insert into NGRAM(ngram, ref, event_id, name_length) values ('b', 'p1', 2, 13);
insert into NGRAM(ngram, ref, event_id, name_length) values ('b', 'p2', 4, 8);
insert into NGRAM(ngram, ref, event_id, name_length) values ('b', 'p2', , 10);
insert into NGRAM(ngram, ref, event_id, name_length) values ('c', 'p2', 6, 20);
insert into NGRAM(ngram, ref, event_id, name_length) values ('c', 'p2', 7, 50);
Here are the tables of example input and desired output
| ngram | event_id | output |
| 'a','b'| < 2 | 'a data 1' |
| 'a','b'| < 3 | 'a data 2' |
| 'a','b'| < 4 | null |
| 'a' | - | null |
| 'a' | - | null |
| 'b' | - | null |
| 'a','b'| - | null |
| 'b' | < 5 | b data 1 |
| 'c' | < 7 | c data 1 |
| 'c' | - | c data 2 |
I've got the following query that worked. (I didn't mention name_length above, as it's just complicated example). Need to replace that 100000000 with an event_id above and also the values for searching of ngram
with max_matched_event_id_and_ref_from_index as
(
-- getting max event id and ref for all the potential matches
select max(event_id) as max_matched_event_id, ref
from ngram
where name_length between 14 and 18 and
ngram in ('a', 'b')
and event_id < 1000000000
group by ref
having count(*) >= 2
),
max_current_event_id as
(
select max(event_id) as max_current_event_id
from ngram_content w
inner join max_matched_event_id_and_ref_from_index n on w.ref = n.ref
where w.event_id >= n.max_matched_event_id and event_id < 1000000000
group by n.ref
)
select nc.data
from ngram_content nc
inner join max_current_event_id m on nc.event_id = m.max_current_event_id
inner join max_matched_event_id_and_ref_from_index mi on nc.event_id = mi.max_matched_event_id;
I have about 450 million from ngram table and about 55 million rows from ngram_content table.
The current query takes over a minute to return which is too slow for our usage.
I've got indexes as followings:
CREATE INDEX combined_index ON NGRAM (ngram, name_length, ref, event_id);
CREATE INDEX idx_ref_ngram_content ON ngram_content (ref);
CREATE INDEX idx_ngram_content_event_id ON ngram_content (event_id);
And here are the detailed explanation from query plan execution:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=4818702.89..4820783.06 rows=1 width=365) (actual time=29201.537..29227.214 rows=15081 loops=1)
Hash Cond: (mi.max_matched_event_id = nc.event_id)
Buffers: shared hit=381223 read=342422, temp read=2204 written=2213
CTE max_matched_event_id_and_ref_from_index
-> Finalize GroupAggregate (cost=3720947.79..3795574.47 rows=87586 width=16) (actual time=19163.811..19978.720 rows=43427 loops=1)
Group Key: ngram.ref
Filter: (count(*) >= 2)
Rows Removed by Filter: 999474
Buffers: shared hit=35270 read=225113, temp read=1402 written=1410
-> Gather Merge (cost=3720947.79..3788348.60 rows=525518 width=24) (actual time=19163.620..19649.679 rows=1048271 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=104576 read=669841 written=1, temp read=4183 written=4207
-> Partial GroupAggregate (cost=3719947.77..3726690.76 rows=262759 width=24) (actual time=19143.782..19356.718 rows=349424 loops=3)
Group Key: ngram.ref
Buffers: shared hit=104576 read=669841 written=1, temp read=4183 written=4207
-> Sort (cost=3719947.77..3720976.62 rows=411540 width=16) (actual time=19143.770..19231.539 rows=362406 loops=3)
Sort Key: ngram.ref
Sort Method: external merge Disk: 11216kB
Worker 0: Sort Method: external merge Disk: 11160kB
Worker 1: Sort Method: external merge Disk: 11088kB
Buffers: shared hit=104576 read=669841 written=1, temp read=4183 written=4207
-> Parallel Index Only Scan using combined_index on ngram (cost=0.57..3674535.28 rows=411540 width=16) (actual time=1.122..18715.404 rows=362406 loops=3)
Index Cond: ((ngram = ANY ('{ORA,AN,MG}'::text[])) AND (name_length >= 14) AND (name_length <= 18) AND (event_id < 1000000000))
Heap Fetches: 1087219
Buffers: shared hit=104560 read=669841 written=1
CTE max_current_event_id
-> GroupAggregate (cost=1020964.39..1021403.41 rows=200 width=40) (actual time=7631.312..7674.228 rows=43427 loops=1)
Group Key: n.ref
Buffers: shared hit=174985 read=70887, temp read=1179 written=273
-> Sort (cost=1020964.39..1021110.06 rows=58270 width=40) (actual time=7631.304..7644.203 rows=71773 loops=1)
Sort Key: n.ref
Sort Method: external merge Disk: 2176kB
Buffers: shared hit=174985 read=70887, temp read=1179 written=273
-> Nested Loop (cost=0.56..1016352.18 rows=58270 width=40) (actual time=1.093..7574.448 rows=71773 loops=1)
Buffers: shared hit=174985 read=70887, temp read=907
-> CTE Scan on max_matched_event_id_and_ref_from_index n (cost=0.00..1751.72 rows=87586 width=40) (actual time=0.000..838.522 rows=43427 loops=1)
Buffers: temp read=907
-> Index Scan using idx_ref_ngram_content on ngram_content w (cost=0.56..11.57 rows=1 width=16) (actual time=0.104..0.154 rows=2 loops=43427)
Index Cond: (ref = n.ref)
Filter: ((event_id < 1000000000) AND (event_id >= n.max_matched_event_id))
Rows Removed by Filter: 0
Buffers: shared hit=174985 read=70887
-> CTE Scan on max_matched_event_id_and_ref_from_index mi (cost=0.00..1751.72 rows=87586 width=8) (actual time=19163.813..19168.081 rows=43427 loops=1)
Buffers: shared hit=35270 read=225113, temp read=495 written=1410
-> Hash (cost=1722.50..1722.50 rows=200 width=381) (actual time=10035.797..10035.797 rows=43427 loops=1)
Buckets: 32768 (originally 1024) Batches: 2 (originally 1) Memory Usage: 3915kB
Buffers: shared hit=345953 read=117309, temp read=1179 written=704
-> Nested Loop (cost=0.56..1722.50 rows=200 width=381) (actual time=7632.365..9994.328 rows=43427 loops=1)
Buffers: shared hit=345953 read=117309, temp read=1179 written=273
-> CTE Scan on max_current_event_id m (cost=0.00..4.00 rows=200 width=8) (actual time=7631.315..7695.869 rows=43427 loops=1)
Buffers: shared hit=174985 read=70887, temp read=1179 written=273
-> Index Scan using idx_ngram_content_event_id on ngram_content nc (cost=0.56..8.58 rows=1 width=373) (actual time=0.052..0.052 rows=1 loops=43427)
Index Cond: (event_id = m.max_current_event_id)
Buffers: shared hit=170968 read=46422
Planning Time: 7.872 ms
Execution Time: 29231.222 ms
(57 rows)
Any suggestions on how to optimise the query or indexes so the query can run faster please?
I have two same tables one having 1k rows and the second 1M rows. I use the following script to populate them.
CREATE TABLE Table1 (
id int NOT NULL primary key,
groupby int NOT NULL,
orderby int NOT NULL,
local_search int NOT NULL,
global_search int NOT NULL,
padding varchar(100) NOT NULL
);
CREATE TABLE Table2 (
id int NOT NULL primary key,
groupby int NOT NULL,
orderby int NOT NULL,
local_search int NOT NULL,
global_search int NOT NULL,
padding varchar(100) NOT NULL
);
INSERT
INTO Table1
WITH t1 AS
(
SELECT id
FROM generate_series(1, 10000) id
), t2 AS
(
SELECT id,
id % 100 groupby
FROM t1
), t3 AS
(
SELECT b.id, b.groupby, row_number() over (partition by groupby order by id) orderby
FROM t2 b
)
SELECT id,
groupby,
orderby,
orderby % 50 local_search,
id % 1000 global_search,
RPAD('Value ' || id || ' ' , 100, '*') as padding
FROM t3;
INSERT
INTO Table2
WITH t1 AS
(
SELECT id
FROM generate_series(1, 1000000) id
), t2 AS
(
SELECT id,
id % 100 groupby
FROM t1
), t3 AS
(
SELECT b.id, b.groupby, row_number() over (partition by groupby order by id) orderby
FROM t2 b
)
SELECT id,
groupby,
orderby,
orderby % 50 local_search,
id % 1000 global_search,
RPAD('Value ' || id || ' ' , 100, '*') as padding
FROM t3;
I created also secondary index on table2
CREATE INDEX ix_Table2_groupby_orderby ON Table2 (groupby, orderby);
Now, I have the following query
select b.id, b.groupby, b.orderby, b.local_search, b.global_search, b.padding
from Table2 b
join Table1 a on b.orderby = a.id
where a.global_search = 1 and b.groupby < 10;
which leads to the following query plan using explain(analyze)
"Nested Loop (cost=0.42..17787.05 rows=100 width=121) (actual time=0.056..34.722 rows=100 loops=1)"
" -> Seq Scan on table1 a (cost=0.00..318.00 rows=10 width=4) (actual time=0.033..1.313 rows=10 loops=1)"
" Filter: (global_search = 1)"
" Rows Removed by Filter: 9990"
" -> Index Scan using ix_table2_groupby_orderby on table2 b (cost=0.42..1746.81 rows=10 width=121) (actual time=0.159..3.337 rows=10 loops=10)"
" Index Cond: ((groupby < 10) AND (orderby = a.id))"
"Planning time: 0.296 ms"
"Execution time: 34.775 ms"
and my question is: how it comes that he does not access the table2 in the query plan? He uses just ix_table2_groupby_orderby, but it contains just groupby, orderby and maybe id columns. How he gets the remaining columns of Table2 and why it is not in the query plan?
** EDIT **
I have tried explain(verbose) As suggested #laurenzalbe. This is the result
"Nested Loop (cost=0.42..17787.05 rows=100 width=121) (actual time=0.070..35.678 rows=100 loops=1)"
" Output: b.id, b.groupby, b.orderby, b.local_search, b.global_search, b.padding"
" -> Seq Scan on public.table1 a (cost=0.00..318.00 rows=10 width=4) (actual time=0.031..1.642 rows=10 loops=1)"
" Output: a.id, a.groupby, a.orderby, a.local_search, a.global_search, a.padding"
" Filter: (a.global_search = 1)"
" Rows Removed by Filter: 9990"
" -> Index Scan using ix_table2_groupby_orderby on public.table2 b (cost=0.42..1746.81 rows=10 width=121) (actual time=0.159..3.398 rows=10 loops=10)"
" Output: b.id, b.groupby, b.orderby, b.local_search, b.global_search, b.padding"
" Index Cond: ((b.groupby < 10) AND (b.orderby = a.id))"
"Planning time: 16.201 ms"
"Execution time: 35.754 ms"
Actually, I do not fully understand why the access to the heap of table2 is not there, but I accept it as an answer.
An index scan in PostgreSQL accesses not only the index, but also the table. This is not explicitly shown in the execution plan and is necessary to find out if a row is visible to the transaction or not.
Try EXPLAIN (VERBOSE) to see what columns are returned.
See the documentation for details:
All indexes in PostgreSQL are secondary indexes, meaning that each index is stored separately from the table's main data area (which is called the table's heap in PostgreSQL terminology). This means that in an ordinary index scan, each row retrieval requires fetching data from both the index and the heap.
Given this partial index:
CREATE INDEX orders_id_created_at_index
ON orders(id) WHERE created_at < '2013-12-31';
Would this query use the index?
SELECT *
FROM orders
WHERE id = 123 AND created_at = '2013-10-12';
As per the documentation, "a partial index can be used in a query only if the system can recognize that the WHERE condition of the query mathematically implies the predicate of the index".
Does that mean that the index will or will not be used?
You can check and yes, it would be used. I've created sql fiddle to check it with a query like this:
create table orders(id int, created_at date);
CREATE INDEX orders_id_created_at_index ON orders(id) WHERE created_at < '2013-12-31';
insert into orders
select
(random()*500)::int, '2013-01-01'::date + ((random() * 200)::int || ' day')::interval
from generate_series(1, 10000) as g
SELECT * FROM orders WHERE id = 123 AND created_at = '2013-10-12';
SELECT * FROM orders WHERE id = 123 AND created_at = '2014-10-12';
sql fiddle demo
If you check execution plans for these queries, you'll see for first query:
Bitmap Heap Scan on orders (cost=4.39..40.06 rows=1 width=8) Recheck Cond: ((id = 123) AND (created_at < '2013-12-31'::date)) Filter: (created_at = '2013-10-12'::date)
-> Bitmap Index Scan on orders_id_created_at_index (cost=0.00..4.39 rows=19 width=0) Index Cond: (id = 123)
and for second query:
Seq Scan on orders (cost=0.00..195.00 rows=1 width=8) Filter: ((id = 123) AND (created_at = '2014-10-12'::date))
I need to quickly select a value ( baz ) from the "earliest" ( MIN(save_date) ) rows grouped by an their foo_id. The following query returns the correct rows (well almost, it can return multiples for each foo_id when there are duplicate save_dates).
The foos table contains about 55k rows and the samples table contains about 25 million rows.
CREATE TABLE foos (
foo_id int,
val varchar(40),
# ref_id is a FK, constraint omitted for brevity
ref_id int
)
CREATE TABLE samples (
sample_id int,
save_date date,
baz smallint,
# foo_id is a FK, constraint omitted for brevity
foo_id int
)
WITH foo ( foo_id, val ) AS (
SELECT foo_id, val FROM foos
WHERE foos.ref_id = 1
ORDER BY foos.val ASC
LIMIT 25 OFFSET 0
)
SELECT foo.val, firsts.baz
FROM foo
LEFT JOIN (
SELECT A.baz, A.foo_id
FROM samples A
INNER JOIN (
SELECT foo_id, MIN( save_date ) AS save_date
FROM samples
GROUP BY foo_id
) B
USING ( foo_id, save_date )
) firsts USING ( foo_id )
This query currently takes over 100 seconds; I'd like to see this return in ~1 second (or less!).
How can I write this query to be optimal?
Updated; adding explains:
Obviously the actual query I'm using isn't using tables foo, baz, etc.
The "dumbed down" example query's (from above) explain:
Hash Right Join (cost=337.69..635.47 rows=3 width=100)
Hash Cond: (a.foo_id = foo.foo_id)
CTE foo
-> Limit (cost=71.52..71.53 rows=3 width=102)
-> Sort (cost=71.52..71.53 rows=3 width=102)
Sort Key: foos.val
-> Seq Scan on foos (cost=0.00..71.50 rows=3 width=102)
Filter: (ref_id = 1)
-> Hash Join (cost=265.25..562.90 rows=9 width=6)
Hash Cond: ((a.foo_id = samples.foo_id) AND (a.save_date = (min(samples.save_date))))
-> Seq Scan on samples a (cost=0.00..195.00 rows=1850 width=10)
-> Hash (cost=244.25..244.25 rows=200 width=8)
-> HashAggregate (cost=204.25..224.25 rows=200 width=8)
-> Seq Scan on samples (cost=0.00..195.00 rows=1850 width=8)
-> Hash (cost=0.60..0.60 rows=3 width=102)
-> CTE Scan on foo (cost=0.00..0.60 rows=3 width=102)
If I understand the question, you want windowing.
WITH find_first AS (
SELECT foo_id, baz,
row_number()
OVER (PARTITION BY foo_id ORDER BY foo_id, save_date) AS rnum
FROM samples
)
SELECT foo_id, baz FROM find_first WHERE rnum = 1;
Using row_number instead of rank eliminates duplicates and guarantees only one baz per foo. If you need to know against foos that have no bazzes, just LEFT JOIN the foos table to this query.
With an index on (foo_id, save_date), the optimizer should be smart enough to do the grouping keeping only one baz and skipping merrily along.
row_number() is a beautiful beast, but DISTINCT ON is simpler here.
WITH foo AS (
SELECT foo_id
FROM foos
WHERE ref_id = 1
ORDER BY val
LIMIT 25 OFFSET 0
)
SELECT DISTINCT ON (1) f.foo_id, s.baz
FROM foo f
LEFT JOIN samples s USING (foo_id)
ORDER BY f.foo_id, s.save_date, s.baz;
This is assuming you want exactly 1 row per foo_id. If there are multiple rows in sample sharing the same earliest save_date, baz serves as tie-breaker.
The case is very similar to this question from yesterday.
More advice:
Don't select val in the CTE, you only need it in ORDER BY.
To avoid expensive sequential scans on foos:
If you are always after rows from foos with ref_id = 1, create a partial multi-column index:
CREATE INDEX foos_val_part_idx ON foos (val)
WHERE ref_id = 1;
If ref_id is variable:
CREATE INDEX foos_ref_id_val_idx ON foos (ref_id, val);
The other index that would help best on samples:
CREATE INDEX samples_foo_id_save_date_baz_idx
ON samples (foo_id, save_date, baz);
These indexes become even more effective with the new "index-only scans" in version 9.2. Details and links here.