Should I use single of composite index? - sql

What is the right way to set index for next query?
SELECT t1.purchaseNumber, t1.parsing_status, t1.docPublishDate
FROM xml_files t1
LEFT JOIN xml_files t2
ON t1.purchaseNumber = t2.purchaseNumber
AND t1.docPublishDate < t2.docPublishDate
WHERE t1.parsing_status IS NULL
AND t2.parsing_status IS NULL
AND t2.docPublishDate IS NULL
AND t1.section_name='contracts' AND t1.parsing_status IS NULL AND t1.random IN (1,2,3,4)
Should I create composite index or better to create single index for every table that used in query?
Also if I am doing comparison of timestamp docPublishDate how should I create in index? Should I use desc keyword?
purchaseNumber - varchar(50)
parsing_status - varchar(10)
random - integer
section_name - varchar(10)
EXPLAIN (ANALYZE, BUFFERS) query;:
Gather (cost=1000.86..137158.61 rows=43091 width=35) (actual time=22366.063..72674.678 rows=46518 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=99244069 read=144071
-> Nested Loop Anti Join (cost=0.86..131849.51 rows=17955 width=35) (actual time=22309.989..72440.514 rows=15506 loops=3)
Buffers: shared hit=99244069 read=144071
-> Parallel Index Scan using index_for_xml_files_parsing_status on xml_files t1 (cost=0.43..42606.31 rows=26932 width=35) (actual time=0.086..193.982 rows=40725 loops=3)
Index Cond: ((parsing_status IS NULL) AND (parsing_status IS NULL))
Filter: (((section_name)::text = 'contracts'::text) AND (random = ANY ('{1,2,3,4}'::integer[])))
Rows Removed by Filter: 383974
Buffers: shared hit=15724 read=42304
-> Index Scan using "index_for_xml_files_purchaseNumber" on xml_files t2 (cost=0.43..4.72 rows=3 width=27) (actual time=1.773..1.773 rows=1 loops=122174)
Index Cond: (("purchaseNumber")::text = (t1."purchaseNumber")::text)
Filter: (t1."docPublishDate" < "docPublishDate")
Rows Removed by Filter: 6499
Buffers: shared hit=99228345 read=101767
Planning Time: 0.396 ms
Execution Time: 72681.868 ms
Data example: How to improve speed of query?

You should explain what you want the query to do. I would write the query more clearly as:
SELECT t1.purchaseNumber, t1.parsing_status, t1.docPublishDate
FROM xml_files t1
WHERE t1.section_name = 'contracts' AND
t1.parsing_status IS NULL AND
t1.random IN (1, 2, 3, 4) AND
NOT EXISTS (SELECT 1
FROM xml_files t2
WHERE t1.purchaseNumber = t2.purchaseNumber AND
t1.docPublishDate < t2.docPublishDate
);
For this query, I would suggest the the following indexes:
create index idx_xml_files_3 on xml_files(section_name, random)
where parsing_status is null;
create index idx_xml_files_2 on xml_files(purchaseNumber, docPublishDate);
There is probably an even better way to write the query, using window functions for instance. However, it is not clear what your data looks like nor what the query is intended to do.

The index scan on the inner side of the nested loop join is inefficient: on average, 6499 of the 6500 rows found are discarded.
Create a better index:
CREATE INDEX ON xml_files ("purchaseNumber", "docPublishDate");

Related

Difference between ANY(ARRAY[..]) vs ANY(VALUES (), () ..) in PostgreSQL

I am trying to workout query optimisation on id. Not sure which one way should I use. Below is the query plan using explain and cost wise looks similar.
1. explain (analyze, buffers) SELECT * FROM table1 WHERE id = ANY (ARRAY['00e289b0-1ac8-451f-957f-e00bc289148e'::uuid,...]);
QUERY PLAN:
Index Scan using table1_pkey on table1 (cost=0.42..641.44 rows=76 width=835) (actual time=0.258..2.603 rows=76 loops=1)
Index Cond: (id = ANY ('{00e289b0-1ac8-451f-957f-e00bc289148e,...}'::uuid[]))
Buffers: shared hit=231 read=73
Planning Time: 0.487 ms
Execution Time: 2.715 ms)
2. explain (analyze, buffers) SELECT * FROM table1 WHERE id = ANY (VALUES ('00e289b0-1ac8-451f-957f-e00bc289148e'::uuid),...);
QUERY PLAN:
Nested Loop (cost=1.56..644.10 rows=76 width=835) (actual time=0.058..0.297 rows=76 loops=1)
Buffers: shared hit=304
-> HashAggregate (cost=1.14..1.90 rows=76 width=16) (actual time=0.049..0.060 rows=76 loops=1)
Group Key: "*VALUES*".column1
-> Values Scan on "*VALUES*" (cost=0.00..0.95 rows=76 width=16) (actual time=0.006..0.022 rows=76 loops=1)
-> Index Scan using table1_pkey on table1 (cost=0.42..8.44 rows=1 width=835) (actual time=0.002..0.003 rows=1 loops=76)
Index Cond: (id = "*VALUES*".column1)
Buffers: shared hit=304
Planning Time: 0.437 ms
Execution Time: 0.389 ms
Looks like VALUES () does some hashing and join to improve performance but not sure.
NOTE: In my practical use case, id is uuid_generate_v4() e.x. d31cddc0-1771-4de8-ad41-e6c568b39a5d but the column may not be indexed as such.
Also, I have a table of with 5-10 million records.
Which way is for the better query performance?
Both options seem reasonable. I would, however, suggest to avoid casting the column you filter on. Instead, you should cast the literal values to uuid:
SELECT *
FROM table1
WHERE id = ANY (ARRAY['00e289b0-1ac8-451f-957f-e00bc289148e'::uuid, ...]);
This should allow the database to take advantage of an index on column id.

Postgres query slows down dramatically when performing addition including column with only null values

We've got a query in our DB that's performing a lot slower than we'd like and we've narrowed it down to two columns in a table that, when added to a select calculation, massively increase the query time (30-40s --> 5m30s - 7m). The only interesting feature about these columns is that they currently only contain NULLs. In another environment where these columns do contain values, we do not have this issue.
The query structure is:
SELECT (some columns, calculations, etc)
FROM table_a
LEFT JOIN (
SELECT more columns, calculations, a + ONLY NULLS COLUMN + OTHER ONLY NULLS COLUMN + b
FROM view_a) alias_a
ON table_a.id = alias_a.id
GROUP BY alias_a.id.
The subselect runs fine, and the outer query runs fine as long as we remove the grouping. Not certain how to fix this besides inserting dummy data.
explain plan (anonymized as best as possible)
GroupAggregate (cost=10432.09..11510.14 rows=4683 width=484)
Group Key: mc.id
-> Merge Left Join (cost=10432.09..10456.47 rows=4683 width=251)
Merge Cond: (mc.id = mct.macguffin_child_id)
-> Sort (cost=439.33..451.04 rows=4683 width=4)
Sort Key: mc.id
-> Seq Scan on macguffin_child mc (cost=0.00..153.83 rows=4683 width=4)
-> Sort (cost=9992.76..9992.92 rows=64 width=251)
Sort Key: mct.macguffin_child_id
-> Nested Loop Left Join (cost=5221.87..9990.84 rows=64 width=251)
-> Merge Join (cost=5221.44..9947.83 rows=64 width=211)
Merge Cond: (mp.id = m.phase_id)
Join Filter: (mc_1.macguffin_id = m.id)
-> Nested Loop (cost=4748.32..10792.40 rows=64 width=223)
-> Nested Loop (cost=4748.05..10788.94 rows=1 width=223)
-> Index Only Scan using macguffin_phase_pkey on macguffin_phase mp (cost=0.13..12.18 rows=3 width=4)
-> Materialize (cost=4747.92..10776.73 rows=1 width=219)
-> Gather (cost=4747.92..10776.72 rows=1 width=219)
Workers Planned: 1
-> Nested Loop (cost=3747.92..9776.62 rows=1 width=219)
-> Nested Loop (cost=3747.49..9625.65 rows=21 width=131)
-> Hash Join (cost=3747.07..9610.78 rows=21 width=121)
Hash Cond: ((mt.macguffin_id = mc_1.macguffin_id) AND (mct.macguffin_child_id = mc_1.id))
-> Parallel Hash Join (cost=3522.99..8897.48 rows=93186 width=100)
Hash Cond: (mct.macguffin_thing_id = mt.id)
-> Parallel Seq Scan on macguffin_child_thing mct (cost=0.00..2141.86 rows=93186 width=88)
-> Parallel Hash (cost=1966.11..1966.11 rows=89511 width=16)
-> Parallel Seq Scan on macguffin_thing mt (cost=0.00..1966.11 rows=89511 width=16)
-> Hash (cost=153.83..153.83 rows=4683 width=25)
-> Seq Scan on macguffin_child mc_1 (cost=0.00..153.83 rows=4683 width=25)
-> Index Scan using thing_pkey on thing t (cost=0.42..0.71 rows=1 width=26)
Index Cond: (id = mt.thing_id)
-> Index Scan using macguffin_thing_calculation_by_date_request_id_composite_key on macguffin_thing_calculation_by_date c (cost=0.43..7.18 rows=1 width=80)
Index Cond: ((request_id = mc_1.request_id) AND (sku = (t.sku)::text))
-> Index Only Scan using window_start_date_end_date_idx on window w (cost=0.28..2.83 rows=64 width=8)
Index Cond: ((start_date <= c.date) AND (end_date >= c.date))
-> Sort (cost=473.12..484.53 rows=4565 width=8)
Sort Key: m.phase_id
-> Seq Scan on macguffin m (cost=0.00..195.65 rows=4565 width=8)
-> Index Scan using macguffin_thing_calculation_override_thing_id on macguffin_thing_calculation_override mtco (cost=0.42..0.63 rows=1 width=21)
Index Cond: ((mct.macguffin_thing_id = macguffin_thing_id) AND (c.date = date))
The table containing the offending fields
-- auto-generated definition
CREATE TABLE macguffin_child_thing
(
id serial NOT NULL
CONSTRAINT macguffin_child_thing_pkey
PRIMARY KEY,
macguffin_child_id integer NOT NULL
CONSTRAINT macguffin_child_thing_macguffin_child_id_fkey
REFERENCES macguffin_child,
macguffin_thing_id integer NOT NULL
CONSTRAINT macguffin_child_thing_macguffin_thing_id_fkey
REFERENCES macguffin_thing,
thing_field_1 numeric
CONSTRAINT macguffin_child_thing_thing_field_check
CHECK (thing_field >= (0)::numeric),
some_value numeric,
thing_field_2 numeric =================================OFFENDING FIELD
CONSTRAINT macguffin_child_thing_thing_field_2_check
CHECK (thing_field_2 >= (0)::numeric),
thing_field_3 numeric =================================OFFENDING FIELD
CONSTRAINT macguffin_child_thing_thing_field_3_check
CHECK (thing_field_3 >= (0)::numeric),
thing_field_4 numeric
CONSTRAINT macguffin_child_thing_thing_field_4_check
CHECK (thing_field_4 >= (0)::numeric)
);
ALTER TABLE macguffin_child_thing
OWNER TO db_owner;
CREATE UNIQUE INDEX macguffin_child_thing__unique_idx
ON macguffin_child_thing (macguffin_child_id, macguffin_thing_id);

Query with join taking too long to run - Postgresql

I have a query that it's taking too long to run.
I'm using PostgreSQL 10.3.
In my tables involved in this query, I have about 3.5 million records in each.
The query is:
SELECT thf.attr1, thf.attr2, thf.attr3, thf.attr4
FROM tb_one AS thf
INNER JOIN tb_two AS ths
ON ths.tb_hit_hitid = thf.tb_hit_hitid
WHERE ths.source IN ('source1', 'source2')
In these tables, I have index:
CREATE INDEX tb_two_idx_1 on tb_two (Source ASC, attr5 ASC);
CREATE INDEX tb_one_idx_1 on tb_one USING btree (attr1 ASC,attr2 ASC,attr3 ASC,attr4 ASC);
CREATE INDEX tb_one_idx_2 on tb_hit_feature (tb_hit_HitId ASC);
CREATE INDEX tb_two_idx_2 on tb_hit_source (tb_hit_HitId ASC);
This is the QUERY PLAN (explain (analyse, buffers)):
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=3.85..287880.35 rows=1771004 width=44) (actual time=0.091..3894.024 rows=1726970 loops=1)
Merge Cond: (thf.tb_hit_hitid = ths.tb_hit_hitid)
Buffers: shared hit=354821
-> Index Scan using tb_one_idx_2 on tb_one thf (cost=0.43..124322.43 rows=3230800 width=52) (actual time=0.014..655.036 rows=1726946 loops=1)
Buffers: shared hit=27201
-> Index Scan using tb_two_idx_2 on tb_two ths (cost=0.43..139531.97 rows=1771004 width=8) (actual time=0.069..1604.789 rows=1726973 loops=1)
Filter: ((source)::text = ANY ('{source1,source2}'::text[]))
Rows Removed by Filter: 1651946
Buffers: shared hit=327620
Planning time: 2.737 ms
Execution time: 4117.573 ms
(11 rows)
For this query:
SELECT thf.attr1, thf.attr2, thf.attr3, thf.attr4
FROM tb_one thf INNER JOIN
tb_two ths
ON ths.tb_hit_hitid = thf.tb_hit_hitid
WHERE ths.source IN ('source1', 'source2');
You want indexes on tb_two(source, tb_hit_hitid) and tb_one(tb_hit_hitid). That is probably the best index.
In case the query returns duplicates (due to the join), I might suggest writing this as:
SELECT thf.attr1, thf.attr2, thf.attr3, thf.attr4
FROM tb_one thf
WHERE EXISTS (SELECT 1
FROM tb_two ths
WHERE ths.tb_hit_hitid = thf.tb_hit_hitid AND
ths.source IN ('source1', 'source2')
);
For this version, you want the index to be tb_two(tb_hit_hitid, source).

Optimizing a one to many query that's using a subselect to paginate

I was hoping to get some expert eyes over my query and see why I'm receiving different performance.
The problem I'm trying to solve is I need orders that can have one to many items. These orders need to be paginated.
To do this I've taken the following approach. I'm using a sub query to filter orders by the required item attributes. I'm then rejoining to the items to get their required fields. This means that when paginating I will not incorrectly filter order rows when there are orders with 2 or more items.
I'm seeing intermittently slow queries. The second time they are run they're much quicker. I presume this is because Postgres is loading indexes and such into memory?
I don't fully understand what is happening From the explain. It's looking like it needs to scan every order to see if they have an item that fits the subquery? I'm a bit confused by the following line. It's saying it needs to scan 286853 rows but also only 165?
Index Scan Backward using orders_created_at_idx on orders (cost=0.42..2708393.65 rows=286853 width=301) (actual time=64.598..2114.676 rows=165 loops=1)
Is there a way to get Postgres to filter by the items first or am I reading this incorrectly and it is doing that already?
Query:
SELECT
"orders"."id_orders" as "orders.id_orders",
"items"."id_items" as "items"."id_items",
...,
orders.created_at, orders.updated_at
FROM (
SELECT
orders.id_orders,
orders.created_at,
orders.updated_at
FROM orders
WHERE orders.status in ('completed','pending') AND
(
SELECT fk_vendor_id FROM items
WHERE (
items.fk_order_id = orders.id_orders AND
items.fk_vendor_id = '0012800001YVccUAAT' AND
items.fk_offer = '0060I00000RAKFYQA5' AND
items.status IN ('completed','cancelled')
) LIMIT 1
) IS NOT NULL ORDER BY orders.created_at DESC LIMIT 50 OFFSET 150
) as orders INNER JOIN items ON items.fk_order_id = orders.id_orders;
1st explain:
Nested Loop (cost=1417.11..2311.77 rows=67 width=1705) (actual time=2785.221..17025.325 rows=17 loops=1)
-> Limit (cost=1416.68..1888.77 rows=50 width=301) (actual time=2785.216..17024.918 rows=15 loops=1)
-> Index Scan Backward using orders_created_at_idx on orders (cost=0.42..2708393.65 rows=286853 width=301) (actual time=1214.013..17024.897 rows=165 loops=1)
Filter: ((status = ANY ('{completed,pending}'::orders_status_enum[])) AND ((SubPlan 1) IS NOT NULL))
Rows Removed by Filter: 313631
SubPlan 1
-> Limit (cost=0.42..8.45 rows=1 width=19) (actual time=0.047..0.047 rows=0 loops=287719)
-> Index Scan using items_fk_order_id_index on items items_1 (cost=0.42..8.45 rows=1 width=19) (actual time=0.047..0.047 rows=0 loops=287719)
Index Cond: (fk_order_id = orders.id_orders)
Filter: ((status = ANY ('{completed,cancelled}'::items_status_enum[])) AND (fk_vendor_id = '0012800001YVccUAAT'::text) AND (fk_offer = '0060I00000RAKFYQA5'::text))
Rows Removed by Filter: 1
-> Index Scan using items_fk_order_id_index on items (cost=0.42..8.44 rows=1 width=1404) (actual time=0.002..0.026 rows=1 loops=15)
Index Cond: (fk_order_id = orders.id_orders)
Planning time: 1.791 ms
Execution time: 17025.624 ms
(15 rows)
2nd explain:
Nested Loop (cost=1417.11..2311.77 rows=67 width=1705) (actual time=115.659..2114.739 rows=17 loops=1)
-> Limit (cost=1416.68..1888.77 rows=50 width=301) (actual time=115.654..2114.691 rows=15 loops=1)
-> Index Scan Backward using orders_created_at_idx on orders (cost=0.42..2708393.65 rows=286853 width=301) (actual time=64.598..2114.676 rows=165 loops=1)
Filter: ((status = ANY ('{completed,pending}'::orders_status_enum[])) AND ((SubPlan 1) IS NOT NULL))
Rows Removed by Filter: 313631
SubPlan 1
-> Limit (cost=0.42..8.45 rows=1 width=19) (actual time=0.006..0.006 rows=0 loops=287719)
-> Index Scan using items_fk_order_id_index on items items_1 (cost=0.42..8.45 rows=1 width=19) (actual time=0.006..0.006 rows=0 loops=287719)
Index Cond: (fk_order_id = orders.id_orders)
Filter: ((status = ANY ('{completed,cancelled}'::items_status_enum[])) AND (fk_vendor_id = '0012800001YVccUAAT'::text) AND (fk_offer = '0060I00000RAKFYQA5'::text))
Rows Removed by Filter: 1
-> Index Scan using items_fk_order_id_index on items (cost=0.42..8.44 rows=1 width=1404) (actual time=0.002..0.002 rows=1 loops=15)
Index Cond: (fk_order_id = orders.id_orders)
Planning time: 2.011 ms
Execution time: 2115.052 ms
(15 rows)
Order indexes:
"cart_pkey" PRIMARY KEY, btree (id_orders)
"orders_legacy_id_uindex" UNIQUE, btree (legacy_id_orders)
"orders_transaction_key_uindex" UNIQUE, btree (transaction_key)
"orders_created_at_idx" btree (created_at)
"orders_customer_email_idx" gin (customer_email gin_trgm_ops)
"orders_customer_full_name_idx" gin (customer_full_name gin_trgm_ops)
Referenced by:
TABLE "items" CONSTRAINT "items_fk_order_id_fkey" FOREIGN KEY (fk_order_id) REFERENCES orders(id_orders) ON DELETE RESTRICT
TABLE "items_log" CONSTRAINT "items_log_fk_order_id_fkey" FOREIGN KEY (fk_order_id) REFERENCES orders(id_orders)
Items indexes:
"items_pkey" PRIMARY KEY, btree (id_items)
"items_fk_vendor_id_booking_number_unique" UNIQUE, btree (fk_vendor_id, booking_number) WHERE legacy_id_items IS NULL
"items_legacy_id_uindex" UNIQUE, btree (legacy_id_items)
"items_transaction_key_uindex" UNIQUE, btree (transaction_key)
"items_booking_number_index" btree (booking_number)
"items_fk_order_id_index" btree (fk_order_id)
"items_fk_vendor_id_index" btree (fk_vendor_id)
"items_status_index" btree (status)
Foreign-key constraints:
"items_fk_order_id_fkey" FOREIGN KEY (fk_order_id) REFERENCES orders(id_orders) ON DELETE RESTRICT
The difference in execution times is probably really the effect of caching. You could use EXPLAIN (ANALYZE, BUFFERS) to see how many pages are found in the database cache.
To make your query more readable, you should rewrite
WHERE (
SELECT fk_vendor_id FROM items
WHERE (
items.fk_order_id = orders.id_orders AND
items.fk_vendor_id = '0012800001YVccUAAT' AND
items.fk_offer = '0060I00000RAKFYQA5' AND
items.status IN ('completed','cancelled')
) LIMIT 1
) IS NOT NULL
to
WHERE NOT EXISTS
(SELECT 1 FROM items
WHERE items.fk_order_id = orders.id_orders
AND items.fk_vendor_id = '0012800001YVccUAAT'
AND items.fk_offer = '0060I00000RAKFYQA5'
AND items.status IN ('completed','cancelled')
)
The best thing you could do to speed up the query is to create an index:
CREATE INDEX ON items(fk_order_id, fk_vendor_id, fk_offer);

Query records linked through key value pairs to records that actually match criteria

We have a simple, generic tables structure, implemented in PostgreSQL (8.3; 9.1 is at our horizon). It seems a very straightforward and common implementation. It boils down to this:
events_event_types
(
# this table holds some 50 rows
id bigserial # PK
"name" character varying(255)
)
events_events
(
# this table holds some 15M rows
id bigserial # PK
datetime timestamp with time zone
eventtype_id bigint # FK to events_event_types.id
)
CREATE TABLE events_eventdetails
(
# this table holds some 65M rows
id bigserial # PK
keyname character varying(255)
"value" text
event_id bigint # FK to events_events.id
)
Some of the rows in events_events and events_eventdetails tables would be like this:
events_events | events_eventdetails
id datetime eventtype_id | id keyname value event_id
----------------------------|-------------------------------------------
100 ... 10 | 1000 transactionId 9774ae16-... 100
| 1001 someKey some value 100
200 ... 20 | 2000 transactionId 9774ae16-... 200
| 2001 reductionId 123 200
| 2002 reductionId 456 200
300 ... 30 | 3000 transactionId 9774ae16-... 300
| 2001 customerId 234 300
| 2001 companyId 345 300
We are in desperate need of a "solution" that returns events_events rows 100 and 200 and 300 together in a single result set and FAST! when asked for reductionId=123 or when asked for customerId=234 or when asked for companyId=345. (Possibly interested in an AND combination of these criteria, but that's not essentially the goal.)
Not sure if it matters at this point, but the result set should be filterable on datetime range and eventtype_id (IN list) and be given a LIMIT.
I ask for a "solution", since this could be either:
A single query
Two smaller queries (as long as their intermediate result is always small enough. I followed this approach and got stuck for companies (companyId) with large amounts (~20k) of associated transactions (transactionId))
A subtle redesign (e.g. denormalization)
This is not a fresh question as we tried all three approaches over many months (won't bother you with those queries) but it all fails at performance. The solution should return in <<<1s. Previous attempts took approx. 10s at best.
I'd really appreciate some help -- I'm at a loss now...
The two smaller queries approach looks much like this:
Query 1:
SELECT Substring(details2_transvalue.VALUE, 0, 32)
FROM events_eventdetails details2_transvalue
JOIN events_eventdetails compdetails ON details2_transvalue.event_id = compdetails.event_id
AND compdetails.keyname = 'companyId'
AND Substring(compdetails.VALUE, 0, 32) = '4'
AND details2_transvalue.keyname = 'transactionId'
Query 2:
SELECT events1.*
FROM events_events events1
JOIN events_eventdetails compDetails ON events1.id = compDetails.event_id
AND compDetails.keyname='companyId'
AND substring(compDetails.value,0,32)='4'
WHERE events1.eventtype_id IN (...)
UNION
SELECT events2.*
FROM events_events events2
JOIN events_eventdetails details2_transKey ON events2.id = details2_transKey.event_id
AND details2_transKey.keyname='transactionId'
AND substring(details2_transKey.value,0,32) IN ( -- result of query 1 goes here -- )
WHERE events2.eventtype_id IN (...)
ORDER BY dateTime DESC LIMIT 50
Performance of this gets poor due to the large set returned by query 1.
As you can see, values in the events_eventdetails table are always expressed as length 32 substrings, which we have indexed as such. Further indices on keyname, event_id, event_id + keyname, keyname + length 32 substring.
Here is a PostgreSQL 9.1 approach -- even though I don't officially have that platform at my disposal:
WITH companyevents AS (
SELECT events1.*
FROM events_events events1
JOIN events_eventdetails compDetails
ON events1.id = compDetails.event_id
AND compDetails.keyname='companyId'
AND substring(compDetails.value,0,32)=' -- my desired companyId -- '
WHERE events1.eventtype_id in (...)
ORDER BY dateTime DESC
LIMIT 50
)
SELECT * from events_events
WHERE transaction_id IN (SELECT transaction_id FROM companyevents)
OR id IN (SELECT id FROM companyevents)
AND eventtype_id IN (...)
ORDER BY dateTime DESC
LIMIT 250;
The query plan is as follows for companyId with 28228 transactionIds:
Limit (cost=7545.99..7664.33 rows=250 width=130) (actual time=210.100..3026.267 rows=50 loops=1)
CTE companyevents
-> Limit (cost=7543.62..7543.74 rows=50 width=130) (actual time=206.994..207.020 rows=50 loops=1)
-> Sort (cost=7543.62..7544.69 rows=429 width=130) (actual time=206.993..207.005 rows=50 loops=1)
Sort Key: events1.datetime
Sort Method: top-N heapsort Memory: 23kB
-> Nested Loop (cost=10.02..7529.37 rows=429 width=130) (actual time=0.093..178.719 rows=28228 loops=1)
-> Append (cost=10.02..1140.62 rows=657 width=8) (actual time=0.082..27.594 rows=28228 loops=1)
-> Bitmap Heap Scan on events_eventdetails compdetails (cost=10.02..394.47 rows=97 width=8) (actual time=0.021..0.021 rows=0 loops=1)
Recheck Cond: (((keyname)::text = 'companyId'::text) AND ("substring"(value, 0, 32) = '4'::text))
-> Bitmap Index Scan on events_eventdetails_substring_ind (cost=0.00..10.00 rows=97 width=0) (actual time=0.019..0.019 rows=0 loops=1)
Index Cond: (((keyname)::text = 'companyId'::text) AND ("substring"(value, 0, 32) = '4'::text))
-> Index Scan using events_eventdetails_companyid_substring_ind on events_eventdetails_companyid compdetails (cost=0.00..746.15 rows=560 width=8) (actual time=0.061..18.655 rows=28228 loops=1)
Index Cond: (((keyname)::text = 'companyId'::text) AND ("substring"(value, 0, 32) = '4'::text))
-> Index Scan using events_events_pkey on events_events events1 (cost=0.00..9.71 rows=1 width=130) (actual time=0.004..0.004 rows=1 loops=28228)
Index Cond: (id = compdetails.event_id)
Filter: (eventtype_id = ANY ('{103,106,107,110,45,34,14,87,58,78,7,76,42,11,25,57,98,37,30,35,33,49,52,29,74,28,85,59,51,65,66,18,13,86,75,6,44,38,43,94,56,95,96,71,50,81,90,89,16,17,4,88,79,77,68,97,92,67,72,53,2,10,31,32,80,111,104,93,26,8,61,5,73,70,63,20,60,40,41,23,22,48,36,108,99,64,62,55,69,19,46,47,15,54,100,101,27,21,12,102,105,109,112,113,114,115,116,119,120,121,122,123,124,9,127,24,130,132,129,125,131,118,117,133,134}'::bigint[]))
-> Index Scan Backward using events_events_datetime_ind on events_events (cost=2.25..1337132.75 rows=2824764 width=130) (actual time=210.100..3026.255 rows=50 loops=1)
Filter: ((hashed SubPlan 2) OR ((hashed SubPlan 3) AND (eventtype_id = ANY ('{103,106,107,110,45,34,14,87,58,78,7,76,42,11,25,57,98,37,30,35,33,49,52,29,74,28,85,59,51,65,66,18,13,86,75,6,44,38,43,94,56,95,96,71,50,81,90,89,16,17,4,88,79,77,68,97,92,67,72,53,2,10,31,32,80,111,104,93,26,8,61,5,73,70,63,20,60,40,41,23,22,48,36,108,99,64,62,55,69,19,46,47,15,54,100,101,27,21,12,102,105,109,112,113,114,115,116,119,120,121,122,123,124,9,127,24,130,132,129,125,131,118,117,133,134}'::bigint[]))))
SubPlan 2
-> CTE Scan on companyevents (cost=0.00..1.00 rows=50 width=90) (actual time=206.998..207.071 rows=50 loops=1)
SubPlan 3
-> CTE Scan on companyevents (cost=0.00..1.00 rows=50 width=8) (actual time=0.001..0.026 rows=50 loops=1)
Total runtime: 3026.410 ms
The query plan is as follows for companyId with 288 transactionIds:
Limit (cost=7545.99..7664.33 rows=250 width=130) (actual time=30.976..3790.362 rows=54 loops=1)
CTE companyevents
-> Limit (cost=7543.62..7543.74 rows=50 width=130) (actual time=9.263..9.290 rows=50 loops=1)
-> Sort (cost=7543.62..7544.69 rows=429 width=130) (actual time=9.263..9.272 rows=50 loops=1)
Sort Key: events1.datetime
Sort Method: top-N heapsort Memory: 24kB
-> Nested Loop (cost=10.02..7529.37 rows=429 width=130) (actual time=0.071..8.195 rows=1025 loops=1)
-> Append (cost=10.02..1140.62 rows=657 width=8) (actual time=0.060..1.348 rows=1025 loops=1)
-> Bitmap Heap Scan on events_eventdetails compdetails (cost=10.02..394.47 rows=97 width=8) (actual time=0.021..0.021 rows=0 loops=1)
Recheck Cond: (((keyname)::text = 'companyId'::text) AND ("substring"(value, 0, 32) = '5'::text))
-> Bitmap Index Scan on events_eventdetails_substring_ind (cost=0.00..10.00 rows=97 width=0) (actual time=0.019..0.019 rows=0 loops=1)
Index Cond: (((keyname)::text = 'companyId'::text) AND ("substring"(value, 0, 32) = '5'::text))
-> Index Scan using events_eventdetails_companyid_substring_ind on events_eventdetails_companyid compdetails (cost=0.00..746.15 rows=560 width=8) (actual time=0.039..1.006 rows=1025 loops=1)
Index Cond: (((keyname)::text = 'companyId'::text) AND ("substring"(value, 0, 32) = '5'::text))
-> Index Scan using events_events_pkey on events_events events1 (cost=0.00..9.71 rows=1 width=130) (actual time=0.005..0.006 rows=1 loops=1025)
Index Cond: (id = compdetails.event_id)
Filter: (eventtype_id = ANY ('{103,106,107,110,45,34,14,87,58,78,7,76,42,11,25,57,98,37,30,35,33,49,52,29,74,28,85,59,51,65,66,18,13,86,75,6,44,38,43,94,56,95,96,71,50,81,90,89,16,17,4,88,79,77,68,97,92,67,72,53,2,10,31,32,80,111,104,93,26,8,61,5,73,70,63,20,60,40,41,23,22,48,36,108,99,64,62,55,69,19,46,47,15,54,100,101,27,21,12,102,105,109,112,113,114,115,116,119,120,121,122,123,124,9,127,24,130,132,129,125,131,118,117,133,134}'::bigint[]))
-> Index Scan Backward using events_events_datetime_ind on events_events (cost=2.25..1337132.75 rows=2824764 width=130) (actual time=30.975..3790.332 rows=54 loops=1)
Filter: ((hashed SubPlan 2) OR ((hashed SubPlan 3) AND (eventtype_id = ANY ('{103,106,107,110,45,34,14,87,58,78,7,76,42,11,25,57,98,37,30,35,33,49,52,29,74,28,85,59,51,65,66,18,13,86,75,6,44,38,43,94,56,95,96,71,50,81,90,89,16,17,4,88,79,77,68,97,92,67,72,53,2,10,31,32,80,111,104,93,26,8,61,5,73,70,63,20,60,40,41,23,22,48,36,108,99,64,62,55,69,19,46,47,15,54,100,101,27,21,12,102,105,109,112,113,114,115,116,119,120,121,122,123,124,9,127,24,130,132,129,125,131,118,117,133,134}'::bigint[]))))
SubPlan 2
-> CTE Scan on companyevents (cost=0.00..1.00 rows=50 width=90) (actual time=9.266..9.327 rows=50 loops=1)
SubPlan 3
-> CTE Scan on companyevents (cost=0.00..1.00 rows=50 width=8) (actual time=0.001..0.019 rows=50 loops=1)
Total runtime: 3796.736 ms
With 3s/4s this is not bad at all, but still a factor 100+ too slow. Also, this wasn't on relevant hardware. Nonetheless it should show where the pain is.
Here is something that could possibly grow into a solution:
Added a table:
events_transaction_helper
(
event_id bigint not null
transactionid character varying(36) not null
keyname character varying(255) not null
value bigint not null
# index on keyname, value
)
I "manually" filled this table now, but a materialized view implementation would do the trick. It would much follow the below query:
SELECT tr.event_id, tr.value AS transactionid, det.keyname, det.value AS value
FROM events_eventdetails tr
JOIN events_eventdetails det ON det.event_id = tr.event_id
WHERE tr.keyname = 'transactionId'
AND det.keyname
IN ('companyId', 'reduction_id', 'customer_id');
Added a column to the events_events table:
transaction_id character varying(36) null
This new column is filled as follows:
update events_events
set transaction_id =
(select value from events_eventdetails
where keyname='transactionId'
and event_id=events_events.id);
Now, the following query returns in <15ms consistently:
explain analyze select * from events_events
where transactionId in
(select distinct transactionid
from events_transaction_helper
WHERE keyname='companyId' and value=5)
and eventtype_id in (...)
order by datetime desc limit 250;
Limit (cost=5075.23..5075.85 rows=250 width=130) (actual time=8.901..9.028 rows=250 loops=1)
-> Sort (cost=5075.23..5077.19 rows=785 width=130) (actual time=8.900..8.953 rows=250 loops=1)
Sort Key: events_events.datetime
Sort Method: top-N heapsort Memory: 81kB
-> Nested Loop (cost=57.95..5040.04 rows=785 width=130) (actual time=0.928..8.268 rows=524 loops=1)
-> HashAggregate (cost=52.30..52.42 rows=12 width=37) (actual time=0.895..0.991 rows=276 loops=1)
-> Subquery Scan on "ANY_subquery" (cost=52.03..52.27 rows=12 width=37) (actual time=0.558..0.757 rows=276 loops=1)
-> HashAggregate (cost=52.03..52.15 rows=12 width=37) (actual time=0.556..0.638 rows=276 loops=1)
-> Index Scan using testmaterializedviewkeynamevalue on events_transaction_helper (cost=0.00..51.98 rows=22 width=37) (actual time=0.068..0.404 rows=288 loops=1)
Index Cond: (((keyname)::text = 'companyId'::text) AND (value = 5))
-> Bitmap Heap Scan on events_events (cost=5.65..414.38 rows=100 width=130) (actual time=0.023..0.024 rows=2 loops=276)
Recheck Cond: ((transactionid)::text = ("ANY_subquery".transactionid)::text)
Filter: (eventtype_id = ANY ('{103,106,107,110,45,34,14,87,58,78,7,76,42,11,25,57,98,37,30,35,33,49,52,29,74,28,85,59,51,65,66,18,13,86,75,6,44,38,43,94,56,95,96,71,50,81,90,89,16,17,4,88,79,77,68,97,92,67,72,53,2,10,31,32,80,111,104,93,26,8,61,5,73,70,63,20,60,40,41,23,22,48,36,108,99,64,62,55,69,19,46,47,15,54,100,101,27,21,12,102,105,109,112,113,114,115,116,119,120,121,122,123,124,9,127,24,130,132,129,125,131,118,117,133,134}'::bigint[]))
-> Bitmap Index Scan on testtransactionid (cost=0.00..5.63 rows=100 width=0) (actual time=0.020..0.020 rows=2 loops=276)
Index Cond: ((transactionid)::text = ("ANY_subquery".transactionid)::text)
Total runtime: 9.122 ms
I'll check back later to let you know if this turned out a feasible solution for real :)
The Idea is not to denormalise, but to normalise. The events_details() table can be replaced by two tables: one with the event_detail_types, and one with the actual values (referring to the {even_id,detail_types}.
This will make the execution of the query easier, since only the numerical ids of the detail_types have to be extracted and selected. The gain is in the reduced number of pages that has to be fetched by the DBMS, since all the key name need only be stored+retrieved+compared once.
NOTE: I changed the naming a bit. For reasons of sanity and safety, mostly.
SET search_path='cav';
/**** ***/
DROP SCHEMA cav CASCADE;
CREATE SCHEMA cav;
SET search_path='cav';
CREATE TABLE event_types
(
-- this table holds some 50 rows
id bigserial PRIMARY KEY
, zname varchar(255)
);
INSERT INTO event_types(zname)
SELECT 'event_'::text || gs::text
FROM generate_series (1,100) gs
;
CREATE TABLE events
(
-- this table holds some 15M rows
id bigserial PRIMARY KEY
, zdatetime timestamp with time zone
, eventtype_id bigint REFERENCES event_types(id)
);
INSERT INTO events(zdatetime,eventtype_id)
SELECT gs, et.id
FROM generate_series ('2012-04-11 00:00:00'::timestamp
, '2012-04-12 12:00:00'::timestamp ,' 1 hour'::interval ) gs
, event_types et
;
-- SELECT * FROM event_types;
-- SELECT * FROM events;
CREATE TABLE event_details
(
-- this table holds some 65M rows
id bigserial PRIMARY KEY
, event_id bigint REFERENCES events(id)
, keyname varchar(255)
, zvalue text
);
INSERT INTO event_details(event_id, keyname)
SELECT ev.id,im.*
FROM events ev
, (VALUES ('transactionId'::text),('someKey'::text)
,('reductionId'::text),('customerId'::text),('companyId'::text)
) im
;
UPDATE event_details
SET zvalue = 'Some_value'::text || (random() * 1000)::int::text
;
--
-- Domain table with all valid detail_types
--
CREATE TABLE detail_types(
id bigserial PRIMARY KEY
, keyname varchar(255)
);
INSERT INTO detail_types(keyname)
SELECT DISTINCT keyname
FROM event_details
;
--
-- Context-attribute-value table, referencing {event_id, type_id}
--
CREATE TABLE event_detail_values
( event_id BIGINT
, detail_type_id BIGINT
, zvalue text
, PRIMARY KEY(event_id , detail_type_id)
, FOREIGN KEY(event_id ) REFERENCES events(id)
, FOREIGN KEY(detail_type_id)REFERENCES detail_types(id)
);
--
-- For the sake of joining we create some natural keys
--
CREATE INDEX events_details_keyname ON event_details (keyname) ;
CREATE INDEX detail_types_keyname ON detail_types(keyname) ;
INSERT INTO event_detail_values (event_id,detail_type_id, zvalue)
SELECT ed.event_id, dt.id
, ed.zvalue
FROM event_details ed
, detail_types dt
WHERE ed.keyname = dt.keyname
;
--
-- Now we can drop the original table, and use the view instead
--
DROP TABLE event_details;
CREATE VIEW event_details AS (
SELECT dv.event_id AS event_id
, dt.keyname AS keyname
, dv.zvalue AS zvalue
FROM event_detail_values dv
JOIN detail_types dt ON dt.id = dv.detail_type_id
);
EXPLAIN ANALYZE
SELECT ev.id AS event_id
, ev.zdatetime AS zdatetime
, ed.keyname AS keyname
, ed.zvalue AS zevalue
FROM events ev
JOIN event_details ed ON ed.event_id = ev.id
WHERE ed.keyname IN ('transactionId','customerId','companyId')
ORDER BY event_id,keyname
;
resulting Query plan:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=1178.79..1197.29 rows=7400 width=40) (actual time=159.902..177.379 rows=11100 loops=1)
Sort Key: ev.id, dt.keyname
Sort Method: external sort Disk: 560kB
-> Hash Join (cost=108.34..703.22 rows=7400 width=40) (actual time=12.225..122.231 rows=11100 loops=1)
Hash Cond: (dv.event_id = ev.id)
-> Hash Join (cost=1.09..466.47 rows=7400 width=32) (actual time=0.047..74.183 rows=11100 loops=1)
Hash Cond: (dv.detail_type_id = dt.id)
-> Seq Scan on event_detail_values dv (cost=0.00..322.00 rows=18500 width=29) (actual time=0.006..26.543 rows=18500 loops=1)
-> Hash (cost=1.07..1.07 rows=2 width=19) (actual time=0.025..0.025 rows=3 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Seq Scan on detail_types dt (cost=0.00..1.07 rows=2 width=19) (actual time=0.009..0.014 rows=3 loops=1)
Filter: ((keyname)::text = ANY ('{transactionId,customerId,companyId}'::text[]))
-> Hash (cost=61.00..61.00 rows=3700 width=16) (actual time=12.161..12.161 rows=3700 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 131kB
-> Seq Scan on events ev (cost=0.00..61.00 rows=3700 width=16) (actual time=0.004..5.926 rows=3700 loops=1)
Total runtime: 192.724 ms
(16 rows)
As you can see, the "deepest" part of the query is the retrieval of the detail_type_ids, given the list of strings. This is put into a hash table, which is then combined with a corresponding hashset for the detail_values. (NB: this is pg-9.1)
YMMV.
If you must use a design along these lines, you should eliminate the id column from events_eventdetails and declare the primary key to be (event_id, keyname). That would give you a very useful index without also maintaining a useless index for the synthetic key.
A step better would be to eliminate the events_eventdetails table entirely and use an hstore column for that data, with a GIN index. That would probably get you to your performance goals without needing to pre-define what event details are stored.
Even better, if you can predict or specify what event details are possible, would be to not try to implement a database within a database. Make each "keyname" value into a column in events_eventdetails with a data type appropriate to the nature of that data. This will probably allow much faster access at the cost of needing to issue ALTER TABLE statements as the nature of the detail changes.
See, if your key (reductionId in this case) is met in more then 7-10% of all the rows in the events_eventdetails table, then PostgreSQL will prefer a SeqScan. There's nothing you can do, it is the fastest way.
I have had a similar case working with ISO8583 packets. Each packet consists of 128 fields (by design), so first database design followed your approach with 2 tables:
field_id and description in one table (events_events in your case),
field_id + field_value in another (events_eventdetails).
Although such layout follows the 3NF, we hit same issues straight away:
bad performance,
highly complicated queries.
In your case you should go for re-design. One option (easier one) is to make events_eventdetails.keyname being a smallint, which will make comparison operations faster. Not a big win though.
Another option is to reduce 2 tables into a single one, something like:
CREATE TABLE events_events (
id bigserial,
datetime timestamp with time zone,
eventtype_id bigint,
transactionId text, -- value for transactionId
reductionId text, -- -"- reductionId
companyId text, -- etc.
customerId text,
anyotherId text,
...
);
This will break the 3NF, but on the other hand:
you have more freedom to index your data;
your queries will be shorter and easier to maintain;
performance will be way too better.
Possible drawbacks:
you will waste a bit more space for the unused fields: unused fields / 8 bytes per row
you might still need an extra table for the events that are too rear to keep a separate column for.
EDIT:
I don't quite understand what you mean by materialize here.
In your question you mentioned you want:
"solution" that returns events_events rows 100 and 200 and 300 together in a single result set and FAST! when asked for reductionId=123 or when asked for customerId=234 or when asked for companyId=345.
The suggested redesign creates a crosstab or pivot table from your events_eventdetails.
And to get all events_events rows that satisfies your conditions you can use:
SELECT *
FROM events_events
WHERE id IN (100, 200, 300)
AND reductionId = 123
-- AND customerId = 234
-- AND companyId = 345;