How can I optimise this complex PostgreSQL query - sql

I have been given a complex PostgreSQL query, and been asked to optimise it. I've made some progress in establishing where the problem might be, however I'm running out of ideas for how to approach optimising it. I'm hoping for some pointers/suggestions.
This query is part of an "AI Chatbot" system, it returns the number of unique conversations had with the chatbot, with a few filters. The data returned looks like:
conversationId
userId
userName
channel
eventCount
operator
firstMessageDate
lastMessageDate
totalUniqueUsers
Id of the conversation
Id of the user
Name of the user
Number of events that occured during conversation
If true, the conversation was with an "admin" or inhouse person
Date of the first message in the conversation
Date of the last message in the conversatrion
Total number of unique user conversations
The problem with the query seems to be related to a specific 'inner join' clause, joining on a large (10540420 rows) table.
This is the query with the inner join commented out, this way it will run in around 2 seconds.
select
"conversation_table"."conversationid" as "conversationId",
"du"."userid" as "userId",
"du"."username" as "userName",
"du"."channel" as "channel",
COALESCE(user_events.count, 0):: integer as "eventCount",
BOOL_OR(dc.operator_messages > 0) as "operator",
DATE_PART(
'epoch', dc.first_message_timestamp
)* 1000 as "firstMessageDate",
DATE_PART(
'epoch', dc.last_message_timestamp
)* 1000 as "lastMessageDate",
COUNT(*) OVER():: integer as "totalUniqueUsers"
from
"fact_conversation_data" as "conversation_table"
inner join "dim_user" as "du" on "conversation_table"."dim_user_user_id" = "du"."user_id"
inner join "dim_time" as "dt" on "conversation_table"."dim_time_time_id" = "dt"."time_id"
inner join "dim_conversation" as "dc" on "conversation_table"."conversationid" = "dc"."conversationid"
-- inner join (
-- select
-- "conversationid"
-- from
-- "fact_milestone_event" as fme
-- where
-- fme.dim_segment_segment_id in ('20736b82-4515-411f-9bc8-cf4d84ad69ac')
-- group by "conversationid"
-- ) as "fme" on "fme"."conversationid" = "conversation_table"."conversationid"
left join (
select
"fme"."conversationid",
count("fme"."event_id")
from
"fact_milestone_event" as "fme"
where
"fme"."timestamp" >= '2022-12-18 11:00:00.000'
and "fme"."timestamp" <= '2023-01-18 10:59:59.999'
and "fme"."dim_tenant_tenant_id" = '4621ed8f-d8a4-46e2-a8de-5710751b16b9'
group by
"fme"."conversationid"
) as "user_events" on "conversation_table"."conversationid" = "user_events"."conversationid"
where
"conversation_table"."timestamp" >= '2022-12-18 11:00:00.000'
and "conversation_table"."timestamp" <= '2023-01-18 10:59:59.999'
and "dt"."bot_zone" = 'Pacific/Auckland'
and "du"."is_platform_user" <> true
and "conversation_table"."conversationid" in (
select
"notEmptyConservation_table"."conversationid"
from
(
select
sum("user_messages") as "total_user_message_count",
"conversationid"
from
"fact_conversation_data"
where
"fact_conversation_data"."dim_tenant_tenant_id" = '4621ed8f-d8a4-46e2-a8de-5710751b16b9'
and "fact_conversation_data"."timestamp" >= '2022-12-18 11:00:00.000'
and "fact_conversation_data"."timestamp" <= '2023-01-18 10:59:59.999'
group by
"conversationid"
) as "notEmptyConservation_table"
where
"notEmptyConservation_table"."total_user_message_count" > 1
)
and "conversation_table"."dim_tenant_tenant_id" = '4621ed8f-d8a4-46e2-a8de-5710751b16b9'
and (
"conversation_table"."user_messages" > 0
)
group by
"user_events"."count",
"conversation_table"."conversationid",
"du"."userid",
"du"."username",
"du"."channel",
"dc"."first_message_timestamp",
"dc"."last_message_timestamp"
order by
"lastMessageDate" asc
limit
20
However, if I uncomment the inner join, I get a runtime of about 9 minutes.
Here's a link to the SLOW query and execution plan in explain.depesz.com.
Here's a link for the faster version of the query, without the problematic inner join.
Here are the indexes for the two main tables involved, fact_milestone_event and fact_conversation_data.
Table Name
Index Name
Index Definition
fact_conversation_data
fact_conversation_data_conversationid_idx
CREATE INDEX fact_conversation_data_conversationid_idx ON public.fact_conversation_data USING btree (conversationid)
fact_conversation_data
fact_conversation_data_tenant_id_idx
CREATE INDEX fact_conversation_data_tenant_id_idx ON public.fact_conversation_data USING btree (dim_tenant_tenant_id)
fact_conversation_data
fact_conversation_data_timestamp_idx
CREATE INDEX fact_conversation_data_timestamp_idx ON public.fact_conversation_data USING btree (timestamp)
fact_conversation_data
conversationid_timeid_unique_idx
CREATE UNIQUE INDEX conversationid_timeid_unique_idx ON public.fact_conversation_data USING btree (conversationid, dim_time_time_id)
fact_conversation_data
fact_conversation_data_pk
CREATE UNIQUE INDEX fact_conversation_data_pk ON public.fact_conversation_data USING btree (conversation_data_id)
fact_milestone_event
fact_milestone_event_dim_segment_segment_id_idx
CREATE INDEX fact_milestone_event_dim_segment_segment_id_idx ON public.fact_milestone_event USING btree (dim_segment_segment_id)
fact_milestone_event
fact_milestone_event_time_id_idx
CREATE INDEX fact_milestone_event_time_id_idx ON public.fact_milestone_event USING btree (dim_time_time_id)
fact_milestone_event
fact_milestone_event_dim_milestone_milestone_id_idx
CREATE INDEX fact_milestone_event_dim_milestone_milestone_id_idx ON public.fact_milestone_event USING btree (dim_milestone_milestone_id)
fact_milestone_event
fact_milestone_event_conversationid_idx
CREATE INDEX fact_milestone_event_conversationid_idx ON public.fact_milestone_event USING btree (conversationid)
fact_milestone_event
fact_milestone_event_timestamp_idx
CREATE INDEX fact_milestone_event_timestamp_idx ON public.fact_milestone_event USING btree (timestamp)
fact_milestone_event
fact_milestone_event_dim_tenant_tenant_id_idx
CREATE INDEX fact_milestone_event_dim_tenant_tenant_id_idx ON public.fact_milestone_event USING btree (dim_tenant_tenant_id)
fact_milestone_event
fact_milestone_event_dim_user_user_id_idx
CREATE INDEX fact_milestone_event_dim_user_user_id_idx ON public.fact_milestone_event USING btree (dim_user_user_id)
fact_milestone_event
fact_milestone_event_pk
CREATE UNIQUE INDEX fact_milestone_event_pk ON public.fact_milestone_event USING btree (event_id)
I've tried replacing the inner join with a WHERE clause with a nested select. I've tried running EXPLAIN against the query, but had trouble understanding the output. I've tried adding more conditions to the inner join (as below).
inner join (
select "conversationid"
from "fact_milestone_event" as fme
where
fme.dim_segment_segment_id in ('20736b82-4515-411f-9bc8-cf4d84ad69ac') AND
fme.timestamp >= '2022-12-18 11:00:00.000' AND
fme.timestamp <= '2023-01-18 10:59:59.999' AND
fme.dim_tenant_tenant_id = '4621ed8f-d8a4-46e2-a8de-5710751b16b9'
group by "conversationid"
) as "fme" on "fme"."conversationid" = "conversation_table"."conversationid"

I think the sub-selects can be eliminated from the joins and their logic moved into the larger query. Rather than repeating the constraints, check for equality.
inner join fact_milestone_event as fme on
fme.conversationid = conversation_table.conversationid
and fme.dim_tenant_tenant_id = conversation_table.dim_tenant_tenant_id
and fme.timestamp = conversation_table.timestamp
where
conversation_table.timestamp >= '2022-12-18 11:00:00.000'
and conversation_table.timestamp <= '2023-01-18 10:59:59.999'
and conversation_table.dim_tenant_tenant_id = '4621ed8f-d8a4-46e2-a8de-5710751b16b9'
The logic in notEmptyConservation_table can use the existing fme join.

Related

Indexed view in SQL Server not using indexes

My schema:
I need to get count of comment for each tag.
I created a view:
create view dbo.UserTagCommentCount
with schemabinding
as
select
c.UserPK, t.PK TagPK, count_big(*) Count
from
dbo.Comments c
join
dbo.Posts p on p.PK = c.PostPK
join
dbo.PostTags pt on pt.PostPK = p.PK
join
dbo.Tags t on t.PK = pt.TagPK
group by
t.PK, c.UserPK
go
and I created clustered unique index on this view:
create unique clustered index PK_UserTagCommentCount
on dbo.UserTagCommentCount(UserPK, TagPK)
go
But when I select a rows by UserPK - this clustered index is not being used:
select *
from UserTagCommentCount
where UserPK = 19146
order by Count desc
OK. I create a simple index
create index IX_UserTagCommentCount_UserPK
on UserTagCommentCount(UserPK)
go
and use select with it
select *
from UserTagCommentCount with(index(IX_UserTagCommentCount_UserPK))
where UserPK = 19146
order by Count desc
but I see the same plan
Please any ideas? Why are the indexes not used when selecting from this view?
SQL Server 2019 development

Oracle poor nested join performance

I have a generic query builder that adds an arbitrary number of filters. I am getting poor performance on one of those filters (filter b) that requires going through two tables.
SELECT *
FROM (SELECT "TABLE_1".*
FROM "TABLE_1"
-- filter a: 1 table deep (fast)
inner join (SELECT "SHARED_ID"
FROM "TABLE_4"
WHERE "TABLE_4"."COLUMN_A" LIKE '%123%'
) "TABLE_4"
ON "TABLE_1"."SHARED_ID" = "TABLE_4"."SHARED_ID"
-- filter b: 2 tables deep (slow)
inner join (SELECT "SHARED_ID"
FROM "TABLE_2"
inner join (SELECT "ID"
FROM "TABLE_3"
WHERE NAME LIKE '%Abc%')
"TABLE_3"
ON "TABLE_2"."TABLE_3_ID" =
"TABLE_3"."ID") "TABLE_2"
ON "TABLE_1"."SHARED_ID" = "TABLE_2"."SHARED_ID")
WHERE ROWNUM <= 20

Getting very slow execution time for a postgresql query

I did the explain analyze for this query, it was giving 30ms, but if the data is more I will get an execution expired; Using PostgreSQL 10
For normal execution: https://explain.depesz.com/s/gSPP
For slow execution: https://explain.depesz.com/s/bQN2
SELECT inventory_histories.*, order_items.order_id as order_id FROM
"inventory_histories" LEFT JOIN order_items ON (order_items.id =
inventory_histories.reference_id AND inventory_histories.reference_type = 4)
WHERE "inventory_histories"."inventory_id" = 1313 AND
(inventory_histories.location_id = 15) ORDER BY inventory_histories.id DESC
LIMIT 10 OFFSET 0;
Indexes:
"inventory_histories_pkey" PRIMARY KEY, btree (id)
"inventory_histories_created_at_index" btree (created_at)
"inventory_histories_inventory_id_index" btree (inventory_id)
"inventory_histories_location_id_index" btree (location_id)
For this query:
SELECT ih.*, oi.order_id as order_id
FROM "inventory_histories" ih LEFT JOIN
order_items oi
ON oi.id = ih.reference_id AND
ih.reference_type = 4
WHERE ih."inventory_id" = 1313 AND
ih.location_id = 15
ORDER BY ih.id DESC
LIMIT 10 OFFSET 0;
For this query, you want composite indexes on inventory_histories(inventory_id, location_id, id, reference_id) and on order_items(id, reference_type, order_id).

Update Not working properly in Netezza. This is actually a general topic

I don't why update query is not behaving properly or I am missing something very trivial here?
Here is the sequence of very simple steps I am running.
Step 1: Creating table
CREATE table SNAPDATE_YOS as SELECT SNAPSHOTDATE, PREFERENCE_ID, CVALIDEMAIL, (CVALIDEMAIL * 1.0125) AS new_CVALIDEMAIL
FROM RPT_EMAIL_CATEGORY_PREFERENCE
WHERE SNAPSHOTDATE = '2014-07-07 00:00:00'
AND PREFERENCE_ID = 'Yosemite';
1 rows affected
Select * from SNAPDATE_YOS;
SNAPSHOTDATE || PREFERENCE_ID || CVALIDEMAIL || NEW_CVALIDEMAIL
2014-07-07 00:00:00|| Yosemite || 97676 || 98896.9500
Step 2: Updating table RPT_EMAIL_CATEGORY_PREFERENCE for a join condition with created table in step 1.
UPDATE RPT_EMAIL_CATEGORY_PREFERENCE
SET CVALIDEMAIL = ROUND(S.new_CVALIDEMAIL,0)
FROM SNAPDATE_YOS S
JOIN RPT_EMAIL_CATEGORY_PREFERENCE P ON P.PREFERENCE_ID = S.PREFERENCE_ID
WHERE P.SNAPSHOTDATE = '2014-11-21 00:00:00'
AND P.PREFERENCE_ID = 'Yosemite';
34 rows updated.
In my opinion only one row should be updated as join condition is giving me only one row.
Here are the supporting statements for it.
Supporting statement 1 : Selecting rows on the condition.
Select ROUND(S.new_CVALIDEMAIL,0) as CVALIDEMAIL
FROM SNAPDATE_YOS S
JOIN RPT_EMAIL_CATEGORY_PREFERENCE P ON P.PREFERENCE_ID = S.PREFERENCE_ID
WHERE P.SNAPSHOTDATE = '2014-11-21 00:00:00'
AND P.PREFERENCE_ID = 'Yosemite';
Output:
CVALIDEMAIL
98897
Supporting Statement 2: Selecting all columns
Select *
FROM SNAPDATE_YOS S
JOIN RPT_EMAIL_CATEGORY_PREFERENCE P ON P.PREFERENCE_ID = S.PREFERENCE_ID
WHERE P.SNAPSHOTDATE = '2014-11-21 00:00:00'
AND P.PREFERENCE_ID = 'Yosemite';
1 row selected
Supporting Statement 3: Selecting data from the table which needs to be updated.
Select * from RPT_EMAIL_CATEGORY_PREFERENCE;
34 rows selected.
In my opinion only 1 row should be updated from RPT_EMAIL_CATEGORY_PREFERENCE table which satisfies the update condition. Am I missing something very trivial here?
Query Plan:
QUERY PLANTEXT:
Nested Loop (cost=1.6..1.7 rows=34 width=113 conf=51)
l: Sequential Scan table "RPT_EMAIL_CATEGORY_PREFERENCE" (cost=0.0..0.0 rows=34 width=105 conf=100)
r: Materialize (cost=0.0..0.0 rows=1 width=16 conf=0)
l: Hash Join (cost=0.0..0.0 rows=1 width=16 conf=51)
l: Sequential Scan table "S" (cost=0.0..0.0 rows=1 width=266 conf=80)
r: Hash (cost=0.0..0.0 rows=1 width=15 conf=0)
l: Sequential Scan table "P" (cost=0.0..0.0 rows=1 width=15 conf=64)
NZ Version
[nz#usga-qts-tfam-01 ~]$ nzrev
Release 7.1.0.2-P2 [Build 39804]
Thanks in advance.
Vivek
The issue here is that you are joining to RPT_EMAIL_CATEGORY_PREFERENCE twice. You may not realize it because the join with the table specified to update is implicit.
UPDATE RPT_EMAIL_CATEGORY_PREFERENCE
-- ^ First reference to RPT_EMAIL_CATEGORY_PREFERENCE (with no alias)
SET CVALIDEMAIL = ROUND(S.new_CVALIDEMAIL,0)
FROM SNAPDATE_YOS S
-- Which is then joined to SNAPDATE S with NO join criteria, making it a cross join producing
-- 1 x 34 rows
JOIN RPT_EMAIL_CATEGORY_PREFERENCE P ON P.PREFERENCE_ID = S.PREFERENCE_ID
-- The third join then joins 1 row from RPT_EMAIL_CATEGORY_PREFERENCE with no join criteria
-- other than a WHERE clause which makes the output to 1 x 34 x 1 rows.
-- This is because the RPT_EMAIL_CATEGORY_PREFERENCE when referenced with a different alias
-- is treated as a separate table.
WHERE P.SNAPSHOTDATE = '2014-11-21 00:00:00'
AND P.PREFERENCE_ID = 'Yosemite';
The UPDATE I think you want is:
UPDATE RPT_EMAIL_CATEGORY_PREFERENCE
SET CVALIDEMAIL = ROUND(SNAPDATE_YOS.new_CVALIDEMAIL,0)
FROM SNAPDATE_YOS
WHERE
RPT_EMAIL_CATEGORY_PREFERENCE.PREFERENCE_ID = SNAPDATE_YOS.PREFERENCE_ID
AND RPT_EMAIL_CATEGORY_PREFERENCE.SNAPSHOTDATE = '2014-11-21 00:00:00'
AND PRPT_EMAIL_CATEGORY_PREFERENCE.PREFERENCE_ID = 'Yosemite';
I removed the aliases for clarity (opinions may vary as to whether that's helpful or not in this case). You should only reference the table being UPDATEd once. For Netezza, inner joins in an UPDATE are implicitly specified by the FROM and WHERE clauses.

sqlite performance issue: one index per table is somewhat painful

So here's my schema (give or take):
cmds.Add(#"CREATE TABLE [Services] ([Id] INTEGER PRIMARY KEY, [AssetId] INTEGER NULL, [Name] TEXT NOT NULL)");
cmds.Add(#"CREATE INDEX [IX_Services_AssetId] ON [Services] ([AssetId])");
cmds.Add(#"CREATE INDEX [IX_Services_Name] ON [Services] ([Name])");
cmds.Add(#"CREATE TABLE [Telemetry] ([Id] INTEGER PRIMARY KEY, [ServiceId] INTEGER NULL, [Name] TEXT NOT NULL)");
cmds.Add(#"CREATE INDEX [IX_Telemetry_ServiceId] ON [Telemetry] ([ServiceId])");
cmds.Add(#"CREATE INDEX [IX_Telemetry_Name] ON [Telemetry] ([Name])");
cmds.Add(#"CREATE TABLE [Events] ([Id] INTEGER PRIMARY KEY, [TelemetryId] INTEGER NOT NULL, [TimestampTicks] INTEGER NOT NULL, [Value] TEXT NOT NULL)");
cmds.Add(#"CREATE INDEX [IX_Events_TelemetryId] ON [Events] ([TelemetryId])");
cmds.Add(#"CREATE INDEX [IX_Events_TimestampTicks] ON [Events] ([TimestampTicks])");
And here's my queries with their strange timer results:
sqlite> SELECT MIN(e.TimestampTicks) FROM Events e INNER JOIN Telemetry ss ON ss.ID = e.TelemetryID INNER JOIN Services s ON s.ID = ss.ServiceID WHERE s.AssetID = 1;
634678974004420000
CPU Time: user 0.296402 sys 0.374402
sqlite> SELECT MIN(e.TimestampTicks) FROM Events e INNER JOIN Telemetry ss ON ss.ID = e.TelemetryID INNER JOIN Services s ON s.ID = ss.ServiceID WHERE s.AssetID = 2;
634691940264680000
CPU Time: user 0.062400 sys 0.124801
sqlite> SELECT MIN(e.TimestampTicks) FROM Events e INNER JOIN Telemetry ss ON ss.ID = +e.TelemetryID INNER JOIN Services s ON s.ID = ss.ServiceID WHERE s.AssetID = 1;
634678974004420000
CPU Time: user 0.000000 sys 0.000000
sqlite> SELECT MIN(e.TimestampTicks) FROM Events e INNER JOIN Telemetry ss ON ss.ID = +e.TelemetryID INNER JOIN Services s ON s.ID = ss.ServiceID WHERE s.AssetID = 2;
634691940264680000
CPU Time: user 0.265202 sys 0.078001
Now I can understand why adding the '+' might change the time, but why is it so inconsistent with the AssetId change? Is there some other index I should create for these MIN queries? There are 900000 rows in the Events table.
Query Plans (first with '+'):
0|0|0|SEARCH TABLE Events AS e USING INDEX IX_Events_TimestampTicks (~1 rows)
0|1|1|SEARCH TABLE Telemetry AS ss USING INTEGER PRIMARY KEY (rowid=?) (~1 rows)
0|2|2|SEARCH TABLE Services AS s USING INTEGER PRIMARY KEY (rowid=?) (~1 rows)
0|0|2|SEARCH TABLE Services AS s USING COVERING INDEX IX_Services_AssetId (AssetId=?) (~1 rows)
0|1|1|SEARCH TABLE Telemetry AS ss USING COVERING INDEX IX_Telemetry_ServiceId (ServiceId=?) (~1 rows)
0|2|0|SEARCH TABLE Events AS e USING INDEX IX_Events_TelemetryId (TelemetryId=?) (~1 rows)
EDIT: In summary, given the tables above what indexes would you create if these were the only queries to ever be executed:
SELECT MIN/MAX(e.TimestampTicks) FROM Events e INNER JOIN Telemetry t ON t.ID = e.TelemetryID INNER JOIN Services s ON s.ID = t.ServiceID WHERE s.AssetID = #AssetId;
SELECT e1.* FROM Events e1 INNER JOIN Telemetry t1 ON t1.Id = e1.TelemetryId INNER JOIN Services s1 ON s1.Id = t1.ServiceId WHERE t1.Name = #TelemetryName AND s1.Name = #ServiceName;
SELECT * FROM Events e INNER JOIN Telemetry t ON t.Id = e.TelemetryId INNER JOIN Services s ON s.Id = t.ServiceId WHERE s.AssetId = #AssetId AND e.TimestampTicks >= #StartTimeTicks ORDER BY e.TimestampTicks LIMIT 1000;
SELECT e.Id, e.TelemetryId, e.TimestampTicks, e.Value FROM (
SELECT e2.Id AS [Id], MAX(e2.TimestampTicks) as [TimestampTicks]
FROM Events e2 INNER JOIN Telemetry t ON t.Id = e2.TelemetryId INNER JOIN Services s ON s.Id = t.ServiceId
WHERE s.AssetId = #AssetId AND e2.TimestampTicks <= #StartTimeTicks
GROUP BY e2.TelemetryId) AS grp
INNER JOIN Events e ON grp.Id = e.Id;
Brannon,
Regarding time differences with change of AssetID:
Perhaps you've already tried this, but have you run each query several times in succession? The memory caching of BOTH your operating system and sqlite will often make a second query much faster than the first run within a session. I would run a given query four times in a row, and see if the 2nd-4th runs are more consistent in timing.
Regarding use of the "+"
(For those who may not know, within a SELECT preceding a field with "+" gives sqlite a hint NOT to use that field's index in the query. May cause your query to miss results if sqlite has optimized the storage to keep the data ONLY in that index. Suspect this is deprecated.)
Have you run the ANALYZE command? It helps the sqlite optimizer quite a bit when making decisions.
http://sqlite.org/lang_analyze.html
Once your schema is stable and your tables are populated, you may only need to run it once -- no need to run it every day.
INDEXED BY
INDEXED BY is a feature the author discourages for typical use, but you might find it helpful in your evaluations.
http://www.sqlite.org/lang_indexedby.html
I'd be interested to know what you discover,
Donald Griggs, Columbia SC USA