optimizing a large "distinct" select in postgres - sql

I have a rather large dataset (millions of rows). I'm having trouble introducing a "distinct" concept to a certain query. (I putting distinct in quotes, because this could be provided by the posgtres keyword DISTINCT or a "group by" form).
A non-distinct search takes 1ms - 2ms ; all attempts to introduce a "distinct" concept have grown this to the 50,000ms - 90,000ms range.
My goal is to show the latest resources based on their most recent appearance in an event stream.
My non-distinct query is essentially this:
SELECT
resource.id AS resource_id,
stream_event.event_timestamp AS event_timestamp
FROM
resource
JOIN
resource_2_stream_event ON (resource.id = resource_2_stream_event.resource_id)
JOIN
stream_event ON (resource_2_stream_event.stream_event_id = stream_event.id)
WHERE
stream_event.viewer = 47
ORDER BY event_timestamp DESC
LIMIT 25
;
I've tried many different forms of queries (and subqueries) using DISTINCT, GROUP BY and MAX(event_timestamp). The issue isn't getting a query that works, it's getting one that works in a reasonable execution time. Looking at the EXPLAIN ANALYZE output for each one, everything is running off of indexes. Th problem seems to be that with any attempt to deduplicate my results, postges must assemble the entire resultset onto disk; since each table has millions of rows, this becomes a bottleneck.
--
update
here's a working group-by query:
EXPLAIN ANALYZE
SELECT
resource.id AS resource_id,
max(stream_event.event_timestamp) AS stream_event_event_timestamp
FROM
resource
JOIN resource_2_stream_event ON (resource_2_stream_event.resource_id = resource.id)
JOIN stream_event ON stream_event.id = resource_2_stream_event.stream_event_id
WHERE (
(stream_event.viewer_id = 57) AND
(resource.condition_1 IS NOT True) AND
(resource.condition_2 IS NOT True) AND
(resource.condition_3 IS NOT True) AND
(resource.condition_4 IS NOT True) AND
(
(resource.condition_5 IS NULL) OR (resource.condition_6 IS NULL)
)
)
GROUP BY (resource.id)
ORDER BY stream_event_event_timestamp DESC LIMIT 25;
looking at the query planner (via EXPLAIN ANALYZE), it seems that adding in the max+groupby clause (or a distinct) forces a sequential scan. that is taking about half the time to computer. there already is an index that contains every "condition", and i tried creating a set of indexes (one for each element). none work.
in any event, the difference is between 2ms and 72,000ms

Often, distinct on is the most efficient way to get one row per something. I would suggest trying:
SELECT DISTINCT ON (r.id) r.id AS resource_id, se.event_timestamp
FROM resource r JOIN
resource_2_stream_event r2se
ON r.id = r2se.resource_id JOIN
stream_event se
ON r2se.stream_event_id = se.id
WHERE se.viewer = 47
ORDER BY r.id, se.event_timestamp DESC
LIMIT 25;
An index on resource(id, event_timestamp) might help performance.
EDIT:
You might try using a CTE to get what you want:
WITH CTE as (
SELECT r.id AS resource_id,
se.event_timestamp AS stream_event_event_timestamp
FROM resource r JOIN
resource_2_stream_event r2se
ON r2se.resource_id = r.id JOIN
stream_event se
ON se.id = r2se.stream_event_id
WHERE ((se.viewer_id = 57) AND
(r.condition_1 IS NOT True) AND
(r.condition_2 IS NOT True) AND
(r.condition_3 IS NOT True) AND
(r.condition_4 IS NOT True) AND
( (r.condition_5 IS NULL) OR (r.condition_6 IS NULL)
)
)
)
SELECT resource_id, max(stream_event_event_timestamp) as stream_event_event_timestamp
FROM CTE
GROUP BY resource_id
ORDER BY stream_event_event_timestamp DESC
LIMIT 25;
Postgres materializes the CTE. So, if there are not that many matches, this may speed the query by using indexes for the CTE.

Related

Postgres query optimization on joins and where in clauses

So I am trying to make a backend to send users notification from time to time. Now in order to do that I need to procure some data from different postgres tables. I wrote this query but it is taking 12-14 seconds to get the data.
When run without where in clause I get the data in almost 700ms.
SELECT DISTINCT ON (t."playerId") t."gzpId", t."pubCode", t."playerId" as token, t."provider",
COALESCE(p."preferenceValue",'en') as lang,
s."segmentId"
FROM "userPlayerIdMap" t LEFT JOIN
"userPreferences" p
ON t."gzpId" = p."gzpId" LEFT JOIN
"segment" s
ON t."gzpId" = s."gzpId"
WHERE t."pubCode" IN ('hyrmas','ayqioa','rj49as99') and
t."provider" IN ('FCM','ONE_SIGNAL') and
s."segmentId" IN (0,1,2,3,4,5,6) and
p."preferenceValue" IN ('en','hi')
ORDER BY t."playerId" desc;
Rows in "userPlayerIdMap" = 650000
Rows in "userPreferences" = 1456466
Rows in "segment" = 5674186
I have already added indexes on the required columns.
Would really appreciate some help.
Use subqueries:
SELECT t."gzpId", t."pubCode", t."playerId" as token, t."provider",
COALESCE((SELECT p."preferenceValue"
FROM "userPreferences" p
WHERE t."gzpId" = p."gzpId" AND
p."preferenceValue" IN ('en', 'hi')
LIMIT 1
), 'en'
) as lang,
(SELECT s."segmentId"
FROM "segment" s
WHERE t."gzpId" = s."gzpId" AND
s."segmentId" IN (0, 1, 2, 3, 4, 5, 6)
LIMIT 1
) as segmentId
FROM "userPlayerIdMap"
WHERE t."pubCode" IN ('hyrmas', 'ayqioa', 'rj49as99') and
t."provider" IN ('FCM', 'ONE_SIGNAL')
-- ORDER BY t."playerId" desc;
I'm not sure the ORDER BY is necessary. If it was only being used for the DISTINCT ON, then it is not necessary in this version of the logic.
At the very least (with the ORDER BY) this will reduce the number of rows that need to be sorted. If you don't need the ORDER BY, then there is no sort -- a significant performance gain.
Then, you want indexes on:
userPreferences(gzpId, preferenceValue)
segment(gzpId, segmentId)
The index on userPlayerIdMap is trickier. I don't think that Postgres can use the index for both ins without a scan. You want the more selective column first, but one of:
userPlayerIdMap(provider, pubCode, gzpId)
userPlayerIdMap(pubCode, provider, gzpId)
I threw gzpId so Postgres can use the index to look up the values in the subquery.

Position of ON and WHERE clauses and the efficiency performance

I have two tables, one called Health_User and the other called Diary. They have users' demographic information, and their recorded values respectively. What I want to do is retrieving the recorded values, but:
Excluding testers (not real users) with the "is_tester" column (boolean values) in Health_User, and
Excluding unreasonable values with too high or too low measurements in Diary.
So I have several queries which should get the same results:
# Query 1
SELECT d.user_id, d.id AS diary_id, d.glucose_value, d.unit
FROM Diary AS d
JOIN (
SELECT id
FROM Health_User
WHERE is_tester = false
) AS u
ON d.user_id = u.id
WHERE ((d.glucose_value >= 20 AND d.glucose_value <= 600 AND d.unit = 'mg/dL')
OR (d.glucose_value >= 20/18.02 AND d.glucose_value <= 600/18.02 AND d.unit = 'mmol/L'));
# Query 2
SELECT d.user_id, d.id AS diary_id, d.glucose_value, d.unit
FROM Diary AS d
JOIN Health_User AS u
ON d.user_id = u.id
WHERE u.is_tester = false
AND ((d.glucose_value >= 20 AND d.glucose_value <= 600 AND d.unit = 'mg/dL')
OR (d.glucose_value >= 20/18.02 AND d.glucose_value <= 600/18.02 AND d.unit = 'mmol/L'));
# Query 3
SELECT d.user_id, d.id AS diary_id, d.glucose_value, d.unit
FROM Health_User AS u
JOIN (
SELECT id, user_id, glucose_value, unit
FROM Diary
WHERE ((glucose_value >= 20 AND glucose_value <= 600 AND unit = 'mg/dL')
OR (glucose_value >= 20/18.02 AND glucose_value <= 600/18.02 AND unit = 'mmol/L'))
) AS d
ON d.user_id = u.id
WHERE u.is_tester = false;
Here I have three questions:
Question 1: I would speculate that Query 1 would have better performance than Query 2, because a) it joins only one column instead of the whole table of Health_User and b) it filters out testers before joining the tables. Am I correct?
Question 2: The conditional limitation is more complex for Diary (See the last WHERE clause in Query 1). Is it better to switch Diary inside the JOIN and make Health_User outside like Query 3, or it makes no difference?
Question 3: Is there any even better solution in terms of performance?
There would be a difference if the database executed the queries in the order your queries suggest (first filter, then join or vice versa).
As it is, PostgreSQL has a query optimizer that rearranges the query to find the most efficient execution order, and all your queries will end up with the same execution plan, which you can verify using the SQL statement EXPLAIN.
For inner joins, it does not influence the result if you filter before or after the join; you could also write all the conditions into the join condition without changing the result. The optimizer knows that.
You can speed up execution by creating appropriate indexes. It depends on the distribution of the data to know if a certain index is useful. The rule of thumb is that indexes on selective conditions (that filter out many data) are more useful. Work with EXPLAIN to find the best indexes.

Query including subquery and group by slower than expected

The whole query below runs incredibly slowly.
The subquery query [alias Stage_1] takes only 1.37 minutes returning 9514 records, however the whole query takes over 20 minutes, returning 2606 records.
I could use a #temp table to hold the subquery to improve the performance however I would prefer not to.
An overview of the query is that table WeeklySpace inner joins to Spaceblock_Name_to_PG table on SpaceblockName_SID, this cuts down the results in WeeklySpace and includes PG_Code with the results in WeeklySpace. WeeklySpace is then Full Outer Joined to Sales_PG_Wk across 3 fields. The where clause focuses the results, and may be changed. The results from the subquery are then sum'd. You cannot do the final sum'ing in the subquery due to the group by and sum over used.
I believe the issue is due to the subquery re calculation repeatedly during the group by in the final sum'ing. The field SpaceblockName_SID also appears to be involved in causing the issue as without it the run time with a group by in the subquery isn't affected.
I have read though loads of suggestion, trying them all to resolve the issue.
These include;
Adding TOP 2147483647 with Order by to force intermediate
materialization, both in the subquery and using a CTE.
Adding a join after stage_1.
Cast'ing SpaceblockName_SID from an int to a varchar and back again
The execution plan (cut in two parts, shown below the code) for both the subquery and the whole query appear similar. The cost is around the Full Outer Join (Hash Match), which I expected.
The query is running on T-SQL 2005.
Any help greatly appreciated!
select
Cost_centre
, Fin_week
, SpaceblockName_SID
, sum(Propor_rep_SRV) as Total_SpaceblockName_SID_SRV
from
(
select
coalesce(space_side.fin_week , sales_side.fin_week) as Fin_week
,coalesce(space_side.cost_centre , sales_side.cost_Centre) as Cost_centre
,space_side.SpaceblockName_SID
,case
when space_side.SpaceblockName_SID is null
then sales_side.SalesExVAT
else sum(space_side.TLM)
/nullif(sum (sum(space_side.TLM) ) over (partition by coalesce(space_side.fin_week , sales_side.fin_week)
, coalesce(space_side.cost_centre , sales_side.cost_Centre)
, coalesce( Spaceblock_Name_to_PG.PG_Code, sales_side.PG_Code)) ,0)*sales_side.SalesExVAT
end as Propor_rep_SRV
from
WeeklySpace as space_side
INNER JOIN
Spaceblock_Name_to_PG
ON space_side.SpaceblockName_SID = Spaceblock_Name_to_PG.SpaceblockName_SID
and Spaceblock_Name_to_PG.PG_Code < 10000
full outer join
sales_pg_wk as sales_side
on space_side.fin_week = sales_side.fin_week
and space_side.Cost_Centre = sales_side.Cost_Centre
and Spaceblock_Name_to_PG.PG_code = sales_side.pg_code
where
coalesce(space_side.fin_week, sales_side.fin_week) between 201538 and 201550
and
coalesce(space_side.cost_centre, sales_side.cost_Centre) in (3, 2800)
group by
coalesce(space_side.fin_week, sales_side.fin_week)
,coalesce(space_side.cost_centre, sales_side.cost_Centre)
,coalesce( Spaceblock_Name_to_PG.PG_Code, sales_side.PG_Code)
,sales_side.SalesExVAT
,space_side.SpaceblockName_SID
) as stage_1
group by
Cost_centre
, Fin_week
, SpaceblockName_SID
Execution plan left hand side
Execution plan right hand side
You didn't mentioned about indices are created or not on those columns those you used in your query. If not then create and check performance of the query
In looking at you logic I think you split this in two with a UNION
One with Spaceblock_Name_to_PG.PG_Code < 10000 and the other with Spaceblock_Name_to_PG.PG_Code >= 10000
And consider this change
If may be doing a bunch of join that you are going to throw out anyway
full outer join sales_pg_wk as sales_side
on space_side.fin_week = sales_side.fin_week
and space_side.Cost_Centre = sales_side.Cost_Centre
and Spaceblock_Name_to_PG.PG_code = sales_side.pg_code
and space_side.fin_week between 201538 and 201550
and sales_side.fin_week between 201538 and 201550
and space_side.cost_centre in (3, 2800)
and sales_side.cost_Centre in (3, 2800)

How to optimize this low-performance MySQL query?

I’m currently using the following query for jsPerf. In the likely case you don’t know jsPerf — there are two tables: pages containing the test cases / revisions, and tests containing the code snippets for the tests inside the test cases.
There are currently 937 records in pages and 3817 records in tests.
As you can see, it takes quite a while to load the “Browse jsPerf” page where this query is used.
The query takes about 7 seconds to execute:
SELECT
id AS pID,
slug AS url,
revision,
title,
published,
updated,
(
SELECT COUNT(*)
FROM pages
WHERE slug = url
AND visible = "y"
) AS revisionCount,
(
SELECT COUNT(*)
FROM tests
WHERE pageID = pID
) AS testCount
FROM pages
WHERE updated IN (
SELECT MAX(updated)
FROM pages
WHERE visible = "y"
GROUP BY slug
)
AND visible = "y"
ORDER BY updated DESC
I’ve added indexes on all fields that appear in WHERE clauses. Should I add more?
How can this query be optimized?
P.S. I know I could implement a caching system in PHP — I probably will, so please don’t tell me :) I’d just really like to find out how this query could be improved, too.
Use:
SELECT x.id AS pID,
x.slug AS url,
x.revision,
x.title,
x.published,
x.updated,
y.revisionCount,
COALESCE(z.testCount, 0) AS testCount
FROM pages x
JOIN (SELECT p.slug,
MAX(p.updated) AS max_updated,
COUNT(*) AS revisionCount
FROM pages p
WHERE p.visible = 'y'
GROUP BY p.slug) y ON y.slug = x.slug
AND y.max_updated = x.updated
LEFT JOIN (SELECT t.pageid,
COUNT(*) AS testCount
FROM tests t
GROUP BY t.pageid) z ON z.pageid = x.id
ORDER BY updated DESC
You want to learn how to use EXPLAIN. This will execute the sql statement, and show you which indexes are being used, and what row scans are being performed. The goal is to reduce the number of row scans (ie, the database searching row by row for values).
You may want to try the subqueries one at a time to see which one is slowest.
This query:
SELECT MAX(updated)
FROM pages
WHERE visible = "y"
GROUP BY slug
Makes it sort the result by slug. This is probably slow.

Bad performance of SQL query due to ORDER BY clause

I have a query joining 4 tables with a lot of conditions in the WHERE clause. The query also includes ORDER BY clause on a numeric column. It takes 6 seconds to return which is too long and I need to speed it up. Surprisingly I found that if I remove the ORDER BY clause it takes 2 seconds. Why the order by makes so massive difference and how to optimize it? I am using SQL server 2005. Many thanks.
I cannot confirm that the ORDER BY makes big difference since I am clearing the execution plan cache. However can you shed light at how to speed this up a little bit? The query is as follows (for simplicity there is "SELECT *" but I am only selecting the ones I need).
SELECT *
FROM View_Product_Joined j
INNER JOIN [dbo].[OPR_PriceLookup] pl on pl.siteID = NodeSiteID and pl.skuid = j.skuid
LEFT JOIN [dbo].[OPR_InventoryRules] irp on irp.ID = pl.SkuID and irp.InventoryRulesType = 'Product'
LEFT JOIN [dbo].[OPR_InventoryRules] irs on irs.ID = pl.siteID and irs.InventoryRulesType = 'Store'
WHERE (((((SiteName = N'EcommerceSite') AND (Published = 1)) AND (DocumentCulture = N'en-GB')) AND (NodeAliasPath LIKE N'/Products/Cats/Computers/Computer-servers/%')) AND ((NodeSKUID IS NOT NULL) AND (SKUEnabled = 1) AND pl.PriceLookupID in (select TOP 1 PriceLookupID from OPR_PriceLookup pl2 where pl.skuid = pl2.skuid and (pl2.RoleID = -1 or pl2.RoleId = 13) order by pl2.RoleID desc)))
ORDER BY NodeOrder ASC
Why the order by makes so massive difference and how to optimize it?
The ORDER BY needs to sort the resultset which may take long if it's big.
To optimize it, you may need to index the tables properly.
The index access path, however, has its drawbacks so it can even take longer.
If you have something other than equijoins in your query, or the ranged predicates (like <, > or BETWEEN, or GROUP BY clause), then the index used for ORDER BY may prevent the other indexes from being used.
If you post the query, I'll probably be able to tell you how to optimize it.
Update:
Rewrite the query:
SELECT *
FROM View_Product_Joined j
LEFT JOIN
[dbo].[OPR_InventoryRules] irp
ON irp.ID = j.skuid
AND irp.InventoryRulesType = 'Product'
LEFT JOIN
[dbo].[OPR_InventoryRules] irs
ON irs.ID = j.NodeSiteID
AND irs.InventoryRulesType = 'Store'
CROSS APPLY
(
SELECT TOP 1 *
FROM OPR_PriceLookup pl
WHERE pl.siteID = j.NodeSiteID
AND pl.skuid = j.skuid
AND pl.RoleID IN (-1, 13)
ORDER BY
pl.RoleID desc
) pl
WHERE SiteName = N'EcommerceSite'
AND Published = 1
AND DocumentCulture = N'en-GB'
AND NodeAliasPath LIKE N'/Products/Cats/Computers/Computer-servers/%'
AND NodeSKUID IS NOT NULL
AND SKUEnabled = 1
ORDER BY
NodeOrder ASC
The relation View_Product_Joined, as the name suggests, is probably a view.
Could you please post its definition?
If it is indexable, you may benefit from creating an index on View_Product_Joined (SiteName, Published, DocumentCulture, SKUEnabled, NodeOrder).