PostgreSQL range-based join works too slow

PostgreSQL range-based join works too slow - sql

Supporse I have 2 tables with some Events and Callbacks with following structure:
Event:
id
timestamp (BIGINT, btree index)
type (VARCHAR, btree index)
(pair index (timestamp, type))
Callback:
id
timestamp (BIGINT, btree index)
event_type
Event table contains about (M=) 300000 rows, Callbacks is about (N=) 25000.
I try to do somethink like:
SELECT * FROM Callback
JOIN Event
ON ABS(Callback.timestamp - Event.Timestamp) < 300000 AND
Callback.event_type = Event.type;
As it was planed, it should work for O(N log(M) + R) (where R - is result size. R is about 1000000 (AVG 50 events for each order)), but practicaly it works about 40 minutes on powerfull CPU.
UPD: Sorry, forget to say, I try:
SELECT * FROM Callback
JOIN Event
ON Event.Timestamp < Callback.timestamp + 300000 AND
Event.Timestamp > Callback.timestamp - 300000 AND
Callback.event_type = Event.type;
But nothing changes.
Can anyone tell, what I do wrong?
Thank you.

Perhaps the following will work with an index on event(type, timestamp):
SELECT *
FROM Callback c JOIN
Event e
ON c.event_type = e.type AND e.Timestamp > c.timestamp - 300000;
The idea is to leave one of the timestamp columns with no modifications. These can prevent the use of the index.
I do wonder if you also want a condition on c.timestamp >= e.TimeStamp. Your performance problem might simply be the volume of data you are returning.

Re-arrange your joins so that one column is expressed as a function of the other, so something like:
SELECT * FROM Callback
JOIN Event
ON (Event.Timestamp > (Callback.timestamp - 300000) AND
Callback.event_type = Event.type);
... or ...
SELECT * FROM Callback
JOIN Event
ON (Callback.timestamp > (Event.Timestamp + 300000) AND
Callback.event_type = Event.type);
(I think I got the >'s and <'s the right way round).
This allows indexes on the columns to be used, but I wouldn't rule out the possibility that full scans will be needed on both tables. It depends on the data distribution of the values.

Related

Can't find a way to improve my PostgreSQL query

In my PostgreSQL database I have 6 tables named storeAPrices, storeBprices etc., holding the same columns and indexes as follows:
item_code (string, primary_key)
item_name (string, btree index)
is_whigthed (number : 0|1, betree index)
item_price (number )
My desire is to join each storePrices table to other by item_code or item_name similarity but "OR" should act as in programming language (check right side only if left is false).
Currently, my query has low performance.
select
*
FROM "storeAprices" sap
left JOIN LATERAL (
SELECT * FROM "storeBPrices" sbp
WHERE
similarity(sap.item_name,sbp.item_name) >= 0.45
ORDER BY similarity(sap.item_name,sbp.item_name) DESC
limit 1
) bp ON case when sap.item_code = bp.item_code then true else sap.item_name % bp.item_name end
left JOIN LATERAL (
select * FROM "storeCPrices" scp
WHERE similarity(sap.item_name,scp.item_name) >= 0.45
ORDER BY similarity(sap.item_name,scp.item_name) desc
limit 1
) rp ON case when sap.item_code = rp.item_code then true else sap.item_name % rp.item_name end
This is part of my query and it took too much time to response. My data is not so large (15k items per table)
Also I have another index "is_whigthed" that I'm not sure how to use it. (I don't want set it as variable because I want to get all "is_whigthed" results)
Any suggestions?

OR should be faster than using case
bp ON sap.item_code = bp.item_code OR sap.item_name % bp.item_name
also you can create Trigram index on item_name columns as mentioned in pg_trgm module docs, since you are using it's % operator for similarity

Query takes too long to run, how to optimize it?

The query structure: Helper-select in "with" clause - selects most recent entry using 'top 1 transaction_date'. Then does many joins. It takes too much time to run - what am I doing wrong?
CREATE VIEW [IRWSMCMaterialization].[FactInventoryItemOnHandDailyView] AS
WITH TempTBLFactIvnItmDaily AS (
SELECT TOP 20
ITEM_NUMBER AS [InventoryItemNumber]
,CAST(FORMAT(TRANSACTION_DATE, 'yyyyMMdd') AS INT) AS [DateKey]
,BRANCH_PLANT_FHK AS [BranchPlantKey]
,BRANCH_PLANT_CODE AS [BranchPlantCode]
,CAST(QUANTITY_ON_HAND AS BIGINT) AS [QuantityOnHand]
,TRANSACTION_DATE AS [Date]
,WAREHOUSE_LOCATION_FHK AS [WarehouseLocationKey]
,WAREHOUSE_LOCATION_CODE AS [WarehouseLocationCode]
,WAREHOUSE_LOT_NUMBER_CODE AS [WarehouseLotNumber]
,WAREHOUSE_LOT_NUMBER_FHK AS [WarehouseLotNumberKey]
,UNIT_OF_MEASURE AS [UnitOfMeasureName]
,UNIT_OF_MEASURE_PHK AS [UnitOfMeasureKey]
FROM dbo.RS_INV_ITEM_ON_HAND
-- below is where clause, choose only most recent entry
WHERE TRANSACTION_DATE = (SELECT TOP 1 TRANSACTION_DATE FROM dbo.RS_INV_ITEM_ON_HAND ORDER BY TRANSACTION_DATE DESC)
)
SELECT [InventoryItemNumber],
[DateKey],
[Date],
[BranchPlantCode] AS [BP],
[WarehouseLocationCode] AS [Location],
[QuantityOnHand],
[UnitOfMeasureName] AS [UoM],
CASE [WarehouseLotNumber]
WHEN 'Not Assigned' THEN NULL
ELSE [WarehouseLotNumber]
END
AS [Lot]
FROM TempTBLFactIvnItmDaily iioh
JOIN DWH.DimBranchPlant bp ON iioh.BranchPlantKey = bp.BRANCH_PLANT_PHK
JOIN DWH.DimWarehouseLocation wloc ON iioh.WarehouseLocationKey = wloc.WAREHOUSE_LOCATION_PHK
JOIN DWH.DimWarehouseLotNumber wlot ON iioh.WarehouseLotNumberKey = wlot.WarehouseLotNumber_PHK
JOIN DWH.DimUnitOfMeasure uom ON CAST(iioh.UnitOfMeasureKey AS VARCHAR(100)) = uom.UNIT_OF_MEASURE_PHK
where bp.BRANCH_PLANT_CODE = '96100'
AND iioh.QuantityOnHand > 0
AND (wloc.WAREHOUSE_LOCATION_CODE like '6000W01%' OR wloc.WAREHOUSE_LOCATION_CODE like 'BL%')
GO

There are a lot of things that does not seems good. First of all, your base query must be a lot simpler. Something like this:
SELECT iioh.ITEM_NUMBER AS [InventoryItemNumber],
CAST(FORMAT(iioh.TRANSACTION_DATE, 'yyyyMMdd') AS INT) AS [DateKey],
iioh.TRANSACTION_DATE AS [Date],
iioh.BRANCH_PLANT_CODE AS [BP],
iioh.WAREHOUSE_LOCATION_CODE AS [Location],
CAST(iioh.QUANTITY_ON_HAND AS BIGINT) AS [QuantityOnHand],
iioh.UNIT_OF_MEASURE AS [UoM],
NULLIF(iioh.WAREHOUSE_LOT_NUMBER_CODE, 'Not Assigned') AS [Lot]
FROM dbo.RS_INV_ITEM_ON_HAND iioh
JOIN DWH.DimBranchPlant bp
ON iioh.BranchPlantKey = bp.BRANCH_PLANT_PHK
JOIN DWH.DimWarehouseLocation wloc
ON iioh.WarehouseLocationKey = wloc.WAREHOUSE_LOCATION_PHK
JOIN DWH.DimUnitOfMeasure uom
ON CAST(iioh.UnitOfMeasureKey AS VARCHAR(100)) = uom.UNIT_OF_MEASURE_PHK
where bp.BRANCH_PLANT_CODE = '96100'
AND iioh.QuantityOnHand > 0
AND (wloc.WAREHOUSE_LOCATION_CODE like '6000W01%' OR wloc.WAREHOUSE_LOCATION_CODE like 'BL%')
AND iioh.TRANSACTION_DATE = #TRANSACTION_DATE
For example, you are joining the DWH.DimWarehouseLotNumber but you are not extracting columns - do you really need it? Also, there are other columns which are not returned by the view - why to query them?
From, there you are first filtering by date and then y other fields, so your first TOP 20 records may be filtered by the next conditions - is this a behavior you want?
Also, do you really want this cast?
ON CAST(iioh.UnitOfMeasureKey AS VARCHAR(100)) = uom.UNIT_OF_MEASURE_PHK
It's better to use CONVERT, not FORMAT in performance aspect. Also, why not saving/materializing the TRANSACTION_DATE as INT (for example using a persisted computed column or just on CRUD) instead of calculating this value on each read?
Filtering by location code using LIKE clause can heart the performance, too. Why not adding a new column WareHouseLocationCodeType and set a same value for all locations satisfying this condition:
(wloc.WAREHOUSE_LOCATION_CODE like '6000W01%' OR wloc.WAREHOUSE_LOCATION_CODE like 'BL%')
Then you can filter by this column in the view since this is very important for you. Also, you can create filter index on this column to increase the performance, more.
Also, you may want to create a inline-function instead a view and pass the date as parameter:
CREATE OR ALTER FUNCTION [IRWSMCMaterialization].[FactInventoryItemOnHandDailyView]
(
#TRANSACTION_DATE datetime
)
RETURNS TABLE
AS
RETURN
(
SELECT iioh.ITEM_NUMBER AS [InventoryItemNumber],
CAST(FORMAT(iioh.TRANSACTION_DATE, 'yyyyMMdd') AS INT) AS [DateKey],
iioh.TRANSACTION_DATE AS [Date],
iioh.BRANCH_PLANT_CODE AS [BP],
iioh.WAREHOUSE_LOCATION_CODE AS [Location],
CAST(iioh.QUANTITY_ON_HAND AS BIGINT) AS [QuantityOnHand],
iioh.UNIT_OF_MEASURE AS [UoM],
NULLIF(iioh.WAREHOUSE_LOT_NUMBER_CODE, 'Not Assigned') AS [Lot]
,iioh.TRANSACTION_DATE
FROM dbo.RS_INV_ITEM_ON_HAND iioh
JOIN DWH.DimBranchPlant bp
ON iioh.BranchPlantKey = bp.BRANCH_PLANT_PHK
JOIN DWH.DimWarehouseLocation wloc
ON iioh.WarehouseLocationKey = wloc.WAREHOUSE_LOCATION_PHK
JOIN DWH.DimUnitOfMeasure uom
ON CAST(iioh.UnitOfMeasureKey AS VARCHAR(100)) = uom.UNIT_OF_MEASURE_PHK
where bp.BRANCH_PLANT_CODE = '96100'
AND iioh.QuantityOnHand > 0
AND (wloc.WAREHOUSE_LOCATION_CODE like '6000W01%' OR wloc.WAREHOUSE_LOCATION_CODE like 'BL%')
AND iioh.TRANSACTION_DATE = #TRANSACTION_DATE
)
Then call it like this:
SELECT TOP 20 *
FROM [IRWSMCMaterialization].[FactInventoryItemOnHandDailyView] ('2020-12-04')
ORDER BY #TRANSACTION_DATE DESC

The query optimization is science today. If you want to find bottlenecks in your query you can follow some of these steps:
As the first step, enable statistics with these commands:
SET STATISTICS TIME ON;
SET STATISTICS IO ON;
Once you execute these commands in some query windows in the same window execute your query. When your query is executed switch to the Messages tab and you will see a lot of useful information like TIME execution, parse and compile-time and maybe the most interesting I/O reads.
As the second step, try to understand which table has a lot of reads, for example if you are expecting 10 rows from the query, but in some tables you have 10k or 100k logical reads something is wrong. That means for the 10 rows query execution from one table reads 10k pages. Obviously you are missing some index on this table, try to find which index you need.
If you are having some static values in where clause like the following one, then think about Filtered Index:
bp.BRANCH_PLANT_CODE = '96100' AND iioh.QuantityOnHand > 0
Not always, but in some cases conversion can break your indexes if you are casting them or using some other function in where clause like the following one, even you have an index on this column query optimizer will not use it in query execution:
CAST(iioh.UnitOfMeasureKey AS VARCHAR(100))
The last one, if you have OR logical operator in your query try to execute one by one part of your OR logical operator separately see to performance. This logical operator can really kill your query, and this is one example:
AND (wloc.WAREHOUSE_LOCATION_CODE like '6000W01%' OR wloc.WAREHOUSE_LOCATION_CODE like 'BL%')
Once, you determine here that you don't have any issues you can go more further.

Filter by clustering fields using a sub-select query

With Google Bigquery, I am querying a clustered table by applying a filter on the clustering field projectId, like so:
WITH userProjects AS (
SELECT
projectsArray
FROM
projectsPerUser
WHERE
userId = "eben#somewhere.com"
)
SELECT
userProperty
FROM
`mydata.mydataset.mytable`
WHERE
--projectId IN UNNEST((SELECT projectsArray FROM userProjects))
projectId IN ("mydata", "anotherproject")
AND _PARTITIONTIME >= "2019-03-20"
Clustering is applied correctly in the code snippet above, but when I use the commented-out line --projectId IN UNNEST((SELECT projectsArray FROM userProjects)), clustering doesn't apply.
I've tried wrapping it in a UDF like this as well, which also doesn't work:
CREATE TEMP FUNCTION storedValue(item ARRAY<STRING>) AS (
item
);
...
WHERE projectId IN UNNEST(storedValue((SELECT projectsListArray FROM projectsList)))
As I understand from this, the execution path for sub-select queries is different to merely filtering on a scalar or array directly.
I expect a solution to exist where I can programmatically supply an array to filter on that will still allow me the cost benefit a clustered table provides.
In summary:
WHERE projectId IN ("mydata", "anotherproject") [OK]
WHERE projectId IN UNNEST((SELECT projectsArray FROM userProjects)) [Not OK]
WHERE projectId IN UNNEST(storedValue((SELECT projectsListArray FROM projectsList))) [Not OK]
Any ideas?

My suggestion is to rewrite your query so that your nested SELECT is a temporary table (which you've already done) and then perform the filtering you require by using an INNER JOIN rather than a set membership test, so your query would become something like this:
WITH userProjects AS (
SELECT
projectsArray
FROM
projectsPerUser
WHERE
userId = "eben#somewhere.com"
)
SELECT
userProperty
FROM
`mydata.mydataset.mytable` as a
JOIN
userProjects as b
ON a.projectId = b.projectsArray
WHERE
AND _PARTITIONTIME >= "2019-03-20"
I believe this will result in a query which does not scan the full partition if that field is clustered.

FWIW, clustering works well for me with dynamic filters:
SELECT title, SUM(views) views
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(TIMESTAMP_TRUNC(datehour, DAY)) = '2019-01-01'
AND wiki='en'
AND title IN ('Dogfight_(disambiguation)','Dogfight','Dogfight_(film)')
GROUP BY 1
1.8 sec elapsed, 364 MB processed
if instead I do
AND title IN (
SELECT DISTINCT prev
FROM `fh-bigquery.wikipedia_vt.clickstream_materialized`
WHERE date='2019-01-01' AND prev LIKE 'Dogfight%'
ORDER BY 1 LIMIT 3)
2.9 sec elapsed, 513.8 MB processed
If I go to v2 (not clustered), instead of v3:
FROM `fh-bigquery.wikipedia_v2.pageviews_2019`
2.6 sec elapsed, 9.6 GB processed
I'm not sure what's happening in your tables - but it might be interesting to revisit.

SQL Query with non exists optimize

I have the following query which i am directly executing in my Code & putting it in datatable. The problem is it is taking more than 10 minutes to execute this query. The main part which is taking time is NON EXISTS.
SELECT
[t0].[PayrollEmployeeId],
[t0].[InOutDate],
[t0].[InOutFlag],
[t0].[InOutTime]
FROM [dbo].[MachineLog] AS [t0]
WHERE
([t0].[CompanyId] = 1)
AND ([t0].[InOutDate] >= '2016-12-13')
AND ([t0].[InOutDate] <= '2016-12-14')
AND
( NOT (EXISTS(
SELECT NULL AS [EMPTY]
FROM [dbo].[TO_Entry] AS [t1]
WHERE
([t1].[EmployeeId] = [t0].[PayrollEmployeeId])
AND ([t1]. [CompanyId] = 1)
AND ([t0].[PayrollEmployeeId] = [t1].[EmployeeId])
AND (([t0].[InOutDate]) = [t1].[Entry_Date])
AND ([t1].[Entry_Method] = 'M')
))
)
ORDER BY
[t0].[PayrollEmployeeId], [t0].[InOutDate]
Is there any way i can optimize this query? What is the work around for this. It is taking too much of time.

It seems that you can convert the NOT EXISTS into a LEFT JOIN query with second table returning NULL values
Please check following SELECT and modify if required to fulfill your requirements
SELECT
[t0].[PayrollEmployeeId], [t0].[InOutDate], [t0].[InOutFlag], [t0].[InOutTime]
FROM [dbo].[MachineLog] AS [t0]
LEFT JOIN [dbo].[TO_Entry] AS [t1]
ON [t1].[EmployeeId] = [t0].[PayrollEmployeeId]
AND [t0].[PayrollEmployeeId] = [t1].[EmployeeId]
AND [t0].[InOutDate] = [t1].[Entry_Date]
AND [t1]. [CompanyId] = 1
AND [t1].[Entry_Method] = 'M'
WHERE
([t0].[CompanyId] = 1)
AND ([t0].[InOutDate] >= '2016-12-13')
AND ([t0].[InOutDate] <= '2016-12-14')
AND [t1].[EmployeeId] IS NULL
ORDER BY
[t0].[PayrollEmployeeId], [t0].[InOutDate]

You will realize that there is an informative message on the execution plan for your query
It is informing that there is a missing cluster index with an effect of 30% on the execution time
It seems that transaction data is occurring based on some date fields like Entry time.
Dates fields especially on your case are strong candidates for clustered indexes. You can create an index on Entry_Date column
I guess you have already some index on InOutDate
You can try indexing this field as well

Postgres consistently favoring nested loop join over merge join

I have a complex query:
SELECT DISTINCT ON (delivery.id)
delivery.id, dl_processing.pid
FROM mailer.mailer_message_recipient_rel AS delivery
JOIN mailer.mailer_message AS message ON delivery.message_id = message.id
JOIN mailer.mailer_message_recipient_rel_log AS dl_processing ON dl_processing.rel_id = delivery.id AND dl_processing.status = 1000
-- LEFT JOIN mailer.mailer_recipient AS r ON delivery.email = r.email
JOIN mailer.mailer_mailing AS mailing ON message.mailing_id = mailing.id
WHERE
NOT EXISTS (SELECT dl_finished.id FROM mailer.mailer_message_recipient_rel_log AS dl_finished WHERE dl_finished.rel_id = delivery.id AND dl_finished.status <> 1000) AND
dl_processing.date <= NOW() - (36000 * INTERVAL '1 second') AND
NOT EXISTS (SELECT ml.id FROM mailer.mailer_message_log AS ml WHERE ml.message_id = message.id) AND
-- (r.times_bounced < 5 OR r.times_bounced IS NULL) AND
NOT EXISTS (SELECT ur.id FROM mailer.mailer_unsubscribed_recipient AS ur WHERE ur.email = delivery.email AND ur.list_id = mailing.list_id)
ORDER BY delivery.id, dl_processing.id DESC
LIMIT 1000;
It is running very slowly, and the reason seems to be that Postgres is consistently avoiding using merge joins in its query plan despite me having all the indices that I would need for this. It looks really depressing:
http://explain.depesz.com/s/tVY
http://i.stack.imgur.com/Myw4R.png
Why would this happen? How do I troubleshoot such an issue?
UPD: with #wildplasser's help I have reworked the query to fix performance (while changing its semantics somewhat):
SELECT delivery.id, dl_processing.pid
FROM mailer.mailer_message_recipient_rel AS delivery
JOIN mailer.mailer_message AS message ON delivery.message_id = message.id
JOIN mailer.mailer_message_recipient_rel_log AS dl_processing ON dl_processing.rel_id = delivery.id AND dl_processing.status in (1000, 2, 5) AND dl_processing.date <= NOW() - (36000 * INTERVAL '1 second')
LEFT JOIN mailer.mailer_recipient AS r ON delivery.email = r.email
WHERE
(r.times_bounced < 5 OR r.times_bounced IS NULL) AND
NOT EXISTS (SELECT dl_other.id FROM mailer.mailer_message_recipient_rel_log AS dl_other WHERE dl_other.rel_id = delivery.id AND dl_other.id > dl_processing.id) AND
NOT EXISTS (SELECT ml.id FROM mailer.mailer_message_log AS ml WHERE ml.message_id = message.id) AND
NOT EXISTS (SELECT ur.id FROM mailer.mailer_unsubscribed_recipient AS ur JOIN mailer.mailer_mailing AS mailing ON message.mailing_id = mailing.id WHERE ur.email = delivery.email AND ur.list_id = mailing.list_id)
ORDER BY delivery.id
LIMIT 1000
It now runs well, but the query plan still sports these horrible nested loop joins <_<:
http://explain.depesz.com/s/MTo3
I would still like to know why that is.

The reason is that Postgres is actually doing the right thing, and I suck at math. Suppose table A has N rows, and table B has M rows, and they are being joined via a column that they both have a B-tree index for. Then the following is true:
Nested loop join's time complexity is not O(MN), like I naively thought, but O(M log N) or O(N log M), depending on which table is scanned linearly. If both are scanned by an index, we get O(M log M log N) or O(N log M log N), respectively. But since this is only required if a specific order of the rows is needed for yet another join or due to the ORDER clause, as we'll see it's not a bad deal at all.
Merge join's time complexity is O(M log M + N log N), which means that it loses to the nested loop join, provided that the asymptotic proportionality coefficients are the same, and AFAIK they should both be equal to 1 in most implementations. Since both tables must be iterated by the same index in the same direction, if different order is required, an additional sort is required, which easily makes the complexity worse than in the case of the nested loop sort.
So basically despite being associated with the merge sort, which we all love, merge join almost always sucks.
The reason why my first query was so slow was because it had to perform sort before applying limit, and it was also bad in many other ways. After applying #wildplasser's suggestions, I managed to reduce the number of (still expensive) nested loops and also allow for limit to be taken without a sort, thus ensuring that Postgres most likely won't need to run the outer scan to its completition, which is where I derive the bulk of performance gains from.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas