Can't find a way to improve my PostgreSQL query - sql

In my PostgreSQL database I have 6 tables named storeAPrices, storeBprices etc., holding the same columns and indexes as follows:
item_code (string, primary_key)
item_name (string, btree index)
is_whigthed (number : 0|1, betree index)
item_price (number )
My desire is to join each storePrices table to other by item_code or item_name similarity but "OR" should act as in programming language (check right side only if left is false).
Currently, my query has low performance.
select
*
FROM "storeAprices" sap
left JOIN LATERAL (
SELECT * FROM "storeBPrices" sbp
WHERE
similarity(sap.item_name,sbp.item_name) >= 0.45
ORDER BY similarity(sap.item_name,sbp.item_name) DESC
limit 1
) bp ON case when sap.item_code = bp.item_code then true else sap.item_name % bp.item_name end
left JOIN LATERAL (
select * FROM "storeCPrices" scp
WHERE similarity(sap.item_name,scp.item_name) >= 0.45
ORDER BY similarity(sap.item_name,scp.item_name) desc
limit 1
) rp ON case when sap.item_code = rp.item_code then true else sap.item_name % rp.item_name end
This is part of my query and it took too much time to response. My data is not so large (15k items per table)
Also I have another index "is_whigthed" that I'm not sure how to use it. (I don't want set it as variable because I want to get all "is_whigthed" results)
Any suggestions?

OR should be faster than using case
bp ON sap.item_code = bp.item_code OR sap.item_name % bp.item_name
also you can create Trigram index on item_name columns as mentioned in pg_trgm module docs, since you are using it's % operator for similarity

Related

Determining what index to create given a query?

Given a SQL query:
SELECT *
FROM Database..Pizza pizza
JOIN Database..Toppings toppings ON pizza.ToppingId = toppings.Id
WHERE toppings.Name LIKE '%Mushroom%' AND
toppings.GlutenFree = 0 AND
toppings.ExtraFee = 1.25 AND
pizza.Location = 'Minneapolis, MN'
How do you determine what index to write to improve the performance of the query? (Assuming every value to the right of the equal is calculated at runtime)
Is there a built in command SQL command to suggest the proper index?
To me, it gets confusing when there's multiple JOINS that use fields from both tables.
For this query:
SELECT *
FROM Database..Pizza p JOIN
Database..Toppings t
ON p.ToppingId = t.Id
WHERE t.Name LIKE '%Mushroom%' AND
t.GlutenFree = 0 AND
t.ExtraFee = 1.25 AND
p.Location = 'Minneapolis, MN';
You basically have two options for indexes:
Pizza(location, ToppingId) and Toppings(id)
or:
Toppings(GlutenFree, ExtraFee, Name, id) and Pizza(ToppingId, location)
Which works better depends on how selective the different conditions are in the WHERE clause.

Filter by clustering fields using a sub-select query

With Google Bigquery, I am querying a clustered table by applying a filter on the clustering field projectId, like so:
WITH userProjects AS (
SELECT
projectsArray
FROM
projectsPerUser
WHERE
userId = "eben#somewhere.com"
)
SELECT
userProperty
FROM
`mydata.mydataset.mytable`
WHERE
--projectId IN UNNEST((SELECT projectsArray FROM userProjects))
projectId IN ("mydata", "anotherproject")
AND _PARTITIONTIME >= "2019-03-20"
Clustering is applied correctly in the code snippet above, but when I use the commented-out line --projectId IN UNNEST((SELECT projectsArray FROM userProjects)), clustering doesn't apply.
I've tried wrapping it in a UDF like this as well, which also doesn't work:
CREATE TEMP FUNCTION storedValue(item ARRAY<STRING>) AS (
item
);
...
WHERE projectId IN UNNEST(storedValue((SELECT projectsListArray FROM projectsList)))
As I understand from this, the execution path for sub-select queries is different to merely filtering on a scalar or array directly.
I expect a solution to exist where I can programmatically supply an array to filter on that will still allow me the cost benefit a clustered table provides.
In summary:
WHERE projectId IN ("mydata", "anotherproject") [OK]
WHERE projectId IN UNNEST((SELECT projectsArray FROM userProjects)) [Not OK]
WHERE projectId IN UNNEST(storedValue((SELECT projectsListArray FROM projectsList))) [Not OK]
Any ideas?
My suggestion is to rewrite your query so that your nested SELECT is a temporary table (which you've already done) and then perform the filtering you require by using an INNER JOIN rather than a set membership test, so your query would become something like this:
WITH userProjects AS (
SELECT
projectsArray
FROM
projectsPerUser
WHERE
userId = "eben#somewhere.com"
)
SELECT
userProperty
FROM
`mydata.mydataset.mytable` as a
JOIN
userProjects as b
ON a.projectId = b.projectsArray
WHERE
AND _PARTITIONTIME >= "2019-03-20"
I believe this will result in a query which does not scan the full partition if that field is clustered.
FWIW, clustering works well for me with dynamic filters:
SELECT title, SUM(views) views
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(TIMESTAMP_TRUNC(datehour, DAY)) = '2019-01-01'
AND wiki='en'
AND title IN ('Dogfight_(disambiguation)','Dogfight','Dogfight_(film)')
GROUP BY 1
1.8 sec elapsed, 364 MB processed
if instead I do
AND title IN (
SELECT DISTINCT prev
FROM `fh-bigquery.wikipedia_vt.clickstream_materialized`
WHERE date='2019-01-01' AND prev LIKE 'Dogfight%'
ORDER BY 1 LIMIT 3)
2.9 sec elapsed, 513.8 MB processed
If I go to v2 (not clustered), instead of v3:
FROM `fh-bigquery.wikipedia_v2.pageviews_2019`
2.6 sec elapsed, 9.6 GB processed
I'm not sure what's happening in your tables - but it might be interesting to revisit.

SQL Filtering duplicate rows due to bad ETL

The database is Postgres but any SQL logic should help.
I am retrieving the set of sales quotations that contain a given product within the bill of materials. I'm doing that in two steps: step 1, retrieve all DISTINCT quote numbers which contain a given product (by product number).
The second step, retrieve the full quote, with all products listed for each unique quote number.
So far, so good. Now the tough bit. Some rows are duplicates, some are not. Those that are duplicates (quote number & quote version & line number) might or might not have maintenance on them. I want to pick the row that has maintenance greater than 0. The duplicate rows I want to exclude are those that have a 0 maintenance. The problem is that some rows, which have no duplicates, have 0 maintenance, so I can't just filter on maintenance.
To make this exciting, the database holds quotes over 20+ years. And the data scientists guys have just admitted that maybe the ETL process has some bugs...
--- step 0
--- cleanup the workspace
SET CLIENT_ENCODING TO 'UTF8';
DROP TABLE IF EXISTS product_quotes;
--- step 1
--- get list of Product Quotes
CREATE TEMPORARY TABLE product_quotes AS (
SELECT DISTINCT master_quote_number
FROM w_quote_line_d
WHERE item_number IN ( << model numbers >> )
);
--- step 2
--- Now join on that list
SELECT
d.quote_line_number,
d.item_number,
d.item_description,
d.item_quantity,
d.unit_of_measure,
f.ref_list_price_amount,
f.quote_amount_entered,
f.negtd_discount,
--- need to calculate discount rate based on list price and negtd discount (%)
CASE
WHEN ref_list_price_amount > 0
THEN 100 - (ref_list_price_amount + negtd_discount) / ref_list_price_amount *100
ELSE 0
END AS discount_percent,
f.warranty_months,
f.master_quote_number,
f.quote_version_number,
f.maintenance_months,
f.territory_wid,
f.district_wid,
f.sales_rep_wid,
f.sales_organization_wid,
f.install_at_customer_wid,
f.ship_to_customer_wid,
f.bill_to_customer_wid,
f.sold_to_customer_wid,
d.net_value,
d.deal_score,
f.transaction_date,
f.reporting_date
FROM w_quote_line_d d
INNER JOIN product_quotes pq ON (pq.master_quote_number = d.master_quote_number)
INNER JOIN w_quote_f f ON
(f.quote_line_number = d.quote_line_number
AND f.master_quote_number = d.master_quote_number
AND f.quote_version_number = d.quote_version_number)
WHERE d.net_value >= 0 AND item_quantity > 0
ORDER BY f.master_quote_number, f.quote_version_number, d.quote_line_number
The logic to filter the duplicate rows is like this:
For each master_quote_number / version_number pair, check to see if there are duplicate line numbers. If so, pick the one with maintenance > 0.
Even in a CASE statement, I'm not sure how to write that.
Thoughts? The database is Postgres but any SQL logic should help.
I think you will want to use Window Functions. They are, in a word, awesome.
Here is a query that would "dedupe" based on your criteria:
select *
from (
select
* -- simplifying here to show the important parts
,row_number() over (
partition by master_quote_number, version_number
order by maintenance desc) as seqnum
from w_quote_line_d d
inner join product_quotes pq
on (pq.master_quote_number = d.master_quote_number)
inner join w_quote_f f
on (f.quote_line_number = d.quote_line_number
and f.master_quote_number = d.master_quote_number
and f.quote_version_number = d.quote_version_number)
) x
where seqnum = 1
The use of row_number() and the chosen partition by and order by criteria guarantee that only ONE row for each combination of quote_number/version_number will get the value of 1, and it will be the one with the highest value in maintenance (if your colleagues are right, there would only be one with a value > 0 anyway).
Can you do something like...
select
*
from
w_quote_line_d d
inner join
(
select
...
,max(maintenance)
from
w_quote_line_d
group by
...
) d1
on
d1.id = d.id
and d1.maintenance = d.maintenance;
Am I understanding your problem correctly?
Edit: Forgot the group by!
I'm not sure, but maybe you could Group By all other columns and use MAX(Maintenance) to get only the greatest.
What do you think?

SQL - How to eliminate duplicates from the below query in POSTGRES

I have been working on the below query. Basically there are two tables. Realtime_Input and Realtime_Output. When I join the two tables and take the necessary columns, I made this a view and when i query against the view I get duplicates.
What am I doing wrong? When I tested using distinct keyword, I get 60 unique rows but intermittently i get duplicates. My db is on cloud foundry cloud (postgres). Is is because of that? Please help !
select i2.key_ts_long,
case
when i2.revenue_activepower = 'NA'
then (-1 * CAST(io.min5_forecast as real))
else (CAST(i2.revenue_activepower AS real) - CAST(io.min5_forecast as real))
end as diff
from realtime_analytic_input i2,
(select i.farm_id,
i.key_ts_long,
o.min5_forecast,
o.min5_timestamp_seconds
from realtime_analytic_input i,
realtime_analytic_output o
where i.farm_id = o.farm_id
and i.key_ts_long = o.key_ts_long
and o.farm_id = 'MW1'
) io
where i2.key_ts_long = CAST(io.min5_timestamp_seconds AS bigint)
and i2.farm_id = io.farm_id
and i2.farm_id = 'MW1'
and io.key_ts_long between 1464738953169 and 1466457841
order by io.key_ts_long desc

Joining multiple tables in Access and limiting to Top 1 result

I have three tables that need to be joined in order to get monthly inventory data in return.
Table 1: TargetInventory
Table 2: TargetValue
Table 3: TargetWeight
[TargetInventory] does not change after being added the first time.
[TargetValue] is just a small table that includes prices of various types of metal.
[TargetWeight] is updated monthly as part of our inventory process. We INSERT new data, we never UPDATE old data.
Below is the relationship between these tables. (Sorry, I don't have the reputation points to post an image. Brand new here, so hopefully this makes sense.)
(* = UniqueKey)
--TargetValue-- --TargetInventory-- --TargetWeight--
*MaterialID <===| *TargetID <=====| *ID
Material |===> MaterialID |===> TargetID
PricePerOunce Length RecordDate
Density Width Weight
Thickness
DateInInventory
The TargetWeight table contains multiple records for TargetID (since a new one is added every month at inventory). That's good for me to track historical usage, but for the current inventory value, I only need the most recent TargetWeight.Weight to be returned.
I don't know how to do a CROSS APPLY from within another INNER JOIN, so I'm at a loss for how to do this (without switching to mySQL and just doing a LIMIT 1...)
I think it needs to look something like what's below, but I'm not sure how to finish the query.
SELECT
TargetInventory.TargetID AS TargetInventory_TargetID,
TargetInventory.MaterialID AS TargetInventory_MaterialID,
TargetInventory.Length,
TargetInventory.Width,
TargetInventory.Thickness,
TargetValue.MaterialID AS TargetValue_MaterialID,
TargetValue.PricePerOunce,
TargetValue.Density,
TargetWeight.ID,
TargetWeight.TargetID AS TargetWeight_TargetID,
TargetWeight.RecordDate,
TargetWeight.Weight
FROM
(TargetValue
INNER JOIN TargetInventory
ON TargetValue.[MaterialID] = TargetInventory.[MaterialID]
)
CROSS APPLY (
SELECT TOP 1 *
FROM .....
)
The following query works for me in Access 2010. It uses an INNER JOIN on a subquery to take the place of the CROSS APPLY (which Access SQL doesn't support). It assumes that there will be no more than one [TargetWeight] record for a given (TargetID, RecordDate):
SELECT
TargetInventory.TargetID AS TargetInventory_TargetID,
TargetInventory.MaterialID AS TargetInventory_MaterialID,
TargetInventory.Length,
TargetInventory.Width,
TargetInventory.Thickness,
TargetValue.MaterialID AS TargetValue_MaterialID,
TargetValue.PricePerOunce,
TargetValue.Density,
LatestWeight.ID,
LatestWeight.TargetID AS TargetWeight_TargetID,
LatestWeight.RecordDate,
LatestWeight.Weight
FROM
(
TargetValue
INNER JOIN
TargetInventory
ON TargetValue.[MaterialID] = TargetInventory.[MaterialID]
)
INNER JOIN
(
SELECT tw.*
FROM
TargetWeight AS tw
INNER JOIN
(
SELECT TargetID, MAX(RecordDate) AS LatestDate
FROM TargetWeight
GROUP BY TargetID
) AS latest
ON latest.TargetID=tw.TargetID
AND latest.LatestDate=tw.RecordDate
) AS LatestWeight
ON LatestWeight.TargetID = TargetInventory.TargetID
Alternative approach specifically for Access 2010 or later
If the above query bogs down with a large number of rows in [TargetWeight] then another possible solution for Access 2010+ would be to add a Yes/No field named [Current] to the [TargetWeight] table and use the following After Insert data macro to ensure that only the latest record for each [TargetID] is flagged as [Current]:
Once that is done, the query would simply be
SELECT
TargetInventory.TargetID AS TargetInventory_TargetID,
TargetInventory.MaterialID AS TargetInventory_MaterialID,
TargetInventory.Length,
TargetInventory.Width,
TargetInventory.Thickness,
TargetValue.MaterialID AS TargetValue_MaterialID,
TargetValue.PricePerOunce,
TargetValue.Density,
TargetWeight.ID,
TargetWeight.TargetID AS TargetWeight_TargetID,
TargetWeight.RecordDate,
TargetWeight.Weight
FROM
(
TargetValue
INNER JOIN
TargetInventory
ON TargetValue.[MaterialID] = TargetInventory.[MaterialID]
)
INNER JOIN
TargetWeight
ON TargetInventory.TargetID = TargetWeight.TargetID
WHERE TargetWeight.Current = True;
To maximize performance, the [TargetWeight].[TargetID] and [TargetWeight].[Current] fields should be indexed.
SELECT TargetInventory.TargetID AS TargetInventory_TargetID,
TargetInventory.MaterialID AS TargetInventory_MaterialID,
TargetInventory.Length,
TargetInventory.Width,
TargetInventory.Thickness,
TargetValue.MaterialID AS TargetValue_MaterialID,
TargetValue.PricePerOunce,
TargetValue.Density, Weight.ID,
Weight.TargetID AS TargetWeight_TargetID,
Weight.RecordDate,
Weight.Weight
FROM TargetInventory
INNER JOIN TargetValue ON TargetValue.[MaterialID] = TargetInventory.[MaterialID]
CROSS APPLY (
SELECT TOP 1 *
FROM TargetWeight
WHERE TargetID = TargetInventory.TargetID
ORDER BY RecordDate DESC
) AS Weight