Execute Subquery refactoring first before any other SQL

Execute Subquery refactoring first before any other SQL - sql

I Have a very complex view which is of the below form
create or replace view loan_vw as
select * from (with loan_info as (select loan_table.*,commission_table.*
from loan_table,
commission_table where
contract_id=commission_id)
select /*complex transformations */ from loan_info
where type <> 'PRINCIPAL'
union all
select /*complex transformations */ from loan_info
where type = 'PRINCIPAL')
Now IF I do the below select the query hangs
select * from loan_vw where contract_id='HA001234TY56';
But if I hardcode inside the subquery refactoring or use package level variable in the same session the query returns in a second
create or replace view loan_vw as
select * from (with loan_info as (select loan_table.*,commission_table.*
from loan_table,
commission_table where
contract_id=commission_id
and contract_id='HA001234TY56'
)
select /*complex transformations */ from loan_info
where type <> 'PRINCIPAL'
union all
select /*complex transformations */ from loan_info
where type = 'PRINCIPAL')
Since I use Business object I cannot use package level variable
So my question is there a hint in Oracle to tell the optimizer to first check the contract_id in loan_vw in the subquery refactoring
As requested the analytical function used is the below
select value_date, item, credit_entry, item_paid
from (
select value_date, item, credit_entry, debit_entry,
greatest(0, least(credit_entry, nvl(sum(debit_entry) over (), 0)
- nvl(sum(credit_entry) over (order by value_date
rows between unbounded preceding and 1 preceding), 0))) as item_paid
from your_table
)
where item is not null;
After following the advice given by Boneist and MarcinJ I removed the Sub query refactoring (CTE) and wrote one long query like the below which improved the performance from 3 min to 0.156 seconds
create or replace view loan_vw as
select /*complex transformations */
from loan_table,
commission_table where
contract_id=commission_id
and loan_table.type <> 'PRINCIPAL'
union all
select /*complex transformations */
from loan_table,
commission_table where
contract_id=commission_id
and loan_table.type = 'PRINCIPAL'

Are these transformations really that complex you have to use UNION ALL? It's really hard to optimize something you can't see, but have you maybe tried getting rid of the CTE and implementing your calculations inline?
CREATE OR REPLACE VIEW loan_vw AS
SELECT loan.contract_id
, CASE commission.type -- or wherever this comes from
WHEN 'PRINCIPAL'
THEN SUM(whatever) OVER (PARTITION BY loan.contract_id, loan.type) -- total_whatever
ELSE SUM(something_else) OVER (PARTITION BY loan.contract_id, loan.type) -- total_something_else
END AS whatever_something
FROM loan_table loan
INNER
JOIN commission_table commission
ON loan.contract_id = commission.commission_id
Note that if your analytic functions don't have PARTITION BY contract_id you won't be able to use an index on that contract_id column at all.
Take a look at this db fiddle (you'll have to click on ... on the last result table to expand the results). Here, the loan table has an indexed (PK) contract_id column, but also some_other_id that is also unique, but not indexed and the predicate on the outer query is still on contract_id. If you compare plans for partition by contract and partition by other id, you'll see that index is not used at all in the partition by other id plan: there's a TABLE ACCESS with FULL options on the loan table, as compared to INDEX - UNIQUE SCAN in partition by contract. That's obviously because the optimizer cannot resolve the relation between contract_id and some_other_id by its own, and so it'll need to run SUM or AVG over the entire window instead of limiting window row counts through index usage.
What you can also try - if you have a dimension table with those contracts - is to join it to your results and expose the contract_id from the dimension table instead of the most likely huge loan fact table. Sometimes this can lead to an improvement in cardinality estimates through the usage of a unique index on the dimension table.
Again, it's really hard to optimize a black box, without a query or even a plan, so we don't know what's going on. CTE or a subquery can get materialized unnecessarily for example.

Thanks for the update to include an example of the column list.
Given your updated query, I would suggest changing your view (or possibly creating a second view for querying single contract_ids, if your original view could be used to query for multiple contract_ids - unless, of course, the results of the original view only make sense for individual contract_ids!) to something like:
CREATE OR REPLACE VIEW loan_vw AS
WITH loan_info AS (SELECT l.*, c.* -- for future-proofing, you should list the column names explicitly; if this statement is rerun and there's a column with the same name in both tables, it'll fail.
FROM loan_table l
INNER JOIN commission_table c ON l.contract_id = c.commission_id -- you should always alias the join condition columns for ease of maintenance.
)
SELECT value_date,
item,
credit_entry,
debit_entry,
GREATEST(0,
LEAST(credit_entry,
NVL(SUM(debit_entry) OVER (PARTITION BY contract_id), 0)
- NVL(SUM(credit_entry) OVER (PARTITION BY contract_id ORDER BY value_date ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING), 0))) AS item_paid
FROM loan_info
WHERE TYPE <> 'PRINCIPAL'
UNION ALL
SELECT ...
FROM loan_info
WHERE TYPE = 'PRINCIPAL';
Note that I've converted your join into ANSI syntax, because it's easier to understand than the old style joins (easier to separate join conditions from predicates, for a start!).

Related

Query Optimization with ROW_NUMBER

I have this query:
SELECT
PE1.PRODUCT_EQUIPMENT_KEY, -- primary key
PE1.Customer_Ban,
PE1.Subscriber_No,
PE1.Prod_Equip_Cd,
PE1.Prod_Equip_Txt,
PE1.Prod_Equip_Category_Txt--,
-- PE2.ep_rnk ------------------ UNCOMMENT THIS LINE
FROM
INT_ADM.Product_Equipment_Dim PE1
INNER JOIN
(
SELECT
PRODUCT_EQUIPMENT_KEY,
ROW_NUMBER() OVER (PARTITION BY Customer_Ban, Subscriber_No ORDER BY Start_Dt ASC) AS ep_rnk
FROM INT_ADM.Product_Equipment_Dim PE2
) PE2
ON PE2.PRODUCT_EQUIPMENT_KEY = PE1.PRODUCT_EQUIPMENT_KEY
WHERE
Line_Of_Business_Cd = 'M'
AND /*v_Date_Start*/ TO_DATE( '2022/01/12', 'yyyy/mm/dd' ) BETWEEN Start_Dt AND End_Dt
AND Current_Ind = 'Y'
If I run it as you see it then it runs in under a second.
If I run it with -- PE2.ep_rnk ------------------ UNCOMMENT THIS LINE uncommented then the query takes up to 5 minutes to complete.
I know it's something to do with ROW_NUMBER() but after looking all over online I can't find a good explanation and solution. Does anyone know why uncommenting that line makes the query so slow, and what I can do about it so it runs fast?
Much appreciate your help in advance.

The root cause is, that even if the predicate in the where clause allows an efficient access to the rows of the table (but I suspect your below a second response is the time to get the first page of the result), you need in the subquery to access all rows of the table, to window sort them and finaly to join them to the first row source.
So if you comment out the ep_rnk Oracle is smart enought that it do not need to evaluate the subquery at all, because the subquery is on the same table and the join is on the primary key - so no row can be lost or duplicated in the join.
What can you improve?
Not much. If the WHERE condition filters the table very restrictive (and you end with only a small number of PRODUCT_EQUIPMENT_KEY) make the same filer in the subquery:
(
SELECT
PRODUCT_EQUIPMENT_KEY,
ROW_NUMBER() OVER (PARTITION BY Customer_Ban, Subscriber_No ORDER BY Start_Dt ASC) AS ep_rnk
FROM INT_ADM.Product_Equipment_Dim PE2
--- filer added
WHERE PRODUCT_EQUIPMENT_KEY in (
SELECT PRODUCT_EQUIPMENT_KEY
FROM INT_ADM.Product_Equipment_Dim
WHERE ... same predicate as in the main query ...
)
) PE2
If the predicate returns all (or most) of the PRODUCT_EQUIPMENT_KEY the only (often used) way is to pre-calculate the rank e.g. in a materialized view
The materialized view is defined as follows
SELECT
PE1.PRODUCT_EQUIPMENT_KEY, -- primary key
PE1.Customer_Ban,
PE1.Subscriber_No,
PE1.Prod_Equip_Cd,
PE1.Prod_Equip_Txt,
PE1.Prod_Equip_Category_Txt--,
ROW_NUMBER() OVER (PARTITION BY Customer_Ban, Subscriber_No ORDER BY Start_Dt ASC) AS ep_rnk
FROM
INT_ADM.Product_Equipment_Dim PE1
and you simple query from it - without a join.

Generating a stable uuid ID across with clauses in GBQ [duplicate]

This question already has an answer here:
BigQuery GENERATE_UUID() and CTE's
(1 answer)
Closed 1 year ago.
I'm trying to generate a UUID within a with clause on GBQ and then use it in a subsequent with clause for a join.
with q1 as (
select generate_uuid() as uuid
), q2 as (
select uuid from q1
)
select * from q1
union all
select * from q2
This returns two distinct uuid.
How would I go about generating an ID that stays the same across with clauses?

I'll start with the reason behind this discrepancy.
TLDR; as of today there is no option to force BigQuery to materialize CTE results. It'd be useful when a CTE referenced more than once in a statement.
see below query:
with cte_1 as (
select count(1) as row_count
from `bigquery-public-data.austin_311.311_service_requests` as sr
)
, cte_2 as (
select row_count
from cte_1
)
select row_count
from cte_1
union all
select row_count
from cte_2;
when the execution plan is examined, you'll see 2 Input stage for referenced sr table.
It'd be great if we have an option to materialize CTE results. As I remember, oracle has this implicitly if CTE used more than once or explicitly via hints.
materializing q1 explicitly to table then use it twice might be a workaround. I'd prefer temporary table.
the drawback is; the cost may increase if your project uses on-demand pricing. (rather than flat-rate)

Because subquery always executes each time it is called, so with your query, the generate_uuid will be called twice for both q1 and q2 tables.
I suggest you save the generated UUID to a table, then query from this to make sure the UUID is the same from both tables.

Calculating SQL Server ROW_NUMBER() OVER() for a derived table

In some other databases (e.g. DB2, or Oracle with ROWNUM), I can omit the ORDER BY clause in a ranking function's OVER() clause. For instance:
ROW_NUMBER() OVER()
This is particularly useful when used with ordered derived tables, such as:
SELECT t.*, ROW_NUMBER() OVER()
FROM (
SELECT ...
ORDER BY
) t
How can this be emulated in SQL Server? I've found people using this trick, but that's wrong, as it will behave non-deterministically with respect to the order from the derived table:
-- This order here ---------------------vvvvvvvv
SELECT t.*, ROW_NUMBER() OVER(ORDER BY (SELECT 1))
FROM (
SELECT TOP 100 PERCENT ...
-- vvvvv ----redefines this order here
ORDER BY
) t
A concrete example (as can be seen on SQLFiddle):
SELECT v, ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
FROM (
SELECT TOP 100 PERCENT 1 UNION ALL
SELECT TOP 100 PERCENT 2 UNION ALL
SELECT TOP 100 PERCENT 3 UNION ALL
SELECT TOP 100 PERCENT 4
-- This descending order is not maintained in the outer query
ORDER BY 1 DESC
) t(v)
Also, I cannot reuse any expression from the derived table to reproduce the ORDER BY clause in my case, as the derived table might not be available as it may be provided by some external logic.
So how can I do it? Can I do it at all?

The Row_Number() OVER (ORDER BY (SELECT 1)) trick should NOT be seen as a way to avoid changing the order of underlying data. It is only a means to avoid causing the server to perform an additional and unneeded sort (it may still perform the sort but it's going to cost the minimum amount possible when compared to sorting by a column).
All queries in SQL server ABSOLUTELY MUST have an ORDER BY clause in the outermost query for the results to be reliably ordered in a guaranteed way.
The concept of "retaining original order" does not exist in relational databases. Tables and queries must always be considered unordered until and unless an ORDER BY clause is specified in the outermost query.
You could try the same unordered query 100,000 times and always receive it with the same ordering, and thus come to believe you can rely on said ordering. But that would be a mistake, because one day, something will change and it will not have the order you expect. One example is when a database is upgraded to a new version of SQL Server--this has caused many a query to change its ordering. But it doesn't have to be that big a change. Something as little as adding or removing an index can cause differences. And more: Installing a service pack. Partitioning a table. Creating an indexed view that includes the table in question. Reaching some tipping point where a scan is chosen instead of a seek. And so on.
Do not rely on results to be ordered unless you have said "Server, ORDER BY".

Relational division with events in a certain timeframe

I have my table (cte) defintions and result set here
The CTE may look strange but it has been tested and returns the correct results in the most efficient manner that I've found yet. The below query will find the number of person IDs (patid) who are taking two or more drugs at the same time. Currently, the query works insofar as it returns the patIDs of the people taking both drugs, but not both drugs at the same time. Taking both drugs is indicated by one fillDate of one drug falling before a scriptEndDate of another drug. So
You can see in this partial result set that on line 18 the scriptFillDate is 2009-07-19 which is between the fillDate and scriptEndDate of the same patID from row 2. What constraint do I need to add so I can filter these unneeded results?
--PatientDrugList is a CTE because eventually parameters might be passed to it
--to alter the selection population
;with PatientDrugList(patid, filldate, scriptEndDate,drugName,strength)
as
(
select rx.patid,rx.fillDate,rx.scriptEndDate,rx.drugName,rx.strength
from rx
),
--the row constructor here will eventually be parameters for a stored procedure
DrugList (drugName)
as
(
select x.drugName
from (values ('concerta'),('fentanyl'))
as x(drugName)
where x.drugName is not null
)
--the row number here is so that I can find the largest date range
--(the largest datediff means the person was on a given drug for a larger
--amount of time. obviously not a optimal solution
--celko inspired relational division!
select distinct row_number() over(partition by pd.patid, drugname order by datediff(day,pd.fillDate,pd.scriptEndDate)desc) as rn
,pd.patid
,pd.drugname
,pd.fillDate
,pd.scriptEndDate
from PatientDrugList as pd
where not exists
(select * from DrugList
where not exists
(select * from PatientDrugList as pd2
where(pd.patid=pd2.patid)
and (pd2.drugName = DrugList.drugName)))
and exists
(select *
from DrugList
where DrugList.drugName=pd.drugName
)
group by pd.patid, pd.drugName,pd.filldate,pd.scriptEndDate

Wrap you original query into a CTE, or better yet, for performance, stability of query plan and result, store it into a temp table.
The query below (assuming CTE option) will give you the overlapping times when both drugs are being taken.
;with tmp as (
.. your query producing the columns shown ..
)
select *
from tmp a
join tmp b on a.patid = b.patid and a.drugname <> b.drugname
where a.filldate < b.scriptenddate
and b.filldate < a.scriptenddate;

Transform arbitrary SQL SELECT TOP(x) to a SELECT COUNT(*)?

I want to be able to take any arbitrary SELECT TOP(X) query that would normally return a large number of rows (without the X limit) and transform that query into a query that counts how many rows would have been returned without the TOP(X) (i.e. SELECT COUNT(*)). Remember I am asking about an arbitrary query with any number of joins, where clauses, group by's etc.
Is there a way to do this?
edited to show syntax with Shannon's solution:
i.e.
`SELECT TOP(X) [colnames] FROM [tables with joins]
WHERE [constraints] GROUP BY [cols] ORDER BY [cols]`
becomes
`SELECT COUNT(*) FROM
(SELECT [colnames] FROM [tables with joins]
WHERE [constraints] GROUP BY [cols]) t`

Inline view:
select count(*)
from (...slightly transformed query...) t
... slightly transfomed query... is:
If the select clause contains any columns without names, such as select ... avg(x) ... then do one of 1) Alias the column, such as avg(x) as AvgX, 2) Remove the column, but make sure at least one column is left, or my favorite 3) Just make the select clause select 1 as C
Remove TOP from select clause.
Remove order by clause.
EDIT 1 Fixed by adding aliases for the inline view and dealing with unnamed columns in select clause.
EDIT 2 But what about the performance? Doesn't this require the DB to run the big query that I wanted to avoid in the first place with TOP(X)?
Not necessarily. It may be the case for some queries that this count will do more work than the TOP(x) would. And it may be the case that for a particular query, you could make the equivelent count faster by making addional changes to remove work that is not needed for the final count. But those simplifications can not be included in a general method to take any arbitrary SELECT TOP(X) query that would normally return a large number of rows (without the X limit) and transform that query into a query that counts how many rows would have been returned without the TOP(X).
And in some cases, the query optimizer may optimize away stuff so that the DB is not to run the big query.
For example Test table & data, using SQL Server 2005:
create table t (PK int identity(1, 1) primary key,
u int not null unique,
string VARCHAR(2000))
insert into t (u, string)
select top 100000 row_number() over (order by s1.id) , replace(space(2000), ' ', 'x')
from sysobjects s1,
sysobjects s2,
sysobjects s3,
sysobjects s4,
sysobjects s5,
sysobjects s6,
sysobjects s7
The non-clustered index on column u will be much smaller than the clustered index on column PK.
Then set up SMSS to show the actual execution plan for:
select PK, U, String from t
select count(*) from t
The first select does a clusted index scan, because it needs to return data out of the leafs. The second query does an index scan on the smaller non-clusteed index created for the unique constraint on U.
Applying the transform of the first query we get:
select count(*)
from (select PK, U, String from t) t
Running that and looking at the plan, the index on U is used again, exact same plan as select count(*) from t. The leaves are not visited to find the values for String on every row.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas