sql server does not take most restrictive condition for execution plan - sql

We have a query with multiple joins where sql server 2016 does not take the optimal path and we cannot convince it without hints (which we prefer not to use)
Simplified the problem is as follows :
Table A (12 million rows)
Table B (type table, 5 rows)
Table C (12 million rows)
query (simplified to clarify)
SELECT
[A].[ID]
,[A].[DATE_CREATED]
,[A].[DATE_LAST_MODIFIED]
,[A].[CODE]
,[B].[CODE]
,[B].[DESCRIPTION]
,[C].[EVENT_ID]
,[C].[SOURCE_REFERENCE]
,[C].[EVTY_ID]
,[C].[BUSINESS_KEY]
,[C].[DATA]
,[C].[EVENT_DATE]
FROM A
JOIN B ON [B].[ID] = [A].[PSTY_ID] AND [B].[ACTIVE] = 1
JOIN C ON [C].[ID] = [B].[EVEN_ID] AND [C].[ACTIVE] = 1
WHERE [B].[CODE] = 'nopr' OR [B].[CODE] = 'inpr'
the selected codes from B correspond to values 1 and 2
Table A contain max 10 PSTY_ID values 1 or 2 the rest is 3,4 or 5
There is a foreign key from A.PSTY_ID to B.ID
There is a filtered index on table A PSTY_ID 1,2 and all selected columns as included columns
The optimizer does not seem to recognize that we try to select values 1 and 2, and does not use the index or start with table B (trying to force with subqueries or changing table order do not help, only the hint OPTION (FORCE ORDER) can convince the optimizer, but this we do not want)
Only when we hard code the B.ID or A.PSTY_ID values 1 and 2 in the where clause the optimizer takes the correct path, starting with table B.
If we do not do this, it starts to join table A with table C, and only then with table B, leading to vastly more processing time (approx 50X)
We also tried to declare the values and using them as variables, but still no luck.
Would anyone know if this is a known issue, or if this can be worked around ?

Your filtered index will not be used in this case unless you include values 1 and 2 in the where clause, you cannot change this even if you try to join with the table that ONLY has 1,2 in its rows.
Filtered index will never be used based on some "assumptions" of what values some table (physical or derived like CTE or subquery), and in fact your subquery did not help.
So if you want to use it, you should add the where condition equivalent to those of filtered index to your query.
Since you don't want to add this condition, but still want to change join order of your tables starting with B table you can use temporary table/table variable like this:
select [ID]
,[CODE]
,[DESCRIPTION]
,[EVEN_ID]
into #tmp
from B
where ([CODE] = 'nopr' OR [CODE] = 'inpr') and [ACTIVE] = 1
And now use this #tmp instead of B in your query.

Related

T-SQL Match records 1 to 1 without join condition

I have a group of enitities which need to have another record associated with them from another table.
When I try to output an Id for the table to be matched on it doesn't work because you can only output from inserted, updated etc.
DECLARE #SignatureGlobalIdsTbl table (ID int,
CompanyBankAccountId int);
INSERT INTO GlobalIds (TypeId)
-- I Cannot output cba.Id into the table since its not from inserted
OUTPUT Inserted.Id,
cba.Id
INTO #SignatureGlobalIdsTbl (ID,
CompanyBankAccountId)
SELECT (#DocumentsGlobalTypeKey)
FROM CompanyBankAccounts cba
INNER JOIN Companies c ON c.CompanyId = cba.CompanyId
WHERE SignatureDocumentId IS NULL
AND (SignatureFile IS NOT NULL
AND SignatureFile != '');
INSERT INTO Documents (DocumentPath,
DocumentType,
DocumentIsExternal,
OwnerGlobalId,
OwnerGlobalTypeID,
DocumentName,
Extension,
GlobalId)
SELECT SignatureFile,
#SignatureDocumentTypeKey,
1,
CompanyGlobalId,
#OwnerGlobalTypeKey,
[dbo].[fnGetFileNameWithoutExtension](SignatureFile),
[dbo].[fnGetFileExtension](SignatureFile),
documentGlobalId
FROM (SELECT c.GlobalId AS CompanyGlobalId,
cba.*,
s.ID AS documentGlobalId
FROM CompanyBankAccounts cba
INNER JOIN Companies c ON c.CompanyId = cba.CompanyId
CROSS JOIN #SignatureGlobalIdsTbl s) info
WHERE SignatureDocumentId IS NULL
AND (SignatureFile IS NOT NULL
AND SignatureFile != '');
I Tried to use cross join to prevent cartesian production but that did not work. I also tried to output the rownumber over some value but I could not get that to be stored in the table either.
If I have two seperate queries which return the same amount of records, how can I pair the records together without creating cartesian production?
'When I try to output an Id for the table ... it doesn't work.'
This seems to be because one of the columns you want to OUTPUT is not actually part of the insert. It's an annoying problem and I wish SQL Server would allow us to do it.
Someone may have a much better answer for this than I do, but the way I usually approach this is
Create a temporary table/etc of the data I want to insert, with a column for ID (starts blank)
Do an insert of the correct amount of rows, and get the IDs out into another temporary table,
Assign the IDs as appropriate within the original temporary table
Go back and update the inserted rows with any additional data needed (though that's probably not needed here given you're just inserting a constant)
What this does is to flag/get the IDs ready for you to use, then you allocate them to your data as needed, then fill in the table with the data. It's relatively simple although it does do 2 table hits rather than 1.
Also consider doing it all within a transaction to keep the data consistent (though also probably not needed here).
How can I pair the records together?
A cross join unfortunately multiplies the rows (number of rows on left times the number of rows on the right). It is useful in some instances, but possibly not here.
I suggest when you do your inserts above, you get an identifier (e.g., companyID) in your temp table and join on that.
If you don't have a matching record and just want to assign them in order, you can use an answer similar to my answer in another recent question How to update multiple rows in a temp table with multiple values from another table using only one ID common between them?
Further notes
I suggest avoiding table variables (e.g., DECLARE #yourtable TABLE) and use temporary tables (CREATE TABLE #yourtable) instead - for performance reasons. If it's only a small amount of rows it's OK, but it gets worse as it gets larger as SQL Server assumed that table variables only have 1 row
In your bottom statement, why is there the SELECT statement in the FROM clause? Couldn't you just get rid of that select statement and have the FROM clause list the tables you want?
I figured out a way to have access to the output, by using a merge statement.
DECLARE #LogoGlobalIdsTbl TABLE (ID INT, companyBankAccountID INT)
MERGE GlobalIds
USING
(
SELECT (cba.CompanyBankAccountId)
FROM CompanyBankAccounts cba
INNER JOIN Companies c on c.CompanyId = cba.CompanyId
WHERE cba.LogoDocumentId IS NULL AND (cba.LogoFile IS NOT NUll AND cba.LogoFile != '')
) src ON (1=0)
WHEN NOT MATCHED
THEN INSERT ( TypeId )
VALUES (#DocumentsGlobalTypeKey)
OUTPUT [INSERTED].[Id], src.CompanyBankAccountId
INTO #LogoGlobalIdsTbl;

SQL SELECT query where the IDs were already found

I have 2 tables:
Table A has 3 columns (for example) with opportunity sales header data:
OPP_ID, CLOSE_DTTM, STAGE
Table B has 3 columns with the individual line items for the Opportunities:
OPP_LINE_ID, OPP_ID, AMOUNT_USD
I have a select statement that correctly parses through Table A and returns a list of Opportunities. What I would like to do is, without joining the data, to have a SELECT statement that will get data from Table B but only for the OPP_IDs that were found in my first query.
The result should be 2 views/resultset (one for each select query) and not just 1 combined view where Table B is joined to Table A.
The reason why I want to keep them separate is because I will have to perform a few manipulations to the result from table B and i don't want the result from table A affected.
Subquery is all what you need
SELECT OPP_ID, CLOSE_DTTM, STAGE
From table a
where a.opp_id IN (Select opp_id from table b)
Presuming you're using this in some client side data access library that represents B's data in some 2 dimensional collection and you want to manipulate it without affecting/ having A's data present in that collection:
Identify the records in A:
SELECT * FROM a WHERE somecolumn = 'somevalue'
Identify the records in B that relate to A, but don't return A's data:
SELECT b.* FROM a JOIN b ON a.opp_id = b.opp_id WHERE a.somecolumn = 'somevalue'
Just because JOIN is used doesn't mean your end-consuming program has to know about A's data. You could also use IN, like the other answer does, but internally the database will rewrite them to be the same thing anyway
I tend to use exists for this type of query:
select b.*
from b
where exists (select 1 from a where a.opp_id = b.opp_id);
If you want two results sets, you need to run two queries. It is unclear what the second query is, perhaps the first query on A.

Faster way to query Access table

I have two access tables, A and B:
Table A
Identifier BenefitBase PlanNav
1 131368.46 131368.46
2 201768.8 201768.79
3 54057.46 54057.46
4 7397.51 7397.51
5 9931.4 9931.4
6 178200 178200
Table B
p ValidityDate LockInAmount
1 2016-4 3.82
2 2016-4 19.97
3 2016-4 26.85
4 2016-6 34.95
I just want to create a query which extracts records from B where the "p" ID is not found in table A.
My current code is:
SELECT B.p, B.ValidityDate, B.LockInAmount
FROM B
WHERE (((B.p) Not In (select Identifier from A)));
Now to me, this code should work fine. However, because the tables are so large (B consists of 486,000 rows (the "p"'s repeats in this table for different dates) whereas A consists of circa 19,000), whenever I run the query, access fills the query progress bar but freezes when near full.
Is there another way to do this?
Thanks
You could also use a left join to do the same thing Gustav does. It's easier for me to read, and I believe that it will operate with the same execution plan.
select B.p, B.ValididtyDate, B.LockInAmount
from B
left join A on B.P = A.Identifier
where A.Identifier is null
And add to that the indexes recommended by Erik up above. (That said, if P and Identifier are primary keys on your tables then they are already indexed and you don't need to add the indexes)
Since you don't know if the fields are indexed:
Create indexes for both fields (see this page by Microsoft for information on indexes):
Execute these queries to create the indexes (or use the GUI)
CREATE INDEX TblAIdentifier ON A(Identifier)
CREATE INDEX TblBP ON B(p)
As long as you at least create the first index, Access won't even need to open up table A. It can just look in the index which fields are taken.
You can use this answer together with the one provided by #Gustav
You could "reverse" the seek - first find those that have a match, then exclude these from Table B:
Select B.*
From B
Where B.ID Not In
(Select A.Id
From A, B
Where A.ID = B.ID)
SELECT B.p, B.ValidityDate, B.LockInAmount
FROM
B
Left join
A
B.p=A.Identifier
WHERE A.Identifier Is Null);

SQL Query returns more

I'm having a bit of a problem with a SQL Query that returns too many results. I'm fairly new to SQL so please bear with me.
Please see the following:
Table Structures
The Query that I use looks like:
SELECT TABLE_B.*
FROM
TABLE_A
JOIN
TABLE_B
ON
TABLE_A.COMMON_ID=TABLE_B.COMMON_ID
AND TABLE_A.SEQ_3C=TABLE_B.SEQ_3C
JOIN
TABLE_C
ON
TABLE_A.COMMON_ID=TABLE_C.EMPLID
WHERE
TABLE_B.ITEM_STATUS<>'C'
and TABLE_A.CHECKLIST_STATUS='I'
and TABLE_A.ADMIN_FUNCTION='ADMA'
and TABLE_A.CHECKLIST_CD='APPL'
and TABLE_A.COMMON_ID = '123456789'
and TABLE_C.ADMIT_TERM='2171'
and TABLE_C.INSTITUTION='SOMEWHERE'
I just want the results from Table_B and not what it's giving me.
Please explain this to me as I have spent 3 days on it non-stop.
What am I missing?
You want data from TABLE_B? Then select from it only and have the conditions on the other tables in your where clause.
The inner joins on the other tables serve as existence tests, I assume? Don't do that. You'd only multiply your records, just as you are doing now, only to have to dismiss duplicates later. That can cause bad performance on large tables and errors in more complicated queries. Use EXISTS or IN instead.
select *
from table_b
where item_status <> 'C'
and (common_id, seq_3c) in
(
select common_id, seq_3c
from table_a
where checklist_status = 'I'
and admin_function = 'ADMA'
and checklist_cd = 'APPL'
)
and common_id in
(
select EMPLID
from table_c
where admit_term = '2171'
and institution = 'SOMEWHERE'
);
SELECT DISTINCT TABLE_B.*
FROM
TABLE_A
JOIN
TABLE_B
ON
TABLE_A.COMMON_ID=TABLE_B.COMMON_ID
AND TABLE_A.SEQ_3C=TABLE_B.SEQ_3C
JOIN
TABLE_C
ON
TABLE_A.COMMON_ID=TABLE_C.EMPLID
WHERE
TABLE_B.ITEM_STATUS<>'C'
and TABLE_A.CHECKLIST_STATUS='I'
and TABLE_A.ADMIN_FUNCTION='ADMA'
and TABLE_A.CHECKLIST_CD='APPL'
and TABLE_A.COMMON_ID = '123456789'
and TABLE_C.ADMIT_TERM='2171'
and TABLE_C.INSTITUTION='SOMEWHERE'
This should be easy to understand without looking at all your tables and output.
Suppose you join two tables, A and B, on a column id. You only want the columns from table B, and in table B the `id' column is a unique identifier.
Even so, if in table A an id (the same id) appears five times, the join will have five rows for that id. Then you just select the columns from table B, so it will look like you got the same row five different times.
Perhaps you don't really need a join? What is your underlying problem you are trying to solve?
It's hard to answer this question without more information about why you're executing these joins. I can explain why you're getting the results you're getting, and hopefully that will allow you to solve the problem yourself.
You start, in your FROM clause, with table A. You join this table with table B on matching COMMON_ID, which, based on the tables you provide, returns three matches for the one record you have in table A. This increases your result set size to three records. Next, you join these three records with table C, on matching ID. Because all ID's are, in fact, identical, this returns nine matches for every record in your current result set: you now have 9 x 3 = 27 records in your result set.
Finally, the WHERE clause comes into effect. This clause excludes 6 out of 9 records in table C, so you have 3 of those records left. Your final result set is therefore 1 (table A) x 3 (table B) x 3 (table C) = 9 records.

How does a WHERE clause work when tables are joined to themselves?

I am re-writing a query that was generated by Business Objects and uses "old-fashioned" implicit join syntax. The code joins a table to itself in one part and also has a "global" where clause. For example:
select a.col1
, b.col2
from MYDB.TABLE a, MYDB.TABLE b
where a.something=b.something_else
and MYDB.TABLE.source='A'
Above is a made up illustration of the question. The actual query is very long, joining about ten tables.
My question: as written above, does the "extra" where condition apply to both instances of the same table? I think yes, but I've never seen code like this before. I'm using Teradata but I think this is a general SQL question.
UPDATE: Perhaps my attempt to narrow the question wasn't accurate. Here is the complete query I'm trying to modify:
SELECT
abs_contrct_prof.contrct_nbr_txt,
gbs_org_LVL1.bus_pln_sgmnt_cd,
abs_gnrc_lst_of_val_LVL1.wirls_val_1_txt
FROM
EDWABSUSERVIEWS.abs_contrct abs_contrct_prof,
EDWABSUSERVIEWS.gbs_org gbs_org_LVL1,
EDWABSUSERVIEWS.abs_gnrc_lst_of_val abs_gnrc_lst_of_val_LVL1,
EDWABSUSERVIEWS.abs_contrct,
EDWABSUSERVIEWS.gbs_sls_actv_blng_org_rltd,
EDWABSUSERVIEWS.gbs_sls_actv_org_rltd
WHERE
( abs_contrct_prof.type_cd = 'PROFILE' )
AND ( EDWABSUSERVIEWS.abs_contrct.type_cd = 'AGREEMENT' )
AND ( abs_contrct_prof.prnt_contrct_id=EDWABSUSERVIEWS.abs_contrct.contrct_id )
AND ( abs_contrct_prof.org_id=EDWABSUSERVIEWS.gbs_sls_actv_org_rltd.org_id )
AND ( EDWABSUSERVIEWS.gbs_sls_actv_org_rltd.gbs_lvl_3_org_id=EDWABSUSERVIEWS.gbs_sls_actv_blng_org_rltd.gbs_lvl_3_org_id )
AND ( EDWABSUSERVIEWS.gbs_sls_actv_blng_org_rltd.gbs_lvl_1_org_id = gbs_org_LVL1.org_id )
AND ( gbs_org_LVL1.bus_pln_sgmnt_cd=abs_gnrc_lst_of_val_LVL1.nm_txt and abs_gnrc_lst_of_val_LVL1.actv_ind = 'Y' and abs_gnrc_lst_of_val_LVL1.type_cd ='ABS_MOBILITY_SEGMENT' )
AND
abs_contrct_prof.contrct_nbr_txt IN #variable('FAN')
AND ( EDWABSUSERVIEWS.abs_contrct.sts_cd = 'Active' )
AND ( EDWABSUSERVIEWS.abs_contrct.type_cd='AGREEMENT' )
Note the table EDWABSUSERVIEWS.abs_contrct is referenced twice in the FROM clause, one time using an alias and once without. And yes, this query works as written, but I want to re-write it to use explict join syntax (as someone commented).
I ran an EXPLAIN on the query and it appears that the "extra" where conditions (for 'Active' and 'AGREEMENT') are in fact applied to both instances of the table separately.
Does the "extra" where condition apply to both instances of the same table?
No, it doesn't. It only applies to one instance of the table.
Plus, the query is not correct written like that, it should (and probably will) give an error in most SQL implementations. You have to name either a or b:
select a.col1
, b.col2
from MYDB.TABLE a, MYDB.TABLE b
where a.something=b.something_else
and a.source = 'A' --or:-- and b.source = 'A'
If your TRANSACTION MODE is TERADATA the optimizer will product join the fully qualified table reference in the WHERE clause against the spool file containing the INNER JOIN results. I have not been able to test this yet with TRANSACTION MODE set to ANSI.
Example
CREATE VOLATILE TABLE Test1
(
Col1 SMALLINT NOT NULL,
col2 VARCHAR(10) NOT NULL
)
PRIMARY INDEX (Col1)
ON COMMIT PRESERVE ROWS;
CREATE VOLATILE TABLE Test2
(
Col1 SMALLINT NOT NULL,
col2 VARCHAR(10) NOT NULL
)
PRIMARY INDEX (Col1)
ON COMMIT PRESERVE ROWS;
SELECT A.Col1
, B.Col2
, Test1.Col1
FROM Test1 A
, Test2 B
WHERE A.Col1 = B.Col1
AND Test1.Col1 = 1;
Explain - Teradata Mode
1) First, we do an all-AMPs JOIN step from USER.B by way of a RowHash
match scan with no residual conditions, which is joined to USER.A
by way of a RowHash match scan with no residual conditions.
USER.B and USER.A are joined using a merge join, with a join
condition of ("USER.A.Col1 = USER.B.Col1"). The result goes into
Spool 2 (one-amp), which is redistributed by the hash code of (9)
to all AMPs. The size of Spool 2 is estimated with low confidence
to be 1 row (15 bytes). The estimated time for this step is 0.02
seconds.
2) Next, we do a single-AMP JOIN step from Spool 2 (Last Use) by way
of an all-rows scan, which is joined to USER.Test1 by way of the
primary index "USER.Test1.Col1 = 1" with no residual conditions.
Spool 2 and USER.Test1 are joined using a product join, with a
join condition of ("(1=1)"). The result goes into Spool 1
(all_amps), which is built locally on that AMP. The size of Spool
1 is estimated with low confidence to be 1 row (22 bytes). The
estimated time for this step is 0.01 seconds.
3) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.03 seconds.
UPDATE
INSERT INTO Test1 VALUES (1,'C');
INSERT INTO Test1 VALUES (2,'D');
INSERT INTO Test2 VALUES (1, 'C1');
INSERT INTO Test2 Values (2, 'D2');
Results
Col1 Col2 Test1.Col1
----++----++----------
1 C1 1
2 D2 1
The filter condition was not applied to the tables in the FROM clause.
One of the first things a dbms does (acts like it does) in resolving a query is to build a working table from all the table constructors (FROM clauses, JOINs, etc).
Immediately after that, the dbms goes (acts like it goes) to the WHERE clause, and removes from the working table all the rows that don't test as TRUE.
So a valid WHERE clause applies to all the rows in the working table. The irascible Joe Celko, a member of the early SQL standards committee, has written about the order of processing often online. (Search the thread in that link for Effectively materialize.)