SQL optimization - execution plan changes based on constraint value - Why?

SQL optimization - execution plan changes based on constraint value - Why? - sql

I've got a table ItemValue full of data on a SQL 2005 Server running in 2000 compatibility mode that looks something like (it's a User-Defined values table):
ID ItemCode FieldID Value
-- ---------- ------- ------
1 abc123 1 D
2 abc123 2 287.23
4 xyz789 1 A
5 xyz789 2 3782.23
6 xyz789 3 23
7 mno456 1 W
9 mno456 3 45
... and so on.
FieldID comes from the ItemField table:
ID FieldNumber DataFormatID Description ...
-- ----------- ------------ -----------
1 1 1 Weight class
2 2 4 Cost
3 3 3 Another made up description
. . x xxx
. . x xxx
. . x xxx
x 91 (we have 91 user-defined fields)
Because I can't PIVOT in 2000 mode, we're stuck building an ugly query using CASEs and GROUP BY to get the data to look how it should for some legacy apps, which is:
ItemNumber Field1 Field2 Field3 .... Field51
---------- ------ ------- ------
abc123 D 287.23 NULL
xyz789 A 3782.23 23
mno456 W NULL 45
You can see we only need this table to show values up to the 51st UDF. Here's the query:
SELECT
iv.ItemNumber,
,MAX(CASE WHEN f.FieldNumber = 1 THEN iv.[Value] ELSE NULL END) [Field1]
,MAX(CASE WHEN f.FieldNumber = 2 THEN iv.[Value] ELSE NULL END) [Field2]
,MAX(CASE WHEN f.FieldNumber = 3 THEN iv.[Value] ELSE NULL END) [Field3]
...
,MAX(CASE WHEN f.FieldNumber = 51 THEN iv.[Value] ELSE NULL END) [Field51]
FROM ItemField f
LEFT JOIN ItemValue iv ON f.ID = iv.FieldID
WHERE f.FieldNumber <= 51
GROUP BY iv.ItemNumber
When the FieldNumber constraint is <= 51, the execute plan goes something like:
SELECT <== Computer Scalar <== Stream Aggregate <== Sort (Cost: 70%) <== Hash Match <== (Clustered Index Seek && Table Scan)
and it's fast! I can pull back 100,000+ records in about a second, which suits our needs.
However, if we had more UDFs and I change the constraint to anything above 66 (yes, I tested them one by one) or if I remove it completely, I lose the Sort in the Execution plan, and it gets replaced with a whole bunch of Parallelism blocks that gather, repartition, and distribute streams, and the entire thing is slow (30 seconds for even just 1 record).
FieldNumber has a clustered, unique index, and is part of composite primary key with the ID column (non-clustered index) in the ItemField table. The ItemValue table's ID and ItemNumber columns make a PK, and there is an extra non-clustered index on the ItemNumber column.
What is the reasoning behind this? Why does changing my simple integer constraint change the entire execution plan?
And if you're up to it... what would you do differently? There's a SQL upgrade planned for a couple months from now but I need to get this problem fixed before that.

SQL Server is smart enough to take CHECK constraints into account when optimizing the queries.
Your f.FieldNumber <= 51 is optimized out and the optimizer sees that the whole two tables should be joined (which is best done with a HASH JOIN).
If you don't have the constraint, the engine needs to check the condition and most probably uses index traversal to do this. This may be slower.
Could please post the whole plans for the queries? Just run SET SHOWPLAN_TEXT ON and then the queries.
Update:
What is the reasoning behind this? Why does changing my simple integer constraint change the entire execution plan?
If by a constraint you mean the WHERE condition, this is probably the other thing.
Set operations (that's what SQL does) have no single most efficient algorithm: efficiency of each algorithm depends heavily on the data distribution in the sets.
Say, for taking a subset (that's what the WHERE clause does) you can either find the range of record in the index and use the index record pointers to locate the data rows in the table, or just scan all records in the table and filter them using the WHERE condition.
Efficiency of the former operation is m × const, that of the latter is n, where m is the number of record satisfying the condition, n is the total number of records in the table and const > 1.
This means that for larger values of m the fullscan is more efficient.
SQL Server is aware of that and changes execution plans accordingly to the constants that affect the data distribution in the set operations.
TO do this, SQL Server maintains statistics: aggregated histograms of the data distribution in each indexed column and uses them to build the query plans.
So changing the integer in the WHERE condition in fact affects the size and the data distribution of the underlying sets and makes SQL Server to reconsider the algorithms best fit to work with the sets of that size and layout.

it gets replaced with a whole bunch of Parallelism blocks
Try this:
SELECT
iv.ItemNumber,
,MAX(CASE WHEN f.FieldNumber = 1 THEN iv.[Value] ELSE NULL END) [Field1]
,MAX(CASE WHEN f.FieldNumber = 2 THEN iv.[Value] ELSE NULL END) [Field2]
,MAX(CASE WHEN f.FieldNumber = 3 THEN iv.[Value] ELSE NULL END) [Field3]
...
,MAX(CASE WHEN f.FieldNumber = 51 THEN iv.[Value] ELSE NULL END) [Field51]
FROM ItemField f
LEFT JOIN ItemValue iv ON f.ID = iv.FieldID
WHERE f.FieldNumber <= 51
GROUP BY iv.ItemNumber
OPTION (Maxdop 1)
By using Option(Maxdop 1), this should prevent the parellelism in the execution plan.

At 66 you are hitting some internal cost estimate threshold that decides it is better to use one plan vs. the other. What that threshold is and why it happens is not really important. Note that your query differ with each FieldNumber value, as you are not only changing the WHERE: you also change the pseudo-'pivot' projected fields.
Now I don't know all the details of your table and your queries and insert/update/delete/pattern, but for the particular query you posted the proper clustered index structure for the ItemValue table is this:
CREATE CLUSTERED INDEX [cdxItemValue] ON ItemValue (FieldID, ItemNumber);
This structure eliminate the need to intermediate sort the results for this 'pivot' query.

Related

How to replace 0 and 1 in SQL Server 2012

I have a rugby database + player table. In the player table I have performance and I want to represent the performance as
0 = low
1 = medium
2 = high
I don't know what datatype the column should be. And what is the formula or function to do that?
Please help

You can define your column like this:
performance tinyint not null check (performance in (0, 1, 2))
tinyint takes only 1 byte for a value and values can range from 0 to 255.
If you store the values as 1 - Low, 2 - Medium, 3 - High and are using SQL server 2012+, then you can simply use CHOOSE function to convert the value to text when select like this:
select choose(performance,'Low','Medium','High')
. . .
If you really want to store as 0,1,2, use :
select choose(performance+1,'Low','Medium','High')
. . .
If you are using a lower version of SQL server, you can use CASE like this:
case performance
when 0 then 'Low'
when 1 then 'Medium'
when 2 then 'High'
end

1- column datatype should b int.
2- where you send the date check the performance first like:-
if(performance = low)
perVar = 0
send into database

There are a number of ways you can handle this. One way would be to represent the performance using an int column, which would take on values 0, 1, 2, .... To get the labels for those peformances, you could create a separate table which would map those numbers to descriptive strings, e.g.
id | text
0 | low
1 | medium
2 | high
You would then join to this table whenever you needed the full text description. Note that this is probably the only option which will scale as the number of performance types starts to get large.
If you don't want a separate table, you could also use a CASE expression to generate labels when querying, e.g.
CASE WHEN id = 0 THEN 'low'
WHEN id = 1 THEN 'medium'
WHEN id = 1 THEN 'high'
END

I would use a TINYINT datatype in the performance table to conserve space, then use a FOREIGN KEY CONSTRAINT from a second table which holds the descriptions. The constraint would force the entry of 0, 1, 2 in the performance table while providing a normalized solution that could grow to include additional perforamnce metrics.

Oracle SQL statement to update column values based on specific condition

I have a table which is having 3 columns-PID,LOCID,ISMGR. Now in existing scenario, for some person, based on the location ID, he is set as ISMGR=true.
But as per the new requirement, we have to make all the ISMGR=true for any person who is having at least one ISMGR=true(means if he is mangager for any one location, he should be manager for all the locations).
Table Data before running the script:
PID|LOCID|ISMGR
1 1 1
1 2 0
1 3 0
2 1 0
2 2 1
Table Data after running the script:
PID|LOCID|ISMGR
1 1 1
1 2 1
1 3 1
2 1 1
2 2 1
Any help will be highly appreciated..
Thanks in advance.

I would be inclined to write this using exists:
update t
set ismgr = 1
where ismgr = 0 and
exists (select 1 from t t2 where t2.pid = t.pid and t2.ismgr = 1);
exists should be more efficient than doing a subquery with an aggregation.
This will work best with indexes on t(pid, ismgr) and t(ismgr).

This is not an answer but a test of the two solutions offered so far - I will call them the "EXISTS" and the "AGGREGATE" solutions or approaches.
Details of the tests are below, but here are two overall conclusions:
Both approaches have comparable execution times; on average the AGGREGATE approach worked a little faster than the EXISTS approach, but by a very small margin (smaller than the differences between running times from one trial to the next). Without indexes on any columns, the run times were: (first number is for the EXISTS approach and the second for AGGREGATE). Trial 1: 8.19s 8.08s Trial 2: 8.98s 8.22s Trial 3: 9.46s 9.55s Note - Estimated optimizer costs should be used only to compare different execution plans for the same statement, not for different solutions using different approaches. Even so, someone will inevitably ask; so - for the EXISTS approach the lowest cost the Optimizer found was 4766; for AGGREGATE, 2665. Again, though, this is completely meaningless.
If a lot of rows need to be updated, indexes will hurt performance much more than they help it. Indeed, when rows are updated, the indexes must be updated as well. If only a small number of rows must be updated, then the indexes will help, because most of the time is spent finding the rows that must be updated and only little time is spent in the updates themselves. In my example almost 25% of rows had to be updated... so the AGGREGATE solution took 51.2 seconds and the EXISTS solution took 59.3 seconds! RECOMMENDATION: If you expect that a large number of rows may need to be updated, and you already have indexes on the table, you may be better off DROPPING them and re-creating them after the updates! Or, perhaps there are other solutions to this problem; I am not an expert (keep that in mind!)
To test properly, after I created the test table and committed, I ran each solution by itself, then I rolled back and, logged in as SYS (in a different session), I ran alter system flush buffer_cache to make sure performance is not randomly helped by cache hits or hurt by misses. In all cases everything is done from disk storage.
I created a table with id's from 1 to 1.2 million and a random integer between 1 and 3, with probabilities 40%, 40% and 20% respectively (see the use of dbms_random below). Then from this prep data I created the test table: each pid was included one, two or three times based on this random integer; and a random 0 or 1 was added as ismgr (with 50-50 probability) in each row. I also added a random integer between 1 and 4 as locid just to simulate the actual data; I didn't worry about duplicate locid since that column plays no role in the problem.
Of the 1.2 million pids, approximately 480,000 (40%) appear just once in the test table, another ~480,000 appear twice and ~240,000 three times. Total rows should be about 2,160,000. That's the cardinality of the base table (in reality it ended up being 2,160,546). Then: none of the ~480,000 rows with unique pid need to be changed; half of the 480,000 pids with a count of 2 will have the same ismgr (so no change) and the other half will be split, so we will need to change 240,000 rows from these; and a simple combinatorial argument shows that 3/8, or 270,000 rows, of the 720,000 rows for pids that appear three times in the table must be changed. So we should expect that 510,000 rows should be changed. In fact the update statements resulted in 510,132 rows updated (same for both solutions). These sanity checks show that the test was probably set up correctly. Below I show also a small sample from the base table, also as a sanity check.
CREATE TABLE statement:
create table tbl as
with prep ( pid, dup ) as (
select level,
round( dbms_random.value(0.5, 3) ) as dup
from dual
connect by level <= 1200000
)
select pid,
round( dbms_random.value(0.5, 4.5) ) as locid,
round( dbms_random.value(0, 1) ) as ismgr
from prep
connect by level <= dup
and prior pid = pid
and prior sys_guid() is not null
;
commit;
Sanity checks:
select count(*) from tbl;
COUNT(*)
----------
2160546
select * from tbl where pid between 324720 and 324730;
PID LOCID ISMGR
---------- ---------- ----------
324720 4 1
324721 1 0
324721 4 1
324722 3 0
324723 1 0
324723 3 0
324723 3 1
324724 3 1
324724 2 0
324725 4 1
324725 2 0
324726 2 0
324726 1 0
324727 3 0
324728 4 1
324729 1 0
324730 3 1
324730 3 1
324730 2 0
19 rows selected
UPDATE statements:
update tbl t
set ismgr = 1
where ismgr = 0 and
exists (select 1 from tbl t2 where t2.pid = t.pid and t2.ismgr = 1);
rollback;
update tbl
set ismgr = 1
where ismgr = 0
and pid in ( select pid
from tbl
group by pid
having max(ismgr) = 1);
rollback;
-- statements to create indexes, used in separate testing:
create index pid_ismgr_idx on tbl(pid, ismgr);
create index ismgr_ids on tbl(ismgr);

Why PL/SQL? All you need is a plain SQL statement. For example:
update your_table t -- enter your actual table name here
set ismgr = 1
where ismgr = 0
and pid in ( select pid
from your_table
group by pid
having max(ismgr) = 1)
;

The existing solutions are perfectly fine, but I prefer to use merge any time I'm updating rows from a correlated sub-query. I find it to be more readable and the performance is typically commensurate with the exists method.
MERGE INTO t
USING (SELECT DISTINCT pid
FROM t
WHERE ismgr = 1) src
ON (t.pid = src.pid)
WHEN MATCHED THEN
UPDATE SET ismgr = 1
WHERE ismgr = 0;
As #mathguy pointed out, in this case using group by and having is more efficient than distinct. To use that with merge is just a matter of changing the sub-query:
MERGE INTO t
USING (SELECT pid
FROM t
GROUP BY pid
HAVING MAX(ismgr) = 1) src
ON (t.pid = src.pid)
WHEN MATCHED THEN
UPDATE SET ismgr = 1
WHERE ismgr = 0;

How to change / convert values in Output that comes from SQL Server table

I have created a view in my SQL Server database which will give me number of columns.
One of the column heading is Priority and the values in this column are Low, Medium, High and Immediate.
When I execute this view, the result is returned perfectly like below. I want to change or assign values for these priorities. For example: instead of Low I should get 4, instead of Medium I should get 3, for High it should be 2 and for Immediate it should be 1.
What should I do to achieve this?
Ticket# Priority
123 Low
1254 Low
5478 Medium
4585 High
etc., etc.,

Use CASE:
Instead of Low I should get 4, instead of Medium I should get 3, for
High it should be 2 and for Immediate it should be 1
SELECT
[Ticket#],
[Priority] = CASE Priority
WHEN 'Low' THEN 4
WHEN 'Medium' THEN 3
WHEN 'High' THEN 2
WHEN 'Immediate' THEN 1
ELSE NULL
END
FROM table_name;
EDIT:
If you use dictionary table like in George Botros Solution you need to remember about:
1) Maintaining and storing dictionary table
2) Adding UNIUQE index to Priority.Name to avoid duplicates like:
Priority table
--------------------
Id | Name | Value
--------------------
1 | Low | 4
2 | Low | 4
...
3) Instead of INNER JOIN defensively you ought to use LEFT JOIN to get all results even if there is no corresponding value in dictionary table.

I have an alternative solution for your problem by creating a new Priority table (Id, Name, Value)
by joining to this table you will be able to select the value column
SELECT Ticket.*, Priority.Value
FROM Ticket INNER JOIN Priority
ON Priority.Name = Ticket.Priority
Note: although using the case keyword is the most straight forward solution for
this problem
this solution may be useful if you will need this priority value in many places at your system

How can I speed up queries that are looking for the root node of a transitive closure?

I have a historical transitive closure table that represents a tree.
create table TRANSITIVE_CLOSURE
(
CHILD_NODE_ID number not null enable,
ANCESTOR_NODE_ID number not null enable,
DISTANCE number not null enable,
FROM_DATE date not null enable,
TO_DATE date not null enable,
constraint TRANSITIVE_CLOSURE_PK unique (CHILD_NODE_ID, ANCESTOR_NODE_ID, DISTANCE, FROM_DATE, TO_DATE)
);
Here's some sample data:
CHILD_NODE_ID | ANCESTOR_NODE_ID | DISTANCE
--------------------------------------------
1 | 1 | 0
2 | 1 | 1
2 | 2 | 0
3 | 1 | 2
3 | 2 | 1
3 | 3 | 0
Unfortunately, my current query for finding the root node causes a full table scan:
select *
from transitive_closure tc
where
distance = 0
and not exists (
select null
from transitive_closure tci
where tc.child_node_id = tci.child_node_id
and tci.distance <> 0
);
On the surface, it doesn't look too expensive, but as I approach 1 million rows, this particular query is starting to get nasty... especially when it's part of a view that grabs the adjacency tree for legacy support.
Is there a better way to find the root node of a transitive closure? I would like to rewrite all of our old legacy code, but I can't... so I need to build the adjacency list somehow. Getting everything except the root node is easy, so is there a better way? Am I thinking about this problem the wrong way?
Query plan on a table with 800k rows.
OPERATION OBJECT_NAME OPTIONS COST
SELECT STATEMENT 2301
HASH JOIN RIGHT ANTI 2301
Access Predicates
TC.CHILD_NODE_ID=TCI.CHILD_NODE_ID
TABLE ACCESS TRANSITIVE_CLOSURE FULL 961
Filter Predicates
TCI.DISTANCE = 1
TABLE ACCESS TRANSITIVE_CLOSURE FULL 962
Filter Predicates
DISTANCE=0

How long does the query take to execute, and how long do you want it to take? (You usually do not want to use the cost for tuning. Very few people know what the explain plan cost really means.)
On my slow desktop the query only took 1.5 seconds for 800K rows. And then 0.5 seconds after the data was in memory. Are you getting something significantly worse,
or will this query be run very frequently?
I don't know what your data looks like, but I'd guess that a full table scan will always be best for this query. Assuming that your hierarchical data
is relatively shallow, i.e. there are many distances of 0 and 1 but very few distances of 100, the most important column will not be very distinct. This means
that any of the index entries for distance will point to a large number of blocks. It will be much cheaper to read the whole table at once using multi-block reads
than to read a large amount of it one block at a time.
Also, what do you mean by historical? Can you store the results of this query in a materialized view?
Another possible idea is to use analytic functions. This replaces the second table scan with a sort. This approach is usually faster, but for me this
query actually takes longer, 5.5 seconds instead of 1.5. But maybe it will do better in your environment.
select * from
(
select
max(case when distance <> 0 then 1 else 0 end)
over (partition by child_node_id) has_non_zero_distance
,transitive_closure.*
from transitive_closure
)
where distance = 0
and has_non_zero_distance = 0;

Can you try adding an index on distance and child_node_id, or change the order of these column in the existing unique index? I think it should then be possible for the outer query to access the table by the index by distance while the inner query needs only access to the index.

Add ONE root node from which all your current root nodes are descended. Then you would simply query the children of your one root. Problem solved.

Best way to count this Data

In short I have 2 tables:
USERS:
------------------------
UserID | Name
------------------------
0 a
1 b
2 c
CALLS:
------------------------
ToUser | Result
------------------------
0 ANSWERED
1 ENGAGED
1 ANSWERED
0 ANSWERED
Etc, etc (i use a numerical referance for result in reality)
I have over 2 million records each detailing a call to a specific client. Currently I'm using Case statements to count each recurance of a particular result AFTER I have already done the quick total count:
COUNT(DISTINCT l_call_log.line_id),
COALESCE (SUM(CASE WHEN l_call_log.line_result = 1 THEN 1 ELSE NULL END), 0) AS [Answered],
COALESCE (SUM(CASE WHEN l_call_log.line_result = 2 THEN 1 ELSE NULL END), 0) AS [Engaged],
COALESCE (SUM(CASE WHEN l_call_log.line_result = 4 THEN 1 ELSE NULL END), 0) AS [Unanswered]
Am I doing 3 scans of the data after my inital total count? if so, is there a way I can do one sweep and count the calls as-per-result in one go?
Thanks.

This would take one full table scan.
EDIT: There's not enough information to answer; because the duplicate removal (DISTINCT) that I missed earlier, we can't tell what strategy that would be used.... especially without knowing the database engine.
In just about every major query engine, each aggregate function is executed per each column per each row, and it may use a cached result (such as COUNT(*) for example).
Is line_result indexed? If so, you could leverage a better query (GROUP BY + COUNT(*) to take advantage of index statistics, though I'm not sure if that's worthwhile depending on your other tables in the query.

There is the GROUP BY construction in SQL. Try:
SELECT COUNT(DISTINCT l_call_log.line_id)
GROUP BY l_call_log.line_result

I would guess it's a table scan, since you don't have any depending subqueries. Run explain on the query to be sure.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas