T-SQL query running indefinitely | Not if its batched - sql

I hit queit a strange issue here working with a SQL Server table.
With following query, I'm checking if an entry exists that was created between 2023-02-01T04:10:18 and 2023-02-05T04:55:44 (4 days).
This query runs forever:
SELECT
TOP 1 1
FROM
tablexyz t1 (nolock)
WHERE
t1.col1 = 1
AND t1.col <= '2023-01-31'
AND t1.knowledge_begin_date >= '2023-02-01T04:10:18'
AND t1.knowledge_begin_date <= '2023-02-05T04:10:18'
OPTION(RECOMPILE)
While, if I check for a 2 day period, both the queries execute in under 200 ms:
-- Executes in 200ms
SELECT
TOP 1 1
FROM
tablexyz t1 (nolock)
WHERE
t1.col1 = 1
AND t1.col2 <= '2023-01-31'
AND t1.knowledge_begin_date >= '2023-02-01T04:10:18'
AND t1.knowledge_begin_date <= '2023-02-03T04:10:18'
OPTION(RECOMPILE)
and
-- Executes in 200ms
SELECT
TOP 1 1
FROM
tablexyz t1 (nolock)
WHERE
t1.col1 = 1
AND t1.col2 <= '2023-01-31'
AND t1.knowledge_begin_date >= '2023-02-03T04:10:18'
AND t1.knowledge_begin_date <= '2023-02-05T04:10:18'
OPTION(RECOMPILE)
Any idea what could be the reason here? Note that this view(over 3 tables) has over 3 billion rows.
Indexes on the tables
Non-clustered index col1_col2_IX on (col1, col2)
Non-clustered index kdb_IX on (knowledge_begin_date)
Execution plans:
I'm not able to get the actual execution plan of long running query as it is not completing execution. Is there any way to access this ?
Looking at query plans of batched queries, it is doing an index lookup on kdb_IX for all 3 tables the view is over
It seems reasonable to believe that query optimizer should take of this, but strangely that is not the case.

You can try to combine the 2 indexes in one query using CROSS APPLY, and see if it helps... something like
SELECT
TOP 1 1
FROM
tablexyz t1 (nolock)
CROSS APPLY (
SELECT
TOP 1 1 x
FROM tablexyz t2 (nolock)
WHERE
t1.col1 = 1
AND t1.col <= '2023-01-31'
AND (t1. ID = t2.ID)
) t2
WHERE
t1.knowledge_begin_date >= '2023-02-01T04:10:18'
AND t1.knowledge_begin_date <= '2023-02-05T04:10:18'
OPTION(RECOMPILE)
This is just a strech, depends a lot on the distributions and how will create the execution plan. The idea is that CROSS APPLY should be applied for each row (as a set) and in this case can use the second index.

Related

How can I create my SQL code below to a loop so I do not need to write multiple lines of code

I have these two tables:
table1
id
amount
table2
id
col1
col2
col3
col4
col5
And this SQL:
select
t2.col1/t1.amount as col1,
(t2.col1 + t2.col2)/t1.amount as col2,
(t2.col1 + t2.col2 + t2.col3)/t1.amount as col3,
(t2.col1 + t2.col2 + t2.col3 + t2.col4)/t1.amount as col4,
(t2.col1 + t2.col2 + t2.col3 + t2.col4 + t2.col5)/t1.amount as col5
from table1 t1
inner join table2 t2 on t2.id = t1.id
I want to create a loop for the above function so I do not need to write the select statement for 90 months. How can I do this?
example:
Current Table
Table 1 Table 2
**Amount 1 2 3 4 5**
100 10 10 10 10 10
200 20 20 20 20 20
Expected Output
**1 2 3 4 5**
10% 20% 30% 40% 50%
10% 20% 30% 40% 50%
If you're really stuck with this broken table layout for 90 columns, your only option is manually writing out the long version of this expression:
coaelsce(col1,0) + coalesce(col2, 0) + ... + coalesce(col90,0)
Using bulk columns like this is not good database design, and hence the database doesn't support it with any easy syntax that would let your write anything shorter.
The one thing you maybe can do is create a View to mimic better table design, but the code for that is still gonna be pretty ugly and SLOW (though an indexed/materialized view might speed it up).
To accomplish this you have to start with a numbers table containing values for 1 through 90:
CREATE View BetterTable2
As
SELECT t2.id, n.number As ColNumber
CASE n.number
WHEN 1 THEN t2.col1
WHEN 2 THEN t2.col2
WHEN 3 THEN t2.col3
-- ...
WHEN 90 THEN t2.col90
ELSE NULL END As ColValue
FROM numbers n
INNER JOIN table2 t2 ON n.number >= 1 and n.number <= 90
You can also see we still need list out all the columns by hand, but if you really need a running total, it will at least make that possible from here so you only have to write out each of those columns once.
Now you can further write a query like this:
SELECT t2_0.id, t2_0.ColNumber,
SUM(t2_1.ColValue)/t1.amount As Percent
FROM Table1 t1
INNER JOIN BetterTable2 t2_0 ON t2_0.id = t1.id
INNER JOIN BetterTable2 t2_1 ON t2_1.id = t2_0.id AND t2_1.ColNumber <= t2_0.ColNumber
GROUP BY t2_0.id, t2_0.ColNumber, t1.amount
And from here you can do a PIVOT if you really need it, which is usually best handled in your client code or reporting tool.

Subselect in the ON clause

Please dont bash me if there are already answers for this question, but I found none.
Basically I want to make a subselect in the ON clause of a Left join to get the newest entry in a timeframe.
(start and endtime are timestamps, hardcoded, in local variables or host variables in a Cobol program) to simplify I used integers in that question.
Select * from table1 as t1
left join table2 as t2 on
t1.primary = t2.secondary
and t2.timestamp = (
select max(t2a.timestamp) from table2 as t2a
where t2.primary = t2a.primary
and t2a.timestamp > starttime
and t2a.timestamp < endtime
)
Now this does not work, I get the following error:
AN ON CLAUSE IS INVALID. SQLCODE=-338
Because (see the docs)
The ON clause cannot contain a subquery.
Now what we can do to surround that is instead of joining table2 to join a already delimited subquery. But that surrounds the query optimizer what literally kills the performance:
Select * from table1 as t1
left join (
select t2a.secondary from table2 as t2a
where t2a.timestamp = (
select max(t2b.timestamp)
from table2 as t2b
where t2a.primary = t2b.primary
and t2b.timestamp > starttime
and t2b.timestamp < endtime
)
)as t2
on t1.primary = t2.secondary
Any idea how to slove this?
Example data table1:
t1.primary
1
2
3
Example data table2:
t2.primary t2.secondary t2.timestamp
1 1 4
2 1 5
3 1 10
4 2 4
5 2 5
Variables:
starttime = 3
endtime = 6
Expected result:
t1.primary t2.primary t2.secondary t2.timestamp
1 2 1 5 --Leftjoined the newest entry in range
2 5 2 5
3 NULL NULL NULL
This should work
select *
from table1 t1
left join (
select t2.primary, t2.secondary, t2.timestamp,
row_number() over (partition by t2.secondary order by t2.timestamp desc) rn
from table2 t2
where t2.timestamp between starttime and endtime
) t on t1.primary = t.secondary and t.rn = 1
If you have an index table2(timestamp, secondary, primary) or at least table2(timestamp, secondary) then it should run really fast. Without the indexes, it still works with quite good performance, since it leads to one sequential scan of the tables.
something like this. Just typed in before lunch so don't bash me if it doesn't work.
select * from table1 a left join
(select t2b.primary, max(t2b.timestamp) mxts
from table2 t2b
group by t2b.primary
) as b on a.primary = b.primary
left join table2 on b.primary = table2.secondary and
table2.timestamp = mxts and table2.timestamp between mystartts and myendts
NOte: Don't assume timestamps are unique and can be used to extract the last entry from a table because this will undobtly fraile.

Teradata optimizer wrongly estimates row number then accessing to table through view with union

Let's say I have three tables: t1(it has about 1 billion rows, fact table) and t2(empty table, 0 rows). and t0 (dimension table), all of them have properly collected statistics. In addition there is view v0:
REPLACE VIEW v0
AS SELECT * from t1
union
SELECT * from t2;
Let's look to these three queries:
1) Select * from t1 inner t0 join on t1.id = t0.id; -- Optimizer correctly estimates 1 bln rows
2) Select * from t2 inner t0 join on t1.id = t0.id; -- Optimizer correctly estimates 0 row
3) Select * from v0 inner t0 join on v0.id = t0.id; -- Optimizer locks t1 and t2 for read, the correctly estimated, that it will get 1 bln rows from t1, but for no clear reasons estimated same number 1 bln from table t2.
What is going on here? Is is it the bug or a feature?
PS. Original query, that pretty big to show here, didn't finished in 35 minutes. After leaving just t1 - successfully finished in 15 minutes.
TD Release: 15.10.03.07
TD Version: 15.10.03.09
It's not the same number for the 2nd Select, it's the overall number of rows in spool after the 2nd Select, which is 1 billion plus 0.
And you query was running slowly because you used a UNION which defaults to DISTINCT, running this on a billion rows is really expensive.
Better switch to UNION ALL instead.

How to improve performance of SQL query with parameters?

I'm using SQL Server 2005.
I have a problems with executing SQL statements like this
DECLARE #Param1 BIT
SET #Param1 = 1
SELECT
t1.Col1,
t1.Col2
FROM
Table1 t1
WHERE
#Param1=0 OR
(t1.Col2 in
(SELECT t2.Col4
FROM
Table2 t2
WHERE
t2.Col1 = t1.Col1 AND
t2.Col2 = 'AAA' AND
t2.t3 <> 0)
)
This query executes very long time.
But if I replace #Param1 with 1, than query execution time is ~2 seconds.
Any information how to resolve the problem would be greatly appreciated.
Well, the explanation seems simple enough. For your current condition, since #Param1=0 is false (you set the parameter to 1 previously), it needs to evaluate your second condition, wich has a subquery and might take a long time. If you change your first filter to #Param1=1, then you are saying that it is true and there is no need to evaluate your second filter, hence making your query faster.
You seem to be confusing the optimiser with your OR statement. If you remove it, you should find it generates two different execution plans for the SELECT statement - one with a filter, and the other without:
DECLARE #Param1 BIT
SET #Param1 = 1
if #Param1=0
begin
SELECT
t1.Col1,
t1.Col2
FROM
Table1 t1
end
else
begin
SELECT
t1.Col1,
t1.Col2
FROM
Table1 t1
WHERE
(t1.Col2 in
(SELECT t2.Col4
FROM
Table2 t2
WHERE
t2.Col1 = t1.Col1 AND
t2.Col2 = 'AAA' AND
t2.t3 <> 0)
)
end
This is commonly referred to as the N+1 problem. You're doing a select in table 1 and for each record you find you will go look for something in table 2.
By setting your #Param1 to an value which will never be found in your select the sql engine will skip the subquery.
To avoid this behavior you could use a JOIN statement to join both tables together and afterwards filter the results with a where statement. The join statement will be a bit slower then a single subquery because you are matching 2 tables to eachother but because you only need to do the join once(vs N times) you'll gain a serious performance boost.
Example code :
DECLARE #Param1 BIT
SET #Param1 = 1
SELECT t1.Col1,t1.Col2
FROM Table1 t1
INNER JOIN Table2 t2 on t1.Col1 = t2.Col1
WHERE #Param1=0
OR t2.Col2 = 'AAA'
AND t2.t3 <> 0

SQL nested query

I have a table like below
id name dependency
-----------------------
1 xxxx 0
2 yyyy 1
3 zzzz 2
4 aaaaaa 0
5 bbbbbb 4
6 cccccc 5
the list goes on. I want to select group of rows from this table , by giving the name of 0 dependency in where clause of SQL and till it reaches a condition where there is no more dependency. (For ex. rows 1,2, 3 forms a group, and rows 4,5,6 is another group) .please help
Since you did not specify a product, I'll go with features available in the SQL specification. In this case, I'm using a common-table expression which are supported by many database products including SQL Server 2005+ and Oracle (but not MySQL):
With MyDependents As
(
Select id, name, 0 As level
From MyTable
Where dependency = 0
And name = 'some value'
Union All
Select T.id, T.name, T.Level + 1
From MyDependents As D
Join MyTable As T
On T.id = D.dependency
)
Select id, name, level
From MyDependents
Another solution which does not rely on common-table expressions but does assume a maximum level of depth (in this case two levels below level 0) would something like
Select T1.id, T1.name, 0 As level
From MyTable As T1
Where T1.name = 'some value'
Union All
Select T2.id, T2.name, 1
From MyTable As T1
Join MyTable As T2
On T2.Id = T1.Dependency
Where T1.name = 'some value'
Union All
Select T3.id, T3.name, 2
From MyTable As T1
Join MyTable As T2
On T2.Id = T1.Dependency
Join MyTable As T3
On T3.Id = T2.Dependency
Where T1.name = 'some value'
Sounds like you want to recursively query your table, for which you will need a Common Table Expression (CTE)
This MSDN article explains CTEs very well. They are confusing at first but surprisingly easy to implement.
BTW this is obviously only for SQL Server, I'm not sure how you'd achieve that in MySQL.
This is the first thing that came to mind. It can be probably done more directly/succinctly, I'll try to dwell on it a little.
SELECT *
FROM table T1
WHERE T1.id >=
(SELECT T2.id FROM table T2 WHERE T2.name = '---NAME HERE---')
AND T1.id <
(SELECT MIN(id)
FROM table T3
WHERE T3.dependency = 0 AND T3.id > T2.id)
If you can estimate a max depth, this works out to something like:
SELECT
COALESCE(t4.field1, t3.field1, t2.field1, t1.field1, t.field1),
COALESCE(t4.field2, t3.field2, t2.field2, t1.field2, t.field2),
COALESCE(t4.field3, t3.field3, t2.field3, t1.field3, t.field3),
....
FROM table AS t
LEFT JOIN table AS t1 ON t.dependency = t1.id
LEFT JOIN table AS t2 ON t1.dependency = t2.id
LEFT JOIN table AS t3 ON t2.dependency = t3.id
LEFT JOIN table AS t4 ON t3.dependency = t4.id
....
This is a wild guess just to be different, but I think it's kind of pretty, anyway. And it's at least as portable as any of the others. But I don't want to look to closely; I'd want to use sensible data, start testing, and check for sensible results.
Hierarchical query will do:
SELECT *
FROM your_table
START WITH id = :id_of_group_header_row
CONNECT BY dependency = PRIOR id
Query works like this:
1. select all rows satisfying START WITH condition (this rows are roots now)
2. select all rows satisfying CONNECT BY condition,
keyword PRIOR means this column's value will be taken from the root row
3. consider rows selected on step 2 to be roots
4. go to step 2 until there are no more rows