Is there a way to make this query more efficient performance wise? - sql

This query takes a long time to run on MS Sql 2008 DB with 70GB of data.
If i run the 2 where clauses seperately it takes a lot less time.
EDIT - I need to change the 'select *' to 'delete' afterwards, please keep it in mind when answering. thanks :)
select *
From computers
Where Name in
(
select T2.Name
from
(
select Name
from computers
group by Name
having COUNT(*) > 1
) T3
join computers T2 on T3.Name = T2.Name
left join policyassociations PA on T2.PK = PA.EntityId
where (T2.EncryptionStatus = 0 or T2.EncryptionStatus is NULL) and
(PA.EntityType <> 1 or PA.EntityType is NULL)
)
OR
ClientId in
(
select substring(ClientID,11,100)
from computers
)

Swapping IN for EXISTS will help.
Also, as per Gordon's answer: UNION can out-perform OR.
SELECT computers.*
FROM computers
LEFT
JOIN policyassociations
ON policyassociations.entityid = computers.pk
WHERE (
computers.encryptionstatus = 0
OR computers.encryptionstatus IS NULL
)
AND (
policyassociations.entitytype <> 1
OR policyassociations.entitytype IS NULL
)
AND EXISTS (
SELECT name
FROM (
SELECT name
FROM computers
GROUP
BY name
HAVING Count(*) > 1
) As duplicate_computers
WHERE name = computers.name
)
UNION
SELECT *
FROM computers As c
WHERE EXISTS (
SELECT SubString(clientid, 11, 100)
FROM computers
WHERE SubString(clientid, 11, 100) = c.clientid
)
You've now updated your question asking to make this a delete.
Well the good news is that instead of the "OR" you just make two DELETE statements:
DELETE
FROM computers
LEFT
JOIN policyassociations
ON policyassociations.entityid = computers.pk
WHERE (
computers.encryptionstatus = 0
OR computers.encryptionstatus IS NULL
)
AND (
policyassociations.entitytype <> 1
OR policyassociations.entitytype IS NULL
)
AND EXISTS (
SELECT name
FROM (
SELECT name
FROM computers
GROUP
BY name
HAVING Count(*) > 1
) As duplicate_computers
WHERE name = computers.name
)
;
DELETE
FROM computers As c
WHERE EXISTS (
SELECT SubString(clientid, 11, 100)
FROM computers
WHERE SubString(clientid, 11, 100) = c.clientid
)
;

Some things I would look at are
1. are indexes in place?
2. 'IN' will slow your query, try replacing it with joins,
3. you should use column name, I guess 'Name' in this case, while using count(*),
4. try selecting required data only, by selecting particular columns.
Hope this helps!

or can be poorly optimized sometimes. In this case, you can just split the query into two subqueries, and combine them using union:
select *
From computers
Where Name in
(
select T2.Name
from
(
select Name
from computers
group by Name
having COUNT(*) > 1
) T3
join computers T2 on T3.Name = T2.Name
left join policyassociations PA on T2.PK = PA.EntityId
where (T2.EncryptionStatus = 0 or T2.EncryptionStatus is NULL) and
(PA.EntityType <> 1 or PA.EntityType is NULL)
)
UNION
select *
From computers
WHERE ClientId in
(
select substring(ClientID,11,100)
from computers
);
You might also be able to improve performance by replacing the subqueries with explicit joins. However, this seems like the shortest route to better performance.
EDIT:
I think the version with join's is:
select c.*
From computers c left outer join
(select c.Name
from (select c.*, count(*) over (partition by Name) as cnt
from computers c
) c left join
policyassociations PA
on T2.PK = PA.EntityId and PA.EntityType <> 1
where (c.EncryptionStatus = 0 or c.EncryptionStatus is NULL) and
c.cnt > 1
) cpa
on c.Name = cpa.Name left outer join
(select substring(ClientID, 11, 100) as name
from computers
) csub
on c.Name = csub.name
Where cpa.Name is not null or csub.Name is not null;

Related

SELECT NOT IN with multiple columns in subquery

Regarding the statement below, sltrxid can exist as both ardoccrid and ardocdbid. I'm wanting to know how to include both in the NOT IN subquery.
SELECT *
FROM glsltransaction A
INNER JOIN cocustomer B ON A.acctid = B.customerid
WHERE sltrxstate = 4
AND araccttype = 1
AND sltrxid NOT IN(
SELECT ardoccrid,ardocdbid
FROM arapplyitem)
I would recommend not exists:
SELECT *
FROM glsltransaction t
INNER JOIN cocustomer c ON c.customerid = t.acctid
WHERE
??.sltrxstate = 4
AND ??.araccttype = 1
AND NOT EXISTS (
SELECT 1
FROM arapplyitem a
WHERE ??.sltrxid IN (a.ardoccrid, a.ardocdbid)
)
Note that I changed the table aliases to things that are more meaningful. I would strongly recommend prefixing the column names with the table they belong to, so the query is unambiguous - in absence of any indication, I represented this as ?? in the query.
IN sometimes optimize poorly. There are situations where two subqueries are more efficient:
SELECT *
FROM glsltransaction t
INNER JOIN cocustomer c ON c.customerid = t.acctid
WHERE
??.sltrxstate = 4
AND ??.araccttype = 1
AND NOT EXISTS (
SELECT 1
FROM arapplyitem a
WHERE ??.sltrxid = a.ardoccrid
)
AND NOT EXISTS (
SELECT 1
FROM arapplyitem a
WHERE ??.sltrxid = a.ardocdbid
)

Is it possible to replace a cross apply with a join?

I am reverse engineering some legacy SQL algorithms to move to apache spark.
I have encountered a across apply which I understand is TSQL specific and there is no direct equivalent in ANSII or Spark SQL.
The sanitized algorithm is:
SELECT
Id_P ,
Monthindex ,
(
SELECT
100 * (STDEV(ResEligible.num_valid) / AVG(ResEligible.num_valid)) AS Pre_Coef_Var
FROM
tbl_p a CROSS APPLY
(
SELECT
e.Monthindex ,
e.num AS num_valid
FROM
dbo.tbl_p e
WHERE
e.Monthindex = a.MonthIndex
AND e.Id_P = a.Id_P
UNION ALL
SELECT DISTINCT
B1.[MonthIndex ] ,
Tr.num AS num_valid
FROM
#tbl_pr B1
INNER JOIN
#tbl_pr B2
ON
B1.[Id_P] = B2.[Id_P]
AND B2.Rang - B1.Rang BETWEEN 0 AND 2
INNER JOIN
dbo.tbl_p Tr
ON
Tr.Id_P = B1.Id_P
AND Tr.Monthindex = B1.Monthindex
WHERE
a.Id_P = B1.[Id_P]
AND B2.[MonthIndex] =
(
SELECT
MAX([MonthIndex])
FROM
#tbl_pr
WHERE
[MonthIndex] < a.MonthIndex
AND [Id_P] = a.Id_P) ) AS ResEligible
WHERE
a.Id_P = result.Id_P
AND a.MonthIndex = result.MonthIndex) AS Coeff
FROM
tbl_p AS result
WHERE
1 = 1
AND MonthIndex = #CurrentMonth
GROUP BY
Id_P ,
Monthindex) AS CC
so for every row in alias b we cross apply to the inner queries.
Is it possible to re-write the cross apply in terms of join operations (or otherwise) so I can re-implement in spark sql?
Cheers
Terry
Seems like you could rewrite your query as the below:
SELECT T1.col1,
T1.col2,
sq.col3Sum
FROM tbl1 T1
CROSS JOIN (SELECT SUM(T1sq.Col3) AS col3Sum
FROM tbl1 T1sq
JOIN tbl2 T2 ON T1sq.Col1 = T2.Col2
JOIN tbl3 T3 ON T2.col1 = T3.Col1) sq;
Seems odd, however, that there was no JOIN criteria between the 2 references to tbl1.

How to improve sql script performance

The following script is very slow when its run.
I have no idea how to improve the performance of the script.
Even with a view takes more than quite a lot minutes.
Any idea please share to me.
SELECT DISTINCT
( id )
FROM ( SELECT DISTINCT
ct.id AS id
FROM [Customer].[dbo].[Contact] ct
LEFT JOIN [Customer].[dbo].[Customer_ids] hnci ON ct.id = hnci.contact_id
WHERE hnci.customer_id IN (
SELECT DISTINCT
( [Customer_ID] )
FROM [Transactions].[dbo].[Transaction_Header]
WHERE actual_transaction_date > '20120218' )
UNION
SELECT DISTINCT
contact_id AS id
FROM [Customer].[dbo].[Restaurant_Attendance]
WHERE ( created > '2012-02-18 00:00:00.000'
OR modified > '2012-02-18 00:00:00.000'
)
AND ( [Fifth_Floor_London] = 1
OR [Fourth_Floor_Leeds] = 1
OR [Second_Floor_Bristol] = 1
)
UNION
SELECT DISTINCT
( ct.id )
FROM [Customer].[dbo].[Contact] ct
INNER JOIN [Customer].[dbo].[Wifinity_Devices] wfd ON ct.wifinity_uniqueID = wfd.[CustomerUniqueID]
AND startconnection > '2012-02-17'
UNION
SELECT DISTINCT
comdt.id AS id
FROM [Customer].[dbo].[Complete_dataset] comdt
LEFT JOIN [Customer].[dbo].[Aggregate_Spend_Counts] agsc ON comdt.id = agsc.contact_id
WHERE agsc.contact_id IS NULL
AND ( opt_out_Mail <> 1
OR opt_out_email <> 1
OR opt_out_SMS <> 1
OR opt_out_Mail IS NULL
OR opt_out_email IS NULL
OR opt_out_SMS IS NULL
)
AND ( address_1 IS NOT NULL
OR email IS NOT NULL
OR mobile IS NOT NULL
)
UNION
SELECT DISTINCT
( contact_id ) AS id
FROM [Customer].[dbo].[VIP_Card_Holders]
WHERE VIP_Card_number IS NOT NULL
) AS tbl
Wow, where to start...
--this distinct does nothing. Union is already distinct
--SELECT DISTINCT
-- ( id )
--FROM (
SELECT DISTINCT [Customer_ID] as ID
FROM [Transactions].[dbo].[Transaction_Header]
where actual_transaction_date > '20120218' )
UNION
SELECT
contact_id AS id
FROM [Customer].[dbo].[Restaurant_Attendance]
-- not sure that you are getting the date range you want. Should these be >=
-- if you want everything that occurred on the 18th or after you want >= '2012-02-18 00:00:00.000'
-- if you want everything that occurred on the 19th or after you want >= '2012-02-19 00:00:00.000'
-- the way you have it now, you will get everything on the 18th unless it happened exactly at midnight
WHERE ( created > '2012-02-18 00:00:00.000'
OR modified > '2012-02-18 00:00:00.000'
)
AND ( [Fifth_Floor_London] = 1
OR [Fourth_Floor_Leeds] = 1
OR [Second_Floor_Bristol] = 1
)
-- all of this does nothing because we already have every id in the contact table from the first query
-- UNION
-- SELECT
-- ( ct.id )
-- FROM [Customer].[dbo].[Contact] ct
-- INNER JOIN [Customer].[dbo].[Wifinity_Devices] wfd ON ct.wifinity_uniqueID = wfd.[CustomerUniqueID]
-- AND startconnection > '2012-02-17'
UNION
-- cleaned this up with isnull function and coalesce
SELECT
comdt.id AS id
FROM [Customer].[dbo].[Complete_dataset] comdt
LEFT JOIN [Customer].[dbo].[Aggregate_Spend_Counts] agsc ON comdt.id = agsc.contact_id
WHERE agsc.contact_id IS NULL
AND ( isnull(opt_out_Mail,0) <> 1
OR isnull(opt_out_email,0) <> 1
OR isnull(opt_out_SMS,0) <> 1
)
AND coalesce(address_1 , email, mobile) IS NOT NULL
UNION
SELECT
( contact_id ) AS id
FROM [Customer].[dbo].[VIP_Card_Holders]
WHERE VIP_Card_number IS NOT NULL
-- ) AS tbl
Where exists is generally faster than in as well.
Or conditions are generally slower as well, use more union statements instead.
And learn to use left joins correctly. If you have a where condition (other than where id is null) on the table on teh right side of a left join, it will convert to an inner join. If this is not what you want, then your code is currently giving you an incorrect result set.
See http://wiki.lessthandot.com/index.php/WHERE_conditions_on_a_LEFT_JOIN for an explanation of how to fix.
As stated in a comment optimize one at a time. See which one takes the longest and focus on that one.
union will remove duplicates so you don't need the distinct on the individual queries
On you first I would try this:
The left join is killed by the WHERE hnci.customer_id IN so you might as well have a join.
The sub-query is not efficient as cannot use an index on the IN.
The query optimizer does not know what in ( select .. ) will return so it cannot optimize use of indexes.
SELECT ct.id AS id
FROM [Customer].[dbo].[Contact] ct
JOIN [Customer].[dbo].[Customer_ids] hnci
ON ct.id = hnci.contact_id
JOIN [Transactions].[dbo].[Transaction_Header] th
on hnci.customer_id = th.[Customer_ID]
and th.actual_transaction_date > '20120218'
On that second join the query optimizer has the opportunity of which condition to apply first. Let say [Customer].[dbo].[Customer_ids].[customer_id] and [Transactions].[dbo].[Transaction_Header] each have indexes. The query optimizer has the option to apply that before [Transactions].[dbo].[Transaction_Header].[actual_transaction_date].
If [actual_transaction_date] is not indexed then for sure it would do the other ID join first.
With your in ( select ... ) the query optimizer has no option but to apply the actual_transaction_date > '20120218' first. OK some times query optimizer is smart enough to use an index inside the in outside the in but why make it hard for the query optimizer. I have found the query optimizer make better decisions if you make the decisions easier.
A join on a sub-query has the same problem. You take options away from the query optimizer. Give the query optimizer room to breathe.
try this, temptable should help you:
IF OBJECT_ID('Tempdb..#Temp1') IS NOT NULL
DROP TABLE #Temp1
--Low perfomance because of using "WHERE hnci.customer_id IN ( .... ) " - loop join must be
--and this "where" condition will apply to two tables after left join,
--so result will be same as with two inner joints but with bad perfomance
--SELECT DISTINCT
-- ct.id AS id
--INTO #temp1
--FROM [Customer].[dbo].[Contact] ct
-- LEFT JOIN [Customer].[dbo].[Customer_ids] hnci ON ct.id = hnci.contact_id
--WHERE hnci.customer_id IN (
-- SELECT DISTINCT
-- ( [Customer_ID] )
-- FROM [Transactions].[dbo].[Transaction_Header]
-- WHERE actual_transaction_date > '20120218' )
--------------------------------------------------------------------------------
--this will give the same result but with better perfomance then previouse one
--------------------------------------------------------------------------------
SELECT DISTINCT
ct.id AS id
INTO #temp1
FROM [Customer].[dbo].[Contact] ct
JOIN [Customer].[dbo].[Customer_ids] hnci ON ct.id = hnci.contact_id
JOIN ( SELECT DISTINCT
( [Customer_ID] )
FROM [Transactions].[dbo].[Transaction_Header]
WHERE actual_transaction_date > '20120218'
) T ON hnci.customer_id = T.[Customer_ID]
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
INSERT INTO #temp1
( id
)
SELECT DISTINCT
contact_id AS id
FROM [Customer].[dbo].[Restaurant_Attendance]
WHERE ( created > '2012-02-18 00:00:00.000'
OR modified > '2012-02-18 00:00:00.000'
)
AND ( [Fifth_Floor_London] = 1
OR [Fourth_Floor_Leeds] = 1
OR [Second_Floor_Bristol] = 1
)
INSERT INTO #temp1
( id
)
SELECT DISTINCT
( ct.id )
FROM [Customer].[dbo].[Contact] ct
INNER JOIN [Customer].[dbo].[Wifinity_Devices] wfd ON ct.wifinity_uniqueID = wfd.[CustomerUniqueID]
AND startconnection > '2012-02-17'
INSERT INTO #temp1
( id
)
SELECT DISTINCT
comdt.id AS id
FROM [Customer].[dbo].[Complete_dataset] comdt
LEFT JOIN [Customer].[dbo].[Aggregate_Spend_Counts] agsc ON comdt.id = agsc.contact_id
WHERE agsc.contact_id IS NULL
AND ( opt_out_Mail <> 1
OR opt_out_email <> 1
OR opt_out_SMS <> 1
OR opt_out_Mail IS NULL
OR opt_out_email IS NULL
OR opt_out_SMS IS NULL
)
AND ( address_1 IS NOT NULL
OR email IS NOT NULL
OR mobile IS NOT NULL
)
INSERT INTO #temp1
( id
)
SELECT DISTINCT
( contact_id ) AS id
FROM [Customer].[dbo].[VIP_Card_Holders]
WHERE VIP_Card_number IS NOT NULL
SELECT DISTINCT
id
FROM #temp1 AS T

Replace no result

I have a query like this:
SELECT TV.Descrizione as TipoVers,
sum(ImportoVersamento) as ImpTot,
count(*) as N,
month(DataAllibramento) as Mese
FROM PROC_Versamento V
left outer join dbo.PROC_TipoVersamento TV
on V.IDTipoVersamento = TV.IDTipoVersamento
inner join dbo.PROC_PraticaRiscossione PR
on V.IDPraticaRiscossioneAssociata = PR.IDPratica
inner join dbo.DA_Avviso A
on PR.IDDatiAvviso = A.IDAvviso
where DataAllibramento between '2012-09-08' and '2012-09-17' and A.IDFornitura = 4
group by V.IDTipoVersamento,month(DataAllibramento),TV.Descrizione
order by V.IDTipoVersamento,month(DataAllibramento)
This query must always return something. If no result is produced a
0 0 0 0
row must be returned. How can I do this. Use a isnull for every selected field isn't usefull.
Use a derived table with one row and do a outer apply to your other table / query.
Here is a sample with a table variable #T in place of your real table.
declare #T table
(
ID int,
Grp int
)
select isnull(Q.MaxID, 0) as MaxID,
isnull(Q.C, 0) as C
from (select 1) as T(X)
outer apply (
-- Your query goes here
select max(ID) as MaxID,
count(*) as C
from #T
group by Grp
) as Q
order by Q.C -- order by goes to the outer query
That will make sure you have always at least one row in the output.
Something like this using your query.
select isnull(Q.TipoVers, '0') as TipoVers,
isnull(Q.ImpTot, 0) as ImpTot,
isnull(Q.N, 0) as N,
isnull(Q.Mese, 0) as Mese
from (select 1) as T(X)
outer apply (
SELECT TV.Descrizione as TipoVers,
sum(ImportoVersamento) as ImpTot,
count(*) as N,
month(DataAllibramento) as Mese,
V.IDTipoVersamento
FROM PROC_Versamento V
left outer join dbo.PROC_TipoVersamento TV
on V.IDTipoVersamento = TV.IDTipoVersamento
inner join dbo.PROC_PraticaRiscossione PR
on V.IDPraticaRiscossioneAssociata = PR.IDPratica
inner join dbo.DA_Avviso A
on PR.IDDatiAvviso = A.IDAvviso
where DataAllibramento between '2012-09-08' and '2012-09-17' and A.IDFornitura = 4
group by V.IDTipoVersamento,month(DataAllibramento),TV.Descrizione
) as Q
order by Q.IDTipoVersamento, Q.Mese
Use COALESCE. It returns the first non-null value. E.g.
SELECT COALESCE(TV.Desc, 0)...
Will return 0 if TV.DESC is NULL.
You can try:
with dat as (select TV.[Desc] as TipyDesc, sum(Import) as ToImp, count(*) as N, month(Date) as Mounth
from /*DATA SOURCE HERE*/ as TV
group by [Desc], month(Date))
select [TipyDesc], ToImp, N, Mounth from dat
union all
select '0', 0, 0, 0 where (select count (*) from dat)=0
That should do what you want...
If it's ok to include the "0 0 0 0" row in a result set that has data, you can use a union:
SELECT TV.Desc as TipyDesc,
sum(Import) as TotImp,
count(*) as N,
month(Date) as Mounth
...
UNION
SELECT
0,0,0,0
Depending on the database, you may need a FROM for the second SELECT. In Oracle, this would be "FROM DUAL". For MySQL, no FROM is necessary

MySQL/SQL - When are the results of a sub-query avaliable?

Suppose I have this query
SELECT * FROM (
SELECT * FROM table_a
WHERE id > 10 )
AS a_results LEFT JOIN
(SELECT * from table_b
WHERE id IN
(SElECT id FROM a_results)
ON (a_results.id = b_results.id)
I would get the error "a_results is not a table". Anywhere I could use the re-use the results of the subquery?
Edit: It has been noted that this query doesn't make sense...it doesn't, yes. This is just to illustrate the question which I am asking; the 'real' query actually looks something like this:
SELECT SQL_CALC_FOUND_ROWS * FROM
( SELECT wp_pod_tbl_hotel . *
FROM wp_pod_tbl_hotel, wp_pod_rel, wp_pod
WHERE wp_pod_rel.field_id =12
AND wp_pod_rel.tbl_row_id =1
AND wp_pod.tbl_row_id = wp_pod_tbl_hotel.id
AND wp_pod_rel.pod_id = wp_pod.id
) as
found_hotel LEFT JOIN (
SELECT COUNT(*) as review_count, avg( (
location_rating + staff_performance_rating + condition_rating + room_comfort_rating + food_rating + value_rating
) /6 ) AS average_score, hotelid
FROM (
SELECT r. * , wp_pod_rel.tbl_row_id AS hotelid
FROM wp_pod_tbl_review r, wp_pod_rel, wp_pod
WHERE wp_pod_rel.field_id =11
AND wp_pod_rel.pod_id = wp_pod.id
AND r.id = wp_pod.tbl_row_id
AND wp_pod_rel.tbl_row_id
IN (
SELECT wp_pod_tbl_hotel .id
FROM wp_pod_tbl_hotel, wp_pod_rel, wp_pod
WHERE wp_pod_rel.field_id =12
AND wp_pod_rel.tbl_row_id =1
AND wp_pod.tbl_row_id = wp_pod_tbl_hotel.id
AND wp_pod_rel.pod_id = wp_pod.id
)
) AS hotel_reviews
GROUP BY hotel_reviews.hotelid
ORDER BY average_score DESC
AS sorted_hotel ON (id = sorted_hotel.hotelid)
As you can see, the sub-query which makes up the found_query table is repeated elsewhere downward as another sub-query, so I was hoping to re-use the results
You can not use a sub-query like this.
I'm not sure I understand your query, but wouldn't that be sufficient?
SELECT * FROM table_a a
LEFT JOIN table_b b ON ( b.id = a.id )
WHERE a.id > 10
It would return all rows from table_a where id > 10 and LEFT JOIN rows from table_b where id matches.