This query takes lot of time to execute, how to optimize it? - sql

I have used replace to clear up the fields. but it's taking hrs to execute.
select c.*
from
(
select distinct a.*, b.*
from
(
--Table 1
select replace(replace(replace(replace(AGENCY_NAME,'',''),'',''),'/',''),'\\','')
as agency_name,
LEN(replace(replace(replace(replace(AGENCY_NAME,' ',''),'-',''),'/',''),'\\',''))
as agency_len
from dbo.tbl_stars_agency
where replace(replace(replace(replace(AGENCY_NAME,' ',''),'-',''),'/',''),'\\','')
not in ('')
) a
inner join
(
--Table 2
select replace(replace(replace(replace(RESPONDENT_NAME_PER,' ',''),'.',''),'/',''),'\\','')
as respondent_name,
len(replace(replace(replace(replace(RESPONDENT_NAME_PER,' ',''),'-',''),'/',''),'\\',''))
as respondent_len
from dbo.TBL_cacs_ecb
where replace(replace(replace(replace(RESPONDENT_NAME_PER,' ',''),'-',''),'/',''),'\\','') not in ('')
) b
on substring(a.agency_name,1,15)=SUBSTRING(b.respondent_name,1,15)
) c
inner join
(
--Table 3
select replace(replace(replace(replace(NM_ENTITY,' ',''),'-',''),'/',''),'\\','')
as nm_entity,
LEN(replace(replace(replace(replace(NM_ENTITY,' ',''),'-',''),'/',''),'\\',''))
as nm_entity_len
from dbo.RMFS010_TF1NAME
where replace(replace(replace(replace(NM_ENTITY,' ',''),'-',''),'/',''),'\\','')
not in ('')
) d
on substring(c.agency_name,1,5)=substring(d.nm_entity,1,5) or
substring(c.respondent_name,1,5)=substring(d.nm_entity,1,5)
I want to compare the three table based on the name field in tables. I have calculated the lengths and used substring function to match up to 15 places.

One approach would be to add calculated columns and index them:
ALTER TABLE dbo.tbl_stars_agency
ADD agency_len AS LEN(replace(replace(replace(
replace(AGENCY_NAME,' ',''),'-',''),'/',''),'\',''))
PERSISTED;
Now add an index:
CREATE INDEX MyIndex dbo.tbl_stars_agency
(agency_len);
Now any search in this should be much faster.
You can change this line:
where replace(replace(replace(
replace(AGENCY_NAME,' ',''),'-',''),'/',''),'\','')
not in ('')
to this:
where agency_len = 0
Try that and if it helps we can look at doing this to other columns
PS: You need to be certain your expression is watertight and won't crash due to invalid lengths etc. otherwise it will stop you inserting and updating records

Related

Redshift - Extract value matching a condition in Array

I have a Redshift table with the following column
How can I extract the value starting by cat_ from this column please (there is only one for each row and at different position in the array)?
I want to get those results:
cat_incident
cat_feature_missing
cat_duplicated_request
Thanks!
There is no easy way to extract multiple values from within one column in SQL (or at least not in the SQL used by Redshift).
You could write a User-Defined Function (UDF) that returns a string containing those values, separated by newlines. Whether this is acceptable depends on what you wish to do with the output (eg JOIN against it).
Another option is to pre-process the data before it is loaded into Redshift, to put this information in a separate one-to-many table, with each value in its own row. It would then be trivial to return this information.
You can do this using tally table (table with numbers). Check this link on information how to create this table: http://www.sqlservercentral.com/articles/T-SQL/62867/
Here is example how you would use it. In real life you should replace temporary #tally table with a permanent one.
--create sample table with data
create table #a (tags varchar(500));
insert into #a
select 'blah,cat_incident,mcr_close_ticket'
union
select 'blah-blah,cat_feature_missing,cat_duplicated_request';
--create tally table
create table #tally(n int);
insert into #tally
select 1
union select 2
union select 3
union select 4
union select 5
;
--get tags
select * from
(
select TRIM(SPLIT_PART(a.tags, ',', t.n)) AS single_tag
from #tally t
inner join #a a ON t.n <= REGEXP_COUNT(a.tags, ',') + 1 and n<1000
)
where single_tag like 'cat%'
;
Thanks!
In the end I managed to do it with the following query:
SELECT SUBSTRING(SUBSTRING(tags, charindex('cat_', tags), len(tags)), 0, charindex(',', SUBSTRING(tags, charindex('cat_', tags), len(tags)))) tags
FROM table

How to use Join with like operator and then casting columns

I have 2 tables with these columns:
CREATE TABLE #temp
(
Phone_number varchar(100) -- example data: "2022033456"
)
CREATE TABLE orders
(
Addons ntext -- example data: "Enter phone:2022033456<br>Thephoneisvalid"
)
I have to join these two tables using 'LIKE' as the phone numbers are not in same format. Little background I am joining the #temp table on the phone number with orders table on its Addons value. Then again in WHERE condition I am trying to match them and get some results. Here is my code. But my results that I am getting are not accurate. As its not returning any data. I don't know what I am doing wrong. I am using SQL Server.
select
*
from
order_no as n
join
orders as o on n.order_no = o.order_no
join
#temp as t on t.phone_number like '%'+ cast(o.Addons as varchar(max))+'%'
where
t.phone_number = '%' + cast(o.Addons as varchar(max)) + '%'
You can not use LIKE statement in the JOIN condition. Please provide more information on your tables. You have to convert the format of one of the phone field to compile with other phone field format in order to join.
I think your join condition is in the wrong order. Because your question explicitly mentions two tables, let's stick with those:
select *
from orders o JOIN
#temp t
on cast(o.Addons as varchar(max)) like '%' + t.phone_number + '%';
It has been so long since I dealt with the text data type (in SQL Server), that I don't remember if the cast() is necessary or not.
Instead of trying to do everything in a single top-level query, you should apply a transformation projection to your orders table and use that as a subquery, which will make the query easier to understand.
Using the CHARINDEX function will make this a lot easier, however it does not support ntext, you will need to change your schema to use nvarchar(max) instead - which you should be doing anyway as ntext is deprecated, fortunately you can use CONVERT( nvarchar(max), someNTextValue ), though this will reduce performance as you won't be able to use any indexes on your ntext values - but this query will run slowly anyway.
SELECT
orders2.*,
CASE WHEN orders2.PhoneStart > 0 AND orders2.PhoneEnd > 0 THEN
SUBSTRING( orders2.Addons, orders2.PhoneStart, orders2.PhoneEnd - orders2.PhoneStart )
ELSE
NULL
END AS ExtractedPhoneNumber
FROM
(
SELECT
orders.*, -- never use `*` in production, so replace this with the actual columns in your orders table
CHARINDEX('Enter phone:', Addons) AS PhoneStart,
CHARINDEX('<br>Thephoneisvalid', AddOns, CHARINDEX('Enter phone:', Addons) ) AS PhoneEnd
FROM
orders
) AS orders2
I suggest converting the above into a VIEW or CTE so you can directly query it in your JOIN expression:
CREATE VIEW ordersWithPhoneNumbers AS
-- copy and paste the above query here, then execute the batch to create the view, you only need to do this once.
Then you can use it like so:
SELECT
* -- again, avoid the use of the star selector in production use
FROM
ordersWithPhoneNumbers AS o2 -- this is the above query as a VIEW
INNER JOIN order_no ON o2.order_no = order_no.order_no
INNER JOIN #temp AS t ON o2.ExtractedPhoneNumber = t.phone_number
Actually, I take back my previous remark about performance - if you add an index to the ExtractedPhoneNumber column of the ordersWithPhoneNumbers view then you'll get good performance.

performance penalty when using "join with temp table " in contrast of "IN clause with constant values"

I have a temp table with two records like this:
select * into #Tbl from (select 1 id union select 2) tbl
and also the related index:
Create nonclustered index IX_1 on #T(id)
The following query takes 4000ms to run:
SELECT AncestorId
FROM myView
WHERE AncestorId =ANY(select id from #t)
But the equivalent query (with IN and literal values) takes only 3ms to run!:
SELECT ProjectStructureId
FROM myView
WHERE AncestorId in (1,2)
Why this huge difference and how can I change the first query to be as fast as the second one?
P.S.
SQL SERVER 2014 SP2
myView is a Recursive CTE
Changing the first query to INNER JOIN model or EXISTS model didn't help
Changing the IX_1 Index to a cluster index didn't help
Using FORSEEK didn't help
P.S.2
The execution plans of both can be downloaded here : https://www.dropbox.com/s/pas1ovyamqojhba/Query-With-In.sqlplan?dl=0
Execution plans in Paste the Plan
P.S. 3
The view definition is :
ALTER VIEW [dbo].[myView]
AS
WITH parents AS (SELECT main.Id, main.NodeTypeCode, main.ParentProjectStructureId AS DirectParentId, parentInfo.Id AS AncestorId, parentInfo.ParentProjectStructureId AS AncestorParentId, CASE WHEN main.NodeTypeCode <> IsNull(parentInfo.NodeTypeCode, 0)
THEN 1 ELSE 0 END AS AncestorTypeDiffLevel
FROM dbo.ProjectStructures AS main LEFT OUTER JOIN
dbo.ProjectStructures AS parentInfo ON main.ParentProjectStructureId = parentInfo.Id
UNION ALL
SELECT m.Id, m.NodeTypeCode, m.ParentProjectStructureId, parents.AncestorId, parents.AncestorParentId,
CASE WHEN m.NodeTypeCode <> parents.NodeTypeCode THEN AncestorTypeDiffLevel + 1 ELSE AncestorTypeDiffLevel END AS AncestorTypeDiffLevel
FROM dbo.ProjectStructures AS m INNER JOIN
parents ON m.ParentProjectStructureId = parents.Id)
SELECT ISNULL(Id, - 1) AS ProjectStructureId,
ISNULL(NodeTypeCode,-1) NodeTypeCode,
DirectParentId,
ISNULL(AncestorId, - 1) AS AncestorId,
AncestorParentId,
AncestorTypeDiffLevel
FROM parents
WHERE (AncestorId IS NOT NULL)
In your good plan it is able to push the literal values right into the index seek of the anchor part of the recursive CTE.
It refuses to do that when they come from a table.
You could create a table type
CREATE TYPE IntegerSet AS TABLE
(
Integer int PRIMARY KEY WITH (IGNORE_DUP_KEY = ON)
);
And then pass that to an inline TVF written to use that in the anchor part directly.
Then just call it like
DECLARE #AncestorIds INTEGERSET;
INSERT INTO #AncestorIds
VALUES (1),
(2);
SELECT *
FROM [dbo].[myFn](#AncestorIds);
The inline TVF would be much the same as the view but with
WHERE parentInfo.Id IN (SELECT Integer FROM #AncestorIds)
in the anchor part of the recursive CTE.
CREATE FUNCTION [dbo].[myFn]
(
#AncestorIds IntegerSet READONLY
)
RETURNS TABLE
AS
RETURN
WITH parents
AS (SELECT /*omitted for clarity*/
WHERE parentInfo.Id IN (SELECT Integer FROM #AncestorIds)
UNION ALL
SELECT/* Rest omitted for clarity*/
Also you might as well change that LEFT JOIN to an INNER JOIN though the optimiser does that for you.
I just want to say that I would write the query as:
SELECT AncestorId
FROM myView
WHERE AncestorId IN (select id from #t);
I doubt this would help.
The issue is that SQL Server can optimize literal values better than values inside a table. The result is that the execution plan changes.
If neither IN nor JOIN fix the problem, then you probably have to fiddle with the definition of the view to improve performance.

Ensuring two columns only contain valid results from same subquery

I have the following table:
id symbol_01 symbol_02
1 abc xyz
2 kjh okd
3 que qid
I need a query that ensures symbol_01 and symbol_02 are both contained in a list of valid symbols. In other words I would needs something like this:
select *
from mytable
where symbol_01 in (
select valid_symbols
from somewhere)
and symbol_02 in (
select valid_symbols
from somewhere)
The above example would work correctly, but the subquery used to determine the list of valid symbols is identical both times and is quite large. It would be very innefficient to run it twice like in the example.
Is there a way to do this without duplicating two identical sub queries?
Another approach:
select *
from mytable t1
where 2 = (select count(distinct symbol)
from valid_symbols vs
where vs.symbol in (t1.symbol_01, t1.symbol_02));
This assumes that the valid symbols are stored in a table valid_symbols that has a column named symbol. The query would also benefit from an index on valid_symbols.symbol
You could try use a CTE like;
WITH ValidSymbols AS (
SELECT DISTINCT valid_symbol
FROM somewhere
)
SELECT mt.*
FROM MyTable mt
INNER JOIN ValidSymbols v1
ON mt.symbol_01 = v1.valid_symbol
INNER JOIN ValidSymbols v2
ON mt.symbol_02 = v2.valid_symbol
From a performance perspective, your query is the right way to do this. I would write it as:
select *
from mytable t
where exists (select 1
from valid_symbols vs
where t.symbol_01 = vs.valid_symbol
) and
exists (select 1
from valid_symbols vs
where t.symbol_02 = vs.valid_symbol
) ;
The important component is that you need an index on valid_symbols(valid_symbol). With this index, the lookup should be pretty fast. Appropriate indexes can even work if valid_symbols is a view, although the effect depends on the complexity of the view.
You seem to have a situation where you have two foreign key relationships. If you explicitly declare these relationships, then the database will enforce that the columns in your table match the valid symbols.

Performance issues with UNION of large tables

I have seven large tables, that can be storing between 100 to 1 million rows at any time. I'll call them LargeTable1, LargeTable2, LargeTable3, LargeTable4...LargeTable7. These tables are mostly static: there are no updates nor new inserts. They change only once every two weeks or once a month, when they are truncated and a new batch of registers are inserted in each.
All these tables have three fields in common: Headquarter, Country and File. Headquarter and Country are numbers in the format '000', though in two of these tables they are parsed as int due to some other system necessities.
I have another, much smaller table called Headquarters with the information of each headquarter. This table has very few entries. At most 1000, actually.
Now, I need to create a stored procedure that returns all those headquarters that appear in the large tables but are either absent in the Headquarters table or have been deleted (this table is deleted logically: it has a DeletionDate field to check this).
This is the query I've tried:
CREATE PROCEDURE deletedHeadquarters
AS
BEGIN
DECLARE #headquartersFiles TABLE
(
hq int,
countryFile varchar(MAX)
);
SET NOCOUNT ON
INSERT INTO #headquartersFiles
SELECT headquarter, CONCAT(country, ' (', file, ')')
FROM
(
SELECT DISTINCT CONVERT(int, headquarter) as headquarter,
CONVERT(int, country) as country,
file
FROM LargeTable1
UNION
SELECT DISTINCT headquarter,
country,
file
FROM LargeTable2
UNION
SELECT DISTINCT headquarter,
country,
file
FROM LargeTable3
UNION
SELECT DISTINCT headquarter,
country,
file
FROM LargeTable4
UNION
SELECT DISTINCT headquarter,
country,
file
FROM LargeTable5
UNION
SELECT DISTINCT headquarter,
country,
file
FROM LargeTable6
UNION
SELECT DISTINCT headquarter,
country,
file
FROM LargeTable7
) TC
SELECT RIGHT('000' + CAST(st.headquarter AS VARCHAR(3)), 3) as headquarter,
MAX(s.deletionDate) as deletionDate,
STUFF
(
(SELECT DISTINCT ', ' + st2.countryFile
FROM #headquartersFiles st2
WHERE st2.headquarter = st.headquarter
FOR XML PATH('')),
1,
1,
''
) countryFile
FROM #headquartersFiles as st
LEFT JOIN headquarters s ON CONVERT(int, s.headquarter) = st.headquarter
WHERE s.headquarter IS NULL
OR s.deletionDate IS NOT NULL
GROUP BY st.headquarter
END
This sp's performance isn't good enough for our application. It currently takes around 50 seconds to complete, with the following total rows for each table (just to give you an idea about the sizes):
LargeTable1: 1516666 rows
LargeTable2: 645740 rows
LargeTable3: 1950121 rows
LargeTable4: 779336 rows
LargeTable5: 1100999 rows
LargeTable6: 16499 rows
LargeTable7: 24454 rows
What can I do to improve performance? I've tried to do the following, with no much difference:
Inserting into the local table by batches, excluding those headquarters I've already inserted and then updating the countryFile field for those that are repeated
Creating a view for that UNION query
Creating indexes for the LargeTables for the headquarter field
I've also thought about inserting these missing headquarters in a permanent table after the LargeTables change, but the Headquarters table can change more often, and I would like not having to change its module to keep these things tidy and updated. But if it's the best possible alternative, I'd go for it.
Thanks
Take this filter
LEFT JOIN headquarters s ON CONVERT(int, s.headquarter) = st.headquarter
WHERE s.headquarter IS NULL
OR s.deletionDate IS NOT NULL
And add it to each individual query in the union and insert into #headquartersFiles
It might seem like this makes a lot more filters but it will actually speed stuff up because you are filtering before you start processing as a union.
Also take out all your DISTINCT, it probably won't speed it up but it seems silly because you are doing a UNION and not a UNION all.
Do the filtering at each step. But first, modify the headquarters table so it has the right type for what you need . . . along with an index:
alter table headquarters add headquarter_int as (cast(headquarter as int));
create index idx_headquarters_int on headquarters(headquarters_int);
SELECT DISTINCT headquarter, country, file
FROM LargeTable5 lt5
WHERE NOT EXISTS (SELECT 1
FROM headquarters s
WHERE s.headquarter_int = lt5.headquarter and s.deletiondate is not null
);
Then, you want an index on LargeTable5(headquarter, country, file).
This should take less than 5 seconds to run. If so, then construct the full query, being sure that the types in the correlated subquery match and that you have the right index on the full table. Use union to remove duplicates between the tables.
I'd try doing the filtering with each individual table first. You just need to account for the fact that a headquarter might appear in one table, but not another. You can do this like so:
SELECT
headquarter
FROM
(
SELECT DISTINCT
headquarter,
'table1' AS large_table
FROM
LargeTable1 LT
LEFT OUTER JOIN Headquarters HQ ON HQ.headquarter = LT.headquarter
WHERE
HQ.headquarter IS NULL OR
HQ.deletion_date IS NOT NULL
UNION ALL
SELECT DISTINCT
headquarter,
'table2' AS large_table
FROM
LargeTable2 LT
LEFT OUTER JOIN Headquarters HQ ON HQ.headquarter = LT.headquarter
WHERE
HQ.headquarter IS NULL OR
HQ.deletion_date IS NOT NULL
UNION ALL
...
) SQ
GROUP BY headquarter
HAVING COUNT(*) = 5
That would make sure that it's missing from all five tables.
Table variables have horrible performance because sql server does not generate statistics for them. Instead of a table variable, try using a temp table instead, and if headquarter + country + file is unique in the temp table, add a unique constraint (which will create a clustered index) in the temp table definition. You can set indexes on a temp table after creating it, but for various reasons SQL Server may ignore it.
Edit: as it turns out, you can in fact create indexes on table variables, even non-unique in 2014+.
Secondly, try not to use functions in your joins or where clauses - doing so often causes performance problems.
The real answer is to create separate INSERT statements for each table with the caveat that data to be inserted does not exist in the destination table.