Performance issues with UNION of large tables - sql

I have seven large tables, that can be storing between 100 to 1 million rows at any time. I'll call them LargeTable1, LargeTable2, LargeTable3, LargeTable4...LargeTable7. These tables are mostly static: there are no updates nor new inserts. They change only once every two weeks or once a month, when they are truncated and a new batch of registers are inserted in each.
All these tables have three fields in common: Headquarter, Country and File. Headquarter and Country are numbers in the format '000', though in two of these tables they are parsed as int due to some other system necessities.
I have another, much smaller table called Headquarters with the information of each headquarter. This table has very few entries. At most 1000, actually.
Now, I need to create a stored procedure that returns all those headquarters that appear in the large tables but are either absent in the Headquarters table or have been deleted (this table is deleted logically: it has a DeletionDate field to check this).
This is the query I've tried:
CREATE PROCEDURE deletedHeadquarters
AS
BEGIN
DECLARE #headquartersFiles TABLE
(
hq int,
countryFile varchar(MAX)
);
SET NOCOUNT ON
INSERT INTO #headquartersFiles
SELECT headquarter, CONCAT(country, ' (', file, ')')
FROM
(
SELECT DISTINCT CONVERT(int, headquarter) as headquarter,
CONVERT(int, country) as country,
file
FROM LargeTable1
UNION
SELECT DISTINCT headquarter,
country,
file
FROM LargeTable2
UNION
SELECT DISTINCT headquarter,
country,
file
FROM LargeTable3
UNION
SELECT DISTINCT headquarter,
country,
file
FROM LargeTable4
UNION
SELECT DISTINCT headquarter,
country,
file
FROM LargeTable5
UNION
SELECT DISTINCT headquarter,
country,
file
FROM LargeTable6
UNION
SELECT DISTINCT headquarter,
country,
file
FROM LargeTable7
) TC
SELECT RIGHT('000' + CAST(st.headquarter AS VARCHAR(3)), 3) as headquarter,
MAX(s.deletionDate) as deletionDate,
STUFF
(
(SELECT DISTINCT ', ' + st2.countryFile
FROM #headquartersFiles st2
WHERE st2.headquarter = st.headquarter
FOR XML PATH('')),
1,
1,
''
) countryFile
FROM #headquartersFiles as st
LEFT JOIN headquarters s ON CONVERT(int, s.headquarter) = st.headquarter
WHERE s.headquarter IS NULL
OR s.deletionDate IS NOT NULL
GROUP BY st.headquarter
END
This sp's performance isn't good enough for our application. It currently takes around 50 seconds to complete, with the following total rows for each table (just to give you an idea about the sizes):
LargeTable1: 1516666 rows
LargeTable2: 645740 rows
LargeTable3: 1950121 rows
LargeTable4: 779336 rows
LargeTable5: 1100999 rows
LargeTable6: 16499 rows
LargeTable7: 24454 rows
What can I do to improve performance? I've tried to do the following, with no much difference:
Inserting into the local table by batches, excluding those headquarters I've already inserted and then updating the countryFile field for those that are repeated
Creating a view for that UNION query
Creating indexes for the LargeTables for the headquarter field
I've also thought about inserting these missing headquarters in a permanent table after the LargeTables change, but the Headquarters table can change more often, and I would like not having to change its module to keep these things tidy and updated. But if it's the best possible alternative, I'd go for it.
Thanks

Take this filter
LEFT JOIN headquarters s ON CONVERT(int, s.headquarter) = st.headquarter
WHERE s.headquarter IS NULL
OR s.deletionDate IS NOT NULL
And add it to each individual query in the union and insert into #headquartersFiles
It might seem like this makes a lot more filters but it will actually speed stuff up because you are filtering before you start processing as a union.
Also take out all your DISTINCT, it probably won't speed it up but it seems silly because you are doing a UNION and not a UNION all.

Do the filtering at each step. But first, modify the headquarters table so it has the right type for what you need . . . along with an index:
alter table headquarters add headquarter_int as (cast(headquarter as int));
create index idx_headquarters_int on headquarters(headquarters_int);
SELECT DISTINCT headquarter, country, file
FROM LargeTable5 lt5
WHERE NOT EXISTS (SELECT 1
FROM headquarters s
WHERE s.headquarter_int = lt5.headquarter and s.deletiondate is not null
);
Then, you want an index on LargeTable5(headquarter, country, file).
This should take less than 5 seconds to run. If so, then construct the full query, being sure that the types in the correlated subquery match and that you have the right index on the full table. Use union to remove duplicates between the tables.

I'd try doing the filtering with each individual table first. You just need to account for the fact that a headquarter might appear in one table, but not another. You can do this like so:
SELECT
headquarter
FROM
(
SELECT DISTINCT
headquarter,
'table1' AS large_table
FROM
LargeTable1 LT
LEFT OUTER JOIN Headquarters HQ ON HQ.headquarter = LT.headquarter
WHERE
HQ.headquarter IS NULL OR
HQ.deletion_date IS NOT NULL
UNION ALL
SELECT DISTINCT
headquarter,
'table2' AS large_table
FROM
LargeTable2 LT
LEFT OUTER JOIN Headquarters HQ ON HQ.headquarter = LT.headquarter
WHERE
HQ.headquarter IS NULL OR
HQ.deletion_date IS NOT NULL
UNION ALL
...
) SQ
GROUP BY headquarter
HAVING COUNT(*) = 5
That would make sure that it's missing from all five tables.

Table variables have horrible performance because sql server does not generate statistics for them. Instead of a table variable, try using a temp table instead, and if headquarter + country + file is unique in the temp table, add a unique constraint (which will create a clustered index) in the temp table definition. You can set indexes on a temp table after creating it, but for various reasons SQL Server may ignore it.
Edit: as it turns out, you can in fact create indexes on table variables, even non-unique in 2014+.
Secondly, try not to use functions in your joins or where clauses - doing so often causes performance problems.

The real answer is to create separate INSERT statements for each table with the caveat that data to be inserted does not exist in the destination table.

Related

Comparing multiple tables with one another

I do not necessarily spend my time creating sql-queries, I maintain and search for mistakes in databases. I constantly have to compare two types of tables, and if the database is small, I dont mind just writing a small query. But at times, some databases are huge und the amount of tables is overwhelming.
I have one table-type that has compressed data and another that has aggregates comprised of the compressed data. At times, the AggregateTables are missing some IDs, some data was not calculated. If it is just one AggregateTable, I just compare it to its corresponding compressed table and i can immediately see what needs to be recalculated(code for that is shown below).
select distinct taguid from TLG.TagValueCompressed_0_100000
where exists
(select * from tlg.AggregateValue_0_100000 where
AggregateValue_0_100000.TagUID = TagValueCompressed_0_100000.TagUID)
I would like to have a table, that compares all tables with another and spits out a table with all non existing tags. My SQl knowledge is at its infancy, and my job does not require me to be a sql freak. But a query that does said problem, would help me alot. Does anyone have any suggestions for a solution?
Relevant comlumns: Taguids, thats it.
Optimal Table:
Existing Tags missing Tags
1
2
3
4
.
.
Numbers meaning what table : "_0_100000", "_0_100001" ...
So let's assume this query produced the output you want, i.e. the list of all possible tags in a given table set (0_100000, etc.) and columns denoting whether the given tag exists in AggregateValue and TagValueCompressed:
SELECT '0_100000' AS TableSet
, ISNULL(AggValue.TagUID, TagValue.TagUID) AS TagUID
, IIF(TagValue.TagUID IS NOT NULL, 1, 0) AS ExistsInTag
, IIF(AggValue.TagUID IS NOT NULL, 1, 0) AS ExistsInAgg
FROM (SELECT DISTINCT TagUID FROM tlg.AggregateValue_0_100000) AggValue
FULL OUTER
JOIN (SELECT DISTINCT TagUID FROM TLG.TagValueCompressed_0_100000) TagValue
ON AggValue.TagUID = TagValue.TagUID
So in order to execute it for multiple tables, we can make this query a template:
DECLARE #QueryTemplate NVARCHAR(MAX) = '
SELECT ''$SUFFIX$'' AS TableSet
, ISNULL(AggValue.TagUID, TagValue.TagUID) AS TagUID
, IIF(TagValue.TagUID IS NOT NULL, 1, 0) AS ExistsInTag
, IIF(AggValue.TagUID IS NOT NULL, 1, 0) AS ExistsInAgg
FROM (SELECT DISTINCT TagUID FROM tlg.AggregateValue_$SUFFIX$) AggValue
FULL OUTER
JOIN (SELECT DISTINCT TagUID FROM TLG.TagValueCompressed_$SUFFIX$) TagValue
ON AggValue.TagUID = TagValue.TagUID';
Here $SUFFIX$ denotes the 0_100000 etc. We can now execute this query dynamically for all tables matching a certain pattern, like say you have 500 of these tables.
DECLARE #query NVARCHAR(MAX) = ''
-- Produce a query for a given suffix out of the template
-- suffix is the last 8 characters of the table's name
-- combine all the queries, for all tables, using UNION ALL
SELECT #query += CONCAT(REPLACE(#QueryTemplate, '$SUFFIX$', RIGHT(name, 8)), ' UNION ALL ')
FROM sys.tables
WHERE name LIKE 'TagValueCompressed%';
-- Get rid of the trailing UNION ALL
SET #Query = LEFT(#Query, LEN(#Query) - LEN('UNION ALL '));
EXECUTE sp_executesql #Query
This will yield combined results for all matching tables.
Here is a working example on dbfiddle.

This query takes lot of time to execute, how to optimize it?

I have used replace to clear up the fields. but it's taking hrs to execute.
select c.*
from
(
select distinct a.*, b.*
from
(
--Table 1
select replace(replace(replace(replace(AGENCY_NAME,'',''),'',''),'/',''),'\\','')
as agency_name,
LEN(replace(replace(replace(replace(AGENCY_NAME,' ',''),'-',''),'/',''),'\\',''))
as agency_len
from dbo.tbl_stars_agency
where replace(replace(replace(replace(AGENCY_NAME,' ',''),'-',''),'/',''),'\\','')
not in ('')
) a
inner join
(
--Table 2
select replace(replace(replace(replace(RESPONDENT_NAME_PER,' ',''),'.',''),'/',''),'\\','')
as respondent_name,
len(replace(replace(replace(replace(RESPONDENT_NAME_PER,' ',''),'-',''),'/',''),'\\',''))
as respondent_len
from dbo.TBL_cacs_ecb
where replace(replace(replace(replace(RESPONDENT_NAME_PER,' ',''),'-',''),'/',''),'\\','') not in ('')
) b
on substring(a.agency_name,1,15)=SUBSTRING(b.respondent_name,1,15)
) c
inner join
(
--Table 3
select replace(replace(replace(replace(NM_ENTITY,' ',''),'-',''),'/',''),'\\','')
as nm_entity,
LEN(replace(replace(replace(replace(NM_ENTITY,' ',''),'-',''),'/',''),'\\',''))
as nm_entity_len
from dbo.RMFS010_TF1NAME
where replace(replace(replace(replace(NM_ENTITY,' ',''),'-',''),'/',''),'\\','')
not in ('')
) d
on substring(c.agency_name,1,5)=substring(d.nm_entity,1,5) or
substring(c.respondent_name,1,5)=substring(d.nm_entity,1,5)
I want to compare the three table based on the name field in tables. I have calculated the lengths and used substring function to match up to 15 places.
One approach would be to add calculated columns and index them:
ALTER TABLE dbo.tbl_stars_agency
ADD agency_len AS LEN(replace(replace(replace(
replace(AGENCY_NAME,' ',''),'-',''),'/',''),'\',''))
PERSISTED;
Now add an index:
CREATE INDEX MyIndex dbo.tbl_stars_agency
(agency_len);
Now any search in this should be much faster.
You can change this line:
where replace(replace(replace(
replace(AGENCY_NAME,' ',''),'-',''),'/',''),'\','')
not in ('')
to this:
where agency_len = 0
Try that and if it helps we can look at doing this to other columns
PS: You need to be certain your expression is watertight and won't crash due to invalid lengths etc. otherwise it will stop you inserting and updating records

Ensuring two columns only contain valid results from same subquery

I have the following table:
id symbol_01 symbol_02
1 abc xyz
2 kjh okd
3 que qid
I need a query that ensures symbol_01 and symbol_02 are both contained in a list of valid symbols. In other words I would needs something like this:
select *
from mytable
where symbol_01 in (
select valid_symbols
from somewhere)
and symbol_02 in (
select valid_symbols
from somewhere)
The above example would work correctly, but the subquery used to determine the list of valid symbols is identical both times and is quite large. It would be very innefficient to run it twice like in the example.
Is there a way to do this without duplicating two identical sub queries?
Another approach:
select *
from mytable t1
where 2 = (select count(distinct symbol)
from valid_symbols vs
where vs.symbol in (t1.symbol_01, t1.symbol_02));
This assumes that the valid symbols are stored in a table valid_symbols that has a column named symbol. The query would also benefit from an index on valid_symbols.symbol
You could try use a CTE like;
WITH ValidSymbols AS (
SELECT DISTINCT valid_symbol
FROM somewhere
)
SELECT mt.*
FROM MyTable mt
INNER JOIN ValidSymbols v1
ON mt.symbol_01 = v1.valid_symbol
INNER JOIN ValidSymbols v2
ON mt.symbol_02 = v2.valid_symbol
From a performance perspective, your query is the right way to do this. I would write it as:
select *
from mytable t
where exists (select 1
from valid_symbols vs
where t.symbol_01 = vs.valid_symbol
) and
exists (select 1
from valid_symbols vs
where t.symbol_02 = vs.valid_symbol
) ;
The important component is that you need an index on valid_symbols(valid_symbol). With this index, the lookup should be pretty fast. Appropriate indexes can even work if valid_symbols is a view, although the effect depends on the complexity of the view.
You seem to have a situation where you have two foreign key relationships. If you explicitly declare these relationships, then the database will enforce that the columns in your table match the valid symbols.

SQL Server query with extremely large IN clause results in numerous queries in activity monitor

SQL Server 2014 database. Table with 200 million rows.
Very large query with HUGE IN clause.
I originally wrote this query for them, but they have grown the IN clause to over 700 entries. The CTE looks unnecessary because I have omitted all the select columns and their substring() transformations for simplicity.
The focus is on the IN clause. 700+ pairs of these.
WITH cte AS (
SELECT *
FROM AODS-DB1B
WHERE
Source+'-'+Target
IN
(
'ACY-DTW',
'ACY-ATL',
'ACY-ORD',
:
: 700+ of these pairs
:
'HTS-PGD',
'PIE-BMI',
'PGD-HTS'
)
)
SELECT *
FROM cte
order by Source, Target, YEAR, QUARTER
When running, this query shoots CPU to 100% for hours - not unexpectedly.
There are indexes on all columns involved.
Question 1: Is there a better, or more effecient way to accomplish this query other than the huge IN clause? Would 700 UNION ALLs be better?
Question 2: When this query runs, it creates a Session_ID that contains 49 "threads" (49 processes that all have the same Session_ID). Every one of them an instance of this query with it's "Command" being this query text.
21 of them SUSPENDED,
14 of them RUNNING, and
14 of them RUNNABLE.
This changes rapidly as the task is running.
WHAT the heck is going on there? Is this SQL Server breaking the query up into pieces to work on it?
I recommend you store your 700+ strings in a permanent table as it is generally perceived as bad practice to store that much meta data in a script. You can create the table like this:
CREATE TABLE dbo.LookUp(Source varchar(250), Target varchar(250))
CREATE INDEX IX_Lookup_Source_Target on dbo.Lookup(Source,Target)
INSERT INTO dbo.Lookup (Source,Target)
SELECT 'ACY','DTW'
UNION
SELECT 'ACY','ATL'
.......
and then you can simply join on this table:
SELECT * FROM [AODS-DB1B] a
INNER JOIN dbo.Lookup lt ON lt.Source = a.Source
AND lt.Target=a.Target
ORDER BY Source, Target, YEAR, QUARTER
However, even better would be to normalise the AODS-DB1B table and store SourceId and TargetId INT values instead, with the VARCHAR values stored in Source and Target tables. You can then write a query that only performs integer comparisons rather than string comparisons and this should be much faster.
Put all of your codes into a temporary table (or permamnent if suitable).....
SELECT *
FROM AODS-DB1B
INNER JOIN NEW_TABLE ON Source+'-'+Target = NEWTABLE.Code
WHERE
...
...
you can create a temp table with all those values and then JOIN to that table, it would make the process a lot faster
I like the answer from Jaco
Have an index on source, target
It may be worth giving this a try
where ( source = 'ACY' and target in ('DTW', 'ATL', 'ORD') )
or ( source = 'HTS' and target in ('PGD') )

SQL Server : compare two tables with UNION and Select * plus additional label column

I've been playing around with the sample on Jeff' Server blog to compare two tables to find the differences.
In my case the tables are a backup and the current data. I can get what I want with this SQL statement (simplified by removing most of the columns). I can then see the rows from each table that don't have an exact match and I can see from which table they come.
SELECT
MIN(TableName) as TableName
,[strCustomer]
,[strAddress1]
,[strCity]
,[strPostalCode]
FROM
(SELECT
'Old' as TableName
,[JAS001].[dbo].[AR_CustomerAddresses].[strCustomer]
,[JAS001].[dbo].[AR_CustomerAddresses].[strAddress1]
,[JAS001].[dbo].[AR_CustomerAddresses].[strCity]
,[JAS001].[dbo].[AR_CustomerAddresses].[strPostalCode]
FROM
[JAS001].[dbo].[AR_CustomerAddresses]
UNION ALL
SELECT
'New' as TableName
,[JAS001new].[dbo].[AR_CustomerAddresses].[strCustomer]
,[JAS001new].[dbo].[AR_CustomerAddresses].[strAddress1]
,[JAS001new].[dbo].[AR_CustomerAddresses].[strCity]
,[JAS001new].[dbo].[AR_CustomerAddresses].[strPostalCode]
FROM
[JAS001new].[dbo].[AR_CustomerAddresses]) tmp
GROUP BY
[strCustomer]
,[strAddress1]
,[strCity]
,[strPostalCode]
HAVING
COUNT(*) = 1
This Stack Overflow Answer gives me a much cleaner SQL query but does not tell me from which table the rows come.
SELECT * FROM [JAS001new].[dbo].[AR_CustomerAddresses]
UNION
SELECT * FROM [JAS001].[dbo].[AR_CustomerAddresses]
EXCEPT
SELECT * FROM [JAS001new].[dbo].[AR_CustomerAddresses]
INTERSECT
SELECT * FROM [JAS001].[dbo].[AR_CustomerAddresses]
I could use the first version but I have many tables that I need to compare and I think that there has to be an easy way to add the source table column to the second query. I've tried several things and googled to no avail. I suspect that maybe I'm just not searching for the correct thing since I'm sure it's been answered before.
Maybe I'm going down the wrong trail and there is a better way to compare the databases?
Could you use the following setup to accomplish your goal?
SELECT 'New not in Old' Descriptor, *
FROM
(
SELECT * FROM [JAS001new].[dbo].[AR_CustomerAddresses]
EXCEPT
SELECT * FROM [JAS001].[dbo].[AR_CustomerAddresses]
) a
UNION
SELECT 'Old not in New' Descriptor, *
FROM
(
SELECT * FROM [JAS001].[dbo].[AR_CustomerAddresses]
EXCEPT
SELECT * FROM [JAS001new].[dbo].[AR_CustomerAddresses]
) b
You can't add the table name there because union, except, and intersection all compare all columns. This means you can't differentiate between them by adding the table name to the query. A group by gives you control over what columns are considered in finding duplicates so you can exclude the table name.
To help you with the large number of tables you need to compare you could write a sql query off the metadata tables that hold table names and columns and generate the sql commands dynamically off those values.
Derive one column using table names like below
SELECT MIN(TableName) as TableName
,[strCustomer]
,[strAddress1]
,[strCity]
,[strPostalCode]
,table_name_came
FROM
(SELECT 'Old' as TableName
,[JAS001].[dbo].[AR_CustomerAddresses].[strCustomer]
,[JAS001].[dbo].[AR_CustomerAddresses].[strAddress1]
,[JAS001].[dbo].[AR_CustomerAddresses].[strCity]
,[JAS001].[dbo].[AR_CustomerAddresses].[strPostalCode]
,'[JAS001].[dbo].[AR_CustomerAddresses]' as table_name_came
FROM [JAS001].[dbo].[AR_CustomerAddresses]
UNION ALL
SELECT 'New' as TableName
,[JAS001new].[dbo].[AR_CustomerAddresses].[strCustomer]
,[JAS001new].[dbo].[AR_CustomerAddresses].[strAddress1]
,[JAS001new].[dbo].[AR_CustomerAddresses].[strCity]
,[JAS001new].[dbo].[AR_CustomerAddresses].[strPostalCode]
,'[JAS001new].[dbo].[AR_CustomerAddresses]' as table_name_came
FROM [JAS001new].[dbo].[AR_CustomerAddresses]
) tmp
GROUP BY [strCustomer]
,[strAddress1]
,[strCity]
,[strPostalCode]
,table_name_came
HAVING COUNT(*) = 1