Optimization - sql :How to show all data exists in multiple tables

Optimization - sql :How to show all data exists in multiple tables - sql

I have two table. I want to find all the rows in table One that exists in table Two, and back. I had the answer, but I want it faster.
Example:
Create table One (ID INT, Value INT, location VARCHAR(10))
Create table Two (ID INT, Value INT, location VARCHAR(10))
INSERT INTO One VALUES(1,2,'Hanoi')
INSERT INTO One VALUES(2,1,'Hanoi')
INSERT INTO One VALUES(1,4,'Hanoi')
INSERT INTO One VALUES(3,5,'Hanoi')
INSERT INTO Two VALUES(1,5,'Saigon')
INSERT INTO Two VALUES(4,6,'Saigon')
INSERT INTO Two VALUES(5,7,'Saigon')
INSERT INTO Two VALUES(2,8,'Saigon')
INSERT INTO Two VALUES(2,8,'Saigon')
And answers:
SELECT * FROM One WHERE ID IN (SELECT ID FROM Two)
UNION ALL
SELECT *FROM Two WHERE ID IN (SELECT ID FROM One)
With this query, the system scan the table 4 times
enter image description here
I want the system scan the table twice (table One once, table Two once).
Am I crazy?

You can try something like:
-- CREATE TABLES
IF OBJECT_ID ( 'tempdb..#One' ) IS NOT NULL
DROP TABLE #One;
IF OBJECT_ID ( 'tempdb..#Two' ) IS NOT NULL
DROP TABLE #Two;
CREATE TABLE #One (ID INT, Value INT, location VARCHAR(10))
CREATE TABLE #Two (ID INT, Value INT, location VARCHAR(10))
-- INSERT DATA
INSERT INTO #One VALUES(1,2,'Hanoi')
INSERT INTO #One VALUES(2,1,'Hanoi')
INSERT INTO #One VALUES(1,4,'Hanoi')
INSERT INTO #One VALUES(3,5,'Hanoi')
INSERT INTO #Two VALUES(1,5,'Saigon')
INSERT INTO #Two VALUES(4,6,'Saigon')
INSERT INTO #Two VALUES(5,7,'Saigon')
INSERT INTO #Two VALUES(2,8,'Saigon')
INSERT INTO #Two VALUES(2,8,'Saigon')
-- CREATE INDEX
CREATE NONCLUSTERED INDEX IX_One ON #One (ID) INCLUDE (Value, location)
CREATE NONCLUSTERED INDEX IX_Two ON #Two (ID) INCLUDE (Value, location)
-- SELECT DATA
SELECT o.ID
,o.Value
,o.location
FROM #One o
WHERE EXISTS (SELECT 1 FROM #Two t WHERE o.ID = t.ID)
UNION ALL
SELECT t.ID
,t.Value
,t.location
FROM #Two t
WHERE EXISTS (SELECT 1 FROM #One o WHERE t.ID = o.ID)
but it depends how "big" data you have. If the data is really big (millions of rows) and you are running Enterprise version of SQL Server, you may consider using columnstore indexes.

The reason you're scanning the tables twice is because you are reading from Table X and looking up the corresponding value from table Y. And once that's finished you do the same but starting from table Y and then looking for matches in Table Y. After that, both results are combined and returned to the caller.
In a way, that's not a bad thing although if tables are 'wide' and contain a lot of columns you don't need then you are doing a lot of IO for no good reason. Additionally, in your example, the quest for a matching ID in the other table requires scanning the whole table because there is no 'logic' to the ID field. It simply is a list of values. To speed up things you should add an index on the ID field that helps the system find a particular ID value MUCH MUCH faster. Additionally, this also limits the amount of data that needs to be read for the lookup phase: the server will only read from the index which only contains the ID values (**) and not all the other, unneeded fields.
To be honest, I find your requirement a bit strange, but I'm guessing it's mostly due to simplification as to make it understandable here on SO. My first reaction was to suggest to use a JOIN between both tables, but since the ID fields are non-unique this then results in duplicates! To work around that I added a DISTINCT but then things slowed down severely. In the end, doing just the WHERE ID IN (...) turned out to be the most efficient approach.
Adding indexes on the ID field made it faster although the effect wasn't as big as I expected, probably because there are few other fields and the gain in IO is negligible (read: it all fits in memory even though I tried this on 5 million rows).
FYI: Personally I prefer the construction WHERE EXISTS() over WHERE IN (...) but they're both equivalent and actually produced the exact same query plan.
(**: aside from the indexed fields, every index also contains the clustered index -- which usually is the Primary Key of the table -- fields in its leaf data. For more information Kimberly L. Tripp has some interesting articles about indexes and how they work.)

Related

Improve insert performance when checking existing rows

I have this simple query that inserts rows from one table(sn_users_main) into another(sn_users_history).
To make sure sn_users_history only has unique rows it checks if the column query_time already exists and if it does then don't insert. query_time is kind of a session identifier that is the same for every row in sn_users_main.
This works fine but since sn_users_history is reaching 50k rows running this query takes more than 2 minutes to run which is too much. Is there anything I can do to improve performance and get the same result?
INSERT INTO sn_users_history(query_time,user_id,sn_name,sn_email,sn_manager,sn_active,sn_updated_on,sn_last_Login_time,sn_is_vip,sn_created_on,sn_is_team_lead,sn_company,sn_department,sn_division,sn_role,sn_employee_profile,sn_location,sn_employee_type,sn_workstation) --- Columns of history table
SELECT snm.query_time,
snm.user_id,
snm.sn_name,
snm.sn_email,
snm.sn_manager,
snm.sn_active,
snm.sn_updated_on,
snm.sn_last_Login_time,
snm.sn_is_vip,
snm.sn_created_on,
snm.sn_is_team_lead,
snm.sn_company,
snm.sn_department,
snm.sn_division,
snm.sn_role,
snm.sn_employee_profile,
snm.sn_location,
snm.sn_employee_type,
snm.sn_workstation
---Columns of main table
FROM sn_users_main snm
WHERE NOT EXISTS(SELECT snh.query_time
FROM sn_users_history snh
WHERE snh.query_time = snm.query_time) --Dont insert items into history table if they already exist

I think you are missing extra condition on user_id, when you are inserting into history table. You have to check combination of userid, querytime.
For your question, I think you are trying to reinvent the wheel. SQL Server is already having temporal tables, to suppor this historical data holding. Read about SQL Server Temporal Tables
If you want to still continue with this approach, I would suggest you to do in batches:
Create a configuration Table to hold the last processed querytime
CREATE TABLE HistoryConfig(HistoryConfigId int, HistoryTableName SYSNAME,
lastProcessedQueryTime DATETIME)
you can do incremental historical inserts
DECLARE #lastProcessedQueryTime DATETIME = (SELECT MAX(lastProcessedQueryTime) FROM HistoryConfig)
INSERT INTO sn_users_history(query_time,user_id,sn_name,sn_email,sn_manager,sn_active,sn_updated_on,sn_last_Login_time,sn_is_vip,sn_created_on,sn_is_team_lead,sn_company,sn_department,sn_division,sn_role,sn_employee_profile,sn_location,sn_employee_type,sn_workstation) --- Columns of history table
SELECT snm.query_time,
snm.user_id,
snm.sn_name,
snm.sn_email,
snm.sn_manager,
snm.sn_active,
snm.sn_updated_on,
snm.sn_last_Login_time,
snm.sn_is_vip,
snm.sn_created_on,
snm.sn_is_team_lead,
snm.sn_company,
snm.sn_department,
snm.sn_division,
snm.sn_role,
snm.sn_employee_profile,
snm.sn_location,
snm.sn_employee_type,
snm.sn_workstation
---Columns of main table
FROM sn_users_main snm
WHERE query_time > #lastProcessedQueryTime
Now, you can update the configuration again
UPDATE HistoryConfig SET lastProcessedQueryTime = (SELECT MAX(lastProcessedQueryTime) FROM HistoryConfig)
HistoryTableName = 'sn_users_history'
I would suggest you to create index on clustered index on UserId, Query_Time(if possible, Otherwise create non-clustered index) which will improve the performance.
Other approaches you can think of:
Create clustered index on userId, querytime in the historical table and also have userid,querytime as clustered index on the main table and perform MERGE operation.

SQL Sever Query Join Optimization

I have looked for answers online but cant find a definitive answer. For example you have 2 join clauses:
1.
JOIN T2 ON T1.[ID] = T2.[ID]
2.
JOIN T2 ON T1.[ID] = REPLACE(T2.[ID],'A', '')
Now the 2nd one performs worse due to the function on the join clause. What is the exact reason for this?
And for example, if this code was in a stored procedure what would be best to optimise it? To remove the replace function and add it to the table level so all of this is completed before any joins?
Any advice or links to further information would be great. Thanks

In your second example, you are attempting to find a record in T2 - but instead of the value being the T1.ID value, you are applying a function to T2.ID - REPLACE(T2.[ID],'A', '')
If you had an index on T2.ID - at best it would scan the index and not seek it - thus causing a performance difference.
This is where it get's harder to explain - the index is stored as a b+tree, of the values for the T2.ID on the table. The index understands that field and can search / sort by it, but it doesn't understand any logic applied to it.
It does not know if REPLACE('A123','A', '') = 123 - without in executing the function on the value in the index and checking the resulting equality.
AAA123 would also be equal, 1A23, 12A3, 123A etc, there is a never ending amount of combinations that would in fact match - but the only way in which it can figure out if a single index entry matches is to run that value through the function and then check the equality.
If it can only figure that out when running the index value through the function - it can only properly answer the query correctly if it does that for every entry in the index - e.g. an index scan of every entry, being passed into the function and the output being checked.
As Jeroen mentions the term is SARGable or SARGability, Search ARGumentABLE, although I personally prefer to explain it as Seek ARGumentABLE since that is a closer match to the query plan operator.
It should be noted that this concept has nothing to do with it being a join, any predicate within SQL has this restriction - a single table query with a where predicate can have the same issue.
Can this problem be avoided? It can but only in some instances, where you can reverse the operation.
Consider a table with an ID column, I could construct a predicate such as this :
WHERE ID * 2 = #paramValue
The only way SQL Server would know if an ID entry multiplied by 2 is the passed in value is to process every entry, double it and check. So that is the index scan scenario again.
In this instance we can re-write it:
WHERE ID = #paramValue / 2.0
Now SQL Server will perform the mathematics once, divide the passed in value and it can then check that against the index in a seekable manner. The difference in the SQL written looks a potentially trivial difference of stating the problem, but makes a very large difference to how the database can resolve the predicate.

SQL Server has four basic methods for handling joins (as do other databases):
Nested loop without an index. This is like two nested for loops and is usually the slowest method.
Index looped (nested loop with an index). This a scan of one table with a lookup in the second.
Merge join. This assumes that the two tables are ordered and loops through the two tables at the same time (this can also be accomplished using indexes).
Hash join. The keys for the two tables are hashed and hash-tables are used for matching.
In general, the first of these is the slowestthe second of these -- using an index -- is the fastest. (There are exceptions). The second is often the fastest.
When you use an equality comparison between two columns in the table, SQL Server has a lot of information for deciding on the best join algorithm to use:
It has information on indexes.
It has statistics on the column.
Without this information, SQL Server often defaults to the nested-loop join. I find that it does this even when it could use the expression for a merge- or hash- based join.
As a note, you can work around this by using a computed column:
alter table t2 add id_no_a as (replace(id, 'A', '')) persisted;
create index idx_t2_id_no_a on t2(id_no_a);
Then phrase
on T1.[ID] = t2.id_no_a

Example of using union to avoid searches without index:
DECLARE #T1 TABLE (ID VARCHAR(16), CODE INT)
DECLARE #T2 TABLE (ID VARCHAR(16), CODE INT)
INSERT INTO #T1 VALUES ('ASD',1)
INSERT INTO #T1 VALUES ('DFG',2)
INSERT INTO #T1 VALUES ('RTY',3)
INSERT INTO #T1 VALUES ('AZX',4)
INSERT INTO #T1 VALUES ('GTY',5)
INSERT INTO #T1 VALUES ('KKO',6)
INSERT INTO #T2 VALUES ('ASD',1)
INSERT INTO #T2 VALUES ('SD',2)
INSERT INTO #T2 VALUES ('DFG',3)
INSERT INTO #T2 VALUES ('RTY',4)
INSERT INTO #T2 VALUES ('AZX',5)
INSERT INTO #T2 VALUES ('ZX',6)
INSERT INTO #T2 VALUES ('GTY',7)
INSERT INTO #T2 VALUES ('GTYA',8)
INSERT INTO #T2 VALUES ('KKO',9)
INSERT INTO #T2 VALUES ('KKOA',10)
INSERT INTO #T2 VALUES ('AKKOA',11)
SELECT * FROM #T1 T1 INNER JOIN (SELECT ID FROM #T2 WHERE ID NOT LIKE '%A%')T2 ON T2.ID = T1.ID
UNION ALL
SELECT * FROM #T1 T1 INNER JOIN (SELECT REPLACE(ID,'A','')ID FROM #T2 WHERE ID LIKE '%A%')T2 ON T2.ID = T1.ID
This is what you can do without schema changes.
With schema changes you need to create a calculated indexed column into T2 and join with it. This is is much faster and most of the effort is placed on inserts/updates to maintain the extra column and the index on it.

Fastest options for merging two tables in SQL Server

Consider two very large tables, Table A with 20 million rows in, and Table B which has a large overlap with TableA with 10 million rows. Both have an identifier column and a bunch of other data. I need to move all items from Table B into Table A updating where they already exist.
Both table structures
- Identifier int
- Date DateTime,
- Identifier A
- Identifier B
- General decimal data.. (maybe 10 columns)
I can get the items in Table B that are new, and get the items in Table B that need to be updated in Table A very quickly, but I can't get an update or a delete insert to work quickly. What options are available to merge the contents of TableB into TableA (i.e. updating existing records instead of inserting) in the shortest time?
I've tried pulling out existing records in TableB and running a large update on table A to update just those rows (i.e. an update statement per row), and performance is pretty bad, even with a good index on it.
I've also tried doing a one shot delete of the different values out of TableA that exist in TableB and performance of the delete is also poor, even with the indexes dropped.
I appreciate that this may be difficult to perform quickly, but I'm looking for other options that are available to achieve this.

Since you deal with two large tables, in-place updates/inserts/merge can be time consuming operations. I would recommend to have some bulk logging technique just to load a desired content to a new table and the perform a table swap:
Example using SELECT INTO:
SELECT *
INTO NewTableA
FROM (
SELECT * FROM dbo.TableB b WHERE NOT EXISTS (SELECT * FROM dbo.TableA a WHERE a.id = b.id)
UNION ALL
SELECT * FROM dbo.TableA a
) d
exec sp_rename 'TableA', 'BackupTableA'
exec sp_rename 'NewTableA', 'TableA'
Simple or at least Bulk-Logged recovery is highly recommended for such approach. Also, I assume that it has to be done out of business time since plenty of missing objects to be recreated on a new tables: indexes, default constraints, primary key etc.

A Merge is probably your best bet, if you want to both inserts and updates.
MERGE #TableB AS Tgt
USING (SELECT * FROM #TableA) Src
ON (Tgt.Identifier = SRc.Identifier)
WHEN MATCHED THEN
UPDATE SET Date = Src.Date, ...
WHEN NOT MATCHED THEN
INSERT (Identifier, Date, ...)
VALUES (Src.Identifier, Src.Date, ...);
Note that the merge statement must be terminated with a ;

Is it possible to add index to a temp table? And what's the difference between create #t and declare #t

I need to do a very complex query.
At one point, this query must have a join to a view that cannot be indexed unfortunately.
This view is also a complex view joining big tables.
View's output can be simplified as this:
PID (int), Kind (int), Date (date), D1,D2..DN
where PID and Date and Kind fields are not unique (there may be more than one row having same combination of pid,kind,date), but are those that will be used in join like this
left join ComplexView mkcs on mkcs.PID=q4.PersonID and mkcs.Date=q4.date and mkcs.Kind=1
left join ComplexView mkcl on mkcl.PID=q4.PersonID and mkcl.Date=q4.date and mkcl.Kind=2
left join ComplexView mkco on mkco.PID=q4.PersonID and mkco.Date=q4.date and mkco.Kind=3
Now, if I just do it like this, execution of the query takes significant time because the complex view is ran three times I assume, and out of its huge amount of rows only some are actually used (like, out of 40000 only 2000 are used)
What i did is declare #temptable, and insert into #temptable select * from ComplexView where Date... - one time per query I select only the rows I am going to use from my ComplexView, and then I am joining this #temptable.
This reduced execution time significantly.
However, I noticed, that if I make a table in my database, and add a clustered index on PID,Kind,Date (non-unique clustered) and take data from this table, then doing delete * from this table and insert into this table from complex view takes some seconds (3 or 4), and then using this table in my query (left joining it three times) take down query time to half, from 1 minute to 30 seconds!
So, my question is, first of all - is it possible to create indexes on declared #temptables.
And then - I've seen people talk about "create #temptable" syntax. Maybe this is what i need? Where can I read about what's the difference between declare #temptable and create #temptable? What shall I use for a query like mine? (this query is for MS Reporting Services report, if it matters).

#tablename is a physical table, stored in tempdb that the server will drop automatically when the connection that created it is closed, #tablename is a table stored in memory & lives for the lifetime of the batch/procedure that created it, just like a local variable.
You can only add a (non PK) index to a #temp table.
create table #blah (fld int)
create nonclustered index idx on #blah (fld)

It's not a complete answer but #table will create a temporary table that you need to drop or it will persist in your database. #table is a table variable that will not persist longer than your script.
Also, I think this post will answer the other part of your question.
Creating an index on a table variable

Yes, you can create indexes on temp tables or table variables. http://sqlserverplanet.com/sql/create-index-on-table-variable/

The #tableName syntax is a table variable. They are rather limited. The syntax is described in the documentation for DECLARE #local_variable. You can kind of have indexes on table variables, but only indirectly by specifying PRIMARY KEY and UNIQUE constraints on columns. So, if your data in the columns that you need an index on happens to be unique, you can do this. See this answer. This may be “enough” for many use cases, but only for small numbers of rows. If you don’t have indexes on your table variable, the optimizer will generally treat table variables as if they contain one row (regardless of how many rows there actually are) which can result in terrible query plans if you have hundreds or thousands of rows in them instead.
The #tableName syntax is a locally-scoped temporary table. You can create these either using SELECT…INTO #tableName or CREATE TABLE #tableName syntax. The scope of these tables is a little bit more complex than that of variables. If you have CREATE TABLE #tableName in a stored procedure, all references to #tableName in that stored procedure will refer to that table. If you simply reference #tableName in the stored procedure (without creating it), it will look into the caller’s scope. So you can create #tableName in one procedure, call another procedure, and in that other procedure read/update #tableName. However, once the procedure that created #tableName runs to completion, that table will be automatically unreferenced and cleaned up by SQL Server. So, there is no reason to manually clean up these tables unless if you have a procedure which is meant to loop/run indefinitely or for long periods of time.
You can define complex indexes on temporary tables, just as if they are permanent tables, for the most part. So if you need to index columns but have duplicate values which prevents you from using UNIQUE, this is the way to go. You do not even have to worry about name collisions on indexes. If you run something like CREATE INDEX my_index ON #tableName(MyColumn) in multiple sessions which have each created their own table called #tableName, SQL Server will do some magic so that the reuse of the global-looking identifier my_index does not explode.
Additionally, temporary tables will automatically build statistics, etc., like normal tables. The query optimizer will recognize that temporary tables can have more than just 1 row in them, which can in itself result in great performance gains over table variables. Of course, this also is a tiny amount of overhead. Though this overhead is likely worth it and not noticeable if your query’s runtime is longer than one second.

To extend Alex K.'s answer, you can create the PRIMARY KEY on a temp table
IF OBJECT_ID('tempdb..#tempTable') IS NOT NULL
DROP TABLE #tempTable
CREATE TABLE #tempTable
(
Id INT PRIMARY KEY
,Value NVARCHAR(128)
)
INSERT INTO #tempTable
VALUES
(1, 'first value')
,(3, 'second value')
-- will cause Violation of PRIMARY KEY constraint 'PK__#tempTab__3214EC071AE8C88D'. Cannot insert duplicate key in object 'dbo.#tempTable'. The duplicate key value is (1).
--,(1, 'first value one more time')
SELECT * FROM #tempTable

Equivalent of a composite index across multiple tables?

I have a table structure similar the following:
create table MAIL (
ID int,
FROM varchar,
SENT_DATE date
);
create table MAIL_TO (
ID int,
MAIL_ID int,
NAME varchar
);
and I need to run the following query:
select m.ID
from MAIL m
inner join MAIL_TO t on t.MAIL_ID = m.ID
where m.SENT_DATE between '07/01/2010' and '07/30/2010'
and t.NAME = 'someone#example.com'
Is there any way to design indexes such that both of the conditions can use an index? If I put an index on MAIL.SENT_DATE and an index on MAIL_TO.NAME, the database will choose to use either one of the indexes or the other, not both. After filtering by the first condition the database always has to do a full scan of the results for the second condition.

Oracle can use both indices. You just don't have the right two indices.
Consider: if the query plan uses your index on mail.sent_date first, what does it get from mail? It gets all the mail.ids where mail.sent_date is within the range you gave in your where clause, yes?
So it goes to mail_to with a list of mail.ids and the mail.name you gave in your where clause. At this point, Oracle decides that it's better to scan the table for matching mail_to.mail_ids rather than use the index on mail_to.name.
Indices on varchars are always problematic, and Oracle really prefers full table scans. But if we give Oracle an index containing the columns it really wants to use, and depending on total table rows and statistics, we can get it to use it. This is the index:
create index mail_to_pid_name on mail_to( mail_id, name ) ;
This works where an index just on name doesn't, because Oracle's not looking just for a name, but for a mail_id and a name.
Conversely, if the cost-based analyzer determines it's cheaper to go to table mail_to first, and uses your index on mail_to.name, what doe sit get? A bunch of mail_to_.mail_ids to look up in mail. It needs to find rows with those ids and certain sent_dates, so:
create index mail_id_sentdate on mail( sent_date, id ) ;
Note that in this case I've put sent_date first in the index, and id second. (This is more an intuitive thing.)
Again, the take home point is this: in creating indices, you have to consider not just the columns in your where clause, but also the columns in your join conditions.
Update
jthg: yes, it always depends on how the data is distributed. And on how many rows are in the table: if very many, Oracle will do a table scan and hash join, if very few it will do a table scan. You might reverse the order of either of the two indices. By putting sent_date first in the second index, we eliminate most needs for an index solely on sent_date.

A materialized view would allow you to index the values, assuming the stringent materialized view criteria is met.

Which criterion is more selective? The date range or the addressee? I would guess the addressee. And if that is highly selective, do not care for the date index, just let the database do the search based on the found mail ids. But index table MAIL on the id if it is not already.
On the other hand, some modern optimizers would even make use of both indexes, scanning both tables and than build a hash value of the join columns to merge the results of both. I am not absolutely sure if and when Oracle would chose this strategy. I just realized that SQL Server tends to make hash joins rather often, compared to other engines.

In situations where the requirements aren't met for a materialized view, there are these two options:
1) You can create a cross reference table, and keep this updated with triggers.
The concepts would be the same with Oracle, but i only have SQL Server installed at the moment to run the test, see this setup:
create table MAIL (
ID INT IDENTITY(1,1),
[FROM] VARCHAR(200),
SENT_DATE DATE,
CONSTRAINT PK_MAIL PRIMARY KEY (ID)
);
create table MAIL_TO (
ID INT IDENTITY(1,1),
MAIL_ID INT,
[NAME] VARCHAR (200),
CONSTRAINT PK_MAIL_TO PRIMARY KEY (ID)
);
ALTER TABLE [dbo].[MAIL_TO] WITH CHECK ADD CONSTRAINT [FK_MAILTO_MAIL] FOREIGN KEY([MAIL_ID])
REFERENCES [dbo].[MAIL] ([ID])
GO
ALTER TABLE [dbo].[MAIL_TO] CHECK CONSTRAINT [FK_MAILTO_MAIL]
GO
CREATE TABLE CompositeIndex_MailSentDate_MailToName (
[MAIL_ID] INT,
[MAILTO_ID] INT,
SENT_DATE DATE,
MAILTO_NAME VARCHAR(200),
CONSTRAINT PK_CompositeIndex_MailSentDate_MailToName PRIMARY KEY (MAILTO_ID,MAIL_ID)
)
GO
CREATE NONCLUSTERED INDEX IX_MailSent_MailTo ON dbo.CompositeIndex_MailSentDate_MailToName (SENT_DATE,MAILTO_NAME)
CREATE NONCLUSTERED INDEX IX_MailTo_MailSent ON dbo.CompositeIndex_MailSentDate_MailToName (MAILTO_NAME,SENT_DATE)
GO
CREATE TRIGGER dbo.trg_MAILTO_Insert
ON dbo.MAIL_TO
AFTER INSERT AS
BEGIN
INSERT INTO dbo.CompositeIndex_MailSentDate_MailToName ( MAIL_ID, MAILTO_ID, SENT_DATE, MAILTO_NAME )
SELECT mailTo.MAIL_ID,mailTo.ID,m.SENT_DATE,mailTo.NAME
FROM
inserted mailTo
INNER JOIN dbo.MAIL m ON m.ID = mailTo.MAIL_ID
END
GO
CREATE TRIGGER dbo.trg_MAILTO_Delete
ON dbo.MAIL_TO
AFTER DELETE AS
BEGIN
DELETE mailToDelete
FROM
dbo.MAIL_TO mailToDelete
INNER JOIN deleted ON mailToDelete.ID = deleted.ID
END
GO
CREATE TRIGGER dbo.trg_MAILTO_Update
ON dbo.MAIL_TO
AFTER UPDATE AS
BEGIN
UPDATE compositeIndex
SET
compositeIndex.MAILTO_NAME = updates.NAME
FROM
dbo.CompositeIndex_MailSentDate_MailToName compositeIndex
INNER JOIN inserted updates ON updates.ID = compositeIndex.MAILTO_ID
END
GO
CREATE TRIGGER dbo.trg_MAIL_Update
ON dbo.MAIL
AFTER UPDATE AS
BEGIN
UPDATE compositeIndex
SET
compositeIndex.SENT_DATE = updates.SENT_DATE
FROM
dbo.CompositeIndex_MailSentDate_MailToName compositeIndex
INNER JOIN inserted updates ON updates.ID = compositeIndex.MAIL_ID
END
GO
INSERT INTO dbo.MAIL ( [FROM], SENT_DATE )
SELECT 'SenderA','2018-10-01'
UNION ALL SELECT 'SenderA','2018-10-02'
INSERT INTO dbo.MAIL_TO ( MAIL_ID, NAME )
SELECT 1,'CustomerA'
UNION ALL SELECT 1,'CustomerB'
UNION ALL SELECT 2,'CustomerC'
UNION ALL SELECT 2,'CustomerD'
UNION ALL SELECT 2,'CustomerE'
SELECT * FROM dbo.MAIL
SELECT * FROM dbo.MAIL_TO
SELECT * FROM dbo.CompositeIndex_MailSentDate_MailToName
You can then use the dbo.CompositeIndex_MailSentDate_MailToName table to JOIN to the rest of your data. This is useful in environments where your rate of inserts and updates are low, but your query needs are high. So the relative overhead of implementing the triggers is small.
This has the advantage of being updated transactionally, in real time.
2) If you don't want the performance/management overhead of a trigger, and you only need this for next day reporting, you can create a view, and a nightly process which truncates the table and selects the entire view into a materialized table.
I've used this successfully to index flattened relational data requiring joins across a dozen or so tables.. reducing report times from hours to seconds. While it's an expensive query, you can set the job to run off hours if you have periods of reduced usage.

If your queries are generally for a particular month, then you could partition the data by month.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas