Joining 100 tables - sql

Assume that I have a main table which has 100 columns referencing (as foreign keys) to some 100 tables (containing primary keys).
The whole pack of information requires joining those 100 tables. And it is definitely a performance issue to join such a number of tables. Hopefully, we can expect that any user would like to request a bunch of data containing values from not more than some 5-7 tables (out of those 100) in queries that put conditions (in WHERE part of the query) on the fields from some 3-4 tables (out of those 100). Different queries have different combinations of the tables used to produce "SELECT" part of query and to put conditions in "WHERE". But, again, every SELECT would require some 5-7 tables and every WHERE would requre some 3-4 tables (definitely, the list of tables used to produce SELECT may overlap with the list of tables used to put conditions in WHERE).
I can write a VIEW with the underlying code joining all those 100 tables. Then I can write the mentioned above SQL-queries to this VIEW. But in this case it is a big issue for me how to instruct SQL Server that (despite the explicit instructions in the code to join all those 100 tables) only some 11 tables should be joined (11 tables are enough to be joined to produce SELECT outcome and take into account WHERE conditions).
Another approach may be to create a "feature" that converts the following "fake" code
SELECT field1, field2, field3 FROM TheFakeTable WHERE field1=12 and field4=5
into the following "real" code:
SELECT T1.field1, T2.field2, T3.field3 FROM TheRealMainTable
join T1 on ....
join T2 on ....
join T3 on ....
join T4 on ....
WHERE T1.field1=12 and T4.field4=5
From grammatical point of view, it is not a problem even to allow any mixed combinations of this "TheFakeTable-mechanism" with real tables and constructions. The real problem here is how to realize this "feature" technically. I can create a function which takes the "fake" code as an input and produces the "real" code. But it is not convenient because it requires using dynamic SQL tools evrywhere where this "TheFakeTable-mechanism" appears. A fantasy-land solution is to extend the gramma of the SQL-language in my Management Studio to allow writing such a fake code and then automatically converting this code into the real one before sending to the server.
My questions are:
whether SQl Server can be instructed shomehow (or to be genius enouh) to join only 11 tables instead of 100 in the VIEW described above?
If I decide to create this "TheFakeTable-mechanism" feature, what would be the best form for the technical realization of this feature?
Thanks to everyone for every comment!
PS
The structure with 100 tables arises from the following question that I asked here:
Normalizing an extremely big table

The SQL Server optimizer does contain logic to remove redundant joins, but there are restrictions, and the joins have to be provably redundant. To summarize, a join can have four effects:
It can add extra columns (from the joined table)
It can add extra rows (the joined table may match a source row more than once)
It can remove rows (the joined table may not have a match)
It can introduce NULLs (for a RIGHT or FULL JOIN)
To successfully remove a redundant join, the query (or view) must account for all four possibilities. When this is done, correctly, the effect can be astonishing. For example:
USE AdventureWorks2012;
GO
CREATE VIEW dbo.ComplexView
AS
SELECT
pc.ProductCategoryID, pc.Name AS CatName,
ps.ProductSubcategoryID, ps.Name AS SubCatName,
p.ProductID, p.Name AS ProductName,
p.Color, p.ListPrice, p.ReorderPoint,
pm.Name AS ModelName, pm.ModifiedDate
FROM Production.ProductCategory AS pc
FULL JOIN Production.ProductSubcategory AS ps ON
ps.ProductCategoryID = pc.ProductCategoryID
FULL JOIN Production.Product AS p ON
p.ProductSubcategoryID = ps.ProductSubcategoryID
FULL JOIN Production.ProductModel AS pm ON
pm.ProductModelID = p.ProductModelID
The optimizer can successfully simplify the following query:
SELECT
c.ProductID,
c.ProductName
FROM dbo.ComplexView AS c
WHERE
c.ProductName LIKE N'G%';
To:
Rob Farley wrote about these ideas in depth in the original MVP Deep Dives book, and there is a recording of him presenting on the topic at SQLBits.
The main restrictions are that foreign key relationships must be based on a single key to contribute to the simplification process, and compilation time for the queries against such a view may become quite long particularly as the number of joins increases. It could be quite a challenge to write a 100-table view that gets all the semantics exactly correct. I would be inclined to find an alternative solution, perhaps using dynamic SQL.
That said, the particular qualities of your denormalized table may mean the view is quite simple to assemble, requiring only enforced FOREIGN KEYs non-NULLable referenced columns, and appropriate UNIQUE constraints to make this solution work as you would hope, without the overhead of 100 physical join operators in the plan.
Example
Using ten tables rather than a hundred:
-- Referenced tables
CREATE TABLE dbo.Ref01 (col01 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref02 (col02 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref03 (col03 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref04 (col04 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref05 (col05 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref06 (col06 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref07 (col07 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref08 (col08 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref09 (col09 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref10 (col10 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
The parent table definition (with page-compression):
CREATE TABLE dbo.Normalized
(
pk integer IDENTITY NOT NULL,
col01 tinyint NOT NULL REFERENCES dbo.Ref01,
col02 tinyint NOT NULL REFERENCES dbo.Ref02,
col03 tinyint NOT NULL REFERENCES dbo.Ref03,
col04 tinyint NOT NULL REFERENCES dbo.Ref04,
col05 tinyint NOT NULL REFERENCES dbo.Ref05,
col06 tinyint NOT NULL REFERENCES dbo.Ref06,
col07 tinyint NOT NULL REFERENCES dbo.Ref07,
col08 tinyint NOT NULL REFERENCES dbo.Ref08,
col09 tinyint NOT NULL REFERENCES dbo.Ref09,
col10 tinyint NOT NULL REFERENCES dbo.Ref10,
CONSTRAINT PK_Normalized
PRIMARY KEY CLUSTERED (pk)
WITH (DATA_COMPRESSION = PAGE)
);
The view:
CREATE VIEW dbo.Denormalized
WITH SCHEMABINDING AS
SELECT
item01 = r01.item,
item02 = r02.item,
item03 = r03.item,
item04 = r04.item,
item05 = r05.item,
item06 = r06.item,
item07 = r07.item,
item08 = r08.item,
item09 = r09.item,
item10 = r10.item
FROM dbo.Normalized AS n
JOIN dbo.Ref01 AS r01 ON r01.col01 = n.col01
JOIN dbo.Ref02 AS r02 ON r02.col02 = n.col02
JOIN dbo.Ref03 AS r03 ON r03.col03 = n.col03
JOIN dbo.Ref04 AS r04 ON r04.col04 = n.col04
JOIN dbo.Ref05 AS r05 ON r05.col05 = n.col05
JOIN dbo.Ref06 AS r06 ON r06.col06 = n.col06
JOIN dbo.Ref07 AS r07 ON r07.col07 = n.col07
JOIN dbo.Ref08 AS r08 ON r08.col08 = n.col08
JOIN dbo.Ref09 AS r09 ON r09.col09 = n.col09
JOIN dbo.Ref10 AS r10 ON r10.col10 = n.col10;
Hack the statistics to make the optimizer think the table is very large:
UPDATE STATISTICS dbo.Normalized WITH ROWCOUNT = 100000000, PAGECOUNT = 5000000;
Example user query:
SELECT
d.item06,
d.item07
FROM dbo.Denormalized AS d
WHERE
d.item08 = 'Banana'
AND d.item01 = 'Green';
Gives us this execution plan:
The scan of the Normalized table looks bad, but both Bloom-filter bitmaps are applied during the scan by the storage engine (so rows that cannot match do not even surface as far as the query processor). This may be enough to give acceptable performance in your case, and certainly better than scanning the original table with its overflowing columns.
If you are able to upgrade to SQL Server 2012 Enterprise at some stage, you have another option: creating a column-store index on the Normalized table:
CREATE NONCLUSTERED COLUMNSTORE INDEX cs
ON dbo.Normalized (col01,col02,col03,col04,col05,col06,col07,col08,col09,col10);
The execution plan is:
That probably looks worse to you, but column storage provides exceptional compression, and the whole execution plan runs in Batch Mode with filters for all the contributing columns. If the server has adequate threads and memory available, this alternative could really fly.
Ultimately, I'm not sure this normalization is the correct approach considering the number of tables and the chances of getting a poor execution plan or requiring excessive compilation time. I would probably correct the schema of the denormalized table first (proper data types and so on), possibly apply data compression...the usual things.
If the data truly belongs in a star-schema, it probably needs more design work than just splitting off repeating data elements into separate tables.

Why do you think joining 100 tables would be a performance issue?
If all the keys are primary keys, then all the joins will use indexes. The only question, then, is whether the indexes fit into memory. If they fit in memory, performance is probably not an issue at all.
You should try the query with the 100 joins before making such a statement.
Furthermore, based on the original question, the reference tables have just a few values in them. The tables themselves fit on a single page, plus another page for the index. This is 200 pages, which would occupy at most a few megabytes of your page cache. Don't worry about the optimizations, create the view, and if you have performance problems then think about the next steps. Don't presuppose performance problems.
ELABORATION:
This has received a lot of comments. Let me explain why this idea may not be as crazy as it sounds.
First, I am assuming that all the joins are done through primary key indexes, and that the indexes fit into memory.
The 100 keys on the page occupy 400 bytes. Let's say that the original strings are, on average 40 bytes each. These would have occupied 4,000 bytes on the page, so we have a savings. In fact, about 2 records would fit on a page in the previous scheme. About 20 fit on a page with the keys.
So, to read the records with the keys is about 10 times faster in terms of I/O than reading the original records. With the assumptions about the small number of values, the indexes and original data fit into memory.
How long does it take to read 20 records? The old way required reading 10 pages. With the keys, there is one page read and 100*20 index lookups (with perhaps an additional lookup to get the value). Depending on the system, the 2,000 index lookups may be faster -- even much faster -- than the additional 9 page I/Os. The point I want to make is that this is a reasonable situation. It may or may not happen on a particular system, but it is not way crazy.
This is a bit oversimplified. SQL Server doesn't actually read pages one-at-a-time. I think they are read in groups of 4 (and there might be look-ahead reads when doing a full-table scan). On the flip side, though, in most cases, a table-scan query is going to be more I/O bound than processor bound, so there are spare processor cycles for looking up values in reference tables.
In fact, using the keys could result in faster reading of the table than not using them, because spare processing cycles would be used for the lookups ("spare" in the sense that processing power is available when reading). In fact, the table with the keys might be small enough to fit into available cache, greatly improving performance of more complex queries.
The actual performance depends on lots of factors, such as the length of the strings, the original table (is it larger than available cache?), the ability of the underlying hardware to do I/O reads and processing at the same time, and the dependence on the query optimizer to do the joins correctly.
My original point was that assuming a priori that the 100 joins are a bad thing is not correct. The assumption needs to be tested, and using the keys might even give a boost to performance.

If your data doesn't change much, you may benefit from creating an Indexed View, which basically materializes the view.
If the data changes often, it may not be a good option, as the server has to maintain the indexed view for each change in the underlying tables of the view.
Here's a good blog post that describes it a bit better.
From the blog:
CREATE VIEW dbo.vw_SalesByProduct_Indexed
WITH SCHEMABINDING
AS
SELECT
Product,
COUNT_BIG(*) AS ProductCount,
SUM(ISNULL(SalePrice,0)) AS TotalSales
FROM dbo.SalesHistory
GROUP BY Product
GO
The script below creates the index on our view:
CREATE UNIQUE CLUSTERED INDEX idx_SalesView ON vw_SalesByProduct_Indexed(Product)
To show that an index has been created on the view and that it does
take up space in the database, run the following script to find out
how many rows are in the clustered index and how much space the view
takes up.
EXECUTE sp_spaceused 'vw_SalesByProduct_Indexed'
The SELECT statement below is the same statement as before, except
this time it performs a clustered index seek, which is typically very
fast.
SELECT
Product, TotalSales, ProductCount
FROM vw_SalesByProduct_Indexed
WHERE Product = 'Computer'

Related

SQL Dynamic Optimization Tables?

I am a very experienced programmer, but extremely new to SQL, which has a more limited view of things than what is available in code. I think it's possible I'm looking at this wrong in the context of SQL in general, so I'm looking for direction. I do not believe the specific SQL implementation is really important at this point. I think this is just a general SQL conceptual issue, that I'm having.
Here's what I'm thinking:
Say I am going to track the results of a very large number of sporting events (10s of millions or more), with the teams that played in them and the final scores:
CREATE TABLE teams (
TeamID INT NOT NULL PRIMARY KEY,
TeamName VCHAR(255) NOT NULL
)
CREATE TABLE games (
GameID INT NOT NULL PRIMARY KEY,
TeamA INT NOT NULL,
TeamB INT NOT NULL,
TeamAScore INT,
TeamBScore INT,
FOREIGN KEY TeamA(TeamID)
REFERENCES teams (TeamID),
FOREIGN KEY TeamB(TeamID)
REFERENCES teams (TeamID)
)
Since the "games" table will be extremely large, when a query is made for the results of a particular team, it seems to me that searching both "TeamA" and "TeamB" columns for matches could be a very time-consuming operation. That would in turn make immediate presentation on a UI a problem.
However, if there were lists of games played by each team, the query could be made much faster (at the expense of more storage):
CREATE TABLE team_TeamID_games (
GameID INT NOT NULL,
FOREIGN KEY GameID(GameID)
)
Then displaying the list of results for a team just involves using the "team_TeamID_games" table and pulling out the results of the "games" table directly, rather than searching it.
The questionable part here starts with the idea of introducing a new table for each team. The "TeamID" portion of the "team_TeamID_games" above would be replaced with the team ID, so there might be tables called "team_1_games", "team_2_games", etc.
That alone seems to break with what I've seen in researching SQL use.
Additionally, from what I've learned of SQL so far, there isn't really a standard way to actually link the "team_TeamID_games" table to the "TeamID" row of the "teams" table, since foreign keys reference a row, not an entire table. And that means the database doesn't really know about the connection.
Alternatively, a VARCHAR() string with the name of the other table could be stored in the "teams" table, but I don't believe that actually means anything to the database either.
Is the concept of a link between tables done above and outside the database itself an extremely bad thing?
Is the creation of such "dynamic" tables (not statically created up front, but created as teams are registered, and populated as the game results are entered) for each team a bad idea?
Is there another way to accomplish this optimization?
Not sure what you consider "extremely" large. With e.g. 2500 teams, the result games table would be about 6 million rows. That is not even considered "large" nowadays. With 5000 teams, the games table would have 25 million rows. Still not "extremely" large nowadays.
The query "find all games of a specific team" can be answered using the following query:
select *
from games
where teama = 42
or teamb = 42;
This can (usually) be improved by creating an index on each column:
create index idx_team_a on games (teama);
create index idx_team_a on games (teamb);
Postgres (and probably other DBMS products as well) would be able to use both indexes for that query. On my laptop (with 2500 teams and 6.2 million games) that query takes about 3 milliseconds.
Another option would be to create an index on an expression that covers both team IDs
create index on games ( (least(teama, teamb)) );
That expression can then be used to find all games for one team:
select *
from games
where least(teama, teamb) = 1234;
As only a single index is involved this a bit faster: about 2 milliseconds on my laptop.
With 25 million rows (5000 teams), the difference between the two approaches is a bit bigger. The OR query takes around 15-20 milliseconds, the expression based query takes around 5-10 milliseconds.
Even 20 milliseconds doesn't seem something that would be a problem in the UI.
So with careful indexing I don't see why you would need any additional table.

Which SQL Update is faster/ more efficient

I need to update a table every time a certain action is taken.
MemberTable
Name varchar 60
Phone varchar 20
Title varchar 20
Credits int <-- the one that needs constant updates
etc with all the relevant member columns 10 - 15 total
Should I update this table with:
UPDATE Members
SET Credits = Credits - 1
WHERE Id = 1
or should I create another table called account with only two columns like:
Account table
Id int
MemberId int <-- foreign key to members table
Credits int
and update it with:
UPDATE Accounts
SET Credits = Credits - 1
WHERE MemberId = 1
Which one would be faster and more efficient?
I have read that SQL Server must read the whole row in order to update it. I'm not sure if that's true. Any help would be greatly appreciated
I know that this doesn't directly answer the question but I'm going to throw this out there as an alternative solution.
Are you bothered about historic transactions? Not everyone will be, but in case you or other future readers are, here's how I would approach the problem:
CREATE TABLE credit_transactions (
member_id int NOT NULL
, transaction_date datetime NOT NULL
CONSTRAINT df_credit_transactions_date DEFAULT Current_Timestamp
, credit_amount int NOT NULL
, CONSTRAINT pk_credit_transactions PRIMARY KEY (member_id, transaction_date)
, CONSTRAINT fk_credit_transactions_member_id FOREIGN KEY (member_id)
REFERENCES member (id)
, CONSTRAINT ck_credit_transaction_amount_not_zero CHECK (credit_amount <> 0)
);
In terms of write performance...
INSERT INTO credit_transactions (member_id, credit_amount)
VALUES (937, -1)
;
Pretty simple, eh! No row locks required.
The downside to this method is that to work out a members "balance", you have to perform a bit of a calculation.
CREATE VIEW member_credit
AS
SELECT member_id
, Sum(credit) As credit_balance
, Max(transaction_date) As latest_transaction
FROM credit_transactions
GROUP
BY member_id
;
However using a view makes things nice and simple and can be optimized appropriately.
Heck, you might want to throw in a NOLOCK (read up about this before making your decision) on that view to reduce locking impact.
TL;DR:
Pros: quick write speed, transaction history available
Cons: slower read speed
Actually the later way would be faster.
If your number transaction is very huge, to the extent where millisecond precision is very important, it's better to do it this way.
Or maybe some members will not have credits, you might save some space here as well.
However, if it's not, it's good to keep your table structure normalized. If every account will always have a credit, it's better to include it as a column in table Member.
Try to not having unnecessary intermediate table which will consume more space (with all those foreign keys and additional IDs). Furthermore, it also makes your schema a little bit more complex.
In the end, it depends on your requirement.
As the ID is the primary key, all the dbms has to do is look up the key in the index, get the record and update. There should not be much of a performance problem.
Using an account table leads to exactly the same access method. But you are right; as there is less data per record, you might more often have the record in the memory cache already and thus save a physical read. However, I wouldn't expect that to happen too often. And well, you probably work more with your member table than with the account table. This makes it more likely to have a member record already in cache, so it's just vice versa and your account table access is slower then.
Cache access vs. physical reads is the only difference, because with the primary key you will walk the same way throgh the ID index and than access one particular record directly.
I don't recommend using the account table. It somewhat blurrs the data structure with a 1:1 relation between the two tables that may not be immediable recognized by other users. And it is not likely you will gain much from it. (As mentioned, you might even lose performance.)

Join to SELECT vs. Join to Tableset

For the DB gurus out there, I was wondering if there is any functional/performance difference between Joining to the results a SELECT statement and Joining to a previously filled table variable. I'm working in SQL Server 2008 R2.
Example (TSQL):
-- Create a test table
DROP TABLE [dbo].[TestTable]
CREATE TABLE [dbo].[TestTable](
[id] [int] NOT NULL,
[value] [varchar](max) NULL
) ON [PRIMARY]
-- Populate the test table with a few rows
INSERT INTO [dbo].[TestTable]
SELECT 1123, 'test1'
INSERT INTO [dbo].[TestTable]
SELECT 2234, 'test2'
INSERT INTO [dbo].[TestTable]
SELECT 3345, 'test3'
-- Create a reference table
DROP TABLE [dbo].[TestRefTable]
CREATE TABLE [dbo].[TestRefTable](
[id] [int] NOT NULL,
[refvalue] [varchar](max) NULL
) ON [PRIMARY]
-- Populate the reference table with a few rows
INSERT INTO [dbo].[TestRefTable]
SELECT 1123, 'ref1'
INSERT INTO [dbo].[TestRefTable]
SELECT 2234, 'ref2'
-- Scenario 1: Insert matching results into it's own table variable, then Join
-- Create a table variable
DECLARE #subset TABLE ([id] INT NOT NULL, [refvalue] VARCHAR(MAX))
INSERT INTO #subset
SELECT * FROM [dbo].[TestRefTable]
WHERE [dbo].[TestRefTable].[id] = 1123
SELECT t.*, s.*
FROM [dbo].[TestTable] t
JOIN #subset s
ON t.id = s.id
-- Scenario 2: Join directly to SELECT results
SELECT t.*, s.*
FROM [dbo].TestTable t
JOIN (SELECT * FROM [dbo].[TestRefTable] WHERE id = 1123) s
ON t.id = s.id
In the "real" world, the tables and table variable are pre-defined. What I'm looking at is being able to have the matched reference rows available for further operations, but I'm concerned that the extra steps will slow the query down. Are there technical reasons as to why one would be faster than the other? What sort of performance difference may be seen between the two approaches? I realize it is difficult (if not impossible) to give a definitive answer, just looking for some advice for this scenario.
The database engine has an optimizer to figure out the best way to execute a query. There is more under the hood than you probably imagine. For instance, when SQL Server is doing a join, it has a choice of at least four join algorithms:
Nested Loop
Index Lookup
Merge Join
Hash Join
(not to mention the multi-threaded versions of these.)
It is not important that you understand how each of these works. You just need to understand two things: different algorithms are best under different circumstances and SQL Server does its best to choose the best algorithm.
The choice of join algorithm is only one thing the optimizer does. It also has to figure out the ordering of the joins, the best way to aggregate results, whether a sort is needed for an order by, how to access the data (via indexes or directly), and much more.
When you break the query apart, you are making an assumption about optimization. In your case, you are making the assumption that the first best thing is to do a select on a particular table. You might be right. If so, your result with multiple queries should be about as fast as using a single query. Well, maybe not. When in a single query, SQL Server does not have to buffer all the results at once; it can stream results from one place to another. It may also be able to take advantage of parallelism in a way that splitting the query prevents.
In general, the SQL Server optimizer is pretty good, so you are best letting the optimizer do the query all in one go. There are definitely exceptions, where the optimizer may not choose the best execution path. Sometimes fixing this is as easy as being sure that statistics are up-to-date on tables. Other times, you can add optimizer hints. And other times you can restructure the query, as you have done.
For instance, one place where loading data into a local table is useful is when the table comes from a different server. The optimizer may not have full information about the size of the table to make the best decisions.
In other words, keep the query as one statement. If you need to improve it, then focus on optimization after it works. You generally won't have to spend much time on optimization, because the engine is pretty good at it.
This would give the same result?
SELECT t.*, s.*
FROM dbo.TestTable AS t
JOIN dbo.TestRefTable AS s ON t.id = s.id AND s.id = 1123
Basically, this is a cross join of all records from TestTable and TestRefTable with id = 1123.
Joining to table variables will also result in bad cardinality estimates by the optimizer. Table variables are always assumed by the optimizer to contain only a single row. The more rows it actually has the worse that estimate becomes. This causes the optimizer to assume the wrong number of rows for the table itself, but in other places, for operators that might then join to that result, it can result in wrong estimations of the number executions for that operation.
Personally I think Table parameters should be used for getting data into and out of the server conveniently using client apps (C# .Net apps make good use of them), or for passing data between Stored Procs, but should not be used too much within the proc itself. The importance of getting rid of them within the Proc code itself increases with the expected number of rows to be carried by the parameter.
Sub Selects will perform better, or immediately copying into a temp table will work well. There is overhead for copying into the temp table, but again, the more rows you have the more worth it that overhead becomes because the estimates by the optimizer get worse and worse.
In general a derived table in the query is probably going to be faster than joining to a table variable because it can make use of indexes and they are not available in table variables. However, temp tables can also have indexes creted and that might solve the potential performance difference.
Also if the number of table variable records is expected to be small, then indexes won't make a great deal of difference anyway and so there would be little or no differnce.
As alawys you need to test on your own system as number of records and table design and index design havea great deal to do with what works best.
I'd expect the direct Table join to be faster than the Table to TableVariable, and use less resources.

Indexed View looking for null references without INNER JOIN or Subquery

So I have a legacy database with table structure like this (simplified)
Create Table Transaction
{
TransactionId INT NOT NULL IDENTITY(1,1) PRIMARY KEY,
ReplacesTransactionId INT
..
..
}
So I want to create an indexed view such that the following example would return only the second role (because it replaces the first one)
Insert Into Transaction (TransactionId, ReplacesTransactionId, ..) Values (1,0 ..)
Insert Into Transaction (TransactionId, ReplacesTransactionId, ..) Values (2,1 ..)
There are a number of ways of creating this query but I would like to create an indexed view which means I cannot use Subqueries, Left joins or Excepts. An example query (using LEFT JOIN) could be.
SELECT trans1.* FROM Transaction trans1
LEFT JOIN Transaction trans2 on trans1.TransactionId = trans2.ReplacesTransactionId
Where trans2.TransacationId IS NULL
Clearly I'm stuck with the structure of the database and am looking to improve performance of the application using the data.
Any suggestions?
What you have here is essentially a hierarchical dataset in which you want to pre-traverse the hierarchy and store the result in an indexed view, but AFAIK, indexed views do not support that.
On the other hand, this may not be the only angle of attack to your larger goal of improving performance. First, the most obvious question: can we assume that TransactionId is clustered and ReplacesTransactionId is indexed? If not, those would be my first two changes. If the indexing is already good, then the next step would be to look at the query plan of your left join and see if anything leaps out.
In general terms (not having seen the query plan): one possible approach could be to try and convert your SELECT statement to a "covered query" (see https://www.simple-talk.com/sql/learn-sql-server/using-covering-indexes-to-improve-query-performance/). This would most likely entail some combination of:
Reducing the number of columns in the SELECT statement (replacing SELECT *)
Adding a few "included" columns to the index on ReplacesTransactionId (either in SSMS or using the INCLUDES clause of CREATE INDEX).
Good luck!

TSQL join performance

My problem is that this query takes forever to run:
Select
tableA.CUSTOMER_NAME,
tableB.CUSTOMER_NUMBER,
TableB.RuleID
FROM tableA
INNER JOIN tableB on tableA.CUST_PO_NUMBER like tableB.CustomerMask
Here is the structure of the tables:
CREATE TABLE [dbo].[TableA](
[CUSTOMER_NAME] [varchar](100) NULL,
[CUSTOMER_NUMBER] [varchar](50) NULL,
[CUST_PO_NUMBER] [varchar](50) NOT NULL,
[ORDER_NUMBER] [varchar](30) NOT NULL,
[ORDER_TYPE] [varchar](30) NULL)
CREATE TABLE [dbo].[TableB](
[RuleID] [varchar](50) NULL,
[CustomerMask] [varchar](500) NULL)
TableA has 14 million rows and TableB has 1000 rows. Data in column customermask can be anything like ‘%’,’ttt%’,’%ttt%’..etc
How can I tune it to make it faster?
Thanks!
The short answer is don't use the LIKE operator to join two tables containing millions of rows. It's not going to be fast, no matter how you tune it. You might be able to improve it incrementally, but it will just be putting lipstick on a pig.
You need to have a distinct value on which to join the tables. Right now it has to do a complete scan of tableA, and do an item-by-item wildcard comparison between Customer_Name and CustomerMask. You're looking at 14 billion comparisons, all using the slow LIKE operator.
The only suggestion I can give is to re-think the architecture of associating rules with Customers.
While you can't change what's already there, you can create a new table like this:
CREATE TABLE [dbo].[TableC](
[CustomerMask] [varchar](500) NULL)
[CUST_PO_NUMBER] [varchar](50) NOT NULL)
Then have a trigger on both TableA and TableB that inserts / updates / deletes records in TableC if they no longer match the condition CUST_PO_NUMBER LIKE CustomerMask (for the trigger on TableB you need to only update TableC if the CustomerMask field has been changed.
Then in your query will just become:
SELECT
tableA.CUSTOMER_NAME,
tableB.CUSTOMER_NUMBER,
TableB.RuleID
FROM tableA
INNER JOIN tableC on tableA.CUST_PO_NUMBER = tableC.CUST_PO_NUMBER
INNER JOIN tableB on tableC.CustomerMask = tableB.CustomerMask
This will greatly improve your query performance and it shouldn't greatly affect your write performance. You will basically only be performing the like query once for each record (unless they change).
Only change order join then faster and enjoy! use this query:
Select tableA.CUSTOMER_NAME, tableB.CUSTOMER_NUMBER, TableB.RuleID
FROM tableB
INNER JOIN tableA
on tableB.CustomerMask like tableA.CUST_PO_NUMBER
Am I missing something? What about the following:
Select
tableA.CUSTOMER_NAME,
tableA.CUSTOMER_NUMBER,
tableB.RuleID
FROM tableA, tableB
WHERE tableA.CUST_PO_NUMBER = tableB.CustomerMask
EDIT2: Thinking about it, how many of those masks start and end with wildcards? You might gain some performance by first:
Indexing CUST_PO_NUMBER
Creating a persisted computed column CUST_PO_NUMBER_REV that's the reverse of CUST_PO_NUMBER
Indexing the persisted column
Putting statistics on these columns
Then you might build three queries, and UNION ALL the results together:
SELECT ...
FROM ...
ON CUSTOM_PO_NUMBER LIKE CustomerMask
WHERE /* First character of CustomerMask is not a wildcard but last one is */
UNION ALL
SELECT ...
FROM ...
ON CUSTOM_PO_NUMBER_REV LIKE REVERSE(CustomerMask)
WHERE /* Last character of CustomerMask is not a wildcard but first one is */
UNION ALL
SELECT ...
FROM ...
ON CUSTOM_PO_NUMBER LIKE CustomerMask
WHERE /* Everything else */
That's just a quick example, you'll need to take some care that the WHERE clauses give you mutually exclusive results (or use UNION, but aim for mutually exclusive results first).
If you can do that, you should have two queries using index seeks and one query using index scans.
EDIT: You can implement a sharding system to spread out the customers and customer masks tables across multiple servers and then have each server evaluate 1/n% of the results. You don't need to partition the data -- simple replication of the entire contents of each table will do. Link the servers to your main server and you can do something to the effect of:
SELECT ... FROM OPENQUERY(LinkedServer1, 'SELECT ... LIKE ... WHERE ID BETWEEN 0 AND 99')
UNION ALL
SELECT ... FROM OPENQUERY(LinkedServer2, 'SELECT ... LIKE ... WHERE ID BETWEEN 100 AND 199')
Note: the OPENQUERY may be extraneous, SQL Server might be smart enough to evaluate queries on remote servers and stream the results back. I know it doesn't do that for JET linked servers, but it might handle its own kind better.
That or through more hardware at the problem.
You can create an Indexed View of your query to improve performance.
From Designing Indexed Views:
For a standard view, the overhead of dynamically building the result set for each query that references a view can be significant for views that involve complex processing of large numbers of rows, such as aggregating lots of data, or joining many rows. If such views are frequently referenced in queries, you can improve performance by creating a unique clustered index on the view. When a unique clustered index is created on a view, the result set is stored in the database just like a table with a clustered index is stored.
Another benefit of creating an index on a view is that the optimizer starts using the view index in queries that do not directly name the view in the FROM clause. Existing queries can benefit from the improved efficiency of retrieving data from the indexed view without having to be recoded.
This should improve the performance of this particular query, but note that inserts, updates and deleted into the tables it uses may be slowed.
You can't use LIKE if you care about performance.
If you are trying to do approximate string matching (e.g. Test and est and best, etc.) and you don't want to use Sql full-text search take a look at this article.
At least you can shortlist approximate matches then run your wildcard test on them.
--EDIT 2--
Your problem is interesting in the context of your limitation. Thinking about it again, I am pretty sure that using 3 gram would boost the performance (going back to my initial suggestion).
Let's say if you setup your 3gram data, you'll be having the following tables:
Customer : 14M
Customer3Grams : Maximum 700M //Considering the field is varchar(50)
3Grams : 78
Pattern : 1000
Pattern3Grams : 50K
To join pattern to customer then you need the following join:
Pattern x Pattern3Grams x Customer3Grams x Customer
With appropriate indexing (which is easy) each look-up can happen in O(LOG(50K)+LOG(700M)+LOG(14M)) which is equal to 47.6.
Considering appropriate indexes are present the whole join can be calculated with less than 50,000 look-ups and of course the cost of scanning after look ups. I expect it to be very efficient (matter of seconds).
The cost of creating 3grams for each new customer is also minimal because it would be maximum of 50x75 possible three grams that should be appended to the customer3Grams table.
--EDIT--
Depending to your data I can also suggest hash based clustering. I assume customer numbers are numbers with some character patterns in them (e.g. 123231ttt3x4). If this is the case you can create a simple hash function that calculates the result of bit-wise OR for every letter (not numbers) and add it as an indexed column to your table. You can filter on the result of the hash before applying LIKE.
Depending to your data this may cluster your data effectively and improve your search by factor of the number of clusters (number of hash). You can test it by applying the hash and counting the number of distinct generated hash.