SQL Dynamic Optimization Tables? - sql

I am a very experienced programmer, but extremely new to SQL, which has a more limited view of things than what is available in code. I think it's possible I'm looking at this wrong in the context of SQL in general, so I'm looking for direction. I do not believe the specific SQL implementation is really important at this point. I think this is just a general SQL conceptual issue, that I'm having.
Here's what I'm thinking:
Say I am going to track the results of a very large number of sporting events (10s of millions or more), with the teams that played in them and the final scores:
CREATE TABLE teams (
TeamID INT NOT NULL PRIMARY KEY,
TeamName VCHAR(255) NOT NULL
)
CREATE TABLE games (
GameID INT NOT NULL PRIMARY KEY,
TeamA INT NOT NULL,
TeamB INT NOT NULL,
TeamAScore INT,
TeamBScore INT,
FOREIGN KEY TeamA(TeamID)
REFERENCES teams (TeamID),
FOREIGN KEY TeamB(TeamID)
REFERENCES teams (TeamID)
)
Since the "games" table will be extremely large, when a query is made for the results of a particular team, it seems to me that searching both "TeamA" and "TeamB" columns for matches could be a very time-consuming operation. That would in turn make immediate presentation on a UI a problem.
However, if there were lists of games played by each team, the query could be made much faster (at the expense of more storage):
CREATE TABLE team_TeamID_games (
GameID INT NOT NULL,
FOREIGN KEY GameID(GameID)
)
Then displaying the list of results for a team just involves using the "team_TeamID_games" table and pulling out the results of the "games" table directly, rather than searching it.
The questionable part here starts with the idea of introducing a new table for each team. The "TeamID" portion of the "team_TeamID_games" above would be replaced with the team ID, so there might be tables called "team_1_games", "team_2_games", etc.
That alone seems to break with what I've seen in researching SQL use.
Additionally, from what I've learned of SQL so far, there isn't really a standard way to actually link the "team_TeamID_games" table to the "TeamID" row of the "teams" table, since foreign keys reference a row, not an entire table. And that means the database doesn't really know about the connection.
Alternatively, a VARCHAR() string with the name of the other table could be stored in the "teams" table, but I don't believe that actually means anything to the database either.
Is the concept of a link between tables done above and outside the database itself an extremely bad thing?
Is the creation of such "dynamic" tables (not statically created up front, but created as teams are registered, and populated as the game results are entered) for each team a bad idea?
Is there another way to accomplish this optimization?

Not sure what you consider "extremely" large. With e.g. 2500 teams, the result games table would be about 6 million rows. That is not even considered "large" nowadays. With 5000 teams, the games table would have 25 million rows. Still not "extremely" large nowadays.
The query "find all games of a specific team" can be answered using the following query:
select *
from games
where teama = 42
or teamb = 42;
This can (usually) be improved by creating an index on each column:
create index idx_team_a on games (teama);
create index idx_team_a on games (teamb);
Postgres (and probably other DBMS products as well) would be able to use both indexes for that query. On my laptop (with 2500 teams and 6.2 million games) that query takes about 3 milliseconds.
Another option would be to create an index on an expression that covers both team IDs
create index on games ( (least(teama, teamb)) );
That expression can then be used to find all games for one team:
select *
from games
where least(teama, teamb) = 1234;
As only a single index is involved this a bit faster: about 2 milliseconds on my laptop.
With 25 million rows (5000 teams), the difference between the two approaches is a bit bigger. The OR query takes around 15-20 milliseconds, the expression based query takes around 5-10 milliseconds.
Even 20 milliseconds doesn't seem something that would be a problem in the UI.
So with careful indexing I don't see why you would need any additional table.

Related

Why does SQL query not use primary key for SELECT when it is most suitable?

Scenario (tried to come up with a 1-1 mapping to my production scenario): Fetch list of all people who flew with Virgin airlines or Emirates from New York.
Table: tbl_Flyer has a few columns containing all details about the people who flew at any point of time. The Primary key is CountryId, CityId, AirlineId, PersonId
Now, a simple SQL query looks like this:
SELECT flyer.PersonId
FROM tbl_Flyer passenger
WHERE passenger.CountryId = #countryId
AND passenger.City= #cityId
AND passenger.AirlineId IN (SELECT values FROM #allAirlineIds)
#countryId and #cityId and #allAirlineIds are properly sent to the SQL stored procedure. My assumption would be that this query would use the primary key as all the 4 columns being used in the query are present in PK, but for some reason it does not. It uses a non clustered index which was added to be able to query passengers on the basis of personal details like age, sex. (looks like (CountryId, CityId, Age, Sex))
I am adding a ForceSeek hint to the query but I want to understand if there is an anti-pattern that I might be using here? Any idea why SQL would defy logic and not use the PK for a seek?
The choice do by your Data Base Engine of using one index or another is automatically do based on automated heuristics... who are not always the most accurate. (99% of the times, they are, but sometimes, the human brain found a better way ).
This heuristics are calculated based on generalist rules, and sometime it don't match the reality of the content of your database.(string colonne with alway the same first letter, colonne with a lot of null, ...)
The "Select In" operation have to be done for each row of your table, and stocked and is considered as extremly expensive by most of the data base engine, so your Data Base can prefere use another way.(non clustered index in your case)
Using Exist in is considered as far less expensive, by the way, and will make your Data base Engine more suspectible to choose the index.
use ForceSeek if it's not enought.
Also You can have the same issue if the type of CountryId, CityId, AirlineId, PersonId is not the same than #CountryId, #CityId, #AirlineId, #PersonId (the type conversion is expensive)

Which SQL Update is faster/ more efficient

I need to update a table every time a certain action is taken.
MemberTable
Name varchar 60
Phone varchar 20
Title varchar 20
Credits int <-- the one that needs constant updates
etc with all the relevant member columns 10 - 15 total
Should I update this table with:
UPDATE Members
SET Credits = Credits - 1
WHERE Id = 1
or should I create another table called account with only two columns like:
Account table
Id int
MemberId int <-- foreign key to members table
Credits int
and update it with:
UPDATE Accounts
SET Credits = Credits - 1
WHERE MemberId = 1
Which one would be faster and more efficient?
I have read that SQL Server must read the whole row in order to update it. I'm not sure if that's true. Any help would be greatly appreciated
I know that this doesn't directly answer the question but I'm going to throw this out there as an alternative solution.
Are you bothered about historic transactions? Not everyone will be, but in case you or other future readers are, here's how I would approach the problem:
CREATE TABLE credit_transactions (
member_id int NOT NULL
, transaction_date datetime NOT NULL
CONSTRAINT df_credit_transactions_date DEFAULT Current_Timestamp
, credit_amount int NOT NULL
, CONSTRAINT pk_credit_transactions PRIMARY KEY (member_id, transaction_date)
, CONSTRAINT fk_credit_transactions_member_id FOREIGN KEY (member_id)
REFERENCES member (id)
, CONSTRAINT ck_credit_transaction_amount_not_zero CHECK (credit_amount <> 0)
);
In terms of write performance...
INSERT INTO credit_transactions (member_id, credit_amount)
VALUES (937, -1)
;
Pretty simple, eh! No row locks required.
The downside to this method is that to work out a members "balance", you have to perform a bit of a calculation.
CREATE VIEW member_credit
AS
SELECT member_id
, Sum(credit) As credit_balance
, Max(transaction_date) As latest_transaction
FROM credit_transactions
GROUP
BY member_id
;
However using a view makes things nice and simple and can be optimized appropriately.
Heck, you might want to throw in a NOLOCK (read up about this before making your decision) on that view to reduce locking impact.
TL;DR:
Pros: quick write speed, transaction history available
Cons: slower read speed
Actually the later way would be faster.
If your number transaction is very huge, to the extent where millisecond precision is very important, it's better to do it this way.
Or maybe some members will not have credits, you might save some space here as well.
However, if it's not, it's good to keep your table structure normalized. If every account will always have a credit, it's better to include it as a column in table Member.
Try to not having unnecessary intermediate table which will consume more space (with all those foreign keys and additional IDs). Furthermore, it also makes your schema a little bit more complex.
In the end, it depends on your requirement.
As the ID is the primary key, all the dbms has to do is look up the key in the index, get the record and update. There should not be much of a performance problem.
Using an account table leads to exactly the same access method. But you are right; as there is less data per record, you might more often have the record in the memory cache already and thus save a physical read. However, I wouldn't expect that to happen too often. And well, you probably work more with your member table than with the account table. This makes it more likely to have a member record already in cache, so it's just vice versa and your account table access is slower then.
Cache access vs. physical reads is the only difference, because with the primary key you will walk the same way throgh the ID index and than access one particular record directly.
I don't recommend using the account table. It somewhat blurrs the data structure with a 1:1 relation between the two tables that may not be immediable recognized by other users. And it is not likely you will gain much from it. (As mentioned, you might even lose performance.)

Joining 100 tables

Assume that I have a main table which has 100 columns referencing (as foreign keys) to some 100 tables (containing primary keys).
The whole pack of information requires joining those 100 tables. And it is definitely a performance issue to join such a number of tables. Hopefully, we can expect that any user would like to request a bunch of data containing values from not more than some 5-7 tables (out of those 100) in queries that put conditions (in WHERE part of the query) on the fields from some 3-4 tables (out of those 100). Different queries have different combinations of the tables used to produce "SELECT" part of query and to put conditions in "WHERE". But, again, every SELECT would require some 5-7 tables and every WHERE would requre some 3-4 tables (definitely, the list of tables used to produce SELECT may overlap with the list of tables used to put conditions in WHERE).
I can write a VIEW with the underlying code joining all those 100 tables. Then I can write the mentioned above SQL-queries to this VIEW. But in this case it is a big issue for me how to instruct SQL Server that (despite the explicit instructions in the code to join all those 100 tables) only some 11 tables should be joined (11 tables are enough to be joined to produce SELECT outcome and take into account WHERE conditions).
Another approach may be to create a "feature" that converts the following "fake" code
SELECT field1, field2, field3 FROM TheFakeTable WHERE field1=12 and field4=5
into the following "real" code:
SELECT T1.field1, T2.field2, T3.field3 FROM TheRealMainTable
join T1 on ....
join T2 on ....
join T3 on ....
join T4 on ....
WHERE T1.field1=12 and T4.field4=5
From grammatical point of view, it is not a problem even to allow any mixed combinations of this "TheFakeTable-mechanism" with real tables and constructions. The real problem here is how to realize this "feature" technically. I can create a function which takes the "fake" code as an input and produces the "real" code. But it is not convenient because it requires using dynamic SQL tools evrywhere where this "TheFakeTable-mechanism" appears. A fantasy-land solution is to extend the gramma of the SQL-language in my Management Studio to allow writing such a fake code and then automatically converting this code into the real one before sending to the server.
My questions are:
whether SQl Server can be instructed shomehow (or to be genius enouh) to join only 11 tables instead of 100 in the VIEW described above?
If I decide to create this "TheFakeTable-mechanism" feature, what would be the best form for the technical realization of this feature?
Thanks to everyone for every comment!
PS
The structure with 100 tables arises from the following question that I asked here:
Normalizing an extremely big table
The SQL Server optimizer does contain logic to remove redundant joins, but there are restrictions, and the joins have to be provably redundant. To summarize, a join can have four effects:
It can add extra columns (from the joined table)
It can add extra rows (the joined table may match a source row more than once)
It can remove rows (the joined table may not have a match)
It can introduce NULLs (for a RIGHT or FULL JOIN)
To successfully remove a redundant join, the query (or view) must account for all four possibilities. When this is done, correctly, the effect can be astonishing. For example:
USE AdventureWorks2012;
GO
CREATE VIEW dbo.ComplexView
AS
SELECT
pc.ProductCategoryID, pc.Name AS CatName,
ps.ProductSubcategoryID, ps.Name AS SubCatName,
p.ProductID, p.Name AS ProductName,
p.Color, p.ListPrice, p.ReorderPoint,
pm.Name AS ModelName, pm.ModifiedDate
FROM Production.ProductCategory AS pc
FULL JOIN Production.ProductSubcategory AS ps ON
ps.ProductCategoryID = pc.ProductCategoryID
FULL JOIN Production.Product AS p ON
p.ProductSubcategoryID = ps.ProductSubcategoryID
FULL JOIN Production.ProductModel AS pm ON
pm.ProductModelID = p.ProductModelID
The optimizer can successfully simplify the following query:
SELECT
c.ProductID,
c.ProductName
FROM dbo.ComplexView AS c
WHERE
c.ProductName LIKE N'G%';
To:
Rob Farley wrote about these ideas in depth in the original MVP Deep Dives book, and there is a recording of him presenting on the topic at SQLBits.
The main restrictions are that foreign key relationships must be based on a single key to contribute to the simplification process, and compilation time for the queries against such a view may become quite long particularly as the number of joins increases. It could be quite a challenge to write a 100-table view that gets all the semantics exactly correct. I would be inclined to find an alternative solution, perhaps using dynamic SQL.
That said, the particular qualities of your denormalized table may mean the view is quite simple to assemble, requiring only enforced FOREIGN KEYs non-NULLable referenced columns, and appropriate UNIQUE constraints to make this solution work as you would hope, without the overhead of 100 physical join operators in the plan.
Example
Using ten tables rather than a hundred:
-- Referenced tables
CREATE TABLE dbo.Ref01 (col01 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref02 (col02 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref03 (col03 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref04 (col04 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref05 (col05 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref06 (col06 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref07 (col07 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref08 (col08 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref09 (col09 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref10 (col10 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
The parent table definition (with page-compression):
CREATE TABLE dbo.Normalized
(
pk integer IDENTITY NOT NULL,
col01 tinyint NOT NULL REFERENCES dbo.Ref01,
col02 tinyint NOT NULL REFERENCES dbo.Ref02,
col03 tinyint NOT NULL REFERENCES dbo.Ref03,
col04 tinyint NOT NULL REFERENCES dbo.Ref04,
col05 tinyint NOT NULL REFERENCES dbo.Ref05,
col06 tinyint NOT NULL REFERENCES dbo.Ref06,
col07 tinyint NOT NULL REFERENCES dbo.Ref07,
col08 tinyint NOT NULL REFERENCES dbo.Ref08,
col09 tinyint NOT NULL REFERENCES dbo.Ref09,
col10 tinyint NOT NULL REFERENCES dbo.Ref10,
CONSTRAINT PK_Normalized
PRIMARY KEY CLUSTERED (pk)
WITH (DATA_COMPRESSION = PAGE)
);
The view:
CREATE VIEW dbo.Denormalized
WITH SCHEMABINDING AS
SELECT
item01 = r01.item,
item02 = r02.item,
item03 = r03.item,
item04 = r04.item,
item05 = r05.item,
item06 = r06.item,
item07 = r07.item,
item08 = r08.item,
item09 = r09.item,
item10 = r10.item
FROM dbo.Normalized AS n
JOIN dbo.Ref01 AS r01 ON r01.col01 = n.col01
JOIN dbo.Ref02 AS r02 ON r02.col02 = n.col02
JOIN dbo.Ref03 AS r03 ON r03.col03 = n.col03
JOIN dbo.Ref04 AS r04 ON r04.col04 = n.col04
JOIN dbo.Ref05 AS r05 ON r05.col05 = n.col05
JOIN dbo.Ref06 AS r06 ON r06.col06 = n.col06
JOIN dbo.Ref07 AS r07 ON r07.col07 = n.col07
JOIN dbo.Ref08 AS r08 ON r08.col08 = n.col08
JOIN dbo.Ref09 AS r09 ON r09.col09 = n.col09
JOIN dbo.Ref10 AS r10 ON r10.col10 = n.col10;
Hack the statistics to make the optimizer think the table is very large:
UPDATE STATISTICS dbo.Normalized WITH ROWCOUNT = 100000000, PAGECOUNT = 5000000;
Example user query:
SELECT
d.item06,
d.item07
FROM dbo.Denormalized AS d
WHERE
d.item08 = 'Banana'
AND d.item01 = 'Green';
Gives us this execution plan:
The scan of the Normalized table looks bad, but both Bloom-filter bitmaps are applied during the scan by the storage engine (so rows that cannot match do not even surface as far as the query processor). This may be enough to give acceptable performance in your case, and certainly better than scanning the original table with its overflowing columns.
If you are able to upgrade to SQL Server 2012 Enterprise at some stage, you have another option: creating a column-store index on the Normalized table:
CREATE NONCLUSTERED COLUMNSTORE INDEX cs
ON dbo.Normalized (col01,col02,col03,col04,col05,col06,col07,col08,col09,col10);
The execution plan is:
That probably looks worse to you, but column storage provides exceptional compression, and the whole execution plan runs in Batch Mode with filters for all the contributing columns. If the server has adequate threads and memory available, this alternative could really fly.
Ultimately, I'm not sure this normalization is the correct approach considering the number of tables and the chances of getting a poor execution plan or requiring excessive compilation time. I would probably correct the schema of the denormalized table first (proper data types and so on), possibly apply data compression...the usual things.
If the data truly belongs in a star-schema, it probably needs more design work than just splitting off repeating data elements into separate tables.
Why do you think joining 100 tables would be a performance issue?
If all the keys are primary keys, then all the joins will use indexes. The only question, then, is whether the indexes fit into memory. If they fit in memory, performance is probably not an issue at all.
You should try the query with the 100 joins before making such a statement.
Furthermore, based on the original question, the reference tables have just a few values in them. The tables themselves fit on a single page, plus another page for the index. This is 200 pages, which would occupy at most a few megabytes of your page cache. Don't worry about the optimizations, create the view, and if you have performance problems then think about the next steps. Don't presuppose performance problems.
ELABORATION:
This has received a lot of comments. Let me explain why this idea may not be as crazy as it sounds.
First, I am assuming that all the joins are done through primary key indexes, and that the indexes fit into memory.
The 100 keys on the page occupy 400 bytes. Let's say that the original strings are, on average 40 bytes each. These would have occupied 4,000 bytes on the page, so we have a savings. In fact, about 2 records would fit on a page in the previous scheme. About 20 fit on a page with the keys.
So, to read the records with the keys is about 10 times faster in terms of I/O than reading the original records. With the assumptions about the small number of values, the indexes and original data fit into memory.
How long does it take to read 20 records? The old way required reading 10 pages. With the keys, there is one page read and 100*20 index lookups (with perhaps an additional lookup to get the value). Depending on the system, the 2,000 index lookups may be faster -- even much faster -- than the additional 9 page I/Os. The point I want to make is that this is a reasonable situation. It may or may not happen on a particular system, but it is not way crazy.
This is a bit oversimplified. SQL Server doesn't actually read pages one-at-a-time. I think they are read in groups of 4 (and there might be look-ahead reads when doing a full-table scan). On the flip side, though, in most cases, a table-scan query is going to be more I/O bound than processor bound, so there are spare processor cycles for looking up values in reference tables.
In fact, using the keys could result in faster reading of the table than not using them, because spare processing cycles would be used for the lookups ("spare" in the sense that processing power is available when reading). In fact, the table with the keys might be small enough to fit into available cache, greatly improving performance of more complex queries.
The actual performance depends on lots of factors, such as the length of the strings, the original table (is it larger than available cache?), the ability of the underlying hardware to do I/O reads and processing at the same time, and the dependence on the query optimizer to do the joins correctly.
My original point was that assuming a priori that the 100 joins are a bad thing is not correct. The assumption needs to be tested, and using the keys might even give a boost to performance.
If your data doesn't change much, you may benefit from creating an Indexed View, which basically materializes the view.
If the data changes often, it may not be a good option, as the server has to maintain the indexed view for each change in the underlying tables of the view.
Here's a good blog post that describes it a bit better.
From the blog:
CREATE VIEW dbo.vw_SalesByProduct_Indexed
WITH SCHEMABINDING
AS
SELECT
Product,
COUNT_BIG(*) AS ProductCount,
SUM(ISNULL(SalePrice,0)) AS TotalSales
FROM dbo.SalesHistory
GROUP BY Product
GO
The script below creates the index on our view:
CREATE UNIQUE CLUSTERED INDEX idx_SalesView ON vw_SalesByProduct_Indexed(Product)
To show that an index has been created on the view and that it does
take up space in the database, run the following script to find out
how many rows are in the clustered index and how much space the view
takes up.
EXECUTE sp_spaceused 'vw_SalesByProduct_Indexed'
The SELECT statement below is the same statement as before, except
this time it performs a clustered index seek, which is typically very
fast.
SELECT
Product, TotalSales, ProductCount
FROM vw_SalesByProduct_Indexed
WHERE Product = 'Computer'

Should I use a unique ID for a row in a junction table?

I am using SQL Server 2008.
A while back, I asked the question "should I use RecordID in a junction table". The tables would look like this:
// Images
ImageID// PK
// Persons
PersonID // pk
// Images_Persons
RecordID // pk
ImageID // fk
PersonID // fk
I was strongly advised NOT to use RecordID because it's useless in a table where the two IDs create a unique combination, meaning there will be no duplicate records.
Now, I am trying to find a random record in the junction table to create a quiz. I want to pull the first id and see if someone can match the second id. Specifically, I grab a random image and display it with three possible choices of persons.
The following query works, but I've quite a bit of negativity that suggests that it's very slow. My database might have 10,000 records, so I don't think that matters much. I've also read that the values generated aren't truly random.
SELECT TOP 1 * FROM Images_Persons ORDER BY newid();
Should I add the RecordID column or not? Is there a better way to find a random record in this case?
Previous questions for reference
Should I use "RecordID" as a column name?
SQL - What is the best table design to store people as musicians and artists?
NEWID is random enough and probably best
10k rows is peanuts
You don't need a surrogate key for a junction (link, many-many) table
Edit: in case you want to prematurely optimise...
You could ignore this and read these from #Mitch Wheat. But with just 10k rows your development time will be longer than any saved execution time..
Efficiently select random rows from large resultset with LINQ (ala TABLESAMPLE)
Efficiently randomize (shuffle) data in Sql Server table
Personally, I don't think that having the RecordID column should be advised AGAINST. Rather I'd advise that often it is UNNECESSARY.
There are cases where having a single value to identify a row makes for simpler code. But they're at the cost of additional storage, often additional indexes, etc. The overheads realistically are small, but so are the benefits.
In terms of the selection of random records, the existence of a single unique identifier can make the task easier if the identifiers are both sequential and consecutive.
The reason I say this is because your proposed solution requires the assignment of NEWID() to every record, and the sorting of all records to find the first one. As the table size grows this operation grows, and can become relatively expensive. Whether it's expensive enough to be worth optimising depends on whatever else is happening, how often, etc.
Where there are sequential consecutive unique identifiers, however, one can then choose a random value between MIN(id) and MAX(id), and then SEEK that value out. The requirement that all value are consecutive, however, is often a constraint too far; you're never allowed to delete a value mid-table, for example...
To overcome this, and depending on indexes, you may find the following approach useful.
DECLARE
#max_id INT
SELECT
#id = COUNT(*)
FROM
Images_Persons
SELECT
*
FROM
(
SELECT
*,
ROW_NUMBER() OVER (ORDER BY ImageID, PersonID) AS id
FROM
Images_Persons
)
AS data
WHERE
Images_Persons.id = CAST(#max_id * RAND() + 1 AS INT)
-- Assuming that `ImageID, PersonID` is the clustered index.
A down side here is that RAND() is notoriously poor at being truly random. Yet it normally perfectly suitable if executed at a random time relative to any other call to RAND().
Consider what you've got.
SELECT TOP 1 * FROM Images_Persons ORDER BY newid();
Not truly random? Excluding the 'truly random is impossible' bit, you're probably right - I believe that there are patterns in generated uniqueidentifiers. But you should test this yourself. It'd be simple; just create a table with 1 to 100 in it, order by newid() a lot of times, and look at the results. If it's random 'enough' for you (which it probably will be, for a quiz) then it's good enough.
Very slow? I wouldn't worry about that. I'd be very surprised if the newid() is slower than reading the record from the table. But again, test and benchmark.
I'd be happy with the solution you have, pending tests if you're concerned about it.
I've always used order by newid().

Using more than one index per table is dangerous?

In a former company I worked at, the rule of thumb was that a table should have no more than one index (allowing the odd exception, and certain parent-tables holding references to nearly all other tables and thus are updated very frequently).
The idea being that often, indexes cost the same or more to uphold than they gain. Note that this question is different to indexed-view-vs-indexes-on-table as the motivation is not only reporting.
Is this true? Is this index-purism worth it?
In your career do you generally avoid using indexes?
What are the general large-scale recommendations regarding indexes?
Currently and at the last company we use SQL Server, so any product specific guidelines are welcome too.
You need to create exactly as many indexes as you need to create. No more, no less. It is as simple as that.
Everybody "knows" that an index will slow down DML statements on a table. But for some reason very few people actually bother to test just how "slow" it becomes in their context. Sometimes I get the impression that people think that adding another index will add several seconds to each inserted row, making it a game changing business tradeoff that some fictive hotshot user should decide in a board room.
I'd like to share an example that I just created on my 2 year old pc, using a standard MySQL installation. I know you tagged the question SQL Server, but the example should be easily converted. I insert 1,000,000 rows into three tables. One table without indexes, one table with one index and one table with nine indexes.
drop table numbers;
drop table one_million_rows;
drop table one_million_one_index;
drop table one_million_nine_index;
/*
|| Create a dummy table to assist in generating rows
*/
create table numbers(n int);
insert into numbers(n) values(0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
/*
|| Create a table consisting of 1,000,000 consecutive integers
*/
create table one_million_rows as
select d1.n + (d2.n * 10)
+ (d3.n * 100)
+ (d4.n * 1000)
+ (d5.n * 10000)
+ (d6.n * 100000) as n
from numbers d1
,numbers d2
,numbers d3
,numbers d4
,numbers d5
,numbers d6;
/*
|| Create an empty table with 9 integer columns.
|| One column will be indexed
*/
create table one_million_one_index(
c1 int, c2 int, c3 int
,c4 int, c5 int, c6 int
,c7 int, c8 int, c9 int
,index(c1)
);
/*
|| Create an empty table with 9 integer columns.
|| All nine columns will be indexed
*/
create table one_million_nine_index(
c1 int, c2 int, c3 int
,c4 int, c5 int, c6 int
,c7 int, c8 int, c9 int
,index(c1), index(c2), index(c3)
,index(c4), index(c5), index(c6)
,index(c7), index(c8), index(c9)
);
/*
|| Insert 1,000,000 rows in the table with one index
*/
insert into one_million_one_index(c1,c2,c3,c4,c5,c6,c7,c8,c9)
select n, n, n, n, n, n, n, n, n
from one_million_rows;
/*
|| Insert 1,000,000 rows in the table with nine indexes
*/
insert into one_million_nine_index(c1,c2,c3,c4,c5,c6,c7,c8,c9)
select n, n, n, n, n, n, n, n, n
from one_million_rows;
My timings are:
1m rows into table without indexes: 0,45 seconds
1m rows into table with 1 index: 1,5 seconds
1m rows into table with 9 indexes: 6,98 seconds
I'm better with SQL than statistics and math, but I'd like to think that:
Adding 8 indexes to my table, added (6,98-1,5) 5,48 seconds in total. Each index would then have contributed 0,685 seconds (5,48 / 8) for all 1,000,000 rows. That would mean that the added overhead per row per index would have been 0,000000685 seconds. SOMEBODY CALL THE BOARD OF DIRECTORS!
In conclusion, I'd like to say that the above test case doesn't prove a shit. It just shows that tonight, I was able to insert 1,000,000 consecutive integers into in a table in a single user environment. Your results will be different.
That is utterly ridiculous. First, you need multiple indexes in order to perfom correctly. For instance, if you have a primary key, you automatically have an index. that means you can't index anything else with the rule you described. So if you don't index foreign keys, joins will be slow and if you don't index fields used in the where clause, queries will still be slow. Yes you can have too many indexes as they do take extra time to insert and update and delete records, but no more than one is not dangerous, it is a requirement to have a system that performs well. And I have found that users tolerate a longer time to insert better than they tolerate a longer time to query.
Now the exception might be for a system that takes thousands of readings per second from some automated equipment. This is a database that generally doesn't have indexes to speed inserts. But usually these types of databases are also not used for reading, the data is transferred instead daily to a reporting database which is indexed.
Yes, definitely - too many indexes on a table can be worse than no indexes at all. However, I don't think there's any good in having the "at most one index per table" rule.
For SQL Server, my rule is:
index any foreign key fields - this helps JOINs and is beneficial to other queries, too
index any other fields when it makes sense, e.g. when lots of intensive queries can benefit from it
Finding the right mix of indices - weighing the pros of speeding up queries vs. the cons of additional overhead on INSERT, UPDATE, DELETE - is not an exact science - it's more about know-how, experience, measuring, measuring, and measuring again.
Any fixed rule is bound to be more contraproductive than anything else.....
The best content on indexing comes from Kimberly Tripp - the Queen of Indexing - see her blog posts here.
Unless you like very slow reads, you should have indexes. Don't go overboard, but don't be afraid of being liberal about them either. EVERY FK should be indexed. You're going to do a look up each of these columns on inserts to other tables to make sure the references are set. The index helps. As well as the fact that indexed columns are used often in joins and selects.
We have some tables that are inserted into rarely, with millions of records. Some of these tables are also quite wide. It's not uncommon for these tables to have 15+ indexes. Other tables with heavy inserting and low reads we might only have a handful of indexes- but one index per table is crazy.
Updating an index is once per insert (per index). Speed gain is for every select. So if you update infrequently and read often, then the extra work may be well worth it.
If you do different selects (meaning the columns you filter on are different), then maintaining an index for each type of query is very useful. Provided you have a limited set of columns that you query often.
But the usual advice holds: if you want to know which is fastest: profile!
You should of course be careful not to create too many indexes per table, but only ever using a single index per table is not a useful level.
How many indexes to use depends on how the table is used. A table that is updated often would generally have less indexes than one that is read much more often than it's updated.
We have some tables that are updated regularly by a job every two minutes, but they are read often by queries that vary a lot, so they have several indexes. One table for example have 24 indexes.
So much depends on your schema and the queries that you normally run. For example: if you normally need to select above 60% of the rows of your table, indexes won't help you and it will be cheaper to table scan than to index scan and then lookup rows. Focused queries that select a small number of rows in different parts of the table or which are used for joins in queries will probably benefit from indexes. The right index in the right place can make or break a feature.
Indexes take space so making too many indexes on a table can be counter productive for the same reasons listed above. Scanning 5 indexes and then performing row lookups may be much more expensive than simply table scanning.
Good design is the synthesis about about knowing when to normalise and when not to.
If you frequently join on a particular column, check the IO plan with the index and without. As a general rule I avoid tables with more than 20 columns. This is often a sign that the data should be normalised. More than about 5 indexes on a table and you may be using more space for the indexes than the main table, be sure that is worth it. These rules are only the lightest of guidance and so much depends on how the data will be used in queries and what your data update profile looks like.
Experiment with your query plans to see how your solution improves or degrades with an index.
Every table must have a PK, which is indexed of course (generally a clustered one), then every FK should be indexed as well.
Finally you may want to index fields on which you often sort on, if their data is well differenciated: for a field with only 5 possible values in a table with 1 million records, an index will not be of a great benefit.
I tend to be minimalistic with indexes, until the db starts beeing well filled, and ...slower. It is easy to identify the bottlenecks and add just the right the indexes at that point.
Optimizing the retrieval with indexes must be carefully designed to reflect actual query patterns. Surely, for a table with Primary Key, you will have at least one clustered index (that's how data is actually stored), then any additional indexes are taking advantage of the layout of the data (clustered index).
After analyzing queries that execute against the table, you want to design an index(s) that cover them. That may mean building one or more indexes but that heavily depends on the queries themselves. That decision cannot be made just by looking at column statistics only.
For tables where it's mostly inserts, i.e. ETL tables or something, then you should not create Primary Keys, or actually drop indexes and re-create them if data changes too quickly or drop/recreated entirely.
I personally would be scared to step into an environment that has a hard-coded rule of indexes per table ratio.