Why does SQL query not use primary key for SELECT when it is most suitable?

Why does SQL query not use primary key for SELECT when it is most suitable? - sql

Scenario (tried to come up with a 1-1 mapping to my production scenario): Fetch list of all people who flew with Virgin airlines or Emirates from New York.
Table: tbl_Flyer has a few columns containing all details about the people who flew at any point of time. The Primary key is CountryId, CityId, AirlineId, PersonId
Now, a simple SQL query looks like this:
SELECT flyer.PersonId
FROM tbl_Flyer passenger
WHERE passenger.CountryId = #countryId
AND passenger.City= #cityId
AND passenger.AirlineId IN (SELECT values FROM #allAirlineIds)
#countryId and #cityId and #allAirlineIds are properly sent to the SQL stored procedure. My assumption would be that this query would use the primary key as all the 4 columns being used in the query are present in PK, but for some reason it does not. It uses a non clustered index which was added to be able to query passengers on the basis of personal details like age, sex. (looks like (CountryId, CityId, Age, Sex))
I am adding a ForceSeek hint to the query but I want to understand if there is an anti-pattern that I might be using here? Any idea why SQL would defy logic and not use the PK for a seek?

The choice do by your Data Base Engine of using one index or another is automatically do based on automated heuristics... who are not always the most accurate. (99% of the times, they are, but sometimes, the human brain found a better way ).
This heuristics are calculated based on generalist rules, and sometime it don't match the reality of the content of your database.(string colonne with alway the same first letter, colonne with a lot of null, ...)
The "Select In" operation have to be done for each row of your table, and stocked and is considered as extremly expensive by most of the data base engine, so your Data Base can prefere use another way.(non clustered index in your case)
Using Exist in is considered as far less expensive, by the way, and will make your Data base Engine more suspectible to choose the index.
use ForceSeek if it's not enought.
Also You can have the same issue if the type of CountryId, CityId, AirlineId, PersonId is not the same than #CountryId, #CityId, #AirlineId, #PersonId (the type conversion is expensive)

Related

SQL Dynamic Optimization Tables?

I am a very experienced programmer, but extremely new to SQL, which has a more limited view of things than what is available in code. I think it's possible I'm looking at this wrong in the context of SQL in general, so I'm looking for direction. I do not believe the specific SQL implementation is really important at this point. I think this is just a general SQL conceptual issue, that I'm having.
Here's what I'm thinking:
Say I am going to track the results of a very large number of sporting events (10s of millions or more), with the teams that played in them and the final scores:
CREATE TABLE teams (
TeamID INT NOT NULL PRIMARY KEY,
TeamName VCHAR(255) NOT NULL
)
CREATE TABLE games (
GameID INT NOT NULL PRIMARY KEY,
TeamA INT NOT NULL,
TeamB INT NOT NULL,
TeamAScore INT,
TeamBScore INT,
FOREIGN KEY TeamA(TeamID)
REFERENCES teams (TeamID),
FOREIGN KEY TeamB(TeamID)
REFERENCES teams (TeamID)
)
Since the "games" table will be extremely large, when a query is made for the results of a particular team, it seems to me that searching both "TeamA" and "TeamB" columns for matches could be a very time-consuming operation. That would in turn make immediate presentation on a UI a problem.
However, if there were lists of games played by each team, the query could be made much faster (at the expense of more storage):
CREATE TABLE team_TeamID_games (
GameID INT NOT NULL,
FOREIGN KEY GameID(GameID)
)
Then displaying the list of results for a team just involves using the "team_TeamID_games" table and pulling out the results of the "games" table directly, rather than searching it.
The questionable part here starts with the idea of introducing a new table for each team. The "TeamID" portion of the "team_TeamID_games" above would be replaced with the team ID, so there might be tables called "team_1_games", "team_2_games", etc.
That alone seems to break with what I've seen in researching SQL use.
Additionally, from what I've learned of SQL so far, there isn't really a standard way to actually link the "team_TeamID_games" table to the "TeamID" row of the "teams" table, since foreign keys reference a row, not an entire table. And that means the database doesn't really know about the connection.
Alternatively, a VARCHAR() string with the name of the other table could be stored in the "teams" table, but I don't believe that actually means anything to the database either.
Is the concept of a link between tables done above and outside the database itself an extremely bad thing?
Is the creation of such "dynamic" tables (not statically created up front, but created as teams are registered, and populated as the game results are entered) for each team a bad idea?
Is there another way to accomplish this optimization?

Not sure what you consider "extremely" large. With e.g. 2500 teams, the result games table would be about 6 million rows. That is not even considered "large" nowadays. With 5000 teams, the games table would have 25 million rows. Still not "extremely" large nowadays.
The query "find all games of a specific team" can be answered using the following query:
select *
from games
where teama = 42
or teamb = 42;
This can (usually) be improved by creating an index on each column:
create index idx_team_a on games (teama);
create index idx_team_a on games (teamb);
Postgres (and probably other DBMS products as well) would be able to use both indexes for that query. On my laptop (with 2500 teams and 6.2 million games) that query takes about 3 milliseconds.
Another option would be to create an index on an expression that covers both team IDs
create index on games ( (least(teama, teamb)) );
That expression can then be used to find all games for one team:
select *
from games
where least(teama, teamb) = 1234;
As only a single index is involved this a bit faster: about 2 milliseconds on my laptop.
With 25 million rows (5000 teams), the difference between the two approaches is a bit bigger. The OR query takes around 15-20 milliseconds, the expression based query takes around 5-10 milliseconds.
Even 20 milliseconds doesn't seem something that would be a problem in the UI.
So with careful indexing I don't see why you would need any additional table.

Which SQL Update is faster/ more efficient

I need to update a table every time a certain action is taken.
MemberTable
Name varchar 60
Phone varchar 20
Title varchar 20
Credits int <-- the one that needs constant updates
etc with all the relevant member columns 10 - 15 total
Should I update this table with:
UPDATE Members
SET Credits = Credits - 1
WHERE Id = 1
or should I create another table called account with only two columns like:
Account table
Id int
MemberId int <-- foreign key to members table
Credits int
and update it with:
UPDATE Accounts
SET Credits = Credits - 1
WHERE MemberId = 1
Which one would be faster and more efficient?
I have read that SQL Server must read the whole row in order to update it. I'm not sure if that's true. Any help would be greatly appreciated

I know that this doesn't directly answer the question but I'm going to throw this out there as an alternative solution.
Are you bothered about historic transactions? Not everyone will be, but in case you or other future readers are, here's how I would approach the problem:
CREATE TABLE credit_transactions (
member_id int NOT NULL
, transaction_date datetime NOT NULL
CONSTRAINT df_credit_transactions_date DEFAULT Current_Timestamp
, credit_amount int NOT NULL
, CONSTRAINT pk_credit_transactions PRIMARY KEY (member_id, transaction_date)
, CONSTRAINT fk_credit_transactions_member_id FOREIGN KEY (member_id)
REFERENCES member (id)
, CONSTRAINT ck_credit_transaction_amount_not_zero CHECK (credit_amount <> 0)
);
In terms of write performance...
INSERT INTO credit_transactions (member_id, credit_amount)
VALUES (937, -1)
;
Pretty simple, eh! No row locks required.
The downside to this method is that to work out a members "balance", you have to perform a bit of a calculation.
CREATE VIEW member_credit
AS
SELECT member_id
, Sum(credit) As credit_balance
, Max(transaction_date) As latest_transaction
FROM credit_transactions
GROUP
BY member_id
;
However using a view makes things nice and simple and can be optimized appropriately.
Heck, you might want to throw in a NOLOCK (read up about this before making your decision) on that view to reduce locking impact.
TL;DR:
Pros: quick write speed, transaction history available
Cons: slower read speed

Actually the later way would be faster.
If your number transaction is very huge, to the extent where millisecond precision is very important, it's better to do it this way.
Or maybe some members will not have credits, you might save some space here as well.
However, if it's not, it's good to keep your table structure normalized. If every account will always have a credit, it's better to include it as a column in table Member.
Try to not having unnecessary intermediate table which will consume more space (with all those foreign keys and additional IDs). Furthermore, it also makes your schema a little bit more complex.
In the end, it depends on your requirement.

As the ID is the primary key, all the dbms has to do is look up the key in the index, get the record and update. There should not be much of a performance problem.
Using an account table leads to exactly the same access method. But you are right; as there is less data per record, you might more often have the record in the memory cache already and thus save a physical read. However, I wouldn't expect that to happen too often. And well, you probably work more with your member table than with the account table. This makes it more likely to have a member record already in cache, so it's just vice versa and your account table access is slower then.
Cache access vs. physical reads is the only difference, because with the primary key you will walk the same way throgh the ID index and than access one particular record directly.
I don't recommend using the account table. It somewhat blurrs the data structure with a 1:1 relation between the two tables that may not be immediable recognized by other users. And it is not likely you will gain much from it. (As mentioned, you might even lose performance.)

Store results of SQL Server query for pagination

In my database I have a table with a rather large data set that users can perform searches on. So for the following table structure for the Person table that contains about 250,000 records:
firstName|lastName|age
---------|--------|---
John | Doe |25
---------|--------|---
John | Sams |15
---------|--------|---
the users would be able to perform a query that can return about 500 or so results. What I would like to do is allow the user see his search results 50 at a time using pagination. I've figured out the client side pagination stuff, but I need somewhere to store the query results so that the pagination uses the results from his unique query and not from a SELECT * statement.
Can anyone provide some guidance on the best way to achieve this? Thanks.
Side note: I've been trying to use temp tables to do this by using the SELECT INTO statements, but I think that might cause some problems if, say, User A performs a search and his results are stored in the temp table then User B performs a search shortly after and User A's search results are overwritten.

In SQL Server the ROW_NUMBER() function is great for pagination, and may be helpful depending on what parameters change between searches, for example if searches were just for different firstName values you could use:
;WITH search AS (SELECT *,ROW_NUMBER() OVER (PARTITION BY firstName ORDER BY lastName) AS RN_firstName
FROM YourTable)
SELECT *
FROM search
WHERE RN BETWEEN 51 AND 100
AND firstName = 'John'
You could add additional ROW_NUMBER() lines, altering the PARTITION BY clause based on which fields are being searched.

Historically, for us, the best way to manage this is to create a complete new table, with a unique name. Then, when you're done, you can schedule the table for deletion.
The table, if practical, simply contains an index id (a simple sequenece: 1,2,3,4,5) and the primary key to the table(s) that are part of the query. Not the entire result set.
Your pagination logic then does something like:
SELECT p.* FROM temp_1234 t, primary_table p
WHERE t.pkey = p.primary_key
AND t.serial_id between 51 and 100
The serial id is your paging index.
So, you end up with something like (note, I'm not a SQL Server guy, so pardon):
CREATE TABLE temp_1234 (
serial_id serial,
pkey number
);
INSERT INTO temp_1234
SELECT 0, primary_key FROM primary_table WHERE <criteria> ORDER BY <sort>;
CREATE INDEX i_temp_1234 ON temp_1234(serial_id); // I think sql already does this for you
If you can delay the index, it's faster than creating it first, but it's a marginal improvement most likely.
Also, create a tracking table where you insert the table name, and the date. You can use this with a reaper process later (late at night) to DROP the days tables (those more than, say, X hours old).
Full table operations are much cheaper than inserting and deleting rows in to an individual table:
INSERT INTO page_table SELECT 'temp_1234', <sequence>, primary_key...
DELETE FROM page_table WHERE page_id = 'temp_1234';
That's just awful.

First of all, make sure you really need to do this. You're adding significant complexity, so go & measure whether the queries and pagination really hurts or you just "feel like you should". The pagination can be handled with ROW_NUMBER() quite easily.
Assuming you go ahead, once you've got your query, clearly you need to build a cache so first you need to identify what the key is. It will be the SQL statement or operation identifier (name of stored procedure perhaps) and the criteria used. If you don't want to share between users then the user name or some kind of session ID too.
Now when you do a query, you first look up in this table with all the key data then either
a) Can't find it so you run the query and add to the cache, storing the criteria/keys and the data or PK of the data depending on if you want a snapshot or real time. Bear in mind that "real time" isn't really because other users could be changing data under you.
b) Find it, so remove the results (or join the PK to the underlying tables) and return the results.
Of course now you need a background process to go and clean up the cache when it's been hanging around too long.
Like I said - you should really make sure you need to do this before you embark on it. In the example you give I don't think it's worth it.

Should I use a unique ID for a row in a junction table?

I am using SQL Server 2008.
A while back, I asked the question "should I use RecordID in a junction table". The tables would look like this:
// Images
ImageID// PK
// Persons
PersonID // pk
// Images_Persons
RecordID // pk
ImageID // fk
PersonID // fk
I was strongly advised NOT to use RecordID because it's useless in a table where the two IDs create a unique combination, meaning there will be no duplicate records.
Now, I am trying to find a random record in the junction table to create a quiz. I want to pull the first id and see if someone can match the second id. Specifically, I grab a random image and display it with three possible choices of persons.
The following query works, but I've quite a bit of negativity that suggests that it's very slow. My database might have 10,000 records, so I don't think that matters much. I've also read that the values generated aren't truly random.
SELECT TOP 1 * FROM Images_Persons ORDER BY newid();
Should I add the RecordID column or not? Is there a better way to find a random record in this case?
Previous questions for reference
Should I use "RecordID" as a column name?
SQL - What is the best table design to store people as musicians and artists?

NEWID is random enough and probably best
10k rows is peanuts
You don't need a surrogate key for a junction (link, many-many) table
Edit: in case you want to prematurely optimise...
You could ignore this and read these from #Mitch Wheat. But with just 10k rows your development time will be longer than any saved execution time..
Efficiently select random rows from large resultset with LINQ (ala TABLESAMPLE)
Efficiently randomize (shuffle) data in Sql Server table

Personally, I don't think that having the RecordID column should be advised AGAINST. Rather I'd advise that often it is UNNECESSARY.
There are cases where having a single value to identify a row makes for simpler code. But they're at the cost of additional storage, often additional indexes, etc. The overheads realistically are small, but so are the benefits.
In terms of the selection of random records, the existence of a single unique identifier can make the task easier if the identifiers are both sequential and consecutive.
The reason I say this is because your proposed solution requires the assignment of NEWID() to every record, and the sorting of all records to find the first one. As the table size grows this operation grows, and can become relatively expensive. Whether it's expensive enough to be worth optimising depends on whatever else is happening, how often, etc.
Where there are sequential consecutive unique identifiers, however, one can then choose a random value between MIN(id) and MAX(id), and then SEEK that value out. The requirement that all value are consecutive, however, is often a constraint too far; you're never allowed to delete a value mid-table, for example...
To overcome this, and depending on indexes, you may find the following approach useful.
DECLARE
#max_id INT
SELECT
#id = COUNT(*)
FROM
Images_Persons
SELECT
*
FROM
(
SELECT
*,
ROW_NUMBER() OVER (ORDER BY ImageID, PersonID) AS id
FROM
Images_Persons
)
AS data
WHERE
Images_Persons.id = CAST(#max_id * RAND() + 1 AS INT)
-- Assuming that `ImageID, PersonID` is the clustered index.
A down side here is that RAND() is notoriously poor at being truly random. Yet it normally perfectly suitable if executed at a random time relative to any other call to RAND().

Consider what you've got.
SELECT TOP 1 * FROM Images_Persons ORDER BY newid();
Not truly random? Excluding the 'truly random is impossible' bit, you're probably right - I believe that there are patterns in generated uniqueidentifiers. But you should test this yourself. It'd be simple; just create a table with 1 to 100 in it, order by newid() a lot of times, and look at the results. If it's random 'enough' for you (which it probably will be, for a quiz) then it's good enough.
Very slow? I wouldn't worry about that. I'd be very surprised if the newid() is slower than reading the record from the table. But again, test and benchmark.
I'd be happy with the solution you have, pending tests if you're concerned about it.
I've always used order by newid().

SQL Server index included columns

I need help understanding how to create indexes. I have a table that looks like this
Id
Name
Age
Location
Education,
PhoneNumber
My query looks like this:
SELECT *
FROM table1
WHERE name = 'sam'
What's the correct way to create an index for this with included columns?
What if the query has a order by statement?
SELECT *
FROM table1
WHERE name = 'sam'
ORDER BY id DESC
What if I have 2 parameters in my where statement?
SELECT *
FROM table1
WHERE name = 'sam'
AND age > 12

The correct way to create an index with included columns? Either via Management Studio/Toad/etc, or SQL (documentation):
CREATE INDEX idx_table_1 ON db.table_1 (name) INCLUDE (id)
What if the Query has an ORDER BY
The ORDER BY can use indexes, if the optimizer sees fit to (determined by table statistics & query). It's up to you to test if a composite index or an index with INCLUDE columns works best by reviewing the query cost.
If id is the clustered key (not always the primary key though), I probably wouldn't INCLUDE the column...
What if I have 2 parameters in my where statement?
Same as above - you need to test what works best for your query. Might be composite, or include, or separate indexes.
But keep in mind that:
tweaking for one query won't necessarily benefit every other query
indexes do slow down INSERT/UPDATE/DELETE statements, and require maintenance
You can use the Database Tuning Advisor (DTA) for index recommendations, including when some are redundant
Recommended reading
I highly recommend reading Kimberly Tripp's "The Tipping Point" for a better understanding of index decisions and impacts.

Since I do not know which exactly tasks your DB is going to implement and how many records in it, I would suggest that you take a look at the Index Basics MSDN article. It will allow you to decide yourself which indexes to create.

If ID is your primary and/or clustered index key, just create an index on Name, Age. This will cover all three queries.
Included fields are best used to retrieve row-level values for columns that are not in the filter list, or to retrieve aggregate values where the sorted field is in the GROUP BY clause.

If inserts are rare, create as much indexes as You want.
For first query create index for name column.
Id column I think already is primary key...
Create 2nd index with name and age. You can keep only one index: 'name, ag'e and it will not be much slower for 1st query.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas