Choosing indexes and primary keys for performance

Choosing indexes and primary keys for performance - sql

I am new to database design and I am having a lot of trouble on designing a PostgreSQL database for a combat game.
On this game, players will fight between them, gaining resources to buy better weapons and armors. Combats will be recorded for future review and the number of combats is expected to grow rapidly, as, for example, 1k players fighting 1k rounds will produce 500k records.
Game interactivity is reduced to spend points to upgrade the fighter equipment and habilities. Combats are resolved by the machine.
Details:
A specific type of weapon or armor can only be possesed once by each fighter.
Fighters will almost exclusively searched by id.
I will often need to search what pieces of equipment (weapons and/or armor) are possesed by a specific fighter, but I do not expect to search which fighters posseses a specific type of weapon.
Combats will be often searched by winner or loser.
Two given fighters can fight multiple times on different dates, so the tuple winner-loser is not unique on table combats
fighters table contains a lot of columns that will be often retrieved all at the same time (I create two objects of class "Fighter" with all the related information anytime a combat begins)
This is my current design:
CREATE TABLE IF NOT EXISTS weapons (
id serial PRIMARY KEY,
*** Game stuff ***
);
CREATE TABLE IF NOT EXISTS armors (
id serial PRIMARY KEY,
*** Game stuff ***
);
CREATE TABLE IF NOT EXISTS fighters (
id serial PRIMARY KEY,
preferred_weapon INT references weapons(id),
preferred_armor INT references armors(id),
*** Game stuff ***
);
CREATE TABLE IF NOT EXISTS combats (
id serial PRIMARY KEY,
winner INT references fighters(id),
loser INT references fighters(id),
*** Game stuff ***
);
CREATE TABLE IF NOT EXISTS fighters_weapons (
fighter INT NOT NULL references fighters(id),
weapon INT NOT NULL references weapons(id),
PRIMARY KEY(fighter, weapon)
);
CREATE TABLE IF NOT EXISTS fighters_armors (
fighter INT NOT NULL references fighters(id),
armor INT NOT NULL references armors(id),
PRIMARY KEY(fighter, armor)
);
My questions are:
Do you think my design is well suited?
I have seen a lot of example databases containing an id column as primary key on every table. Is there any reason for that? Should I do that instead of the multiple column primary keys I am using on fighters_weapons and fighters_armors?
PostgreSQL creates indexes automatically for each primary key, but there are several tables which I do not expect to search by it (i. e. combats). Should I remove the index for performance? PostgreSQL complains about an existing constraint.
As I will search fighters_weapons and fighters_armors by fighter, as well as combats by winner and loser, do you think I should create indexes for all of this columns on these tables?
Any performance improvement advice? The most used operations will be: insert and query fighters, query equipment for a given fighter and insert combats.
Thanks a lot :)

To address your explicit questions:
2) It can be preferable to use a "natural" value as a primary key, i.e. not a serial id, if one exists. In cases where you are unlikely to use a serial id as an identifier, I would say it's slight better not to add it.
3) Unless you intend to insert many rows very quickly into the combats table, it probably won't hurt you too much to have the index on the id column.
4) Creating an index on {fighter} is not necessary if the index {fighter, weapon} exists, and similarly creating an index on {fighter} is not necessary if the index {fighter, armor} exists. In general, you don't benefit from creating an index that is the prefix of another multi-column index. Separately, creating {winner} and {loser} indexes on combats seems like a good idea given the access pattern you've described.
5) Beyond table design, there are a few database tuning parameters that you might want to set if you've installed the database yourself. If an experienced database administrator has set up the database, he/she has probably already done this for you.

Related

Should I index a foreign key that is going to be updated often

I am trying to create a library relational database, in which there are two tables: users and books. The relationship is one to many:one. A user has many books, and one book is owned by only one user. I was thinking that the book table should have a foreign key column that references the user id.
However I encountered a problem if I want to get all of the books of a given user.
The only option is to query the books whose user id equals the given user id using join.
But if there are many books it will take a lot of time.
So one may suggest to index the foreign key column as a non clustered index. However a book-user combination will be updated often--you don't keep a book more than one day in this library. But I read that update an indexed column often is not the best practice.
So what should I do? What is the best solution for this case?

Best performance for bi-directions query should include a middle table to storage the relationships. Both of customer and book should have unique index
The middle table - borrowing_table
with column user_id and book_id You storage the information of both users and books index (id) on this table, so you can query the table by user_id and get which books have been borrowed by this individual, you also can get the users quick from the query by books_id.

You should have an index on book_id.
Your concern about "frequent" updates just doesn't apply in a library setting. Libraries work on the time frames of days and weeks. Databases work on the timeframes of milliseconds, seconds, and minutes. What might seem frequent in a library is rather rare from the perspective of a database.
That said, I would suggest an intermediate table, not because you have a 1-n relationship at any given point in time. Instead, you have a time-tiled relationship. So:
create table UserBooks (
UserBookId int, -- serial, auto_increment, identity, generated always
UserId int references Users(UserId),
BookId int references Books(BookId),
FromDate datetime,
ToDate datetime,
DueDate datetime,
OverdueFees numeric(20, 4)
. . .
);
In other words, "borrowing" deserves to be entity itself, because there is more information than just the book and the user.

Problems on having a field that will be null very often on a table in SQL Server

I have a column that sometimes will be null. This column is also a foreign key, so I want to know if I'll have problems with performance or with data consistency if this column will have weight
I know its a foolish question but I want to be sure.

There is no problem necessarily with this, other than it is likely indication that you might have poorly normalized design. There might be performance implications due to the way indexes are structured and the sparseness of the column with nulls, but without knowing your structure or intended querying scenarios any conclusions one might draw would be pure speculation.
A better solution might be a shared primary key where table A has a primary key, and there is zero or one records in B with the same primary key.
If table A can have one or zero B, but more than one A can refer to B, then what you have is a one to many relationship. This can be represented as Pieter laid out in his answer. This allows multiple A records to refer to the same B, and in turn each B may optionally refer to an A.
So you see there are two optional structures to address this problem, and choosing each is not guesswork. There is a distinct rational between why you would choose one or the other, but it depends on the nature of your relationships you are modelling.

Instead of this design:
create table Master (
ID int identity not null primary key,
DetailID int null references Detail(ID)
)
go
create table Detail (
ID int identity not null primary key
)
go
consider this instead
create table Master (
ID int identity not null primary key
)
go
create table Detail (
ID int identity not null primary key,
MasterID int not null references Master(ID)
)
go
Now the Foreign Key is never null, rather the existence (or not) of the Detail record indicates whether it exists.
If a Detail can exist for multiple records, create a mapping table to manage the relationship.

Large Numbers Of Columns In Database

I have been doing some research into this issue and still have not been able to make up a satisfactory decision.
This question came closest but still does not really help my situation.
Large Number of Columns in MySQL Database
I am basically creating a site of "who would win in a fight" to settle the long standing batman vs superman style arguments where users can vote on who they think would win.
Users will have the option to submit a "fighter" to the website who will then be randomly matched to every other fighter for future users to vote on.
I want to obviously keep statistics on all of the match ups to display to the users.
Now i will have a table named lets say FIGHTERS. this will store info like primary key, name, description, but not fight results.
As for storing the fight results i can see two options.
Option A: Create a table for each fighter to count the amount of winning votes they have vs every other fighters primary key.
Option B: Create One Large votes table which would have an equal amount of column's and rows indexed by the primary keys of the fighters. Then for example to get the stats for fighter1 vs fighter4 i would query row 1 (fighter1 PK1) Column 4 (for fighter 4 PK4) to get the amount of fighter 1 wins vs fighter 4, and then repeat but query row 4 (PK4 for fighter 4), column 1 to get fighter 4 wins vs fighter1. This table would obviously get very large when hundreds (thousands?) of fighters are added.
(Hope that was not too confusing!)
So i guess my question is, is it better to have hundreds of small tables (which will all need to have columns and rows added when a new fighter is added). Or to have one large table?
Im totally 50/50 with this so please any advice or other ways i could achieve this would be most appreciated.
Thanks in advance.
EDIT: Sorry for leaving this out. The voting i had in mind would work basically as a count of overall votes for each fighter in favour of winning the fight vs each other fighter.

Following clarification I would consider
CREATE TABLE FightResults
(
Fighter1Id INT REFERENCES FIGHTERS(FighterId),
Fighter2Id INT REFERENCES FIGHTERS(FighterId),
Fighter1Votes INT,
Fighter2Votes INT,
CHECK (Fighter1Id < Fighter2Id ),
PRIMARY KEY (Fighter1Id,Fighter2Id)
)
You have a row for each matchup. Gorilla vs Shark, Lion vs Tiger etc. The check and PK constraints ensure the same matchup isn't represented more than once.
This does assume that the fights will have a fixed number of participants at two. If this isn't the case then a more flexible schema is
CREATE TABLE Fight
(
FightId INT PRIMARY KEY,
/*Other columns with fight metadata*/
)
CREATE TABLE FightResult
(
FightId INT REFERENCES Fight(FightId),
FighterId INT REFERENCES FIGHTERS(FighterId),
Votes INT,
PRIMARY KEY (FightId,FighterId)
)
But this does add quite possibly unnecessary complexity to your queries.
You may also want to prevent multiple votes on the same contest by the same user. In that case you might use something like (assuming two fighters per contest again)
CREATE TABLE Fights
(
FightId INT PRIMARY KEY,
Fighter1Id INT REFERENCES FIGHTERS(FighterId),
Fighter2Id INT REFERENCES FIGHTERS(FighterId),
CHECK (Fighter1Id < Fighter2Id )
)
CREATE TABLE Votes
(
FightId INT REFERENCES Fights(FightId),
UserId INT REFERENCES Users(UserId),
Vote INT CHECK (Vote IN (1,2)),
PRIMARY KEY (FightId,UserId)
)
But possibly keeping denormalised vote totals around for performance reason.

The solution is to create 2 tables:
Fighters with FighterId (primary key) and all the other data.
FightResult: FightResultId (primary key), FighterId1, FighterId2, FightResult. The two columns FighterIdX are foreign keys to Fighter.
This will make it easy to query and add votes and will keep it simple and easy to understand.
You can also add info like which user voted for a fight (foreign keys to users) to the second table if you like.

Identity column separate from composite primary key

I have a table representing soccer matches:
Date
Opponent
I feel {Date,Opponent} is the primary key because in this table there can never be more than one opponent per date. The problem is that when I create foreign key constraints in other tables, I have to include both Date and Opponent columns in the other tables:
Soccer game statistics table:
Date
Opponent
Event (Goal scored, yellow card etc)
Ideally I would like to have:
Soccer matches table:
ID
Date
Opponent
Soccer match statistics table:
SoccerMatchID
Event (Goal scored, yellow card etc)
where SoccerMatch.ID is a unique ID (but not the primary key) and {Date,Opponent} is still the primary key.
The problem is SQL Server doesn't seem to let me define ID as being a unique identity whilst {Date,Component} is the primary key. When I go to the properties for ID, the part signalling unique identifying is grayed-out with "No".
(I assume everyone agrees I should try to achieve the above as it's a better design?)

I think most people don't use the graphical designer to do this, as it's the graphical designer that's preventing it, not SQL Server. Try running DDL in a query window:
ALTER TABLE dbo.YourTable ADD ID INT IDENTITY(1,1);
GO
CREATE UNIQUE INDEX yt_id ON dbo.YourTable(ID);
GO
Now you can reference this column in other tables no problem:
CREATE TABLE dbo.SomeOtherTable
(
MatchID INT FOREIGN KEY REFERENCES dbo.YourTable(ID)
);
That said, I find the column name ID completely useless. If it's a MatchID, why not call it MatchID everywhere it appears in the schema? Yes it's redundant in the PK table but IMHO consistency throughout the model is more important.
For that matter, why is your table called SoccerMatch? Do you have other kinds of matches? I would think it would be Matches with a unique ID = MatchID. That way if you later have different types of matches you don't have to create a new table for each sport - just add a type column of some sort. If you only ever have soccer, then SoccerMatch is kind of redundant, no?
Also I would suggest that the key and unique index be the other way around. If you're not planning to use the multi-column key for external reference then it is more intuitive, at least to me, to make the PK the thing you do reference in other tables. So I would say:
CREATE TABLE dbo.Matches
(
MatchID INT IDENTITY(1,1),
EventDate DATE, -- Date is also a terrible name and it's reserved
Opponent <? data type ?> / FK reference?
);
ALTER TABLE dbo.Matches ADD CONSTRAINT PK_Matches
PRIMARY KEY (MatchID);
ALTER TABLE dbo.Matches ADD CONSTRAINT UQ_Date_Opponent
UNIQUE (EventDate, Opponent);

Generic Database table design

Just trying to figure out the best way to design my table for the following scenario:
I have several areas in my system (documents, projects, groups and clients) and each of these can have comments logged against them.
My question is should I have one table like this:
CommentID
DocumentID
ProjectID
GroupID
ClientID
etc
Where only one of the ids will have data and the rest will be NULL or should I have a separate CommentType table and have my comments table like this:
CommentID
CommentTypeID
ResourceID (this being the id of the project/doc/client)
etc
My thoughts are that option 2 would be more efficient from an indexing point of view. Is this correct?

Option 2 is not a good solution for a relational database. It's called polymorphic associations (as mentioned by #Daniel Vassallo) and it breaks the fundamental definition of a relation.
For example, suppose you have a ResourceId of 1234 on two different rows. Do these represent the same resource? It depends on whether the CommentTypeId is the same on these two rows. This violates the concept of a type in a relation. See SQL and Relational Theory by C. J. Date for more details.
Another clue that it's a broken design is that you can't declare a foreign key constraint for ResourceId, because it could point to any of several tables. If you try to enforce referential integrity using triggers or something, you find yourself rewriting the trigger every time you add a new type of commentable resource.
I would solve this with the solution that #mdma briefly mentions (but then ignores):
CREATE TABLE Commentable (
ResourceId INT NOT NULL IDENTITY,
ResourceType INT NOT NULL,
PRIMARY KEY (ResourceId, ResourceType)
);
CREATE TABLE Documents (
ResourceId INT NOT NULL,
ResourceType INT NOT NULL CHECK (ResourceType = 1),
FOREIGN KEY (ResourceId, ResourceType) REFERENCES Commentable
);
CREATE TABLE Projects (
ResourceId INT NOT NULL,
ResourceType INT NOT NULL CHECK (ResourceType = 2),
FOREIGN KEY (ResourceId, ResourceType) REFERENCES Commentable
);
Now each resource type has its own table, but the serial primary key is allocated uniquely by Commentable. A given primary key value can be used only by one resource type.
CREATE TABLE Comments (
CommentId INT IDENTITY PRIMARY KEY,
ResourceId INT NOT NULL,
ResourceType INT NOT NULL,
FOREIGN KEY (ResourceId, ResourceType) REFERENCES Commentable
);
Now Comments reference Commentable resources, with referential integrity enforced. A given comment can reference only one resource type. There's no possibility of anomalies or conflicting resource ids.
I cover more about polymorphic associations in my presentation Practical Object-Oriented Models in SQL and my book SQL Antipatterns.

Read up on database normalization.
Nulls in the way you describe would be a big indication that the database isn't designed properly.
You need to split up all your tables so that the data held in them is fully normalized, this will save you a lot of time further down the line guaranteed, and it's a lot better practice to get into the habit of.

From a foreign key perspective, the first example is better because you can have multiple foreign key constraints on a column but the data has to exist in all those references. It's also more flexible if the business rules change.

To continue from #OMG Ponies' answer, what you describe in the second example is called a Polymorphic Association, where the foreign key ResourceID may reference rows in more than one table. However in SQL databases, a foreign key constraint can only reference exactly one table. The database cannot enforce the foreign key according to the value in CommentTypeID.
You may be interested in checking out the following Stack Overflow post for one solution to tackle this problem:
MySQL - Conditional Foreign Key Constraints

The first approach is not great, since it is quite denormalized. Each time you add a new entity type, you need to update the table. You may be better off making this an attribute of document - I.e. store the comment inline in the document table.
For the ResourceID approach to work with referential integrity, you will need to have a Resource table, and a ResourceID foreign key in all of your Document, Project etc.. entities (or use a mapping table.) Making "ResourceID" a jack-of-all-trades, that can be a documentID, projectID etc.. is not a good solution since it cannot be used for sensible indexing or foreign key constraint.
To normalize, you need to the comment table into one table per resource type.
Comment
-------
CommentID
CommentText
...etc
DocumentComment
---------------
DocumentID
CommentID
ProjectComment
--------------
ProjectID
CommentID
If only one comment is allowed, then you add a unique constraint on the foreign key for the entity (DocumentID, ProjectID etc.) This ensures that there can only be one row for the given item and so only one comment. You can also ensure that comments are not shared by using a unique constraint on CommentID.
EDIT: Interestingly, this is almost parallel to the normalized implementation of ResourceID - replace "Comment" in the table name, with "Resource" and change "CommentID" to "ResourceID" and you have the structure needed to associate a ResourceID with each resource. You can then use a single table "ResourceComment".
If there are going to be other entities that are associated with any type of resource (e.g. audit details, access rights, etc..), then using the resource mapping tables is the way to go, since it will allow you to add normalized comments and any other resource related entities.

I wouldn't go with either of those solutions. Depending on some of the specifics of your requirements you could go with a super-type table:
CREATE TABLE Commentable_Items (
commentable_item_id INT NOT NULL,
CONSTRAINT PK_Commentable_Items PRIMARY KEY CLUSTERED (commentable_item_id))
GO
CREATE TABLE Projects (
commentable_item_id INT NOT NULL,
... (other project columns)
CONSTRAINT PK_Projects PRIMARY KEY CLUSTERED (commentable_item_id))
GO
CREATE TABLE Documents (
commentable_item_id INT NOT NULL,
... (other document columns)
CONSTRAINT PK_Documents PRIMARY KEY CLUSTERED (commentable_item_id))
GO
If the each item can only have one comment and comments are not shared (i.e. a comment can only belong to one entity) then you could just put the comments in the Commentable_Items table. Otherwise you could link the comments off of that table with a foreign key.
I don't like this approach very much in your specific case though, because "having comments" isn't enough to put items together like that in my mind.
I would probably go with separate Comments tables (assuming that you can have multiple comments per item - otherwise just put them in your base tables). If a comment can be shared between multiple entity types (i.e., a document and a project can share the same comment) then have a central Comments table and multiple entity-comment relationship tables:
CREATE TABLE Comments (
comment_id INT NOT NULL,
comment_text NVARCHAR(MAX) NOT NULL,
CONSTRAINT PK_Comments PRIMARY KEY CLUSTERED (comment_id))
GO
CREATE TABLE Document_Comments (
document_id INT NOT NULL,
comment_id INT NOT NULL,
CONSTRAINT PK_Document_Comments PRIMARY KEY CLUSTERED (document_id, comment_id))
GO
CREATE TABLE Project_Comments (
project_id INT NOT NULL,
comment_id INT NOT NULL,
CONSTRAINT PK_Project_Comments PRIMARY KEY CLUSTERED (project_id, comment_id))
GO
If you want to constrain comments to a single document (for example) then you could add a unique index (or change the primary key) on the comment_id within that linking table.
It's all of these "little" decisions that will affect the specific PKs and FKs. I like this approach because each table is clear on what it is. In databases that's usually better then having "generic" tables/solutions.

Of the options you give, I would go for number 2.

Option 2 is a good way to go. The issue that I see with that is you are putting the resouce key on that table. Each of the IDs from the different resources could be duplicated. When you join resources to the comments you will more than likely come up with comments that do not belong to that particular resouce. This would be considered a many to many join. I would think a better option would be to have your resource tables, the comments table, and then tables that cross reference the resource type and the comments table.

If you carry the same sort of data about all comments regardless of what they are comments about, I'd vote against creating multiple comment tables. Maybe a comment is just "thing it's about" and text, but if you don't have other data now, it's likely you will: date the comment was entered, user id of person who made it, etc. With multiple tables, you have to repeat all these column definitions for each table.
As noted, using a single reference field means that you could not put a foreign key constraint on it. This is too bad, but it doesn't break anything, it just means you have to do the validation with a trigger or in code. More seriously, joins get difficult. You can just say "from comment join document using (documentid)". You need a complex join based on the value of the type field.
So while the multiple pointer fields is ugly, I tend to think that's the right way to go. I know some db people say there should never be a null field in a table, that you should always break it off into another table to prevent that from happening, but I fail to see any real advantage to following this rule.
Personally I'd be open to hearing further discussion on pros and cons.

Pawnshop Application:
I have separate tables for Loan, Purchase, Inventory & Sales transactions.
Each tables rows are joined to their respective customer rows by:
customer.pk [serial] = loan.fk [integer];
= purchase.fk [integer];
= inventory.fk [integer];
= sale.fk [integer];
I have consolidated the four tables into one table called "transaction", where a column:
transaction.trx_type char(1) {L=Loan, P=Purchase, I=Inventory, S=Sale}
Scenario:
A customer initially pawns merchandise, makes a couple of interest payments, then decides he wants to sell the merchandise to the pawnshop, who then places merchandise in Inventory and eventually sells it to another customer.
I designed a generic transaction table where for example:
transaction.main_amount DECIMAL(7,2)
in a loan transaction holds the pawn amount,
in a purchase holds the purchase price,
in inventory and sale holds sale price.
This is clearly a denormalized design, but has made programming alot easier and improved performance. Any type of transaction can now be performed from within one screen, without the need to change to different tables.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas