database performance around storing and querying bi-directional relationships

database performance around storing and querying bi-directional relationships - sql

I'm looking to determine whether it is better from a performance and coding perspective to store two associated database records as a single row (and search both columns for a specific record since the value could be in either place) or create a second row for that association and only search one column.
An example will help hopefully:
UserTable
userID INTEGER,
firstName VARCHAR2(20),
lastName VARCHAR2(20)
2 rows:
1, John, Smith
2, Terry, Jenkins
Second table (to track relationship between the two)
RelationshipTable
relationshipID INTEGER,
userID1 INTEGER,
userID2 INTEGER
Now to store a relationship between john and terry I could do:
Option1 (1 row):
relationshipID, userID1, userID2
1, 1, 2
Then to look for any relationship that terry is a part of i would have to do something like
SELECT *
FROM RelationshipTable
WHERE userID1 = [terrysID] OR userID2 = [terrysID]
Or I could go with 2 rows and inserting each ID in the association into a specific column.
Option2 (2 rows):
relationshipID, userID1, userID2
1, 1, 2
2, 2, 1
and find any relationships that terry is a part of by:
SELECT *
FROM RelationshipTable
WHERE userID1 = [terrysID]
I'm not sure which is better.
I could setup indexes on both columns which would help with the first option. However, I would still have to do some results post-processing to determine which column in the resultset has the ID that is not terry's. And i think the coding is a bit messier since I'd have to repeat that logic in multiple places.
On the other-hand, the second approach effectively doubles the amount of data, and even scarier, duplicates data without adding any real "business value". So if that relationship ever ended I would have to ensure I deleted both records (or soft-deleted or whatever we chose to do).
I never know if I would be searching for John's relationship's or Terry's relationship's so I cannot intelligently insert either ID into a specific column at time of relationship creation.
Thoughts? There might be a third option that I haven't thought of that is the better? Something like creating a view on the table that creates the two rows for querying but without actually duplicating the data? Obviously that would create additional overhead on the system.
Edit:
This looks like a similar question, but I am not sure any answer accurately satisfies what I am looking for.
Two way relationships in SQL queries
Thanks!

In terms of clarity and ease of use, I'd go with option 1. This has the drawback of a bug allowing 1 to relate to 2 and also 2 to relate to 1 which would be redundant. However, that would be up to the front end to stop (you can't do everything in the DB).
Your postprocessing can be totally avoided by not using the simple select you gave, but by using this:
SELECT relationshipId, user1Id, user2Id
FROM RelationshipTable
WHERE userID1 = [terrysID]
union all
SELECT relationshipId, user2Id, user1Id
from RelationshipTable
where userID2 = [terrysID]
This will mean that [terrysId] will always be the first of the pair. If you have indexes on both columns, then it should be pretty efficient too.

Related

How to force ID column to remain sequential even if a recored has been deleted, in SQL server?

I don't know what is the best wording of the question, but I have a table that has 2 columns: ID and NAME.
when I delete a record from the table the related ID field deleted with it and then the sequence spoils.
take this example:
if I deleted row number 2, the sequence of ID column will be: 1,3,4
How to make it: 1,2,3

ID's are meant to be unique for a reason. Consider this scenario:
**Customers**
id value
1 John
2 Jackie
**Accounts**
id customer_id balance
1 1 $500
2 2 $1000
In the case of a relational database, say you were to delete "John" from the database. Now Jackie would take on the customer_id of 1. When Jackie goes in to check here balance, she will now show $500 short.
Granted, you could go through and update all of her other records, but A) this would be a massive pain in the ass. B) It would be very easy to make mistakes, especially in a large database.
Ids (primary keys in this case) are meant to be the rock that holds your relational database together, and you should always be able to rely on that value regardless of the table.
As JohnFx pointed out, should you want a value that shows the order of the user, consider using a built in function when querying.

In SQL Server identity columns are not guaranteed to be sequential. You can use the ROW_NUMBER function to generate a sequential list of ids when you query the data from the database:
SELECT
ROW_NUMBER() OVER (ORDER BY Id) AS SequentialId,
Id As UniqueId,
Name
FROM dbo.Details

If you want sequential numbers don't store them in the database. That is just a maintenance nightmare, and I really can't think of a very good reason you'd even want to bother.
Just generate them dynamically using tSQL's RowNumber function when you query the data.
The whole point of an Identity column is creating a reliable identifier that you can count on pointing to that row in the DB. If you shift them around you undermine the main reason you WANT an ID.
In a real world example, how would you feel if the IRS wanted to change your SSN every week so they could keep the Social Security Numbers sequential after people died off?

One lookup table or many lookup tables? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I need to save basic member's data with additional attributes such as gender, education, profession, marital_status, height, residency_status etc.
I have around 15-18 lookup tables all having (id, name, value), all attributes have string values.
Shall I create member's table tbl_members and separate 15-18 lookup tables for each of the above attributes:
tbl_members:
mem_Id
mem_email
mem_password
Gender_Id
education_Id
profession_id
marital_status_Id
height_Id
residency_status_Id
or shall I create only one lookup table tbl_Attributes and tbl_Attribute_Types?
tbl_Attributes:
att_Id
att_Value
att_Type_Id
Example data:
001 - Male - 001
002 - Female - 001
003 - Graduate - 002
004 - Masters - 002
005 - Engineer - 003
006 - Designer - 003
tbl_Attribute_Types:
att_type_Id
att_type_name
Example data:
001 - Gender
002 - Education
003 - Profession
To fill look-up drop-downs I can select something like:
SELECT A.att_id, A.att_value, AT.att_type_name
FROM tbl_Attributes A
INNER JOIN tbl_Attribute_Types AT ON AT.att_type_Id = A.att_type_Id
WHERE att_Type_Id = #att_Type_Id
and an additional table tbl_mem_att_value to save member's attributes and values
tbl_mem_att_value:
mem_id
att_id
Example data for member_id 001, is Male, Masters, Engineer
001 - 001
001 - 004
001 - 005
So my question is shall I go for one lookup table or many lookup tables?
Thanks

Never use one lookup table for everything. It will make it more difficult to find things, and it will need to be joined in every query probably multiple times which will mean that it may cause locking and blocking problems. Further in one table you can't use good design to make sure the data type for the descriptor is correct. For instance suppose you wanted a lookup of the state abbreviations which are two characters. If you use a onesize fits all table, then it has to be wide enough for teh largest possible value of any lookup and you lose the possibility of it rejecting an incorrect entry because it is too long. This is a guarantee of later data integrity issues.
Further you can't properly use foreign keys to make sure data entry is limited only to the correct values. This will also cause data integrity issues.
There is NO BENEFIT whatsoever to using one table except a few minutes of dev time (possibly the least important concern in designing a database). There are plenty of negatives.

The primary reason for using multiple lookup tables is that you can then enforce foreign key constraints. This is quite important for maintaining relational integrity.
The primary reason for using a single lookup table is so you have all the string values in one place. This can be very useful for internationalization of the software.
In general, I would go with separate reference tables, because relational integrity is generally a more important concern than internationalization.
There are secondary considerations. Many different reference tables are going to occupy more space than a single reference table -- with most of the pages being empty (how much space do you really need to store the gender lookup information?). However, with a relatively small number of reference tables, this is actually a pretty minor concern.
Another consideration in using a single table is that all the reference keys will have different values. This is useful because it can prevent unlikely joins. However, I prevent this problem by naming join keys the same, both for the primary key and the foreign key. So, GenderId would be the primary key in Gender as well as the foreign key column.

I've struggled with the same question myself. If the only thing in the lookup table is some sort of code or id and a text value, then it certainly works to just add "attribute id" and throw it all in one table. The obvious advantage is that you then have only one table to create and manage. Searches might possibly be slower because there are more records to search, but presumably you create an index on attribute id + value id. At that point, whether performance is better having one big table or ten small tables probably depends on all sorts of details about how the database engine works and the pattern of access. That's a case where I'd say, Unless in practice it proves to be a problem, don't worry about it.
Two caveats:
One: If you do create a single table, I'd create a code for the attribute name, and then another table to list the codes. Like:
lookup_attribute(attribute_id, attribute_name)
lookup_value(attribute_id, value_id, value_text)
Then the first table has records like
1, 'Gender'
2, 'Marital Status'
3, 'Education'
etc
And the second is
1, 1, 'Male'
1, 2, 'Female'
1, 3, 'Undecided'
2, 1, 'Single'
2, 2, 'Married'
2, 3, 'Divorced'
2, 4, 'Widowed'
3, 1, 'High School'
3, 2, 'Associates'
3, 3, 'Bachelors'
3, 4, 'Masters'
3, 5, 'Doctorate'
3, 6, 'Other'
etc.
(The value id could be unique for all attribute ids or it might only be unique within the attribute id, whatever works for you. It shouldn't matter.)
Two: If there is other data you need to store for some attribute besides just the text of a value, then break that out into a separate table. Like if you had an attribute for, say, "Membership Level", and then the user says that there are different dues for each level and you need to record this, then you have an extra field that applies only to this one attribute. At that point it should become its own table. I've seen systems where they have a couple of pieces of extra data for each of several attributes, and they create a field called "extra data" or some such, and for "membership level" it holds annual dues and for "store name" it holds the city where the store is and for "item number" it holds the number of units on hand of that item, etc, and the system quickly becomes a nightmare to manage.
Update
To retrieve values, let's suppose we have just gender and marital status as lookups. The principle is the same for any others.
So we have the monster lookup table as described above. Then we have the member table with, say
member (member_id, name, member_number, whatever, gender_id, marital_status_id)
To retrieve you just write
select m.member_id, m.name, m.member_number, m.whatever,
g.value_text as gender, ms.value_text as marital_status
from member m
join lookup_value g on g.attribute_id=1 and g.attribute_value=m.gender_id
join lookup_value ms on ms.attribute_id=2 and ms.attribute_value=m.marital_status_id
where m.member_id=#member_id
You could, alternatively, have:
member (member_id, name, member_number, whatever)
member_attributes (member_id, attribute_id, value_id)
Then you can get all the attributes w
select a.attribute_name, v.value_text
from member_attribute ma
join lookup_attribute a on a.attribute_id=ma.attribute_id
join lookup_value v on v.attribute_id=a.attribute_id and v.value_id=ma.value_id
where ma.member_id=#member_id
It occurs to me as I try to write the queries that there's a distinct advantage to making the value id globally unique: Not only does that eliminate having to specify the attribute id in the join, but it also means that if you do have a field for, say, gender_id, you can still have a foreign key clause on it.

Putting all the lookup values into a single table is usually referred to as Common Lookup Tables, or Massively Unified Code-Key (MUCK), and is generally considered a design error.
Great argumentation of why it's not a good idea can be found in the article below.
https://www.red-gate.com/simple-talk/sql/database-administration/five-simple-database-design-errors-you-should-avoid/

Two way relationships in SQL queries

I have a small database that is used to track parts. for the sake of this example the table looks like this:
PartID (PK), int
PartNumber, Varchar(50), Unique
Description, Varchar(255)
I have a requirement to define that certain parts are classified as similar to each other.
To do this I have setup a second table that looks like this:
PartID, (PK), int
SecondPartID, (PK), int
ReasonForSimilarity, Varchar(255)
Then a many-to-many relationship has been setup between the two tables.
The problem comes when I need to report on the parts that are considered similar because the relationship is two way I.E. if part XYZ123 is similar to ABC678 then ABC678 is considered to be similar to XYZ123. So if I wanted to list all parts that are similar to a given part I either need to ensure the relationship is setup in both directions (which is bad because data is duplicated) or need to have 2 queries that look at the table in both directions. Neither of these solutions feels right to me.
So, how should this problem be approached? Can this be solved with SQL alone or does my design need to change to accommodate the business requirement?
Consider the following parts XYZ123, ABC123, ABC234, ABC345, ABC456 & EFG456 which have been entered into the existing structure entered above. You could end up with data that looks like this (omitting the reason field which is irrelevant at this point):
PartID, SecondPartID
XYZ123, ABC123
XYZ123, ABC234
XYZ123, ABC345
XYZ123, ABC456
EFG456, XYZ123
My user wants to know "Which parts are similar to XYZ123". This could be done using a query like so:
SELECT SecondPartID
FROM tblRelatedParts
WHERE PartID = 'XYZ123'
The problem with this though is it will not pick out part EFG456 which is related to XYZ123 despite the fact that the parts have been entered the other way round. It is feasible that this could happen depending on which part the user is currently working with and the relationship between the parts will always be two-way.
The problem I have with this though is that I now need to check that when a user sets up a relationship between two parts it does not already exist in the other direction.
#Goran
I have done some initial tests using your suggestion and this is how I plan to approach the problem using your suggestion.
The data listed above is entered into the new table (Note that I have changed the partID to part number to make the example clearer; the semantics of my problem haven't changed though)
The table would look like this:
RelationshipID, PartNumber
1, XYZ123
1, ABC123
2, XYZ123
2, ABC234
3, XYZ123
3, ABC345
4, XYZ123
4, ABC456
5, EFG456
5, XYZ123
I can then retrieve a list of similar parts using a query like this:
SELECT PartNumber
FROM tblPartRelationships
WHERE RelationshipID ANY (SELECT RelationshipID
FROM tblPartRelationships
WHERE PartNumber = 'XYZ123')
I'll carry out some more tests and if this works I'll feedback and accept the answer.

I've dealt with this issue by setting up a relationship table.
Part table:
PartID (PK), int
PartNumber, Varchar(50), Unique
Description, Varchar(255)
PartRelationship table:
RelationshipId (FK), int
PartID (FK), int
Relationship table:
RelationshipId (PK), int
Now similar parts simply get added to Relationship table:
RelationshipId, PartId
1,1
1,2
Whenever you add another part with relationshipId = 1 it is considered similar to any part with relationshipId = 1.
Possible API solutions for adding relationships:
Create new relationship for each list of similar parts. Let client load, change and update the entire list whenever needed.
Retrieve relationship(s) for a similar object. Filter the list by some criteria so that only one remains or let client choose from existing relationships. Create, remove PartRelationship record as needed.
Retrieve list of relationships from Relationship table. Let client specify parts and relationships. Create, remove PartRelationship records as needed.

Add a CHECK constraint e.g.
CHECK (PartID < SecondPartID);

I know this is old but why not just do this query with your original schema? Less tables and rows.
SELECT SecondPartID
FROM tblRelatedParts
WHERE PartID = 'XYZ123'
UNION
SELECT PartID
FROM tblRelatedParts
WHERE SecondPartID = 'XYZ123'
I am dealing with a similar issue and looking at the two approaches and wondering why you thought the schema with the relationship table was better. It seems like the original issue still exists in the sense that you still need to manage the relationships between them from both directions.

How about having two rows for each similarity. For example if you have objects A, B similar you will have in your relation table
A B
B A
I know you will double your relation data, but they are integers so it won't over kill your database. Instead you have some gains:
you won't use union. Union is over kill in any dbms. Especially when you have order by or group by
you can implement more specific relation: a is in relation with b, but b is not in relation with a. For example John can replace Dave, but Dave cannot replace John.

Basic question: how to properly redesign this schema

I am hopping on a project that sits on top of a Sql Server 2008 DB with what seems like an inefficient schema to me. However, I'm not an expert at anything SQL, so I am seeking for guidance.
In general, the schema has tables like this:
ID | A | B
ID is a unique identifier
A contains text, such as animal names. There's very little variety; maybe 3-4 different values in thousands of rows. This could vary with time, but still a small set.
B is one of two options, but stored as text. The set is finite.
My questions are as follows:
Should I create another table for names contained in A, with an ID and a value, and set the ID as the primary key? Or should I just put an index on that column in my table? Right now, to get a list of A's, it does "select distinct(a) from table" which seems inefficient to me.
The table has a multitude of columns for properties of A. It could be like: Color, Age, Weight, etc. I would think that this is better suited in a separate table with: ID, AnimalID, Property, Value. Each property is unique to the animal, so I'm not sure how this schema could enforce this (the current schema implies this as it's a column, so you can only have one value for each property).
Right now the DB is easily readable by a human, but its size is growing fast and I feel like the design is inefficient. There currently is not index at all anywhere. As I said I'm not a pro, but will read more on the subject. The goal is to have a fast system. Thanks for your advice!

This sounds like a database that might represent a veterinary clinic.
If the table you describe represents the various patients (animals) that come to the clinic, then having properties specific to them are probably best on the primary table. But, as you say column "A" contains a species name, it might be worthwhile to link that to a secondary table to save on the redundancy of storing those names:
For example:
Patients
--------
ID Name SpeciesID Color DOB Weight
1 Spot 1 Black/White 2008-01-01 20
Species
-------
ID Species
1 Cocker Spaniel
If your main table should be instead grouped by customer or owner, then you may want to add an Animals table and link it:
Customers
---------
ID Name
1 John Q. Sample
Animals
-------
ID CustomerID SpeciesID Name Color DOB Weight
1 1 1 Spot Black/White 2008-01-01 20
...
As for your original column B, consider converting it to a boolean (BIT) if you only need to store two states. Barring that, consider CHAR to store a fixed number of characters.

Like most things, it depends.
By having the animal names directly in the table, it makes your reporting queries more efficient by removing the need for many joins.
Going with something like 3rd normal form (having an ID/Name table for the animals) makes you database smaller, but requires more joins for reporting.
Either way, make sure to add some indexes.

Generate unique ID to share with multiple tables SQL 2008

I have a couple of tables in a SQL 2008 server that I need to generate unique ID's for. I have looked at the "identity" column but the ID's really need to be unique and shared between all the tables.
So if I have say (5) five tables of the flavour "asset infrastructure" and I want to run with a unique ID between them as a combined group, I need some sort of generator that looks at all (5) five tables and issues the next ID which is not duplicated in any of those (5) five tales.
I know this could be done with some sort of stored procedure but I'm not sure how to go about it. Any ideas?

The simplest solution is to set your identity seeds and increment on each table so they never overlap.
Table 1: Seed 1, Increment 5
Table 2: Seed 2, Increment 5
Table 3: Seed 3, Increment 5
Table 4: Seed 4, Increment 5
Table 5: Seed 5, Increment 5
The identity column mod 5 will tell you which table the record is in. You will use up your identity space five times faster so make sure the datatype is big enough.

Why not use a GUID?

You could let them each have an identity that seeds from numbers far enough apart never to collide.
GUIDs would work but they're butt-ugly, and non-sequential if that's significant.
Another common technique is to have a single-column table with an identity that dispenses the next value each time you insert a record. If you need them pulling from a common sequence, it's not unlikely to be useful to have a second column indicating which table it was dispensed to.
You realize there are logical design issues with this, right?

Reading into the design a bit, it sounds like what you really need is a single table called "Asset" with an identity column, and then either:
a) 5 additional tables for the subtypes of assets, each with a foreign key to the primary key on Asset; or
b) 5 views on Asset that each select a subset of the rows and then appear (to users) like the 5 original tables you have now.
If the columns on the tables are all the same, (b) is the better choice; if they're all different, (a) is the better choice. This is a classic DB spin on the supertype / subtype relationship.
Alternately, you could do what you're talking about and recreate the IDENTITY functionality yourself with a stored proc that wraps INSERT access on all 5 tables. Note that you'll have to put a TRANSACTION around it if you want guarantees of uniqueness, and if this is a popular table, that might make it a performance bottleneck. If that's not a concern, a proc like that might take the form:
CREATE PROCEDURE InsertAsset_Table1 (
BEGIN TRANSACTION
-- SELECT MIN INTEGER NOT ALREADY USED IN ANY OF THE FIVE TABLES
-- INSERT INTO Table1 WITH THAT ID
COMMIT TRANSACTION -- or roll back on error, etc.
)
Again, SQL is highly optimized for helping you out if you choose the patterns I mention above, and NOT optimized for this kind of thing (there's overhead with creating the transaction AND you'll be issuing shared locks on all 5 tables while this process is going on). Compare that with using the PK / FK method above, where SQL Server knows exactly how to do it without locks, or the view method, where you're only inserting into 1 table.

I found this when searching on google. I am facing a simillar problem for the first time. I had the idea to have a dedicated ID table specifically to generate the IDs but I was unsure if it was something that was considered OK design. So I just wanted to say THANKS for confirmation.. it looks like it is an adequate sollution although not ideal.

I have a very simple solution. It should be good for cases when the number of tables is small:
create table T1(ID int primary key identity(1,2), rownum varchar(64))
create table T2(ID int primary key identity(2,2), rownum varchar(64))
insert into T1(rownum) values('row 1')
insert into T1(rownum) values('row 2')
insert into T1(rownum) values('row 3')
insert into T2(rownum) values('row 1')
insert into T2(rownum) values('row 2')
insert into T2(rownum) values('row 3')
select * from T1
select * from T2
drop table T1
drop table T2

This is a common problem for example when using a table of people (called PERSON singular please) and each person is categorized, for example Doctors, Patients, Employees, Nurse etc.
It makes a lot of sense to create a table for each of these people that contains thier specific category information like an employees start date and salary and a Nurses qualifications and number.
A Patient for example, may have many nurses and doctors that work on him so a many to many table that links Patient to other people in the PERSON table facilitates this nicely. In this table there should be some description of the realtionship between these people which leads us back to the categories for people.
Since a Doctor and a Patient could create the same Primary Key ID in their own tables, it becomes very useful to have a Globally unique ID or Object ID.
A good way to do this as suggested, is to have a table designated to Auto Increment the primary key. Perform an Insert on that Table first to obtain the OID, then use it for the new PERSON.
I like to go a step further. When things get ugly (some new developer gets got his hands on the database, or even worse, a really old developer, then its very useful to add more meaning to the OID.
Usually this is done programatically, not with the database engine, but if you use a BIG INT for all the Primary Key ID's then you have lots of room to prefix a number with visually identifiable sequence. For example all Doctors ID's could begin with 100, all patients with 110, all Nurses with 120.
To that I would append say a Julian date or a Unix date+time, and finally append the Auto Increment ID.
This would result in numbers like:
110,2455892,00000001
120,2455892,00000002
100,2455892,00000003
since the Julian date 100yrs from now is only 2492087, you can see that 7 digits will adequately store this value.
A BIGINT is 64-bit (8 byte) signed integer with a range of -9.22x10^18 to 9.22x10^18 ( -2^63 to 2^63 -1). Notice the exponant is 18. That's 18 digits you have to work with.
Using this design, you are limited to 100 million OID's, 999 categories of people and dates up to... well past the shelf life of your databse, but I suspect thats good enough for most solutions.
The operations required to created an OID like this are all Multiplication and Division which avoids all the gear grinding of text manipulation.
The disadvantage is that INSERTs require more than a simple TSQL statement, but the advantage is that when you are tracking down errant data or even being clever in your queries, your OID is visually telling you alot more than a random number or worse, an eyesore like GUID.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas