Simple DB design question -- sets of measurements

Simple DB design question -- sets of measurements - sql

I have multiple sets of measurements. Each set has multiple values in it. The data is currently in a spreadsheet, and I'd like to move it to a real database.
In the spreadsheet, each set of measurements is in its own column, i.e.,
1 | 2 | 3 | ...
0.5 | 0.7 | 0.2 | ...
0.3 | 0.6 | 0.4 | ...
and so on. If I have, say, 10 data sets and each has 8 measurements, I wind up with an 8x10 table with 80 values.
To put the database in normal form, I know that it should be designed to add rows, not columns, for each new set. I presume that means the tables should be arranged something like (MySQL syntax):
create table Measurement(
int id not null auto_increment,
primary key(id)
);
create table Measurement_Data(
int id not null auto_increment,
double value not null,
int measurement,
primary key(id),
constraint fk_data foreign key(measurement) references Measurement(id)
}
Is that correct? The result is a table with only one autogenerated column (though of course I could add more to it later) and another table that has each data set stored "lengthwise", meaning that table will have 800 rows given the values above. Is that the right way to handle this problem? It also means that I need to add data, I need to insert a row into the Measurement table, get its assigned pk, and then add all the new rows to the Measurement_Data table, right?
I appreciate any advice,
Ken

What about a simple one-table solution for this?
CREATE TABLE measurements
(
id INT AUTO_INCREMENT,
dataSetId INT,
measurementNumber INT,
measurementValue DOUBLE,
CONSTRAINT UNIQUE (dataSetId, measurementNumber)
);
Unless your problem is more complicated than you describe, this should be adequate.

In general, your plan sounds good as far as it goes. I'd tweak it a bit differently from the other posts:
create table Measurement(
int MeasurementId not null auto_increment,
primary key(id)
);
create table MeasurementData(
int MeasurementId not null,
int MeasurementDataEntry not null,
double value not null,
primary key(MeasurementId, MeasurementDataEntry),
constraint fk_data foreign key(MeasurementId) references Measurement(MeasurementId)
)
As per LukLed, MeasrementDataEnry would be an ordinal value, starting at 1 and incrementing by one for each "data entry" for that MeasurementId. The compound primary key should be sufficient, as with the probelm stated I see no real need for an extra surrogate key on MeasurementData.
Be sure that your solution is sufficient for your problem set. Are other columns required in either table (type of measurement, when it occured, who dunnit)? Are there always the same number of data readings for each Measurement, or can it vary? If so, do you need the number of data measurements stored in the parent table (for quicker retrieval, ad the cost of redundancy?) Will you ever need to update data entries, and will that mess up order? It just seems a bit sparse, and there might be other business rules you'll need to factor in down the road.

First I would change 'measurement' column name to "measurement-id", becase this name is not intuitive. You could also add ordinal number to Measurement-Data table to know order in which measurements were taken and add unique on measurement-id and ordinal-numer. By adding ordinal-numer You can also limit number of measurements in every test with check constraint.

Related

Can I use identity for primary key in more than one table in the same ER model

As it is said in the title, my question is can I use int identity(1,1) for primary key in more than one table in the same ER model? I found on Internet that Primary Key need to have unique value and row, for example if I set int identity (1,1) for table:
CREATE TABLE dbo.Persons
(
Personid int IDENTITY(1,1) PRIMARY KEY,
LastName varchar(255) NOT NULL,
FirstName varchar(255),
Age int
);
GO
and the other table
CREATE TABLE dbo.Job
(
jobID int IDENTITY(1,1) NOT NULL PRIMARY KEY,
nameJob NVARCHAR(25) NOT NULL,
Personid int FOREIGN KEY REFERENCES dbo.Persons(Personid)
);
Wouldn't Personid and jobID have the same value and because of that cause an error?

Constraints in general are defined and have a scope of one table (object) in the database. The only exception is the FOREIGN KEY which usually has a REFERENCE to another table.
The PRIMARY KEY (or any UNIQUE key) sets a constraint only on the table it is defined on and is not affecting or is not affected by other constraints on other tables.
The PRIMARY KEY defines a column or a set of columns which can be used to uniquely identify one record in one table (and none of the columns can hold NULL, UNIQUE on the other hand allows NULLs and how it is treated might differ in different database engines).
So yes, you might have the same value for PersonID and JobID, but their meaning is different. (And to select the one unique record, you will need to tell SQL Server in which table and in which column of that table you are looking for it, this is the table list and the WHERE or JOIN conditions in the query).
The query SELECT * FROM dbo.Job WHERE JobID = 1; and SELECT * FROM dbo.Person WHERE PersonID = 1; have a different meaning even when the value you are searching for is the same.
You will define the IDENTITY on the table (the table can have only one IDENTITY column). You don't need to have an IDENTITY definition on a column to have the value 1 in it, the IDENTITY just gives you an easy way to generate unique values per table.
You can share sequences across tables by using a SEQUENCE, but that will not prevent you to manually insert the same values into multiple tables.
In short, the value stored in the column is just a value, the table name, the column name and the business rules and roles will give it a meaning.
To the notion "every table needs to have a PRIMARY KEY and IDENTITY, I would like to add, that in most cases there are multiple (independent) keys in the table. Usually every entity has something what you can call business key, which is in loose terms the key what the business (humans) use to identify something. This key has very similar, but usually the same characteristics as a PRIMARY KEY with IDENTITY.
This can be a product's barcode, or the employee's ID card number, or something what is generated in another system (say HR) or a code which is assigned to a customer or partner.
These business keys are useful for humans, but not always useful for computers, but they could serve as PRIMARY KEY.
In databases we (the developers, architects) like simplicity and a business key can be very complex (in computer terms), can consist of multiple columns, and can also cause performance issues (comparing a strings is not the same as comparing numbers, comparing multiple columns is less efficient than comparing one column), but the worst, it might change over time. To resolve this, we tend to create our own technical key which then can be used by computers more easily and we have more control over it, so we use things like IDENTITYs and GUIDs and whatnot.

Problems on having a field that will be null very often on a table in SQL Server

I have a column that sometimes will be null. This column is also a foreign key, so I want to know if I'll have problems with performance or with data consistency if this column will have weight
I know its a foolish question but I want to be sure.

There is no problem necessarily with this, other than it is likely indication that you might have poorly normalized design. There might be performance implications due to the way indexes are structured and the sparseness of the column with nulls, but without knowing your structure or intended querying scenarios any conclusions one might draw would be pure speculation.
A better solution might be a shared primary key where table A has a primary key, and there is zero or one records in B with the same primary key.
If table A can have one or zero B, but more than one A can refer to B, then what you have is a one to many relationship. This can be represented as Pieter laid out in his answer. This allows multiple A records to refer to the same B, and in turn each B may optionally refer to an A.
So you see there are two optional structures to address this problem, and choosing each is not guesswork. There is a distinct rational between why you would choose one or the other, but it depends on the nature of your relationships you are modelling.

Instead of this design:
create table Master (
ID int identity not null primary key,
DetailID int null references Detail(ID)
)
go
create table Detail (
ID int identity not null primary key
)
go
consider this instead
create table Master (
ID int identity not null primary key
)
go
create table Detail (
ID int identity not null primary key,
MasterID int not null references Master(ID)
)
go
Now the Foreign Key is never null, rather the existence (or not) of the Detail record indicates whether it exists.
If a Detail can exist for multiple records, create a mapping table to manage the relationship.

Performance Typed Column x Distinct Table

There are differences between distinct tables and type columns in terms of Performance or Optimizations for queries?
for example:
Create Table AllInOne(
Key Integer Identity Primary Key,
Desc varchar(20) Not Null,
OneType Integer Not Null
)
Where OneType only receives 1,2 or 3. (integer values)
Versus the following architecture:
Create Table One(
Key Integer Identity Primary Key,
Desc varchar(20) Not Null
)
Create Table Two(
Key Integer Identity Primary Key,
Desc varchar(20) Not Null
)
Create Table Three(
Key Integer Identity Primary Key,
Desc varchar(20) Not Null
)
Another possible architecture:
Create Table Root(
Key Integer Identity Primary Key,
Desc varchar(20) Not Null
)
Create Table One(
Key Integer Primary Key references Root
)
Create Table Two(
Key Integer Primary Key references Root
)
Create Table Three(
Key Integer Primary Key references Root
)
In the 3rd way all data will be set in the root and the relationship with the one, two and three tables.
I asked my teacher sometime ago and he couldn't answer if there is any difference.
Let's suppose i have to choose between these three approaches.
Assume that commonly used queries are filtering the type. And there are no child tables that reference these.
To make it easier to understand let's think about an payroll system.
One = Incomings
Two = Discounts
Three = Base for calculation.

Having separate tables, like in (2), will mean that someone who needs to access data for a particular OneType can ignore data for other types, thereby doing less I/O for a table scan. Also, indexes on the table in (2) would be smaller and potentially of less height, meaning less I/Os for index accesses.
Given the high selectivity of OneType, indexes would not help filtering in (1). However, table partitioning could be used to get all the benefits mentioned above.
There would also be an additional benefits. When querying (2), you need to know which OneType you need in order to know which table to query. In a partitioned version of (1), partition elimination for unneeded partitions can happen through values supplied in a where clause predicate, making the process much easier.
Other benefits include easier database management (when you add a column to a partitioned table, it gets added to all partitions), ans easier scaling (adding partitions for new OneType values is easy). Also, as mentioned, the table can be targeted by foreign keys.

Performance - Int vs Char(3)

I have a table and am debating between 2 different ways to store information. It has a structure like so
int id
int FK_id
varchar(50) info1
varchar(50) info2
varchar(50) info3
int forTable or char(3) forTable
The FK_id can be a foreign key to one of 6 tables so I need another field to determine which table it's for.
I see two solutions:
An integer that is a FK to a settings table which has its actual value.
A char(3) field with the a abbreviated version of the table.
I am wondering if anyone knows if one will be more beneficial speed wise over the other or if there will be any major problems using the char(3)
Note: I will be creating an indexed view on each of the 6 different values for this field. This table will contain ~30k rows and will need to be joined with much larger tables

In this case, it probably doesn't matter except for the collation overhead (A vs a vs ä va à)
I'd use char(3), say for currency code like CHF, GBP etc But if my natural key was "Swiss Franc", "British Pound" etc, I'd take the numeric.
3 bytes + collation vs 4 bytes numeric? You'd need a zillion rows or be running a medium sized country before it mattered...

Have you considered using a TinyInt. Only takes one byte to store it's value. TinyInt has a range of values between 0 and 255.

Is the reason you need a single table that you want to ensure that when the six parent tables reference a given instance of a child row that is guaranteed to be the same instance? This is the classic "multi-parent" problem. An example of where you might run into this is with addresses or phone numbers with multiple person/contact tables.
I can think of a couple of options:
Choice 1: A link table for each parent table. This would be the Hoyle architecture. So, something like:
Create Table MyTable(
id int not null Primary Key Clustered
, info1 varchar(50) null
, info2 varchar(50) null
, info3 varchar(50) null
)
Create Table LinkTable1(
MyTableId int not null
, ParentTable1Id int not null
, Constraint PK_LinkTable1 Primary Key Clustered( MyTableId, ParentTable1Id )
, Constraint FK_LinkTable1_ParentTable1
Foreign Key ( MyTableId )
References MyTable ( Id )
, Constraint FK_LinkTable1_ParentTable1
Foreign Key ( ParentTable1Id )
References ParentTable1 ( Id )
)
...
Create Table LinkTable2...LinkTable3
Choice 2. If you knew that you would never have more than say six tables and were willing to accept some denormalization and a fugly design, you could add six foreign keys to your main table. That avoids the problem of populating a bunch of link tables and ensures proper referential integrity. However, that design can quickly get out of hand if the number of parents grows.
If you are content with your existing design, then with respect to the field size, I would use the full table name. Frankly, the difference in performance between a char(3) and a varchar(50) or even varchar(128) will be negligible for the amount of data you are likely to put in the table. If you really thought you were going to have millions of rows, then I would strongly consider the option of linking tables.
If you wanted to stay with your design and wanted the maximum performance, then I would use a tinyint with a foreign key to a table that contained the list of the six tables with a tinyint primary key. That prevents the number from being "magic" and ensures that you narrow down the list of parent tables. Of course, it still does not prevent orphaned records. In this design, you have to use triggers to do that.

Because your FK cannot be enforced (since it is a variant depending upon type) by database constraint, I would strongly consider re-evaluating your design to use link tables, where each link table includes two FK columns, one to the PK of the entity and one to the PK of one of the 6 tables.
While this might seem to be overkill, it makes a lot of things simpler and adding new link tables is no more complex than accommodating new FK-types. In addition, it is more easily expandable to the case where an entity needs more than a 1-1 relationship to a single table, or needs multiple 1-1 relationships to the 6 other entities.
In a varying-FK scenario, you can lose database consistency, you can join to the wrong entity by neglecting to filter on type code, etc.
I should add that another huge benefit of link tables is that you can link to tables which have keys of varying data types (ints, natural keys, etc) without having to add surrograte keys or stored the key in a varchar or similar workarounds which are prone to problems.

I think a small integer (tinyint) is called for here. An "abbreviated version" looks too much like a magic number.
I also think performance wise the integer should beat the char(3).

First off, a 50 character Id that is not globally unique sounds a little scary. Do the IDs have some meaning? If not, you can easily get a GUID in less space. Personally, I am a big fan of making things human readable whenever possible. I would, and have, put the full name in graphs until I needed to do otherwise. My preference would be to have linking tables for each possible related table though.
Unless you are talking about really large scale, you are much better off decreasing the size of the IDs and taking a few more characters for the name of the table. For really large scale, I would decrease the size of the IDs and use an integer.
Jacob

Design question: Filterable attributes, SQL

I have two tables in my database, Operation and Equipment. An operation requires zero or more attributes. However, there's some logic in how the attributes are attributed:
Operation Foo requires equipment A and B
Operation Bar requires no equipment
Operation Baz requires equipment B and either C or D
Operation Quux requires equipment (A or B) and (C or D)
What's the best way to represent this in SQL?
I'm sure people have done this before, but I have no idea where to start.
(FWIW, my application is built with Python and Django.)
Update 1: There will be around a thousand Operation rows and about thirty Equipment rows. The information is coming in CSV form similar to the description above: Quux, (A & B) | (C & D)
Update 2: The level of conjunctions & disjunctions shouldn't be too deep. The Quux example is probably the most complicated, though there appears to be a A | (D & E & F) case.

Think about how you'd model the operations in OO design: the operations would be subclasss of a common superclass Operation. Each subclass would have mandatory object members for the respective equipment required by that operation.
The way to model this with SQL is Class Table Inheritance. Create a common super-table:
CREATE TABLE Operation (
operation_id SERIAL PRIMARY KEY,
operation_type CHAR(1) NOT NULL,
UNIQUE KEY (operation_id, operation_type),
FOREIGN KEY (operation_type) REFERENCES OperationTypes(operation_type)
);
Then for each operation type, define a sub-table with a column for each required equipment type. For example, OperationFoo has a column for each of equipA and equipB. Since they are both required, the columns are NOT NULL. Constrain them to the correct types by creating a Class Table Inheritance super-table for equipment too.
CREATE TABLE OperationFoo (
operation_id INT PRIMARY KEY,
operation_type CHAR(1) NOT NULL CHECK (operation_type = 'F'),
equipA INT NOT NULL,
equipB INT NOT NULL,
FOREIGN KEY (operation_id, operation_type)
REFERENCES Operations(operation_d, operation_type),
FOREIGN KEY (equipA) REFERENCES EquipmentA(equip_id),
FOREIGN KEY (equipB) REFERENCES EquipmentB(equip_id)
);
Table OperationBar requires no equipment, so it has no equip columns:
CREATE TABLE OperationBar (
operation_id INT PRIMARY KEY,
operation_type CHAR(1) NOT NULL CHECK (operation_type = 'B'),
FOREIGN KEY (operation_id, operation_type)
REFERENCES Operations(operation_d, operation_type)
);
Table OperationBaz has one required equipment equipA, and then at least one of equipB and equipC must be NOT NULL. Use a CHECK constraint for this:
CREATE TABLE OperationBaz (
operation_id INT PRIMARY KEY,
operation_type CHAR(1) NOT NULL CHECK (operation_type = 'Z'),
equipA INT NOT NULL,
equipB INT,
equipC INT,
FOREIGN KEY (operation_id, operation_type)
REFERENCES Operations(operation_d, operation_type)
FOREIGN KEY (equipA) REFERENCES EquipmentA(equip_id),
FOREIGN KEY (equipB) REFERENCES EquipmentB(equip_id),
FOREIGN KEY (equipC) REFERENCES EquipmentC(equip_id),
CHECK (COALESCE(equipB, equipC) IS NOT NULL)
);
Likewise in table OperationQuux you can use a CHECK constraint to make sure at least one equipment resource of each pair is non-null:
CREATE TABLE OperationQuux (
operation_id INT PRIMARY KEY,
operation_type CHAR(1) NOT NULL CHECK (operation_type = 'Q'),
equipA INT,
equipB INT,
equipC INT,
equipD INT,
FOREIGN KEY (operation_id, operation_type)
REFERENCES Operations(operation_d, operation_type),
FOREIGN KEY (equipA) REFERENCES EquipmentA(equip_id),
FOREIGN KEY (equipB) REFERENCES EquipmentB(equip_id),
FOREIGN KEY (equipC) REFERENCES EquipmentC(equip_id),
FOREIGN KEY (equipD) REFERENCES EquipmentD(equip_id),
CHECK (COALESCE(equipA, equipB) IS NOT NULL AND COALESCE(equipC, equipD) IS NOT NULL)
);
This may seem like a lot of work. But you asked how to do it in SQL. The best way to do it in SQL is to use declarative constraints to model your business rules. Obviously, this requires that you create a new sub-table every time you create a new operation type. This is best when the operations and business rules never (or hardly ever) change. But this may not fit your project requirements. Most people say, "but I need a solution that doesn't require schema alterations."
Most developers probably don't do Class Table Inheritance. More commonly, they just use a one-to-many table structure like other people have mentioned, and implement the business rules solely in application code. That is, your application contains the code to insert only the equipment appropriate for each operation type.
The problem with relying on the app logic is that it can contain bugs and might insert data the doesn't satisfy the business rules. The advantage of Class Table Inheritance is that with well-designed constraints, the RDBMS enforces data integrity consistently. You have assurance that the database literally can't store incorrect data.
But this can also be limiting, for instance if your business rules change and you need to adjust the data. The common solution in this case is to write a script to dump all the data out, change your schema, and then reload the data in the form that is now allowed (Extract, Transform, and Load = ETL).
So you have to decide: do you want to code this in the app layer, or the database schema layer? There are legitimate reasons to use either strategy, but it's going to be complex either way.
Re your comment: You seem to be talking about storing expressions as strings in data fields. I recommend against doing that. The database is for storing data, not code. You can do some limited logic in constraints or triggers, but code belongs in your application.
If you have too many operations to model in separate tables, then model it in application code. Storing expressions in data columns and expecting SQL to use them for evaluating queries would be like designing an application around heavy use of eval().

I think you should have either a one-to-many or many-to-many relationship between Operation and Equipment, depending on whether there is one Equipment entry per piece of equipment, or per equipment type.
I would advise against putting business logic into your database schema, as business logic is subject to change and you'd rather not have to change your schema in response.

Looks like you'll need to be able to group certain equipment together as either conjunction or disjunction and combine these groups together...
OperationEquipmentGroup
id int
operation_id int
is_conjuction bit
OperationEquipment
id int
operation_equipment_group_id int
equipment_id
You can add ordering columns if that is important and maybe another column to the group table to specify how groups are combined (only makes sense if ordered). But, by your examples, it looks like groups are only conjuncted together.

Since Operations can have one or more piece of equipment, you should use a linking table. Your schema would be like this:
Operation
ID
othercolumn
Equipment
ID
othercolumn
Operation_Equipment_Link
OperationID
EquipmentID
The two fields in the third table can be set up as a composite primary key, so you don't need a third field and can more easily keep duplicates out of the table.

In addition to Nicholai's suggestion I solved a similar problem as following:
Table Operation has an additional field "OperationType"
Table Equipment has an additional field "EquipmentType"
I have an additional table "DefaultOperationEquipmentType" specifying which EquipmentType needs to be include with each OperationType, e.g.
OperationType EquipmentType
==============.=============.
Foo_Type A_Type
Foo_Type B_Type
Baz_Type B_Type
Baz_Type C_Type
My application doesn't need complex conditions like (A or B) because in my business logic both alternative equipments belong to the same type of equipment, e.g. in a PC environment I could have an equipment Mouse (A) or Trackball (B), but they both belong to EquipmentType "PointingDevice_Type"
Hope that helps

Be Aware I have not tested this in the wild. That being said, the best* way I can see to do a mapping is with a denormalized table for the grouping.
*(aside from Bill's way, which is hard to set up, but masterful when done correctly)
Operations:
--------------------
Op_ID int not null pk
Op_Name varchar 500
Equipment:
--------------------
Eq_ID int not null pk
Eq_Name varchar 500
Total_Available int
Group:
--------------------
Group_ID int not null pk
-- Here you have a choice. You can either:
-- Not recommended
Equip varchar(500) --Stores a list of EQ_ID's {1, 3, 15}
-- Recommended
Eq_ID_1 bit
Eq_1_Total_Required
Eq_ID_2 bit
Eq_2_Total_Required
Eq_ID_3 bit
Eq_3_Total_Required
-- ... etc.
Operations_to_Group_Mapping:
--------------------
Group_ID int not null frk
Op_ID int not null frk
Thus, in case X: A | (D & E & F)
Operations:
--------------------
Op_ID Op_Name
1 X
Equipment:
--------------------
Eq_ID Eq_Name Total_Available
1 A 5
-- ... snip ...
22 D 15
23 E 0
24 F 2
Group:
--------------------
Group_ID Eq_ID_1 Eq_1_Total_Required -- ... etc. ...
1 TRUE 3
-- ... snip ...
2 FALSE 0
Operations_to_Group_Mapping:
--------------------
Group_ID Op_ID
1 1
2 1

As loathe as I am to put recursive (tree) structures in SQL, it sounds like this is really what you're looking for. I would use something modeled like this:
Operation
----------------
OperationID PK
RootEquipmentGroupID FK -> EquipmentGroup.EquipmentGroupID
...
Equipment
----------------
EquipmentID PK
...
EquipmentGroup
----------------
EquipmentGroupID PK
LogicalOperator
EquipmentGroupEquipment
----------------
EquipmentGroupID | (also FK -> EquipmentGroup.EquipmentGroupID)
EntityType | PK (all 3 columns)
EntityID | (not FK, but references either Equipment.EquipmentID
or EquipmentGroup.EquipmentGroupID)
Now that I've put forth an arguably ugly schema, allow me to explain a bit...
Every equipment group can either be an and group or an or group (as designated by the LogicalOperator column). The members of each group are defined in the EquipmentGroupEquipment table, with EntityID referencing either Equipment.EquipmentID or another EquipmentGroup.EquipmentGroupID, the target being determined by the value in EntityType. This will allow you to compose a group that consists of equipment or other groups.
This will allow you to represent something as simple as "requires equipment A", which would look like this:
EquipmentGroupID LogicalOperator
--------------------------------------------
1 'AND'
EquipmentGroupID EntityType EntityID
--------------------------------------------
1 1 'A'
...all the way to your "A | (D & E & F)", which would look like this:
EquipmentGroupID LogicalOperator
--------------------------------------------
1 'OR'
2 'AND'
EquipmentGroupID EntityType EntityID
--------------------------------------------
1 1 'A'
1 2 2 -- group ID 2
2 1 'D'
2 1 'E'
2 1 'F'
(I realize that I've mixed data types in the EntityID column; this is just to make it clearer. Obviously you wouldn't do this in an actual implementation)
This would also allow you to represent structures of arbitrary complexity. While I realize that you (correctly) don't wish to overarchitect the solution, I don't think you can really get away with less without breaking 1NF (by combining multiple equipment into a single column).

From what I understood you want to store the equipments in relation to the operations in a way that will allow you to apply your business logic to it later, in that case you'll need 3 tables:
Operations:
ID
name
Equipment:
ID
name
Operations_Equipment:
equipment_id
operation_id
symbol
Where symbol is A, B, C, etc...
If you have the condition like (A & B) | (C & D) you can know which equipment is which easily.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas