Database wide table or referenced table with record type - sql-server-2012

I want to represent a vehicle (think car or truck) in a database. I have up to 62 pieces of information I'd like to store for each. Examples: year, make, model, drive type, brake system, Mfr. body code, steering type, wheel base, etc. The information are Ids which reference a 3rd party database which provides the labels for each Id. The provider has 1 table to list all makes, 1 table to list all "Steering types", etc.
All vehicles will populate the year, make, and model columns. Almost no record (if any) will populate more than 10 columns. But if I looked at all vehicles, then every column would be used by at least one record.
One approach would be to have a single table that has 62 columns. Again most records will have NULL values in most columns.
Alternatively I can do something like this (ignoring indices and primary key for sake of example):
create table vehicles (
id identity(1,1) int,
year int,
make int,
model int
)
create table constraints (
id identity(1,1) int,
vehicleId int, -- foreign key to vehicles.id
constraintTypeId int, -- foreign key to constraintTypes.id
value int
)
create table constraintTypes (
id identity(1,1) int,
name nvarchar(200) -- Example: "wheel base", "brake system" etc
)
With this second method if a vehicle only stores 2 pieces of information (aside from year, make, model), then it would have 2 records in table constraints.
Users wish to have a page to view all applications. If I have a table with 62 columns I'd need 62 joins in the query to get the labels. I could store labels on the vehicle to make retrieval faster, but than when labels change in the source data it might be slow to update my vehicles table.
At current there are over 12 million vehicle records, and the source data changes monthly (additions, deletes, and a few label changes).
Is it better a better design to have more columns, even if most are always just NULL. Or is the second approach better? How does one even calculate the best approach? Even if I had 62 columns they are all valid to every vehicle, but for cataloging purposes most are left empty. For example if a record should match any "1999 Dodge Viper" (regardless of steering type, or body style, etc) the user doesn't want to have to populate all 62 columns, they want to just see one record for "1999 Dodge Viper".

Your question is a specific case of the general issue related to data anomalys and normalisation. https://en.wikipedia.org/wiki/Database_normalization
And there is no 'right' answer although experience suggests there are 'better' and 'worse' answers. So a question to help you with your planning.
Will the requirements ever change? E.g. will someone one day want to
record the brake shoe type, or drivers seat type? If yes what are the
implications of your 62 column table becoming a 63 (or 99) column
table. (In my mind this leads me towards your second method)
Also remember thanks to Views the presentation of the data, even in the DB, does not have to match it storage. E.g. you can have well normalised tables and a view to show users 62 (or 63 or 99) columns.

Related

Store 3-dimensional table in database where 1 dimension increases over time

I have a data set with three dimensions that I would like to store for use with a website:
A list of companies (about 1000)
Information about the company (about 15 things)
Time (monthly)
Essentially, I want to track this information over time and keep it up to date.
When I start, the data will be 1000x15x1, after a year it will be 1000x15x12, and after 10 years if will be 1000x15x120.
The main queries I would make are:
Get all information for one company over all times
Get all information for one particular time
What would be a good database configuration for doing this? I'm open to either SQL or noSQL solutions.
In case it matters, the website is on Google App Engine.
From the relational database schema design perspective:
If the goal is analytics / ad-hoc querying / OLAP in general only, then you can use star-schema which is well suited for these type of analytics. But beware, OLAP databases are de-normalized and not suitable for operational transaction storage / OLTP in general, if you are planning to do both on this database.
The beauty of the Star schema:
The fact tables are usually all numeric, making the tables very small even though there are too many records. Small table means it is very fast to read (I/O).
All joins from the fact table to dimension tables are based on foreign keys (single column, numeric, indexable foreign keys)
All dimension tables have surrogate key, which is a single column primary key. Single column primary key is easier to JOIN than a multi-column primary key and also easier to index.
There is no NULL in foreign keys in fact tables. This makes JOIN operations straightforward, i.e. always JOIN fact table to all of its dimension tables. If you need NULL case, you need to add that as a special case in your dimension table. For example: if a company is not listed on stock market, and one of the thing you track is stock price, then you enter 0 or NULL into the fact for the stock price table depending on (how you want to do SUM(), AVG() etc later) and then add a special case into your StockSymbols dimension table called 'Private company' and add the foreign key of this special case into the fact table as your foreign key.
Almost all filtering is done through the dimension tables that are much much smaller than the fact tables. This requires having a Date dimension to be able to do date-based queries.
If you can stay in pure Star schema, then all yours JOINs are single hop (i.e. no join between two tables through another table).
All these makes JOIN operations very fast, simple and straightforward. That's why the Star schema is at the heart of data-warehousing designs.
https://en.wikipedia.org/wiki/Star_schema
https://en.wikipedia.org/wiki/Data_warehouse
One level up from this is OLAP (SSAS SQL Server Analyses Services for example) which does pre-processing of the data to make it fast to query but it involves more learning than pure start-schema and it's an overkill in your case
For your example
In Star schema,
Companies will be a dimension table
You will need Month dimension table. It's simplified version of Date dimension, just for month info. An example of Date dimension is here.
https://www.codeproject.com/Articles/647950/Create-and-Populate-Date-Dimension-for-Data-Wareho
The information about the company (15 things you say) will be fact tables. The facts must be numeric (b/c ideally all non-numeric values is saved in dimension tables). This means taking the non-numeric part of a fact to a dimension table. For example: if you are keeping revenue and would like to keep the currency type too, then the you will need a Currency dimension and save only the amount in the fact table and a foreign key to the Currency dimension table.
If you have any non-numeric facts, you need to store the distinct list in a dimension table and add foreign key to that dimension table inside your Fact table (this is called factless fact table). The only exception to that is if the cardinality of the dimension and the fact table is very similar, then you can just store the non-numeric fact value inside the fact table directly as there is no benefit in having a dimension table (in fact a disadvantage).
Also the facts can be grouped by their granularity. For example you could have company_monthly_summary fact table and keep more than one fact in that table (which are all joining to Company dimension and Month dimension). This is all up-to-you how you would like to group facts table. But if their granularity are not the same, they should not be grouped as that will cause sparse fact tables and harder to query.
You will use foreign keys in Fact tables to join to your Dimension tables
Add index for your Dimension tables' most used columns
Add a numeric surrogate key to your dimension. It is usually an auto-increment number but that's up-to you. One exception people prefers for the surrogate key of Date dimension is using the format YYYYMMDD (as integer). This makes is easier on WHERE clause: i.e instead of filtering for the Date column (a DATETIME value), which will do search to find the surrogate keys, you just provide the surrogate keys directly b/c you know the format. Depending on your business domain, you may have other similar useful surrogate key patterns that you may want to consider and use. But just know, in case of a business domain change, you will have have to update all fact records. Simple auto-increment surrogate key does not have that problem. In your case, the surrogate key for the month can be actual month number (1 for Jan)
That being said, 1 million rows in 5 years is easy to query even without a Star-schema design (with proper indexing, database maintenance). But if this is part of a larger analytics system, then go with Star schema
The simplest way.
Create a table, companyname + info you needto store + column for year-month.
Ex:
CREATE TABLE tablename (
id int(11) NOT NULL AUTO_INCREMENT,
companyname varchar(255) ,
info1 int(11) NOT NULL,
info2 datetime ,
info3 varchar(255) ,
info4 bool ,
yearmonth datetime,
PRIMARY KEY (id) );
#queries
select * from tablename where companyname="nameofthecompany";
select * from tablename where yearmonth="year-month"; #can use between here

SQL a table for each metadata of other tables

Hi I have various time series each having a unique timeseries ID. Given an ID, the series look something like this (obviously with different dates and data respetively)
datetime data
1/1/1980 11.6985
1/2/1980 43.6431
1/3/1980 54.9089
1/4/1980 63.1225
1/5/1980 72.4399
1/6/1980 79.1363
1/7/1980 82.2778
1/8/1980 86.0785
These time series have different "types". For instance, suppose that some time series are "WindData" type, some that are "SolarData" type and some that are "GasData" type. Given a timeseries ID, this will belong to some type. For instance:
IDs 1, 2, 3 could belong to SolarData
IDs 4,5 could belong to Wind Data
ID 6 could belong to GasData.
Time series of the same type (for instanec 1, 2, 3) share the same fields of metadata (but not the same values!) For instance WindData could have fields:
WindTurbineNumber, WindFarmName, Country
while the SolarData could have fields:
SiteName, SolarPanelType
and the GasData could have:
PipelineNumber, CountryOfOrigin, CountryOfDestination
Now, the issue is that as time grows I could have many many more types. Therefore, I want a way of generalizing this data-metadata structure. How? My idea would be to have:
A table that given a timeseries id it tells me the type of that series (i.e. given 1, it tells SolarData)
A table that given the type, it would give me the column names (and optionally their types)
a table that given the id, it would return the data.
What database structure would I need?
I cannot figure out how I would create a table (or multiple tables) that could tell me, given a seriesid, which metadata fields it needs..
I believe you're not going to find a relational database structure that will really suit your needs here.
Relational databases are designed with a "schema on write" philosophy. We decide what the data we will be getting in the future will look like, then we design a storage structure with that data schema, and then insert data into that schema. Under the right circumstances, this works well, as evidenced by fifty or so years of Boyce-Codd-esque database structures.
It sounds, though, like you want to store your data as you receive it, whatever that shape may be, and then apply a "schema on read" philosophy, extracting the useful bits later, in the form the query requires. That's going to require a NoSQL or NewSQL solution. You could consider any number of appliances to accomplish that, from Hadoop and its related structures like HBase (but not Hive) to CouchDB or Apache Cassandra.
The general ideal goes as below. You must a kind of series table and a "father" series table and some child series tables.
create table dbo.Seriekind
(
Id int not null primrary key
,Description varchar(50) not null
,ListOfColumns varchar(500) not null
)
create table dbo.Series
(
Id int not null indentity primary key
,TimeStamp datetime not null
,SerieKindId int not null
)
create table dbo.SolarData
(
Id int not null primary key identity
,SerieId int not null
,SiteName
,SolarPanelType
)
create table dbo.WindData
(
Id int not null primary key identity
,SerieId int not null
,WindTurbineNumber
,WindFarmName
,Country
)
create table dbo.GasData
(
Id int not null primary key identity
,SerieId int not null
,PipelineNumber
,CountryOfOrigin
,CountryOfDestination
)
One "disvantage" you do needs a new table for any new kind of data. FK are trivial.
Edit
As Eric explained SQL structure is not that flexible. It's awesome to describe data relations and is really efficient in storing and fetching large chunks of data, not to say it's capabilities in some kinds of processing.
A better solution is maybe a a hybrid one, maybe storing the data as a flexible format like json inside a Series table or even do use a NoSql solution or a hybrid of SQL x NoSQL.
The main thing here is how many series do you need and how often a new one can come in. A dozen: SQl, A thousand: NoSQL.

Table design about sets of data collection elements

Let me know if you need additional information as this is my first post to the forums.
The design is for clinical studies. Easiest way to explain would be to give an example to a scenario which applies to all studies/protocols in some shape or form. Say I have:
Study1, Study2, Study3
Study1 has Protocol1, Protocol2, Protocol3
Each protocol has "data collection" (set of forms, questions and sample collections, which can overlap across studies and/or protocols)
All these data collections are scheduled to be completed in clinic visits.
I can build all the relationships between Study, Protocol and Questions similar to a questionnaire/survey design structure. However this is where it gets tricky with the protocol definitions and how to link protocols back to the data collection items, some examples are:
Protocol1 has a form that needs to be filled every 3 months after enrollment to 24 months, then every 6 months.
Protocol1 has a sample collection at 6month, 15month, 27month and then annually.
Protocol1 has another sample collection which needs to happen at the age of 4, 5 and 6.
Some data collection items are at the enrollment, some are every visit, etc..
What I want is to have a "To-do list" for that clinic visit for a specific patient based on all the relationships between study-protocol-datacollection but I am not sure how to define these conditional criteria for protocols at the back-end to be able to query? or am I trying to do something unrealistic?
**I am using SQL Server by the way
Having set up similar schemas myself I would recommend you take the approach of generating all future schedule dates and storing these in a table linked to patient rather than try to calculate these "on the fly". I think this will save you a lot of headaches and difficult queries. For example you could have a table defined like this:
CREATE TABLE PatientSchedule
(
PatientId INT, /* foreign key into Patient table */
ProtocolId INT, /* foreign key into Protocol table */
StudyId INT, /* foreign key into Study table */
DataCollectionId INT, /* foreign key into DataCollection table */
SampleCollectionId INT, /* Foreign key into sample table */
ScheduleDate DATE
)
(you'll obviously need to adapt this based on your particular relationships but hopefully you get the idea).
This table can then be pre-populated for a particular patient when they enroll for a particular Study/Protocol inserting all scheduled dates upto whatever date is likely to be the realistic maximum in the future.
For a particular clinic visit the "To Do List" query should then be as simple as something like:
SELECT * FROM PatientSchedule
WHERE PatientId = ??
AND ScheduleDate = <clinic visit date>
If schedule dates are changed for any reason in the future it shouldn't be too difficult to update the PatientSchedule table.

Need advice on SQL philosophy

Before I go asking more questions about the coding, I'd like to first figure out the best method for me to follow for making my database. I'm running into a problem with how I should go about structuring it to keep everything minimized and due to its' nature I have lots of re-occurring data that I have to represent.
I design custom shirts and have a variety of different types of shirts for people to choose from that are available in both adult and child sizes of both genders. For example, I have crewneck shirts, raglan sleeves, ringer sleeves and hoodies which are available for men, women, boys, girls and toddlers. The prices are the same for each shirt from the toddler sizes up to 1x in the adult sizes, then 2x, 3x, 4x and 5x are each different prices. Then there's the color options for each kind of shirt which varies, some may have 4 color options, some have 32.
So lets take just the crewneck shirts for an example. Men s-1x, Women s-1x, Boys xs-1x, girls xs-1x and toddlers NB-18months is a total of 22 rows that will be represented in the table and are all the same price. 2X and up only apply to men and women so that's 8 more rows which makes 30 rows total for just the crewneck shirts. When it gets into the color options, there's 32 different colors available for them. If I were to do each and every size for all of them that would be 960 total rows just for the crewneck shirts alone with mainly HIGHLY repeated data for just one minor change.
I thought about it and figured It's best to treat these items on the table as actual items in a stock room because THEY'RE REALLY THERE in the stock room... you don't have just one box of shirts that you can punch a button on the side to turn to any size of color, you have to deal with the actual shirt and tedious task of putting them somewhere, so I deciding against trying to get outrageous with a bunch of foreign keys and indexes, besides that it gets just as tedious and you wind up having to represent just as much but with a lot more tables when you could've just put the data it's linking to in the first table.
If we take just the other 3 kinds of shirts and apply that same logic with all the colors and sizes just for those 4 shirts alone there will be 3,840 rows, with the other shirts left I'm not counting in you could say I'm looking at roughly 10,000 rows of data all in one table. This data will be growing over time and I'm wondering what it might turn into trying to keep it all organized. So I figured maybe the best logic to go with would be to break it down like the do in an actual retail store, which is to separate the departments into men, women, boys, girls and babies. so that way I have 5 separate tables that are only called when the user decides to "go to that department" so if there's a man who wants the men shirts he doesn't have 7,000+ rows of extra data present that doesn't even apply to what he's looking for.
Would this be a better way of setting it up? or would it be better to keep it all as one gigantic table and just query the "men" shirts in the php from the table in the section for men and the same with women and kids?
My next issue is all the color options that may be available, as I said before some shirts will have as few as 4 some will have as many as 32, so some of those are enough data to form a table all on their own, so I could really have a separate table for every kind of shirt. I'll be using a query in php to populate my items from the tables so I don't have to code so much in the html and javascript. That way I can set it to SELECT ALL * table WHERE type=men and it will take all the men shirts and auto populate the coding for each one. That way as I add and take things to and from the tables it'll automatically be updated. I already have an idea for HOW I'm going to do that, but I can only think so far into it because I haven't decided on a good way to set the tables up which is what I'd have to structure it to call from.
For example, if I have all the color options of each shirt all on the same table versus having it broken down and foreign keys linking to other tables to represent them. that would be two totally different ways of having to call it forth, so I'm stuck on this and don't really know where to go with it. any suggestions?
Typically retail organization is by SKU (stock keeping unit). Department and color are attributes of a garment, not the way you identify the garment for the purpose of accounting or stocking.
CREATE TABLE Skus (
sku BIGINT UNSIGNED PRIMARY KEY,
description TEXT,
department VARCHAR(10) NOT NULL,
color VARCHAR(10) NOT NULL,
qty_in_stock INT UNSIGNED NOT NULL DEFAULT 0,
unit_price NUMERIC(9,2) NOT NULL,
FOREIGN KEY (department) REFERENCES Departments(department),
FOREIGN KEY (color) REFERENCES Colors(color)
);
This is better than splitting into five tables, because:
You can quickly get a sum of the total value of all your stock.
You can switch the department of a given SKU easily.
When someone buys a few garments, their order lineitems reference a single table instead of five different tables (that would be invalid for a relational database).
There are lots of other examples of tasks that are easier if similar entities are stored in one table.
I know you don't want to break it out into separate tables, but I think going the multiple table route would be the best. However, I don't think it is as bad as you think. My suggestion would be the following. Obviously, you want to change the names of the fields, but this is a quick representation:
Shirts
- id (primary key)
- description
- men (Y/N)
- women (Y/N)
- boy (Y/N)
- girl (Y/N)
- toddlers (Y/N)
Sizes
- id (primary key)
- shirt_id (foreign key)
- Size
Colors
- id (primary key)
- shirt_id (foreign key)
- Color
Price
- id (primary key)
- shirt_id (foreign key)
- size_id (foreign key)
- price
Having these three tables makes it so that you won't have to store all 10,000 rows in one single table and maintain it, but the data is still all there. Keeping your data separated into their proper places keeps from replicating needless information.
Want to pull all men's shirts?
SELECT * FROM shirts WHERE men = '1'
To be honest, you should really have at least 5 or 6 tables. One/two containing the labels for sizes and colors (either one table containing all, or one for each one) and the other 4 containing the actual data. This will keep your data uniform across everything (example: Blue vs blue). You know what they say, there is more than one way to skin a cat.
You need to think about a database term called 'normalization'. Normalization means that everything has it's place in the database and should not be listed twice but reused as needed. The most common mistake people make is to not ask or think about what will happen down the road and they put up a database that has next to no normalization, has massive memory consumed do to large datatypes, no seeding done, and is completely inflexible and comes at a great cost to change later because it was made without thinking of the future.
There are many levels of normalization but the most consistent thing is to think about a simple example I could give you to explain some simple concepts that can be applied to larger things later. This is assuming you have access to SQL management studio, SSMS, HOWEVER if you are using MYSQL or Oracle the principles are still very similar and the comments sections will show what I am getting at. This example you can self run if you have SSMS and just paste it in and hit F5. If you don't just look at the comments section although these concepts are better to see in action than to try to just envision what they mean.
Declare #Everything table (PersonID int, OrderID int, PersonName varchar(8), OrderName varchar(8) );
insert into #Everything values (1, 1, 'Brett', 'Hat'),(1, 2, 'Brett', 'Shirt'),(1, 3, 'Brett', 'Shoes'),(2,1,'John','Shirt'),(2,2,'John','Shoes');
-- very basic normalization level in that I did not even ATTEMPT to seperate entities into different tables for reuse.
-- I just insert EVERYTHING as I get in one place. This is great for just getting off the ground or testing things.
-- but in the future you won't be able to change this easily as everything is here and if there is a lot of data it is hard
-- to move it. When you insert if you keep adding more and more and more columns it will get slower as it requires memory
-- for the rows and the columns
Select Top 10 * from #Everything
declare #Person table ( PersonID int identity, PersonName varchar(8));
insert into #Person values ('Brett'),('John');
declare #Orders table ( OrderID int identity, PersonID int, OrderName varchar(8));
insert into #Orders values (1, 'Hat'),(1,'Shirt'),(1, 'Shoes'),(2,'Shirt'),(2, 'Shoes');
-- I now have tables storing two logic things in two logical places. If I want to relate them I can use the TSQL language
-- to do so. I am now using less memory for storage of the individual tables and if one or another becomes too large I can
-- deal with them isolated. I also have a seeding record (an ever increasing number) that I could use as a primary key to
-- relate row position and for faster indexing
Select *
from #Person p
join #Orders o on p.PersonID = o.PersonID
declare #TypeOfOrder table ( OrderTypeID int identity, OrderType varchar(8));
insert into #TypeOfOrder values ('Hat'),('Shirt'),('Shoes')
declare #OrderBridge table ( OrderID int identity, PersonID int, OrderType int)
insert into #OrderBridge values (1, 1),(1,2),(1,3),(2,2),(2,3);
-- Wow I have a lot more columns but my ability to expand is now pretty flexible I could add even MORE products to the bridge table
-- or other tables I have not even thought of yet. Now that I have a bridge table I have to list a product type ONLY once ever and
-- then when someone orders it again I just label the bridge to relate a person to an order, hence the name bridge as it on it's own
-- serves nothing but relating two different things to each other. This method takes more time to set up but in the end you need
-- less rows of your database overall as you are REUSING data efficiently and effectively.
Select Top 10 *
from #Person p
join #OrderBridge o on p.PersonID = o.PersonID
join #TypeOfOrder t on o.OrderType = t.OrderTypeID

Architecture of SQL tables

I am wondering is it more useful and practical (size of DB) to create multiple tables in sql with two columns (one column containing foreign key and one column containing random data) or merge it and create one table containing multiple columns. I am asking this because in my scenario one product holding primary key could have sufficient/applicable data for only one column while other columns would be empty.
example a. one table
productID productname weight no_of_pages
1 book 130 500
2 watch 50 null
3 ring null null
example b. three tables
productID productname
1 book
2 watch
3 ring
productID weight
1 130
2 50
productID no_of_pages
1 500
The multi-table approach is more "normal" (in database terms) because it avoids columns that commonly store NULLs. It's also something of a pain in programming terms because you have to JOIN a bunch of tables to get your original entity back.
I suggest adopting a middle way. Weight seems to be a property of most products, if not all (indeed, a ring has a weight even if small and you'll probably want to know it for shipping purposes), so I'd leave that in the Products table. But number of pages applies only to a book, as do a slew of other unmentioned properties (author, ISBN, etc). In this example, I'd use a Products table and a Books table. The books table would extend the Products table in a fashion similar to class inheritance in object oriented program.
All book-specific properties go into the Books table, and you join only Products and Books to get a complete description of a book.
I think this all depends on how the tables will be used. Maybe your examples are oversimplifying things too much but it seems to me that the first option should be good enough.
You'd really use the second example if you're going to be doing extremely CPU intensive stuff with the first table and will only need the second and third tables when more information about a product is needed.
If you're going to need the information in the second and third tables most times you query the table, then there's no reason to join over every time and you should just keep it in one table.
I would suggest example a, in case there is a defined set of attributes for product, and an example c if you need variable number of attributes (new attributes keep coming every now and then) -
example c
productID productName
1 book
2 watch
3 ring
attrID productID attrType attrValue
1 1 weight 130
2 1 no_of_pages 500
3 2 weight 50
The table structure you have shown in example b is not normalized - there will be separate id columns required in second and third tables, since productId will be an fk and not a pk.
It depends on how many rows you are expecting on your PRODUCTS table. I would say that it would not make sense to normalize your tables to 3N in this case because product name, weight, and no_of_pages each describe the products. If you had repeating data such as manufacturers, it would make more sense to normalize your tables at that point.
Without knowing the background (data model), there is no way to tell which variant is more "correct". both are fine in certain scenarios.
You want three tables, full stop. That's best because there's no chance of watches winding up with pages (no pun intended) and some books without. If you normalize, the server works for you. If you don't, you do the work instead, just not as well. Up to you.
I am asking this because in my scenario one product holding primary key could have sufficient/applicable data for only one column while other columns would be empty.
That's always true of nullable columns. Here's the rule: a nullable column has an optional relationship to the key. A nullable column can always be, and usually should be, in a separate table where it can be non-null.