SQL a table for each metadata of other tables - sql

Hi I have various time series each having a unique timeseries ID. Given an ID, the series look something like this (obviously with different dates and data respetively)
datetime data
1/1/1980 11.6985
1/2/1980 43.6431
1/3/1980 54.9089
1/4/1980 63.1225
1/5/1980 72.4399
1/6/1980 79.1363
1/7/1980 82.2778
1/8/1980 86.0785
These time series have different "types". For instance, suppose that some time series are "WindData" type, some that are "SolarData" type and some that are "GasData" type. Given a timeseries ID, this will belong to some type. For instance:
IDs 1, 2, 3 could belong to SolarData
IDs 4,5 could belong to Wind Data
ID 6 could belong to GasData.
Time series of the same type (for instanec 1, 2, 3) share the same fields of metadata (but not the same values!) For instance WindData could have fields:
WindTurbineNumber, WindFarmName, Country
while the SolarData could have fields:
SiteName, SolarPanelType
and the GasData could have:
PipelineNumber, CountryOfOrigin, CountryOfDestination
Now, the issue is that as time grows I could have many many more types. Therefore, I want a way of generalizing this data-metadata structure. How? My idea would be to have:
A table that given a timeseries id it tells me the type of that series (i.e. given 1, it tells SolarData)
A table that given the type, it would give me the column names (and optionally their types)
a table that given the id, it would return the data.
What database structure would I need?
I cannot figure out how I would create a table (or multiple tables) that could tell me, given a seriesid, which metadata fields it needs..

I believe you're not going to find a relational database structure that will really suit your needs here.
Relational databases are designed with a "schema on write" philosophy. We decide what the data we will be getting in the future will look like, then we design a storage structure with that data schema, and then insert data into that schema. Under the right circumstances, this works well, as evidenced by fifty or so years of Boyce-Codd-esque database structures.
It sounds, though, like you want to store your data as you receive it, whatever that shape may be, and then apply a "schema on read" philosophy, extracting the useful bits later, in the form the query requires. That's going to require a NoSQL or NewSQL solution. You could consider any number of appliances to accomplish that, from Hadoop and its related structures like HBase (but not Hive) to CouchDB or Apache Cassandra.

The general ideal goes as below. You must a kind of series table and a "father" series table and some child series tables.
create table dbo.Seriekind
(
Id int not null primrary key
,Description varchar(50) not null
,ListOfColumns varchar(500) not null
)
create table dbo.Series
(
Id int not null indentity primary key
,TimeStamp datetime not null
,SerieKindId int not null
)
create table dbo.SolarData
(
Id int not null primary key identity
,SerieId int not null
,SiteName
,SolarPanelType
)
create table dbo.WindData
(
Id int not null primary key identity
,SerieId int not null
,WindTurbineNumber
,WindFarmName
,Country
)
create table dbo.GasData
(
Id int not null primary key identity
,SerieId int not null
,PipelineNumber
,CountryOfOrigin
,CountryOfDestination
)
One "disvantage" you do needs a new table for any new kind of data. FK are trivial.
Edit
As Eric explained SQL structure is not that flexible. It's awesome to describe data relations and is really efficient in storing and fetching large chunks of data, not to say it's capabilities in some kinds of processing.
A better solution is maybe a a hybrid one, maybe storing the data as a flexible format like json inside a Series table or even do use a NoSql solution or a hybrid of SQL x NoSQL.
The main thing here is how many series do you need and how often a new one can come in. A dozen: SQl, A thousand: NoSQL.

Related

Store 3-dimensional table in database where 1 dimension increases over time

I have a data set with three dimensions that I would like to store for use with a website:
A list of companies (about 1000)
Information about the company (about 15 things)
Time (monthly)
Essentially, I want to track this information over time and keep it up to date.
When I start, the data will be 1000x15x1, after a year it will be 1000x15x12, and after 10 years if will be 1000x15x120.
The main queries I would make are:
Get all information for one company over all times
Get all information for one particular time
What would be a good database configuration for doing this? I'm open to either SQL or noSQL solutions.
In case it matters, the website is on Google App Engine.
From the relational database schema design perspective:
If the goal is analytics / ad-hoc querying / OLAP in general only, then you can use star-schema which is well suited for these type of analytics. But beware, OLAP databases are de-normalized and not suitable for operational transaction storage / OLTP in general, if you are planning to do both on this database.
The beauty of the Star schema:
The fact tables are usually all numeric, making the tables very small even though there are too many records. Small table means it is very fast to read (I/O).
All joins from the fact table to dimension tables are based on foreign keys (single column, numeric, indexable foreign keys)
All dimension tables have surrogate key, which is a single column primary key. Single column primary key is easier to JOIN than a multi-column primary key and also easier to index.
There is no NULL in foreign keys in fact tables. This makes JOIN operations straightforward, i.e. always JOIN fact table to all of its dimension tables. If you need NULL case, you need to add that as a special case in your dimension table. For example: if a company is not listed on stock market, and one of the thing you track is stock price, then you enter 0 or NULL into the fact for the stock price table depending on (how you want to do SUM(), AVG() etc later) and then add a special case into your StockSymbols dimension table called 'Private company' and add the foreign key of this special case into the fact table as your foreign key.
Almost all filtering is done through the dimension tables that are much much smaller than the fact tables. This requires having a Date dimension to be able to do date-based queries.
If you can stay in pure Star schema, then all yours JOINs are single hop (i.e. no join between two tables through another table).
All these makes JOIN operations very fast, simple and straightforward. That's why the Star schema is at the heart of data-warehousing designs.
https://en.wikipedia.org/wiki/Star_schema
https://en.wikipedia.org/wiki/Data_warehouse
One level up from this is OLAP (SSAS SQL Server Analyses Services for example) which does pre-processing of the data to make it fast to query but it involves more learning than pure start-schema and it's an overkill in your case
For your example
In Star schema,
Companies will be a dimension table
You will need Month dimension table. It's simplified version of Date dimension, just for month info. An example of Date dimension is here.
https://www.codeproject.com/Articles/647950/Create-and-Populate-Date-Dimension-for-Data-Wareho
The information about the company (15 things you say) will be fact tables. The facts must be numeric (b/c ideally all non-numeric values is saved in dimension tables). This means taking the non-numeric part of a fact to a dimension table. For example: if you are keeping revenue and would like to keep the currency type too, then the you will need a Currency dimension and save only the amount in the fact table and a foreign key to the Currency dimension table.
If you have any non-numeric facts, you need to store the distinct list in a dimension table and add foreign key to that dimension table inside your Fact table (this is called factless fact table). The only exception to that is if the cardinality of the dimension and the fact table is very similar, then you can just store the non-numeric fact value inside the fact table directly as there is no benefit in having a dimension table (in fact a disadvantage).
Also the facts can be grouped by their granularity. For example you could have company_monthly_summary fact table and keep more than one fact in that table (which are all joining to Company dimension and Month dimension). This is all up-to-you how you would like to group facts table. But if their granularity are not the same, they should not be grouped as that will cause sparse fact tables and harder to query.
You will use foreign keys in Fact tables to join to your Dimension tables
Add index for your Dimension tables' most used columns
Add a numeric surrogate key to your dimension. It is usually an auto-increment number but that's up-to you. One exception people prefers for the surrogate key of Date dimension is using the format YYYYMMDD (as integer). This makes is easier on WHERE clause: i.e instead of filtering for the Date column (a DATETIME value), which will do search to find the surrogate keys, you just provide the surrogate keys directly b/c you know the format. Depending on your business domain, you may have other similar useful surrogate key patterns that you may want to consider and use. But just know, in case of a business domain change, you will have have to update all fact records. Simple auto-increment surrogate key does not have that problem. In your case, the surrogate key for the month can be actual month number (1 for Jan)
That being said, 1 million rows in 5 years is easy to query even without a Star-schema design (with proper indexing, database maintenance). But if this is part of a larger analytics system, then go with Star schema
The simplest way.
Create a table, companyname + info you needto store + column for year-month.
Ex:
CREATE TABLE tablename (
id int(11) NOT NULL AUTO_INCREMENT,
companyname varchar(255) ,
info1 int(11) NOT NULL,
info2 datetime ,
info3 varchar(255) ,
info4 bool ,
yearmonth datetime,
PRIMARY KEY (id) );
#queries
select * from tablename where companyname="nameofthecompany";
select * from tablename where yearmonth="year-month"; #can use between here

Efficient storage pattern for millions of values of different types

I am about to build an SQL database that will contain the results of statistics calculations for hundreds of thousands of objects. It is planned to use Postgres, but the question equally applies to MySQL.
For example, hypothetically, let's assume I have half a million records of phone calls. Each PhoneCall will now, through a background job system, have statistics calculated. For example, a PhoneCall has the following statistics:
call_duration: in seconds (float)
setup_time: in seconds (float)
dropouts: periods in which audio dropout was detected (array), e.g. [5.23, 40.92]
hung_up_unexpectedly: true or false (boolean)
These are just simple examples; in reality, the statistics are more complex. Each statistic has a version number associated with it.
I am unsure as to which storage pattern for these type of calculated data will be the most efficient. I'm not looking into fully normalizing everything in the database though. So far, I have come up with the following options:
Option 1 – long format in one column
I store the statistic name and its value in one column each, with a reference to the main transaction object. The value column is a text field; the value will be serialized (e.g. as JSON or YAML) so that different types (strings, arrays, ...) can be stored. The database layout for the statistics table would be:
statistic_id (PK)
phone_call_id (FK)
statistic_name (string)
statistic_value (text, serialized)
statistic_version (integer)
created_at (datetime)
I have worked with this pattern for a while, and what's good about it is that I can easily filter statistics according to phone call and the statistic name. I can also add new types of statistics easily and filter by version and creation time.
But it seems to me that the (de)serialization of values makes it quite inefficient in terms of handling lots of data. Also, I cannot perform calculations on SQL-level; I always have to load and deserialize the data. Or is the JSON suppot in Postgres that good so that I could still pick this pattern?
Option 2 – statistics as attributes of main object
I could also think about collecting all types of statistic names and adding them as new columns to the phone call object, e.g.:
id (PK)
call_duration
setup_time
dropouts
hung_up_unexpectedly
...
This would be very efficient, and each column would have its own type, but I can no longer store different versions of statistics, or filter them according to when they were created. The whole business logic of statistics disappears. Adding new statistics is also not possible easily since the names are baked in.
Option 3 – statistics as different columns
This would probably be the most complex. I am storing only a reference to the statistic type, and the column will be looked up according to that:
statistic_id (PK)
phone_call_id (FK)
statistic_name (string)
statistic_value_bool (boolean)
statistic_value_string (string)
statistic_value_float (float)
statistic_value_complex (serialized or complex data type)
statistic_value_type (string that indicates bool, string etc.)
statistic_version (integer)
created_at (datetime)
This would mean that the table is going to be very sparse, as only one of the statistic_value_ columns would be populated. Could that lead to performance issues?
Option 4 – normalized form
Trying to normalize option 3, I would create two tables:
statistics
id (PK)
version
created_at
statistic_mapping
phone_call_id (FK)
statistic_id (FK)
statistic_type_mapping
statistic_id (FK)
type (string, indicates bool, string etc.)
statistic_values_boolean
statistic_id (FK)
value (bool)
…
But this isn't going anywhere since I can't dynamically join to another table name, can I? Or should I anyway then just join to all statistic_values_* tables based on the statistic ID? My application would have to make sure that no duplicate entries exist then.
To summarize, given this use case, what would be the most efficient approach for storing millions of statistic values in a relational DB (e.g. Postgres), when the requirement is that statistic types may be added or changed, and that several versions exist at the same time, and that querying of the values should be somewhat efficient?
IMO you can use the following simple database structure to solve your problem.
Statistics type dictionary
A very simple table - just name and description of the stat. type:
create table stat_types (
type text not null constraint stat_types_pkey primary key,
description text
);
(You can replace it with enum if you have a finite number of elements)
Stat table for every type of objects in the project
It contains FK to the object, FK to the stat. type (or just enum) and, this is important, the jsonb field with an arbitrary stat. data related to its type. For example, such a table for phone calls:
create table phone_calls_statistics (
phone_call_id uuid not null references phone_calls,
stat_type text not null references stat_types,
data jsonb,
constraint phone_calls_statistics_pkey primary key (phone_call_id, stat_type)
);
I assume here that table phone_calls has uuid type of its PK:
create table phone_calls (
id uuid not null constraint phone_calls_pkey primary key
-- ...
);
The data field has a different structure which depends on its stat. type. Example for call duration:
{
"call_duration": 120.0
}
or for dropouts:
{
"dropouts": [5.23, 40.92]
}
Let's play with data:
insert into phone_calls_statistics values
('9fc1f6c3-a9d3-4828-93ee-cf5045e93c4c', 'CALL_DURATION', '{"call_duration": 100.0}'),
('86d1a2a6-f477-4ed6-a031-b82584b1bc7e', 'CALL_DURATION', '{"call_duration": 110.0}'),
('cfd4b301-bdb9-4cfd-95db-3844e4c0625c', 'CALL_DURATION', '{"call_duration": 120.0}'),
('39465c2f-2321-499e-a156-c56a3363206a', 'CALL_DURATION', '{"call_duration": 130.0}'),
('9fc1f6c3-a9d3-4828-93ee-cf5045e93c4c', 'UNEXPECTED_HANGUP', '{"unexpected_hungup": true}'),
('86d1a2a6-f477-4ed6-a031-b82584b1bc7e', 'UNEXPECTED_HANGUP', '{"unexpected_hungup": true}'),
('cfd4b301-bdb9-4cfd-95db-3844e4c0625c', 'UNEXPECTED_HANGUP', '{"unexpected_hungup": false}'),
('39465c2f-2321-499e-a156-c56a3363206a', 'UNEXPECTED_HANGUP', '{"unexpected_hungup": false}');
Get the average, min and max call duration:
select
avg((pcs.data ->> 'call_duration')::float) as avg,
min((pcs.data ->> 'call_duration')::float) as min,
max((pcs.data ->> 'call_duration')::float) as max
from
phone_calls_statistics pcs
where
pcs.stat_type = 'CALL_DURATION';
Get the number of unexpected hung ups:
select
sum(case when (pcs.data ->> 'unexpected_hungup')::boolean is true then 1 else 0 end) as hungups
from
phone_calls_statistics pcs
where
pcs.stat_type = 'UNEXPECTED_HANGUP';
I believe that this solution is very simple and flexible, has good performance potential and perfect scalability. The main table has a simple index; all queries will perform inside it. You can always extend the number of stat. types and their calculations.
Live example: https://www.db-fiddle.com/f/auATgkRKrAuN3jHjeYzfux/0

Strategies to store extra information about models without too many column names (alternatives to DB normalization and model subclassing)

Say you had a Model called Forest. Each object represents a forest on your continent. There is a set of data that is common to all these forests, like forest type, area etc., and these can be easily represented by columns on the SQL table, forest.
However, imagine that these forests had additional data about them that might not always be repeatable. For example the 20 coniferous forests have a pine-fir split ratio number, whereas the deciduous forests have a autumn-duration number. One way would be to store all these columns on the main table itself, but there will be too many columns on each row, with many columns remaining un-filled by definition.
The most obvious way around this is to make sub-classes of the Forest model and have separate table for each subclass. I feel that's a heavy handed approach that I would rather not follow. If I need some data about the generic forest I'll have to consult another table.
Is there a pattern to solve this problem? What solution do you usually prefer?
NOTE: I have seen the other questions about this. The solutions proposed were:
Subtyping, same as I proposed above.
Have all the columns on the same table.
Have separate tables for each kind of forest, with duplicated data like area and rainfall... duplicated.
Is there an inventive solution that I don't know of?
UPDATE: I have run into the EAV model, and also a modified version where the unpredictable fields are stored out in a NoSQL/JSON store, and the id for that is held in the RDB. I like both, but welcome suggestions in this direction.
On the database side, the best approach is often to store attributes common to all forests in one table, and to store unique attributes in other tables. Build updatable views for clients to use.
create table forests (
forest_id integer primary key,
-- Assumes forest names are not unique on a continent.
forest_name varchar(45) not null,
forest_type char(1) not null
check (forest_type in ('c', 'd')),
area_sq_km integer not null
check (area_sq_km > 0),
-- Other columns common to all forests go here.
--
-- This constraint lets foreign keys target the pair
-- of columns, guaranteeing that a row in each subtype
-- table references a row here having the same subtype.
unique (forest_id, forest_type)
);
create table coniferous_forests_subtype (
forest_id integer primary key,
forest_type char(1) not null
default 'c'
check (forest_type = 'c'),
pine_fir_ratio float not null
check (pine_fir_ratio >= 0),
foreign key (forest_id, forest_type)
references forests (forest_id, forest_type)
);
create table deciduous_forests_subtype (
forest_id integer primary key,
forest_type char(1) not null
default 'd'
check (forest_type = 'd'),
autumn_duration_days integer not null
check (autumn_duration_days between 20 and 100),
foreign key (forest_id, forest_type)
references forests (forest_id, forest_type)
);
Clients usually use updatable views, one for each subtype, instead of using the base tables. (You can revoke privileges on the base subtype tables to guarantee this.) You might want to omit the "forest_type" column.
create view coniferous_forests as
select t1.forest_id, t1.forest_type, t1.area_sq_km,
t2.pine_fir_ratio
from forests t1
inner join coniferous_forests_subtype t2
on t1.forest_id = t2.forest_id;
create view deciduous_forests as
select t1.forest_id, t1.forest_type, t1.area_sq_km,
t2.autumn_duration_days
from forests t1
inner join deciduous_forests_subtype t2
on t1.forest_id = t2.forest_id;
What you have to do to make these views updatable varies a little with the dbms, but expect to write some triggers (not shown). You'll need triggers to handle all the DML actions--insert, update, and delete.
If you need to report only on columns that appear in "forests", then just query the table "forests".
Well, the easiest way is putting all the columns into one table and then having a "type" field to decide which columns to use. This works for smaller tables, but for more complicated cases it can lead to a big messy table and issues with database constraints (such as NULLs).
My preferred method would be something like this:
A generic "Forests" table with: id, type, [generic_columns, ...]
"Coniferous_Forests" table with: id, forest_id (FK to Forests), ...
So, in order to get all the data for a Coniferous Forest with id of 1, you'd have a query like so:
SELECT * FROM Coniferous_Forests INNER JOIN Forests
ON Coniferous_Forests.forest_id = Forests.id
AND Coniferous_Forests.id = 1
As for inventive solutions, there is such a thing as an OODBMS (Object Oriented Database Management Sytem).
The most popular alternative to Relational SQL databases are Document-Oriented NoSQL databases like MongoDB. This is comparable to using JSON objects to store your data, and allows you to be more flexible with your database fields.

What is the preferred way of saving dynamic lists in database?

In our application user can create different lists (like sharepoint) for example a user can create a list of cars (name, model, brand) and a list of students (name, dob, address, nationality), e.t.c.
Our application should be able to query on different columns of the list so we can't just serialize each row and save it in one row.
Should I create a new table at runtime for each newly created list? If this was the best solution then probably Microsoft SharePoint would have done it as well I suppose?
Should I use the following schema
Lists (Id, Name)
ListColumns (Id, ListId, Name)
ListRows (Id, ListId)
ListData(RowId, ColumnId, Value)
Though a single row will create as many rows in list data table as there are columns in the list, this just doesn't feel right.
Have you dealt with this situation? How did you handle it in database?
what you did is called EAV (Entity-Attribute-Value Model).
For a list with 3 columns and 1000 entries:
1 record in Lists
3 records in ListColumns
and 3000 Entries in ListData
This is fine. I'm not a fan of creating tables on-the-fly because it could mess up your database and you would have to "generate" your SQL queries dynamically. I would get a strange feeling when users could CREATE/DROP/ALTER Tables in my database!
Another nice feature of the EAV model is that you could merge two lists easily without droping and altering a table.
Edit:
I think you need another table called ListRows that tells you which ListData records belong together in a row!
Well I've experienced something like this before - I don't want to share the actual table schema so lets do some thought exercises using some of the suggested table structures:
Lets have a lists table containing a list of all my lists
Lets also have a columns table containing the metadata (column names)
Now we need a values table which contains the column values
We also need a rows table which contains a list of all the rows, otherwise it gets very difficult to work out how many rows there actually are
To keep things simple lets just make everything a string (VARCAHR) and have a go at coming up with some queries:
Counting all the rows in a table
SELECT COUNT(*) FROM [rows]
JOIN [lists]
ON [rows].list_id = [Lists].id
WHERE [Lists].name = 'Cars'
Hmm, not too bad, compared to:
SELECT * FROM [Cars]
Inserting a row into a table
BEGIN TRANSACTION
DECLARE #row_id INT
DECLARE #list_id INT
SELECT #list_id = id FROM [lists] WHERE name = 'Cars'
INSERT INTO [rows] (list_id) VALUES (#list_id)
SELECT #row_id = ##IDENTITY
DECLARE #column_id INT
-- === Need one of these for each column ===
SELECT #column_id = id FROM [columns]
WHERE name = 'Make'
AND list_id = #list_id
INSERT INTO [values] (column_id, row_id, value)
VALUES (#column_id, #row_id, 'Rover')
-- === Need one of these for each column ===
SELECT #column_id = id FROM [columns]
WHERE name = 'Model'
AND list_id = #list_id
INSERT INTO [values] (column_id, row_id, value)
VALUES (#column_id, #row_id, 'Metro')
COMMIT TRANSACTION
Um, starting to get a little bit hairy compared to:
INSERT INTO [Cars] ([Make], [Model}) VALUES ('Rover', 'Metro')
Simple queries
I'm now getting bored of constructing tediously complex SQL statements so maybe you can have a go at coming up with equivalent queries for the followng statements:
SELECT [Model] FROM [Cars] WHRE [Make] = 'Rover'
SELECT [Cars].[Make], [Cars].[Model], [Owners].[Name] FROM [Cars]
JOIN [Owners] ON [Owners].id = [Cars].owner_id
WHERE [Owners].Age > 50
SELECT [Cars].[Make], [Cars].[Model], [Owners].[Name] FROM [Cars]
JOIN [Owners] ON [Owners].id = [Cars].owner_id
JOIN [Addresses] ON [Addresses].id = [Owners].address_id
WHERE [Addresses].City = 'London'
I hope you are beginning to get the idea...
In short - I've experienced this before and I can assure you that creating a database inside a database in this way is definitely a Bad Thing.
If you need to do anything but the most basic querying on these lists (and literally I mean "Can I have all the items in this list please?"), you should try and find an alternative.
As long as each user pretty much has their own database I'll definitely recommend the CREATE TABLE approach. Even if they don't I'd still recommend that you at least consider it.
Perhaps a potential solution would be the creating of lists can involve CREATE TABLE statements for those entities/lists?
It sounds like the db structure or schema can change at runtime, or at the user's command, so perhaps something like this might help?
User wants to create a new list of an entity never seen before. Call it Computer.
User defines the attributes (screensize, CpuSpeed, AmountRAM, NumberOfCores)
System allows user to create in the UI
system generally lets them all be strings, unless can tell when all supplied values are indeed dates or numbers.
build the CREATE scripts, execute them against the DB.
insert the data that the user defined into that new table.
Properly coded, we're working with the requirements given: let users create new entities. There was no mention of scale here. Of course, this requires all input to be sanitized, queries parameterized, actions logged, etc.
The negative comment below doesn't actually give any good reasons, but creates a bit of FUD. I'd be interested in addressing any concerns with this potential solution. We haven't heard about scale, security, performance, or usage (internal LAN vs. internet).
You should absolutely not dynamically create tables when your users create lists. That isn't how databases are meant to work.
Your schema is correct, and the pluralization is, in my opinion, also correct, though I would remove the camel case and call them lists, list_columns, list_rows and list_data.
I would further improve upon your schema by skipping rows and columns tables, they serve no purpose. Simply have a row/column number attached to each cell, and keep things sparse: Don't bother holding empty cells in the database. You retain the ability to query/sort based on row/column, your queries will be (potentially very much) faster because the number of list_cells will be reduced, and you won't have to do any crazy joining to link your data back to its table.
Here is the complete schema:
create table lists (
id int primary key,
name varchar(25) not null
);
create table list_cells (
id int primary key,
list_id int not null references lists(id)
on delete cascade on update cascade,
row int not null,
col int not null,
data varchar(25) not null
);
It sounds like you might have Sharepoint already deployed in your environment.
Consider integrating your application with Sharepoint, and have it be your datastore. No need to recreate all the things you like about Sharepoint, when you could leverage it.
It'd take a bit of configuring, but you could call SP web services to CRUD your list data for you.
inserting list data into Sharepoint via web services
reading SP lists via web services
Sharepoint 2010 can also expose lists via OData, which would be simple to consume from any application.

Generate unique ID to share with multiple tables SQL 2008

I have a couple of tables in a SQL 2008 server that I need to generate unique ID's for. I have looked at the "identity" column but the ID's really need to be unique and shared between all the tables.
So if I have say (5) five tables of the flavour "asset infrastructure" and I want to run with a unique ID between them as a combined group, I need some sort of generator that looks at all (5) five tables and issues the next ID which is not duplicated in any of those (5) five tales.
I know this could be done with some sort of stored procedure but I'm not sure how to go about it. Any ideas?
The simplest solution is to set your identity seeds and increment on each table so they never overlap.
Table 1: Seed 1, Increment 5
Table 2: Seed 2, Increment 5
Table 3: Seed 3, Increment 5
Table 4: Seed 4, Increment 5
Table 5: Seed 5, Increment 5
The identity column mod 5 will tell you which table the record is in. You will use up your identity space five times faster so make sure the datatype is big enough.
Why not use a GUID?
You could let them each have an identity that seeds from numbers far enough apart never to collide.
GUIDs would work but they're butt-ugly, and non-sequential if that's significant.
Another common technique is to have a single-column table with an identity that dispenses the next value each time you insert a record. If you need them pulling from a common sequence, it's not unlikely to be useful to have a second column indicating which table it was dispensed to.
You realize there are logical design issues with this, right?
Reading into the design a bit, it sounds like what you really need is a single table called "Asset" with an identity column, and then either:
a) 5 additional tables for the subtypes of assets, each with a foreign key to the primary key on Asset; or
b) 5 views on Asset that each select a subset of the rows and then appear (to users) like the 5 original tables you have now.
If the columns on the tables are all the same, (b) is the better choice; if they're all different, (a) is the better choice. This is a classic DB spin on the supertype / subtype relationship.
Alternately, you could do what you're talking about and recreate the IDENTITY functionality yourself with a stored proc that wraps INSERT access on all 5 tables. Note that you'll have to put a TRANSACTION around it if you want guarantees of uniqueness, and if this is a popular table, that might make it a performance bottleneck. If that's not a concern, a proc like that might take the form:
CREATE PROCEDURE InsertAsset_Table1 (
BEGIN TRANSACTION
-- SELECT MIN INTEGER NOT ALREADY USED IN ANY OF THE FIVE TABLES
-- INSERT INTO Table1 WITH THAT ID
COMMIT TRANSACTION -- or roll back on error, etc.
)
Again, SQL is highly optimized for helping you out if you choose the patterns I mention above, and NOT optimized for this kind of thing (there's overhead with creating the transaction AND you'll be issuing shared locks on all 5 tables while this process is going on). Compare that with using the PK / FK method above, where SQL Server knows exactly how to do it without locks, or the view method, where you're only inserting into 1 table.
I found this when searching on google. I am facing a simillar problem for the first time. I had the idea to have a dedicated ID table specifically to generate the IDs but I was unsure if it was something that was considered OK design. So I just wanted to say THANKS for confirmation.. it looks like it is an adequate sollution although not ideal.
I have a very simple solution. It should be good for cases when the number of tables is small:
create table T1(ID int primary key identity(1,2), rownum varchar(64))
create table T2(ID int primary key identity(2,2), rownum varchar(64))
insert into T1(rownum) values('row 1')
insert into T1(rownum) values('row 2')
insert into T1(rownum) values('row 3')
insert into T2(rownum) values('row 1')
insert into T2(rownum) values('row 2')
insert into T2(rownum) values('row 3')
select * from T1
select * from T2
drop table T1
drop table T2
This is a common problem for example when using a table of people (called PERSON singular please) and each person is categorized, for example Doctors, Patients, Employees, Nurse etc.
It makes a lot of sense to create a table for each of these people that contains thier specific category information like an employees start date and salary and a Nurses qualifications and number.
A Patient for example, may have many nurses and doctors that work on him so a many to many table that links Patient to other people in the PERSON table facilitates this nicely. In this table there should be some description of the realtionship between these people which leads us back to the categories for people.
Since a Doctor and a Patient could create the same Primary Key ID in their own tables, it becomes very useful to have a Globally unique ID or Object ID.
A good way to do this as suggested, is to have a table designated to Auto Increment the primary key. Perform an Insert on that Table first to obtain the OID, then use it for the new PERSON.
I like to go a step further. When things get ugly (some new developer gets got his hands on the database, or even worse, a really old developer, then its very useful to add more meaning to the OID.
Usually this is done programatically, not with the database engine, but if you use a BIG INT for all the Primary Key ID's then you have lots of room to prefix a number with visually identifiable sequence. For example all Doctors ID's could begin with 100, all patients with 110, all Nurses with 120.
To that I would append say a Julian date or a Unix date+time, and finally append the Auto Increment ID.
This would result in numbers like:
110,2455892,00000001
120,2455892,00000002
100,2455892,00000003
since the Julian date 100yrs from now is only 2492087, you can see that 7 digits will adequately store this value.
A BIGINT is 64-bit (8 byte) signed integer with a range of -9.22x10^18 to 9.22x10^18 ( -2^63 to 2^63 -1). Notice the exponant is 18. That's 18 digits you have to work with.
Using this design, you are limited to 100 million OID's, 999 categories of people and dates up to... well past the shelf life of your databse, but I suspect thats good enough for most solutions.
The operations required to created an OID like this are all Multiplication and Division which avoids all the gear grinding of text manipulation.
The disadvantage is that INSERTs require more than a simple TSQL statement, but the advantage is that when you are tracking down errant data or even being clever in your queries, your OID is visually telling you alot more than a random number or worse, an eyesore like GUID.