DateTime as part of PK in FACT Table for warehouses - sql

I know generally its not good idea to have a DateTime column as your PK, however for my situation I believe it makes more sense than having a surrogate key in the Fact Table.
The reasons are...
The data inserted into the fact table is always sequential. i.e. I would never insert date time value that is older than the last value already in Fact table.
The date time field is not the only column of the PK (composite PK), The PK is of course itself and the dimension FK's surrogate key.
The way i be querying the data is nearly always based on time.
A surrogate key on the Fact table would tell me nothing about the row. Each row is already unique and to find that particular fact I would always filter on the Date time first and the values in the dimensions.
There is no separate datetime dimension table. No requirement now or in the foreseeable future to have named points in time etc.
Side notes - time will be in UTC and using SQL 2008 R2.
What I'm asking is are giving the situation - what are the disadvantages to doing this? Will I come up against unforeseen issues?
Is this actaully a good thing to be doing when querying that data back later?
Would like to know peoples view points on a DateTime field as the first column of a composite PK.

It's almost an essential feature of any data warehouse that date/time is a component of a key in most tables. There's nothing "wrong" with that.
A surrogate key generally shouldn't be the only key of a table, so perhaps your question is really "Should I create a surrogate key on my table as well?". My suggestion is that if you don't have a reason to create a surrogate key then don't. The time to create a surrogate is when you find that you need it.

Most fact tables have composite keys and date-time or often DateKey, TimeKey are part of it. Actually, quite common.
The dimDate and dimTime are simply used to avoid having "funny" date-time functions in the WHERE clause of a query. For example
-- sales on
-- weekends for previous 28 weeks
--
select sum(f.SaleAmount)
from factSale as f
join dimDate as d on d.DateKey = f.DateKey
where d.IsWeekend = 'yes'
and d.WeeksAgo between 1 and 28 ;
So here I can have indexes on IsWeekend and WeeksAgo (DateKey too). If these were replaced by date-time functions, this would cause row-by-row processing.

Related

Store 3-dimensional table in database where 1 dimension increases over time

I have a data set with three dimensions that I would like to store for use with a website:
A list of companies (about 1000)
Information about the company (about 15 things)
Time (monthly)
Essentially, I want to track this information over time and keep it up to date.
When I start, the data will be 1000x15x1, after a year it will be 1000x15x12, and after 10 years if will be 1000x15x120.
The main queries I would make are:
Get all information for one company over all times
Get all information for one particular time
What would be a good database configuration for doing this? I'm open to either SQL or noSQL solutions.
In case it matters, the website is on Google App Engine.
From the relational database schema design perspective:
If the goal is analytics / ad-hoc querying / OLAP in general only, then you can use star-schema which is well suited for these type of analytics. But beware, OLAP databases are de-normalized and not suitable for operational transaction storage / OLTP in general, if you are planning to do both on this database.
The beauty of the Star schema:
The fact tables are usually all numeric, making the tables very small even though there are too many records. Small table means it is very fast to read (I/O).
All joins from the fact table to dimension tables are based on foreign keys (single column, numeric, indexable foreign keys)
All dimension tables have surrogate key, which is a single column primary key. Single column primary key is easier to JOIN than a multi-column primary key and also easier to index.
There is no NULL in foreign keys in fact tables. This makes JOIN operations straightforward, i.e. always JOIN fact table to all of its dimension tables. If you need NULL case, you need to add that as a special case in your dimension table. For example: if a company is not listed on stock market, and one of the thing you track is stock price, then you enter 0 or NULL into the fact for the stock price table depending on (how you want to do SUM(), AVG() etc later) and then add a special case into your StockSymbols dimension table called 'Private company' and add the foreign key of this special case into the fact table as your foreign key.
Almost all filtering is done through the dimension tables that are much much smaller than the fact tables. This requires having a Date dimension to be able to do date-based queries.
If you can stay in pure Star schema, then all yours JOINs are single hop (i.e. no join between two tables through another table).
All these makes JOIN operations very fast, simple and straightforward. That's why the Star schema is at the heart of data-warehousing designs.
https://en.wikipedia.org/wiki/Star_schema
https://en.wikipedia.org/wiki/Data_warehouse
One level up from this is OLAP (SSAS SQL Server Analyses Services for example) which does pre-processing of the data to make it fast to query but it involves more learning than pure start-schema and it's an overkill in your case
For your example
In Star schema,
Companies will be a dimension table
You will need Month dimension table. It's simplified version of Date dimension, just for month info. An example of Date dimension is here.
https://www.codeproject.com/Articles/647950/Create-and-Populate-Date-Dimension-for-Data-Wareho
The information about the company (15 things you say) will be fact tables. The facts must be numeric (b/c ideally all non-numeric values is saved in dimension tables). This means taking the non-numeric part of a fact to a dimension table. For example: if you are keeping revenue and would like to keep the currency type too, then the you will need a Currency dimension and save only the amount in the fact table and a foreign key to the Currency dimension table.
If you have any non-numeric facts, you need to store the distinct list in a dimension table and add foreign key to that dimension table inside your Fact table (this is called factless fact table). The only exception to that is if the cardinality of the dimension and the fact table is very similar, then you can just store the non-numeric fact value inside the fact table directly as there is no benefit in having a dimension table (in fact a disadvantage).
Also the facts can be grouped by their granularity. For example you could have company_monthly_summary fact table and keep more than one fact in that table (which are all joining to Company dimension and Month dimension). This is all up-to-you how you would like to group facts table. But if their granularity are not the same, they should not be grouped as that will cause sparse fact tables and harder to query.
You will use foreign keys in Fact tables to join to your Dimension tables
Add index for your Dimension tables' most used columns
Add a numeric surrogate key to your dimension. It is usually an auto-increment number but that's up-to you. One exception people prefers for the surrogate key of Date dimension is using the format YYYYMMDD (as integer). This makes is easier on WHERE clause: i.e instead of filtering for the Date column (a DATETIME value), which will do search to find the surrogate keys, you just provide the surrogate keys directly b/c you know the format. Depending on your business domain, you may have other similar useful surrogate key patterns that you may want to consider and use. But just know, in case of a business domain change, you will have have to update all fact records. Simple auto-increment surrogate key does not have that problem. In your case, the surrogate key for the month can be actual month number (1 for Jan)
That being said, 1 million rows in 5 years is easy to query even without a Star-schema design (with proper indexing, database maintenance). But if this is part of a larger analytics system, then go with Star schema
The simplest way.
Create a table, companyname + info you needto store + column for year-month.
Ex:
CREATE TABLE tablename (
id int(11) NOT NULL AUTO_INCREMENT,
companyname varchar(255) ,
info1 int(11) NOT NULL,
info2 datetime ,
info3 varchar(255) ,
info4 bool ,
yearmonth datetime,
PRIMARY KEY (id) );
#queries
select * from tablename where companyname="nameofthecompany";
select * from tablename where yearmonth="year-month"; #can use between here

I need help counting char occurencies in a row with sql (using firebird server)

I have a table where I have these fields:
id(primary key, auto increment)
car registration number
car model
garage id
and 31 fields for each day of the mont for each row.
In these fields I have char of 1 or 2 characters representing car status on that date. I need to make a query to get number of each possibility for that day, field of any day could have values: D, I, R, TA, RZ, BV and LR.
I need to count in each row, amount of each value in that row.
Like how many I , how many D and so on. And this for every row in table.
What best approach would be here? Also maybe there is better way then having field in database table for each day because it makes over 30 fields obviously.
There is a better way. You should structure the data so you have another table, with rows such as:
CarId
Date
Status
Then your query would simply be:
select status, count(*)
from CarStatuses
where date >= #month_start and date < month_end
group by status;
For your data model, this is much harder to deal with. You can do something like this:
select status, count(*)
from ((select status_01 as status
from t
) union all
(select status_02
from t
) union all
. . .
(select status_31
from t
)
) s
group by status;
You seem to have to start with most basic tutorials about relational databases and SQL design. Some classic works like "Martin Gruber - Understanding SQL" may help. Or others. ATM you miss the basics.
Few hints.
Documents that you print for user or receive from user do not represent your internal data structures. They are created/parsed for that very purpose machine-to-human interface. Inside your program should structure the data for easy of storing/processing.
You have to add a "dictionary table" for the statuses.
ID / abbreviation / human-readable description
You may have a "business rule" that from "R" status you can transition to either "D" status or to "BV" status, but not to any other. In other words you better draft the possible status transitions "directed graph". You would keep it in extra columns of that dictionary table or in one more specialized helper table. Dictionary of transitions for the dictionary of possible statuses.
Your paper blank combines in the same row both totals and per-day detailisation. That is easy for human to look upon, but for computer that in a sense violates single responsibility principle. Row should either be responsible for primary record or for derived total calculation. You better have two tables - one for primary day by day records and another for per-month total summing up.
Bonus point would be that when you would change values in the primary data table you may ask server to automatically recalculate the corresponding month totals. Read about SQL triggers.
Also your triggers may check if the new state properly transits from the previous day state, as described in the "business rules". They would also maybe have to check there is not gaps between day. If there is a record for "march 03" and there is inserted a new the record for "march 05" then a record for "march 04" should exists, or the server would prohibit adding such a row. Well, maybe not, that is dependent upon you business processes. The general idea is that server should reject storing any data that is not valid and server can know it.
you per-date and per-month tables should have proper UNIQUE CONSTRAINTs prohibiting entering duplicate rows. It also means the former should have DATE-type column and the latter should either have month and year INTEGER-type columns or have a DATE-type column with the day part in it always being "1" - you would want a CHECK CONSTRAINT for it.
If your company has some registry of cars (and probably it does, it is not looking like those car were driven in by random one-time customers driving by) you have to introduce a dictionary table of cars. Integer ID (PK), registration plate, engine factory number, vagon factory number, colour and whatever else.
The per-month totals table would not have many columns per every status. It would instead have a special row for every status! The structure would probably be like that: Month / Year / ID of car in the registry / ID of status in the dictionary / count. All columns would be integer type (some may be SmallInt or BigInt, but that is minor nuancing). All the columns together (without count column) should constitute a UNIQUE CONSTRAINT or even better a "compound" Primary Key. Adding a special dedicated PK column here in the totaling table seems redundant to me.
Consequently, your per-day and per-month tables would not have literal (textual and immediate) data for status and car id. Instead they would have integer IDs referencing proper records in the corresponding cars dictionary and status dictionary tables. That you would code as FOREIGN KEY.
Remember the rule of thumb: it is easy to add/delete a row to any table but quite hard to add/delete a column.
With design like yours, column-oriented, what would happen if next year the boss would introduce some more statuses? you would have to redesign the table, the program in many points and so on.
With the rows-oriented design you would just have to add one row in the statuses dictionary and maybe few rows to transition rules dictionary, and the rest works without any change.
That way you would not

Date ranges unique constraint in Database

I have a table "holidays" which represents people's holidays. It contains a FK to a person table, a from date column and a to date column. I want to add a constraint so that no person can have an over lapping holiday with themselves. So if Billy has a skiing holiday from 15th Jan - 20thJan, he can't have another vacation on the 18th Jan? But it's fine for him to do it on the 21st Jan?
Is this possible to do at database level via a constraint?
DB2 or Oracle can suffice?
Thanks
In DB2 you could use Temporal Tables and Time Travel Queries - check out the doumentation
Using Business Time with Business Period Temporal Tables will allow to define an index which enforces that periods do not overlap
CREATE UNIQUE INDEX I_vacation ON vacation (person, BUSINESS_TIME WITHOUT OVERLAPS)
Not directly. Constraints (at least in Oracle, I can't speak for other databases) work on one row at a time, they don't look at other rows - EXCEPT the UNIQUE constraint which looks across rows.
So - two solutions. One is, instead of storing ranges, to store one row per holiday DAY. (By the way, I believe what you call "holiday" is called "vacation", at least in America; "holiday" is reserved for common holidays, the same for all people, such as New Year or Christmas, etc.) In this arrangement, add a UNIQUE constraint on (person_id, vacation_day). Then re-work your input and reporting apps to translate from ranges to individual days, and respectively from individual days back to ranges.
The other solution, if you must store ranges, is to create a materialized view with refresh on commit (preferably fast refresh if the conditions permit), which shows person_id and vacation_day, one row per day - and put a UNIQUE constraint on the materialized view.
You can create a stored procedure wich take datestart and dateend of current row and use them parameter of this procedure. This procedure return 1 if exist in table a bad range and otherwise 0. Then you create your constraint check when this result procedure =0

How to create a custom primary key using strings and date

I have an order table in sql server and I need for the order number primary key to be like this
OR\20160202\01
OR is just a string
20160202 is the Date
01 is sequence number for that day
for second Order record the same day it would be
OR\20160202\02 and so on..
backlashes should also be included...
Whats the way to go about creating such a field in sql server (using version 2016)
EDIT: to add more context to what sequence number is, its just a way for this field composite or not to be unique. without a sequence number i would get duplicate records in DB because i could have many records the same day so date would remain the same thus it would be something like
OR\20160202 for all rows for that particular day so it would be duplicate. Adding a "sequence" number helps solve this.
The best way is to not create such a column in SQL. You're effectively combining multiple pieces of data into the same column, which shouldn't happen in a relational database for many reasons. A column should hold one piece of data.
Instead, create a composite primary key across all of the necessary columns.
composite pk
order varchar(20)
orDate DateTime
select *
, row_number() over (partition by cast(orDate as Date) order by orDate) as seq
from table
Will leave it to you on how to concatenate the data
That is presentation thing - don't make it a problem for the PK
About "sequence number for that day" (department, year, country, ...).
Almost every time I discussed such a requirement with end users it turned out to be just misunderstanding of how shared database works, a vague attempt to repeat old (separate databases, EXCEL files or even paper work) tricks on shared database.
So i second Tom H and others, first try not to do it.
If nevertheless you must do it, for legal or other unnegotiatable reasons then i hope you are on 2012+. Create SEQUENCE for every day.
Formatted PK is not a good idea.Composite key is a better approach.The combination of day as a date column and order number as a bigint column should be used.This helps in improving the query performance too.
You might want to explore 'Date Dimension' table. Date Dimension is commonly used table in data warehousing. It stores all the days of the calendar(based on your choice of years) and numeric generated keys for these days. Check this post on date dimension. It talks about creating one in SQL SERVER.
https://www.mssqltips.com/sqlservertip/4054/creating-a-date-dimension-or-calendar-table-in-sql-server/

Calculate date and time key in fact table using existing date time field

I have date time field in a fact table in the format MM/DD/YY HH:MM:SS (e.g 2/24/2009 11:18:47 AM) and I have seperate date and time dimension
tables. What I would like ask is that how I can create date key and time key in the fact table using the date time field so that I can join the
date and time dimension.
There are alot of reference for creating seperate date and time dimension and their benefit but I could not find how to create date and time keys
in the fact table using existing date time field.
I have also heard that having date time field in fact table has certain benefit. If so, what would you recommend, should I have all three (date key, time key
and date time field) in the fact table. Date key and time key are must to have for me and I am concerned about fact table size if I have date time field
also in the fact table.
Thank you all for any help you can give.
What you need to do (if I understand correctly) is to create two fields in your Fact table:kTime, kDate.
We would always suggest using the primary keys for DimTime and DimDate as having meaning (this being a special case normally Dim tables' promary keys dont have any meaning). So e.g. in DimDate, we would have kDate as primary key with values formatted as YYYYMMDD so that you can order by kDate and it puts them in date order. Then have DimTime table having kTime primary key in the form HHMM or HHMMSS (depending on the resolution you need.
It is best to keep the actual date time field on the fact table as well, as it allows SQL to use its inbuilt date/time functions to do subsetting, but if you extend your Dim tables with useful extra columns : DimDate (add DayOfWeek, IsHoliday, DayNumber, MonthNumber, YearNumber, etc) and DimTime (HourNumber, MinuteNumber, IsWorkingTime), then you can perform very interesting queries very simply.
So to answer your question, "how to create date and time keys in the fact table using existing date time field?" ... as you are loading the data into the fact table, use the inbuilt date/time functions to create separate Date fields and Time fields.
It depends very much on how many rows you expect in your Fact table wether this approach will produce a lot of data, but it is the easiest to work with from a data warehouse point of view.
best of luck!