Data modeling with levels of detail, some of which are absent

Data modeling with levels of detail, some of which are absent - sql

I'm doing a data model for a roller derby league to track their matches. I track things like lap times, penalties per lap, penalties per period, and penalties per match.
The problem is that in some cases, I will only have the overall data; I might have "penalties per match" for one match and "penalties per period" for another. So at the lowest level, for some matches, I'll have the very detailed data (penalties per hap), and at the highest level I'll have penalties per match.
I'm not sure how to model/use this to do reporting when I don't have a high detail for some records. I thought about something like this:
PenaltiesPerMatch
MatchID
PenaltyCount
PenaltiesPerPeriod
MatchID
PeriodID
PenaltyCount
PenaltiesPerLap
MatchID
PeriodID
LapID
PenaltyCount
But my concern is that the higher-level information can be derived from lower level. Do I duplicate records (e.g. fill in a record for penalties per period with data that is also in penalties per lap, summed by period?) or keep unique records (don't put in penalties per period for data that I have already in penalties per lap; calculate it by summing on period).

What I would do is record the information that you have. For some matches, record it in high detail, for others in low detail.
When you report on the matches:
Calculate the sums per match for the high detail matches
Use the sum per match from the low detail matches
Store data at the lowest detail level that you have; calculate the higher detail levels.

You could save the information in one table, with NULL values indicating that you don't have the data down to that level. You wouldn't be able to put a primary key over that, so you would need a surrogate key, but you should be able to use a unique constraint.
For example:
CREATE TABLE PenaltyCounts
(
penalty_count_id INT NOT NULL,
match_id INT NOT NULL,
period TINYINT NULL CHECK (period BETWEEN 1 AND 3),
lap SMALLINT NULL,
penalty_count SMALLINT NOT NULL,
CONSTRAINT PK_PenaltyCounts PRIMARY KEY NONCLUSTERED (penalty_count_id),
CONSTRAINT UI_PenaltyCounts UNIQUE CLUSTERED (match_id, period, lap),
CONSTRAINT CK_lap_needs_period CHECK (lap IS NULL OR period IS NOT NULL)
)
One problem with this for which I don't see an easy solution yet is how to enforce that they ONLY can enter penalties at one level. For example, they could still do this:
INSERT INTO PenaltyCounts (penalty_count_id, match_id, period, lap, penalty_count)
VALUES (1, 1, NULL, NULL, 5)
INSERT INTO PenaltyCounts (penalty_count_id, match_id, period, lap, penalty_count)
VALUES (2, 1, 1, NULL, 3)
INSERT INTO PenaltyCounts (penalty_count_id, match_id, period, lap, penalty_count)
VALUES (3, 1, 2, NULL, 2)
The advantage of this single-table solution is that your statistics can all be found by querying one table and the GROUP BYs will roll everything up nicely.
You could also use the separate table method but put views over them to pull everything together. This still allows the problem above though of putting numbers in at multiple levels.

I think it depends on what information is valuable to the customer. If they would like to have the information by period, then you should include that as a separate record. Penalty by period and by match must be separated.
I you always had the penalty by period information, then you could do a query that sums the data.
If your periods is always a fixed number, then you could probably just do two columns in the table instead of a new table to hold the period information

Related

Store 3-dimensional table in database where 1 dimension increases over time

I have a data set with three dimensions that I would like to store for use with a website:
A list of companies (about 1000)
Information about the company (about 15 things)
Time (monthly)
Essentially, I want to track this information over time and keep it up to date.
When I start, the data will be 1000x15x1, after a year it will be 1000x15x12, and after 10 years if will be 1000x15x120.
The main queries I would make are:
Get all information for one company over all times
Get all information for one particular time
What would be a good database configuration for doing this? I'm open to either SQL or noSQL solutions.
In case it matters, the website is on Google App Engine.

From the relational database schema design perspective:
If the goal is analytics / ad-hoc querying / OLAP in general only, then you can use star-schema which is well suited for these type of analytics. But beware, OLAP databases are de-normalized and not suitable for operational transaction storage / OLTP in general, if you are planning to do both on this database.
The beauty of the Star schema:
The fact tables are usually all numeric, making the tables very small even though there are too many records. Small table means it is very fast to read (I/O).
All joins from the fact table to dimension tables are based on foreign keys (single column, numeric, indexable foreign keys)
All dimension tables have surrogate key, which is a single column primary key. Single column primary key is easier to JOIN than a multi-column primary key and also easier to index.
There is no NULL in foreign keys in fact tables. This makes JOIN operations straightforward, i.e. always JOIN fact table to all of its dimension tables. If you need NULL case, you need to add that as a special case in your dimension table. For example: if a company is not listed on stock market, and one of the thing you track is stock price, then you enter 0 or NULL into the fact for the stock price table depending on (how you want to do SUM(), AVG() etc later) and then add a special case into your StockSymbols dimension table called 'Private company' and add the foreign key of this special case into the fact table as your foreign key.
Almost all filtering is done through the dimension tables that are much much smaller than the fact tables. This requires having a Date dimension to be able to do date-based queries.
If you can stay in pure Star schema, then all yours JOINs are single hop (i.e. no join between two tables through another table).
All these makes JOIN operations very fast, simple and straightforward. That's why the Star schema is at the heart of data-warehousing designs.
https://en.wikipedia.org/wiki/Star_schema
https://en.wikipedia.org/wiki/Data_warehouse
One level up from this is OLAP (SSAS SQL Server Analyses Services for example) which does pre-processing of the data to make it fast to query but it involves more learning than pure start-schema and it's an overkill in your case
For your example
In Star schema,
Companies will be a dimension table
You will need Month dimension table. It's simplified version of Date dimension, just for month info. An example of Date dimension is here.
https://www.codeproject.com/Articles/647950/Create-and-Populate-Date-Dimension-for-Data-Wareho
The information about the company (15 things you say) will be fact tables. The facts must be numeric (b/c ideally all non-numeric values is saved in dimension tables). This means taking the non-numeric part of a fact to a dimension table. For example: if you are keeping revenue and would like to keep the currency type too, then the you will need a Currency dimension and save only the amount in the fact table and a foreign key to the Currency dimension table.
If you have any non-numeric facts, you need to store the distinct list in a dimension table and add foreign key to that dimension table inside your Fact table (this is called factless fact table). The only exception to that is if the cardinality of the dimension and the fact table is very similar, then you can just store the non-numeric fact value inside the fact table directly as there is no benefit in having a dimension table (in fact a disadvantage).
Also the facts can be grouped by their granularity. For example you could have company_monthly_summary fact table and keep more than one fact in that table (which are all joining to Company dimension and Month dimension). This is all up-to-you how you would like to group facts table. But if their granularity are not the same, they should not be grouped as that will cause sparse fact tables and harder to query.
You will use foreign keys in Fact tables to join to your Dimension tables
Add index for your Dimension tables' most used columns
Add a numeric surrogate key to your dimension. It is usually an auto-increment number but that's up-to you. One exception people prefers for the surrogate key of Date dimension is using the format YYYYMMDD (as integer). This makes is easier on WHERE clause: i.e instead of filtering for the Date column (a DATETIME value), which will do search to find the surrogate keys, you just provide the surrogate keys directly b/c you know the format. Depending on your business domain, you may have other similar useful surrogate key patterns that you may want to consider and use. But just know, in case of a business domain change, you will have have to update all fact records. Simple auto-increment surrogate key does not have that problem. In your case, the surrogate key for the month can be actual month number (1 for Jan)
That being said, 1 million rows in 5 years is easy to query even without a Star-schema design (with proper indexing, database maintenance). But if this is part of a larger analytics system, then go with Star schema

The simplest way.
Create a table, companyname + info you needto store + column for year-month.
Ex:
CREATE TABLE tablename (
id int(11) NOT NULL AUTO_INCREMENT,
companyname varchar(255) ,
info1 int(11) NOT NULL,
info2 datetime ,
info3 varchar(255) ,
info4 bool ,
yearmonth datetime,
PRIMARY KEY (id) );
#queries
select * from tablename where companyname="nameofthecompany";
select * from tablename where yearmonth="year-month"; #can use between here

I need help counting char occurencies in a row with sql (using firebird server)

I have a table where I have these fields:
id(primary key, auto increment)
car registration number
car model
garage id
and 31 fields for each day of the mont for each row.
In these fields I have char of 1 or 2 characters representing car status on that date. I need to make a query to get number of each possibility for that day, field of any day could have values: D, I, R, TA, RZ, BV and LR.
I need to count in each row, amount of each value in that row.
Like how many I , how many D and so on. And this for every row in table.
What best approach would be here? Also maybe there is better way then having field in database table for each day because it makes over 30 fields obviously.

There is a better way. You should structure the data so you have another table, with rows such as:
CarId
Date
Status
Then your query would simply be:
select status, count(*)
from CarStatuses
where date >= #month_start and date < month_end
group by status;
For your data model, this is much harder to deal with. You can do something like this:
select status, count(*)
from ((select status_01 as status
from t
) union all
(select status_02
from t
) union all
. . .
(select status_31
from t
)
) s
group by status;

You seem to have to start with most basic tutorials about relational databases and SQL design. Some classic works like "Martin Gruber - Understanding SQL" may help. Or others. ATM you miss the basics.
Few hints.
Documents that you print for user or receive from user do not represent your internal data structures. They are created/parsed for that very purpose machine-to-human interface. Inside your program should structure the data for easy of storing/processing.
You have to add a "dictionary table" for the statuses.
ID / abbreviation / human-readable description
You may have a "business rule" that from "R" status you can transition to either "D" status or to "BV" status, but not to any other. In other words you better draft the possible status transitions "directed graph". You would keep it in extra columns of that dictionary table or in one more specialized helper table. Dictionary of transitions for the dictionary of possible statuses.
Your paper blank combines in the same row both totals and per-day detailisation. That is easy for human to look upon, but for computer that in a sense violates single responsibility principle. Row should either be responsible for primary record or for derived total calculation. You better have two tables - one for primary day by day records and another for per-month total summing up.
Bonus point would be that when you would change values in the primary data table you may ask server to automatically recalculate the corresponding month totals. Read about SQL triggers.
Also your triggers may check if the new state properly transits from the previous day state, as described in the "business rules". They would also maybe have to check there is not gaps between day. If there is a record for "march 03" and there is inserted a new the record for "march 05" then a record for "march 04" should exists, or the server would prohibit adding such a row. Well, maybe not, that is dependent upon you business processes. The general idea is that server should reject storing any data that is not valid and server can know it.
you per-date and per-month tables should have proper UNIQUE CONSTRAINTs prohibiting entering duplicate rows. It also means the former should have DATE-type column and the latter should either have month and year INTEGER-type columns or have a DATE-type column with the day part in it always being "1" - you would want a CHECK CONSTRAINT for it.
If your company has some registry of cars (and probably it does, it is not looking like those car were driven in by random one-time customers driving by) you have to introduce a dictionary table of cars. Integer ID (PK), registration plate, engine factory number, vagon factory number, colour and whatever else.
The per-month totals table would not have many columns per every status. It would instead have a special row for every status! The structure would probably be like that: Month / Year / ID of car in the registry / ID of status in the dictionary / count. All columns would be integer type (some may be SmallInt or BigInt, but that is minor nuancing). All the columns together (without count column) should constitute a UNIQUE CONSTRAINT or even better a "compound" Primary Key. Adding a special dedicated PK column here in the totaling table seems redundant to me.
Consequently, your per-day and per-month tables would not have literal (textual and immediate) data for status and car id. Instead they would have integer IDs referencing proper records in the corresponding cars dictionary and status dictionary tables. That you would code as FOREIGN KEY.
Remember the rule of thumb: it is easy to add/delete a row to any table but quite hard to add/delete a column.
With design like yours, column-oriented, what would happen if next year the boss would introduce some more statuses? you would have to redesign the table, the program in many points and so on.
With the rows-oriented design you would just have to add one row in the statuses dictionary and maybe few rows to transition rules dictionary, and the rest works without any change.
That way you would not

postgresql sequence number depending on rows?

Taking a course in databases and i am unsure of how to create this view.
I have this table(postgresql):
CREATE TABLE InQueue (
id INT REFERENCES Student(id),
course VARCHAR(10) REFERENCES RestrictedCourse(course_code),
since TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id,course),
UNIQUE (course,since)
);
I am supposed to create a view that lists course, id, number, where number is calculated with since. Basically the lowest since gives the queuenumber 1, the 2nd lowest gives 2 and so forth. (course,number) is unique, but not number by itself, since there are many different courses.
What i think needs to be done is to first order the table by (course,since) and then just add sequence numbers, but eventually the course will change and then the sequence numbers need to start over, beginning with 1 again.
Could someone point me in the right direction? :)

Use:
Select row_number() over (partition by course order by since asc) as yournumber, id,
course, since from InQueue
You can read about analytic functions here: http://www.postgresql.org/docs/9.4/static/tutorial-window.html

Data warehouse type 2 scd Employee dimension and HR Facts (Kimball's)

I'm following the "Data Warehouse Toolkit" book by Kimball and I'm getting confused with an example of an employee dimension and a HR snapshot fact table.
Here is a screenshot of the example given in the book:
I'm getting confused with the 'Employee count', 'New Hire Count', 'Transfer Count' and 'Promotion Count' fields. As you can see there is a relationship between the HR fact table and the Employee dimension table but which key would be assigned in the fact table in the case of these count values?. I understand that there could be a 'New Hire Count' at the end of the month and we would have Month Dimension FK in the fact table pointing to that month, but what about the employee dimension key?
I hope I'm making myself clear here, and sorry if somehow this is a dumb question.
Thanks.

stigma,
I believe the surrogate key of the most recent row from Employee Transaction Dimension is what one would include in the fact table. There may be multiple rows in Employee Transaction Dimension for the month, but only the most recent one would be referenced in the fact table row for a given month.
Hope this helps.
Best Regards,
Jesse Dyson

In a SCD (type 2 or type 3), you want to think in terms of 2 types of key; natural keys and pseudo keys. The natural key is the identifier which the "real world" would understand, in the example of an Employee dimension, this would probably be some kind of Employee Id. Each time you add an entry to this table, you get a new pseudo key, and I like to think of this as the "as-was" key. It represents the state of that dimensional member "as it was", when the record was added.
Over time in a SCD, you will have many, many records per natural key, each with it's own "as-was" key. Considering the most recent entry, it's "as-was" key is also the "as-is" key, as it represents the current state.
In a fact table, you should ALWAYS expect to find the "as-was" key. If you're going to assume the fact table will always hold the "as-is" key, or the most recent key, then it assumes you're going to go back and update historical records in your fact table simply because an attribute of the dimension changed. This is a waste of resources for started, and is actually counter-productive as one of the major benefits of a SCD is the ability to do "as-was vs as-is" analysis, and to do this you need to preserve the "as-was" state.

Data modelling draft/quote/order/invoice

Im currently working on a small project in which I need to model the following scenario:
Scenario
Customer calls, he want an quote on a new car.
Sales rep. register customer information.
Sales rep. create a quote in the system, and add a item to the quote (the car).
Sales rep. send the quote to the customer on email.
Customer accept the quote, and the quote is now not longer a quote but an order.
Sales rep. check the order, everything is OK and he invoice the order. The order is now not longer an order, but an invoice.
Thoughts
I need a bit of help finding out the ideal way to model this, but I have some thoughts.
I'm thinking that both draft/quote/invoice is basically an order.
Draft/quote/invoice need seperate unique numbers(id's) so there for i'm thinking separate tables for all of them.
Model
This is my data model v.1.0, please let me know what you think.
Concerns
I however have som concerns regarding this model:
Draft/quote/invoice might have different items and prices on the order lines. In this model all draft/quote/invoice is connected to the same order and also order lines, making it impossible to have separate quote lines/draft lines/invoice lines. Maybe I shall make new tables for this, but then basically the same information would be stored in multiple tables, and that is not good either.
Sometimes two or more quotes become an invoice, how would this model take care of this?
If you have any tips on how to model this better, please let me know!
EDIT: Data model v.1.4

It looks like you've modeled every one of these things--quote, order, draft, invoice--as structurally identical to all the others. If that's the case, then you can "push" all the similar attributes up into a single table.
create table statement (
stmt_id integer primary key,
stmt_type char(1) not null check (stmt_type in ('d', 'q', 'o', 'i')),
stmt_date date not null default current_date,
customer_id integer not null -- references customer (customer_id)
);
create table statement_line_items (
stmt_id integer not null references statement (stmt_id),
line_item_number integer not null,
-- other columns for line items
primary key (stmt_id, line_item_number)
);
I think that will work for the model you've described, but I think you'll be better served in the long run by modeling these as a supertype/subtype. Columns common to all subtypes get pushed "up" into the supertype; each subtype has a separate table for the attributes unique to that subtype.
This SO question and its accepted answer (and comments) illustrate a supertype/subtype design for blog comments. Another question relates to individuals and organizations. Yet another relating to staffing and phone numbers.
Later . . .
This isn't complete, but I'm out of time. I know it doesn't include line items. Might have missed something else.
-- "Supertype". Comments appear above the column they apply to.
create table statement (
-- Autoincrement or serial is ok here.
stmt_id integer primary key,
stmt_type char(1) unique check (stmt_type in ('d','q','o','i')),
-- Guarantees that only the order_st table can reference rows having
-- stmt_type = 'o', only the invoice_st table can reference rows having
-- stmt_type = 'i', etc.
unique (stmt_id, stmt_type),
stmt_date date not null default current_date,
cust_id integer not null -- references customers (cust_id)
);
-- order "subtype"
create table order_st (
stmt_id integer primary key,
stmt_type char(1) not null default 'o' check (stmt_type = 'o'),
-- Guarantees that this row references a row having stmt_type = 'o'
-- in the table "statement".
unique (stmt_id, stmt_type),
-- Don't cascade deletes. Don't even allow deletes. Every order given
-- an order number must be maintained for accountability, if not for
-- accounting.
foreign key (stmt_id, stmt_type) references statement (stmt_id, stmt_type)
on delete restrict,
-- Autoincrement or serial is *not* ok here, because they can have gaps.
-- Database must account for each order number.
order_num integer not null,
is_canceled boolean not null
default FALSE
);
-- Write triggers, rules, whatever to make this view updatable.
-- You build one view per subtype, joining the supertype and the subtype.
-- Application code uses the updatable views, not the base tables.
create view orders as
select t1.stmt_id, t1.stmt_type, t1.stmt_date, t1.cust_id,
t2.order_num, t2.is_canceled
from statement t1
inner join order_st t2 on (t1.stmt_id = t2.stmt_id);

There should be a table "quotelines", which would be similar to "orderlines". Similarly, you should have an 'invoicelines' table. All these tables should have a 'price' field (which nominally will be the part's default price) along with a 'discount' field. You could also add a 'discount' field to the 'quotes', 'orders' and 'invoices' tables, to handle things like cash discounts or special offers. Despite what you write, it is good to have separate tables, as the amount and price in the quote may not match what the customer actually orders, and again it may not be the same amount that you actually supply.
I'm not sure what the 'draft' table is - you could probably combine the 'draft' and 'invoices' tables as they hold the same information, with one field containing the status of the invoice - draft or final. It is important to separate your invoice data from order data, as presumably you will be paying taxes according to your income (invoices).
'Quotes', 'Orders' and 'Invoices' should all have a field (foreign key) which holds the value of the sales rep; this field would point to the non-existent 'SalesRep' table. You could also add a 'salesrep' field in the 'customers' table, which points to the default rep for the customer. This value would be copied into the 'quotes' table, although it could be changed if a different rep to the default gave the quote. Similarly, this field should be copied when an order is made from a quote, and an invoice from an order.
I could probably add much more, but it all depends on how complex and detailed a system you want to make. You might need to add some form of 'bill of materials' if the cars are configured according to their options and priced accordingly.

Add a new column to line_items ( ex:Status as smallint)
When a quote_line becomes an order_line then set bit you choose from 0 to 3 to 1.
But when qty changes then add a new line with new qte and keep last line unchanged.
Kad.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas