Slowly Changing Fact Table? - sql

Background)
I've gone through the process of building a fact table for our inventory data that will in theory act as a nightly snapshot of our warehouse. What is recorded is information such as quantity, weights, locations, statuses, etc. The data is very granular and in many cases not specifically related to a single entity (our source database records inventory data as having three primary keys: licenseplate aka pallet, product, and packaging type - so it has essentially 3 business keys and no surrogate key).
The goal is to be able to have a 100% accurate recreation of our warehouse management system's data, that is viewable for any one day in history. So I can look up and see how many pallets of product XYZ was in location 1234 on the 4th of August.
Question 1)
Now, I have built this fact table to structurally look like a Slowly Changing Dimension, Type 2. Is this wrong? I've been reading up a little on accumulating snapshot fact tables and i'm beginning to question my design. What is the best practice in this situation?
Question 2)
If my design is ok, how do I configure Analysis services so that it recognizes my DateStart and DateEnd columns in the FACT table? I have found some information on how to configure this for dimensions but it does not seem to work/apply to fact tables.
For reference - My fact table's structure (with added notes about columns):
CREATE TABLE [dbo].[FactInventory](
[id] [int] IDENTITY(1,1) NOT NULL, (fact table only surrogate key)
[DateStart] [datetime] NULL, (record begin date)
[DateEnd] [datetime] NULL, (record end date)
[CreateDate] [datetime] NULL, (create date of the inventory record in src db)
[CreateDateId] [int] NULL, (create date dimension key)
[CreateTimeId] [int] NULL, (create time dimension key)
[LicensePlateId] [int] NULL, (pallet id dimension key)
[SerialNumberId] [int] NULL, (serial number id dimension key)
[PackagedId] [int] NULL, (packaging type id dimension key)
[LotId] [int] NULL, (inventory lot id dimension key)
[MaterialId] [int] NULL, (product id dimension key)
[ProjectId] [int] NULL, (customer project id dimension key)
[OwnerId] [int] NULL, (customer id dimension key)
[WarehouseId] [int] NULL, (warehouse id dimension key)
[LocationId] [int] NULL, (location id dimension key)
[LPStatusId] [int] NULL, (licenseplate status id dimension key)
[LPTypeId] [int] NULL, (licenseplate type id dimension key)
[LPLookupCode] [nvarchar](128) NULL, (licenseplate non-system name)
[PackagedAmount] [money] NULL, (inventory amount - measure)
[netWeight] [money] NULL, (inventory netWeight - measure)
[grossWeight] [money] NULL, (inventory grossWeight - measure)
[Archived] [bit] NULL, (inventory archived yes/no - dimension)
[SCDChangeReason] [nvarchar](128) NULL (auditing data for changes)

Typically, in a snapshot fact table you do not have changes.
You usually have a date/time dimension which is used for the granularity of the measurements and not a DateStart/DateEnd. Similarly you do not have any SCD information. The fact snapshot is taken and the Date and Time dimensions are attached to those facts. If those facts repeat identically each month, so be it.
Dealing with determining which facts are valid at a given time is more processing than you really want your DW or your ETL to handle - that kind of design (effective dates, etc) is more effectively used in a live OLTP-type system where complete history is kept in the live system. The point of the DW is to optimize for reporting, not for space, and thus there is a direct snapshot date/time dimension which allows you to easily index and potentially partition the data without a lot of date arithmetic or comparisons.
As far as your dimensional model, be careful that you aren't succumbing to the too-many dimensions problem. Remember that dimensions do not have to correspond to entities in the real world. The choice of how dimensional attributes are grouped into dimension tables should be informed by 1) query needs, 2) data affinity and change behavior, 3) business organization. You might want to look into using one or more junk dimensions.

Before going any further, is inventory really a slowly changing fact?
Edit: Then why not just snapshot every product each day, since that's what you want.
The problem is that fact tables get large and you're throwing EVERYTHING into the fact table unnecessarily. Ideally, the fact table will contain nothing more than foreign keys to dimensions and data only pertaining to the fact at hand. But some of the columns you've outlined look like they belong in one of the dimensions tables whereas
For instance, the license plate information. Status, type, and lookup code. Likewise with netWeight/grossWeight. They should be derivable from the product dimension and PackagedAmount.
CREATE TABLE [dbo].[FactInventory](
[id] [int] IDENTITY(1,1) NOT NULL, (fact table only surrogate key)
[day] [int] NULL, (day dimension key, grain of a day)
[CreateDateId] [int] NULL, (create date dimension key)
/* I take these are needed?
* [CreateTimeId] [int] NULL, (create time dimension key)
* [CreateDate] [datetime] NULL, (create date of the inventory record in src db)
*/
[LicensePlateId] [int] NULL, (pallet id dimension key)
/* Now THESE dimension columns...possibly slowly changing dimensions?
[LPStatusId] [int] NULL, (licenseplate status id dimension key)
[LPTypeId] [int] NULL, (licenseplate type id dimension key)
[LPLookupCode] [nvarchar](128) NULL, (licenseplate non-system name)
*/
[SerialNumberId] [int] NULL, (serial number id dimension key)
[PackagedId] [int] NULL, (packaging type id dimension key)
[LotId] [int] NULL, (inventory lot id dimension key)
[MaterialId] [int] NULL, (product id dimension key)
[ProjectId] [int] NULL, (customer project id dimension key)
[OwnerId] [int] NULL, (customer id dimension key)
[WarehouseId] [int] NULL, (warehouse id dimension key)
[LocationId] [int] NULL, (location id dimension key)
[PackagedAmount] [money] NULL, (inventory amount - measure)
[netWeight] [money] NULL, (inventory netWeight - measure)
[grossWeight] [money] NULL, (inventory grossWeight - measure)
[Archived] [bit] NULL, (inventory archived yes/no - dimension)
[SCDChangeReason] [nvarchar](128) NULL (auditing data for changes)

Related

MS SQL stored procedure to insert or update total YTD & Previous year invoice totals

I would like to total both current financial YTD invoiced sales & previous financial year invoiced sales on a daily basis for each customer in a MS SQL database and populate a separate table with this information. This information will be available within an application with specific functionality.
There are 3 x tables that contain pertinent information:
Customers (list of all customers in system)
Invoice (all invoice data)
CustomData (custom fields that can be created and populated in an application that interacts with the database)
I am looking for advice on how best to implement a solution that achieves the following:
Checks the current date to see if it is the 1st day of financial year (1st April)
If it is, loop through every record in the customer table that satisfies an argument (live customer) and total the net value - value of credit notes of both the invoices raised in the previous financial year (1st April - 31st March) and current financial YTD (from the 1st April) where the customer ID in the invoice table matches the ID in the customer table.
Then check the CustomData table to see if records for that customerID exist > if they do, update the record with new value (NB null values should = 0), if it doesn't exist, insert the data into the CustomData table.
If the current date is not 1st April, then only total the current financial YTD sales and update/insert into CustomData as required
The tables are constructed as follows (columns that are not relevant to this process have been excluded):
CREATE TABLE [dbo].[Customers]
(
[ID] [int] IDENTITY(1,1) NOT NULL,
[Code] [varchar](32) NULL,
[CustomerStatus] [tinyint] NOT NULL
CONSTRAINT [DF_Customers_CustomerStatus] DEFAULT ((0)),
CONSTRAINT [PK__Customers]
PRIMARY KEY CLUSTERED ([ID] ASC)
) ON [PRIMARY]
CREATE TABLE [dbo].[Invoice]
(
[ID] [int] IDENTITY(1,1) NOT NULL,
[Datedb] [datetime] NULL,
[AccountNo] [varchar](32) NULL,
[Nett] [float] NOT NULL
CONSTRAINT [DF_InvoiceHead_Nett] DEFAULT ((0)),
[IsCreditNote] [tinyint] NOT NULL
CONSTRAINT [DF_InvoiceHead_IsCreditNote] DEFAULT ((0)),
CONSTRAINT [PK__InvoiceHead]
PRIMARY KEY CLUSTERED ([ID] ASC)
) ON [PRIMARY]
CREATE TABLE [dbo].[CustomData]
(
[ID] [int] IDENTITY(1,1) NOT NULL,
[CustomFieldsContentID] [int] NOT NULL
CONSTRAINT [DF_CustomFieldsData_CustomFieldsContentID] DEFAULT ((0)),
[ModuleID] [int] NOT NULL
CONSTRAINT [DF_CustomFieldsData_ModuleID] DEFAULT ((0)),
[ModuleType] [tinyint] NOT NULL
CONSTRAINT [DF_CustomFieldsData_ModuleType] DEFAULT ((0)),
[ValueNumber] [float] NULL,
[dbTimeStamp] [timestamp] NOT NULL,
CONSTRAINT [PK__CustomFieldsData]
PRIMARY KEY CLUSTERED ([ID] ASC)
) ON [PRIMARY]
The customer relationship is Customers.Code Invoice.AccountNo & CustomData.ModuleID
Customers are live if Customers.CustomerStatus = 2
The CustomData.CustomFieldsContentID = 32 for previous financial year invoiced sales and 33 for YTD invoiced sales.
CustomData.ModuleID is the customer code (Customers.Code & Invoice.AccountNo)
CustomData.ModuleType will be a static value = 18
CustomData.ValueNumber will be the sum of the nett invoice value.
Invoice.IsCreditNote = 1 requires the nett value of that record to be a negative figure.
I'm not familiar with using cursors to loop through the records and also not sure the most efficient way to do this! I don't really know where to start!

Save records in rows or columns in SQL Server

I have to save three document types in a table. number of document types is fixed and will not change. there is more than 1 million records and in the future it can be more than 100 millions. for this purpose performance is so important in my program. I don't know which way can improve the database performance. row-based or column based?
Row-Based:
CREATE TABLE [Person].[Document]
(
[Id] [uniqueidentifier] NOT NULL,
[PersonId] [uniqueidentifier] NOT NULL,
[Document] [varbinary](max) NULL,
[DocType] [int] NOT NULL,
)
Column-based:
CREATE TABLE [Person].[Document]
(
[Id] [uniqueidentifier] NOT NULL,
[PersonId] [uniqueidentifier] NOT NULL,
[Document_Page1] [varbinary](max) NULL,
[Document_Page2] [varbinary](max) NULL,
[Document_Page3] [varbinary](max) NULL,
)
The normalized (or as you called it - row based) solution is more flexible.
It allows you to change the number of documents saved for each person without changing the database structure, and usually is the preferred solution.
A million rows is a small table for SQL server.
I've seen database tables with 50 million rows that performs very well.
It's a question of correct indexing.
I do suggest that if you want better performance use an int identity column for your primary key instead of a uniqueidentifier, since it's very light weight and much easier for the database to index because it's not randomly ordered to begin with.
I would go with the normalized solution.

Nested SQL Queries 3 levels

I am using MS Access and having troubles writing a query(s) to get my end result. Maybe someone can lend a hand.
I have Projects, Tasks and SubTasks tables. Each table has a related table for "Assignees". meaning a project could be assigned to an employee but the child tasks could be assigned to a different employee and still further the subtasks could then be assigned to other employees.
Now, when displaying this on screen, and I query for an employee that has been assigned to any project/Task/Subtask. I need that data to display but not other data. So for instance if the employee I query for only has been assigned to a task, then that project and task should display, but no additional projects/tasks/ and no subtasks. Likewise, if I query for an employee that only has been assigned to a subtask, then I only want to see the associated project and task. I think I can complete this with a series of queries...i think... but is there a slick method I can use to create this data.
Simply a select query with a series of joins could possibly work, but it doesnt because when an employee has been assigned to a subtask only and not to a project or task.
Thanks for any assistance!
Updated with additional information:
Table Structures:
CREATE TABLE [dbo].[Projects](
[ProjectID] [int] IDENTITY(1,1) NOT NULL,
[ProjectName] [varchar](100) NULL,
[ClientID] [int] NULL,
CREATE TABLE [dbo].[PM_ProjectAssignee](
[AssigneeID] [int] IDENTITY(1,1) NOT NULL,
[ProjectID] [int] NULL,
[EmployeeID] [int] NULL,
CREATE TABLE [dbo].[PM_ProjectTasks](
[ProjectTaskID] [int] IDENTITY(1,1) NOT NULL,
[ProjectID] [int] NULL,
[TaskID] [smallint] NULL,
CREATE TABLE [dbo].[PM_TaskAssignee](
[AssigneeID] [int] IDENTITY(1,1) NOT NULL,
[ProjectTaskID] [int] NULL,
[EmployeeID] [int] NULL,
CREATE TABLE [dbo].[PM_ProjectSubTasks](
[ProjectSubTaskID] [int] IDENTITY(1,1) NOT NULL,
[ProjectTaskID] [int] NULL,
[SubTaskDesc] [varchar](255) NULL,
CREATE TABLE [dbo].[PM_SubTaskAssignee](
[AssigneeID] [int] IDENTITY(1,1) NOT NULL,
[ProjectSubTaskID] [int] NULL,
[EmployeeID] [int] NULL,
With regards to queries I have tried...alot. I was implementing a scenario where I ended up with about a half dozen different queries all culminating into one (some of the queries where built with code to allow filtering) However the last one tried was:
SELECT ProjectID, ProjectName, EmployeeID, ProjectTaskID, EmployeeID, Association, ProjectSubTaskID, EmployeeID
FROM (qrTest3_Project LEFT JOIN qrTest2_Task ON qrTest3_Project.ProjectID = qrTest2_Task.ProjectID) LEFT JOIN qrtest1_SubTask ON qrTest2_Task.ProjectTaskID = qrtest1_SubTask.Association
WHERE (((qrTest3_Project.EmployeeID)=8)) OR (((qrTest2_Task.EmployeeID)=8)) OR (((qrtest1_SubTask.EmployeeID)=8));
the above query included other queries that simply joined each project/task/subtask to their respective assignee table. I can post those as well if needed.
I hope that provides the additional information you need? If not, happy to provide more.
Thanks!
I think I may have figured it out..as I sort of suspected, I was making it a bit more difficult then it needed to be. Simply joins really with criteria gives me the data I need and can work with.
SELECT PM_ProjectAssignee.ProjectID, PM_ProjectTasks.ProjectTaskID, PM_ProjectSubTasks.ProjectSubTaskID
FROM (((PM_ProjectAssignee
LEFT JOIN PM_ProjectTasks
ON PM_ProjectAssignee.ProjectID = PM_ProjectTasks.ProjectID)
LEFT JOIN PM_ProjectSubTasks
ON PM_ProjectTasks.ProjectTaskID = PM_ProjectSubTasks.ProjectTaskID)
LEFT JOIN PM_TaskAssignee
ON PM_ProjectTasks.ProjectTaskID = PM_TaskAssignee.ProjectTaskID)
LEFT JOIN PM_SubTaskAssignee
ON PM_ProjectSubTasks.ProjectSubTaskID = PM_SubTaskAssignee.ProjectSubTaskID
WHERE (((PM_ProjectAssignee.EmployeeID)=14))
OR (((PM_TaskAssignee.EmployeeID)=14))
OR (((PM_SubTaskAssignee.EmployeeID)=14))
GROUP BY PM_ProjectAssignee.ProjectID, PM_ProjectTasks.ProjectTaskID, PM_ProjectSubTasks.ProjectSubTaskID;

Implementing custom fields in a database for large numbers of records

I'm developing an app which requires a user defined custom fields on a contacts table. This contact table can contain many millions of contacts.
We're looking at using a secondary metadata table which stores information about the fields, along with a tertiary value table which stores the actual data.
Here's the rough schema:
CREATE TABLE [dbo].[Contact](
[ID] [int] IDENTITY(1,1) NOT NULL,
[FirstName] [nvarchar](max) NULL,
[MiddleName] [nvarchar](max) NULL,
[LastName] [nvarchar](max) NULL,
[Email] [nvarchar](max) NULL
)
CREATE TABLE [dbo].[CustomField](
[ID] [int] IDENTITY(1,1) NOT NULL,
[FieldName] [nvarchar](50) NULL,
[Type] [varchar](50) NULL
)
CREATE TABLE [dbo].[ContactAndCustomField](
[ID] [int] IDENTITY(1,1) NOT NULL,
[ContactID] [int] NULL,
[FieldID] [int] NULL,
[FieldValue] [nvarchar](max) NULL
)
However, this approach introduces a lot of complexity, particularly with regard to importing CSV files with multiple custom fields. At the moment this requires a update/join statement and a separate insert statement for every individual custom field. Joins would also be required to return custom field data for multiple rows at once
I've argued for this structure instead:
CREATE TABLE [dbo].[Contact](
[ID] [int] IDENTITY(1,1) NOT NULL,
[FirstName] [nvarchar](max) NULL,
[MiddleName] [nvarchar](max) NULL,
[LastName] [nvarchar](max) NULL,
[Email] [nvarchar](max) NULL
[CustomField1] [nvarchar](max) NULL
[CustomField2] [nvarchar](max) NULL
[CustomField3] [nvarchar](max) NULL /* etc, adding lots of empty fields */
)
CREATE TABLE [dbo].[ContactCustomField](
[ID] [int] IDENTITY(1,1) NOT NULL,
[FieldIndex] [int] NULL,
[FieldName] [nvarchar](50) NULL,
[Type] [varchar](50) NULL
)
The downside of this second approach is that there is a finite number of custom fields that must be specified when the contacts table is created. I don't think that's a major hurdle given the performance benefits it will surely have when importing large CSV files, and returning result sets.
What approach is the most efficient for large numbers of rows? Are there any downsides to the second technique that I'm not seeing?
Microsoft introduced sparse columns exactly for this type of problems. Tha point is that in a "classic" design you end up with large number of columns, most of the NULLs for any particular row. Same here with sparse columns, but NULLs don't require any storage. Moreover, you can create sets of columns and modify sets with XML.
Performance- and storage-wise, sparse columns are the winner.
http://technet.microsoft.com/en-us/library/cc280604.aspx
uery performance. Query performance for any "property bag table" approach is funny and comically slow - but if you need flexibility you can either have a dynamic table that is changed via an editor OR you have a property bag table. So when you need it, you need it.
But expect the performance to be slow.
The best approach would likely be a ContactCustomFields table which has - fields that are determined by an editor.

Create table with sum of several conditions from other table

I have another problem, as I need to have a sum in a table based on several data from other table (COST). Solution above brings data on one condition, what I need is to have sum on several conditions from other table:
So I have Cost table, which looks:
CREATE TABLE [dbo].[Cost](
[ID_Cost] [int] NOT NULL,
[Name] [varchar](50) NULL,
[ID_CostCategory] [int] NULL,
[ID_Department] [int] NULL,
[ID_Project] [int] NULL,
[Value] [money] NULL,
) go
then I have Department table, with colums: ID_Department, Name, Plan,
What I want to do is to make in Department table, column Realization which sum values from Cost table based on several conditions (ID_CostCategory and ID_Department and if possible ID_Project)
So in result I get in Department table, column Realization with sum of cost from Cost table per ID_CostCategory and ID_Department.