Facts and dimensions: dynamic dimensions [closed] - sql

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
(Appreciate this post is perhaps too high level or philosophical for SO, I'm in the schema planning phase and seeking some guidance)
After some difficulty working with a clone of our production database for analytics, I am attempting to define a events fact table along with some dimensions tables in order to make analytics work simpler.
The block I've hit in my planning is this. We have different categories of event with different dimensions needed to describe them. E.g. suppose we have 'account settings' event category as well as 'Gallery' events.
In a fact table I might have a field eventCategory and eventName with example values from above such as:
'EventCategory': 'Account Settings'
'EventName': 'Update Card Billing Details'
Or:
'EventCategory': 'Gallery'
'EventName': 'Create New Gallery'
In each case I want to use a different collection of dimensions to describe them. E.g. for Gallery events we want to know 'template', 'count of images', 'gallery category e.g. fruits'. We have no need for these details with account settings events, which have their own distinct set of dimensions to describe them.
Via the textbook examples I find online, I would have a dimensions table for Gallery events and a dimensions table for Account Settings events.
The mental block I have is that these dimensions are dynamic not static. I want to record in the fact table the value of these dimensions at the time of the event not 'now'. For example, a user can either be in trial or a paid user. If I had a dimension table 'user' their status might currently be 'paid' but at the time of some previous gallery event they may have been in trial.
What is the 'right' way to handle this:
Multiple facts tables, one for Gallery events and one for Account Settings events?
Use json in a new field in the main facts table e.g. 'EventDetail' which contains what would otherwise go in a dimension table except by using json we know the values of the dimensions at the time of the event as opposed to whatever those values are now?
I could have a sparse facts table. I would include fields for each dimension across all categories and these would be null where not applicable
Given that the dimensions I use to describe an event are dynamic, what is the 'right' way to construct a fact table for analytics? The way I see it just now the dimensions tables would have to be facts themselves to capture the changing values of these attributes over time.

To add a dimension to any SQL table is always done the same way, by adding a column.
In any kind of history, there is no "now". Every status has a time period: a beginning and ending. I usually name those columns AsOf and Until, because begin/end show up a lot as SQL keywords, making the column names harder to scan for. Usually, only AsOf is needed, because you can self-join the table to find succeeding periods, and use NULL to represent 'now' (where "now" means, as of the time the query is executed).
'user' their status might currently be 'paid' but at the time of some previous gallery event they may have been in trial.
Right, so the user's status isn't just paid/trial. It's paid or trial starting AsOf some date, until a later AsOf date for the same user.
It's hard to be more helpful. There's a bit of jargon in your question, and it's couched in domain-specific terms. I hope by attaching a date/time to every status, you can see your way out of the forest.

(A) Managing temporal data in postgres
Temporal data is a quite usual need in many kinds of business applications, but it is not a "built-in" feature in postgres, nor in many other RDBMS.
As stated by #James K. Lowden, you can use some AsOf and Until columns of type timestamp with or without time zone, or you can use instead a single column of type tsrange or tstzrange, ie a range of timestamps, and which will offer you some nice built-in functions, see the manual.
In order to avoid overlaps between timestamp ranges associated to different events for the same data, you can implement a business logic with trigger functions.
For instance for the same user, you can implement the following trigger function so that the range r1 associated to the status 'in trial' and the range r2 associated to the status 'paid' are automatically set up when the corresponding rows are inserted in the user table, and the ranges of the existing rows for the same user are updated accordingly :
CREATE OR REPLACE FUNCTION before_insert_user ()
RETURNS trigger LANGUAGE plpgsql AS
$$
BEGIN
-- update all the existing rows (ie status) for the same user_id whose valid_ranges are valid as of now
UPDATE user
SET valid_range = tstzrange(lower(valid_range), Now())
WHERE user_id = NEW.user_id
AND valid_range #> Now() ;
-- set up the valid_range for the new row (ie the new status)
NEW.valid_range = tstzrange(Now(), NULL) ;
END ;
$$ ;
CREATE OR REPLACE TRIGGER before_insert_user BEFORE INSERT ON user
FOR EACH ROW EXECUTE FUNCTION before_insert_user () ;
(B) Managing different dimensions for different categpories
As already discussed, json can be a solution to store various dimensions in the same column.
An other solution could be the table inheritance with some interesting functionalities :
CREATE TABLE Event
( EventCategory varchar
, EventName varchar
, ValidityRange tstzrange
, primary key (EventCategory , EventName, ValidityRange )
) ;
CREATE TABLE user
( status varchar
) INHERITS Event ;
CREATE TABLE Gallery
( template varchar
, "count of images" integer
, "gallery category e.g. fruits" varchar
) INHERITS Event ;

A fact table needs to have its grain defined; if facts don’t match that grain they can’t be stored in that fact table => if you have facts with different sets if dimensions then you need to use different fact tables.
Regarding the values in a dimension changing, you need to read up on Slowly Changing Dimensions

Related

Qlikview - Target missing where no actual value

I have a fact table of Delay by Date by Category (and many other Fields). I have another (target) table of DelayTarget by Month and Category.
I am currently associating the target table to the fact table on Month & Category but when there is no Delay for a given Category in a given Month, then the DelayTarget value does not display in my dashboard.
How do I associate the DelayTarget to all Months in my main dataset - even when there is no Delay to report? I think I want to create a Zero value for Delay when it is null but I don't know how to do this or if this is the best method.
You need to create MasterCalendar to fill gap in dates.
I can give you more detailed answer but the best would be to share you data model (ctrl +T) and some example data from tables (or even better just.qvw)

Calculated Attribute - Min and Max Valid Date

We have some data inside a table (Dimension) with historical values.
Like this (Small example)
ProductId is our Primary Key (and then is unique)
Code is our Business Key
Color and Type are our historical values
In Analysis Services (Tabular mode), our users want to build a report on that values.
Client usage Could be:
(1) If they only want to see the code ('CAR' in our example) the result would be:
(2) If they want to see the code and the Color:
Same for all the attributes that we can have and all the combinations.
Do you know how to solve this?
Can we add some logic in a calculated attribute
Thank you,
Arnaud
In essence, you want to aggregate by date? So, for any set of attributes you put in your pivot table, you want to show the earliest ValidFrom date and the latest ValidTo date that applies?
To accomplish this in SSAS Tabular, import the table and hide the columns ValidFrom & ValidTo. (To hide a column, right click it in Visual Studio and select Hide from Client Tools.)
Then, create 2 measures. For example:
Valid From := MIN([ValidFrom])
Valid To := MAX([ValidTo])
Note the extra space in the names to distinguish them from the column names. You could also call them something completely different. (E.g. Earliest Valid From Date)
When people connect to your cube, people will use these 2 measures rather than the columns from the original table. (They won't even see the columns because you've hidden them.)
If their pivot table includes all the attributes above (Product ID, Code, Color, Type), then the table will look exactly like your original table. If they only show Code, then your table will look like your (1). If they only show Code & Color, then your table will look like (2).

How to fetch the changed data in a database? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I need to fetch the local names from table A, which have been changed in the past 30 days using SQL.
Do I need to create a backup of a table or is there any other method?
And If creating backup is the only method how do we compare and find out the locally overridden names?
Table Details:
TREE_ID (NUMBER)
TREE_NM (VARCHAR2)
TREE_LEVEL (VARCHAR2)
UPLEVEL_ID (NUMBER)
HRCHY_TYPE (VARCHAR2)
CATG_ID (NUMBER)
SUBCATG_ID (NUMBER)
STATUS (VARCHAR2)
USER_ID (NUMBER)
CREATE_DATE (DATE)
EFFCT_START_DATE (DATE)
EFFCT_END_DATE (DATE)
UPDATED_DATE (DATE)
TOP_LEVEL_ID (NUMBER)
I need to generate a feed at the end of every month to fetch the changed TREE_NM.
I think there is no default operation in Oracle to do that. A possible workaround could be to include a new column to your table A where you store the modificationDate. Then defining a Before Insert OR Update Trigger which simply writes the new value (current date) to all rows that have been inserted or updated.
Hope this helps.
If you can't modify the table, this can't be done unless you can modify the apps that modify the table. If you can do the latter, make a second table with:
TreeID NUMBER (foreign key)
LastModifiedDate datetime
And write to this table every time the first table is modified. Then, you can join the two tables together on
TableA.TreeID = Table2.TreeID
WHERE Table2.LastModifiedDate >= DATEADD(d, -30, getdate())
And that will return all records that were modified in last 30 days.
If you can't modify the database OR the apps, then this is impossible with your current structure, so hopefully you have the ability to make some changes.
EDIT:
If historical changes are something that you will need to track for other purposes in the future, you should look into implementing a data warehouse (specifically, look into slowly changing dimensions).
Second Edit:
I would seriously question why you're not allowed to add a field to this table. In SQL Server, you can add fields to tables without impacting the data or applications that access it. If I were you, I would push pretty hard to add the field to the table instead of creating a more complex and obfuscated database/application structure for no apparent reason.

Qlikview line chart with multiple expressions over time period dimension

I am new to Qlikview and after several failed attempts I have to ask for some guidance regarding charts in Qlikview. I want to create Line chart which will have:
One dimension – time period of one month broke down by days in it
One expression – Number of created tasks per day
Second expression – Number of closed tasks per day
Third expression – Number of open tasks per day
This is very basic example and I couldn’t find solution for this, and to be honest I think I don’t understand how I should setup my time period dimension and expression. Each time when I try to introduce more then one expression things go south. Maybe its because I have multiple dates or my dimension is wrong.
Here is my simple data:
http://pastebin.com/Lv0CFQPm
I have been reading about helper tables like Master Callendar or “Date Island” but I couldn’t grasp it. I have tried to follow guide from here: https://community.qlik.com/docs/DOC-8642 but that only worked for one date (for me at least).
How should I setup dimension and expression on my chart, so I can count the ID field if Created Date matches one from dimension and Status is appropriate?
I have personal edition so I am unable to open qwv files from other authors.
Thank you in advance, kind regards!
My solution to this would be to change from a single line per Call with associated dates to a concatenated list of Call Events with a single date each. i.e. each Call will have a creation event and a resolution event. This is how I achieve that. (I turned your data into a spreadsheet but the concept is the same for any data source.)
Calls:
LOAD Type,
Id,
Priority,
'New' as Status,
date(floor(Created)) as [Date],
time(Created) as [Time]
FROM
[Calls.xlsx]
(ooxml, embedded labels, table is Sheet1) where Created>0;
LOAD Type,
Id,
Priority,
Status,
date(floor(Resolved)) as [Date],
time(Resolved) as [Time]
FROM
[Calls.xlsx]
(ooxml, embedded labels, table is Sheet1) where Resolved>0;
Key concepts here are allowing QlikView's auto-conatenate to do it's job by making the field-names of both load statements exactly the same, including capitalisation. The second is splitting the timestamp into a Date and a time. This allows you to have a dimension of Date only and group the events for the day. (In big data sets the resource saving is also significant.) The third is creating the dummy 'New' status for each event on the day of it's creation date.
With just this data and these expressions
Created = count(if(Status='New',Id))
Resolved = count(if(Status='Resolved',Id))
and then
Created-Resolved
all with full accumulation ticked for Open (to give you a running total rather than a daily total which might go negative and look odd) you could draw this graph.
For extra completeness you could add this to the code section to fill up your dates and create the Master Calendar you spoke of. There are many other ways of achieving this
MINMAX:
load floor(num(min([Date]))) as MINTRANS,
floor(num(max([Date]))) as MAXTRANS
Resident Calls;
let zDateMin=FieldValue('MINTRANS',1);
let zDateMax=FieldValue('MAXTRANS',1);
//complete calendar
Dates:
LOAD
Date($(zDateMin) + IterNo() - 1, '$(DateFormat)') as [Date]
AUTOGENERATE 1
WHILE $(zDateMin)+IterNo()-1<= $(zDateMax);
Then you could draw this chart. Don't forget to turn Suppress Zero Values on the Presentation tab off.
But my suggestion would be to use a combo rather than line chart so that the calls per day are shown as discrete buckets (Bars) but the running total of Open calls is a line

SQL Server/Table Design, table for data snapshots where hundreds of columns possible [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
We have a business process that requires taking a "snapshot" of portions of a client's data at a point in time, and being able to regurgitate it later. The data set has some oddities though that make the problem interesting:
The data is pulled from several databases, some of which are not ours.
The list of fields that could possibly be pulled are somewhere between 150 and 200
The list of fields that are typically pulled are somewhere between 10 and 20.
Each client can pull a custom set of fields for storage, this set is pre-determined ahead of time.
For example (and I have vastly oversimplified these):
Client A decides on Fridays to take a snapshot of customer addresses (1 record per customer address).
Client B decides on alternate Tuesdays to take a snapshot of summary invoice information (1 record per type of invoice).
Client C monthly summarizes hours worked by each department (1 record per department).
When each of these periods happen, a process goes out and fetches the appropriate information for each of these clients... and does something with them.
Sounds like an historical reporting system, right? It kind of is. The data is later parsed up and regurgitated in a variety of formats (xml, cvs, excel, text files, etc..) depending on the client's needs.
I get to rewrite this.
Since we don't own all of the databases, I can't just keep references to the data around. Some of that data is overwritten periodically anyway. I actually need to find the appropriate data and set it aside.
I'm hoping someone has a clever way of approaching the table design for such a beast. The methods that come to mind, all with their own drawbacks:
A dataset table (data set id, date captured, etc...);
A data table (data set id, row number, "data as a blob of crap")
A dataset table (data set id, date captured, etc....);
A data table (data set id, row number, possible field 1, possible field 2, possible field 3, ...., possible field x (where x > 150)
A dataset table (data set id, date captured, etc...); A field table (1 row per all possible field types); A selected field table (1 row for each field the client has selected); One table for each primitive data type possible (varchar, decimal, integer) (keyed on selected field, data set id, row, position, data is the single field value).
The first being the easiest to implement, but the "blob of crap" would have to be engineered to be parseable to break it down into reportable fields. Not very database friendly either, not reportable, etc.. Doesn't feel right.
The second is a horror show of columns. shudder
The third sounds right, but kind of doesn't. It's 3NF (yes, I'm old) so feels right that way. However reporting on the table screams of "rows that should have been columns" problems -- fairly useless to try to select on outside of a program.
What are your thoughts?
RE: "where hundreds of columns possible"
The limitations are 1000 columns per table
http://msdn.microsoft.com/en-us/library/ms143432.aspx