Should I combine the columns of a fact table to make it more narrow, or should I keep it more user friendly with a lot of columns? - sql

I have a Fact table that shows the results of KPIs. There are several KPIs, and some of these have a similar output.
My current columns are something like this:
KPI_ID, DOCUMENT_ID, TRUE_FALSE_FLAG1, TRUE_FALSE_FLAG2, DURATION_3, DURATION_4
So, for KPI number 1 (true false output), the last three columns will be NULL- values. Should I combine TRUE_FALSE_FLAG1 and TRUE_FALSE_FLAG2? What is BEST PRACTICE?
In total, there are 18 columns, where 12 of them are either true/false- flags or durations in the shape of "number of days" (integer).
picture of the two alternatives
EDIT:
KPI 3 could be "duration of problem", and you'd have a bunch of problems, each with a documentID, represented as a row. Dur_3 would be like 5 days, 3 days, 10 days, etc. KPI 4 would be "Delay of fix after repair was ordered", and the answer would still be an integer in days. But completely non- related to KPI 3.
Reporting could be "average delay of fix". So roughly a select AVG() from table where KPI_ID = 3 group by KPI_ID.

Based on your latest comment, you are best with Alternative 2. Specifically, as long as every KPI is only True/False, and has only one duration to store, you are better with Alternative 2.
EDIT: with Alternative 2, each KPI can store one True/False value AND one duration value

Related

SQL new variable using multiple conditions (count of occurrences in 6 month look-back period using timestamp for each unique ID)

I am trying to achieve the following:
Attached is what my data looks like.
I want to create 2 new variables which counts the number of times 'Target' (variable 1) and 'Competitor' appears (variable 2), within the last 6 months of a given date_of_prescription. This would be done for every unique D_PRESCRIBER_ID.
So for example:
For ID: 1003000902 prescribing on 2020-03-18 date, the COMPETITOR drug. When you look at the rows before that, you can see that within 6 months prior to the 2020-03-18 date, there are 2 Target drugs prescribed and 0 competitor drugs prescribed. So my variable values will be: 2 (variable 1) and 0 (variable 2)
My data is much larger than what the screenshot looks like. It has more variables and 1000's of unique D_PRESCRIBER_IDs. Each row is not a unique ID, there are duplicates in the data for various date_of_prescription timestamps. These variables need to be created in my select statements in order to keep the rest of the data the same.
Any help here would be awesome. Thanks!

How to calculate dynamic % of grand total as a measure on Power BI?

I have the below table connected into Power BI and I am looking for ways to create a formula calculating % of grand total of the Rating column and further subtracting with targets for each rating. For example, the % of grand total for Rating 1 is 3 divided by 7 (42.86%). The most important part of the formula is the denominator which has to remain at a total level and dynamic for any filters applied to either Grade or BU columns. For example, denominator at a total level would be 7 and when filtered down to Academy BU should be 3.
Sample Data Table:
Rating Target Table:
I want the end result to look like this,
I have used the following formula to achieve this,
Measure created: % of total calc = DIVIDE(COUNT('Table'[Rating]),CALCULATE(SUM('Table'[Count]),'Table'[Rating]))
To make the above formula work I had to add an extra column and include ones in it (see below)
I want to know if there are other ways of achieving this outcome?
ALLEXCEPT will produce such result to exclude used dimensions and include mandatory filters such as date with one condition, rating, date, any dimension must be in the same table.

Access SQL - Add Row Number to Query Result for a Multi-table Join

What I am trying to do is fairly simple. I just want to add a row number to a query. Since this is in Access is a bit more difficult than other SQL, but under normal circumstances is still doable using solutions such as DCount or Select Count(*), example here: How to show row number in Access query like ROW_NUMBER in SQL or Access SQL how to make an increment in SELECT query
My Issue
My issue is I'm trying to add this counter to a multi-join query that orders by fields from numerous tables.
Troubleshooting
My code is a bit ridiculous (19 fields, seven of which are long expressions, from 9 different joined tables, and ordered by fields from 5 of those tables). To make things simple, I have an simplified example query below:
Example Query
SELECT DCount("*","Requests_T","[Requests_T].[RequestID]<=" & [Requests_T].[RequestID]) AS counter, Requests_T.RequestHardDeadline AS Deadline, Requests_T.RequestOverridePriority AS Priority, Requests_T.RequestUserGroup AS [User Group], Requests_T.RequestNbrUsers AS [Nbr of Users], Requests_T.RequestSubmissionDate AS [Submitted on], Requests_T.RequestID
FROM (((((((Requests_T
INNER JOIN ENUM_UserGroups_T ON ENUM_UserGroups_T.UserGroups = Requests_T.RequestUserGroup)
INNER JOIN ENUM_RequestNbrUsers_T ON ENUM_RequestNbrUsers_T.NbrUsers = Requests_T.RequestNbrUsers)
INNER JOIN ENUM_RequestPriority_T ON ENUM_RequestPriority_T.Priority = Requests_T.RequestOverridePriority)
ORDER BY Requests_T.RequestHardDeadline, ENUM_RequestPriority_T.DisplayOrder DESC , ENUM_UserGroups_T.DisplayOrder, ENUM_RequestNbrUsers_T.DisplayOrder DESC , Requests_T.RequestSubmissionDate;
If the code above is trying to select a field from a table not included, I apologize - just trust the field comes from somewhere (lol i.e. one of the other joins I excluded to simply the query). A great example of this is the .DisplayOrder fields used in the ORDER BY expression. These are fields from a table that simply determines the "priority" of an enum. Example: Requests_T.RequestOverridePriority displays to the user as an combobox option of "Low", "Med", "High". So in a table, I assign a numerical priority to these of "1", "2", and "3" to these options, respectively. Thus when ENUM_RequestPriority_T.DisplayOrder DESC is called in order by, all "High" priority requests will display above "Medium" and "Low". Same holds true for ENUM_UserGroups_T.DisplayOrder and ENUM_RequestNbrUsers_T.DisplayOrder.
I'd also prefer to NOT use DCOUNT due to efficiency, and rather do something like:
select count(*) from Requests_T where Requests_T.RequestID>=RequestID) as counter
Due to the "Order By" expression however, my 'counter' doesn't actually count my resulting rows sequentially since both of my examples are tied to the RequestID.
Example Results
Based on my actual query results, I've made an example result of the query above.
Counter Deadline Priority User_Group Nbr_of_Users Submitted_on RequestID
5 12/01/2016 High IT 2-4 01/01/2016 5
7 01/01/2017 Low IT 2-4 05/06/2016 8
10 Med IT 2-4 07/13/2016 11
15 Low IT 10+ 01/01/2016 16
8 Low IT 2-4 01/01/2016 9
2 Low IT 2-4 05/05/2016 2
The query is displaying my results in the proper order (those with the nearest deadline at the top, then those with the highest priority, then user group, then # of users, and finally, if all else is equal, it is sorted by submission date). However, my "Counter" values are completely wrong! The counter field should simply intriment +1 for each new row. Thus if displaying a single request on a form for a user, I could say
"You are number: Counter [associated to RequestID] in the
development queue."
Meanwhile my results:
Aren't sequential (notice the first four display sequentially, but then the final two rows don't)! Even though the final two rows are lower in priority than the records above them, they ended up with a lower Counter value simply because they had the lower RequestID.
They don't start at "1" and increment +1 for each new record.
Ideal Results
Thus my ideal result from above would be:
Counter Deadline Priority User_Group Nbr_of_Users Submitted_on RequestID
1 12/01/2016 High IT 2-4 01/01/2016 5
2 01/01/2017 Low IT 2-4 05/06/2016 8
3 Med IT 2-4 07/13/2016 11
4 Low IT 10+ 01/01/2016 16
5 Low IT 2-4 01/01/2016 9
6 Low IT 2-4 05/05/2016 2
I'm spoiled by PLSQL and other software where this would be automatic lol. This is driving me crazy! Any help would be greatly appreciated.
FYI - I'd prefer an SQL option over VBA if possible. VBA is very much welcomed and will definitely get an up vote and my huge thanks if it works, but I'd like to mark an SQL option as the answer.
Unfortuantely, MS Access doesn't have the very useful ROW_NUMBER() function like other clients do. So we are left to improvise.
Because your query is so complicated and MS Access does not support common table expressions, I recommend you follow a two step process. First, name that query you already wrote IntermediateQuery. Then, write a second query called FinalQuery that does the following:
SELECT i1.field_primarykey, i1.field2, ... , i1.field_x,
(SELECT field_primarykey FROM IntermediateQuery i2
WHERE t2.field_primarykey <= t1.field_primarykey) AS Counter
FROM IntermediateQuery i1
ORDER BY Counter
The unfortunate side effect of this is the more data your table returns, the longer it will take for the inline subquery to calculate. However, this is the only way you'll get your row numbers. It does depend on having a primary key in the table. In this particular case, it doesn't have to be an explicitly defined primary key, it just needs to be a field or combination of fields that is completely unique for each record.

No sum() with null values

I'm looking for a solution to create sums of +- 10 scores and targets of a product over 6 different dimensions. There are some more i won't bother you with. Of every dimension I need a total. For example
SalesPeriod. Product: Bikes. Dimensions: bmx, size, colours, with bars etc. Targets: 1,2,3,4,5. Scores:1,2,3,4,5.
So 10 totals for bmx bikes with size x, colour red and bars, and 10 totals for bmx bikes, size x, colour red etc etc.
However, every score needs to be calculated only when none of the underlying values is a null. For example score 1 contains a null then no calculation, but score 2 does not contain a null thus should be calculated.
At this point the calculation is done via a case statement which basically checks the values of within each column/score and only calculates the total when the count of scores is equal to the expected rows.
The calculation requires a lot of cpu and with a larger dataset this is very inefficient and it simply takes too long.
I'm looking for a solution that will be much more effecient. What could be my best option to try?
You can filter (or first group by) the products with Non Null values only first by using your same count method. I don't think there is any other method.
SELECT columnid, SUM(column1)
FROM table
GROUP BY columnid
HAVING COUNT(column1)=COUNT(*);
Then you can join it on columnid with another similar query on another columnN as well.
(I'm not sure if understood your problem completely, but you basically want an efficient query with sum(scores) and sum(targets) only when they are not null? or only when they are both not null? or only scores? or only targets?)

Row blocks in Hive (how to group rows by certain criteria and count these groups)

Here is a sample of the data I have:
Date_key UserID
20140401 a
20140402 a
20140406 a
20140407 a
20140408 a
20140409 a
20140404 b
20140408 b
20140409 b
20140414 b
20140415 b
... ...
Each row has a Date, User ID couple which indicates that that user was active on that day. A user can appear on multiple dates and a date will have multiple users -- just like in the example.
I want to get the number of consecutive day groups (i.e. blocks of activity). For example, this value for 'User a' would be 2 because they were active on 20140401 and 20140402 which is the first group of consecutive days. After 20140402, they waited for a while before becoming active again (i.e. they were not active the following day). On 20140406, their second block of activity started and continued without any break up until 20140409. For 'User b', this value would be 3 because they have been active during three consecutive day periods: 1)20140404 2) 20140408, 20140409 3) 20140414, 20140415
I use Hive. I am not sure if this is possible in Hive, but if the data needs to be carried over to a RDBMS to perform this task, I can do that too. Your recommendations are greatly appreciated. Thank you!
Cheers
When you use the distribute by clause, ie: .......distribute by user_id sort by user_id,date_key desc...... all the records for a particular user would go to a particular reducer, where the records are then sorted by date_key descending. Here why don't we write a UDF to iterate through the records, and when ever there is break in the continuity it would increment the counter for continuity by 1 and return the result along with the user_id.