Modeling an OLAP cube for Football analysis

Modeling an OLAP cube for Football analysis - ssas

I would like to do a basic analysis on NFL Football data by season. For example, the Fact that I would like to have would be:
[NFL Fact]
Season (Dimension)
Team (Dimension)
Coach (Dimension)
Roster (Dimension, one-to-many, can be the same 'Person' dimension as for Coach)
Wins (Int)
Losses (Int)
Ties (Int)
Made_Playoffs (Bool)
Won_Conference (Bool)
Won_Superbowl (Bool)
And example of the data would look like this in a human-readable form:
In the 2019 season
the Kansas City Chiefs team
coached by Andy Reid
who had on their roster Patrick Mahomes, Travis Kelce, ...
had 12 wins
and 4 losses
and 0 ties in the regular season
made the playoffs
and won their conference
and won the Super Bowl.
If the above is our fact table, how could the dimensions be modeled? Here is my first stab at it:
Season
Year (key)
AFC Champion (string or link to dimension? Example: "DAL")
NFC Champion (string or link to dimension? Example: "DAL")
NFL Champion (string or link to dimension? Example: "DAL" or should it be "2018 DAL" ?)
Question: should the AFC/NFC/NFL Champion be a string? Or should this reference a Team dimension? Why would one choice be made over the other?
Team
Code (String, example: "DAL")
Season (Int, example: 2018, key = code+year, example: "2018 DAL")
Conference (String, example: "AFC")
Division (String, example: "South")
Name (String, example: "Dallas Cowboys")
URL (String, example: "https://nfl.com/dallas-cowboys")
Question: do I need to make the key be Code+Year, or can I just use the Code since the season is 'derived' from the Fact table it is linked to? Should the Season be an Integer or link to the Season dimension?
Person (includes both players, coaches, etc.)
Name
Season
Team (?)
Position
Age
College
Question: do I need to include the Team here, or is that not needed because the team is already inferred from the fact table? Should the Season be linked here as well?
Any suggestions on the above format would be most helpful, thank you!

I'm not good at NFL Football, but can give you basic ideas how to manage your entities and create relations.
First of all consider fact table as a main manager and connector between all dimensions.
[Team] dimension can possibly be Slowly Changing Type-2 (SCD-2) because you may want to track changes of teams between divisions/conferences etc. Or even name change. But to get seasonal info for a team you need only to join [NFL Fact] and [Team], so no need to create Code+Year key as Season attribute.
[Season] in your example looks like not a dimension, but a datamart over [NFL Fact]. Because every row of [Season] you can get querying [NFL Fact]. AFC/NFC/NFL Champion you can get from it + Conference attribute of [Team] dimension. I suggest to create [Date] dimension with the list of dates (years or YYYYMMDD or else) instead of [Season].
[Person] you can't get from [NFL Fact]. Because the lowest granularity in [NFL Fact] is actually a Team. So to manage players/coaches you need a separate fact table (because next year player can move to another team etc.). So it will be [Persons] fact table with surrogate keys of a team that they play for a given year. This also means that Coach/Roster attributes to be a part of this new [Persons] fact table instead if [NFL Fact], because in your example [NFL Fact] is only to collect team stats.

Related

Subtotal by unique store with multiple sales staff over period

I am trying to gather various KPI's for salespersons in multiple stores. The goal is to break down store performance into salesperson level.
I am facing an issue when trying to add a hitrate, as this is normally only by store. Number of quotes given / Visitors.
Even though it may not be 100% accurate I still wish to have the KPI by sales person. I am able to do this on a sales person level, but my subtotal for the store is incorrect, as it makes a summation of visitors by sales persons.
Monthly period is to be considered to as sales persons comes and goes throughout the period. Example of what I wish for a Subtotal for the measure "Vis". Store X 370 for month 1,2 & 3. For Store Y 395.
Vis measure = Visitor (Calculation i have tried but gives the wrong result for the store total for the period.)
I have tried various Calculate, Sum, max functions, but nothing seems to provide the result I need.
I hope that someone might be able to help me get along with this.
Example data tables is link as shown below:
enter image description here
Thanks in advance.

This sounds like a case where the HASONVALUE function would be useful.
The idea being that you would the result of that function in an if to determine if you are calculating at your sales person level or a the store level which should contain multiple salespersons. Thus you would have two different calculations, one for the sales person and store combination, and one for the Store level.
Example would be sometime like the following, in this example I am assuming you have a sales person table:
Measure:= IF( HASONEVALUE( Salesperson[Sales Person] ),
[Vis],
[Measure for subtotal]
)
[Measure for Subtotal] would just being the calculation that you want for your store total.
Of course if you filter to just a single sales person, then the totals for the store will just match that sales person.

SSAS: How to model KPI goals for different levels on hierarchy

I have a simple SSAS model from a fact table FactPerformance:
- DateKey
- WorkplaceKey
- Measure1
- Measure2
The dimension DimWorkplace consists of a hierachy:
Plant
Department
Area
Workplace
I have to create KPIs where the goals are given by Area and Department. The goals for the Workplace has to be calculated from the Area where it is assigned to. The goals are not aggregateable and have to be configurable in a way like:
Plant = "Plant 1", Department = "Milling": Goal1 = 0.82, Goal3 = 0.85
Plant = "Plant 1", Department = "Milling", Area = "Area A": Goal1 = 0.9, Goal2 = 0.92
Furthermore, the goals might change over time, the values have to be historized (SCD?).
My first idea was to turn the dimension DimWorkplace into a SCD and add the attributes for the goals. For various reasons I would prefer an independet storage for the goals.
I had troubles finding examples for some sample implementations. Are there any best practices? How are those challenges usually solved? Do you have any hints for me? Thanks in advance!
Andreas

One way is to add another fact table "FactPerformanceGoal" with the same dimensions and measures.
Another (prefered) is to add a dimension called "Version" to your fact tabe. The "version" dimension has 2 members. "actual" and "goal". Link all your current rows in the fact to "actual" and then add more rows linked to "goal" for your goals.
Ssas supports a default member for dimensions. Make the "actual" member the default member and set the dimension to "not aggregateable".

Data Model for analytical CRM as a service

We're developing an analytical CRM as a service and I have a question about data model.
A CRM user might upload a batch (about 1 million rows a week) with his clients/customers.
Also we already have 200 millions of rows in this database.
CRM users want us to provide a feature for estimating the amount of people in a segment defined by various constraints (sex, age, location, etc; ~50 mandatory filters) like in any advertisement platform (Facebook Ads, Google Adwords). Of course, he expects to see count result in real time.
For simplicity let's imagine you want to get a count of male smokers age 18 and 22, so you apply filters successively:
sex (select count (id) where sex = m)
age (select count (id) where sex = m and age in (18,22))
smokers (select count (id) where sex = m and age in (18,22) and smoker = 1)
Afterwards you might extract a list of emails from these segments (not real time).
The other thing is the operational database which only put data (operational writes from HTTPs API).
Q. How to make a data model for this case:
Which
type of DB and how many nodes
table sets, normalization and relations
to choose?
Thanks in advance!

Help with SQL aggregate functions

I've been learning SQL for about a day now and I've run into a road bump. Please help me with the following questions:
STUDENT (**StudentNumber**, StudentName, TutorialNumber)
TUTORIAL (**TutorialNumber**, Day, Time, Room, TutorInCharge)
ASSESSMENT (**AssessmentNumber**, AssessmentTitle, MarkOutOf)
MARK (**AssessmentNumber**, **StudentNumber**, RawMark)
PK and FK are identified within "**". I need to generate queries that:
1) List of assessment tasks results showing: Assessment Number, Assessment Title, and average Raw Mark. I know how to use the avg function for a single column, but to display something for multiple columns... a little unsure here.
My attempt:
SELECT RawMark, AssessmentNumber, AsessmentTitle
FROM MARK, ASSESSMENT
WHERE RawMark = (SELECT (RawMark) FROM MARK)
AND MARK.AssessmentNumber = ASSESSMENT.AssessmentNumber;
2) Report on tutorial enrollment showing: Tutorial Number, Day, Room, Tutor in Charge and number of students enrolled. Same as the avg function, now for the count function. Would this require 2 queries?
3) List each student's Raw Mark in each of the assessment tasks showing: Assessment Number, Assessment Title, Student Number, Student Name, Raw Mark, Tutor in Charge and Time. Sort on Tutor in Charge, Day and Time.

Here is an example for the first one, just take the logic and see if you can expand it to the other questions. I find that these things can be hard to lear if you can't find any solid examples but once you get the hang of it you'll sort it out pretty quick.
1)
SELECT a.AssessmentNumber, a.AssessmentTitle, AVG(RawMark)
FROM ASSESSMENT a LEFT JOIN MARK m ON a.AssessmentNumber = m.AssessmentNumber
GROUP BY a.AssessmentNumber, a.AssessmentTitle
OR not using a left join or alias table names
SELECT ASSESSMENT.AssessmentNumber, ASSESSMENT.AssessmentTitle, AVG(RawMark)
FROM ASSESSMENT,MARK
WHERE ASSESSMENT.AssessmentNumber = MARK.AssessmentNumber
GROUP BY ASSESSMENT.AssessmentNumber, ASSESSMENT.AssessmentTitle

A simple MDX question

I am new to MDX and I know that this must be a simple question but I haven't been able to find an answer.
I am modeling a a questionnaire that has questions and answers. What I am trying to achieve is to find out the number of people who gave specific answers to questions., e.g. the number of males aged between 20-25
When I run the query below for the questions individually the correct result is returned
SELECT
[Measures].[Fact Demographics Count] ON Columns
FROM
[Dsv All]
WHERE
[Answer].[Dim Answer].&[1]
[Measures].[Fact Demographics Count] is a count of the primary key column
[Answer].[Dim Answer].&[1] is the key for the Male answer
Result for number of people who are male = 150
Result for number of people who are between 20-25 = 12
But when I run the next query below rather than getting the number people who are males and aged between 20-25. I get the sum of the number of people who are males and the number of people who are between 20-25.
SELECT
[Measures].[Fact Demographics Count] ON Columns
FROM
[Dsv All]
WHERE
{[Answer].[Dim Answer].&[1],[Answer].[Dim Answer].&[9]}
result = 162
The structure of the fact table is
FactDemographicsKey,
RespodentKey,
QuestionKey,
AnswerKey
Any help would be greatly appreciated
Thanks

Take a look at the MDX function FILTER - this may give you what you need. A combination of FILTER and Member Properties to filter against the ID's might do it. You're having a problem because what you're trying to do is a little against the grain of how an OLAP cube is structured (from my experience) because Age and Gender are both members of the same dimension (Answers), which means that they each get their own cells for aggregation, but unlike if Age and Gender were each on their own dimension, they don't get aggregated with respect to one another except to get added together. In an OLAP cube, each combination of each member of each dimension with each member of every other dimension gets a "cell" with the value of each measure that is unique to that combination - that is what you want, but members of the same dimension (such as Answers) aren't cross-calculated in that way.
If possible, I would recommend breaking out the individual answers into individual dimensions, i.e. Age and Gender each have their own dimensions with their own members, then what you want to do will naturally flow out of your cube. Otherwise, I'm afraid you will have lots of MDX fiddelry to do. (I am not an MDX expert, though, so I could be completely off base on this one, but that is my understanding)
Also, definitely read the book previously mentioned, MDX Solutions, unless this is the only MDX query you think you'll need to write. MDX and Multidimensional analysis are nothing like SQL, and a solid understanding of the structure of an OLAP database and MDX in general is absolutely essential, and that book does a very, very nice job of getting you where you need to be in that department.

When trying to figure out problems with where-criteria or slices I find it helpful to breakdown the items that you're slicing on into dimensions, rather than measures.
select
[Measures].[Fact Demographics Count] on Columns
from [Dsv All]
where
{
[Answer].[Dim Answer].&[1],
[Dim Age Band].[20-25]
}
Although then you're not really using the power of MDX - you're getting just one value.
select
[Dim Answer].Members on Columns,
[Dim Age Band].Members on Rows
from [Dsv All]
where ( [Measures].[Fact Demographics Count] )
Will give you a pivot table (or crosstable) breaking down gender (on columns) by age-bands (on rows).
BTW - ff you're learning MDX this book: MDX Solutions is far and away the best starting point that I've found.

Firstly thanks to everyone for their replies. This was an interesting one to solve and for anyone new to MDX and coming from SQL its an easy trap to fall into.
So for those interested here is a brief overview of the solution.
I have 3 tables
factDemographics: holds respondents and their answers (who answered what)
dimAnswer: the answers
dimRespondent: the respondents
In the datasource view for the cube I duplicated factDemographics 5 times using Named Queries and I named these fact1, fact2, ..., fact5. (which will create 5 measure groups)
Using VS Studio's create cube wizard I set the following fact tables
fact1, fact2, ... as fact tables
dimRespondent a fact table. I use this table to get the number of respondents.
Removed the original factDemographics table.
Once the cube was created I duplicated the dimAnswer dimension 5 times, naming them filter1, filter2, ...
Finally in the Cube Structure's Dimension Usage tab I linked these together as follows
filter1 many to many dimRespondent
filter2 many to many dimRespondent
filter3 many to many dimRespondent
filter4 many to many dimRespondent
filter5 many to many dimRespondent
filter1 regular relationship fact1
filter2 regular relationship fact2
filter3 regular relationship fact3
filter4 regular relationship fact4
filter5 regular relationship fact5
This now enables me to rewrite the query I used in my original post as
SELECT
[Measures].[Dim Respondent Count] On 0
FROM
[DemographicsCube]
WHERE
(
[Filter1].[Answer].&[Male],
[Filter2].[Answer].&[20-25]
)
My query can now be filtered by up to 5 questions.
Although this works I'm sure that there is a more elegant solution. If anyone knows what that is I'd love to hear it.
Thanks

If you are using MSSQL, you can use the "WITH ROLLUP" to get some extra rows which would have the information you want. Also, you are not using a "GROUP BY" which you will need.
Use the GROUP BY to break up the set into groups and then use aggregate functions to get your counts and other stats.
Example:
select AGE, GENDER, count(1)
from MY_TABLE
group by AGE, GENDER
with rollup
This would give you the number of each gender of person in your table in each age group, and the "rollup" would give you the total number of people in your table, the numbers in each age group regardless of gender, and the numbers of each gender regardless of age. Something like
AGE GENDER COUNT
--- ------ -----
20 M 1245
21 M 1012
20 F 942
21 F 838
M 2257
F 1780
20 2187
21 1850
4037

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas