Date scaffolding with multiple different date measures? - sql

I want to show the amount of people in each contract status historically. I have a list of every contract's start date, suspension dates, expiration date, and termination state. As a brief example this is what my table looks like:
Client
Location
StartDate
ExpDate
SuspensionStart
SuspensionEnd
TerminatedDate
Jane
NJ
1/1/22
1/1/23
3/1/22
5/1/22
NULL
John
NY
11/15/22
11/15/23
NULL
NULL
3/8/22
Alice
NY
3/12/21
3/12/22
6/1/21
8/1/21
NULL
Jack
NJ
6/20/21
6/20/22
NULL
NULL
NULL
My goal is to get my table to look like this for the month of March
Active
Suspended
Expired
Terminated
1
1
1
1
Then be able to drill down by location too.
Since I have two variables that I want to count by the date (count if expdate=month/year and count of terminateddate=month/year) and then two variables with through dates.
One more piece of context...this data is pulled from a using a sql query from a shared snowflake database. There is no calendar table and I cannot create one except by a view which I used
select
dateadd(day,seq,dt::date) dat
,year(dat) as "YEAR"
,quarter(dat) as "QUARTER OF YEAR"
,month(dat) as "MONTH"
,day(dat) as "DAY"
,dayofmonth(dat) as "DAY OF MONTH",
dayofweek(dat) as "DAY OF WEEK",dayname(dat) as dayName,
dayofyear(dat) as "DAY OF YEAR"
from (
select seq4() as seq, dateadd(month, 1, '2015-01-01'::date) dt
from table(generator(rowcount => 16000))
)
I haven't used scaffolding before, and unsure which date to build the relationship on/join on?

Scaffolding is best done by Tableau Prep. There are multiple steps involved and Prep can step through them while it is very challenging with Tableau Desktop. See https://www.tableau.com/about/blog/2018/12/scaffold-data-tableau-prep-fill-gaps-your-data-set-99389 for one example of how to scaffold the data.
You can apply the techniques in the blog article and create the four metrics that you want to show.

Related

SQLite - Output count of all records per day including days with 0 records

I have a sqlite3 database maintained on an AWS exchange that is regularly updated by a Python script. One of the things it tracks is when any team generates a new post for a given topic. The entries look something like this:
id
client
team
date
industry
city
895
acme industries
blueteam
2022-06-30
construction
springfield
I'm trying to create a table that shows me how many entries for construction occur each day. Right now, the entries with data populate, but they exclude dates with no entries. For example, if I search for just
SELECT date, count(id) as num_records
from mytable
WHERE industry = "construction"
group by date
order by date asc
I'll get results that looks like this:
date
num_records
2022-04-01
3
2022-04-04
1
How can I make sqlite output like this:
date
num_records
2022-04-02
3
2022-04-02
0
2022-04-03
0
2022-04-04
1
I'm trying to generate some graphs from this data and need to be able to include all dates for the target timeframe.
EDIT/UPDATE:
The table does not already include every date; it only includes dates relevant to an entry. If no team posts work on a day, the date column will jump from day 1 (e.g. 2022-04-01) to day 3 (2022-04-03).
Given that your "mytable" table contains all dates you need as an assumption, you can first select all of your dates, then apply a LEFT JOIN to your own query, and map all resulting NULL values for the "num_records" field to "0" using the COALESCE function.
WITH cte AS (
SELECT date,
COUNT(id) AS num_records
FROM mytable
WHERE industry = "construction"
GROUP BY date
ORDER BY date
)
SELECT dates.date,
COALESCE(cte.num_records, 0) AS num_records
FROM (SELECT date FROM mytable) dates
LEFT JOIN cte
ON dates.date = cte.date

Converting EAVT table into SCD type 2

After a lot of research and head picking, I'm still unable to find a good/clean solution to convert an entity-attribute-value-timestamp table to an scd type 2 dimension.
Here's the issue:
I have a CRM source that stores all history in a EAVT model (Entity/Attribute/Value of the attribute/valid_from/valid_to).
So for every object (Company, product...etc) I have a table with the current state that is in a relational model, and another history table that contains all value changes to all attributes with a valid_from/valid_to column for validity of the values themselves.
I want to be able to merge these two tables into an SCD table with a Valid_To/Valid_From and a column per attribute.
To give an example:
Company has two tables:
Current state of the Companies:
company_id
name
number_of_employees
city
1
Company 1
500
Paris
2
Company 2
500
Paris
History Table:
company_id
attribute
value
valid_from
valid_to
1
city
New York
01/01/2020
01/05/2022
1
city
Paris
01/05/2022
12/31/9999
1
number_of_employees
50
01/01/2021
01/01/2022
1
number_of_employees
100
01/01/2022
12/31/9999
What I want to have as a result is the following:
company_id
name
city
number_of_employees
valid_from
valid_to
is_active
1
Company 1
New York
null
01/01/2020
01/01/2021
false
1
Company 1
New York
50
01/01/2021
01/01/2022
false
1
Company 1
New York
100
01/01/2022
01/01/2022
false
1
Company 1
Paris
100
01/05/2022
12/31/9999
true
So based on this example, we have a company that started on 01/01/2020 with New York as city and number of employees wasn't populated at that time.
We then modified our company to add 50 as the number of employees, this happened on 01/01/2021.
We modified our company again on 01/01/2022 to change the number of employees to 100, only to change the city of the company from New York to Paris on 01/05/2021.
This gives us 4 states for the company, so our SCD should contain a row per state or 4 rows.
The dates should be calculated to overlap and valid_from should be set to the valid_to of the attribute that changed from the "history" table, and valid_to should be set to the valid_from of the attribute that changed from the "history" table.
To add more complexity to the task, imagine we have about 120 attributes but also if a company was never changed (just created and still has the same state from creation) then it won't exist in the "Current State" table. So in our example, Company 2 will not exist in the history table at all and will have to be read from the first table into the SCD (union between current table and history result table). Fun right! :)
To give you a sense of the technical environment, the CRM is hubspot, data is replicated from hubspot to BigQuery and the reporting tool is Power BI.
I have tried to use pivoting in both Power BI and BigQuery, which is the standard solution when it comes to EAV model tables, but I'm stuck at the calculation of the valid/from valid/to in the result SCD. ( example of using the pivoting here: https://dba.stackexchange.com/questions/20275/solutions-for-reporting-off-of-an-eav-structured-database )
I need one process that can be applied to multiple tables (because this example is only for company, but I have also other objects that I need to convert into SCD).
So what is the best way to convert this EAVT data into an SCD without falling into a labyrinth of hard code and performance issues? And how to calculate the valid_from/valid_to dynamically<
Whether it's BigQuery or Power Query or just theoretical, any solutions, tips, ideas or just plain opinion is highly appreciated as this is the last step into the adoption of a whole data culture in the company I work for, and if I cannot make this, well... my credibility will be hit! so please help a fellow lost IT professional! :D
Too broad question - but anyway, below is just to give you an idea. Obviously it does not cover all cases - but hope you can work it further out
select company_id, city, number_of_employees, min(day) valid_from, max(day) valid_to
from (
select * from (
select company_id, attribute, value, day
from history,
unnest(generate_date_array(date(valid_from), if(valid_to = '9999-12-31', date('2222-12-31'), date(valid_to)))) day
)
pivot (any_value(value) for attribute in ('city', 'number_of_employees'))
)
group by company_id, city, number_of_employees
if applied to sample data as in your question
with history as (
select 1 company_id, 'city' attribute, 'New York' value, '2020-01-01' valid_from, '2022-01-05' valid_to union all
select 1, 'city', 'Paris', '2022-01-05', '2222-12-31' union all
select 1, 'number_of_employees', '50', '2021-01-01', '2022-01-01' union all
select 1, 'number_of_employees', '100', '2022-01-01', '2222-12-31'
)
output is

count occurrences for each week using db2

I am looking for some general advice rather than a solution. My problem is that I have a list of dates per person where due to administrative procedures, a person may have multiple records stored for this one instance, yet the date recorded is when the data was entered in as this person is passed through the paper trail. I understand this is quite difficult to explain so I'll give an example:
Person Date Audit
------ ---- -----
1 2000-01-01 A
1 2000-01-01 B
1 2000-01-02 C
1 2003-04-01 A
1 2003-04-03 A
where I want to know how many valid records a person has by removing annoying audits that have recorded the date as the day the data was entered, rather than the date the person first arrives in the dataset. So for the above person I am only interested in:
Person Date Audit
------ ---- -----
1 2000-01-01 A
1 2003-04-01 A
what makes this problem difficult is that I do not have the luxury of an audit column (the audit column here is just to present how to data is collected). I merely have dates. So one way where I could crudely count real events (and remove repeat audit data) is to look at individual weeks within a persons' history and if a record(s) exists for a given week, add 1 to my counter. This way even though there are multiple records split over a few days, I am only counting the succession of dates as one record (which after all I am counting by date).
So does anyone know of any db2 functions that could help me solve this problem?
If you can live with standard weeks it's pretty simple:
select
person, year(dt), week(dt), min(dt), min(audit)
from
blah
group by
person, year(dt), week(dt)
If you need seven-day ranges starting with the first date you'd need to generate your own week numbers, a calendar of sorts, e.g. like so:
with minmax(mindt, maxdt) as ( -- date range of the "calendar"
select min(dt), max(dt)
from blah
),
cal(dt,i) as ( -- fill the range with every date, count days
select mindt, 0
from minmax
union all
select dt+1 day , i+1
from cal
where dt < (select maxdt from minmax) and i < 100000
)
select
person, year(blah.dt), wk, min(blah.dt), min(audit)
from
(select dt, int(i/7)+1 as wk from cal) t -- generate week numbers
inner join
blah
on t.dt = blah.dt
group by person, year(blah.dt), wk

Join to Calendar Table - 5 Business Days

So this is somewhat of a common question on here but I haven't found an answer that really suits my specific needs. I have 2 tables. One has a list of ProjectClosedDates. The other table is a calendar table that goes through like 2025 which has columns for if the row date is a weekend day and also another column for is the date a holiday.
My end goal is to find out based on the ProjectClosedDate, what date is 5 business days post that date. My idea was that I was going to use the Calendar table and join it to itself so I could then insert a column into the calendar table that was 5 Business days away from the row-date. Then I was going to join the Project table to that table based on ProjectClosedDate = RowDate.
If I was just going to check the actual business-date table for one record, I could use this:
SELECT actual_date from
(
SELECT actual_date, ROW_NUMBER() OVER(ORDER BY actual_date) AS Row
FROM DateTable
WHERE is_holiday= 0 and actual_date > '2013-12-01'
ORDER BY actual_date
) X
WHERE row = 65
from here:
sql working days holidays
However, this is just one date and I need a column of dates based off of each row. Any thoughts of what the best way to do this would be? I'm using SQL-Server Management Studio.
Completely untested and not thought through:
If the concept of "business days" is common and important in your system, you could add a column "Business Day Sequence" to your table. The column would be a simple unique sequence, incremented by one for every business day and null for every day not counting as a business day.
The data would look something like this:
Date BDAY_SEQ
========== ========
2014-03-03 1
2014-03-04 2
2014-03-05 3
2014-03-06 4
2014-03-07 5
2014-03-08
2014-03-09
2014-03-10 6
Now it's a simple task to find the N:th business day from any date.
You simply do a self join with the calendar table, adding the offset in the join condition.
select a.actual_date
,b.actual_date as nth_bussines_day
from DateTable a
join DateTable b on(
b.bday_seq = a.bday_seq + 5
);

Getting repeated rows for where with or condition

I am trying find employees that worked during a specific time period and the hours they worked during that time period. My query has to join the employee table that has employee id as pk and uses effective_date and expiration_date as time measures for the employee's position to the timekeeping table that has a pay period id number as pk and also uses effective and expiration dates.
The problem with the expiration date in the employee table is that if the employee is currently employed then the date is '12/31/9999'. I am looking for employees that worked in a certain year and current employees as well as the hours they worked separated by pay periods.
When I take this condition in account in the where with an OR statement, I get duplicates that is employees that have worked the time period I am looking for and beyond as well as duplicate records for the '12/31/9999' and the valid employee in that time period.
This is the query I am using:
SELECT
J.EMPL_ID
,J.DEPT
,J.UNIT
,J.LAST_NM
,J.FIRST_NM
,J.TITLE
,J.EFF_DT
,J.EXP_DT
,TM1.PPRD_ID
,TM1.EMPL_ID
,TM1.EXP_DT
,TM1.EFF_DT
--PULLING IN THE DAILY HRS WORKED
,(SELECT NVL(SUM(((to_number(SUBSTR(TI.DAY_1, 1
,INSTR(TI.DAY_1, ':', 1, 1)-1),99))*60)+
(TO_NUMBER(SUBSTR(TI.DAY_1
,INSTR(TI.DAY_1,':', -1, 1)+1),99))),0)
FROM PPRD_LINE TI
WHERE
TI.PPRD_ID=TM1.PPRD_ID
) "DAY1"
---AND THE REST OF THE DAYS FOR THE WORK PERIOD
FROM PPRD_LINE TM1
JOIN EMPL J ON TM1.EMPL_ID=J.EMPL_ID
WHERE
J.EMPL_ID='some id number' --for test purposes, will need to break down to depts-
AND
J.EFF_DT >=TO_DATE('1/1/2012','MM/DD/YYYY')
AND
(
J.EXP_DT<=TO_DATE('12/31/2012','MM/DD/YYYY')
OR
J.EXP_DT=TO_DATE('12/31/9999','MM/DD/YYYY') --I think the problem might be here???
)
GROUP BY
J.EMPL_ID
,J.DEPT
,J.UNIT
,J.LAST_NM
,J.FIRST_NM
,J.TITLE
,J.EFF_DT
,J.EXP_DT
,TM1.PPRD_ID
,TM1.EMPL_ID
,TM1.DOC_ID
,TM1.EXP_DT
,TM1.EFF_DT
ORDER BY
J.EFF_DT
,TM1.EFF_DT
,TM1.EXP_DT
I'm pretty sure I'm missing something simple but at this point I can't see the forest for the trees. Can anyone out there point me in the right direction?
an example of the duplicate records:
for employee 1 for the year of 2012:
Empl_ID Dept Unit Last First Title Eff Date Exp Date PPRD ID Empl_ID
00001 04 012 Babbage Charles Somejob 4/1/2012 10/15/2012 0407123 00001
Exp Date_1 Eff Date_1
4/15/2012 4/1/2012
this record repeats 3 times and goes past the pay periods in 2012 to the current pay period in 2013
the subquery I use to convert time to be able to add hrs and mins together to compare down the line.
I'm going to take a wild guess and see if this is what you want, remember I could not test so there may be typos.
If this is and especially if it is not, you should read in the FAQ about how to ask good questions. If this is what you were trying to understand your question should have been answered within about 10 mins. Because it was not clear what you were asking no one could answer your question.
You should include inputs and outputs and EXPECTED output in your question. The data you gave was not the output of the select statement (it did not have the DAY1 column).
SELECT
J.EMPL_ID
,J.DEPT
,J.UNIT
,J.LAST_NM
,J.FIRST_NM
,J.TITLE
,J.EFF_DT
,J.EXP_DT
,TM1.PPRD_ID
,TM1.EMPL_ID
-- ,TM1.EXP_DT Can't have these if you are summing accross multiple records.
-- ,TM1.EFF_DT
--PULLING IN THE DAILY HRS WORKED
,NVL(SUM(((to_number(SUBSTR(TM1.DAY_1, 1,INSTR(TM1.DAY_1, ':', 1, 1)-1),99))*60)+
(TO_NUMBER(SUBSTR(TM1.DAY_1,INSTR(TM1.DAY_1,':', -1, 1)+1),99))),0)
"DAY1"
---AND THE REST OF THE DAYS FOR THE WORK PERIOD
FROM PPRD_LINE TM1
JOIN EMPL J ON TM1.EMPL_ID=J.EMPL_ID
WHERE
J.EMPL_ID='some id number' --for test purposes, will need to break down to depts-
AND J.EFF_DT >=TO_DATE('1/1/2012','MM/DD/YYYY')
AND(J.EXP_DT<=TO_DATE('12/31/2012','MM/DD/YYYY') OR J.EXP_DT=TO_DATE('12/31/9999','MM/DD/YYYY'))
GROUP BY
J.EMPL_ID
,J.DEPT
,J.UNIT
,J.LAST_NM
,J.FIRST_NM
,J.TITLE
,TM1.PPRD_ID
,TM1.EMPL_ID
,TM1.DOC_ID
ORDER BY
MIN(J.EFF_DT)
,MAX(TM1.EFF_DT)
,MAX(TM1.EXP_DT)