Database Table Design for Group Values with Changing Status over Time

Database Table Design for Group Values with Changing Status over Time - sql

I have the following groups that have a particular designation depending on the date:
Group 1: 3/30/2017 to present: status 'on'
Group 2: 3/30/2017 to present: status 'on'
Group 3: 3/30/2017 to present: status 'on'
Group 4: 3/30/2017 to 6/1/2017: status 'off'; 6/2/2017 to present: status: 'on'
Group 5: 3/30/2017 to present: status 'off'
Group 6: 3/30/2017 to 7/10/2017: status 'off'; 7/11/2017 to present: status 'on'
I'm trying to translate this information into an effective database table so I can designate a change in status on a particular date.
I have a process that runs daily in near real time that checks the status of each group and undertakes various processes based on the status.
I have come up with the following though I think it is not sufficient:
Group Effective Date Termination Date Status
Group 1 '2017-03-30' NULL On
Group 2 '2017-03-30' NULL On
Group 3 '2017-03-30' NULL On
Group 4 '2017-03-30' '2017-06-01' On
Group 4 '2017-06-02' NULL Off
Group 5 '2017-03-30' NULL Off
Group 6 '2017-03-30' '2017-07-10' Off
Group 6 '2017-07-11' NULL On
So if I run my daily process historically, I want it to be able to consult the table and determine the status of the group. If I am running my process in real time, I want to be able to consult the table and determine the status. If I want to change the status at a particular point in time, I enter a termination date for the Group and status and start a new line.
I can't imagine this is a good way to do this.
Looking for insights.
Thanks in advance.

Here is one method using what I call Version Normal Form (vnf). It works for entities that have a smooth, unbroken chain of state changes. That is, there are no gaps (one state ends only upon another state taking effect) or overlaps (only one state is in effect at any time).
Design the Group table with all the group info except for status.
create table Group(
ID int auto generated primary key,
... ... -- all other Group data
);
Now create a Status table with a State field and one date field -- the date the status takes effect.
create table GroupStatus(
ID int references Group( ID ),
EffDate date not null default Now(),
State char( 1 ) check (State in ('Y', 'N')),
constraint PK_GroupStatus primary key( ID, EffDate )
);
There are two important points about the GroupStatus table to consider:
the PK definition means no two entries for the same Group can be defined for the same time. Thus, it is not possible to have overlapping status values.
there is no "end" date. A status takes effect at the designated date and time and continues in effect until replaced by another state change. Thus, it is not possible to have gaps between the status changes of any Group.
I used a single character 'Y' (for On) and 'N' (for Off) but you can define the status state any way you want. This is for illustration only.
The EffDate field may have to be Date, Datetime or Timestamp type, depending on your specific DBMS. Now() just means "current date and time" using any method available in your DBMS.
The GroupStatus data would look like this:
ID EffDate State
1 '2017-03-30' Y
2 '2017-03-30' Y
3 '2017-03-30' Y
4 '2017-03-30' Y
4 '2017-06-02' N
5 '2017-03-30' N
6 '2017-03-30' N
6 '2017-07-11' Y
For the level of data integrity enforced, the design is very simple. The queries will be a little more complicated.
To see the current status of Group 1:
select g.ID as 'Group', s.EffDate as 'Effective Date',
case s.State when 'Y' then 'On' else 'Off' end as Status
from Group g
join GroupStatus s
on s.ID = g.ID
and s.EffDate =(
select Max( s1.EffDate )
from GroupStatus s1
where s1.ID = g.ID
and s1.EffDate <= Now()
)
where g.ID = 1;
To see the current status of all groups, just omit the where clause. To see the status of group 1 that was in effect on a certain date, just change the Now() in the subquery to a variable loaded with the date and time of interest.
In fact, set the query for current status of all groups as a view. Then your daily process can simply query:
select ID, Status from CurrentGroupStatus;
Since you know there can be no gaps or overlaps, you know there will be one and only one row for each group.
Suppose upon inserting the group 6 entry on March 30, you already know the date it will be turned on. You can go ahead and insert the GroupStatus entry with the future date (July 11) and the "current" queries and view will continue to show the correct status (Off) until that date arrives, at which point the ON status will start appearing.
Create "instead of" triggers on the view(s) to correctly work with the underlying tables and your apps don't even have to know the details of how the data is stored.
This gives you rock solid data integrity and a lot of flexibility in how you view and manipulate the data.

Related

Finding a gap between timestamps (for multiple entries)

I have a table with a list of vehicles and timestamps of adding and deleting them from the database. A single vehicle can be added by more than one user, so there can be more than one row when a specific license plate is 'active' on some point (added to the database but not deleted). Periods can overlap.
license_plate
create_timestamp
delete_timestamp
AA-BBB-CC
2019-10-26 0:04:57
2021-04-07 14:18:44
AA-BBB-CC
2021-04-07 16:00:43
\N
How to check if there was a gap for a specific license plate in the entire period I'am checking?
There are multiple rows with start_date and and date (or with no end date if it wasn't deleted), I need to find out if there is any gap between first start_date and last end_date for a specific car.
I need to group vehicles into 3 groups: Active (never completely deleted from the database), Deleted and Resurrected (deleted but re-entered - for this group I need to determine the gap).
I tried to create tables with one row per a car based on a row number
ROW_NUMBER() OVER(PARTITION BY license_plate ORDER BY create_timestamp, delete_timestamp) AS rank
and after having multiple tables compare entries row by row:
CASE WHEN r1.delete_timestamp is null THEN 'Active'
WHEN r1.delete_timestamp < r2.create_timestamp AND r2.delete_timestamp is null THEN 'Active'
WHEN r1.delete_timestamp > r2.create_timestamp AND r2.delete_timestamp is null THEN 'Resurrected'
WHEN r1.delete_timestamp > r2.create_timestamp AND r2.delete_timestamp is not null
AND r3.create_timestamp is null THEN 'Deleted'
end so on for next rows :)
END AS Vehicle_status
But it doesn't make sense as there are currently vehicles with 20 rows (entries) and can be more in the future.
Hope someone has any idea how to solve it :)

Difference between two dates based on ID and condition

I have to calculate the date difference between two rows for each unique ID (shipment-number) where the first time is categorized as (pickup) and the last time is categorized as (delivery). In essence, is there a way to create a calculated field or equivalent in Quicksight (or SQL) which does the following (where shipment number, event-type, and event-time are column names):
for each unique shipment-number find the date difference (event-time) between event-type == delivered and event-type == pickup
Assuming there are four rows for a specific shipment number:
shipment-number
event-type
event-type
001
pickup
11.01.2021
001
in-transit
12.01.2021
001
arrived-destination
13.01.2021
001
delivered
15.01.2021
then the expected result would be the difference from 11.01.2021 (pickup) and 15.01.2021 (delivered) which is 4 days.
I have tried using datediff, however, this requires two date fields. Creating two fields (firstEvent) and (lastEvent) by extracting the time of the event by the event-type for each unique shipment-number could be a possibility? However, I am unsure as to do so.
Any help/advice would be of great help.

You can use conditional aggregation as follows:
select shipment_number,
datediff(day, min(case when event_type = 'pickup' then event_date end)
- max(case when event_type = 'delivered' then event_date end) ) as diff
from your_table
group by shipment_number

Get timestamp from one table (displayed on different rows) and display it in my query

I am trying to write SQL where there is a table which has different timestamps for when the order status changes from one status to another. I need to capture those timestamps and display it in my SQL output.
If I use the join command it gives me 6 different rows where all the information is same but different timestamps for different order statuses when it changed. Is there a way to capture these different time stamps and display it as my SQL output.
As a simple example order changed from DC Allocated to Packed on 12/15/2019 10:00 AM and then from Packed to Shipped on 12/15/2019 12:00 PM
I need to show this as:
Currently I am getting 2 rows with the same data but different timestamps for Packed and Shipped statuses. There are multiple tables in play here but Order table gives the order number and status table gives the transaction status and time whenever the order was updated by the processor.

The basic pattern for this is to aggregate over a conditional expression, eg
select OrderNumber,
max(case when Status = 'Packed' then StatusDate else null end) Packed,
max(case when Status = 'Shipped' then StatusDate else null end) Shipped
from OrderStatus
group by OrderNumber

How many customers upgraded from Product A to Product B?

I have a "daily changes" table that records when a customer "upgrades" or "downgrades" their membership level. In the table, let's say field 1 is customer ID, field 2 is membership type and field 3 is the date of change. Customers 123 and ABC each have two rows in the table. Values in field 1 (ID) are the same, but values in field 2 (TYPE) and 3 (DATE) are different. I'd like to write a SQL query to tell me how many customers "upgraded" from membership type 1 to membership type 2 how many customers "downgraded" from membership type 2 to membership type 1 in any given time frame.
The table also shows other types of changes. To identify the records with changes in the membership type field, I've created the following code:
SELECT *
FROM member_detail_daily_changes_new
WHERE customer IN (
SELECT customer
FROM member_detail_daily_changes_new
GROUP BY customer
HAVING COUNT(distinct member_type_cd) > 1)
I'd like to see an end report which tells me:
For Fiscal 2018,
X,XXX customers moved from Member Type 1 to Member Type 2 and
X,XXX customers moved from Member Type 2 to Member type 1

Sounds like a good time to use a LEAD() analytical function to look ahead for a given customer's member_Type; compare it to current record and then evaluate if thats an upgrade/downgrade then sum results.
DEMO
CTE AS (SELECT case when lead(Member_Type_Code) over (partition by Customer order by date asc) > member_Type_Code then 1 else 0 end as Upgrade
, case when lead(Member_Type_Code) over (partition by Customer order by date asc) < member_Type_Code then 1 else 0 end as DownGrade
FROM member_detail_daily_changes_new
WHERE Date between '20190101' and '20190201')
SELECT sum(Upgrade) upgrades, sum(downgrade) downgrades
FROM CTE
Giving us: using my sample data
+----+----------+------------+
| | upgrades | downgrades |
+----+----------+------------+
| 1 | 3 | 2 |
+----+----------+------------+
I'm not sure if SQL express on rex tester just doesn't support the sum() on the analytic itself which is why I had to add the CTE or if that's a rule in non-SQL express versions too.
Some other notes:
I let the system implicitly cast the dates in the where clause
I assume the member_Type_Code itself tells me if it's an upgrade or downgrade which long term probably isn't right. Say we add membership type 3 and it goes between 1 and 2... now what... So maybe we need a decimal number outside of the Member_Type_Code so we can handle future memberships and if it's an upgrade/downgrade or a lateral...
I assumed all upgrades/downgrades are counted and a user can be counted multiple times if membership changed that often in time period desired.
I assume an upgrade/downgrade can't occur on the same date/time. Otherwise the sorting for lead may not work right. (but if it's a timestamp field we shouldn't have an issue)
So how does this work?
We use a Common table expression (CTE) to generate the desired evaluations of downgrade/upgrade per customer. This could be done in a derived table as well in-line but I find CTE's easier to read; and then we sum it up.
Lead(Member_Type_Code) over (partition by customer order by date asc) does the following
It organizes the data by customer and then sorts it by date in ascending order.
So we end up getting all the same customers records in subsequent rows ordered by date. Lead(field) then starts on record 1 and Looks ahead to record 2 for the same customer and returns the Member_Type_Code of record 2 on record 1. We then can compare those type codes and determine if an upgrade or downgrade occurred. We then are able to sum the results of the comparison and provide the desired totals.
And now we have a long winded explanation for a very small query :P

You want to use lag() for this, but you need to be careful about the date filtering. So, I think you want:
SELECT prev_membership_type, membership_type,
COUNT(*) as num_changes,
COUNT(DISTINCT member) as num_members
FROM (SELECT mddc.*,
LAG(mddc.membership_type) OVER (PARTITION BY mddc.customer_id ORDER BY mddc.date) as prev_membership_type
FROM member_detail_daily_changes_new mddc
) mddc
WHERE prev_membership_type <> membership_type AND
date >= '2018-01-01' AND
date < '2019-01-01'
GROUP BY membership_type, prev_membership_type;
Notes:
The filtering on date needs to occur after the calculation of lag().
This takes into account that members may have a certain type in 2017 and then change to a new type in 2018.
The date filtering is compatible with indexes.
Two values are calculated. One is the overall number of changes. The other counts each member only once for each type of change.

With conditional aggregation after self joining the table:
select
2018 fiscal,
sum(case when m.member_type_cd > t.member_type_cd then 1 else 0 end) upgrades,
sum(case when m.member_type_cd < t.member_type_cd then 1 else 0 end) downgrades
from member_detail_daily_changes_new m inner join member_detail_daily_changes_new t
on
t.customer = m.customer
and
t.changedate = (
select max(changedate) from member_detail_daily_changes_new
where customer = m.customer and changedate < m.changedate
)
where year(m.changedate) = 2018
This will work even if there are more than 2 types of membership level.

How do I analyse time periods between records in SQL data without cursors?

The root problem: I have an application which has been running for several months now. Users have been reporting that it's been slowing down over time (so in May it was quicker than it is now). I need to get some evidence to support or refute this claim. I'm not interested in precise numbers (so I don't need to know that a login took 10 seconds), I'm interested in trends - that something which used to take x seconds now takes of the order of y seconds.
The data I have is an audit table which stores a single row each time the user carries out any activity - it includes a primary key, the user id, a date time stamp and an activity code:
create table AuditData (
AuditRecordID int identity(1,1) not null,
DateTimeStamp datetime not null,
DateOnly datetime null,
UserID nvarchar(10) not null,
ActivityCode int not null)
(Notes: DateOnly (datetime) is the DateTimeStamp with the time stripped off to make group by for daily analysis easier - it's effectively duplicate data to make querying faster).
Also for the sake of ease you can assume that the ID is assigned in date time order, that is 1 will always be before 2 which will always be before 3 - if this isn't true I can make it so).
ActivityCode is an integer identifying the activity which took place, for instance 1 might be user logged in, 2 might be user data returned, 3 might be search results returned and so on.
Sample data for those who like that sort of thing...:
1, 01/01/2009 12:39, 01/01/2009, P123, 1
2, 01/01/2009 12:40, 01/01/2009, P123, 2
3, 01/01/2009 12:47, 01/01/2009, P123, 3
4, 01/01/2009 13:01, 01/01/2009, P123, 3
User data is returned (Activity Code 2) immediate after login (Activity Code 1) so this can be used as a rough benchmark of how long the login takes (as I said, I'm interested in trends so as long as I'm measuring the same thing for May as July it doesn't matter so much if this isn't the whole login process - it takes in enough of it to give a rough idea).
(Note: User data can also be returned under other circumstances so it's not a one to one mapping).
So what I'm looking to do is select the average time between login (say ActivityID 1) and the first instance after that for that user on that day of user data being returned (say ActivityID 2).
I can do this by going through the table with a cursor, getting each login instance and then for that doing a select to say get the minimum user data return following it for that user on that day but that's obviously not optimal and is slow as hell.
My question is (finally) - is there a "proper" SQL way of doing this using self joins or similar without using cursors or some similar procedural approach? I can create views and whatever to my hearts content, it doesn't have to be a single select.
I can hack something together but I'd like to make the analysis I'm doing a standard product function so would like it to be right.

SELECT TheDay, AVG(TimeTaken) AvgTimeTaken
FROM (
SELECT
CONVERT(DATE, logins.DateTimeStamp) TheDay
, DATEDIFF(SS, logins.DateTimeStamp,
(SELECT TOP 1 DateTimeStamp
FROM AuditData userinfo
WHERE UserID=logins.UserID
and userinfo.ActivityCode=2
and userinfo.DateTimeStamp > logins.DateTimeStamp )
)TimeTaken
FROM AuditData logins
WHERE
logins.ActivityCode = 1
) LogInTimes
GROUP BY TheDay
This might be dead slow in real world though.

In Oracle this would be a cinch, because of analytic functions. In this case, LAG() makes it easy to find the matching pairs of activity codes 1 and 2 and also to calculate the trend. As you can see, things got worse on 2nd JAN and improved quite a bit on the 3rd (I'm working in seconds rather than minutes).
SQL> select DateOnly
2 , elapsed_time
3 , elapsed_time - lag (elapsed_time) over (order by DateOnly) as trend
4 from
5 (
6 select DateOnly
7 , avg(databack_time - prior_login_time) as elapsed_time
8 from
9 ( select DateOnly
10 , databack_time
11 , ActivityCode
12 , lag(login_time) over (order by DateOnly,UserID, AuditRecordID, ActivityCode) as prior_login_time
13 from
14 (
15 select a1.AuditRecordID
16 , a1.DateOnly
17 , a1.UserID
18 , a1.ActivityCode
19 , to_number(to_char(a1.DateTimeStamp, 'SSSSS')) as login_time
20 , 0 as databack_time
21 from AuditData a1
22 where a1.ActivityCode = 1
23 union all
24 select a2.AuditRecordID
25 , a2.DateOnly
26 , a2.UserID
27 , a2.ActivityCode
28 , 0 as login_time
29 , to_number(to_char(a2.DateTimeStamp, 'SSSSS')) as databack_time
30 from AuditData a2
31 where a2.ActivityCode = 2
32 )
33 )
34 where ActivityCode = 2
35 group by DateOnly
36 )
37 /
DATEONLY ELAPSED_TIME TREND
--------- ------------ ----------
01-JAN-09 120
02-JAN-09 600 480
03-JAN-09 150 -450
SQL>
Like I said in my comment I guess you're working in MSSQL. I don't know whether that product has any equivalent of LAG().

If the assumptions are that:
Users will perform various tasks in no mandated order, and
That the difference between any two activities reflects the time it takes for the first of those two activities to execute,
Then why not create a table with two timestamps, the first column containing the activity start time, the second column containing the next activity start time. Thus the difference between these two will always be total time of the first activity. So for the logout activity, you would just have NULL for the second column.
So it would be kind of weird and interesting, for each activity (other than logging in and logging out), the time stamp would be recorded in two different rows--once for the last activity (as the time "completed") and again in a new row (as time started). You would end up with a jacob's ladder of sorts, but finding the data you are after would be much more simple.
In fact, to get really wacky, you could have each row have the time that the user started activity A and the activity code, and the time started activity B and the time stamp (which, as mentioned above, gets put down again for the following row). This way each row will tell you the exact difference in time for any two activities.
Otherwise, you're stuck with a query that says something like
SELECT TIME_IN_SEC(row2-timestamp) - TIME_IN_SEC(row1-timestamp)
which would be pretty slow, as you have already suggested. By swallowing the redundancy, you end up just querying the difference between the two columns. You probably would have less need of knowing the user info as well, since you'd know that any row shows both activity codes, thus you can just query the average for all users on any given day and compare it to the next day (unless you are trying to find out which users are having the problem as well).

This is the faster query to find out, in one row you will have current and row before datetime value, after that you can use DATEDIFF ( datepart , startdate , enddate ). I use #DammyVariable and DamyField as i remember the is some problem if is not first #variable=Field in update statement.
SELECT *, Cast(NULL AS DateTime) LastRowDateTime, Cast(NULL As INT) DamyField INTO #T FROM AuditData
GO
CREATE CLUSTERED INDEX IX_T ON #T (AuditRecordID)
GO
DECLARE #LastRowDateTime DateTime
DECLARE #DammyVariable INT
SET #LastRowDateTime = NULL
SET #DammyVariable = 1
UPDATE #T SET
#DammyVariable = DammyField = #DammyVariable
, LastRowDateTime = #LastRowDateTime
, #LastRowDateTime = DateTimeStamp
option (maxdop 1)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas