get number of occurrences taking 1 for every 3 days group - sql

I've got a table for employees that usually get late to work. I need to send a report to Human Resources showing every user that got late, taking into account that I just can count a warning per user if that user got late in a period of 3 days at least 1 time within period.
The first data I need is the total number of warnings to be sent for HR manager to evaluate global "lateness".
Users that got late just one day will receive one warning, but if they got late twice or more the warnings they'll receive depend if they received a warning within a 3 days period counting from day one.
Let's see with an example:
Joe monday 9th
Mark monday 9th
Tim monday 9th
Joe tuesday 10th
Joe wednesday 11th
Joe Thursday 12th
Tim Friday 13th
Taking the data from table above as an example.
Joe will receive 2 warnings: first for monday, and second for Thursday. Tuesday and Wednesday will be discarded because they belonged to the same 3 day period.
Mark will receive just one warning for monday.
Tim will receive 2 warnings. First for monday and second for Friday.
Maybe this number is not possible to get using standard sql query and some cursors need to be done.
Thanks in advance

Ok... some information is missing here, as correctly stated in both comments (Quassnoi and Mr. Llama).
RDBMS in use can impact on the solution as this has to do with date functions and date algebra and not all RDBMS share the same extended function set. I have presumed MySQL 5.5, which is quite common and can be tested on SQLFiddle.
Your 3 day period is also a bit vague. Is it the same for all employees, or on what does it depend? Is it 3 days, or half a week (mon-wed, thu-sat)? What happens after that? Do we use sun-tue, then wed-fri and so on? I have presumed half-weeks (sun-wed, thu-sat) the same for all employees.`So a period is identified by year, week-of-year, half-week.
The last point you should clear up is the expected result set. Do you want a list of warning days or a count or what? I have presumed a list of warnings with the date of each (I've taken the first date for each combination of employee & period).
Creating your sample dataset with these statements:
CREATE TABLE LateEntrances (
Employee VARCHAR(20),
DateLate DATE
);
INSERT INTO LateEntrances VALUES('Joe' ,'2014.06.09');
INSERT INTO LateEntrances VALUES('Mark','2014.06.09');
INSERT INTO LateEntrances VALUES('Tim' ,'2014.06.09');
INSERT INTO LateEntrances VALUES('Joe' ,'2014.06.10');
INSERT INTO LateEntrances VALUES('Joe' ,'2014.06.11');
INSERT INTO LateEntrances VALUES('Joe' ,'2014.06.12');
INSERT INTO LateEntrances VALUES('Tim' ,'2014.06.13');
The following query solves your problem:
SELECT i.Employee, i.YearLate, i.WeekLate, i.PeriodLate, MIN(i.DateLate)
FROM (
SELECT Employee, DateLate,
YEAR(DateLate) AS YearLate,
WEEKOFYEAR(DateLate) AS WeekLate,
FLOOR(DAYOFWEEK(DateLate)/4) AS PeriodLate
FROM LateEntrances
) i
GROUP BY i.Employee, i.YearLate, i.WeekLate, i.PeriodLate;
(SQLFiddle here)
The three columns YearLate, WeekLate and PeriodLate identify the warning period. You could concatenate them in a single period identification column:
SELECT i.Employee, i.PeriodLate, MIN(i.DateLate)
FROM (
SELECT Employee, DateLate,
CONCAT_WS('*',
YEAR(DateLate) ,
WEEKOFYEAR(DateLate) ,
FLOOR(DAYOFWEEK(DateLate)/4)
) AS PeriodLate
FROM LateEntrances
) i
GROUP BY i.Employee, i.PeriodLate;
... or you could hide them alltogether (in the SELECT), even though you must still use them to GROUP BY:
SELECT i.Employee, MIN(i.DateLate)
FROM (
SELECT Employee, DateLate,
CONCAT_WS('*',
YEAR(DateLate) ,
WEEKOFYEAR(DateLate) ,
FLOOR(DAYOFWEEK(DateLate)/4)
) AS PeriodLate
FROM LateEntrances
) i
GROUP BY i.Employee, i.PeriodLate;
You can also easily change the period calculation logic to something else, like strict 3 days of year, or 3 days of month periods. There are many possibilities.
...as per the assumptions I made. Clear up the open points and I'll try to make the answer better. But in the meantime this should be enough to get you started.

Related

TSQL query to find latest (current) record from period column when there are past present and future records

edited as requested:
My apologies. I've been dealing with this a bit and it's well and truly in my head, but not for the reader.
We have multiple records in table A which have multiple entries in the Period column. Say it's like a football schedule. Teams will have multiple dates/times in the Period column.
When we run query:
We want records selected for the most recent games only.
We don't want the earlier games.
We don't want the games "scheduled" and not yet played.
"Last game played" i.e. Period for teams are often on different days.
Table like:
Team Period
Reds 2021020508:00
Reds 2021011107:00
City 2021030507:00
Reds 2021032607:00
City 2021041607:00
Reds 2021050707:00
When I run query, I want to see the records for last game played regardless of date. So if I run the query on 27 Mar 2021, I want:
City 2021030507:00
Reds 2021032607:00
Keep in mind I used the above as an easily understandable example. In my case I have 1000s of "Teams" each of which may have 100+ different date entries in the Period column and I would like the solution to be applicable regardless of number of records, dates, or when the query is run.
What can I do?
Thanks!
So this gives you your desired output using the sample data, does it fulfil your requirement?
create table x (Team varchar(10), period varchar(20))
insert into x values
('Reds','2021020508:00'),
('Reds','2021011107:00'),
('City','2021030507:00'),
('Reds','2021032607:00'),
('City','2021041607:00'),
('Reds','2021050707:00')
select Team, Max(period) LastPeriod
from x
where period <=Format(GetDate(), 'yyyyMMddhh:mm')
group by Team
The string-formatted date you have order by text, so I think this would work
SELECT TOP 2 *
FROM tableA
WHERE period = FORMAT( GETDATE(), 'yyyyMMddhh:mm' )
ORDER BY period
Perhaps you want:
where period = (select max(t2.period) from t t2)
This returns all rows with the last period in the table.

Is there a simple line (or two) of code that will pull records before a minimum date in another table?

I want to pull Emergency room visits before a members first treatment date. Everyone as a different first treatment date and none occur before Jan 01 2012.
So if a member has a first treatment date of Feb 24 2013, I want to know how many times they visited the ER one year prior to that date.
These min dates are located in another table and I can not use the Min date in my DATEADD function. Thoughts?
One possible solution is to use a CTE to capture the visits between the dates your interested in and then join to that with your select.
Here is an example:
Rextester
Edit:
I just completely updated my answer. Sorry for the confusion.
So you have at least two tables:
Emergency room visits
Treatment information
Let's call these two tables [ERVisits] and [Treatments].
I suppose both tables have some id-field for the patient/member. Let's call it [MemberId].
How about this conceptual query:
WITH [FirstTreatments] AS
(
SELECT [MemberId], MIN([TreatmentDate]) AS [FirstTreatmentDate]
FROM [Treatments]
GROUP BY [MemberId]
)
SELECT V.[MemberId], T.[FirstTreatmentDate], COUNT(*) AS [ERVisitCount]
FROM [ERVisits] AS V INNER JOIN [FirstTreatments] AS T ON T.[MemberId] = V.[MemberId]
WHERE DATEDIFF(DAY, V.[VisitDate], T.[FirstTreatmentDate]) BETWEEN 0 AND 365
GROUP BY V.[MemberId], T.[FirstTreatmentDate]
This query should show the number of times a patient/member has visited the ER in the year before his/her first treatment date.
Here is a tester: https://rextester.com/UXIE4263

Should I use Effective Date or Start Date and End Date for historical recording?

I am a Business Analyst and have prepared tables/erd for a system we are implementing.
The context is essentially an employee management system, an employee can join the company, change positions, get promoted, demoted, terminated etc. All of this is required to be tracked for filtering and reporting purposes. Therefore we require historical tracking of records.
My recommendation and original design of the tables included a field called "Effective Date", so essentially effectively from a date onwards a particular "Action" is valid.
Say for example, John joined an organisation as a consultant on the 1st Jan 2017 thus the action was he was hired, therefore the effective date is 1st Jan 2017 and he was a consultant for a certain period of time until he became a senior consultant on the 6th September 2017, thus the effective date is 6th September 2017 with an action of promoted for that record.
By the way we will also be performing calculations on the salary of the employee based on their position and other parameters so there will be derived fields and fields being referenced from other tables etc.
Now my boss and the Solutions Architect have advised not to use the "Effective Date," my boss says there will be "problems" with the calculation but doesn't elaborate, and the Solutions Architect says it would be easier to use a Start Date and an End Date instead of effective date. His rationale is if there's no end date that action/event is active, but is inactive once an end date is provided.
My problem with this is that we'll have to maintain an additional column that I feel is totally uneccessary.
What do the brains trust of StackOverflow advise??
Thanks :)
Your instincts serve you well. Don't use the end date. This adds a complication and source of possible anomalous data. Take the following sequential entries:
ID <attr> StartDate EndDate
1 ... Jan 1 Jan 20
1 ... Jan 20 Jan 22
1 ... Feb 1 Jul 30
There was a state change recorded on Jan 1 which was in effect until the next state change on Jan 20. Now we have a problem. According to the EndDate of that version, there was another state change on Jan 22, but the next version started on Feb 1.
This forms a gap in the time stream and we have no indication of where the problem lies. Is the EndDate of Jan 22 wrong? Is the StartDate of Feb 1 wrong? Or is there a missing version that connects the two ends of the gap? There is no way to tell.
ID <attr> StartDate EndDate
1 ... Jan 1 Jan 20
1 ... Jan 20 Feb 20
1 ... Feb 1 Jul 30
Now there is an overlap of states. The second state supposedly lasted until Feb 20 but the third state says it started on Feb 1. But the start of one state logically means the end of the previous state. Again, we have no idea (just by looking at the data) which date is wrong.
Knowing that the start of one state also indicates the end of the previous state, looks what happens when we simply remove the EndDate column.
ID <attr> EffDate
1 ... Jan 1
1 ... Jan 20
1 ... Feb 1
Now gaps and overlaps are impossible. Each state begins at the effective date and ends when the next state begins. As the EffDate field is part of the PK, no entry can have the same EffDate value for a given ID value.
This design is not used with the main entity table. It is implemented as a special form of second normal form, what I can version normal form (vnf).
Your Employee table will have fields that don't change over the course of time and some that do. You might also have fields that change but you don't wish to track those changes.
create table Employees(
ID int auto_generated primary key,
Hired date not null,
FName varchar not null,
LName varchar not null,
Sex enum -- M or F
BDay date,
Position enum not null,
PayRate currency,
DeptID int references Depts( ID )
);
If we wish to track changes to the data, we could add an effective date field. Consider, however, that data such as the hire date and birth date will not change from one version to another. Thus they are dependent only on the ID field. The data that does change (Position, PayRate, DeptID) are dependent on the ID and the effective date field. The table is no longer in 2nf.
So we normalize:
create table Employees(
ID int auto_generated primary key,
Hired date not null,
FName varchar not null,
Sex enum -- M or F
BDay date
);
create table Employees_V(
ID int not null references Employees( ID ),
EffDate date not null,
LName varchar not null,
Position enum not null,
PayRate currency,
DeptID int references Depts( ID ),
constraint PK_Employees_V primary key( ID, EffDate )
);
The last name can be expected to change now and then, especially among the female employees.
One of the main advantages of this method is that foreign keys cannot reference versions. Now all FKs can reference the main entity table as normal.
The query to obtain the "current" data is relatively simple:
select e.ID, e.Hired, e.FName, v.Lname, e.Sex, e.BDay, v.Position, v.PayRate, v.DeptID
from Employees e
join Employees)V v
on v.ID = e.ID
and v.EffDate =(
select Max( EffDate )
from Employees_V
where ID = v.ID
and EffDate <= GetDate())
where e.ID = 123;
Compare to querying a table with start/end dates.
select ID, Hired, FName, Lname, Sex, BDay, Position, PayRate, DeptID
from Employees
where ID = 123
and StartDate >= GetDate()
and EndDate < GetDate();
This assumes the EndDate value for the current version is a magic value such as 12/31/9999.
This second query looks a lot simpler than the first. Even if the data is normalized as shown above, there is a join but no subquery. It also looks like it will execute much faster.
I have used this technique for about 8 years now and I've never had to alter it because of performance issues. The vnf query runs at worst less than 10% slower than the start/end version. So a one minute query will take about one minute 5 seconds. However, under some conditions, the vnf query will execute faster.
Take entities that have many, many changes (many thousands of versions). The start/end query performs an index scan. It starts at the earliest version and must examine each version in sequence until it finds the one with the EndDate less than the target date. Normally, this is the last version. In the vnf query, the subquery makes it possible to perform an index seek.
So don't reject this design because you think it is slow. It is not slow. Especially when you consider that inserting a new version requires only the one INSERT statement. When working with start/end dates, the insert of a new version requires an UPDATE and then an INSERT. It's two UPDATEs and an INSERT when inserting a new version between two existing versions. To remove a start/end version requires one or two UPDATE and one DELETE statements. To delete a vnf version, just delete the version.
And if the start and end dates between versions ever get out of synch, you have a gap or overlap and good luck finding the right values.
So I'll take the small performance hit to ensure that the data can never get out of synch and turn anomalous on me. This (vnf), as it turns out, is really the simpler design.
Definitely implement the end date. It is a tiny bit more work when writing but you only write it once, but you will report on it many many times and you'll find that it makes everything so much easier (and faster) when the end date is already there on the record.
All over stackoverflow you will find questions about writing queries to find the end date of a given record when it is defined on the 'next' record rather than the 'current' record These queries are ugly and slow
If you look at the back end of enterprise systems like SAP you'll find that records have start and end dates defined.
With regards to your colleagues comments about not using effective date: You don't provide much info so I'll guess. I'm guessing that there is a true 'effective date' when the thing happened but there is also another set of start and end dates which are the payroll effective dates that the change applies to. So if someone starts on the 1st, the payroll effective date might actually be the 15th. This might also be used for FTE calculations. Payroll and pay periods are really a big deal and quite complex so you shouldn't underestimate the complexity there. If you're including pay calculations in this system then at the very least you need to understand what effective payroll dates are.
You should not be afraid of storing four date columns instead of one. Databases are there to make things easy for you not harder.
Using startDate and endDate makes update messy but it helps fetching effective dated much easier and faster.
Updating same record asynchronously may cause overlapping of the dates as we need to fetch all the records within update range and update these records individually.
On the other hand, using effectiveDate only fastens the update process as well as it will terminate the issue of date overlapping. But fetch seems too much complicated with this way.
For example:
ID Data EffDate
1 ... Jan 1 2020
1 ... Jan 30 2020
1 ... Feb 1 2020
In above example, if we want to fetch record of effective date Feb 1 we would have to compare first 3 records to match the highest date (which is not possible if we are fetching list). Upon that it will be mess to join with other effective dated tables.

Detecting Invalid Dates in Oracle 11g database (ORA-01847 )

I am querying an Oracle 11.2 instance to build a small data mart that includes extracting the date of birth and date of death of people.
Unfortunately the INSERT query (which takes its data from a SELECT) fails due to ORA-01847 (day of month must be between 1 and last day of month).
To find my bad dates I first did:
SELECT extract(day FROM SOME_DT_TM),
extract(month FROM SOME_DT_TM),
COUNT(*)
FROM PERSON
GROUP BY extract(day FROM SOME_DT_TM), extract(month FROM SOME_DT_TM)
ORDER BY COUNT(*) DESC;
It gave me 367 rows, one for each day of the year including NULL and February-29th (leap year). True for the other date column as well, so it looks like the data is fine from a SELECT perspective.
However if I set logging up on my insert
create table registry_new_dates
(some_dob date, some_death_date date);
exec dbms_errlog.create_error_log('SOME_NEW_DATES');
And then run my long insert query:
SELECT some_dob,some_death_date,ora_err_mesg$ FROM ERR$_SOME_NEW_DATES;
I get the following weird results (first 3 rows shown) which makes me think that zip codes have been somehow inserted instead of dates for the 2nd column.
31-DEC-25 35244 "ORA-01847: day of month must be between 1 and last day of month"
13-DEC-33 35244-3402 "ORA-01847: day of month must be between 1 and last day of month"
23-JUN-58 35235 "ORA-01847: day of month must be between 1 and last day of month"
My question is - how do I detect these bad rows (there are 11 apparentlyh) with an SQL statement so I can fix or remove them. Fixing them in the originating table is not an option (no write privileges). I tried using queries like this:
SELECT DECEASED_DT_TM
FROM WH_CLN_PERSON
WHERE DECEASED_DT_TM LIKE '35%'
AND rownum<3;
But it did not find the offending rows.
Not sure if you are still actively researching this (or if you got an answer already).
To find the rows with the bad data, can't you instead select the DOB and the date of death, and express the WHERE clause in terms of DOB - like so:
...WHERE some_dob = to_date('31-DEC-25')
? After you find those rows, you may want to do another query on just one or two of those rows, including a calculated column: dump(date of death). Then post that. We can learn a lot from the dump - the internal representation of the so-called "date" (which may very well be a ZIP code instead). With that in hand we may be able to figure out what's stored, and how to hunt for it.

GROUP BY with date range

I have a table with 4 columns, id, Stream which is text, Duration (int), and Timestamp (datetime). There is a row inserted for every time someone plays a specific audio stream on my website. Stream is the name, and Duration is the time in seconds that they are listening. I am currently using the following query to figure up total listen hours for each week in a year:
SELECT YEARWEEK(`Timestamp`), (SUM(`Duration`)/60/60) FROM logs_main
WHERE `Stream`="asdf" GROUP BY YEARWEEK(`Timestamp`);
This does what I expect... presenting a total of listen time for each week in the year that there is data.
However, I would like to build a query where I have a result row for weeks that there may not be any data. For example, if the 26th week of 2006 has no rows that fall within that week, then I would like the SUM result to be 0.
Is it possible to do this? Maybe via a JOIN over a date range somehow?
The tried an true old school solution is to set up another table with a bunch of date ranges that you can outer join with for the grouping (as in the other table would have all of the weeks in it with a begin / end date).
In this case, you could just get by with a table full of the values from YEARWEEK:
201100
201101
201102
201103
201104
And here is a sketch of a sql statement:
SELECT year_weeks.yearweek , (SUM(`Duration`)/60/60)
FROM year_weeks LEFT OUTER JOIN logs_main
ON year_weeks.yearweek = logs_main.YEARWEEK(`Timestamp`)
WHERE `Stream`="asdf" GROUP BY year_weeks.yearweek;
Here is a suggestion. might not be exactly what you are looking for.
But say you had a simple table with one column [year_week] that contained the values of 1, 2, 3, 4... 52
You could then theoretically:
SELECT
A.year_week,
(SELECT SUM('Duration')/60/00) FROM logs_main WHERE
stream = 'asdf' AND YEARWEEK('TimeStamp') = A.year_week GROUP BY YEARWEEK('TimeStamp'))
FROM
tblYearWeeks A
this obviously needs some tweaking... i've done several similar queries in other projects and this works well enough depending on the situation.
If your looking for a one table/sql based solution then that is deffinately something I would be interested in as well!