Identify full vs half yearly datasets in SQL - sql

I have a table with two fields of interest for this particular exercise: a CHAR(3) ID and a DATETIME. The ID identifies the submitter of the data - several thousand rows. The DATETIME is not necessarily unique, either. (The primary keys are other fields of the table.)
Data for this table is submitted every six months. In December, we receive July-December data from each submitter, and in June we receive July-June data. My task is to write a script that identifies people who have only submitted half their data, or only submitted January-June data in June.
...Does anyone have a solution?

For interest, this is what I wound up using. It was based off Stephen's answer, but with a few adaptations.
It's part of a larger script that's run every six months, but we're only checking this every twelve months - hence the "If FullYear = 1". I'm sure there's a more stylish way to identify the boundary dates, but this seems to work.
IF #FullYear = 1
BEGIN
DECLARE #FirstDate AS DATETIME
DECLARE #LastDayFirstYear AS DATETIME
DECLARE #SecondYear AS INT
DECLARE #NewYearsDay AS DATETIME
DECLARE #LastDate AS DATETIME
SELECT #FirstDate = MIN(dscdate), #LastDate = MAX(dscdate)
FROM TheTable
SELECT #SecondYear = DATEPART(yyyy, #FirstDate) + 1
SELECT #NewYearsDay = CAST(CAST(#SecondYear AS VARCHAR)
+ '-01-01' AS DATETIME)
INSERT INTO #AuditResults
SELECT DISTINCT
'Submitter missing Jan-Jun data', t.id
FROM TheTable t
WHERE
EXISTS (
SELECT 1
FROM TheTable t1
WHERE t.id = t1.id
AND t1.date >= #FirstDate
AND t1.date < #NewYearsDay )
AND NOT EXISTS (
SELECT 1
FROM TheTable t2
WHERE t2.date >= #NewYearsDay
AND t2.date <= #LastDate
AND t2.id = t.id
GROUP BY t2.id )
GROUP BY t.id
END

From your description, I wouldn't worry about the efficiency of the query since apparently it only needs to run twice a year!
There are a few ways to do this, which one is 'best' depends on the data that you have. The datediff (on max/min date values) you suggested should work, another option is to just count records for each submitted within each date range, e.g.
select * from (
select T.submitterId,
(select count(*)
from TABLE T1
where T1.datefield between [july] and [december]
and T1.submitterId = T.submitterId
group by T1.submitterId) as JDCount,
(select count(*)
from TABLE T2
where T2.datefield between [december] and [june]
and T2.submitterId = T.submitterId
group by T2.submitterId) as DJCount
from TABLE T) X
where X.JDCount <= 0 OR X.DJCount <= 0
Caveat: untested query off the top of my head; your mileage may vary.

I later realised that I was supposed to check to make sure that there was data for both July to December and January to June. So this is what I wound up in v2:
SELECT #avgmonths = AVG(x.[count])
FROM ( SELECT CAST(COUNT(DISTINCT DATEPART(month,
DATEADD(month,
DATEDIFF(month, 0, dscdate),
0))) AS FLOAT) AS [count]
FROM HospDscDate
GROUP BY hosp
) x
IF #avgmonths > 7
SET #months = 12
ELSE
SET #months = 6
SELECT 'Submitter missing data for some months' AS [WarningType],
t.id
FROM TheTable t
WHERE EXISTS ( SELECT 1
FROM TheTable t1
WHERE t.id = t1.id
HAVING COUNT(DISTINCT DATEPART(month,
DATEADD(month, DATEDIFF(month, 0, t1.Date), 0))) < #months )
GROUP BY t.id

Related

How To Select Records in a Status Between Timestamps? T-SQL

I have a T-SQL Quotes table and need to be able to count how many quotes were in an open status during past months.
The dates I have to work with are an 'Add_Date' timestamp and an 'Update_Date' timestamp. Once a quote is put into a 'Closed_Status' of '1' it can no longer be updated. Therefore, the 'Update_Date' effectively becomes the Closed_Status timestamp.
I'm stuck because I can't figure out how to select all open quotes that were open in a particular month.
Here's a few example records:
Quote_No Add_Date Update_Date Open_Status Closed_Status
001 01-01-2016 NULL 1 0
002 01-01-2016 3-1-2016 0 1
003 01-01-2016 4-1-2016 0 1
The desired result would be:
Year Month Open_Quote_Count
2016 01 3
2016 02 3
2016 03 2
2016 04 1
I've hit a mental wall on this one, I've tried to do some case when filtering but I just can't seem to figure this puzzle out. Ideally I wouldn't be hard-coding in dates because this spans years and I don't want to maintain this once written.
Thank you in advance for your help.
You are doing this by month. So, three options come to mind:
A list of all months using left join.
A recursive CTE.
A number table.
Let me show the last:
with n as (
select row_number() over (order by (select null)) - 1 as n
from master..spt_values
)
select format(dateadd(month, n.n, q.add_date), 'yyyy-MM') as yyyymm,
count(*) as Open_Quote_Count
from quotes q join
n
on (closed_status = 1 and dateadd(month, n.n, q.add_date) <= q.update_date) or
(closed_status = 0 and dateadd(month, n.n, q.add_date) <= getdate())
group by format(dateadd(month, n.n, q.add_date), 'yyyy-MM')
order by yyyymm;
This does assume that each month has at least one open record. That seems reasonable for this purpose.
You can use datepart to extract parts of a date, so something like:
select datepart(year, add_date) as 'year',
datepart(month, date_date) as 'month',
count(1)
from theTable
where open_status = 1
group by datepart(year, add_date), datepart(month, date_date)
Note: this counts for the starting month and primarily to show the use of datepart.
Updated as misunderstood the initial request.
Consider following test data:
DECLARE #test TABLE
(
Quote_No VARCHAR(3),
Add_Date DATE,
Update_Date DATE,
Open_Status INT,
Closed_Status INT
)
INSERT INTO #test (Quote_No, Add_Date, Update_Date, Open_Status, Closed_Status)
VALUES ('001', '20160101', NULL, 1, 0)
, ('002', '20160101', '20160301', 0, 1)
, ('003', '20160101', '20160401', 0, 1)
Here is a recursive solution, that doesn't rely on system tables BUT also performs poorer. As we are talking about months and year combinations, the number of recursions will not get overhand.
;WITH YearMonths AS
(
SELECT YEAR(MIN(Add_Date)) AS [Year]
, MONTH(MIN(Add_Date)) AS [Month]
, MIN(Add_Date) AS YMDate
FROM #test
UNION ALL
SELECT YEAR(DATEADD(MONTH,1,YMDate))
, MONTH(DATEADD(MONTH,1,YMDate))
, DATEADD(MONTH,1,YMDate)
FROM YearMonths
WHERE YMDate <= SYSDATETIME()
)
SELECT [Year]
, [Month]
, COUNT(*) AS Open_Quote_Count
FROM YearMonths ym
INNER JOIN #test t
ON (
[Year] * 100 + [Month] <= CAST(FORMAT(t.Update_Date, 'yyyyMM') AS INT)
AND t.Closed_Status = 1
)
OR (
[Year] * 100 + [Month] <= CAST(FORMAT(SYSDATETIME(), 'yyyyMM') AS INT)
AND t.Closed_Status = 0
)
GROUP BY [Year], [Month]
ORDER BY [Year], [Month]
Statement is longer, also more readable and lists all year/month combinations to date.
Take a look at Date and Time Data Types and Functions for SQL-Server 2008+
and Recursive Queries Using Common Table Expressions

SQL Query to get oldest date

I'm new to SQL and have a large database that contains IDs and Service Dates and I need to write a query to give me the first date each ID had a service.
I tried:
SELECT dbo.table.ID, dbo.otherTable.ServiceDate AS EasliestDate
FROM dbo.table INNER JOIN dbo.table.ID = dbo.otherTable.ID
But the output is every service for every ID, which has too many results to sort through. I want the output to only show the ID and the oldest service date. Any advice is appreciated.
EDIT: To be more precise, the output I am looking for is the ID and service date if the oldest service date is during the year that I specify. I.E. if ID = 1 has a service in 2015 and 2016 and I am searching for IDs in 2016 then ID = 1 should not appear in the results because there was an earlier service in 2015.
EDIT: Thanks everyone who helped with this! The answer I accepted did exactly what I asked. Major kudos to Patty though who who elaborated on how to further filter the outcome by year.
Use GROUP BY and MIN to get the first date for each ID:
SELECT dbo.table.ID,
MIN(dbo.otherTable.ServiceDate) AS EasliestDate
FROM dbo.table
INNER JOIN otherTable
ON dbo.table.ID = dbo.otherTable.ID
GROUP BY dbo.table.ID;
ADDENDUM
In reference to a question in the comments:
how would I also restrict it to show only those who had a service in a specific year?
It would depend on your exact requirements, consider the following set:
ID ServiceDate
--------------------
1 2014-05-01
1 2015-08-01
1 2016-07-07
2 2015-08-19
You would only want to include ID = 1 if the year you specified was 2016, but assuming you still wanted to return the first date of 2014-05-01 then you would need to add a having clause with a case statement to get this.
DECLARE #Year INT = 2016;
DECLARE #YearStart DATE = DATEADD(YEAR, #Year - 1900, '19000101'),
#YearEnd DATE = DATEADD(YEAR, #Year - 1900 + 1, '19000101');
SELECT #YearStart, #YearEnd
SELECT t.ID,
MIN(o.ServiceDate) AS EasliestDate
FROM dbo.table AS t
INNER JOIN otherTable AS o
ON o.ID = r.ID
GROUP BY t.ID
HAVING COUNT(CASE WHEN o.ServiceDate >= #YearStart
AND o.ServiceDate < #YearEnd THEN 1 END) > 0;
If you only want the earliest date in 2016 the a where clause would suffice
DECLARE #Year INT = 2016;
DECLARE #YearStart DATE = DATEADD(YEAR, #Year - 1900, '19000101'),
#YearEnd DATE = DATEADD(YEAR, #Year - 1900 + 1, '19000101');
SELECT #YearStart, #YearEnd
SELECT t.ID,
MIN(o.ServiceDate) AS EasliestDate
FROM dbo.table AS t
INNER JOIN otherTable AS o
ON o.ID = r.ID
WHERE o.ServiceDate >= #YearStart
AND o.ServiceDate < #YearEnd
GROUP BY t.ID;
It is worth noting there is a very good reason I have chosen to calculate the start of the year, and the start of the next year and used
WHERE o.ServiceDate >= #YearStart
AND o.ServiceDate < #YearEnd
Instead of just
WHERE DATEPART(YEAR, o.ServiceDate) = 2016;
In the former, an index on ServiceDate can be used whereas in the latter, the DATEPART calculation must be done on every record and this can cause significant performace issues.
ADDENDUM 2
To do the following:
The exact thing I want then would be IDs who's earliest service is in the year I specify.
Then you would need a having clause, just a different one to the one I posted before:
DECLARE #Year INT = 2016;
DECLARE #YearStart DATE = DATEADD(YEAR, #Year - 1900, '19000101'),
#YearEnd DATE = DATEADD(YEAR, #Year - 1900 + 1, '19000101');
SELECT #YearStart, #YearEnd
SELECT t.ID,
MIN(o.ServiceDate) AS EasliestDate
FROM dbo.table AS t
INNER JOIN otherTable AS o
ON o.ID = r.ID
GROUP BY t.ID
HAVING MIN(o.ServiceDate) >= #YearStart
AND MIN(o.ServiceDate) < #YearEnd;
ADDENDUM 3
CREATE VIEW dbo.YourView
AS
SELECT dbo.table.ID,
MIN(dbo.otherTable.ServiceDate) AS EasliestDate
FROM dbo.table
INNER JOIN otherTable
ON dbo.table.ID = dbo.otherTable.ID
GROUP BY dbo.table.ID;
Then you can apply your criteria to the view:
SELECT *
FROM dbo.YourView
WHERE EasliestDate >= '2015-01-01'
AND EasliestDate < '2016-01-01';
You have to include a WHERE in your current query:
SELECT dbo.table.ID, dbo.otherTable.ServiceDate AS EasliestDate
FROM dbo.table INNER JOIN dbo.table.ID = dbo.otherTable.ID
WHERE Month(dbo.otherTable.ServiceDate) = 1
Or you can search with Year(dbo.otherTable.ServiceDate) = 2016
Or you can use Day(dbo.otherTable.ServiceDate) = 1
Or an specific date.
use group by and min to get records. Else you can refer http://www.xaprb.com/blog/2006/12/07/how-to-select-the-firstleastmax-row-per-group-in-sql/ for better understanding.
You need to use a "Group by" statement. Try this:
SELECT dbo.table.ID, Max(dbo.otherTable.ServiceDate) AS LatestDate, Min(dbo.otherTable.ServiceDate as EarliestDate)
FROM dbo.table INNER JOIN dbo.table.ID = dbo.otherTable.ID
group by dbo.table.ID
Use nested statement to get the min date , and then just match based on ID.
select t1.ID from table1 t1 INNER JOIN
(
SELECT ID, MIN(servicedate) MinServiceDate
FROM table2
GROUP BY ID
) t2 ON t1.ID = t2.ID

Pad out an SQL table with data for Graphing Purposes

SQL Server 2005
I have an SQL Function (ftn_GetExampleTable) which returns a table with multiple result rows
EXAMPLE
ID MemberID MemberGroupID Result1 Result2 Result3 Year Week
1 1 1 High Risk 2 xx 2011 22
2 11 4 Low Risk 1 yy 2011 21
3 12 5 Med Risk 3 zz 2011 25
etc.
Now I do a count and group by on a table above this for Result 2 for instance so I get
SELECT MemberGroupID, Result2, Count(*) AS ExampleCount, Year, Week
FROM ftn_GetExampleTable
GROUP BY MemberGroupID, Result2, Year, Week
MemberGroupID Result2 ExampleCount Year Week
1 2 4 2011 22
4 1 2 2011 21
5 3 1 2011 25
Now imagine when I go to graph this new table between Weeks 20 and 23 of Year 2011, you'll see that it won't graph 20 or 23 or certain groups or even certain results in this example as they are not in the included data, so I need "false data" inserted into this table which has all the possibilities so they at least show on a graph even if the count is 0, does this make sense?
I am wondering on the easiest and kind of most dynamic way as it could be Result1 or Result3 I want to Graph on (different column types).
Thanks in advance
It looks like your dimensions are: MemberGroupID,Result2, and week (Year,Week).
One approach to solving this is to generate a list of all values you want for all the dimensions, and produce a cartesian product of them. As an example,
SELECT m.MemberGroupID, n.Result2, w.Year, w.Week
FROM (SELECT MemberGroupID FROM ftn_GetExampleTable GROUP BY MemberGroupID) m
CROSS
JOIN (SELECT Result2 FROM ftn_GetExampleTable GROUP BY Result2 ) n
CROSS
JOIN (SELECT Year, Week FROM myCalendar WHERE ... ) w
You don't necessarily need a table named myCalendar. (That approach does seem to be the popular one.) You just need a row source from which you can derive a list of (Year, Week) tuples. (There are answers to the question elsewhere in Stackoverflow, how to generate a list of dates.)
And the list of MemberGroupID and Result2 values doesn't have to come from the ftn_GetExampleTable rowsource, you could substitute another query.
With a cartesian product of those dimensions, you've got a complete "grid". Now you can LEFT JOIN your original result set to that.
Any place you don't have a matching row from the "gappy" result query, you'll get a NULL returned. You can leave the NULL, or replace it with a 0, which is probably what you want if it's a "count" you are returning.
SELECT d.MemberGroupID
, d.Result2
, d.Year
, d.Week
, IFNULL(r.ExampleCount,0) as ExampleCount
FROM ( <dimension query from above> ) d
LEFT
JOIN ( <original ExampleCount query> ) r
ON r.MemberGroupID = d.MemberGroupID
AND r.Result2 = d.Result2
AND r.Year = d.Year
AND r.Week = d.Week
That query can be refactored to make use of Common Table Expressions, which makes the query a little easier to read, especially if you are including multiple measures.
; WITH d AS ( /* <dimension query with no gaps (example above)> */
)
, r AS ( /* <original query with gaps> */
SELECT MemberGroupID, Result2, Count(*) AS ExampleCount, Year, Week
FROM ftn_GetExampleTable
GROUP BY MemberGroupID, Result2, Year, Week
)
SELECT d.*
, IFNULL(r.ExampleCount,0)
FROM d
LEFT
JOIN r
ON r.Year=d.Year AND r.Week=d.Week AND r.MemberGroupID = d.MemberGroupID
AND r.Result2 = d.Result2
This isn't a complete working solution to your problem, but it outlines an approach you can use.
Whenever I need to generate a sequence within SQL-Server I use the sys.all_objects table along with the ROW_NUMBER function, then maninpulate it as required:
SELECT ROW_NUMBER() OVER(ORDER BY Object_ID) AS Sequence
FROM Sys.All_Objects
So for the list of year and week numbers I would use:
DECLARE #StartDate DATETIME,
#EndDate DATETIME
SET #StartDate = '20110101'
SET #EndDate = '20120601'
SELECT DATEPART(YEAR, Date) AS YEAR,
DATEPART(WEEK, Date) AS WeekNum
FROM ( SELECT DATEADD(WEEK, ROW_NUMBER() OVER(ORDER BY Object_ID) - 1, #StartDate) AS Date
FROM Sys.All_Objects
) Dates
WHERE Date < #endDate
Where the dates subquery provides a list of dates at one week intervals between your start and end dates.
So in your example the end result would be something like:
DECLARE #StartDate DATETIME,
#EndDate DATETIME
SET #StartDate = '20110101'
SET #EndDate = '20120601'
;WITH Data AS
( SELECT MemberGroupID,
Result2,
Count(*) AS ExampleCount,
Year,
Week
FROM ftn_GetExampleTable
GROUP BY MemberGroupID, Result2, Year, Week
), Dates AS
( SELECT DATEPART(YEAR, Date) AS YEAR,
DATEPART(WEEK, Date) AS WeekNum
FROM ( SELECT DATEADD(WEEK, ROW_NUMBER() OVER(ORDER BY Object_ID) - 1, #StartDate) AS Date
FROM Sys.All_Objects
) Dates
WHERE Date < #endDate
)
SELECT YearNum,
WeeNum,
MemberID,
Result2,
COALESCE(ExampleCount, 0) AS ExampleCount
FROM Dates
LEFT JOIN Data
ON YearNum = Data.Year
AND WeekNum = Data.Week

Sql to select row from each day of a month

I have a table which store records of all dates of a month. I want to retrieve some data from it. The table is so large that I should only selecting a fews of them. If the records have a column "ric_date" which is a date, how can I select records from each of the dates in a month, while selecting only a fews from each date?
The table is so large that the records for 1 date can have 100000 records.
WITH T AS (
SELECT ric_date
FROM yourTable
WHERE rice_date BETWEEN #start_date AND #end_date -- thanks Aaron Bertrand
GROUP BY ric_date
)
SELECT CA.*
FROM T
CROSS APPLY (
SELECT TOP 500 * -- 'a fews'
FROM yourTable AS YT
WHERE YT.ric_date = T.ric_date
ORDER BY someAttribute -- not required, but useful
) AS CA
Rough idea. This will get the first three rows per day for the current month (or as many that exist for any given day - there may be days with no rows represented).
DECLARE
#manys INT = 3,
#month DATE = DATEADD(DAY, 1-DAY(GETDATE()), DATEDIFF(DAY, 0, GETDATE()));
;WITH x AS
(
SELECT some_column, ric_date, rn = ROW_NUMBER() OVER
(PARTITION BY ric_date ORDER BY ric_date)
FROM dbo.data
WHERE ric_date >= #month
AND ric_date < DATEADD(MONTH, 1, #month)
)
SELECT some_column, ric_date FROM x
WHERE rn <= #manys;
If you don't have supporting indexes (most importantly on ric_date), this won't necessarily scale well at the high end.

SQL to determine minimum sequential days of access?

The following User History table contains one record for every day a given user has accessed a website (in a 24 hour UTC period). It has many thousands of records, but only one record per day per user. If the user has not accessed the website for that day, no record will be generated.
Id UserId CreationDate
------ ------ ------------
750997 12 2009-07-07 18:42:20.723
750998 15 2009-07-07 18:42:20.927
751000 19 2009-07-07 18:42:22.283
What I'm looking for is a SQL query on this table with good performance, that tells me which userids have accessed the website for (n) continuous days without missing a day.
In other words, how many users have (n) records in this table with sequential (day-before, or day-after) dates? If any day is missing from the sequence, the sequence is broken and should restart again at 1; we're looking for users who have achieved a continuous number of days here with no gaps.
Any resemblance between this query and a particular Stack Overflow badge is purely coincidental, of course.. :)
How about (and please make sure the previous statement ended with a semi-colon):
WITH numberedrows
AS (SELECT ROW_NUMBER() OVER (PARTITION BY UserID
ORDER BY CreationDate)
- DATEDIFF(day,'19000101',CreationDate) AS TheOffset,
CreationDate,
UserID
FROM tablename)
SELECT MIN(CreationDate),
MAX(CreationDate),
COUNT(*) AS NumConsecutiveDays,
UserID
FROM numberedrows
GROUP BY UserID,
TheOffset
The idea being that if we have list of the days (as a number), and a row_number, then missed days make the offset between these two lists slightly bigger. So we're looking for a range that has a consistent offset.
You could use "ORDER BY NumConsecutiveDays DESC" at the end of this, or say "HAVING count(*) > 14" for a threshold...
I haven't tested this though - just writing it off the top of my head. Hopefully works in SQL2005 and on.
...and would be very much helped by an index on tablename(UserID, CreationDate)
Edited: Turns out Offset is a reserved word, so I used TheOffset instead.
Edited: The suggestion to use COUNT(*) is very valid - I should've done that in the first place but wasn't really thinking. Previously it was using datediff(day, min(CreationDate), max(CreationDate)) instead.
Rob
The answer is obviously:
SELECT DISTINCT UserId
FROM UserHistory uh1
WHERE (
SELECT COUNT(*)
FROM UserHistory uh2
WHERE uh2.CreationDate
BETWEEN uh1.CreationDate AND DATEADD(d, #days, uh1.CreationDate)
) = #days OR UserId = 52551
EDIT:
Okay here's my serious answer:
DECLARE #days int
DECLARE #seconds bigint
SET #days = 30
SET #seconds = (#days * 24 * 60 * 60) - 1
SELECT DISTINCT UserId
FROM (
SELECT uh1.UserId, Count(uh1.Id) as Conseq
FROM UserHistory uh1
INNER JOIN UserHistory uh2 ON uh2.CreationDate
BETWEEN uh1.CreationDate AND
DATEADD(s, #seconds, DATEADD(dd, DATEDIFF(dd, 0, uh1.CreationDate), 0))
AND uh1.UserId = uh2.UserId
GROUP BY uh1.Id, uh1.UserId
) as Tbl
WHERE Conseq >= #days
EDIT:
[Jeff Atwood] This is a great fast solution and deserves to be accepted, but Rob Farley's solution is also excellent and arguably even faster (!). Please check it out too!
If you can change the table schema, I'd suggest adding a column LongestStreak to the table which you'd set to the number of sequential days ending to the CreationDate. It's easy to update the table at login time (similar to what you are doing already, if no rows exist of the current day, you'll check if any row exists for the previous day. If true, you'll increment the LongestStreak in the new row, otherwise, you'll set it to 1.)
The query will be obvious after adding this column:
if exists(select * from table
where LongestStreak >= 30 and UserId = #UserId)
-- award the Woot badge.
Some nicely expressive SQL along the lines of:
select
userId,
dbo.MaxConsecutiveDates(CreationDate) as blah
from
dbo.Logins
group by
userId
Assuming you have a user defined aggregate function something along the lines of (beware this is buggy):
using System;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Runtime.InteropServices;
namespace SqlServerProject1
{
[StructLayout(LayoutKind.Sequential)]
[Serializable]
internal struct MaxConsecutiveState
{
public int CurrentSequentialDays;
public int MaxSequentialDays;
public SqlDateTime LastDate;
}
[Serializable]
[SqlUserDefinedAggregate(
Format.Native,
IsInvariantToNulls = true, //optimizer property
IsInvariantToDuplicates = false, //optimizer property
IsInvariantToOrder = false) //optimizer property
]
[StructLayout(LayoutKind.Sequential)]
public class MaxConsecutiveDates
{
/// <summary>
/// The variable that holds the intermediate result of the concatenation
/// </summary>
private MaxConsecutiveState _intermediateResult;
/// <summary>
/// Initialize the internal data structures
/// </summary>
public void Init()
{
_intermediateResult = new MaxConsecutiveState { LastDate = SqlDateTime.MinValue, CurrentSequentialDays = 0, MaxSequentialDays = 0 };
}
/// <summary>
/// Accumulate the next value, not if the value is null
/// </summary>
/// <param name="value"></param>
public void Accumulate(SqlDateTime value)
{
if (value.IsNull)
{
return;
}
int sequentialDays = _intermediateResult.CurrentSequentialDays;
int maxSequentialDays = _intermediateResult.MaxSequentialDays;
DateTime currentDate = value.Value.Date;
if (currentDate.AddDays(-1).Equals(new DateTime(_intermediateResult.LastDate.TimeTicks)))
sequentialDays++;
else
{
maxSequentialDays = Math.Max(sequentialDays, maxSequentialDays);
sequentialDays = 1;
}
_intermediateResult = new MaxConsecutiveState
{
CurrentSequentialDays = sequentialDays,
LastDate = currentDate,
MaxSequentialDays = maxSequentialDays
};
}
/// <summary>
/// Merge the partially computed aggregate with this aggregate.
/// </summary>
/// <param name="other"></param>
public void Merge(MaxConsecutiveDates other)
{
// add stuff for two separate calculations
}
/// <summary>
/// Called at the end of aggregation, to return the results of the aggregation.
/// </summary>
/// <returns></returns>
public SqlInt32 Terminate()
{
int max = Math.Max((int) ((sbyte) _intermediateResult.CurrentSequentialDays), (sbyte) _intermediateResult.MaxSequentialDays);
return new SqlInt32(max);
}
}
}
Seems like you could take advantage of the fact that to be continuous over n days would require there to be n rows.
So something like:
SELECT users.UserId, count(1) as cnt
FROM users
WHERE users.CreationDate > now() - INTERVAL 30 DAY
GROUP BY UserId
HAVING cnt = 30
Doing this with a single SQL query seems overly complicated to me. Let me break this answer down in two parts.
What you should have done until now and should start doing now:
Run a daily cron job that checks for every user wether he has logged in today and then increments a counter if he has or sets it to 0 if he hasn't.
What you should do now:
- Export this table to a server that doesn't run your website and won't be needed for a while. ;)
- Sort it by user, then date.
- go through it sequentially, keep a counter...
You could use a recursive CTE (SQL Server 2005+):
WITH recur_date AS (
SELECT t.userid,
t.creationDate,
DATEADD(day, 1, t.created) 'nextDay',
1 'level'
FROM TABLE t
UNION ALL
SELECT t.userid,
t.creationDate,
DATEADD(day, 1, t.created) 'nextDay',
rd.level + 1 'level'
FROM TABLE t
JOIN recur_date rd on t.creationDate = rd.nextDay AND t.userid = rd.userid)
SELECT t.*
FROM recur_date t
WHERE t.level = #numDays
ORDER BY t.userid
Joe Celko has a complete chapter on this in SQL for Smarties (calling it Runs and Sequences). I don't have that book at home, so when I get to work... I'll actually answer this. (assuming history table is called dbo.UserHistory and the number of days is #Days)
Another lead is from SQL Team's blog on runs
The other idea I've had, but don't have a SQL server handy to work on here is to use a CTE with a partitioned ROW_NUMBER like this:
WITH Runs
AS
(SELECT UserID
, CreationDate
, ROW_NUMBER() OVER(PARTITION BY UserId
ORDER BY CreationDate)
- ROW_NUMBER() OVER(PARTITION BY UserId, NoBreak
ORDER BY CreationDate) AS RunNumber
FROM
(SELECT UH.UserID
, UH.CreationDate
, ISNULL((SELECT TOP 1 1
FROM dbo.UserHistory AS Prior
WHERE Prior.UserId = UH.UserId
AND Prior.CreationDate
BETWEEN DATEADD(dd, DATEDIFF(dd, 0, UH.CreationDate), -1)
AND DATEADD(dd, DATEDIFF(dd, 0, UH.CreationDate), 0)), 0) AS NoBreak
FROM dbo.UserHistory AS UH) AS Consecutive
)
SELECT UserID, MIN(CreationDate) AS RunStart, MAX(CreationDate) AS RunEnd
FROM Runs
GROUP BY UserID, RunNumber
HAVING DATEDIFF(dd, MIN(CreationDate), MAX(CreationDate)) >= #Days
The above is likely WAY HARDER than it has to be, but left as an a brain tickle for when you have some other definition of "a run" than just dates.
A couple of SQL Server 2012 options (assuming N=100 below).
;WITH T(UserID, NRowsPrevious)
AS (SELECT UserID,
DATEDIFF(DAY,
LAG(CreationDate, 100)
OVER
(PARTITION BY UserID
ORDER BY CreationDate),
CreationDate)
FROM UserHistory)
SELECT DISTINCT UserID
FROM T
WHERE NRowsPrevious = 100
Though with my sample data the following worked out more efficient
;WITH U
AS (SELECT DISTINCT UserId
FROM UserHistory) /*Ideally replace with Users table*/
SELECT UserId
FROM U
CROSS APPLY (SELECT TOP 1 *
FROM (SELECT
DATEDIFF(DAY,
LAG(CreationDate, 100)
OVER
(ORDER BY CreationDate),
CreationDate)
FROM UserHistory UH
WHERE U.UserId = UH.UserID) T(NRowsPrevious)
WHERE NRowsPrevious = 100) O
Both rely on the constraint stated in the question that there is at most one record per day per user.
If this is so important to you, source this event and drive a table to give you this info. No need to kill the machine with all those crazy queries.
Something like this?
select distinct userid
from table t1, table t2
where t1.UserId = t2.UserId
AND trunc(t1.CreationDate) = trunc(t2.CreationDate) + n
AND (
select count(*)
from table t3
where t1.UserId = t3.UserId
and CreationDate between trunc(t1.CreationDate) and trunc(t1.CreationDate)+n
) = n
I used a simple math property to identify who consecutively accessed the site. This property is that you should have the day difference between the first time access and last time equal to number of records in your access table log.
Here are SQL script that I tested in Oracle DB (it should work in other DBs as well):
-- show basic understand of the math properties
select ceil(max (creation_date) - min (creation_date))
max_min_days_diff,
count ( * ) real_day_count
from user_access_log
group by user_id;
-- select all users that have consecutively accessed the site
select user_id
from user_access_log
group by user_id
having ceil(max (creation_date) - min (creation_date))
/ count ( * ) = 1;
-- get the count of all users that have consecutively accessed the site
select count(user_id) user_count
from user_access_log
group by user_id
having ceil(max (creation_date) - min (creation_date))
/ count ( * ) = 1;
Table prep script:
-- create table
create table user_access_log (id number, user_id number, creation_date date);
-- insert seed data
insert into user_access_log (id, user_id, creation_date)
values (1, 12, sysdate);
insert into user_access_log (id, user_id, creation_date)
values (2, 12, sysdate + 1);
insert into user_access_log (id, user_id, creation_date)
values (3, 12, sysdate + 2);
insert into user_access_log (id, user_id, creation_date)
values (4, 16, sysdate);
insert into user_access_log (id, user_id, creation_date)
values (5, 16, sysdate + 1);
insert into user_access_log (id, user_id, creation_date)
values (6, 16, sysdate + 5);
declare #startdate as datetime, #days as int
set #startdate = cast('11 Jan 2009' as datetime) -- The startdate
set #days = 5 -- The number of consecutive days
SELECT userid
,count(1) as [Number of Consecutive Days]
FROM UserHistory
WHERE creationdate >= #startdate
AND creationdate < dateadd(dd, #days, cast(convert(char(11), #startdate, 113) as datetime))
GROUP BY userid
HAVING count(1) >= #days
The statement cast(convert(char(11), #startdate, 113) as datetime) removes the time part of the date so we start at midnight.
I would assume also that the creationdate and userid columns are indexed.
I just realized that this won't tell you all the users and their total consecutive days. But will tell you which users will have been visiting a set number of days from a date of your choosing.
Revised solution:
declare #days as int
set #days = 30
select t1.userid
from UserHistory t1
where (select count(1)
from UserHistory t3
where t3.userid = t1.userid
and t3.creationdate >= DATEADD(dd, DATEDIFF(dd, 0, t1.creationdate), 0)
and t3.creationdate < DATEADD(dd, DATEDIFF(dd, 0, t1.creationdate) + #days, 0)
group by t3.userid
) >= #days
group by t1.userid
I've checked this and it will query for all users and all dates. It is based on Spencer's 1st (joke?) solution, but mine works.
Update: improved the date handling in the second solution.
This should do what you want but I don't have enough data to test efficiency. The convoluted CONVERT/FLOOR stuff is to strip the time portion off the datetime field. If you're using SQL Server 2008 then you could use CAST(x.CreationDate AS DATE).
DECLARE #Range as INT
SET #Range = 10
SELECT DISTINCT UserId, CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, a.CreationDate)))
FROM tblUserLogin a
WHERE EXISTS
(SELECT 1
FROM tblUserLogin b
WHERE a.userId = b.userId
AND (SELECT COUNT(DISTINCT(CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, CreationDate)))))
FROM tblUserLogin c
WHERE c.userid = b.userid
AND CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, c.CreationDate))) BETWEEN CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, a.CreationDate))) and CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, a.CreationDate)))+#Range-1) = #Range)
Creation script
CREATE TABLE [dbo].[tblUserLogin](
[Id] [int] IDENTITY(1,1) NOT NULL,
[UserId] [int] NULL,
[CreationDate] [datetime] NULL
) ON [PRIMARY]
Spencer almost did it, but this should be the working code:
SELECT DISTINCT UserId
FROM History h1
WHERE (
SELECT COUNT(*)
FROM History
WHERE UserId = h1.UserId AND CreationDate BETWEEN h1.CreationDate AND DATEADD(d, #n-1, h1.CreationDate)
) >= #n
Off the top of my head, MySQLish:
SELECT start.UserId
FROM UserHistory AS start
LEFT OUTER JOIN UserHistory AS pre_start ON pre_start.UserId=start.UserId
AND DATE(pre_start.CreationDate)=DATE_SUB(DATE(start.CreationDate), INTERVAL 1 DAY)
LEFT OUTER JOIN UserHistory AS subsequent ON subsequent.UserId=start.UserId
AND DATE(subsequent.CreationDate)<=DATE_ADD(DATE(start.CreationDate), INTERVAL 30 DAY)
WHERE pre_start.Id IS NULL
GROUP BY start.Id
HAVING COUNT(subsequent.Id)=30
Untested, and almost certainly needs some conversion for MSSQL, but I think that give some ideas.
How about one using Tally tables? It follows a more algorithmic approach, and execution plan is a breeze. Populate the tallyTable with numbers from 1 to 'MaxDaysBehind' that you want to scan the table (ie. 90 will look for 3 months behind,etc).
declare #ContinousDays int
set #ContinousDays = 30 -- select those that have 30 consecutive days
create table #tallyTable (Tally int)
insert into #tallyTable values (1)
...
insert into #tallyTable values (90) -- insert numbers for as many days behind as you want to scan
select [UserId],count(*),t.Tally from HistoryTable
join #tallyTable as t on t.Tally>0
where [CreationDate]> getdate()-#ContinousDays-t.Tally and
[CreationDate]<getdate()-t.Tally
group by [UserId],t.Tally
having count(*)>=#ContinousDays
delete #tallyTable
Tweaking Bill's query a bit. You might have to truncate the date before grouping to count only one login per day...
SELECT UserId from History
WHERE CreationDate > ( now() - n )
GROUP BY UserId,
DATEADD(dd, DATEDIFF(dd, 0, CreationDate), 0) AS TruncatedCreationDate
HAVING COUNT(TruncatedCreationDate) >= n
EDITED to use DATEADD(dd, DATEDIFF(dd, 0, CreationDate), 0) instead of convert( char(10) , CreationDate, 101 ).
#IDisposable
I was looking to use datepart earlier but i was too lazy to look up the syntax so i figured i d use convert instead. I dint know it had a significant impact Thanks! now i know.
assuming a schema that goes like:
create table dba.visits
(
id integer not null,
user_id integer not null,
creation_date date not null
);
this will extract contiguous ranges from a date sequence with gaps.
select l.creation_date as start_d, -- Get first date in contiguous range
(
select min(a.creation_date ) as creation_date
from "DBA"."visits" a
left outer join "DBA"."visits" b on
a.creation_date = dateadd(day, -1, b.creation_date ) and
a.user_id = b.user_id
where b.creation_date is null and
a.creation_date >= l.creation_date and
a.user_id = l.user_id
) as end_d -- Get last date in contiguous range
from "DBA"."visits" l
left outer join "DBA"."visits" r on
r.creation_date = dateadd(day, -1, l.creation_date ) and
r.user_id = l.user_id
where r.creation_date is null