SQL to determine minimum sequential days of access?

SQL to determine minimum sequential days of access? - sql

The following User History table contains one record for every day a given user has accessed a website (in a 24 hour UTC period). It has many thousands of records, but only one record per day per user. If the user has not accessed the website for that day, no record will be generated.
Id UserId CreationDate
------ ------ ------------
750997 12 2009-07-07 18:42:20.723
750998 15 2009-07-07 18:42:20.927
751000 19 2009-07-07 18:42:22.283
What I'm looking for is a SQL query on this table with good performance, that tells me which userids have accessed the website for (n) continuous days without missing a day.
In other words, how many users have (n) records in this table with sequential (day-before, or day-after) dates? If any day is missing from the sequence, the sequence is broken and should restart again at 1; we're looking for users who have achieved a continuous number of days here with no gaps.
Any resemblance between this query and a particular Stack Overflow badge is purely coincidental, of course.. :)

How about (and please make sure the previous statement ended with a semi-colon):
WITH numberedrows
AS (SELECT ROW_NUMBER() OVER (PARTITION BY UserID
ORDER BY CreationDate)
- DATEDIFF(day,'19000101',CreationDate) AS TheOffset,
CreationDate,
UserID
FROM tablename)
SELECT MIN(CreationDate),
MAX(CreationDate),
COUNT(*) AS NumConsecutiveDays,
UserID
FROM numberedrows
GROUP BY UserID,
TheOffset
The idea being that if we have list of the days (as a number), and a row_number, then missed days make the offset between these two lists slightly bigger. So we're looking for a range that has a consistent offset.
You could use "ORDER BY NumConsecutiveDays DESC" at the end of this, or say "HAVING count(*) > 14" for a threshold...
I haven't tested this though - just writing it off the top of my head. Hopefully works in SQL2005 and on.
...and would be very much helped by an index on tablename(UserID, CreationDate)
Edited: Turns out Offset is a reserved word, so I used TheOffset instead.
Edited: The suggestion to use COUNT(*) is very valid - I should've done that in the first place but wasn't really thinking. Previously it was using datediff(day, min(CreationDate), max(CreationDate)) instead.
Rob

The answer is obviously:
SELECT DISTINCT UserId
FROM UserHistory uh1
WHERE (
SELECT COUNT(*)
FROM UserHistory uh2
WHERE uh2.CreationDate
BETWEEN uh1.CreationDate AND DATEADD(d, #days, uh1.CreationDate)
) = #days OR UserId = 52551
EDIT:
Okay here's my serious answer:
DECLARE #days int
DECLARE #seconds bigint
SET #days = 30
SET #seconds = (#days * 24 * 60 * 60) - 1
SELECT DISTINCT UserId
FROM (
SELECT uh1.UserId, Count(uh1.Id) as Conseq
FROM UserHistory uh1
INNER JOIN UserHistory uh2 ON uh2.CreationDate
BETWEEN uh1.CreationDate AND
DATEADD(s, #seconds, DATEADD(dd, DATEDIFF(dd, 0, uh1.CreationDate), 0))
AND uh1.UserId = uh2.UserId
GROUP BY uh1.Id, uh1.UserId
) as Tbl
WHERE Conseq >= #days
EDIT:
[Jeff Atwood] This is a great fast solution and deserves to be accepted, but Rob Farley's solution is also excellent and arguably even faster (!). Please check it out too!

If you can change the table schema, I'd suggest adding a column LongestStreak to the table which you'd set to the number of sequential days ending to the CreationDate. It's easy to update the table at login time (similar to what you are doing already, if no rows exist of the current day, you'll check if any row exists for the previous day. If true, you'll increment the LongestStreak in the new row, otherwise, you'll set it to 1.)
The query will be obvious after adding this column:
if exists(select * from table
where LongestStreak >= 30 and UserId = #UserId)
-- award the Woot badge.

Some nicely expressive SQL along the lines of:
select
userId,
dbo.MaxConsecutiveDates(CreationDate) as blah
from
dbo.Logins
group by
userId
Assuming you have a user defined aggregate function something along the lines of (beware this is buggy):
using System;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Runtime.InteropServices;
namespace SqlServerProject1
{
[StructLayout(LayoutKind.Sequential)]
[Serializable]
internal struct MaxConsecutiveState
{
public int CurrentSequentialDays;
public int MaxSequentialDays;
public SqlDateTime LastDate;
}
[Serializable]
[SqlUserDefinedAggregate(
Format.Native,
IsInvariantToNulls = true, //optimizer property
IsInvariantToDuplicates = false, //optimizer property
IsInvariantToOrder = false) //optimizer property
]
[StructLayout(LayoutKind.Sequential)]
public class MaxConsecutiveDates
{
/// <summary>
/// The variable that holds the intermediate result of the concatenation
/// </summary>
private MaxConsecutiveState _intermediateResult;
/// <summary>
/// Initialize the internal data structures
/// </summary>
public void Init()
{
_intermediateResult = new MaxConsecutiveState { LastDate = SqlDateTime.MinValue, CurrentSequentialDays = 0, MaxSequentialDays = 0 };
}
/// <summary>
/// Accumulate the next value, not if the value is null
/// </summary>
/// <param name="value"></param>
public void Accumulate(SqlDateTime value)
{
if (value.IsNull)
{
return;
}
int sequentialDays = _intermediateResult.CurrentSequentialDays;
int maxSequentialDays = _intermediateResult.MaxSequentialDays;
DateTime currentDate = value.Value.Date;
if (currentDate.AddDays(-1).Equals(new DateTime(_intermediateResult.LastDate.TimeTicks)))
sequentialDays++;
else
{
maxSequentialDays = Math.Max(sequentialDays, maxSequentialDays);
sequentialDays = 1;
}
_intermediateResult = new MaxConsecutiveState
{
CurrentSequentialDays = sequentialDays,
LastDate = currentDate,
MaxSequentialDays = maxSequentialDays
};
}
/// <summary>
/// Merge the partially computed aggregate with this aggregate.
/// </summary>
/// <param name="other"></param>
public void Merge(MaxConsecutiveDates other)
{
// add stuff for two separate calculations
}
/// <summary>
/// Called at the end of aggregation, to return the results of the aggregation.
/// </summary>
/// <returns></returns>
public SqlInt32 Terminate()
{
int max = Math.Max((int) ((sbyte) _intermediateResult.CurrentSequentialDays), (sbyte) _intermediateResult.MaxSequentialDays);
return new SqlInt32(max);
}
}
}

Seems like you could take advantage of the fact that to be continuous over n days would require there to be n rows.
So something like:
SELECT users.UserId, count(1) as cnt
FROM users
WHERE users.CreationDate > now() - INTERVAL 30 DAY
GROUP BY UserId
HAVING cnt = 30

Doing this with a single SQL query seems overly complicated to me. Let me break this answer down in two parts.
What you should have done until now and should start doing now:
Run a daily cron job that checks for every user wether he has logged in today and then increments a counter if he has or sets it to 0 if he hasn't.
What you should do now:
- Export this table to a server that doesn't run your website and won't be needed for a while. ;)
- Sort it by user, then date.
- go through it sequentially, keep a counter...

You could use a recursive CTE (SQL Server 2005+):
WITH recur_date AS (
SELECT t.userid,
t.creationDate,
DATEADD(day, 1, t.created) 'nextDay',
1 'level'
FROM TABLE t
UNION ALL
SELECT t.userid,
t.creationDate,
DATEADD(day, 1, t.created) 'nextDay',
rd.level + 1 'level'
FROM TABLE t
JOIN recur_date rd on t.creationDate = rd.nextDay AND t.userid = rd.userid)
SELECT t.*
FROM recur_date t
WHERE t.level = #numDays
ORDER BY t.userid

Joe Celko has a complete chapter on this in SQL for Smarties (calling it Runs and Sequences). I don't have that book at home, so when I get to work... I'll actually answer this. (assuming history table is called dbo.UserHistory and the number of days is #Days)
Another lead is from SQL Team's blog on runs
The other idea I've had, but don't have a SQL server handy to work on here is to use a CTE with a partitioned ROW_NUMBER like this:
WITH Runs
AS
(SELECT UserID
, CreationDate
, ROW_NUMBER() OVER(PARTITION BY UserId
ORDER BY CreationDate)
- ROW_NUMBER() OVER(PARTITION BY UserId, NoBreak
ORDER BY CreationDate) AS RunNumber
FROM
(SELECT UH.UserID
, UH.CreationDate
, ISNULL((SELECT TOP 1 1
FROM dbo.UserHistory AS Prior
WHERE Prior.UserId = UH.UserId
AND Prior.CreationDate
BETWEEN DATEADD(dd, DATEDIFF(dd, 0, UH.CreationDate), -1)
AND DATEADD(dd, DATEDIFF(dd, 0, UH.CreationDate), 0)), 0) AS NoBreak
FROM dbo.UserHistory AS UH) AS Consecutive
)
SELECT UserID, MIN(CreationDate) AS RunStart, MAX(CreationDate) AS RunEnd
FROM Runs
GROUP BY UserID, RunNumber
HAVING DATEDIFF(dd, MIN(CreationDate), MAX(CreationDate)) >= #Days
The above is likely WAY HARDER than it has to be, but left as an a brain tickle for when you have some other definition of "a run" than just dates.

A couple of SQL Server 2012 options (assuming N=100 below).
;WITH T(UserID, NRowsPrevious)
AS (SELECT UserID,
DATEDIFF(DAY,
LAG(CreationDate, 100)
OVER
(PARTITION BY UserID
ORDER BY CreationDate),
CreationDate)
FROM UserHistory)
SELECT DISTINCT UserID
FROM T
WHERE NRowsPrevious = 100
Though with my sample data the following worked out more efficient
;WITH U
AS (SELECT DISTINCT UserId
FROM UserHistory) /*Ideally replace with Users table*/
SELECT UserId
FROM U
CROSS APPLY (SELECT TOP 1 *
FROM (SELECT
DATEDIFF(DAY,
LAG(CreationDate, 100)
OVER
(ORDER BY CreationDate),
CreationDate)
FROM UserHistory UH
WHERE U.UserId = UH.UserID) T(NRowsPrevious)
WHERE NRowsPrevious = 100) O
Both rely on the constraint stated in the question that there is at most one record per day per user.

If this is so important to you, source this event and drive a table to give you this info. No need to kill the machine with all those crazy queries.

Something like this?
select distinct userid
from table t1, table t2
where t1.UserId = t2.UserId
AND trunc(t1.CreationDate) = trunc(t2.CreationDate) + n
AND (
select count(*)
from table t3
where t1.UserId = t3.UserId
and CreationDate between trunc(t1.CreationDate) and trunc(t1.CreationDate)+n
) = n

I used a simple math property to identify who consecutively accessed the site. This property is that you should have the day difference between the first time access and last time equal to number of records in your access table log.
Here are SQL script that I tested in Oracle DB (it should work in other DBs as well):
-- show basic understand of the math properties
select ceil(max (creation_date) - min (creation_date))
max_min_days_diff,
count ( * ) real_day_count
from user_access_log
group by user_id;
-- select all users that have consecutively accessed the site
select user_id
from user_access_log
group by user_id
having ceil(max (creation_date) - min (creation_date))
/ count ( * ) = 1;
-- get the count of all users that have consecutively accessed the site
select count(user_id) user_count
from user_access_log
group by user_id
having ceil(max (creation_date) - min (creation_date))
/ count ( * ) = 1;
Table prep script:
-- create table
create table user_access_log (id number, user_id number, creation_date date);
-- insert seed data
insert into user_access_log (id, user_id, creation_date)
values (1, 12, sysdate);
insert into user_access_log (id, user_id, creation_date)
values (2, 12, sysdate + 1);
insert into user_access_log (id, user_id, creation_date)
values (3, 12, sysdate + 2);
insert into user_access_log (id, user_id, creation_date)
values (4, 16, sysdate);
insert into user_access_log (id, user_id, creation_date)
values (5, 16, sysdate + 1);
insert into user_access_log (id, user_id, creation_date)
values (6, 16, sysdate + 5);

declare #startdate as datetime, #days as int
set #startdate = cast('11 Jan 2009' as datetime) -- The startdate
set #days = 5 -- The number of consecutive days
SELECT userid
,count(1) as [Number of Consecutive Days]
FROM UserHistory
WHERE creationdate >= #startdate
AND creationdate < dateadd(dd, #days, cast(convert(char(11), #startdate, 113) as datetime))
GROUP BY userid
HAVING count(1) >= #days
The statement cast(convert(char(11), #startdate, 113) as datetime) removes the time part of the date so we start at midnight.
I would assume also that the creationdate and userid columns are indexed.
I just realized that this won't tell you all the users and their total consecutive days. But will tell you which users will have been visiting a set number of days from a date of your choosing.
Revised solution:
declare #days as int
set #days = 30
select t1.userid
from UserHistory t1
where (select count(1)
from UserHistory t3
where t3.userid = t1.userid
and t3.creationdate >= DATEADD(dd, DATEDIFF(dd, 0, t1.creationdate), 0)
and t3.creationdate < DATEADD(dd, DATEDIFF(dd, 0, t1.creationdate) + #days, 0)
group by t3.userid
) >= #days
group by t1.userid
I've checked this and it will query for all users and all dates. It is based on Spencer's 1st (joke?) solution, but mine works.
Update: improved the date handling in the second solution.

This should do what you want but I don't have enough data to test efficiency. The convoluted CONVERT/FLOOR stuff is to strip the time portion off the datetime field. If you're using SQL Server 2008 then you could use CAST(x.CreationDate AS DATE).
DECLARE #Range as INT
SET #Range = 10
SELECT DISTINCT UserId, CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, a.CreationDate)))
FROM tblUserLogin a
WHERE EXISTS
(SELECT 1
FROM tblUserLogin b
WHERE a.userId = b.userId
AND (SELECT COUNT(DISTINCT(CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, CreationDate)))))
FROM tblUserLogin c
WHERE c.userid = b.userid
AND CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, c.CreationDate))) BETWEEN CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, a.CreationDate))) and CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, a.CreationDate)))+#Range-1) = #Range)
Creation script
CREATE TABLE [dbo].[tblUserLogin](
[Id] [int] IDENTITY(1,1) NOT NULL,
[UserId] [int] NULL,
[CreationDate] [datetime] NULL
) ON [PRIMARY]

Spencer almost did it, but this should be the working code:
SELECT DISTINCT UserId
FROM History h1
WHERE (
SELECT COUNT(*)
FROM History
WHERE UserId = h1.UserId AND CreationDate BETWEEN h1.CreationDate AND DATEADD(d, #n-1, h1.CreationDate)
) >= #n

Off the top of my head, MySQLish:
SELECT start.UserId
FROM UserHistory AS start
LEFT OUTER JOIN UserHistory AS pre_start ON pre_start.UserId=start.UserId
AND DATE(pre_start.CreationDate)=DATE_SUB(DATE(start.CreationDate), INTERVAL 1 DAY)
LEFT OUTER JOIN UserHistory AS subsequent ON subsequent.UserId=start.UserId
AND DATE(subsequent.CreationDate)<=DATE_ADD(DATE(start.CreationDate), INTERVAL 30 DAY)
WHERE pre_start.Id IS NULL
GROUP BY start.Id
HAVING COUNT(subsequent.Id)=30
Untested, and almost certainly needs some conversion for MSSQL, but I think that give some ideas.

How about one using Tally tables? It follows a more algorithmic approach, and execution plan is a breeze. Populate the tallyTable with numbers from 1 to 'MaxDaysBehind' that you want to scan the table (ie. 90 will look for 3 months behind,etc).
declare #ContinousDays int
set #ContinousDays = 30 -- select those that have 30 consecutive days
create table #tallyTable (Tally int)
insert into #tallyTable values (1)
...
insert into #tallyTable values (90) -- insert numbers for as many days behind as you want to scan
select [UserId],count(*),t.Tally from HistoryTable
join #tallyTable as t on t.Tally>0
where [CreationDate]> getdate()-#ContinousDays-t.Tally and
[CreationDate]<getdate()-t.Tally
group by [UserId],t.Tally
having count(*)>=#ContinousDays
delete #tallyTable

Tweaking Bill's query a bit. You might have to truncate the date before grouping to count only one login per day...
SELECT UserId from History
WHERE CreationDate > ( now() - n )
GROUP BY UserId,
DATEADD(dd, DATEDIFF(dd, 0, CreationDate), 0) AS TruncatedCreationDate
HAVING COUNT(TruncatedCreationDate) >= n
EDITED to use DATEADD(dd, DATEDIFF(dd, 0, CreationDate), 0) instead of convert( char(10) , CreationDate, 101 ).
#IDisposable
I was looking to use datepart earlier but i was too lazy to look up the syntax so i figured i d use convert instead. I dint know it had a significant impact Thanks! now i know.

assuming a schema that goes like:
create table dba.visits
(
id integer not null,
user_id integer not null,
creation_date date not null
);
this will extract contiguous ranges from a date sequence with gaps.
select l.creation_date as start_d, -- Get first date in contiguous range
(
select min(a.creation_date ) as creation_date
from "DBA"."visits" a
left outer join "DBA"."visits" b on
a.creation_date = dateadd(day, -1, b.creation_date ) and
a.user_id = b.user_id
where b.creation_date is null and
a.creation_date >= l.creation_date and
a.user_id = l.user_id
) as end_d -- Get last date in contiguous range
from "DBA"."visits" l
left outer join "DBA"."visits" r on
r.creation_date = dateadd(day, -1, l.creation_date ) and
r.user_id = l.user_id
where r.creation_date is null

Related

Find records created in a period of 24hours (not especially the last 24hrs)

I'm using SQLServer 2012 and I have a table timestamped like that :
Id INT NOT NULL, --PK
IdUser INT NOT NULL --FK from table USER
-- Some other fields
CreationDate DATETIME NOT NULL
This table records some type of action made by the user.
I'm trying to find IF a user did this type of action more than 20 times (ie there is 20 records with the same IdUser in that table) in a period of 24 hours.
The problem is I'm not trying to retrieve the records in the last 24 hours, but the records in a period of 24 hours (from the 1st record to today)
This is what I wrote :
SELECT IdUser
FROM MyTable
WHERE IdUser = 1
AND CreationDate BETWEEN DATEADD(day, -1, GETDATE()) AND GETDATE() -- <= WRONG
But the WHERE clause doesn't fit my needs as I have no idea how to translate "seek 20 records from the user id=1 in a period of 24 hours, not especially the last 24 hours" in SQL
SAMPLE
Let's say our user Id=1 did 154 times this action. So I have 154 records in my table, with the datetime of the record.
IdUser = 1 ; CreationDate = 2016-07-29 12:24:54.590
IdUser = 1 ; CreationDate = 2016-07-29 16:51:55.856
IdUser = 1 ; CreationDate = 2016-07-27 14:12:36.125
(151 omitted rows)
What I'm seeking is if I can find 20 records in a period of 24 hours for a particular user. In this sample, only the 2 firsts are on a period of 24 hours. In my case, I'm seeking if there is 20 or more records in this period.
Could some one help me ?
Thank you

A rather painful way to do this (performance-wise) is with a join/group by or apply operator.
The apply should be better from a performance perspective:
select t.*
from t cross apply
(select count(*) as cnt
from t t2
where t2.iduser = t.iduser and
t2.creationdate >= t.creationdate and
t2.creationdate < dateadd(day, 1, t.creationdate)
) t24
where t24.cnt >= 20;
An index on (iduser, creationdate) is a win for this query.

Have you tried declaring a DateTime variable and setting the date you want to query??
Also, if you want only 'n' records, you can use the TOP 'n' CLAUSE. Even, you you can use TOP n% to get a percentage. (By the way, TOP is available starting with SQLServer 2008)
DECLARE #DATE DATETIME = '20160601' --YYYYMMDD
SELECT TOP 20 IdUser
FROM MyTable
WHERE IdUser = 1
AND CreationDate BETWEEN DATEADD(day, -1, #DATE) AND #DATE

You can pass the value for the #starttime as needed
DECLARE #starttime DATETIME
SELECT IdUser
FROM MyTable
WHERE CreationDate BETWEEN #starttime and DATEADD(HOUR, 24, #starttime)
GROUP BY IdUser
HAVING COUNT(CreationDate) > 20

--This will count all records occurring 24 hours after the first instance of a record
SELECT t1.IdUser, T1.CreationDate, COUNT(0) AS InstancesIn24Hrs
FROM MyTable t1
JOIN MyTable t2 ON T1.IdUser = T2.IdUser AND t2.CreationDate
BETWEEN t1.CreationDate
AND
DATEADD(day, 1, t1.CreationDate)
WHERE t1.IdUser = 1
GROUP BY t1.IdUser, T1.CreationDate
ORDER BY T1.CreationDate

Summing up columns from two different tables

I have two different tables FirewallLog and ProxyLog. There is no relation between these two tables. They have four common fields :
LogTime ClientIP BytesSent BytesRec
I need to Calculate the total usage of a particular ClientIP for each day over a period of time (like last month) and display it like below:
Date TotalUsage
2/12 125
2/13 145
2/14 0
. .
. .
3/11 150
3/12 125
TotalUsage is SUM(FirewallLog.BytesSent + FirewallLog.BytesRec) + SUM(ProxyLog.BytesSent + ProxyLog.BytesRec) for that IP. I have to show Zero if there is no usage (no record) for that day.
I need to find the fastest solution to this problem. Any Ideas?

First, create a Calendar table. One that has, at the very least, an id column and a calendar_date column, and fill it with dates covering every day of every year you can ever be interested in . (You'll find that you'll add flags for weekends, bankholidays and all sorts of other useful meta-data about dates.)
Then you can LEFT JOIN on to that table, after combining your two tables with a UNION.
SELECT
CALENDAR.calendar_date,
JOINT_LOG.ClientIP,
ISNULL(SUM(JOINT_LOG.BytesSent + JOINT_LOG.BytesRec), 0) AS TotalBytes
FROM
CALENDAR
LEFT JOIN
(
SELECT LogTime, ClientIP, BytesSent, BytesRec FROM FirewallLog
UNION ALL
SELECT LogTime, ClientIP, BytesSent, BytesRec FROM ProxyLog
)
AS JOINT_LOG
ON JOINT_LOG.LogTime >= CALENDAR.calendar_date
AND JOINT_LOG.LogTime < CALENDAR.calendar_date+1
WHERE
CALENDAR.calendar_date >= #start_date
AND CALENDAR.calendar_date < #cease_date
GROUP BY
CALENDAR.calendar_date,
JOINT_LOG.ClientIP
SQL Server is very good at optimising this type of UNION ALL query. Assuming that you have appropriate indexes.

If you don't have a calendar table, you can create one using a recursive CTE:
declare #startdate date = '2013-02-01';
declare #enddate date = '2013-03-01';
with dates as (
select #startdate as thedate
union all
select dateadd(day, 1, thedate)
from dates
where thedate < #enddate
)
select driver.thedate, driver.ClientIP,
coalesce(fwl.FWBytes, 0) + coalesce(pl.PLBytes, 0) as TotalBytes
from (select d.thedate, fwl.ClientIP
from dates d cross join
(select distinct ClientIP from FirewallLog) fwl
) driver left outer join
(select cast(fwl.logtime as date) as thedate,
SUM(fwl.BytesSent + fwl.BytesRec) as FWBytes
from FirewallLog fwl
group by cast(fwl.logtime as date)
) fwl
on driver.thedate = fwl.thedate and driver.clientIP = fwl.ClientIP left outer join
(select cast(pl.logtime as date) as thedate,
SUM(pl.BytesSent + pl.BytesRec) as PLBytes
from ProxyLog pl
group by cast(pl.logtime as date)
) pl
on driver.thedate = pl.thedate and driver.ClientIP = pl.ClientIP
This uses a driver table that generates all the combinations of IP and date, which it then uses for joining to the summarized table. This formulation assumes that the "FirewallLog" contains all the "ClientIp"s of interest.
This also breaks out the two values, in case you also want to include them (to see which is contributing more bytes to the total, for instance).

I would recommend creating a Dates Lookup table if that is an option. Create the table once and then you can use it as often as needed. If not, you'll need to look into creating a Recursive CTE to act as the Dates table (easy enough -- look on stackoverflow for examples).
Select d.date,
results.ClientIp
Sum(results.bytes)
From YourDateLookupTable d
Left Join (
Select ClientIp, logtime, BytesSent + BytesRec bytes From FirewallLog
Union All
Select ClientIp, logtime, BytesSent + BytesRec bytes From ProxyLog
) results On d.date = results.logtime
Group By d.date,
results.ClientIp
This assumes the logtime and date data types are the same. If logtime is a date time, you'll need to convert it to a date.

SQL Count to include zero values

I have created the following stored procedure that is used to count the number of records per day between a specific range for a selected location:
[dbo].[getRecordsCount]
#LOCATION as INT,
#BEGIN as datetime,
#END as datetime
SELECT
ISNULL(COUNT(*), 0) AS counted_leads,
CONVERT(VARCHAR, DATEADD(dd, 0, DATEDIFF(dd, 0, Time_Stamp)), 3) as TIME_STAMP
FROM HL_Logs
WHERE Time_Stamp between #BEGIN and #END and ID_Location = #LOCATION
GROUP BY DATEADD(dd, 0, DATEDIFF(dd, 0, Time_Stamp))
but the problem is that the result does not show the days where there are zero records, I pretty sure that it has something to do with my WHERE statement not allowing the zero values to be shown but I do not know how to over come this issue.
Thanks in advance
Neil

Not so much the WHERE clause, but the GROUP BY. The query will only return data for rows that exist. That means when you're grouping by the date of the timestamp, only days for which there are rows will be returned. SQL Server can't know from context that you want to "fill in the blanks", and it wouldn't know what with.
The normal answer is a CTE that produces all the days you want to see, thus filling in the blanks. This one's a little tricky because it requires a recursive SQL statement, but it's a well-known trick:
WITH CTE_Dates AS
(
SELECT #START AS cte_date
UNION ALL
SELECT DATEADD(DAY, 1, cte_date)
FROM CTE_Dates
WHERE DATEADD(DAY, 1, cte_date) <= #END
)
SELECT
cte_date as TIME_STAMP,
ISNULL(COUNT(HL_Logs.Time_Stamp), 0) AS counted_leads,
FROM CTE_Dates
LEFT JOIN HL_Logs ON DATEADD(dd, 0, DATEDIFF(dd, 0, Time_Stamp)) = cte_date
WHERE Time_Stamp between #BEGIN and #END and ID_Location = #LOCATION
GROUP BY cte_date
Breaking it down, the CTE uses a union that references itself to recursively add one day at a time to the previous date and remember that date as part of the table. If you ran a simple statement that used the CTE and just selected * from it, you'd see a list of dates between start and end. Then, the statement joins this list of dates to the log table based on the log timestamp date, while preserving dates that have no log entries using the left join (takes all rows from the "left" side whether they have matching rows on the "right" side or not). Finally, we group by date and count instead and we should get the answer you want.

When there is no data to count, there is no row to return.
If you want to include empty days as a 0, you need to create a table (or temporary table, or subquery) to store the days, and left join to your query from that.
eg: something like
SELECT
COUNT(*) AS counted_leads,
CONVERT(VARCHAR, DATEADD(dd, 0, DATEDIFF(dd, 0, Time_Stamp)), 3) as TIME_STAMP
FROM
TableOfDays
left join
HL_Logs
on TableOfDays.Date = convert(date,HL_Logs.Time_Stamp)
and ID_Location = #LOCATION
WHERE TableOfDays.Date between #BEGIN and #END
GROUP BY DATEADD(dd, 0, DATEDIFF(dd, 0, Time_Stamp))

Use a left outer join. Such as
select count(stuff_ID), extra_NAME
from dbo.EXTRAS
left outer join dbo.STUFF on suff_EXTRA = extra_STUFF
group by extra_NAME

I just recently has a similar task and used this as a backdrop to my work. However, as explained by robwilliams I too, couldn't get it KeithS solution to work. Mine task was slightly different I was doing it by hours vs days but I think the solution to the neilrudds question would be
DECLARE #Start as DATETIME
,#End as DATETIME
,#LOCATION AS INT;
WITH CTE_Dates AS
(
SELECT #Start AS cte_date, 0 as 'counted_leads'
UNION ALL
SELECT DATEADD(DAY, 1, cte_date) as cte_date, 0 AS 'counted_leads'
FROM CTE_Dates
WHERE DATEADD(DAY, 1, cte_date) <= #End
)
SELECT cte_date AS 'TIME_STAMP'
,COUNT(HL.ID_Location) AS 'counted_leads'
FROM CTE_Dates
LEFT JOIN HL_Logs AS HL ON CAST(HL.Time_Stamp as date) = CAST(cte_date as date)
AND DATEPART(day, HL.Time_Stamp) = DATEPART(day,cte_date)
AND HL.ID_Location = #LOCATION
group by cte_date
OPTION (MAXRECURSION 0)

group data by any range of 30 days (not by range of dates) in SQL Server

I got a table with a list of transactions.
for the example, lets say it has 4 fields:
ID, UserID, DateAddedd, Amount
I would like to run a query that checks if there was a time, that in 30 days, a user made transactions in the sum of 100 or more
I saw lots of samples of grouping by month or a day but the problem is that if for example
a user made a 50$ transaction on the 20/4 and on the 5/5 he made another 50$ transaction, the query should show it. (its 100$ or more in a period of 30 days)

I think that this should work (I'm assuming that transactions have a date component, and that a user can have multiple transactions on a single day):
;with DailyTransactions as (
select UserID,DATEADD(day,DATEDIFF(day,0,DateAdded),0) as DateOnly,SUM(Amount) as Amount
from Transactions group by UserID,DATEADD(day,DATEDIFF(day,0,DateAdded),0)
), Numbers as (
select ROW_NUMBER() OVER (ORDER BY object_id) as n from sys.objects
), DayRange as (
select n from Numbers where n between 1 and 29
)
select
dt.UserID,dt.DateOnly as StartDate,MAX(ot.DateOnly) as EndDate, dt.Amount + COALESCE(SUM(ot.Amount),0) as TotalSpend
from
DailyTransactions dt
cross join
DayRange dr
left join
DailyTransactions ot
on
dt.UserID = ot.UserID and
DATEADD(day,dr.n,dt.DateOnly) = ot.DateOnly
group by dt.UserID,dt.DateOnly,dt.Amount
having dt.Amount + COALESCE(SUM(ot.Amount),0) >= 100.00
Okay, I'm using 3 common table expressions. The first (DailyTransactions) is reducing the transactions table to a single transaction per user per day (this isn't necessary if the DateAdded is a date only, and each user has a single transaction per day). The second and third (Numbers and DayRange) are a bit of a cheat - I wanted to have the numbers 1-29 available to me (for use in a DATEADD). There are a variety of ways of creating either a permanent or (as in this case) temporary Numbers table. I just picked one, and then in DayRange, I filter it down to the numbers I need.
Now that we have those available to us, we write the main query. We're querying for rows from the DailyTransactions table, but we want to find later rows in the same table that are within 30 days. That's what the left join to DailyTransactions is doing. It's finding those later rows, of which there may be 0, 1 or more. If it's more than one, we want to add all of those values together, so that's why we need to do a further bit of grouping at this stage. Finally, we can write our having clause, to filter down only to those results where the Amount from a particular day (dt.Amount) + the sum of amounts from later days (SUM(ot.Amount)) meets the criteria you set out.
I based this on a table defined thus:
create table Transactions (
UserID int not null,
DateAdded datetime not null,
Amount decimal (38,2)
)

If I understand you correctly, you need a calendar table and then check the sum between date and date+30. So if you want to check a period of 1 year you need to check something like 365 periods.
Here is one way of doing that. The recursive CTE creates the calendar and the cross apply calculates the sum for each CalDate between CalDate and CalDate+30.
declare #T table(ID int, UserID int, DateAdded datetime, Amount money)
insert into #T values(1, 1, getdate(), 50)
insert into #T values(2, 1, getdate()-29, 60)
insert into #T values(4, 2, getdate(), 40)
insert into #T values(5, 2, getdate()-29, 50)
insert into #T values(7, 3, getdate(), 70)
insert into #T values(8, 3, getdate()-30, 80)
insert into #T values(9, 4, getdate()+50, 50)
insert into #T values(10,4, getdate()+51, 50)
declare #FromDate datetime
declare #ToDate datetime
select
#FromDate = min(dateadd(d, datediff(d, 0, DateAdded), 0)),
#ToDate = max(dateadd(d, datediff(d, 0, DateAdded), 0))
from #T
;with cal as
(
select #FromDate as CalDate
union all
select CalDate + 1
from cal
where CalDate < #ToDate
)
select S.UserID
from cal as C
cross apply
(select
T.UserID,
sum(Amount) as Amount
from #T as T
where T.DateAdded between CalDate and CalDate + 30
group by T.UserID) as S
where S.Amount >= 100
group by S.UserID
option (maxrecursion 0)

Identify full vs half yearly datasets in SQL

I have a table with two fields of interest for this particular exercise: a CHAR(3) ID and a DATETIME. The ID identifies the submitter of the data - several thousand rows. The DATETIME is not necessarily unique, either. (The primary keys are other fields of the table.)
Data for this table is submitted every six months. In December, we receive July-December data from each submitter, and in June we receive July-June data. My task is to write a script that identifies people who have only submitted half their data, or only submitted January-June data in June.
...Does anyone have a solution?

For interest, this is what I wound up using. It was based off Stephen's answer, but with a few adaptations.
It's part of a larger script that's run every six months, but we're only checking this every twelve months - hence the "If FullYear = 1". I'm sure there's a more stylish way to identify the boundary dates, but this seems to work.
IF #FullYear = 1
BEGIN
DECLARE #FirstDate AS DATETIME
DECLARE #LastDayFirstYear AS DATETIME
DECLARE #SecondYear AS INT
DECLARE #NewYearsDay AS DATETIME
DECLARE #LastDate AS DATETIME
SELECT #FirstDate = MIN(dscdate), #LastDate = MAX(dscdate)
FROM TheTable
SELECT #SecondYear = DATEPART(yyyy, #FirstDate) + 1
SELECT #NewYearsDay = CAST(CAST(#SecondYear AS VARCHAR)
+ '-01-01' AS DATETIME)
INSERT INTO #AuditResults
SELECT DISTINCT
'Submitter missing Jan-Jun data', t.id
FROM TheTable t
WHERE
EXISTS (
SELECT 1
FROM TheTable t1
WHERE t.id = t1.id
AND t1.date >= #FirstDate
AND t1.date < #NewYearsDay )
AND NOT EXISTS (
SELECT 1
FROM TheTable t2
WHERE t2.date >= #NewYearsDay
AND t2.date <= #LastDate
AND t2.id = t.id
GROUP BY t2.id )
GROUP BY t.id
END

From your description, I wouldn't worry about the efficiency of the query since apparently it only needs to run twice a year!
There are a few ways to do this, which one is 'best' depends on the data that you have. The datediff (on max/min date values) you suggested should work, another option is to just count records for each submitted within each date range, e.g.
select * from (
select T.submitterId,
(select count(*)
from TABLE T1
where T1.datefield between [july] and [december]
and T1.submitterId = T.submitterId
group by T1.submitterId) as JDCount,
(select count(*)
from TABLE T2
where T2.datefield between [december] and [june]
and T2.submitterId = T.submitterId
group by T2.submitterId) as DJCount
from TABLE T) X
where X.JDCount <= 0 OR X.DJCount <= 0
Caveat: untested query off the top of my head; your mileage may vary.

I later realised that I was supposed to check to make sure that there was data for both July to December and January to June. So this is what I wound up in v2:
SELECT #avgmonths = AVG(x.[count])
FROM ( SELECT CAST(COUNT(DISTINCT DATEPART(month,
DATEADD(month,
DATEDIFF(month, 0, dscdate),
0))) AS FLOAT) AS [count]
FROM HospDscDate
GROUP BY hosp
) x
IF #avgmonths > 7
SET #months = 12
ELSE
SET #months = 6
SELECT 'Submitter missing data for some months' AS [WarningType],
t.id
FROM TheTable t
WHERE EXISTS ( SELECT 1
FROM TheTable t1
WHERE t.id = t1.id
HAVING COUNT(DISTINCT DATEPART(month,
DATEADD(month, DATEDIFF(month, 0, t1.Date), 0))) < #months )
GROUP BY t.id

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL to determine minimum sequential days of access? - sql

Seems like you could take advantage of the fact that to be continuous over n days would require there to be n rows. So something like: SELECT users.UserId, count(1) as cnt FROM users WHERE users.CreationDate > now() - INTERVAL 30 DAY GROUP BY UserId HAVING cnt = 30

If this is so important to you, source this event and drive a table to give you this info. No need to kill the machine with all those crazy queries.

Something like this? select distinct userid from table t1, table t2 where t1.UserId = t2.UserId AND trunc(t1.CreationDate) = trunc(t2.CreationDate) + n AND ( select count(*) from table t3 where t1.UserId = t3.UserId and CreationDate between trunc(t1.CreationDate) and trunc(t1.CreationDate)+n ) = n

Spencer almost did it, but this should be the working code: SELECT DISTINCT UserId FROM History h1 WHERE ( SELECT COUNT(*) FROM History WHERE UserId = h1.UserId AND CreationDate BETWEEN h1.CreationDate AND DATEADD(d, #n-1, h1.CreationDate) ) >= #n

Related

Find records created in a period of 24hours (not especially the last 24hrs)

Summing up columns from two different tables

SQL Count to include zero values

group data by any range of 30 days (not by range of dates) in SQL Server

Identify full vs half yearly datasets in SQL

Categories

Resources