Complex Query Involving Search for Contiguous Dates (by Month) - sql

I have a table that contains a list of accounts by month along with a field that indicates activity. I want to search through to find when an account has "died", based on the following criteria:
the account had consistent activity for a contiguous period of months
the account had a spike of activity on a final month (spike = 200% or more of average of all previous contiguous months of activity)
the month immediately following the spike of activity and the next 12 months all had 0 activity
So the table might look something like this:
ID | Date | Activity
1 | 1/1/2010 | 2
2 | 1/1/2010 | 3.2
1 | 2/3/2010 | 3
2 | 2/3/2010 | 2.7
1 | 3/2/2010 | 8
2 | 3/2/2010 | 9
1 | 4/6/2010 | 0
2 | 4/6/2010 | 0
1 | 5/2/2010 | 0
2 | 5/2/2010 | 2
So in this case both accounts 1 and 2 have activity in months Jan - Mar. Both accounts exhibit a spike of activity in March. Both accounts have 0 activity in April. Account 2 has activity again in May, but account 1 does not. Therefore, my query should return Account 1, but not Account 2. I would want to see this as my query result:
ID | Last Date
1 | 3/2/2010
I realize this is a complicated question and I'm not expecting anyone to write the whole query for me. The current best approach I can think of is to create a series of sub-queries and join them, but I don't even know what the subqueries would look like. For example: how do I look for a contiguous series of rows for a single ID where activity is all 0 (or all non-zero?).
My fall-back if the SQL is simply too involved is to use a brute-force search using Java where I would first find all unique IDs, and then for each unique ID iterate across the months to determine if and when the ID "died".
Once again: any help to move in the right direction is very much appreciated.

Processing in Java, or partially processing in SQL, and finishing the processing in Java is a good approach.
I'm not going to tackle how to define a spike.
I will suggest that you start with condition 3. It's easy to find the last non-zero value. Then that's the one you want to test for a spike, and consistant data before the spike.
SELECT out.*
FROM monthly_activity out
LEFT OUTER JOIN monthly_activity comp
ON out.ID = comp.ID AND out.Date < comp.Date AND comp.Activity <> 0
WHERE comp.Date IS NULL
Not bad, but you don't want the result if this is because the record is the last for the month, so instead,
SELECT out.*
FROM monthly_activity out
INNER JOIN monthly_activity comp
ON out.ID = comp.ID AND out.Date < comp.Date AND comp.Activity == 0
GROUP BY out.ID

Probably not the world's most efficient code, but I think this does what you're after:
declare #t table (AccountId int, ActivityDate date, Activity float)
insert #t
select 1, '2010-01-01', 2
union select 2, '2010-01-01', 3.2
union select 1, '2010-02-03', 3
union select 2, '2010-02-03', 2.7
union select 1, '2010-03-02', 8
union select 2, '2010-03-02', 9
union select 1, '2010-04-06', 0
union select 2, '2010-04-06', 0
union select 1, '2010-05-02', 0
union select 2, '2010-05-02', 2
select AccountId, ActivityDate LastActivityDate --, Activity
from #t a
where
--Part 2 --select only where the activity is a peak
Activity >= isnull
(
(
select 2 * avg(c.Activity)
from #t c
where c.AccountId = 1
and c.ActivityDate >= isnull
(
(
select max(d.ActivityDate)
from #t d
where d.AccountId = c.AccountId
and d.ActivityDate < c.ActivityDate
and d.Activity = 0
)
,
(
select min(e.ActivityDate)
from #t e
where e.AccountId = c.AccountId
)
)
and c.ActivityDate < a.ActivityDate
)
, Activity + 1 --Part 1 (i.e. if no activity before today don't include the result)
)
--Part 3
and not exists --select only dates which have had no activity for the following 12 months on the same account (assumption: count no record as no activity / also ignore current date in this assumption)
(
select 1
from #t b
where a.AccountId = b.AccountId
and b.Activity > 0
and b.ActivityDate between dateadd(DAY, 1, a.ActivityDate) and dateadd(YEAR, 1, a.ActivityDate)
)

Related

SQL Group By + Count with multiple tables

I'm studying for an interview next week which has a small data analysis component. The recruiter gave me the following sample SQL question which I'm having trouble wrapping my mind around a solution. I'm hoping that I'm not biting off more than I can chew ;)..
SAMPLE QUESTION:
You are given two tables:
AdClick Table (columns: ClickID, AdvertiserID, UserID, and other
fields) and AdConversion Table (columns: ClickID, UserID and other
fields).
You have to find the total conversion rate (# of conversions/# of
clicks) for users with 1 click, 2 click etc.
I've been playing with this for about an hour and keep hitting road blocks. I understand COUNT and GROUP BY but suspect I'm missing a simple SQL feature that I'm unaware of. This also makes it difficult for me to find any possible pointers/solutions via Google: not knowing the magic keywords to search on.
Example Input
dbo.AdConversion
----------------
ClickID UserID
1 1
2 1
4 1
5 3
6 2
7 2
12 1
9 4
10 4
dbo.AdClick
-----------
ClickID AdvertiserID UserID
1 1 1
2 2 1
3 1 2
4 1 1
5 1 3
6 2 2
7 3 2
8 1 1
9 4 4
10 2 4
11 3 4
12 2 1
Expected Result:
----------------
UserClickCount ConversionRate
4 80.00%
2 66.67%
1 100.00%
Explanation/Clarification:
Users with 4 AdConversion.ClickIDs (aka Conversions) have an 80% conversation rate.
Here there's just one user, UserID 1, which has 5 AdClicks with 4 AdConversions.
Users with 2 Conversions have a combined 6 Adclicks with 4 conversions for a conversion rate of 66.67%. Here, that'd be UserID 2 and 4.
Users with 1 Conversion, here only UserID 3, has 1 conversion against 1 AdClick for a 100% conversion rate.
Here's one possible solution I've come up with after some direction from Zack's comment. I can't imagine that it's the ideal solution or whether it has bugs in it or not:
DECLARE #Conversions TABLE
(
UserID int NOT NULL,
AdConversions int
)
INSERT INTO #Conversions (UserID, AdConversions)
SELECT adc.UserID, COUNT(adc.UserID)
FROM dbo.AdConversion adc
GROUP BY adc.UserID;
DECLARE #Clicks TABLE
(
UserID int NOT NULL,
AdClicks int
)
INSERT INTO #Clicks(UserID, AdClicks)
SELECT UserID, Count (ClickID)
FROM dbo.AdClick
GROUP BY UserID;
SELECT co.AdConversions, CONVERT(decimal(6,3), (CAST(SUM(co.AdConversions) AS float) / SUM(cl.AdClicks))) * 100
FROM #Conversions co
INNER JOIN #Clicks cl
ON co.UserID = cl.UserID
GROUP BY co.AdConversions;
Any advice would be greatly appreciated!
Thanks,
Michael
Your logic seems good. Here is a version with common table expressions and a little update with the numeric conversion:
WITH tConversions as
(SELECT UserID, COUNT(ClickID) as AdConversions
FROM AdConversion
GROUP BY UserID),
tClicks as
(SELECT UserID, COUNT(ClickID) as AdClicks
FROM AdClick
GROUP BY UserID)
SELECT co.AdConversions, CONVERT(decimal(10,2),CAST(SUM(co.AdConversions) as float) / SUM(cl.AdClicks) * 100) as ConversionRate
FROM tConversions co
INNER JOIN tClicks cl
ON co.UserID = cl.UserID
GROUP BY co.AdConversions
You can also use subqueries directly:
SELECT co.AdConversions, CONVERT(decimal(10,2),CAST(SUM(co.AdConversions) as float) / SUM(cl.AdClicks) * 100) as ConversionRate
FROM
(SELECT UserID, COUNT(ClickID) as AdConversions
FROM AdConversion
GROUP BY UserID)
as co
INNER JOIN
(SELECT UserID, COUNT(ClickID) as AdClicks
FROM AdClick
GROUP BY UserID)
as cl
ON co.UserID = cl.UserID
GROUP BY co.AdConversions

Query to find Cumulative while subtracting other counts

Here is my table structure
Id INT
RecId INT
Dated DATETIME
Status INT
and here is my data.
Status table (contains different statuses)
Id Status
1 Created
2 Assigned
Log table (contains logs for the different statuses that a record went through (RecId))
Id RecId Dated Status
1 1 2013-12-09 14:16:31.930 1
2 7 2013-12-09 14:27:26.620 1
3 1 2013-12-09 14:27:26.620 2
3 8 2013-12-10 11:14:13.747 1
3 9 2013-12-10 11:14:13.747 1
3 8 2013-12-10 11:14:13.747 2
What I need to generate a report from this data in the following format.
Dated Created Assigned
2013-12-09 2 1
2013-12-10 3 1
Here the rows data is calculated date wise. The Created is calculated as (previous record (date) Created count - Previous date Assigned count) + Todays Created count.
For example if on date 2013-12-10 three entries were made to log table out of which two have the status Created while one has the status assigned. So in the desired view that I want to build for report, For date 2013-12-10, the view will return Created as 2 + 1 = 3 where 2 is newly inserted records in log table and 1 is the previous day remaining record count (Created - Assigned) 2 - 1.
I hope the scenario is clear. Please ask me if further information is required.
Please help me with the sql to construct the above view.
This matches the expected result for the provided sample, but may require more testing.
with CTE as (
select
*
, row_number() over(order by dt ASC) as rn
from (
select
cast(created.dated as date) as dt
, count(created.status) as Created
, count(Assigned.status) as Assigned
, count(created.status)
- count(Assigned.status) as Delta
from LogTable created
left join LogTable assigned
on created.RecId = assigned.RecId
and created.status = 1
and assigned.Status = 2
and created.Dated <= assigned.Dated
where created.status = 1
group by
cast(created.dated as date)
) x
)
select
dt.dt
, dt.created + coalesce(nxt.delta,0) as created
, dt.assigned
from CTE dt
left join CTE nxt on dt.rn = nxt.rn+1
;
Result:
| DT | CREATED | ASSIGNED |
|------------|---------|----------|
| 2013-12-09 | 2 | 1 |
| 2013-12-10 | 3 | 1 |
See this SQLFiddle demo

Count number of occurrences for each unique value [duplicate]

This question already has answers here:
Count the occurrences of DISTINCT values
(4 answers)
Closed 5 years ago.
Basically I have a table similar to this:
time.....activities.....length
13:00........3.............1
13:15........2.............2
13:00........3.............2
13:30........1.............1
13:45........2.............3
13:15........5.............1
13:45........1.............3
13:15........3.............1
13:45........3.............2
13:45........1.............1
13:15........3.............3
A couple of notes:
Activities can be between 1 and 5
Length can be between 1 and 3
The query should return:
time........count
13:00.........2
13:15.........2
13:30.........0
13:45.........1
Basically for each unique time I want a count of the number of rows where the activities value is 3.
So then I can say:
At 13:00 there were X amount of activity 3s.
At 13:45 there were Y amount of activity 3s.
Then I want a count for activity 1s,2s,4s and 5s. so I can plot the distribution for each unique time.
Yes, you can use GROUP BY:
SELECT time,
activities,
COUNT(*)
FROM table
GROUP BY time, activities;
select time, coalesce(count(case when activities = 3 then 1 end), 0) as count
from MyTable
group by time
SQL Fiddle Example
Output:
| TIME | COUNT |
-----------------
| 13:00 | 2 |
| 13:15 | 2 |
| 13:30 | 0 |
| 13:45 | 1 |
If you want to count all the activities in one query, you can do:
select time,
coalesce(count(case when activities = 1 then 1 end), 0) as count1,
coalesce(count(case when activities = 2 then 1 end), 0) as count2,
coalesce(count(case when activities = 3 then 1 end), 0) as count3,
coalesce(count(case when activities = 4 then 1 end), 0) as count4,
coalesce(count(case when activities = 5 then 1 end), 0) as count5
from MyTable
group by time
The advantage of this over grouping by activities, is that it will return a count of 0 even if there are no activites of that type for that time segment.
Of course, this will not return rows for time segments with no activities of any type. If you need that, you'll need to use a left join with table that lists all the possible time segments.
If i am understanding your question, would this work? (you will have to replace with your actual column and table names)
SELECT time_col, COUNT(time_col) As Count
FROM time_table
GROUP BY time_col
WHERE activity_col = 3
You should change the query to:
SELECT time_col, COUNT(time_col) As Count
FROM time_table
WHERE activity_col = 3
GROUP BY time_col
This vl works correctly.

How to find range of a number where the ranges come dyamically from another table?

If I had two tables:
PersonID | Count
-----------------
1 | 45
2 | 5
3 | 120
4 | 87
5 | 60
6 | 200
7 | 31
SizeName | LowerLimit
-----------------
Small | 0
Medium | 50
Large | 100
I'm trying to figure out how to do a query to get a result similar to:
PersonID | SizeName
-----------------
1 | Small
2 | Small
3 | Large
4 | Medium
5 | Medium
6 | Large
7 | Small
Basically, one table specifies an unknown number of "range names" and their integer ranges associated. So a count range of 0 to 49 from the person table gets a 'small' designation. 50-99 gets 'medium' etc. But I need it to be dynamic because I do not know the range names or integer values. Can I do this in a single query or would I have to write a separate function to loop through the possibilities?
Try this out:
SELECT PersonID, SizeName
FROM
(
SELECT
PersonID,
(SELECT MAX([LowerLimit]) FROM dbo.[Size] WHERE [LowerLimit] < [COUNT]) As LowerLimit
FROM dbo.Person
) A
INNER JOIN dbo.[SIZE] B ON A.LowerLimit = B.LowerLimit
With Ranges As
(
Select 'Small' As Name, 0 As LowerLimit
Union All Select 'Medium', 50
Union All Select 'Large', 100
)
, Person As
(
Select 1 As PersonId, 45 As [Count]
Union All Select 2, 5
Union All Select 3, 120
Union All Select 4, 87
Union All Select 5, 60
Union All Select 6, 200
Union All Select 7, 31
)
, RangeStartEnd As
(
Select R1.Name
, Case When Min(R1.LowerLimit) = 0 Then -1 Else MIN(R1.LowerLimit) End As StartValue
, Coalesce(MIN(R2.LowerLimit), 2147483647) As EndValue
From Ranges As R1
Left Join Ranges As R2
On R2.LowerLimit > R1.LowerLimit
Group By R1.Name
)
Select P.PersonId, P.[Count], RSE.Name
From Person As P
Join RangeStartEnd As RSE
On P.[Count] > RSE.StartValue
And P.[Count] <= RSE.EndValue
Although I'm using common-table expressions (cte for short) which only exist in SQL Server 2005+, this can be done with multiple queries where you create a temp table to store the equivalent of the RangeStartEnd cte. The trick is to create a view that has a start column and end column.
SELECT p.PersonID, Ranges.SizeName
FROM People P
JOIN
(
SELECT SizeName, LowerLimit, MIN(COALESCE(upperlimit, 2000000)) AS upperlimit
FROM (
SELECT rl.SizeName, rl.LowerLimit, ru.LowerLimit AS UpperLimit
FROM Ranges rl
LEFT OUTER JOIN Ranges ru ON rl.LowerLimit < ru.LowerLimit
) r
WHERE r.LowerLimit < COALESCE(r.UpperLimit, 2000000)
GROUP BY SizeName, LowerLimit
) Ranges ON p.Count >= Ranges.LowerLimit AND p.Count < Ranges.upperlimit
ORDER BY PersonID

Query to calculate average time between successive events

My question is about how to write an SQL query to calculate the average time between successive events.
I have a small table:
event Name | Time
stage 1 | 10:01
stage 2 | 10:03
stage 3 | 10:06
stage 1 | 10:10
stage 2 | 10:15
stage 3 | 10:21
stage 1 | 10:22
stage 2 | 10:23
stage 3 | 10:29
I want to build a query that get as an answer the average of the times between stage(i) and stage(i+1).
For example,
the average time between stage 2 and stage 3 is 5:
(3+6+6)/3 = 5
Aaaaand with a sprinkle of black magic:
select a.eventName, b.eventName, AVG(DATEDIFF(MINUTE, a.[Time], b.[Time])) as Average from
(select *, row_number() over (order by [time]) rn from events) a
join (select *, row_number() over (order by [time]) rn from events) b on (a.rn=b.rn-1)
group by
a.eventName, b.eventName
This will give you rows like:
stage3 stage1 2
stage1 stage2 2
stage2 stage3 5
The first column is the starting event, the second column is the ending event. If there is Event 3 right after Event 1, that will be listed as well. Otherwise you should provide some criteria as to which stage follows which stage, so the times are calculated only between those.
Added: This should work OK on both Transact-SQL (MSSQL, Sybase) and PL/SQL (Oracle, PostgreSQL). However I haven't tested it and there could still be syntax errors. This will NOT work on any edition of MySQL.
Select Avg(differ) from (
Select s1.r, s2.r, s2.time - s1.time as differ from (
Select * From (Select rownum as r, inn.time from table inn order by time) s1
Join (Select rownum as r, inn.time from table inn order by time) s2
On mod(s2.r, 3) = 2 and s2.r = s1.r + 1
Where mod(s1.r, 3) = 1)
);
The parameters can be changed as the number of stages changes. This is currently set up to find the average between stages 1 and 2 from a 3 stage process.
EDIT a couple typos
Your table design is flawed. HOw can you tell which stage1 goes with which stage2? Without a way to do this, I do not think your query is possible.
The easiest way would be to order by time and use a cursor (tsql) for iterating over the data. Since cursors are evil it is advisable to fetch the data ordered by time into your application code and iterate there. There are probably other ways to do this in SQL but they will be very complicated and rely on non-standard language extensions.
You don't say which flavour of SQL you want the answer for. This probably means you want the code in SQL Server (as [sql] commonly = [sql-server] in SO tag usage).
But just in case you (or some future seeker) are using Oracle, this kind of query is quite straightforward with analytic functions, in this case LAG(). Check it out:
SQL> select stage_range
2 , avg(time_diff)/60 as average_time_diff_in_min
3 from
4 (
5 select event_name
6 , case when event_name = 'stage 2' then 'stage 1 to 2'
7 when event_name = 'stage 3' then 'stage 2 to 3'
8 else '!!!' end as stage_range
9 , stage_secs - lag(stage_secs)
10 over (order by ts, event_name) as time_diff
11 from
12 ( select event_name
13 , ts
14 , to_number(to_char(ts, 'sssss')) as stage_secs
15 from timings )
16 )
17 where event_name in ('stage 2','stage 3')
18 group by stage_range
19 /
STAGE_RANGE AVERAGE_TIME_DIFF_IN_MIN
------------ ------------------------
stage 1 to 2 2.66666667
stage 2 to 3 5
SQL>
The change of format in the inner query is necessary because I have stored the TIME column as a DATE datatype, so I convert it into seconds to make the mathematics clearer. An alternate solution would be to work with Day to Second Interval datatype instead. But this solution is really all about LAG().
edit
In my take on this query I have explicitly not calculated the difference between a prior Stage 3 and a subsequent Stage 1. This is a matter of requirement.
WITH q AS
(
SELECT 'stage 1' AS eventname, CAST('2009-01-01 10:01:00' AS DATETIME) AS eventtime
UNION ALL
SELECT 'stage 2' AS eventname, CAST('2009-01-01 10:03:00' AS DATETIME) AS eventtime
UNION ALL
SELECT 'stage 3' AS eventname, CAST('2009-01-01 10:06:00' AS DATETIME) AS eventtime
UNION ALL
SELECT 'stage 1' AS eventname, CAST('2009-01-01 10:10:00' AS DATETIME) AS eventtime
UNION ALL
SELECT 'stage 2' AS eventname, CAST('2009-01-01 10:15:00' AS DATETIME) AS eventtime
UNION ALL
SELECT 'stage 3' AS eventname, CAST('2009-01-01 10:21:00' AS DATETIME) AS eventtime
UNION ALL
SELECT 'stage 1' AS eventname, CAST('2009-01-01 10:22:00' AS DATETIME) AS eventtime
UNION ALL
SELECT 'stage 2' AS eventname, CAST('2009-01-01 10:23:00' AS DATETIME) AS eventtime
UNION ALL
SELECT 'stage 3' AS eventname, CAST('2009-01-01 10:29:00' AS DATETIME) AS eventtime
)
SELECT (
SELECT AVG(DATEDIFF(minute, '2009-01-01', eventtime))
FROM q
WHERE eventname = 'stage 3'
) -
(
SELECT AVG(DATEDIFF(minute, '2009-01-01', eventtime))
FROM q
WHERE eventname = 'stage 2'
)
This relies on the fact that you always have complete groups of the stages and they always go in the same order (that is, stage 1 then stage 2 then stage 3)
I can't comment, but I have to agree with HLGEM. While you can tell with the provided data set, the OP should be made aware that relying on only a single set of stages existing at one time may be too optimistic.
event Name | Time
stage 1 | 10:01
stage 2 | 10:03
stage 3 | 10:06
stage 1 | 10:10
stage 2 | 10:15
stage 3 | 10:21
stage 1 | 10:22
stage 2 | 10:23
stage 1 | 10:25 --- new stage 1
stage 2 | 10:28 --- new stage 2
stage 3 | 10:29
stage 3 | 10:34 --- new stage 3
We don't know the environment or what is creating the data. It is up to the OP to decide if the table is built correctly.
Oracle would handle this with Analytics. like Vilx's answer.
try this
Select Avg(e.Time - s.Time)
From Table s
Join Table e
On e.Time =
(Select Min(Time)
From Table
Where eventname = s.eventname
And time > s.Time)
And Not Exists
(Select * From Table
Where eventname = s.eventname
And time < s.Time)
For each record representing a Start of a stage, this sql joins it to the record which represents the end, takes the difference between the end time and the start time, and averages those differences. The Not Exists ensures that he intermediate resultset of start records joined to end records only includes the start records as s... and the first join condition ensures that only the one end record ( the one with the same name and the next time value after the start time) is joined to it...
To see the intermediate resultset after the join, but before the average is taken, run the following:
Select s.EventName,
s.Time Startime, e.Time EndTime,
(e.Time - s.Time) Elapsed
From Table s
Join Table e
On e.Time =
(Select Min(Time)
From Table
Where eventname = s.eventname
And time > s.Time)
And Not Exists
(Select * From Table
Where eventname = s.eventname
And time < s.Time)