SQL - unique users who are visiting for the first time - sql

Given following table visitorLog, write a SQL to find the following by date.
Total_Visitors
VisitorGain - compare to previous day
VisitorLoss - compare to previous day
Total_New_Visitors - unique users who are visiting for the first time
visitorLog :
*----------------------*
| Date Visitor |
*----------------------*
| 01-Jan-2011 V1 |
| 01-Jan-2011 V2 |
| 01-Jan-2011 V3 |
| 02-Jan-2011 V2 |
| 03-Jan-2011 V2 |
| 03-Jan-2011 V4 |
| 03-Jan-2011 V5 |
*----------------------*
Expected output:
*---------------------------------------------------------------------*
| Date Total_Visitors VisitorGain VisitorLoss Total_New_Visitors |
*---------------------------------------------------------------------*
| 01-Jan-2011 3 3 0 3 |
| 02-Jan-2011 1 0 2 0 |
| 03-Jan-2011 3 2 0 2 |
*---------------------------------------------------------------------*
Here is my SQL and SLQ fiddle.
with cte as
(
select
date,
total_visitors,
lag(total_visitors) over (order by date) as prev_visitors,
row_number() over (order by date ) as rnk
from
(
select
*,
count(visitor) over (partition by date) as total_visitors
from visitorLog
) val
group by
date,
total_visitors
),
cte2 as
(
select
date,
sum(case when rnk = 1 then 1 else 0 end) as total_new_visitors
from
(
select
date,
visitor,
row_number() over (partition BY visitor order by date) as rnk
from visitorLog
) t
group by
date
)
select
c.date,
sum(total_visitors) as total_visitors,
sum(
case
when rnk = 1 then total_visitors
when (rnk > 1 and prev_visitors < total_visitors) then (total_visitors - prev_visitors)
else
0
end
)visitorGain,
sum(
case
when rnk = 1 then 0
when prev_visitors > total_visitors then (prev_visitors - total_visitors)
else
0
end
) as visitorLoss,
sum(total_new_visitors) as total_new_visitors
from cte c
join cte2 c2
on c.date = c2.date
group by
c.date
order by
c.date
My solution is working as expected but I am wondering if I am missing any any edge cases here which may break my logic. any help would be great.

This logic does what you want:
select date, count(*) as num_visitor,
greatest(count(*) - lag(count(*)::int, 1, 0) over (order by date), 0) as visitor_gain,
greatest(lag(count(*)::int, 1, 0) over (order by date) - count(*), 0) as visitor_loss,
count(*) filter (where seqnum = 1) as num_new_visitors
from (select vl.*,
row_number() over (partition by visitor order by date) as seqnum
from visitorLog vl
) vl
group by date
order by date
Here is a db<>fiddle.

I would use window functions and aggregation:
select
date,
count(*) no_visitor,
count(*) - lag(count(*), 1, 0) over(partition by date) no_visitor_diff,
count(*) filter(where rn = 1) no_new_visitors
from (
select t.*, row_number() over(partition by visitor order by date) rn
from visitorLog
) t
group by date
order by date
The subquery ranks the visits of each customer using row_number() (the first visit of each customer gets row number 1). Then, the outer query aggregates by date, and uses lag() to get the visitor count of the "previous" day.
I don't really see the point to have two distinct columns for the difference of visitors compared to the last day, so this gives you a single column, with a value that's either positive or negative depending whether customers were gained or lost.
If you really want two columns, then:
greatest(count(*) - lag(count(*), 1, 0) over(partition by date), 0) visitor_gain,
- least(count(*) - lag(count(*), 1, 0) over(partition by date), 0) visitor_loss

Related

Minimum and maximum dates within continuous date range grouped by name

I have a data ranges with start and end date for a persons, I want to get the continuous date ranges only per persons:
Input:
NAME | STARTDATE | END DATE
--------------------------------------
MIKE | **2019-05-15** | 2019-05-16
MIKE | 2019-05-17 | **2019-05-18**
MIKE | 2020-05-18 | 2020-05-19
Expected output like:
MIKE | **2019-05-15** | **2019-05-18**
MIKE | 2020-05-18 | 2020-05-19
So basically output is MIN and MAX for each continuous period for the person.
Appreciate any help.
I have tried the below query:
With N AS ( SELECT Name, StartDate, EndDate
, LastStop = MAX(EndDate)
OVER (PARTITION BY Name ORDER BY StartDate, EndDate
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) FROM Table ), B AS ( SELECT Name, StartDate, EndDate
, Block = SUM(CASE WHEN LastStop Is Null Then 1
WHEN LastStop < StartDate Then 1
ELSE 0
END)
OVER (PARTITION BY Name ORDER BY StartDate, LastStop) FROM N ) SELECT Name
, MIN(StartDate) DateFrom
, MAX(EndDate) DateTo FROM B GROUP BY Name, Block ORDER BY Name, Block
But its not considering the continuous period. It's showing the same input.
This is a type of gap-and-islands problem. There is no need to expand the data out by day! That seems very inefficient.
Instead, determine the "islands". This is where there is no overlap -- in your case lag() is sufficient. Then a cumulative sum and aggregation:
select name, min(startdate), max(enddate)
from (select t.*,
sum(case when prev_enddate >= dateadd(day, -1, startdate) then 0 else 1 end) over
(partition by name order by startdate) as grp
from (select t.*,
lag(enddate) over (partition by name order by startdate) as prev_enddate
from t
) t
) t
group by name, grp;
Here is a db<>fiddle.
Here is an example using an ad-hoc tally table
Example or dbFiddle
;with cte as (
Select A.[Name]
,B.D
,Grp = datediff(day,'1900-01-01',D) - dense_rank() over (partition by [Name] Order by D)
From YourTable A
Cross Apply (
Select Top (DateDiff(DAY,StartDate,EndDate)+1) D=DateAdd(DAY,-1+Row_Number() Over (Order By (Select Null)),StartDate)
From master..spt_values n1,master..spt_values n2
) B
)
Select [Name]
,StartDate= min(D)
,EndDate = max(D)
From cte
Group By [Name],Grp
Returns
Name StartDate EndDate
MIKE 2019-05-15 2019-05-18
MIKE 2020-05-18 2020-05-19
Just to help with the Visualization, the CTE generates the following
This will give you the same result
SELECT subquery.name,min(subquery.startdate),max(subquery.enddate1)
FROM (SELECT NAME,startdate,
CASE WHEN EXISTS(SELECT yt1.startdate
FROM t yt1
WHERE yt1.startdate = DATEADD(day, 1, yt2.enddate)
) THEN null else yt2.enddate END as enddate1
FROM t yt2) as subquery
GROUP by NAME, CAST(MONTH(subquery.startdate) AS VARCHAR(2)) + '-' + CAST(YEAR(subquery.startdate) AS VARCHAR(4))
For the CASE WHEN EXISTS I refered to SQL CASE
For the group by month and year you can see this GROUP BY MONTH AND YEAR
DB_FIDDLE

logic for subtracting and getting variance in oracle

how to write a logic for getting the below output
with b2 as
(
select COUNT(SaLE) As cnt,
TO_CHAR(date,'YYYY-MON') As Period
from Order
where date between date '2020-02-01' and date '2020-02-28'
group by TO_CHAR(BATCH_DTE_CYMD,'YYYY-MON')
union all
select COUNT(Sale) As cnt,
TO_CHAR(Date,'YYYY-MON') As Period
from Order
where date between date '2020-01-01' and date '2020-01-31'
group by TO_CHAR(Date,'YYYY-MON')
)
select cnt, Period,
100*(cnt-lag(cnt,1,cnt) over (order by period))
/lag(cnt,1,cnt) over (order by period)
as "variance(%)"
from b2
order by period
i am getting this ouput
Cnt | period | variance(%)
11917 | 2020-FEB | 0
11707 | 2020-JAN | -1.76218847025258034740286984979441134514
but i want the this output
Cnt | period | variance(%) | sign
11917 | 2020-FEB | JAN-FEB (Variance we get in feb in % (with no decimal)) | Increase/decrease
11707 | 2020-JAN | 0 | 0
The issue with your code is, You are using LAG on the CHAR column(PERIOD) which is not correct as 2020-FEB is lower than the 2020-JAN when the comparison is in the string.
You must use them as the date in LAG function as following:
WITH B2 AS (
SELECT COUNT(SALE) AS CNT,
TRUNC(DATE, 'MON') AS PERIOD
FROM ORDER
WHERE DATE BETWEEN DATE '2020-01-01' AND DATE '2020-02-28'
GROUP BY TRUNC(DATE, 'MON')
)
SELECT
CNT,
TO_CHAR(PERIOD,'YYYY-MON') AS PERIOD,
100 * ( CNT - LAG(CNT, 1, CNT) OVER( ORDER BY PERIOD ) )
/ LAG(CNT, 1, CNT) OVER(ORDER BY PERIOD)
AS "variance(%)"
FROM B2
ORDER BY PERIOD
Cheers!!
Try below query:
select cnt,period,
case when variance<>0 then
'JAN-FEB(Variance we get in feb in '||to_char(round(variance,0))||')'
else to_char(variance) end as variance,
case when SIGN (variance)<0 then 'Decrease'
when sign(variance)=0 then 'No Change' else 'Increase'
end as sign from(select cnt, Period,
100*(cnt-lag(cnt,1,cnt) over (order by period desc ))
/lag(cnt,1,cnt) over (order by period desc)
as variance
from test
)t
order by period
I have used round function which will convert -1.7.. to -2. you can also use Ceil

SQL partition by on date range

Assume this is my table:
ID NUMBER DATE
------------------------
1 45 2018-01-01
2 45 2018-01-02
2 45 2018-01-27
I need to separate using partition by and row_number where the difference between one date and another is greater than 5 days. Something like this would be the result of the above example:
ROWNUMBER ID NUMBER DATE
-----------------------------
1 1 45 2018-01-01
2 2 45 2018-01-02
1 3 45 2018-01-27
My actual query is something like this:
SELECT ROW_NUMBER() OVER(PARTITION BY NUMBER ODER BY ID DESC) AS ROWNUMBER, ...
But as you can notice, it doesn't work for the dates. How can I achieve that?
You can use lag function :
select *, row_number() over (partition by number, grp order by id) as [ROWNUMBER]
from (select *, (case when datediff(day, lag(date,1,date) over (partition by number order by id), date) <= 1
then 1 else 2
end) as grp
from table
) t;
by using lag and datediff funtion
select * from
(
select t.*,
datediff(day,
lag(DATE) over (partition by NUMBER order by id),
DATE
) as diff
from t
) as TT where diff>5
http://sqlfiddle.com/#!18/130ae/11
I think you want to identify the groups, using lag() and datediff() and a cumulative sum. Then use row_number():
select t.*,
row_number() over (partition by number, grp order by date) as rownumber
from (select t.*,
sum(grp_start) over (partition by number order by date) as grp
from (select t.*,
(case when lag(date) over (partition by number order by date) < dateadd(day, 5, date)
then 1 else 0
end) as grp_start
from t
) t
) t;

SQL Server - find absence date occurrences [duplicate]

This question already has an answer here:
SQL: Gaps and Islands, Grouped dates
(1 answer)
Closed 5 years ago.
I have the following dataset:
enter image description here
Here is script for this data:
;with dataset AS (
select 'EMP01' AS EMP_ID,CAST('2018-01-01' AS DATE) AS PERIOD_START,CAST('2018-01-31' AS DATE) AS PERIOD_END,CAST('2018-01-07' AS DATE) AS CUT_DATE
UNION
select 'EMP01' AS EMP_ID,CAST('2018-01-01' AS DATE) AS PERIOD_START,CAST('2018-01-31' AS DATE) AS PERIOD_END,CAST('2018-01-15' AS DATE) AS CUT_DATE
UNION
select 'EMP02' AS EMP_ID,CAST('2018-01-01' AS DATE) AS PERIOD_START,CAST('2018-01-31' AS DATE) AS PERIOD_END,CAST('2018-01-09' AS DATE) AS CUT_DATE
)
select *
from dataset
I need to divide these periods (PERIOD_START and PERIOD_END) by CUT_DATE (exclude cut dates from that periods) The number of cut dates could be any (3,5,8 etc).
Expecting result for the dataset above is:
If your version of SQL Server supports LAG, you can use this.
SELECT EMPLOYEE_ID,
ITEM_TYPE,
MIN(APPLY_DATE) AS STARTDATE,
MAX(APPLY_DATE) AS ENDDATE
FROM
(SELECT T.*,
SUM(CASE WHEN PREV_TYPE=ITEM_TYPE THEN 0 ELSE 1 END)
OVER(PARTITION BY EMPLOYEE_ID ORDER BY APPLY_DATE) AS GRP
FROM (SELECT D.*,
LAG(ITEM_TYPE) OVER(PARTITION BY EMPLOYEE_ID ORDER BY APPLY_DATE) AS PREV_TYPE
FROM DATA D
) T
) T
WHERE ITEM_TYPE IN ('Sickness','Vacation')
GROUP BY EMPLOYEE_ID,ITEM_TYPE,GRP
The logic is to get the previous row's item_type (based on ascending order of apply_date) and compare it with the current row's value. If they are equal, they belong to the same group. Else you start a new group. This is done in the sum window function. After groups are assigned, you just need to get the max and min date for an employee_id,item_type.
Sample Demo
You would use the LAG function.
If you order by something, the LAG function gives the previous value;
a full description can be found at: http://www.sqlservercentral.com/articles/T-SQL/106783/
Take a look at vkp's answer for a full query
This is another way if way if lag is supported.
Rextester Sample
with tbl as
(select d.*
,case when (item_type = lag(item_type) over (partition by employee_id order by apply_date))
then 0
else 1
end grp_tmp
from DATA2 d
where
item_type <> 'Worked'
)
,tbl2 as
(select t.*
,sum(grp_tmp) over (order by employee_id,apply_date
rows between unbounded preceding and current row
)
as grp
from tbl t
)
select
EMPLOYEE_ID
,ITEM_TYPE
,(CONVERT(VARCHAR(24),min(apply_date),103)
+' - '
+CONVERT(VARCHAR(24),max(apply_date),103)
) as range
from tbl2
group by EMPLOYEE_ID,
ITEM_TYPE
,grp
order by
employee_id
,min(apply_date);
Output
+-------------+-----------+-------------------------+
| EMPLOYEE_ID | ITEM_TYPE | range |
+-------------+-----------+-------------------------+
| 1 | Sickness | 23/05/2017 - 24/05/2017 |
| 1 | Vacation | 26/05/2017 - 29/05/2017 |
| 1 | Sickness | 01/06/2017 - 01/06/2017 |
| 2 | Sickness | 25/05/2017 - 30/05/2017 |
+-------------+-----------+-------------------------+

Number of unique dates

There is table:
CREATE TABLE my_table
(gr_id NUMBER,
start_date DATE,
end_date DATE);
All dates always have zero time portion. I need to know a fastest way to compute number of unique dates inside gr_id.
For example, if there is rows (dd.mm.rrrr):
1 | 01.01.2000 | 07.01.2000
1 | 01.01.2000 | 07.01.2000
2 | 01.01.2000 | 03.01.2000
2 | 05.01.2000 | 07.01.2000
3 | 01.01.2000 | 04.01.2000
3 | 03.01.2000 | 05.01.2000
then right answer will be
1 | 7
2 | 6
3 | 5
At now I use additional table
CREATE TABLE mfr_date_list
(MFR_DATE DATE);
with every date between 01.01.2000 and 31.12.2020 and query like this:
SELECT COUNT(DISTINCT mfr_date_list.mfr_date) cnt,
dt.gr_id
FROM dwh_mfr.mfr_date_list,
(SELECT gr_id,
start_date AS sd,
end_date AS ed
FROM my_table
) dt
WHERE mfr_date_list.mfr_date BETWEEN dt.sd AND dt.ed
AND dt.ed IS NOT NULL
GROUP BY dt.gr_id
This query return correct resul data set, but I think it's not fastest way. I think there is some way to build query withot table mfr_date_list at all.
Oracle 11.2 64-bit.
I would expect what you're doing to be the fastest way (as always test). Your query can be simplified, though this only aids understanding and not necessarily speed:
select t.gr_id, count(distinct dl.mfr_date) as cnt
from my_table t
join mfr_date_list dl
on dl.mfr_date between t.date_start and t.date_end
where t.end_date is not null
group by t.gr_id
Whatever you do you have to generate the data between the two dates somehow as you need to remove the overlap. One way would be to use CAST(MULTISET()), as Lalit Kumar explains:
select gr_id, count(distinct end_date - column_value + 1)
from my_table m
cross join table(cast(multiset(select level
from dual
connect by level <= m.end_date - m.start_date + 1
) as sys.odcinumberlist))
group by gr_id;
GR_ID COUNT(DISTINCTEND_DATE-COLUMN_VALUE+1)
---------- --------------------------------------
1 7
2 6
3 5
This is very Oracle specific but should perform substantially better than most other row-generators as you're only accessing the table once and you're generating the minimal number of rows required due to the condition linking MY_TABLE and your generated rows.
What you really need to do is combine the ranges and then count the lengths. This can be quite challenging because of duplicate dates. The following is one way to approach this.
First, enumerate the dates and determine whether the date is "in" or "out". When the cumulative sum is 0 then it is "out":
select t.gr_id, dt,
sum(inc) over (partition by t.gr_id order by dt) as cume_inc
from (select t.gr_id, t.start_date as dt, 1 as inc
from my_table t
union all
select t.gr_id, t.end_date + 1, -1 as inc
from my_table t
) t
Then, use lead() to determine how long the period is:
with inc as (
select t.gr_id, dt,
sum(inc) over (partition by t.gr_id order by dt) as cume_inc
from (select t.gr_id, t.start_date as dt, 1 as inc
from my_table t
union all
select t.gr_id, t.end_date + 1, -1 as inc
from my_table t
) t
)
select t.gr_id,
sum(nextdt - dt) as daysInUse
from (select inc.*, lead(dt) over (partition by t.gr_id order by dt) as nextdt
from inc
) t
group by t.gr_id;
This is close to what you want. The following are two challenges: (1) putting in the limits and (2) handling ties. The following should work (although there might be off-by-one and boundary issues):
with inc as (
select t.gr_id, dt, priority,
sum(inc) over (partition by t.gr_id order by dt) as cume_inc
from ((select t.gr_id, t.start_date as dt, count(*) as inc, 1 as priority
from my_table t
group by t.gr_id, t.start_date
)
union all
(select t.gr_id, t.end_date + 1, - count(*) as inc, -1
from my_table t
group by t.gr_id, t.end_date
)
) t
)
select t.gr_id,
sum(least(nextdt, date '2020-12-31') - greatest(dt, date, '2010-01-01')) as daysInUse
from (select inc.*, lead(dt) over (partition by t.gr_id order by dt, priority) as nextdt
from inc
) t
group by t.gr_id;