Using T-SQL, how do I generate a result that shows a range of dates - sql

Using SQL Server, how do I generate a result set that shows a range of dates, like so:
StartDate EndDate
01/01/2014 01/04/2014
01/08/2014 01/11/2014
01/14/2014 01/15/2014
The original data had the dates in this format:
ColumnA DateColumn
blah 01/01/2014
blah 01/02/2014
blah 01/03/2014
blah 01/04/2014
blah 01/08/2014
blah 01/09/2014
blah 01/10/2014
blah 01/11/2014
blah 01/14/2014
blah 01/15/2014
Currently, I have a bunch of queries that does this, but I'm wondering if I can do something in less code:
SELECT ROW_NUMBER() OVER(ORDER BY DateColumn) AS rownum,
DateColumn
INTO #main
FROM MyTable
SELECT m1.DateColumn AS TBegin,
m2.DateColumn AS TEnd,
COALESCE(DATEDIFF(day, m2.TimePk, m1.TimePk), 0) AS Gap
INTO #Gap
FROM #main m1
LEFT OUTER JOIN #main m2
ON m1.rownum = m2.rownum + 1
ORDER BY m1.DateColumn
SELECT ROW_NUMBER() OVER(ORDER BY i_id, TBegin) AS rownum,
TBegin
INTO #Begin
FROM #Gap
WHERE Gap <> 1
ORDER BY TBegin
SELECT ROW_NUMBER() OVER(ORDER BY i_id, TEnd) AS rownum,
TEnd
INTO #End
FROM (
SELECT TEnd
FROM #Gap
WHERE Gap > 1
UNION
SELECT MAX(TBegin)
FROM #Gap
) as t
ORDER BY TEnd
SELECT b.TBegin,
e.TEnd
FROM #Begin b
INNER JOIN #End e
ON b.i_id = e.i_id
AND b.rownum = e.rownum
ORDER BY b.TBegin
Any ideas on how to simplify or approach this in an entirely different way?

My approach to these is to identify the first date that has no date preceding it. This is the beginning of a group. Then I take the cumulative sum of that as a group identifier, and do the aggregation.
SQL Server 2008 doesn't have lag or cumulative sums, so I use correlated subqueries for this:
with mt as (
select t.*,
(case when (select top 1 t2.dateColumn
from MyTable t2
where t2.ColumnA = t.ColumnA and
t2.dateColumn < t.dateColumn
order by t2.dateColumn desc
) = dateadd(day, -1, t.datecolumn)
then 0
else 1
end) as IsStart
from MyTable t
),
mtcum as (
select mt.*,
(select sum(mt2.IsStart)
from mt mt2
where mt2.ColumnA = mt.ColumnA and
mt2.dateColumn <= mt.DateColumn
) as grpId
from mt
)
select ColumnA, min(dateColumn) as StartDate, max(dateColumn) as EndDate
from mtcum
group by ColumnA, grpId;
EDIT:
An easier way of approach this is with the observation that the difference between a sequence of dates and a sequence of numbers is constant.
select columnA, min(dateColumn) as StartDate, max(dateColumn) as EndDate
from (select mt.*, row_number() over (partition by ColumnA order by datecolumn) as seqnum
from mytable mt
) t
group by columnA, dateadd(day, - seqnum, datecolumn);

This will work for you. It's still fairly complex though. It uses inner queries to find the first date that is after a gap for each date. This way all days belonging to the same group of dates can be grouped together.
select MIN(DateColumn) StartDate, MAX(DateColumn) EndDate from
(select X.DateColumn, MIN(Y.DateColumn) MinOverGap from
(select DateColumn, ROW_NUMBER() OVER (ORDER BY DateColumn) RowNumber
from MyTable) X
left join
(select DateColumn, ROW_NUMBER() OVER (ORDER BY DateColumn) RowNumber
from MyTable) Y
on DATEADD(d, Y.RowNumber - 1, X.DateColumn) <> DATEADD(d, X.RowNumber -1, Y.DateColumn) AND X.DateColumn < Y.DateColumn
group by x.DateColumn) grouped
group by MinOverGap
order by 1

Related

Hive Query, any good ways to optimize these unions?

ALL, I'm new to HIVE and general query optimization.
I have 3 unions that are more or less the exact same query. The only reason why these unions exist is because my source table does NOT have weekend or holiday dates, and I need to persist a few basic values from the preceding calendar day that exists in the source table for the holiday/weekend date that doesn't exist. The Dateadd function is really the only differentiator of the 3 unions (1, 2, or 3 days)
Is there any way to combine these 3 queries into one, or perhaps just do this in a more performant way?
I'm a bit stuck but I've already got this down from a 45 minute overall process to 4 1/2 minutes. Just not sure how to optimize these unions. Please help :/
UNION ALL
--ADDING 1 DAYS TO FRIDAYS--
select * from
(
SELECT a.portfolio_name, cast(date_add(performance_end_date,1) as timestamp) as performance_end_date, cast(0.0000000 as string) as car_return, a.nav, a.nav_id
,row_number() over (partition by a.portfolio_code,a.performance_end_date order by a.nav_id desc) as row_no
FROM carsales a
where
a.portfolio_code IN ('1994',1998,2523)
and a.year=2020 and a.month=09
and DAYOFWEEK(performance_end_date) = 6
) a
where row_no= 1
UNION ALL
--ADDING 2 DAYS TO FRIDAYS--
select * from
(
SELECT a.portfolio_name, cast(date_add(performance_end_date,2) as timestamp) as performance_end_date, cast(0.0000000 as string) as car_return, a.nav, a.nav_id
,row_number() over (partition by a.portfolio_code,a.performance_end_date order by a.nav_id desc) as row_no
FROM carsales a
where
a.portfolio_code IN ('1994',1998,2523)
and a.year=2020 and a.month=09
and DAYOFWEEK(performance_end_date) = 6
) a
where row_no= 1
UNION ALL
--ADDING 3 DAYS To Holidays
select * from
(
SELECT a.portfolio_name, cast(date_add(performance_end_date,3) as timestamp) as performance_end_date, cast(0.0000000 as string) as car_return, a.nav, a.nav_id
,row_number() over (partition by a.portfolio_code,a.performance_end_date order by a.nav_id desc) as row_no
FROM carsales a
where
a.portfolio_code IN ('1994',1998,2523)
and a.year=2020 and a.month=09
and performance_end_date in ('2020-09-04 00:00:00.000','2020-10-09 00:00:00.000')
) a
where row_no= 1
If it's exactly like you wrote, that the only difference is the date_add parameter function, you could take the sql from one of the unions and cross join it with a union between 1,2 and 3 constastants. Maybe the cross join will work better than the union; depends also on the numbers from source. Also, you could filter the rownumber just before doing the Cross join, in order to join less rows. In the example posted below I didn't filtered the row number.
The query will look like this:
SELECT a.portfolio_name,
Cast(Date_add(a.performance_end_date, crs.crs) AS TIMESTAMP) AS
performance_end_date,
a.car_return,
a.nav,
a.nav_id,
a.performance_end_date,
a.row_no
FROM (SELECT a.portfolio_name,
-- Cast(Date_add(performance_end_date, 1) AS TIMESTAMP) AS performance_end_date,
Cast(0.0000000 AS STRING) AS car_return,
a.nav,
a.nav_id,
a.performance_end_date,
Row_number()
OVER (
partition BY a.portfolio_code, a.performance_end_date
ORDER BY a.nav_id DESC) AS row_no
FROM carsales a
WHERE a.portfolio_code IN ( '1994', 1998, 2523 )
AND a.year = 2020
AND a.month = 09
AND Dayofweek(performance_end_date) = 6) a
CROSS JOIN (SELECT 1 crs
UNION ALL
SELECT 2
UNION ALL
SELECT 3) crs
EDIT 1: regarding the comments about date1 or date2, you could do exatly like you wrote. In the where clause, put date_column = something or date_column = something.
SELECT a.portfolio_name,
Cast(Date_add(a.performance_end_date, crs.crs) AS TIMESTAMP) AS
performance_end_date,
a.car_return,
a.nav,
a.nav_id,
a.performance_end_date,
a.row_no
FROM (SELECT a.portfolio_name,
-- Cast(Date_add(performance_end_date, 1) AS TIMESTAMP) AS performance_end_date,
Cast(0.0000000 AS STRING) AS car_return,
a.nav,
a.nav_id,
a.performance_end_date,
Row_number()
OVER (
partition BY a.portfolio_code, a.performance_end_date
ORDER BY a.nav_id DESC) AS row_no
FROM carsales a
WHERE a.portfolio_code IN ( '1994', 1998, 2523 )
AND a.year = 2020
AND a.month = 09
AND (Dayofweek(performance_end_date) = 6 or performance_end_date in ('2020-09-04 00:00:00.000','2020-10-09 00:00:00.000'))
) a
CROSS JOIN (SELECT 1 crs
UNION ALL
SELECT 2
UNION ALL
SELECT 3) crs
In addition to #F.Lazarescu answer, you can rewrite CROSS JOIN subquery.
Instead of this:
CROSS JOIN (SELECT 1 crs
UNION ALL
SELECT 2
UNION ALL
SELECT 3) crs
Use stack() UDTF, it will perform faster:
CROSS JOIN (SELECT stack(3, 1,2,3) as crs) crs

conditional running sum

I'm trying to return the number of unique users that converted over time.
So I have the following query:
WITH CTE
As
(
SELECT '2020-04-01' as date,'userA' as user,1 as goals Union all
SELECT '2020-04-01','userB',0 Union all
SELECT '2020-04-01','userC',0 Union all
SELECT '2020-04-03','userA',1 Union all
SELECT '2020-04-05','userC',1 Union all
SELECT '2020-04-06','userC',0 Union all
SELECT '2020-04-06','userB',0
)
select
date,
COUNT(DISTINCT
IF
(goals >= 1,
user,
NULL)) AS cad_converters
from CTE
group by date
I'm trying to count distinct user but I need to find a way to apply the distinct count to the whole date. I probably need to do something like a cumulative some...
expected result would be something like this
date, goals, total_unique_converted_users
'2020-04-01',1,1
'2020-04-01',0,1
'2020-04-01',0,1
'2020-04-03',1,2
'2020-04-05',1,2
'2020-04-06',0,2
'2020-04-06',0,2
Below is for BigQuery Standard SQL
#standardSQL
SELECT t.date, t.goals, total_unique_converted_users
FROM `project.dataset.table` t
LEFT JOIN (
SELECT a.date,
COUNT(DISTINCT IF(b.goals >= 1, b.user, NULL)) AS total_unique_converted_users
FROM `project.dataset.table` a
CROSS JOIN `project.dataset.table` b
WHERE a.date >= b.date
GROUP BY a.date
)
USING(date)
I would approach this by tagging when the first goal is scored for each name. Then simply do a cumulative sum:
select cte.* except (seqnum), countif(seqnum = 1) over (order by date)
from (select cte.*,
(case when goals = 1 then row_number() over (partition by user, goals order by date) end) as seqnum
from cte
) cte;
I realize this can be expressed without the case in the subquery:
select cte.* except (seqnum), countif(seqnum = 1 and goals = 1) over (order by date)
from (select cte.*,
row_number() over (partition by user, goals order by date) as seqnum
from cte
) cte;

How to get the validity date range of a price from individual daily prices in SQL

I have some prices for the month of January.
Date,Price
1,100
2,100
3,115
4,120
5,120
6,100
7,100
8,120
9,120
10,120
Now, the o/p I need is a non-overlapping date range for each price.
price,from,To
100,1,2
115,3,3
120,4,5
100,6,7
120,8,10
I need to do this using SQL only.
For now, if I simply group by and take min and max dates, I get the below, which is an overlapping range:
price,from,to
100,1,7
115,3,3
120,4,10
This is a gaps-and-islands problem. The simplest solution is the difference of row numbers:
select price, min(date), max(date)
from (select t.*,
row_number() over (order by date) as seqnum,
row_number() over (partition by price, order by date) as seqnum2
from t
) t
group by price, (seqnum - seqnum2)
order by min(date);
Why this works is a little hard to explain. But if you look at the results of the subquery, you will see how the adjacent rows are identified by the difference in the two values.
SELECT Lag.price,Lag.[date] AS [From], MIN(Lead.[date]-Lag.[date])+Lag.[date] AS [to]
FROM
(
SELECT [date],[Price]
FROM
(
SELECT [date],[Price],LAG(Price) OVER (ORDER BY DATE,Price) AS LagID FROM #table1 A
)B
WHERE CASE WHEN Price <> ISNULL(LagID,1) THEN 1 ELSE 0 END = 1
)Lag
JOIN
(
SELECT [date],[Price]
FROM
(
SELECT [date],Price,LEAD(Price) OVER (ORDER BY DATE,Price) AS LeadID FROM [#table1] A
)B
WHERE CASE WHEN Price <> ISNULL(LeadID,1) THEN 1 ELSE 0 END = 1
)Lead
ON Lag.[Price] = Lead.[Price]
WHERE Lead.[date]-Lag.[date] >= 0
GROUP BY Lag.[date],Lag.[price]
ORDER BY Lag.[date]
Another method using ROWS UNBOUNDED PRECEDING
SELECT price, MIN([date]) AS [from], [end_date] AS [To]
FROM
(
SELECT *, MIN([abc]) OVER (ORDER BY DATE DESC ROWS UNBOUNDED PRECEDING ) end_date
FROM
(
SELECT *, CASE WHEN price = next_price THEN NULL ELSE DATE END AS abc
FROM
(
SELECT a.* , b.[date] AS next_date, b.price AS next_price
FROM #table1 a
LEFT JOIN #table1 b
ON a.[date] = b.[date]-1
)AA
)BB
)CC
GROUP BY price, end_date

Find nearest date to start and end of the month

Table contains daily snapshots of specific parameter, but data can be missing for some days. Task is to calculate amount per month, for this sake we need values on start/end of the month, if data is missing, we need pairs of nearest dates i.e:
[Time] Value
2015-04-28 00:00:00.000 76127
2015-05-03 00:00:00.000 76879
2015-05-22 00:00:00.000 79314
2015-06-07 00:00:00.000 81443
Currently i use following code:
select
*
from(
select
[Time],
Value,
ROW_NUMBER() over (partition by CASE WHEN [Time] < '2015-05-01' THEN 1 ELSE 0 END order by abs(DATEDIFF(DAY, '2015-05-01', [Time]))) as rn2,
ROW_NUMBER() over (partition by CASE WHEN [Time] > '2015-05-01' THEN 1 ELSE 0 END order by abs(DATEDIFF(DAY, [Time], '2015-05-01'))) as rn3,
ROW_NUMBER() over (partition by CASE WHEN [Time] < '2015-05-31' THEN 1 ELSE 0 END order by abs(DATEDIFF(DAY, '2015-05-31', [Time]))) as rn4,
ROW_NUMBER() over (partition by CASE WHEN [Time] > '2015-05-31' THEN 1 ELSE 0 END order by abs(DATEDIFF(DAY, [Time], '2015-05-31'))) as rn5,
DATEDIFF(DAY, '2015-05-01', [Time]) as doff,
DATEDIFF(DAY, '2015-05-31', [Time]) as doff2
from
ValueTable
where
[Time] between '2015-04-01' and '2015-06-30'
) r
where
doff = 0 or doff2 = 0 or (doff != 0 and rn2 = 1 and rn3 = 1) or (doff2 != 0 and rn4 = 1 and rn5 = 1)
Is there any more efficient way to do it?
The following code is going to look more complicated because it is longer. However, it should be very fast, because it can make very good use of an index on ValueTable([Time]).
The idea is to look for exact matches. If there are no exact matches, then find the first and last records before and after the dates. This requires union all on six subqueries, but each should make optimal use of an index:
with exact_first as (
select t.*
from ValueTable t
where [Time] = '2015-05-01'
),
exact_last as (
select t.*
from ValueTable t
where [Time] = '2015-05-01'
)
(select ef.*
from exact_first ef
) union all
(select top 1 t.*
from ValueTable t
where [Time] < '2015-05-01' and
not exists (select 1 from exact_first ef2)
order by [Time]
) union all
(select top 1 t.*
from ValueTable t
where [Time] > '2015-05-01' and
not exists (select 1 from exact_first ef2)
order by [Time] desc
) union all
(select el.*
from exact_last el
) union all
(select top 1 t.*
from ValueTable t
where [Time] < '2015-05-31' and
not exists (select 1 from exact_last ef2)
order by [Time]
) union all
(select top 1 t.*
from ValueTable t
where [Time] > '2015-05-31' and
not exists (select 1 from exact_last ef2)
order by [Time] desc;
)

SQL query to return data corresponding to all values of a column except for the min value of that column

I have a table with the following columns:
userid, datetime, type
Sample data:
userid datetime type
1 2013-08-01 08:10:00 I
1 2013-08-01 08:12:00 I
1 2013-08-01 08:12:56 I
I need to fetch data for only two rows other than the row with min(datetime)
my query to fetch data for min(datetime) is :
SELECT
USERID, MIN(CHECKTIME) as ChkTime, CHECKTYPE, COUNT(*) AS CountRows
FROM
T1
WHERE
MONTH(CONVERT(DATETIME, CHECKTIME)) = MONTH(DATEADD(MONTH, -1,
CONVERT(DATE, GETDATE())))
AND YEAR(CONVERT(DATETIME, CHECKTIME)) = YEAR(GETDATE()) AND USERID=35
AND CHECKTYPE='I'
GROUP BY
CONVERT(DATE, CHECKTIME), USERID, CHECKTYPE
HAVING
COUNT(*) > 1
a lil help'll be much appreciated..thnx
Maybe something like this will help you:
WITH CTE AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY userid ORDER BY checktime) RN
FROM dbo.T1
WHERE CHECKTYPE = 'I'
--add your conditions here
)
SELECT * FROM CTE
WHERE RN > 1
Using CTE and ROW_NUMBER() function this will select all rows except min(date) for each user.
SQLFiddle DEMO
SELECT * FROM YOURTABLE A
INNER JOIN
(SELECT USERID,TYPE,MIN(datetime) datetime FROM YOURTABLE GROUP BY USERID,TYPE )B
ON
A.USERID=B.USERID AND
A.TYPE=B.TYPE
WHERE A.DATETIME<>B.DATETIME