Hive Query, any good ways to optimize these unions? - sql

ALL, I'm new to HIVE and general query optimization.
I have 3 unions that are more or less the exact same query. The only reason why these unions exist is because my source table does NOT have weekend or holiday dates, and I need to persist a few basic values from the preceding calendar day that exists in the source table for the holiday/weekend date that doesn't exist. The Dateadd function is really the only differentiator of the 3 unions (1, 2, or 3 days)
Is there any way to combine these 3 queries into one, or perhaps just do this in a more performant way?
I'm a bit stuck but I've already got this down from a 45 minute overall process to 4 1/2 minutes. Just not sure how to optimize these unions. Please help :/
UNION ALL
--ADDING 1 DAYS TO FRIDAYS--
select * from
(
SELECT a.portfolio_name, cast(date_add(performance_end_date,1) as timestamp) as performance_end_date, cast(0.0000000 as string) as car_return, a.nav, a.nav_id
,row_number() over (partition by a.portfolio_code,a.performance_end_date order by a.nav_id desc) as row_no
FROM carsales a
where
a.portfolio_code IN ('1994',1998,2523)
and a.year=2020 and a.month=09
and DAYOFWEEK(performance_end_date) = 6
) a
where row_no= 1
UNION ALL
--ADDING 2 DAYS TO FRIDAYS--
select * from
(
SELECT a.portfolio_name, cast(date_add(performance_end_date,2) as timestamp) as performance_end_date, cast(0.0000000 as string) as car_return, a.nav, a.nav_id
,row_number() over (partition by a.portfolio_code,a.performance_end_date order by a.nav_id desc) as row_no
FROM carsales a
where
a.portfolio_code IN ('1994',1998,2523)
and a.year=2020 and a.month=09
and DAYOFWEEK(performance_end_date) = 6
) a
where row_no= 1
UNION ALL
--ADDING 3 DAYS To Holidays
select * from
(
SELECT a.portfolio_name, cast(date_add(performance_end_date,3) as timestamp) as performance_end_date, cast(0.0000000 as string) as car_return, a.nav, a.nav_id
,row_number() over (partition by a.portfolio_code,a.performance_end_date order by a.nav_id desc) as row_no
FROM carsales a
where
a.portfolio_code IN ('1994',1998,2523)
and a.year=2020 and a.month=09
and performance_end_date in ('2020-09-04 00:00:00.000','2020-10-09 00:00:00.000')
) a
where row_no= 1

If it's exactly like you wrote, that the only difference is the date_add parameter function, you could take the sql from one of the unions and cross join it with a union between 1,2 and 3 constastants. Maybe the cross join will work better than the union; depends also on the numbers from source. Also, you could filter the rownumber just before doing the Cross join, in order to join less rows. In the example posted below I didn't filtered the row number.
The query will look like this:
SELECT a.portfolio_name,
Cast(Date_add(a.performance_end_date, crs.crs) AS TIMESTAMP) AS
performance_end_date,
a.car_return,
a.nav,
a.nav_id,
a.performance_end_date,
a.row_no
FROM (SELECT a.portfolio_name,
-- Cast(Date_add(performance_end_date, 1) AS TIMESTAMP) AS performance_end_date,
Cast(0.0000000 AS STRING) AS car_return,
a.nav,
a.nav_id,
a.performance_end_date,
Row_number()
OVER (
partition BY a.portfolio_code, a.performance_end_date
ORDER BY a.nav_id DESC) AS row_no
FROM carsales a
WHERE a.portfolio_code IN ( '1994', 1998, 2523 )
AND a.year = 2020
AND a.month = 09
AND Dayofweek(performance_end_date) = 6) a
CROSS JOIN (SELECT 1 crs
UNION ALL
SELECT 2
UNION ALL
SELECT 3) crs
EDIT 1: regarding the comments about date1 or date2, you could do exatly like you wrote. In the where clause, put date_column = something or date_column = something.
SELECT a.portfolio_name,
Cast(Date_add(a.performance_end_date, crs.crs) AS TIMESTAMP) AS
performance_end_date,
a.car_return,
a.nav,
a.nav_id,
a.performance_end_date,
a.row_no
FROM (SELECT a.portfolio_name,
-- Cast(Date_add(performance_end_date, 1) AS TIMESTAMP) AS performance_end_date,
Cast(0.0000000 AS STRING) AS car_return,
a.nav,
a.nav_id,
a.performance_end_date,
Row_number()
OVER (
partition BY a.portfolio_code, a.performance_end_date
ORDER BY a.nav_id DESC) AS row_no
FROM carsales a
WHERE a.portfolio_code IN ( '1994', 1998, 2523 )
AND a.year = 2020
AND a.month = 09
AND (Dayofweek(performance_end_date) = 6 or performance_end_date in ('2020-09-04 00:00:00.000','2020-10-09 00:00:00.000'))
) a
CROSS JOIN (SELECT 1 crs
UNION ALL
SELECT 2
UNION ALL
SELECT 3) crs

In addition to #F.Lazarescu answer, you can rewrite CROSS JOIN subquery.
Instead of this:
CROSS JOIN (SELECT 1 crs
UNION ALL
SELECT 2
UNION ALL
SELECT 3) crs
Use stack() UDTF, it will perform faster:
CROSS JOIN (SELECT stack(3, 1,2,3) as crs) crs

Related

Get range of dates from dates record in MS SQL

I have dates record
with DateTable (dateItem) as
(
select '2022-07-03' union all
select '2022-07-05' union all
select '2022-07-04' union all
select '2022-07-09' union all
select '2022-07-12' union all
select '2022-07-13' union all
select '2022-07-18'
)
select dateItem
from DateTable
order by 1 asc
I want to get ranges of dates between this record like this
with DateTableRange (dateItemStart, dateItemend) as
(
select '2022-07-03','2022-07-05' union all
select '2022-07-09','2022-07-09' union all
select '2022-07-12','2022-07-13' union all
select '2022-07-18','2022-07-18'
)
select dateItemStart, dateItemend
from DateTableRange
I am able to do it in SQL with looping using while or looping by getting first one and check the next dates and if they are 1 plus then I add it in enddate and do the same in loop
But I don't know what the best or optimized way is, as there were lots of looping and temp tables involve
Edited :
as in data we have 3,4,5 and 6,7,8 is missing so range is 3-5
9 exist and 10 is missing so range is 9-9
so ranges is purely depend on the consecutive data in datetable
Any suggestion will be appreciated
With some additional clarity this requires a gaps-and-islands approach to first identify adjacent rows as groups, from which you can then use a window to identify the first and last value of each group.
I'm sure this could be refined further but should give your desired results:
with DateTable (dateItem) as
(
select '2022-07-03' union all
select '2022-07-05' union all
select '2022-07-04' union all
select '2022-07-09' union all
select '2022-07-12' union all
select '2022-07-13' union all
select '2022-07-18'
), valid as (
select *,
case when exists (
select * from DateTable d2 where Abs(DateDiff(day, d.dateitem, d2.dateitem)) = 1
) then 1 else 0 end v
from DateTable d
), grp as (
select *,
Row_Number() over(order by dateitem) - Row_Number()
over (partition by v order by dateitem) g
from Valid v
)
select distinct
Iif(v = 0, dateitem, First_Value(dateitem) over(partition by g order by dateitem)) DateItemStart,
Iif(v = 0, dateitem, First_Value(dateitem) over(partition by g order by dateitem desc)) DateItemEnd
from grp
order by dateItemStart;
See Demo Fiddle
After clarification, this is definitely a 'gaps and islands' problem.
The solution can be like this
WITH DateTable(dateItem) AS
(
SELECT * FROM (
VALUES
('2022-07-03'),
('2022-07-05'),
('2022-07-04'),
('2022-07-09'),
('2022-07-12'),
('2022-07-13'),
('2022-07-18')
) t(v)
)
SELECT
MIN(dateItem) AS range_from,
MAX(dateItem) AS range_to
FROM (
SELECT
*,
SUM(CASE WHEN DATEADD(day, 1, prev_dateItem) >= dateItem THEN 0 ELSE 1 END) OVER (ORDER BY rn) AS range_id
FROM (
SELECT
ROW_NUMBER() OVER (ORDER BY dateItem) AS rn,
CAST(dateItem AS date) AS dateItem,
CAST(LAG(dateItem) OVER (ORDER BY dateItem) AS date) AS prev_dateItem
FROM DateTable
) groups
) islands
GROUP BY range_id
You can check a working demo

conditional running sum

I'm trying to return the number of unique users that converted over time.
So I have the following query:
WITH CTE
As
(
SELECT '2020-04-01' as date,'userA' as user,1 as goals Union all
SELECT '2020-04-01','userB',0 Union all
SELECT '2020-04-01','userC',0 Union all
SELECT '2020-04-03','userA',1 Union all
SELECT '2020-04-05','userC',1 Union all
SELECT '2020-04-06','userC',0 Union all
SELECT '2020-04-06','userB',0
)
select
date,
COUNT(DISTINCT
IF
(goals >= 1,
user,
NULL)) AS cad_converters
from CTE
group by date
I'm trying to count distinct user but I need to find a way to apply the distinct count to the whole date. I probably need to do something like a cumulative some...
expected result would be something like this
date, goals, total_unique_converted_users
'2020-04-01',1,1
'2020-04-01',0,1
'2020-04-01',0,1
'2020-04-03',1,2
'2020-04-05',1,2
'2020-04-06',0,2
'2020-04-06',0,2
Below is for BigQuery Standard SQL
#standardSQL
SELECT t.date, t.goals, total_unique_converted_users
FROM `project.dataset.table` t
LEFT JOIN (
SELECT a.date,
COUNT(DISTINCT IF(b.goals >= 1, b.user, NULL)) AS total_unique_converted_users
FROM `project.dataset.table` a
CROSS JOIN `project.dataset.table` b
WHERE a.date >= b.date
GROUP BY a.date
)
USING(date)
I would approach this by tagging when the first goal is scored for each name. Then simply do a cumulative sum:
select cte.* except (seqnum), countif(seqnum = 1) over (order by date)
from (select cte.*,
(case when goals = 1 then row_number() over (partition by user, goals order by date) end) as seqnum
from cte
) cte;
I realize this can be expressed without the case in the subquery:
select cte.* except (seqnum), countif(seqnum = 1 and goals = 1) over (order by date)
from (select cte.*,
row_number() over (partition by user, goals order by date) as seqnum
from cte
) cte;

How to get the validity date range of a price from individual daily prices in SQL

I have some prices for the month of January.
Date,Price
1,100
2,100
3,115
4,120
5,120
6,100
7,100
8,120
9,120
10,120
Now, the o/p I need is a non-overlapping date range for each price.
price,from,To
100,1,2
115,3,3
120,4,5
100,6,7
120,8,10
I need to do this using SQL only.
For now, if I simply group by and take min and max dates, I get the below, which is an overlapping range:
price,from,to
100,1,7
115,3,3
120,4,10
This is a gaps-and-islands problem. The simplest solution is the difference of row numbers:
select price, min(date), max(date)
from (select t.*,
row_number() over (order by date) as seqnum,
row_number() over (partition by price, order by date) as seqnum2
from t
) t
group by price, (seqnum - seqnum2)
order by min(date);
Why this works is a little hard to explain. But if you look at the results of the subquery, you will see how the adjacent rows are identified by the difference in the two values.
SELECT Lag.price,Lag.[date] AS [From], MIN(Lead.[date]-Lag.[date])+Lag.[date] AS [to]
FROM
(
SELECT [date],[Price]
FROM
(
SELECT [date],[Price],LAG(Price) OVER (ORDER BY DATE,Price) AS LagID FROM #table1 A
)B
WHERE CASE WHEN Price <> ISNULL(LagID,1) THEN 1 ELSE 0 END = 1
)Lag
JOIN
(
SELECT [date],[Price]
FROM
(
SELECT [date],Price,LEAD(Price) OVER (ORDER BY DATE,Price) AS LeadID FROM [#table1] A
)B
WHERE CASE WHEN Price <> ISNULL(LeadID,1) THEN 1 ELSE 0 END = 1
)Lead
ON Lag.[Price] = Lead.[Price]
WHERE Lead.[date]-Lag.[date] >= 0
GROUP BY Lag.[date],Lag.[price]
ORDER BY Lag.[date]
Another method using ROWS UNBOUNDED PRECEDING
SELECT price, MIN([date]) AS [from], [end_date] AS [To]
FROM
(
SELECT *, MIN([abc]) OVER (ORDER BY DATE DESC ROWS UNBOUNDED PRECEDING ) end_date
FROM
(
SELECT *, CASE WHEN price = next_price THEN NULL ELSE DATE END AS abc
FROM
(
SELECT a.* , b.[date] AS next_date, b.price AS next_price
FROM #table1 a
LEFT JOIN #table1 b
ON a.[date] = b.[date]-1
)AA
)BB
)CC
GROUP BY price, end_date

SQL - values from two rows into new two rows

I have a query that gives a sum of quantity of items on working days. on weekend and holidays that quantity value and item value is empty.
I would like that on empty days is last known quantity and item.
My query is like this:
`select a.dt,b.zaliha as quantity,b.artikal as item
from
(select to_date('01-01-2017', 'DD-MM-YYYY') + rownum -1 dt
from dual
connect by level <= to_date(sysdate) - to_date('01-01-2017', 'DD-MM-YYYY') + 1
order by 1)a
LEFT OUTER JOIN
(select kolicina,sum(kolicina)over(partition by artikal order by datum_do) as zaliha,datum_do,artikal
from
(select sum(vv.kolicinaulaz-vv.kolicinaizlaz)kolicina,vz.datum as datum_do,vv.artikal
from vlpzaglavlja vz, vlpvarijante vv
where vz.id=vv.vlpzaglavlje
and vz.orgjed='01006'
and vv.skladiste='01006'
and vv.artikal in (3069,6402)
group by vz.datum,vv.artikal
order by vv.artikal,vz.datum asc)
order by artikal,datum_do asc)b
on a.dt=b.datum_do
where a.dt between to_date('12102017','ddmmyyyy') and to_date('16102017','ddmmyyyy')
order by a.dt`
and my output is like this:
and I want this:
In short, if quantity is null use lag(... ignore nulls) and coalesce or nvl:
select dt, item,
nvl(quantity, lag(quantity ignore nulls) over (partition by item order by dt))
from t
order by dt, item
Here is the full query, I cannot test it, but it is something like:
with t as (
select a.dt, b.zaliha as quantity, b.artikal as item
from (
select date '2017-10-10' + rownum - 1 dt
from dual
connect by date '2017-10-10' + rownum - 1 <= date '2017-10-16' ) a
left join (
select kolicina, datum_do, artikal,
sum(kolicina) over(partition by artikal order by datum_do) as zaliha
from (
select sum(vv.kolicinaulaz-vv.kolicinaizlaz) kolicina,
vz.datum as datum_do, vv.artikal
from vlpzaglavlja vz
join vlpvarijante vv on vz.id = vv.vlpzaglavlje
where vz.orgjed = '01006' and vv.skladiste='01006'
and vv.artikal in (3069,6402)
group by vz.datum, vv.artikal)) b
on a.dt = b.datum_do)
select *
from (
select dt, item,
nvl(quantity, lag(quantity ignore nulls)
over (partition by item order by dt)) qty
from t)
where dt >= date '2017-10-12'
order by dt, item
There are several issues in your query, major and minor:
in date generator (subquery a) you are selecting dates from long period, january to september, then joining with main tables and summing data and then selecting only small part. Why not filter dates at first?,
to_date(sysdate). sysdate is already date,
use ansi joins,
do not use order by in subqueries, it has no impact, only last ordering is important,
use date literals when defining dates, it is more readable.

Using T-SQL, how do I generate a result that shows a range of dates

Using SQL Server, how do I generate a result set that shows a range of dates, like so:
StartDate EndDate
01/01/2014 01/04/2014
01/08/2014 01/11/2014
01/14/2014 01/15/2014
The original data had the dates in this format:
ColumnA DateColumn
blah 01/01/2014
blah 01/02/2014
blah 01/03/2014
blah 01/04/2014
blah 01/08/2014
blah 01/09/2014
blah 01/10/2014
blah 01/11/2014
blah 01/14/2014
blah 01/15/2014
Currently, I have a bunch of queries that does this, but I'm wondering if I can do something in less code:
SELECT ROW_NUMBER() OVER(ORDER BY DateColumn) AS rownum,
DateColumn
INTO #main
FROM MyTable
SELECT m1.DateColumn AS TBegin,
m2.DateColumn AS TEnd,
COALESCE(DATEDIFF(day, m2.TimePk, m1.TimePk), 0) AS Gap
INTO #Gap
FROM #main m1
LEFT OUTER JOIN #main m2
ON m1.rownum = m2.rownum + 1
ORDER BY m1.DateColumn
SELECT ROW_NUMBER() OVER(ORDER BY i_id, TBegin) AS rownum,
TBegin
INTO #Begin
FROM #Gap
WHERE Gap <> 1
ORDER BY TBegin
SELECT ROW_NUMBER() OVER(ORDER BY i_id, TEnd) AS rownum,
TEnd
INTO #End
FROM (
SELECT TEnd
FROM #Gap
WHERE Gap > 1
UNION
SELECT MAX(TBegin)
FROM #Gap
) as t
ORDER BY TEnd
SELECT b.TBegin,
e.TEnd
FROM #Begin b
INNER JOIN #End e
ON b.i_id = e.i_id
AND b.rownum = e.rownum
ORDER BY b.TBegin
Any ideas on how to simplify or approach this in an entirely different way?
My approach to these is to identify the first date that has no date preceding it. This is the beginning of a group. Then I take the cumulative sum of that as a group identifier, and do the aggregation.
SQL Server 2008 doesn't have lag or cumulative sums, so I use correlated subqueries for this:
with mt as (
select t.*,
(case when (select top 1 t2.dateColumn
from MyTable t2
where t2.ColumnA = t.ColumnA and
t2.dateColumn < t.dateColumn
order by t2.dateColumn desc
) = dateadd(day, -1, t.datecolumn)
then 0
else 1
end) as IsStart
from MyTable t
),
mtcum as (
select mt.*,
(select sum(mt2.IsStart)
from mt mt2
where mt2.ColumnA = mt.ColumnA and
mt2.dateColumn <= mt.DateColumn
) as grpId
from mt
)
select ColumnA, min(dateColumn) as StartDate, max(dateColumn) as EndDate
from mtcum
group by ColumnA, grpId;
EDIT:
An easier way of approach this is with the observation that the difference between a sequence of dates and a sequence of numbers is constant.
select columnA, min(dateColumn) as StartDate, max(dateColumn) as EndDate
from (select mt.*, row_number() over (partition by ColumnA order by datecolumn) as seqnum
from mytable mt
) t
group by columnA, dateadd(day, - seqnum, datecolumn);
This will work for you. It's still fairly complex though. It uses inner queries to find the first date that is after a gap for each date. This way all days belonging to the same group of dates can be grouped together.
select MIN(DateColumn) StartDate, MAX(DateColumn) EndDate from
(select X.DateColumn, MIN(Y.DateColumn) MinOverGap from
(select DateColumn, ROW_NUMBER() OVER (ORDER BY DateColumn) RowNumber
from MyTable) X
left join
(select DateColumn, ROW_NUMBER() OVER (ORDER BY DateColumn) RowNumber
from MyTable) Y
on DATEADD(d, Y.RowNumber - 1, X.DateColumn) <> DATEADD(d, X.RowNumber -1, Y.DateColumn) AND X.DateColumn < Y.DateColumn
group by x.DateColumn) grouped
group by MinOverGap
order by 1