Hive windowing in sub-query and outer query on DATE - hive

CREATE TABLE big_hive_table(
`partner` string,
start_date date,
end_date date,
`category` string,
`category2` string);
insert into big_hive_table values ('S1','2018-01-01','2018-03-31','c1','M');
insert into big_hive_table values ('S1','2017-12-01','2018-01-31','c1','M');
insert into big_hive_table values ('S1','2017-01-01','2017-11-30','c1','M');
insert into big_hive_table values ('S1','2018-02-01','2018-04-30','c1','M');
insert into big_hive_table values ('S1','2018-02-01','2018-04-30','c1','L');
insert into big_hive_table values ('S2','2018-02-01','2018-04-30','c1','S');
insert into big_hive_table values ('S3','2018-02-01','2018-04-30','c2','S');
insert into big_hive_table values ('S3','2018-01-01','2018-03-31','c2','S');
insert into big_hive_table values ('S3','2017-12-01','2018-01-31','c2','S');
Problem: get oldest start_date and latest end_date for the group (partner, category, category2) if there is an overlapping period
expected result:
S1 01/12/2017 30/04/2018 c1 M
S1 01/01/2017 30/11/2017 c1 M
S1 01/02/2018 30/04/2018 c1 L
S2 01/02/2018 30/04/2018 c1 S
S3 01/12/2017 30/04/2018 c2 S
My query
SELECT DISTINCT partner,
category,
category2,
First_value(start_date) OVER (partition BY partner, category, category2 ORDER BY start_date) period_start,
last_value(end_date) OVER (partition BY partner, category, category2 ORDER BY start_date rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED following) period_end
from (select pps.*, sum(start_new_period) over (partition BY partner, category, category2)
FROM ( select partner,
start_date,
end_date,
category,
category2,
lag(end_date) over (partition by partner, category, category2 order by start_date) previous_period_end
, case
when start_date > lag(end_date) over (partition by partner, category, category2 order by start_date)
then 1
else 0
end start_new_period
from big_hive_table
where start_date is not null and end_date is not null) pps
)
Currently I'm getting the following error, when I run 2 inner queries (from select pps.*) or the whole query:
Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.parse.SemanticException:Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies.
Underlying error: Primitve type DATE not supported in Value Boundary expression
Can anyone suggest what I'm missing. Thanks for your help.

Just add rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED following in your first_value window function and try running again.
Change your query
from
First_value(start_date) OVER (partition BY partner, category, category2
ORDER BY start_date) period_start
To
First_value(start_date) OVER (partition BY partner, category, category2 ORDER BY
start_date rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED following) period_start
There is Jira regards to primitive type support and got fixed in Hive.2.1.0

Related

How to calculate total balance in sql using sum?

updated question --
I have a table that contains the following columns:
DROP TABLE TABLE_1;
CREATE TABLE TABLE_1(
TRANSACTION_ID number, USER_KEY number,AMOUNT number,CREATED_DATE DATE, UPDATE_DATE DATE
);
insert into TABLE_1
values ('001','1001',75,'2022-12-02','2022-12-03'),
('001','1001',-74.98,'2022-12-02','2022-12-03'),
('001','1001',74.98,'2022-12-03','2022-12-04'),
('001','1001',-75,'2022-12-03','2022-12-04')
I need to calculate the balance based on the update date. In some cases there can be the same update_date for two different records. When I have this, I want to grab the lower value of the balance.
This is the query I have so far:
select * from (
select TRANSACTION_ID,USER_KEY,AMOUNT,CREATED_DATE,UPDATE_DATE,
sum(AMOUNT) over(partition by USER_KEY order by UPDATE_DATE rows BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as TOTAL_BALANCE_AMOUNT
from TABLE_1
) qualify row_number() over (partition by USER_KEY order by UPDATE_DATE DESC, UPDATE_DATE DESC) = 1
In the query above, it's is grabbing the 75, rather than the 0 after I try to only grab the LAST balance.
Is there a way to include in the qualify query to grab the last balance but if the dates are the same, to grab the lowest balance?
why is the second query, showing 4 different record balances?
That is the point of "running total". If the goal is to have a single value per entire window then order by should be skipped:
select USER_KEY,
sum(AMOUNT) over(partition by USER_KEY) as TOTAL_BALANCE_AMOUNT
from TABLE1;
The partition by clause could be futher expanded with date to produce output per user_key/date:
select USER_KEY,
sum(AMOUNT) over(partition by USER_KEY,date) as TOTAL_BALANCE_AMOUNT
from TABLE1;
I think you're looking for something like this, aggregate by USER_ID, DATE, and then calculate a running sum. If this is not what you're looking for nor is Lukasz Szozda's answer, please edit the question to show the intended output.
create or replace table T1(USER_KEY int, AMOUNT number(38,2), "DATE" date);
insert into T1(USER_KEY, AMOUNT, "DATE") values
(1001, 75, '2022-12-02'),
(1001, -75, '2022-12-02'),
(1001, 75, '2022-12-03'),
(1001, -75, '2022-12-03');
-- Option 1, aggregate after window
select USER_KEY, "DATE", min(TOTAL_BALANCE_AMOUNT) as MINIMUM_BALANCE from
(
select USER_KEY, "DATE", sum(AMOUNT)
over(partition by USER_KEY order by DATE, AMOUNT desc rows BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as TOTAL_BALANCE_AMOUNT from
T1
)
group by USER_KEY, "DATE"
;
--Option 2, qualify by partitioning by user and day, reversing the order of transactions
select USER_KEY, "DATE", sum(AMOUNT)
over(partition by USER_KEY order by DATE, AMOUNT desc rows BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as TOTAL_BALANCE_AMOUNT
from
T1
qualify row_number() over (partition by USER_KEY, DATE order by DATE, AMOUNT asc) = 1
;
USER_KEY
DATE
TOTAL_BALANCE_AMOUNT
1001
2022-12-02 00:00:00
0
1001
2022-12-03 00:00:00
0

How can i group rows on sql base on condition

I am using redshift sql and would like to group users who has overlapping voucher period into a single row instead (showing the minimum start date and max end date)
For E.g if i have these records,
I would like to achieve this result using redshift
Explanation is tat since row 1 and row 2 has overlapping dates, I would like to just combine them together and get the min(Start_date) and max(End_Date)
I do not really know where to start. Tried using row_number to partition them but does not seem to work well. This is what I tried.
select
id,
start_date,
end_date,
lag(end_date, 1) over (partition by id order by start_date) as prev_end_date,
row_number() over (partition by id, (case when prev_end_date >= start_date then 1 else 0) order by start_date) as rn
from users
Are there any suggestions out there? Thank you kind sirs.
This is a type of gaps-and-islands problem. Because the dates are arbitrary, let me suggest the following approach:
Use a cumulative max to get the maximum end_date before the current date.
Use logic to determine when there is no overall (i.e. a new period starts).
A cumulative sum of the starts provides an identifier for the group.
Then aggregate.
As SQL:
select id, min(start_date), max(end_date)
from (select u.*,
sum(case when prev_end_date >= start_date then 0 else 1
end) over (partition by id
order by start_date, voucher_code
rows between unbounded preceding and current row
) as grp
from (select u.*,
max(end_date) over (partition by id
order by start_date, voucher_code
rows between unbounded preceding and 1 preceding
) as prev_end_date
from users u
) u
) u
group by id, grp;
Another approach would be using recursive CTE:
Divide all rows into numbered partitions grouped by id and ordered by start_date and end_date
Iterate over them calculating group_start_date for each row (rows which have to be merged in final result would have the same group_start_date)
Finally you need to group the CTE by id and group_start_date taking max end_date from each group.
Here is corresponding sqlfiddle: http://sqlfiddle.com/#!18/7059b/2
And the SQL, just in case:
WITH cteSequencing AS (
-- Get Values Order
SELECT *, start_date AS group_start_date,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY start_date, end_date) AS iSequence
FROM users),
Recursion AS (
-- Anchor - the first value in groups
SELECT *
FROM cteSequencing
WHERE iSequence = 1
UNION ALL
-- Remaining items
SELECT b.id, b.start_date, b.end_date,
CASE WHEN a.end_date > b.start_date THEN a.group_start_date
ELSE b.start_date
END
AS groupStartDate,
b.iSequence
FROM Recursion AS a
INNER JOIN cteSequencing AS b ON a.iSequence + 1 = b.iSequence AND a.id = b.id)
SELECT id, group_start_date as start_date, MAX(end_date) as end_date FROM Recursion group by id, group_start_date ORDER BY id, group_start_date

Number of sales relative to historical date in previous year

I have a database containing sales transactions. These are in the following (simplified) format:
sales_id | customer_id | sales_date | number_of_units | total_price
The goal for my query is for each of these transactions, to get the number of sales that this specific customer_id made before the current record, during the whole history of this database, but also during the 365 days before the current record.
Lifetime sales works right now, but the last 365 days part has me stuck. My query right now can identify IF a record had at least one sale in the previous 365 days, and I do it like so:
SELECT sales_id ,customer_id,sales_date,number_of_units,total_price,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY sales_date ASC) as 'LifeTimeSales' ,
CASE WHEN DATEDIFF(DAY,sales_date,LAG(sales_date, 1) OVER (PARTITION BY customer_id ORDER BY sales_date ASC)) > -365
THEN 1 ELSE 0 END as 'Last365Sales'
FROM sales_db
+ some non-important WHERE clauses. After which I aggregate the result of this query in some other ways.
But this does not tell me if this purchase is for example the 4th sale in the previous 365 days of a customer.
Note:
This is a query that runs daily on the full database with 6 million records and growing. I drop and recreate this table right now, which is obviously not efficient. Updating the table when new sales come in would be ideal, but right now this is not possible to create. Any ideas?
Some test data:
sales_id,customer_id,sales_date,number_of_units,total_price
1001,2001,2016-01-01,1,86
1002,2001,2016-08-01,3,98
1003,2001,2017-06-01,2,87
1004,2002,2017-06-01,2,15
+ expected result:
sales_id,customer_id,sales_date,number_of_units,total_price,LifeTimeSales,Last365Sales
1001,2001,2016-01-01,1,86,0,0
1002,2001,2016-08-01,3,98,1,1
1003,2001,2017-06-01,2,87,2,1
1004,2002,2017-06-01,2,15,0,0
For the count of sales before a sale you could use correlated subqueries.
SELECT s1.sales_id,
s1.customer_id,
s1.sales_date,
s1.number_of_units,
s1.total_price,
(SELECT count(*)
FROM sales_db s2
WHERE s2.customer_id = s1.customer_id
AND s2.sales_date <= s1.sales_date) - 1 lifetimesales,
(SELECT count(*)
FROM sales_db s2
WHERE s2.customer_id = s1.customer_id
AND s2.sales_date <= s1.sales_date
AND s2.sales_date >= dateadd(day, s1.sales_date, -356)) - 1 last365sales
FROM sales_db s1;
(I used s2.sales_date <= s1.sales_date and then subtracted 1 from the reuslt, so that multiple sales on the same day, if such data exists, are also counted. But as this also counts the sale of the current row, it has to be decremented by 1.)
I create report view where all required fields are available.
Select all that you need:
with all_history_statistics as
(select customer_id, sales_id, sales_date, number_of_units, total_price,
max(sales_date) over (partition by customer_id order by (select null)) as last_sale_date,
count(sales_id) over (partition by customer_id order by (select null)) total_number_of_sales,
count(sales_id) over (partition by customer_id order by sales_date asc ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) number_of_sales_for_current_date,
sum(number_of_units) over (partition by customer_id order by (select null)) total_number_saled_units,
sum(number_of_units) over (partition by customer_id order by sales_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) number_saled_units_for_current_date,
sum(total_price) over (partition by customer_id order by (select null)) as total_earned,
sum(total_price) over (partition by customer_id order by sales_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) earned_for_current_date)
from sales_db),
with last_year_statistics as
(select customer_id, sales_id, sales_date, number_of_units, total_price,
max(sales_date) over (partition by customer_id order by (select null)) as last_sale_date,
count(sales_id) over (partition by customer_id order by (select null)) total_number_of_sales,
count(sales_id) over (partition by customer_id order by sales_date asc ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) number_of_sales_for_current_date,
sum(number_of_units) over (partition by customer_id order by (select null)) total_number_saled_units,
sum(number_of_units) over (partition by customer_id order by sales_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) number_saled_units_for_current_date,
sum(total_price) over (partition by customer_id order by (select null)) as total_earned,
sum(total_price) over (partition by customer_id order by sales_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) earned_for_current_date)
from sales_db)
select <specify list of fields which you need>
from all_history_statistics t1 inner join last_year_statistics
on t1.customer_id = t2.cutomer_id
;

Merge the records for overlapping dates

I have data as below and want to merge the records for overlapping dates. MIN and MAX of start and end dates for overlapping records should be the Start and end date of merged record.
Before merge:
Item Code Start_date End_date
============== =========== ===========
111 15-May-2004 20-Jun-2004
111 22-May-2004 07-Jun-2004
111 20-Jun-2004 13-Aug-2004
111 27-May-2004 30-Aug-2004
111 02-Sep-2004 23-Dec-2004
222 21-May-2004 19-Aug-2004
Required output:
Item Code Start_date End_date
============== =========== ===========
111 15-May-2004 30-Aug-2004
111 02-Sep-2004 23-Dec-2004
222 21-May-2004 19-Aug-2004
you can create sample data using
create table item(item_code number, start_date date, end_date date);
insert into item values (111,to_date('15-May-2004','DD-Mon-YYYY'),to_date('20-Jun-2004','DD-Mon-YYYY'));
insert into item values (111,to_date('22-May-2004','DD-Mon-YYYY'),to_date('07-Jun-2004','DD-Mon-YYYY'));
insert into item values (111,to_date('20-Jun-2004','DD-Mon-YYYY'),to_date('13-Aug-2004','DD-Mon-YYYY'));
insert into item values (111,to_date('27-May-2004','DD-Mon-YYYY'),to_date('30-Aug-2004','DD-Mon-YYYY'));
insert into item values (111,to_date('02-Sep-2004','DD-Mon-YYYY'),to_date('23-Dec-2004','DD-Mon-YYYY'));
insert into item values (222,to_date('21-May-2004','DD-Mon-YYYY'),to_date('19-Aug-2004','DD-Mon-YYYY'));
commit;
The code for this type of problem is rather tricky. Here is one approach that works pretty well:
with item (item_code, start_date, end_date) as (
select 111,to_date('15-05-2004','DD-MM-YYYY'),to_date('20-06-2004','DD-MM-YYYY') from dual union all
select 111,to_date('22-05-2004','DD-MM-YYYY'),to_date('07-06-2004','DD-MM-YYYY') from dual union all
select 111,to_date('20-06-2004','DD-MM-YYYY'),to_date('13-08-2004','DD-MM-YYYY') from dual union all
select 111,to_date('27-05-2004','DD-MM-YYYY'),to_date('30-08-2004','DD-MM-YYYY') from dual union all
select 111,to_date('02-09-2004','DD-MM-YYYY'),to_date('23-12-2004','DD-MM-YYYY') from dual union all
select 222,to_date('21-05-2004','DD-MM-YYYY'),to_date('19-08-2004','DD-MM-YYYY') from dual
),
id as (
select item_code, start_date as dte, count(*) as inc
from item
group by item_code, start_date
union all
select item_code, end_date, - count(*) as inc
from item
group by item_code, end_date
),
id2 as (
select id.*, sum(inc) over (partition by item_code order by dte) as running_inc
from id
),
id3 as (
select id2.*, sum(case when running_inc = 0 then 1 else 0 end) over (partition by item_code order by dte desc) as grp
from id2
)
select item_code, min(dte) as start_date, max(dte) as end_date
from id3
group by item_code, grp;
And a rextester to validate it.
What is this doing? Good question. The idea in these problems is to define the adjacent groups. This method does so by counting the number of "starts" and "ends" up to a given date. When the value is 0, a group ends.
The specific steps are as follows:
(1) Break out all the dates onto separate rows along with an indicator of whether the date is a start date or end date. This indicator is key to defining the ranges -- +1 to "enter" and "-1" to exit.
(2) Calculate the running total of the indicators. The 0s in this total are the ends of overlapping ranges.
(3) Do a reverse cumulative sum of the 0s to identify the groups.
(4) Aggregate to get the final results.
You can look at each of the CTEs to see what is happening in the data.
It's a variation of a gaps&islands problem. First calculate the maximum previous end date for each row. Then filter the rows where the current row's start date is greater than that max date, this is the start of a new group and the group's end date is found in the next row.
WITH max_dates AS
(
SELECT
item_code
,start_date
,Max(end_date) -- get the maximum prevous end_date
Over (PARTITION BY item_code
ORDER BY start_date
ROWS BETWEEN Unbounded Preceding AND 1 Preceding) AS max_prev_date
,Max(end_date) -- get the maximum overall date (only needed for the last group)
Over (PARTITION BY item_code) AS max_date
FROM item
)
SELECT
item_code
,start_date
,Coalesce(Lead(max_prev_date) -- next row got the end date for the current row
Over (PARTITION BY item_code
ORDER BY start_date)
,max_date ) AS end_date -- no next row for the last row --> overall maximum end_date
FROM max_dates
WHERE max_prev_date < start_date -- maximum previous end date is less than current start date --> start of a new group
OR max_prev_date IS NULL -- first row
In SQL Server you can try this. It will give your desired output but as performance point of view the query might slow down, When there is a large number of data to be checked.
DECLARE #item Table(item_code int, start_date date, end_date date);
insert into #item values (111,'15-May-2004','20-Jun-2004');
insert into #item values (111,'22-May-2004','07-Jun-2004');
insert into #item values (111,'20-Jun-2004','13-Aug-2004');
insert into #item values (111,'27-May-2004','30-Aug-2004');
insert into #item values (111,'02-Sep-2004','23-Dec-2004');
insert into #item values (222,'21-May-2004','19-Aug-2004');
SELECT * FROM #item WHERE item_code IN (SELECT item_code FROM #item GROUP BY item_code) AND
(start_date IN (SELECT max(start_date) FROM #item GROUP BY item_code) or start_date In (SELECT min(start_date) FROM #item GROUP BY item_code))
with help of above answers i am able to simplify this as below
WITH max_dates AS
(
SELECT
item_code
,start_date
,end_date
,Max(end_date)
Over (PARTITION BY item_code
ORDER BY start_date
) AS max_date
FROM item
) ,
max_dates1 as
(
select max_dates.* , lag(max_date) over(partition by item_code order by 1) as MPD from max_dates
)
select ITEM_CODE,start_date,end_date from max_dates1
WHERE MPD < start_date
OR MPD IS NULL

transform large price table to table with startdate and enddate

I have a table with prices per article per date with a lot of redundancy: even if the price does not change, I still have a line for each date. What I would like to do is transform this table to a table where for every different price, there will be a new line with a startdate and enddate.
Source example:
article_ID date price
1 01/01/15 2.99
1 02/01/15 2.99
1 03/01/15 2.49
2 01/01/15 12.29
2 02/01/15 12.29
2 03/01/15 12.29
I am looking for an SQL query to create the following result:
article_ID startdate enddate price
1 01/01/15 02/01/15 2.99
1 03/01/15 03/01/15 2.49
2 01/01/15 03/01/15 12.49
I work with SQL Server and Oracle SQL Developer.
You need to identify rows of consecutive dates with the same price, and then group on the resulting identifier. A simpler way to get the group is to subtract an increasing sequence, generated by row_number():
select article_id, min(date) as startdate, max(date) as enddate, price
from (select s.*,
dateadd(day,
- row_number() over (partition by article_id, price
order by date
)
date) as grp
from source s
) s
group by grp, article_id, price;
If you have the possibility of missed dates, then a difference of row numbers works:
select article_id, min(date) as startdate, max(date) as enddate, price
from (select s.*,
(row_number() over (partition by article_id order by date) -
row_number() over (partition by article_id, price order by date)
) as grp
from source s
) s
group by grp, article_id, price;
You could try this:
INSERT INTO destinationtable (article_ID,startdate,enddate.price)
SELECT article_ID, MIN(date) AS startdate, MAX(date) AS enddate, price
FROM sourcetable
GROUP BY article_ID, price
This will not work properly if a price changes back to a previous value. If that is a chance you will have to run a procedural code that loops while price stays constant and tracks start and end date.