Calculating consecutive range of dates with a value in Hive - hive

I want to know if it is possible to calculate the consecutive ranges of a specific value for a group of Id's and return the calculated value(s) of each one.
Given the following data:
+----+----------+--------+
| ID | DATE_KEY | CREDIT |
+----+----------+--------+
| 1 | 8091 | 0.9 |
| 1 | 8092 | 20 |
| 1 | 8095 | 0.22 |
| 1 | 8096 | 0.23 |
| 1 | 8098 | 0.23 |
| 2 | 8095 | 12 |
| 2 | 8096 | 18 |
| 2 | 8097 | 3 |
| 2 | 8098 | 0.25 |
+----+----------+--------+
I want the following output:
+----+-------------------------------+
| ID | RANGE_DAYS_CREDIT_LESS_THAN_1 |
+----+-------------------------------+
| 1 | 1 |
| 1 | 2 |
| 1 | 1 |
| 2 | 1 |
+----+-------------------------------+
In this case, the ranges are the consecutive days with credit less than 1. If there is a gap between date_key column, then the range won't have to take the next value, like in ID 1 between 8096 and 8098 date key.
Is it possible to do this with windowing functions in Hive?
Thanks in advance!

You can do this with a running sum classifying rows into groups, incrementing by 1 every time a credit<1 row is found(in the date_key order). Thereafter it is just a group by.
select id,count(*) as range_days_credit_lt_1
from (select t.*
,sum(case when credit<1 then 0 else 1 end) over(partition by id order by date_key) as grp
from tbl t
) t
where credit<1
group by id

The key is to collapse all the consecutive sequence and compute their length, I struggled to achieve this in a relatively clumsy way:
with t_test as
(
select num,row_number()over(order by num) as rn
from
(
select explode(array(1,3,4,5,6,9,10,15)) as num
)
)
select length(sign)+1 from
(
select explode(continue_sign) as sign
from
(
select split(concat_ws('',collect_list(if(d>1,'v',d))), 'v') as continue_sign
from
(
select t0.num-t1.num as d from t_test t0
join t_test t1 on t0.rn=t1.rn+1
)
)
)
Get the previous number b in the seq for each original a;
Check if a-b == 1, which shows if there is a "gap", marked as 'v';
Merge all a-b to a string, and then split using 'v', and compute length.
To get the ID column out, another string which encode id should be considered.

Related

Get some values from the table by selecting

I have a table:
| id | Number |Address
| -----| ------------|-----------
| 1 | 0 | NULL
| 1 | 1 | NULL
| 1 | 2 | 50
| 1 | 3 | NULL
| 2 | 0 | 10
| 3 | 1 | 30
| 3 | 2 | 20
| 3 | 3 | 20
| 4 | 0 | 75
| 4 | 1 | 22
| 4 | 2 | 30
| 5 | 0 | NULL
I need to get: the NUMBER of the last ADDRESS change for each ID.
I wrote this select:
select dh.id, dh.number from table dh where dh =
(select max(min(t.history)) from table t where t.id = dh.id group by t.address)
But this select not correctly handling the case when the address first changed, and then changed to the previous value. For example id=1: group by return:
| Number |
| -------- |
| NULL |
| 50 |
I have been thinking about this select for several days, and I will be happy to receive any help.
You can do this using row_number() -- twice:
select t.id, min(number)
from (select t.*,
row_number() over (partition by id order by number desc) as seqnum1,
row_number() over (partition by id, address order by number desc) as seqnum2
from t
) t
where seqnum1 = seqnum2
group by id;
What this does is enumerate the rows by number in descending order:
Once per id.
Once per id and address.
These values are the same only when the value is 1, which is the most recent address in the data. Then aggregation pulls back the earliest row in this group.
I answered my question myself, if anyone needs it, my solution:
select * from table dh1 where dh1.number = (
select max(x.number)
from (
select
dh2.id, dh2.number, dh2.address, lag(dh2.address) over(order by dh2.number asc) as prev
from table dh2 where dh1.id=dh2.id
) x
where NVL(x.address, 0) <> NVL(x.prev, 0)
);

Partition & consecutive in SQL

fellow stackers
I have a data set like so:
+---------+------+--------+
| user_id | date | metric |
+---------+------+--------+
| 1 | 1 | 1 |
| 1 | 2 | 1 |
| 1 | 3 | 1 |
| 2 | 1 | 1 |
| 2 | 2 | 1 |
| 2 | 3 | 0 |
| 2 | 4 | 1 |
+---------+------+--------+
I am looking to flag those customers who has 3 consecutive "1"s in the metric column. I have a solution as below.
select distinct user_id
from (
select user_id
,metric +
ifnull( lag(metric, 1) OVER (PARTITION BY user_id ORDER BY date), 0 ) +
ifnull( lag(metric, 2) OVER (PARTITION BY user_id ORDER BY date), 0 )
as consecutive_3
from df
) b
where consecutive_3 = 3
While it works it is not scalable. As one can imagine what the above query would look like if I were looking for a consecutive 50.
May I ask if there is a scalable solution? Any cloud SQL will do. Thank you.
If you only want such users, you can use a sum(). Assuming that metric is only 0 or 1:
select user_id,
(case when max(metric_3) = 3 then 1 else 0 end) as flag_3
from (select df.*,
sum(metric) over (partition by user_id
order by date
rows between 2 preceding and current row
) as metric_3
from df
) df
group by user_id;
By using a windowing clause, you can easily expand to as many adjacent 1s as you like.

How do I do an Oracle SQL update from select in a specific order?

I have a table with old values (some null) and new values for various attributes, all inserted at different add times throughout the months. I'm trying to update a second table with records with business month end dates. Right now, these records only contain the most recent new values for all month end dates. The goal is to create historical data by updating the previous month end values with the old values from the first table. I am a beginner and was able to come up with a query to update on one object where there was one entry from the first table. Now I am trying to expand the query to include multiple objects, with possible, multiple old values within the same month. I tried to use "order by" (since I need to make updates for a month in ascending order so it gets the latest value) but read that doesn't work with update statements without a subquery. So I tried my hand at making a more complicated query, without success. I am getting the following error: single-row subquery returns more than one row. Thanks!
TableA:
| ID | TYPE | OLD_VALUE | NEW_VALUE | ADD_TIME|
-----------------------------------------------
| 1 | A | 2 | 3 | 1/11/2019 8:00:00am |
| 1 | B | 3 | 4 | 12/10/2018 8:00:00am|
| 1 | B | 4 | 5 | 12/11/2018 8:00:00am|
| 2 | A | 5 | 1 | 12/5/2018 08:00:00am|
| 2 | A | 1 | 2 | 12/5/2019 09:00:00am|
| 2 | A | 2 | 3 | 12/5/2019 10:00:00am|
| 2 | B | 1 | 2 | 12/5/2019 10:00:00am|
TableB
| ID | MONTH_END | TYPE_A | TYPE_B |
-----------------------------------
| 1 | 1/31/19 | 3 | 5 |
| 1 | 12/31/18 | 3 | 5 |
| 1 | 11/30/18 | 3 | 5 |
| 2 | 12/31/18 | 3 | 2 |
| 2 | 11/30/18 | 3 | 2 |
Desired Output for TableB
| ID | MONTH_END | TYPE_A | TYPE_B |
-----------------------------------
| 1 | 1/31/19 | 3 | 5 |
| 1 | 12/31/18 | 2 | 5 |
| 1 | 11/30/18 | 2 | 3 |
| 2 | 12/31/18 | 3 | 2 |
| 2 | 11/30/18 | 5 | 2 |
My Query for Type A (Which I plan to adapt for Type B and execute as well for the desired output)
update TableB B
set b.type_a =
(
with aa as
(
select id, nvl(old_value, new_value) typea, add_time
from TableA
where type = 'A'
order by id, add_time ascending
)
select typea
from aa
where aa.id = b.id
and b.month_end <= aa.add_tm
)
where exists
(
with aa as
(
select id, nvl(old_value, new_value) typea, add_time
from TableA
where type = 'A'
order by id, add_time ascending
)
select typea
from aa
where aa.id = b.id
and b.month_end <= aa.add_tm
)
Kudo's for giving example input data and desired output. I found your question a bit confusing so let me rephrase to "Provide the last type a value from table a that is in the same month as the month end.
By matching on type and date of entry, we can get your answer. The "ROWNUM=1" is to limit result set to a single entry in case there is more than one row with the same add_time. This SQL is still a mess, maybe someone else can come up with a better one.
UPDATE tableb b
SET b.typea =
(SELECT old_value
FROM tablea a
WHERE LAST_DAY( TRUNC( a.add_time ) ) = b.month_end
AND TYPE = 'A'
AND add_time =
(SELECT MAX( add_time )
FROM tablea
WHERE TYPE = 'A' AND LAST_DAY( TRUNC( a.add_time ) ) = b.month_end)
AND ROWNUM = 1)
WHERE EXISTS
(SELECT old_value
FROM tablea a
WHERE LAST_DAY( TRUNC( a.add_time ) ) = b.month_end AND TYPE = 'A');

SQL: Select single item per name with multiple criteria

I'm trying to select a single item per value in a "Name" column according to several criteria.
The criteria I want to use look like this:
Only include results where IsEnabled = 1
Return the single result with the lowest priority (we're using 1 to mean "top priority")
In case of a tie, return the result with the newest Timestamp
I've seen several other questions that ask about returning the newest timestamp for a given value, and I've been able to adapt that to return the minimum value of Priority - but I can't figure out how to filter off of both Priority and Timestamp.
Here is the question that's been most helpful in getting me this far.
Sample data:
+------+------------+-----------+----------+
| Name | Timestamp | IsEnabled | Priority |
+------+------------+-----------+----------+
| A | 2018-01-01 | 1 | 1 |
| A | 2018-03-01 | 1 | 5 |
| B | 2018-01-01 | 1 | 1 |
| B | 2018-03-01 | 0 | 1 |
| C | 2018-01-01 | 1 | 1 |
| C | 2018-03-01 | 1 | 1 |
| C | 2018-05-01 | 0 | 1 |
| C | 2018-06-01 | 1 | 5 |
+------+------------+-----------+----------+
Desired output:
+------+------------+-----------+----------+
| Name | Timestamp | IsEnabled | Priority |
+------+------------+-----------+----------+
| A | 2018-01-01 | 1 | 1 |
| B | 2018-01-01 | 1 | 1 |
| C | 2018-03-01 | 1 | 1 |
+------+------------+-----------+----------+
What I've tried so far (this gets me only enabled items with lowest priority, but does not filter for the newest item in case of a tie):
SELECT DATA.Name, DATA.Timestamp, DATA.IsEnabled, DATA.Priority
From MyData AS DATA
INNER JOIN (
SELECT MIN(Priority) Priority, Name
FROM MyData
GROUP BY Name
) AS Temp ON DATA.Name = Temp.Name AND DATA.Priority = TEMP.Priority
WHERE IsEnabled=1
Here is a SQL fiddle as well.
How can I enhance this query to only return the newest result in addition to the existing filters?
Use row_number():
select d.*
from (select d.*,
row_number() over (partition by name order by priority, timestamp) as seqnum
from mydata d
where isenabled = 1
) d
where seqnum = 1;
The most effective way that I've found for these problems is using CTEs and ROW_NUMBER()
WITH CTE AS(
SELECT *, ROW_NUMBER() OVER( PARTITION BY Name ORDER BY Priority, TimeStamp DESC) rn
FROM MyData
WHERE IsEnabled = 1
)
SELECT Name, Timestamp, IsEnabled, Priority
From CTE
WHERE rn = 1;

Efficient ROW_NUMBER increment when column matches value

I'm trying to find an efficient way to derive the column Expected below from only Id and State. What I want is for the number Expected to increase each time State is 0 (ordered by Id).
+----+-------+----------+
| Id | State | Expected |
+----+-------+----------+
| 1 | 0 | 1 |
| 2 | 1 | 1 |
| 3 | 0 | 2 |
| 4 | 1 | 2 |
| 5 | 4 | 2 |
| 6 | 2 | 2 |
| 7 | 3 | 2 |
| 8 | 0 | 3 |
| 9 | 5 | 3 |
| 10 | 3 | 3 |
| 11 | 1 | 3 |
+----+-------+----------+
I have managed to accomplish this with the following SQL, but the execution time is very poor when the data set is large:
WITH Groups AS
(
SELECT Id, ROW_NUMBER() OVER (ORDER BY Id) AS GroupId FROM tblState WHERE State=0
)
SELECT S.Id, S.[State], S.Expected, G.GroupId FROM tblState S
OUTER APPLY (SELECT TOP 1 GroupId FROM Groups WHERE Groups.Id <= S.Id ORDER BY Id DESC) G
Is there a simpler and more efficient way to produce this result? (In SQL Server 2012 or later)
Just use a cumulative sum:
select s.*,
sum(case when state = 0 then 1 else 0 end) over (order by id) as expected
from tblState s;
Other method uses subquery :
select *,
(select count(*)
from table t1
where t1.id < t.id and state = 0
) as expected
from table t;