COUNT() OVER possible using DISTINCT and WINDOWING IN HIVE

COUNT() OVER possible using DISTINCT and WINDOWING IN HIVE - sql

I want to calculate the number of distinct port numbers that exist between the current row and the X previous rows (sliding window), where x can be any integer number.
For instance,
If the input is:
ID PORT
1 21
2 22
3 23
4 25
5 25
6 21
The output should be:
ID PORT COUNT
1 21 1
2 22 2
3 23 3
4 25 4
5 25 4
6 21 4
I am using Hive, over RapidMiner and I have tried the following:
select id, port,
count (*) over (partition by srcport order by id rows between 5 preceding and current row)
This must work for big data and when X is big integer number.
Any feedback would be appreciated.

I don't think there is an easy way. One method uses lag():
select ( (case when port_5 is not null then 1 else 0 end) +
(case when port_4 is not null and port_4 not in (port_5) then 1 else 0 end) +
(case when port_3 is not null and port_3 not in (port_5, port_4) then 1 else 0 end) +
(case when port_2 is not null and port_2 not in (port_5, port_4, port_3) then 1 else 0 end) +
(case when port_1 is not null and port_1 not in (port_5, port_4, port_3, port_2) then 1 else 0 end) +
(case when port is not null and port not in (port_5, port_4, port_3, port_2, port_2) then 1 else 0 end)
) as cumulative_distinct_count
from (select t.*,
lag(port, 5) over (partition by srcport order by id rows) as port_5,
lag(port, 4) over (partition by srcport order by id rows) as port_4,
lag(port, 3) over (partition by srcport order by id rows) as port_3,
lag(port, 2) over (partition by srcport order by id rows) as port_2,
lag(port, 1) over (partition by srcport order by id rows) as port_1
from t
) t
This is a complicated query, but the performance should be ok.
Note: port and srcport I assume are the same thing, but this borrows from your query.

One way to do it is with a self join as distinct isn't supported in window functions.
select t1.id,count(distinct t2.port) as cnt
from tbl t1
join tbl t2 on t1.id-t2.id>=0 and t1.id-t2.id<=5 --change this number per requirements
group by t1.id
order by t1.id
This assumes id's are in sequential order.
If not, first get the row numbers and use the logic from above. It would be like
with rownums as (select id,port,row_number() over(order by id) as rnum
from tbl)
select r1.id,count(distinct r2.port)
from rownums r1
join rownums r2 on r1.rnum-r2.rnum>=0 and r1.rnum-r2.rnum<=5
group by r1.id

Related

(SQL) Per ID, starting from the first row, return all successive rows with a value N greater than the prior returned row

I have the following example dataset:
ID
Value
Row index (for reference purposes only, does not need to exist in final output)
a
4
1
a
7
2
a
12
3
a
12
4
a
13
5
b
1
6
b
2
7
b
3
8
b
4
9
b
5
10
I would like to write a SQL script that returns the next row which has a Value of N or more than the previously returned row starting from the first row per ID and ordered ascending by [Value]. An example of the final table for N = 3 should look like the following:
ID
Value
Row index
a
4
1
a
7
2
a
12
3
b
1
6
b
4
9
Can this script be written in a vectorised manner? Or must a loop be utilised? Any advice would be greatly appreciated. Thanks!

SQL tables represent unordered sets. There is no definition of "previous" value, unless you have a column that specifies the ordering. With such a column, you can use lag():
select t.*
from (select t.*,
lag(value) over (partition by id order by <ordering column>) as prev_value
from t
) t
where prev_value is null or prev_value <= value - 3;
EDIT:
I think I misunderstood what you want to do. You seem to want to start with the first row for each id. Then get the next row that is 3 or higher in value. Then hold onto that value and get the next that is 3 or higher than that. And so on.
You can do this in SQL using a recursive CTE:
with ts as (
select distinct t.id, t.value, dense_rank() over (partition by id order by value) as seqnum
from t
),
cte as (
select id, value, value as grp_value, 1 as within_seqnum, seqnum
from ts
where seqnum = 1
union all
select ts.id, ts.value,
(case when ts.value >= cte.grp_value + 3 then ts.value else cte.grp_value end),
(case when ts.value >= cte.grp_value + 3 then 1 else cte.within_seqnum + 1 end),
ts.seqnum
from cte join
ts
on ts.id = cte.id and ts.seqnum = cte.seqnum + 1
)
select *
from cte
where within_seqnum = 1
order by id, value;
Here is a db<>fiddle.

Find subsequent occurrence of a value in a table

I have a table which looks like shown below
ID SubmittedValue ApprovedValue
1 25.9 0
1 29 29
1 25.9 25.9
1 50 0
1 45 0
1 10 0
1 10 10
Expected result
ID SubsequentlyApproved(CNT) Total_Amt_sub_aprvd
1 2 35.9
We get the above result because 25.9+10 since it is repeated in the subsequent rows.
How to perform VLOOKUP like functionality for this scenario. I tried the subquery but it didn't work.
SELECT a.id,
SUM(CASE WHEN a.ApprovedValue=0 THEN 1 ELSE 0 END) AS SUB_COUNT
FROM myTable a
join (select id, sum( case when SubmittedValue=ApprovedValue then 1 end) as check_value from myTable) b
on b.id=a.id and SUB_COUNT=check_value
but this is not giving me the expected result.

You seem to want to count rows where the values are the same and the first value appears more than once. If so, you can use window functions and aggregation:
select id, count(*), sum(ApprovedValue)
from (select t.*, count(*) over (partition by id, SubmittedValue) as cnt
from t
) t
where cnt > 1 and SubmittedValue = ApprovedValue
group by id

Without window functions using a semi-join
select id, count(*), sum(submittedvalue)
from test t1
where submittedvalue=approvedvalue
and exists (select 1
from test t2
where t1.id=t2.id and t1.submittedvalue=t2.submittedvalue
group by id, submittedvalue
having count(*)>1)
group by id;

Count consecutive duplicate values in SQL

I have a table like so
ID OrdID Value
1 1 0
2 2 0
3 1 1
4 2 1
5 1 1
6 2 0
7 1 0
8 2 0
9 2 1
10 1 0
11 2 0
I want to get the count of consecutive value where the value is 0. Using the example above the result will be 3 (Rows 6, 7 and 8). I am using sql server 2008 r2.

I am going to presume that id is unique and increasing. You can get counts of consecutive values by using the different of row numbers. The following counts all sequences:
select grp, value, min(id), max(id), count(*) as cnt
from (select t.*,
(row_number() over (order by id) - row_number() over (partition by value order by id)
) as grp
from table t
) t
group by grp, value;
If you want the longest sequence of 0s:
select top 1 grp, value, min(id), max(id), count(*) as cnt
from (select t.*,
(row_number() over (order by id) - row_number() over (partition by value order by id)
) as grp
from table t
) t
group by grp, value
having value = 0
order by count(*) desc

A query using not exists to find consecutive 0s
select top 1 min(t2.id), max(t2.id), count(*)
from mytable t
join mytable t2 on t2.id <= t.id
where not exists (
select 1 from mytable t3
where t3.id between t2.id and t.id
and t3.value <> 0
)
group by t.id
order by count(*) desc
http://sqlfiddle.com/#!3/52989/3

Aggregate within a group of unchanged values

I have sample data:
RowId TypeId Value
1 1 34
2 1 53
3 1 34
4 2 43
5 2 65
6 16 54
7 16 34
8 1 45
9 6 43
10 6 34
11 16 64
12 16 63
I want to count row for each type (The Value does not matter to me), but only for... neighbor TypeId
TypeId Count
1 3
2 2
16 2
1 1
6 2
16 2
How to achieve this result?

This should give you COUNT of rows within a group of unchanged values:
SELECT TypeId, grp, COUNT(*) FROM (
SELECT RowId, TypeId , Value, gap, SUM(gap) over (ORDER BY RowId ) grp
FROM (SELECT RowId, TypeId , Value,
CASE WHEN TypeId = lag(TypeId) over (ORDER BY RowId )
THEN 0
ELSE 1
END gap
FROM dummy
) t
) tt
GROUP BY TypeId, grp;
If you prefer WITH over endless sub-query inclusions:
WITH dummy_with_groups AS (
SELECT RowId, TypeId , Value, SUM(gap) OVER (ORDER BY RowId) grp
FROM (SELECT RowId, TypeId , Value,
CASE WHEN TypeId = lag(TypeId) OVER (ORDER BY RowId)
THEN 0 ELSE 1 END gap
FROM dummy) t
)
SELECT TypeId, COUNT(*) as Result
FROM dummy_with_groups
GROUP BY TypeId, grp;
http://www.sqlfiddle.com/#!6/f16e9/34

Check this fiddle demo. I have renamed your columns a little.
WITH myCTE AS
(SELECT row_id,
type_id,
ROW_NUMBER () OVER (PARTITION BY type_id ORDER BY row_id)
AS cnt,
CASE LEAD (type_id) OVER (ORDER BY row_id)
WHEN type_id THEN 0
ELSE 1
END
AS show
FROM dummy),
innerQuery AS
(SELECT row_id, type_id, cnt
FROM myCTE
WHERE show = 1)
SELECT iq1.type_id, iq1.cnt - ISNULL (iq2.cnt, 0) CNT
FROM innerQuery iq1
LEFT OUTER JOIN innerQuery iq2
ON iq1.type_id = iq2.type_id
AND EXISTS
(SELECT 1
FROM innerQuery iq3
WHERE iq3.type_id = iq1.type_id
AND iq3.row_id < iq1.row_id
HAVING MAX (iq3.row_id) = iq2.row_id)
The output is exactly as expected.

How to get average of the 'middle' values in a group?

I have a table that has values and group ids (simplified example). I need to get the average for each group of the middle 3 values. So, if there are 1, 2, or 3 values it's just the average. But if there are 4 values, it would exclude the highest, 5 values the highest and lowest, etc. I was thinking some sort of window function, but I'm not sure if it's possible.
http://www.sqlfiddle.com/#!11/af5e0/1
For this data:
TEST_ID TEST_VALUE GROUP_ID
1 5 1
2 10 1
3 15 1
4 25 2
5 35 2
6 5 2
7 15 2
8 25 3
9 45 3
10 55 3
11 15 3
12 5 3
13 25 3
14 45 4
I'd like
GROUP_ID AVG
1 10
2 15
3 21.6
4 45

Another option using analytic functions;
SELECT group_id,
avg( test_value )
FROM (
select t.*,
row_number() over (partition by group_id order by test_value ) rn,
count(*) over (partition by group_id ) cnt
from test t
) alias
where
cnt <= 3
or
rn between floor( cnt / 2 )-1 and ceil( cnt/ 2 ) +1
group by group_id
;
Demo --> http://www.sqlfiddle.com/#!11/af5e0/59

I'm not familiar with the Postgres syntax on windowed functions, but I was able to solve your problem in SQL Server with this SQL Fiddle. Maybe you'll be able to easily migrate this into Postgres-compatible code. Hope it helps!
A quick primer on how I worked it.
Order the test scores for each group
Get a count of items in each group
Use that as a subquery and select only the middle 3 items (that's the where clause in the outer query)
Get the average for each group
--
select
group_id,
avg(test_value)
from (
select
t.group_id,
convert(decimal,t.test_value) as test_value,
row_number() over (
partition by t.group_id
order by t.test_value
) as ord,
g.gc
from
test t
inner join (
select group_id, count(*) as gc
from test
group by group_id
) g
on t.group_id = g.group_id
) a
where
ord >= case when gc <= 3 then 1 when gc % 2 = 1 then gc / 2 else (gc - 1) / 2 end
and ord <= case when gc <= 3 then 3 when gc % 2 = 1 then (gc / 2) + 2 else ((gc - 1) / 2) + 2 end
group by
group_id

with cte as (
select
*,
row_number() over(partition by group_id order by test_value) as rn,
count(*) over(partition by group_id) as cnt
from test
)
select
group_id, avg(test_value)
from cte
where
cnt <= 3 or
(rn >= cnt / 2 - 1 and rn <= cnt / 2 + 1)
group by group_id
order by group_id
sql fiddle demo
in the cte, we need to get count of elements over each group_id by window function + calculate row_number inside each group_id. Then, if this count > 3 then we need to get middle of the group by dividing count by 2 and then get +1 and -1 element. If count <= 3, then we should just take all elements.

This works:
SELECT A.group_id, avg(A.test_value) AS avg_mid3 FROM
(SELECT group_id,
test_value,
row_number() OVER (PARTITION BY group_id ORDER BY test_value) AS position
FROM test) A
JOIN
(SELECT group_id,
CASE
WHEN count(*) < 4 THEN 1
WHEN count(*) % 2 = 0 THEN (count(*)/2 - 1)
ELSE (count(*) / 2)
END AS position_start,
CASE
WHEN count(*) < 4 THEN count(*)
WHEN count(*) % 2 = 0 THEN (count(*)/2 + 1)
ELSE (count(*) / 2 + 2)
END AS position_end
FROM test GROUP BY group_id) B
ON A.group_id=B.group_id
AND A.position >= B.position_start
AND A.position <= B.position_end
GROUP BY A.group_id
Fiddle link: http://www.sqlfiddle.com/#!11/af5e0/56

If you need to calculate the average values for groups then you can do this:
SELECT CASE WHEN NUMBER_FIRST_GROUP <> 0
THEN SUM_FIRST_GROUP / NUMBER_FIRST_GROUP
ELSE NULL
END AS AVG_FIRST_GROUP,
CASE WHEN NUMBER_SECOND_GROUP <> 0
THEN SUM_SECOND_GROUP / NUMBER_SECOND_GROUP
ELSE NULL
END AS AVG_SECOND_GROUP,
CASE WHEN NUMBER_THIRD_GROUP <> 0
THEN SUM_THIRD_GROUP / NUMBER_THIRD_GROUP
ELSE NULL
END AS AVG_THIRD_GROUP,
CASE WHEN NUMBER_FOURTH_GROUP <> 0
THEN SUM_FOURTH_GROUP / NUMBER_FOURTH_GROUP
ELSE NULL
END AS AVG_FOURTH_GROUP
FROM (
SELECT
SUM(CASE WHEN GROUP_ID = 1 THEN 1 ELSE 0 END) AS NUMBER_FIRST_GROUP,
SUM(CASE WHEN GROUP_ID = 1 THEN TEST_VALUE ELSE 0 END) AS SUM_FIRST_GROUP,
SUM(CASE WHEN GROUP_ID = 2 THEN 1 ELSE 0 END) AS NUMBER_SECOND_GROUP,
SUM(CASE WHEN GROUP_ID = 2 THEN TEST_VALUE ELSE 0 END) AS SUM_SECOND_GROUP,
SUM(CASE WHEN GROUP_ID = 3 THEN 1 ELSE 0 END) AS NUMBER_THIRD_GROUP,
SUM(CASE WHEN GROUP_ID = 3 THEN TEST_VALUE ELSE 0 END) AS SUM_THIRD_GROUP,
SUM(CASE WHEN GROUP_ID = 4 THEN 1 ELSE 0 END) AS NUMBER_FOURTH_GROUP,
SUM(CASE WHEN GROUP_ID = 4 THEN TEST_VALUE ELSE 0 END) AS SUM_FOURTH_GROUP
FROM TEST
) AS FOO

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

COUNT() OVER possible using DISTINCT and WINDOWING IN HIVE - sql

Related

(SQL) Per ID, starting from the first row, return all successive rows with a value N greater than the prior returned row

Find subsequent occurrence of a value in a table

Count consecutive duplicate values in SQL

Aggregate within a group of unchanged values

How to get average of the 'middle' values in a group?

Categories

Resources