HIVE: Finding running total excluding duplicates

HIVE: Finding running total excluding duplicates - hive

Hi I have a very peculiar problem at hand for which I am unable to find a solution. I have a table UserViews having following columns:
Progdate(String)
UserName(String)
Dummy data in the table:
Progdate UserName
20161119 A
20161119 B
20161119 C
20161119 B
20161120 D
20161120 E
20161120 A
20161121 B
20161121 A
20161121 B
20161121 F
20161121 G
Each time a User views a program there is an entry in the table. For example on 19th Nov, User A watched the program once so there is one entry. User B watched the program twice so there are two entries for this user on 19th Nov and so on.
Select Progdate, count(distinct UserName) UniqueUsersByDate
from UserViews
group by Progdate;
Above query will give me date-wise count of all the unique users who have watched the program
Progdate UniqueUsersByDate
20161119 3
20161120 3
20161121 4
Below query:
Select Progdate, UniqueUsersByDate, Sum(UniqueUsersByDate) over(Order By Progdate) RunningTotalNewUsers
from
(
Select Progdate, count(distinct UserName) UniqueUsersByDate
from
UserViews
group by Progdate SORT BY Progdate
) UV;
Will give me result as:
Progdate UniqueUsersByDate RunningTotalNewUsers
20161119 3 3
20161120 3 6
20161121 4 10
But what I want is the running total of all the users who have watched the program only first time. Means if User A has watched the program on 20161119 and then again on 20161120, then the count of this User should not be repeated in the running total for 20161120. Thus the result which I want from the above table is:
Progdate UniqueUsersByDate RunningTotalNewUsers
20161119 3 3
20161120 3 5
20161121 4 7
I am looking for the solution only in HIVE HQL. Any input toward the problem is greatly appreciated.
Thanks.

select Progdate
,UniqueUsersByDate
,sum(Users1stOcc) over
(
order by Progdate
) as RunningTotalNewUsers
from (select Progdate
,count (distinct UserName) as UniqueUsersByDate
,count (case when rn = 1 then 1 end) as Users1stOcc
from (select Progdate
,UserName
,row_number() over
(
partition by UserName
order by Progdate
) as rn
from UserViews
) uv
group by Progdate
) uv
;
+-------------+--------------------+-----------------------+
| progdate | uniqueusersbydate | runningtotalnewusers |
+-------------+--------------------+-----------------------+
| 2016-11-19 | 3 | 3 |
| 2016-11-20 | 3 | 5 |
| 2016-11-21 | 4 | 7 |
+-------------+--------------------+-----------------------+
P.s.
Theoretically, the aggregation and the use of the SUM analytical function do not require additional sub-query, but there seems to be an issue (bug/feature) with the parser.
Please note that an additional sub-query does not necessarily indicate an additional execution stage, e.g. select * from (select * from (select * from (select * from (select * from t)t)t)t)t; and select * from t will have the same execution plan.

Related

How to count distinct a field cumulatively using recursive cte or other method in SQL?

Using example below, Day 1 will have 1,3,3 distinct name(s) for A,B,C respectively.
When calculating distinct name(s) for each house on Day 2, data up to Day 2 is used.
When calculating distinct name(s) for each house on Day 3, data up to Day 3 is used.
Can recursive cte be used?
Data:
Day
House
Name
1
A
Jack
1
B
Pop
1
C
Anna
1
C
Dew
1
C
Franco
2
A
Jon
2
B
May
2
C
Anna
3
A
Jon
3
B
Ken
3
C
Dew
3
C
Dew
Result:
Day
House
Distinct names
1
A
1
1
B
1
1
C
3
2
A
2 (jack and jon)
2
B
2
2
C
3
3
A
2 (jack and jon)
3
B
3
3
C
3

Without knowing the need and size of data it'll be hard to give an ideal/optimal solution. Assuming a small dataset needing a quick and dirty way to calculate, just use sub query like this...
SELECT p.[Day]
, p.House
, (SELECT COUNT(DISTINCT([Name]))
FROM #Bing
WHERE [Day]<= p.[Day] AND House = p.House) DistinctNames
FROM #Bing p
GROUP BY [Day], House
ORDER BY 1

There is no need for a recursive CTE. Just mark the first time a name is seen in a house and use a cumulative sum:
select day, house,
sum(sum(case when seqnum = 1 then 1 else 0 end)) over (partition by house order by day) as num_unique_names
from (select t.*,
row_number() over (partition by house, name order by day) as seqnum
from t
) t
group by day, house

SQL GROUP BY where either column has same value

I have the following table
User A | User B | Value
-------+--------+------
1 | 2 | 60
3 | 1 | 10
4 | 5 | 50
3 | 5 | 50
5 | 1 | 80
2 | 3 | 10
I want group together records where either user a = x or user b = x, in order to find averages.
e.g. User 1 appears in the table 3 times, once as 'User A' and twice as 'User B'. So I would want to carry out my AVG() function using those three rows.
I need the highest and lowest average values. Such a query would break down the above table into the following groups:
User | Avg Value
-----+-----
1 | 50
2 | 35
3 | 23.33
4 | 50
5 | 60
and then return
Highest Avg | Lowest Avg
------------+-----------
60 | 23.33
I know that GROUP BY collects together records where a column has the same value. I want to collect together records where either one of two columns has the same value. I have searched through many solutions but can't seem to find one that meets my problem.

A portable option uses union all:
select usr, avg(value) avg_value
from (
select usera usr, value from mytable
union all select userb, value from mytable
) t
group by usr
This gives you the first resultset. Then, you can add another level of aggregataion to get the maximum and minimum average:
select min(avg_value) min_avg_value, max(avg_value) max_avg_value
from (
select usr, avg(value) avg_value
from (
select usera usr, value from mytable
union all select userb, value from mytable
) t
group by usr
) t
In databases that support lateral joins and values(), this is most convinently (and efficiently) expressed as follows:
select min(avg_value) min_avg_value, max(avg_value) max_avg_value
from (
select usr, avg(value) avg_value
from mytable t
cross join lateral (values (usera, value), (userb, value)) as x(usr, value)
group by usr
) t
This would work in Postgres for example. In SQL Server, you would just replace cross join lateral with cross apply.

You can unpivot using union all and then aggregation:
select user, avg(value)
from ((select usera as user, value) union all
(select userb as user, value)
) u
group by user;
You can get the extremes with another level of aggregation:
select min(avg_value), max(avg_value)
from (select user, avg(value) as avg_value
from ((select usera as user, value) union all
(select userb as user, value)
) u
group by user
) ua

How to select IDs that have at least two specific instaces in a given column

I'm working with a medical claim table in pyspark and I want to return only userid's that have at least 2 claim_ids. My table looks something like this:
claim_id | userid | diagnosis_type | claim_type
__________________________________________________
1 1 C100 M
2 1 C100a M
3 2 D50 F
5 3 G200 M
6 3 C100 M
7 4 C100a M
8 4 D50 F
9 4 A25 F
From this example, I would want to return userid's 1, 3, and 4 only. Currently I'm building a temp table to count all of the distinct instances of the claim_ids
create table temp.claim_count as
select distinct userid, count(distinct claim_id) as claims
from medical_claims
group by userid
and then pulling from this table when the number of claim_id >1
select distinct userid
from medical_claims
where userid (
select distinct userid
from temp.claim_count
where claims>1)
Is there a better / more efficient way of doing this?

If you want only the ids, then use group by:
select userid, count(*) as claims
from medical_claims
group by userid
having count(*) > 1;
If you want the original rows, then use window functions:
select mc.*
from (select mc.*, count(*) over (partition by userid) as num_claims
from medical_claims mc
) mc
where num_claims > 1;

SQL: SELECT value for all rows based on a value in one of the rows and a condition

I have a list of total store visits for a customer for a month. The customer has a home store but can visit other stores. Like the table below:
MemberId | HomeStoreId | VisitedStoreId | Month | Visits
1 5 5 1 5
1 5 3 1 2
1 5 2 1 1
1 5 4 1 7
I want my select statement to give the number of visits to the home store against each store for that member for that month. Like the below:
MemberId | HomeStoreId | VisitedStoreId | Month | Visits | HomeStoreVisits
1 5 5 1 5 5
1 5 3 1 2 5
1 5 2 1 1 5
1 5 4 1 7 5
I've looked at a SUM with CASE statements inside and OVER with PARTITION but I can't seem to work it out.
Thanks

I would use window functions:
select t.*,
sum(case when homestoreid = visitedstoreid then visits end) over
(partition by memberid, month) as homestorevisits
from t;

SELECT MemberID,HomestoreID,visitedstoreid,Month,visits, homestorevisits
FROM Table LEFT OUTER JOIN
(SELECT MemberID, Visits homestorevisits
FROM TABLE WHERE homestoreID =VisitedStoreId
)T ON T.MemberID = Table.MemberID

You can achieve this using a simple subquery.
SELECT MemberId, HomeStoreID, VisitedStoreID, Month, Visits,
(SELECT Visits FROM table t2
WHERE t2.MemberId = t1.MemberId
AND t2.HomeStoreId = t1.HomeStoreId
AND t2.Month = t1.Month
AND t2.VisitedStoreId = t2.HomeStoreId) AS HomeStoreVisits
FROM table t1

logic in HAVING clause to get multiple values of a group by result

Imagine I have a table with data as below:
ROLE_ID | USER_ID | CODE
---------------------------------
14 | USER A | 001
15 | USER A | 002
11 | USER B | 004
13 | USER D | 005
13 | USER A | 001
15 | USER B | 009
15 | USER D | 005
12 | USER C | 004
15 | USER C | 008
13 | USER D | 007
15 | USER D | 007
I want to get the User ids and codes that only have 13 and 15 role_ids. So based on the data above I would like back the following
USER D | 005
USER D | 007
I have the query below, however, it only brings back one, not both.
SELECT a.user_id, a.code
FROM my_table a
WHERE a.ROLE_ID in (13,15,11,14)
group by a.USER_ID, a.code
having sum( case when a.role_id in (13,15) then 1 else 0 end) = 2
and sum( case when a.role_id in (11,14) then 1 else 0 end) = 0
ORDER BY USER_ID
The above query only brings
USER D | 005
rather than
USER D | 005
USER D | 007

Sometimes just listening to your own words in English translates into the easiest to read SQL:
SELECT DISTINCT a.user_id, a.code
FROM my_table a
WHERE a.user_id in
(SELECT b.user_id
FROM my_table b
WHERE b.ROLE_ID = 13)
AND a.user_id in
(SELECT b.user_id
FROM my_table b
WHERE b.ROLE_ID = 15)
AND a.user_id NOT IN
(SELECT b.user_id
FROM my_table b
WHERE b.ROLE_ID NOT IN (13,15))

I will:
SELECT a.user_id, a.code
FROM my_table a
GROUP BY a.user_id, a.code
HAVING sum(case when a.role_id in (13, 15) then 1 else 3 end) = 2
:)

As proven by EthanB, your query is working exactly as you desire. There must be something in your project data that is not represented in your question's fabricated data.
I do endorse a pivot as you have executed in your question, but I would write it as a single SUM expression to reduce the number of iterations over the aggregate data. I certainly do not endorse multiple subqueries on each row of the table (1, 2, 3) ...regardless of whether the optimizer is converting the subqueries to multiple JOINs.
Your pivot conditions:
having sum( case when a.role_id in (13,15) then 1 else 0 end) = 2
and sum( case when a.role_id in (11,14) then 1 else 0 end) = 0
My recommendation:
As the aggregate data is being iterated, you can keep a tally (+1) of qualifying rows and jump to a disqualifying outcome (+3) after each evaluation. This way, there is only one pass over the aggregate instead of two.
SELECT USER_ID, CODE
FROM my_table
WHERE ROLE_ID IN (13,15,11,14)
GROUP BY USER_ID, CODE
HAVING SUM(CASE WHEN ROLE_ID IN (13,15) THEN 1
WHEN ROLE_ID IN (11,14) THEN 3 END) = 2
Another way of expressing what these HAVING clauses are doing is:
Require that the first CASE is satisfied twice and that the second CASE is never satisfied.
Demo Link
Alternatively, the above HAVING clause could be less elegantly written as:
HAVING SUM(CASE ROLE_ID
WHEN 13 THEN 1
WHEN 15 THEN 1
WHEN 11 THEN 3
WHEN 14 THEN 3
END) = 2
Disclaimer #1: I don't swim in the [oracle] tag pool, I've not investigated how to execute this with PIVOT.
Disclaimer #2: My above advice assumes that ROLE_IDs are unique in the grouped USER_ID+CODE aggregate data. Fringe cases: (a demo)
a given group contains ROLE_ID = 13, ROLE_ID = 13, and ROLE_ID = 15 then of course the SUM will be at least 3 and the group will be disqualified.
a given group contains only ROLE_ID = 15 and ROLE_ID = 15 then of course the SUM will be 2 and the group will be unintentionally qualified.
To combat scenarios like these, make three separate MAX conditions.
HAVING MAX(CASE WHEN ROLE_ID = 13 THEN 1 END) = 1
AND MAX(CASE WHEN ROLE_ID = 15 THEN 1 END) = 1
AND MAX(CASE WHEN ROLE_ID IN (11,14) THEN 1 END) IS NULL
Demo

SELECT user_id, code FROM my_table
WHERE role_id = 13
INTERSECT
SELECT user_id, code FROM my_table
WHERE role_id = 15

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas