Hive - over (partition by ...) with a column not in group by

Hive - over (partition by ...) with a column not in group by - hive

Is it possible to do something like:
select
avg(count(distinct user_id))
over (partition by some_date) as average_users_per_day
from user_activity
group by user_type
(notably, the partition by column, some_date, is not in the group by columns)
The idea I'm going for is something like: the average users per day by user type.
I know how to do it using subqueries (see below), but I'd like to know if there is a nice way using only over (partition by ...) and group by.
Notes:
From reading this answer, my understanding (correct me if I'm wrong) is that the following query:
select
avg(count(distinct a)) over (partition by b)
from foo
group by b
can be expanded equivalently to:
select
avg(count_distinct_a)
from (
select
b,
count(distinct a) as count_distinct_a
from foo
group by b
)
group by b
And from that, I can tweak it a bit to achieve what I want:
select
avg(count_distinct_user_id) as average_users_per_day
from (
select
user_type,
count(distinct user_id) as count_distinct_user_id
from user_activity
group by user_type, some_date
)
group by user_type
(notably, the inner group by user_type, some_date differs from the outer group by user_type)
I'd like to be able to tell the partition by-group by interaction to use a "sub-group-by" for the windowing part. Please let me know if my understanding of partition by/group by is completely off.
EDIT: Some sample data and desired output.
Source table:
+---------+-----------+-----------+
| user_id | user_type | some_date |
+---------+-----------+-----------+
| 1 | a | 1 |
| 1 | a | 2 |
| 2 | a | 1 |
| 3 | a | 2 |
| 3 | a | 2 |
| 4 | b | 2 |
| 5 | b | 1 |
| 5 | b | 3 |
| 5 | b | 3 |
| 6 | c | 1 |
| 7 | c | 1 |
| 8 | c | 4 |
| 9 | c | 2 |
| 9 | c | 3 |
| 9 | c | 4 |
+---------+-----------+-----------+
Sample intermediate table (for reasoning with):
+-----------+-----------+---------------------+
| user_type | some_date | distinct_user_count |
+-----------+-----------+---------------------+
| a | 1 | 2 |
| a | 2 | 2 |
| b | 1 | 1 |
| b | 2 | 1 |
| b | 3 | 1 |
| c | 1 | 2 |
| c | 2 | 1 |
| c | 3 | 1 |
| c | 4 | 2 |
+-----------+-----------+---------------------+
SQL is: select user_type, some_date, count(distinct user_id) from user_activity group by user_type, some_date.
Desired result:
+-----------+---------------------+
| user_type | average_daily_users |
+-----------+---------------------+
| a | 2 |
| b | 1 |
| c | 1.5 |
+-----------+---------------------+

Related

SQL return only rows where value exists multiple times and other value is present

I have a table like this in MS SQL SERVER
+------+------+
| ID | Cust |
+------+------+
| 1 | A |
| 1 | A |
| 1 | B |
| 1 | B |
| 2 | A |
| 2 | A |
| 2 | A |
| 2 | B |
| 3 | A |
| 3 | B |
| 3 | B |
| 3 | C |
| 3 | C |
+------+------+
I don't know the values in column "Cust" and I want to return all rows where the value of "Cust" appears multiple times and where at least one of the "ID" values is "1".
Like this:
+------+------+
| ID | Cust |
+------+------+
| 1 | A |
| 1 | A |
| 1 | B |
| 1 | B |
| 2 | A |
| 2 | A |
| 2 | A |
| 2 | B |
| 3 | A |
| 3 | B |
| 3 | B |
+------+------+
Any ideas? I can't find it.

You may use COUNT window function as the following:
SELECT ID, Cust
FROM
(
SELECT ID, Cust,
COUNT(*) OVER (PARTITION BY Cust) cn,
COUNT(CASE WHEN ID=1 THEN 1 END) OVER (PARTITION BY Cust) cn2
FROM table_name
) T
WHERE cn>1 AND cn2>0
ORDER BY ID, Cust
COUNT(*) OVER (PARTITION BY Cust) to check if the value of "Cust" appears multiple times.
COUNT(CASE WHEN ID=1 THEN 1 END) OVER (PARTITION BY Cust) to check that at least one of the "ID" values is "1".
See a demo.

SQL Select random rows partitioned by a column

I have a dataset looks like this
| Country | id |
-------------------
| a | 5 |
| a | 1 |
| a | 2 |
| b | 1 |
| b | 5 |
| b | 4 |
| b | 7 |
| c | 5 |
| c | 1 |
| c | 2 |
and i need a query which returns 2 random values from where country in ('a', 'c'):
| Country | id |
------------------
| a | 2 | -- Two random rows from Country = 'a'
| a | 1 |
| c | 1 |
| c | 5 | --Two random rows from Country = 'c'

This should work:
select Country, id from
(select Country,
id,
row_number() over(partition by Country order by rand()) as rn
from table_name
) t
where Country in ('a', 'c') and rn <= 2
Replace rand() with random() if you're using Postgres or newid() in SQL Server.

SQL Query/ Assigning Rank

I am interested in SQL query not the PLSQL code.
We need to assign the rank based on date and id value
Input table should look like below
+------------+----+
| date | id |
+------------+----+
| 01-01-2018 | A |
| 02-01-2018 | A |
| 03-01-2018 | C |
| 04-01-2018 | B |
| 05-01-2018 | A |
| 06-01-2018 | C |
| 07-01-2018 | C |
| 08-01-2018 | B |
| 09-01-2018 | B |
| 10-01-2018 | B |
+------------+----+
output table should look like below
+------------+----+------+
| date | id | rank |
+------------+----+------+
| 01-01-2018 | A | 1 |
| 02-01-2018 | A | 2 |
| 03-01-2018 | C | 1 |
| 04-01-2018 | B | 1 |
| 05-01-2018 | A | 1 |
| 06-01-2018 | C | 1 |
| 07-01-2018 | C | 2 |
| 08-01-2018 | B | 1 |
| 09-01-2018 | B | 2 |
| 10-01-2018 | B | 3 |
+------------+----+------+

This is a type of gaps-and-islands problem. In this case, the simplest solution is probably the difference of row numbers:
select t.*,
row_number() over (partition by id, (seqnum - seqnum_i)
order by date
) as ranking
from (select t.*,
row_number() over (order by date) as seqnum,
row_number() over (partition by id order by date) as seqnum_i
from t
) t;
Why this works is a little tricky to explain. The difference of the two row numbers assigns a constant value to advance values of the same id. If you stare at the results of the subquery, you will see how this works.
Then the outer query just uses row_number() to assign the sequential number you want.

group by top two results based on order

I have been trying to get this to work with some row_number, group by, top, sort of things, but I am missing some fundamental concept. I have a table like so:
+-------+-------+-------+
| name | ord | f_id |
+-------+-------+-------+
| a | 1 | 2 |
| b | 5 | 2 |
| c | 6 | 2 |
| d | 2 | 1 |
| e | 4 | 1 |
| a | 2 | 3 |
| c | 50 | 4 |
+-------+-------+-------+
And my desired output would be:
+-------+---------+--------+-------+
| f_id | ord_n | ord | name |
+-------+---------+--------+-------+
| 2 | 1 | 1 | a |
| 2 | 2 | 5 | b |
| 1 | 1 | 2 | d |
| 1 | 2 | 4 | e |
| 3 | 1 | 2 | a |
| 4 | 1 | 50 | c |
+-------+---------+--------+-------+
Where data is ordered by the ord value, and only up to two results per f_id. Should I be working on a Stored Procedure for this or can I just do it with SQL? I have experimented with some select TOP subqueries, but nothing has even come close..
Here are some statements to create the test table:
create table help(name varchar(255),ord tinyint,f_id tinyint);
insert into help values
('a',1,2),
('b',5,2),
('c',6,2),
('d',2,1),
('e',4,1),
('a',2,3),
('c',50,4);

You may use Rank or DENSE_RANK functions.
select A.name, A.ord_n, A.ord , A.f_id from
(
select
RANK() OVER (partition by f_id ORDER BY ord asc) AS "Rank",
ROW_NUMBER() OVER (partition by f_id ORDER BY ord asc) AS "ord_n",
help.*
from help
) A where A.rank <= 2
Sqlfiddle demo

Finding number of types of accounts from each customer

I am having a lot of trouble with trying to construct a query that will give me the name of each customer and the number of different types of accounts each has. The three types are Checkings, Savings, and CD.
customers:
+--------+--------+
| cid | name |
+--------+--------+
| 1 | a |
| 2 | b |
| 3 | c |
+--------+--------+
accounts:
+-----------+-----------+
| aid | type |
+-----------+-----------+
| 1 | Checkings |
| 2 | Savings |
| 3 | Checkings |
| 4 | CD |
| 5 | CD |
| 6 | Checkings |
+-----------+-----------+
transactions:
+--------+--------+--------+
| tid | cid | aid |
+--------+--------+--------+
| 1 | 1 | 1 |
| 2 | 1 | 2 |
| 3 | 2 | 3 |
| 4 | 3 | 4 |
| 5 | 1 | 5 |
| 6 | 3 | 4 |
| 7 | 1 | 6 |
+--------+--------+--------+
The expected answer would be:
a, 3
b, 1
c, 1
Getting the names is simple enough, but how can I keep count of each individual's account as well as compare the accounts to make sure that it is not the same type?

just add DISTINCT inside the COUNT
SELECT a.cid, a.name, COUNT(DISTINCT c.type) totalCount
FROM customers a
INNER JOIN transactions b
ON a.cis = b.cid
INNER JOIN accounts c
ON b,aid = c.aid
GROUP BY a.cid, a.name

Query:
SQLFiddleExample
SELECT
a."name",
COUNT(DISTINCT c."type") totalCount
FROM customers a
INNER JOIN transactions b
ON a."cid" = b."cid"
INNER JOIN accounts c
ON b."aid" = c."aid"
GROUP BY a."cid", a."name"
ORDER BY totalCount DESC
Result:
| NAME | TOTALCOUNT |
---------------------
| a | 3 |
| b | 1 |
| c | 1 |

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Hive - over (partition by ...) with a column not in group by - hive

Related

SQL return only rows where value exists multiple times and other value is present

SQL Select random rows partitioned by a column

SQL Query/ Assigning Rank

group by top two results based on order

Finding number of types of accounts from each customer

Categories

Resources