SQL GROUP BY where either column has same value - sql

I have the following table
User A | User B | Value
-------+--------+------
1 | 2 | 60
3 | 1 | 10
4 | 5 | 50
3 | 5 | 50
5 | 1 | 80
2 | 3 | 10
I want group together records where either user a = x or user b = x, in order to find averages.
e.g. User 1 appears in the table 3 times, once as 'User A' and twice as 'User B'. So I would want to carry out my AVG() function using those three rows.
I need the highest and lowest average values. Such a query would break down the above table into the following groups:
User | Avg Value
-----+-----
1 | 50
2 | 35
3 | 23.33
4 | 50
5 | 60
and then return
Highest Avg | Lowest Avg
------------+-----------
60 | 23.33
I know that GROUP BY collects together records where a column has the same value. I want to collect together records where either one of two columns has the same value. I have searched through many solutions but can't seem to find one that meets my problem.

A portable option uses union all:
select usr, avg(value) avg_value
from (
select usera usr, value from mytable
union all select userb, value from mytable
) t
group by usr
This gives you the first resultset. Then, you can add another level of aggregataion to get the maximum and minimum average:
select min(avg_value) min_avg_value, max(avg_value) max_avg_value
from (
select usr, avg(value) avg_value
from (
select usera usr, value from mytable
union all select userb, value from mytable
) t
group by usr
) t
In databases that support lateral joins and values(), this is most convinently (and efficiently) expressed as follows:
select min(avg_value) min_avg_value, max(avg_value) max_avg_value
from (
select usr, avg(value) avg_value
from mytable t
cross join lateral (values (usera, value), (userb, value)) as x(usr, value)
group by usr
) t
This would work in Postgres for example. In SQL Server, you would just replace cross join lateral with cross apply.

You can unpivot using union all and then aggregation:
select user, avg(value)
from ((select usera as user, value) union all
(select userb as user, value)
) u
group by user;
You can get the extremes with another level of aggregation:
select min(avg_value), max(avg_value)
from (select user, avg(value) as avg_value
from ((select usera as user, value) union all
(select userb as user, value)
) u
group by user
) ua

Related

Postgresql query to filter latest data based on 2 columns

Table Structure First
users table
id
1
2
3
sites table
id
1
2
site_memberships table
site_id
user_id
created_on
1
1
1
1
1
2
1
1
3
2
1
1
2
1
2
1
2
2
1
2
3
Assuming higher the created_on number, latest the record
Expected Output
site_id
user_id
created_on
1
1
3
2
1
2
1
2
3
Expected output: I need latest record for each user for each site membership.
Tried the following query, but this does not seem to work.
select * from users inner join
(
SELECT ROW_NUMBER () OVER (
PARTITION BY sm.user_id,
sm.created_on
), sm.*
from site_memberships sm
inner join sites s on sm.site_id=s.id
) site_memberships
ON site_memberships.user_id = users.user_id where row_number=1```
I think you have overcomplicated the problem you want to solve.
You seem to want aggregation:
select site_id, user_id, max(created_on)
from site_memberships sm
group by site_id, user_id;
If you had additional columns that you wanted, you could use distinct on instead:
select distinct on (site_id, user_id) sm.*
from site_memberships sm
order by site_id, user_id, created_on desc;

How to select IDs that have at least two specific instaces in a given column

I'm working with a medical claim table in pyspark and I want to return only userid's that have at least 2 claim_ids. My table looks something like this:
claim_id | userid | diagnosis_type | claim_type
__________________________________________________
1 1 C100 M
2 1 C100a M
3 2 D50 F
5 3 G200 M
6 3 C100 M
7 4 C100a M
8 4 D50 F
9 4 A25 F
From this example, I would want to return userid's 1, 3, and 4 only. Currently I'm building a temp table to count all of the distinct instances of the claim_ids
create table temp.claim_count as
select distinct userid, count(distinct claim_id) as claims
from medical_claims
group by userid
and then pulling from this table when the number of claim_id >1
select distinct userid
from medical_claims
where userid (
select distinct userid
from temp.claim_count
where claims>1)
Is there a better / more efficient way of doing this?
If you want only the ids, then use group by:
select userid, count(*) as claims
from medical_claims
group by userid
having count(*) > 1;
If you want the original rows, then use window functions:
select mc.*
from (select mc.*, count(*) over (partition by userid) as num_claims
from medical_claims mc
) mc
where num_claims > 1;

Multiple columns returned by subquery are not yet supported

Given the following table:
transaction_id user_id product_id
1 10 AA
2 10 CC
3 10 AA
4 10 CC
5 20 AA
6 20 BB
7 20 BB
8 30 BB
9 30 BB
10 30 BB
11 40 CC
12 40 AA
13 40 CC
14 40 BB
15 40 BB
16 50 EE
17 60 EE
Using the following query:
select
product_id,
count(distinct user_id) as count_repeat_users
from
product_usage_log
where
(product_id, user_id) in (
select
product_id,
user_id
from (
select
product_id,
user_id,
count (distinct transaction_id) as transactions
from
product_usage_log
group by
product_id,
user_id
) t
where transactions >= 2
)
group by product_id
Returns the following result:
product_id count_repeat_users
AA 1
BB 3
CC 2
(note that 'EE' doesn't appear, as expected)
The purpose of the query above is to return, for every product, the count of users having made at least two transactions with this product. The above query satisfies this, however it is using a multiple-column subquery with an IN predicate. This capability is not available (yet, although it's been talked about for the past two years with no success) in Presto.
How to replicate the above result without the possibility to use where (product_id, user_id) in (...)?
Note: I've tried to flatten the where condition into two successive ones, the problem being that now the condition on ALL columns being matched for EVERY row turns into a condition on ALL columns being matched for ANY row. In other words, now it will match a user-product couple as soon as the product is in the subtable, and the user is in the subtable, but not necessarily in the same row.
Another way to phrase the question is therefore: in Presto, how to make a condition based on a couple of values being present on the SAME row in a subquery?
How to replicate the above result without the possibility to use where (product_id, user_id) in (...)?
This is directly available in Presto.
You just need to wrap values produced by subquery in anonymous ROWs, so that they are, in fact, single-column.
Testing with Presto 318:
presto:default> SELECT
-> x, y
-> FROM (VALUES (1,2), (3,4), (5,6)) t(x, y)
-> WHERE (x, y) IN (
-> SELECT (z, w)
-> FROM (VALUES (1,1), (3,4), (5,5)) u(z, w)
-> );
x | y
---+---
3 | 4
(1 row)
Another example with tpch.tiny schema:
presto:tiny> SELECT orderkey
-> FROM orders
-> JOIN customer ON orders.custkey = customer.custkey
-> WHERE (orderkey, nationkey) IN (
-> SELECT (suppkey, nationkey) FROM supplier
-> );
orderkey
----------
3
(1 row)
Note: I'm not entirely sure this works correct with respect to NULLs. I guess this is not a problem in your case and your subquery does not produce NULLs for product_id, user_id.
You can use window functions. I think this will work:
select product_id, count(distinct user_id)
from (select pul.*,
count(*) over (partition by product_id, user_id) as cnt
from product_usage_log pul
) pul
where cnt >= 2
group by product_id;
Based on your sample data, I am guessing that transaction_id is unique. If not, then use count(distinct transaction_id) in the subquery.
I don't see the reason (at least from your sample data) why you use that WHERE...IN....You can get what you need without it:
select t.product_id, count(*) count_repeat_users
from (
select user_id, product_id
from product_usage_log
group by user_id, product_id
having count(transaction_id) > 1
) as t
group by product_id
See the demo (for SQL Server but since the code is standard SQL it should work for Presto too).
Results:
product_id | count_repeat_users
AA | 1
BB | 3
CC | 2

How to get MAX Hike in Min month?

below is table:
Name | Hike% | Month
------------------------
A 7 1
A 6 2
A 8 3
b 4 1
b 7 2
b 7 3
Result should be:
Name | Hike% | Month
------------------------
A 8 3
b 7 2
Here is one way of doing this:
SELECT Name, [Hike%], Month
FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY Name ORDER BY [Hike%] DESC, Month) rn
FROM yourTable
) t
WHERE rn = 1
ORDER BY Name;
If you instead want to return multiple records per name, in the case where two or more records might be tied for having the greatest hike%, then replace ROW_NUMBER with RANK.
use correlated subquery
select Name,min(Hike) as Hike,min(Month) as Month
from
(
select * from tablename a
where Hike in (select max(Hike) from tablename b where a.name=b.name)
)A group by Name
You can use something similar to the below:
SELECT Name, MAX(Hike), Month
FROM table
GROUP BY Name, Month
Hope this helps :)

Grouping results in sql query by a field in the result

I have a table with the following format:
User | Entity | ID
123 AB 1
123 AB 2
543 BC 3
098 CB 4
543 BC 5
543 ZG 6
etc...
I want to get a result set that only returns the User/Entity pairs and their ID for the greatest ID, so this result for example:
User | Entity | ID
123 AB 2
098 CB 4
543 BC 5
543 ZG 6
Is there any way to do this in SQL?
Try to use group by with max function
select user, Entity, max(id) as id
from table
group by user, Entity
You can also use CTE and Partition By
Like this:
;WITH CTE as
(
SELECT
Users,Entity,
ROW_NUMBER() OVER(PARTITION BY Entity ORDER BY ID DESC) AS Row,
Id
FROM Item
)
SELECT Users, Entity, Id From CTE Where Row = 1
Note that we used Order By ID DESC as we need highest ID. You can delete DESC if you want the smallest ID.
SQLFiddle: http://sqlfiddle.com/#!3/1dcb9/4