Get most commonly occurring value for each user id - sql

I have a table with userIds and product categories prod. I want to get a table of unique userIds and associated most occurring product categories prod. In other words, I want to know what item categorys each customer is buying the most. How can I achieve this in PL/SQL or Oracle SQL?
|userId|prod|
|------|----|
|123544|cars|
|123544|cars|
|123544|dogs|
|123544|cats|
|987689|bats|
|987689|cats|
I have already seen SO questions for getting the most common value of a column, but how do I get the most common value for each unique userId?

You should use just SQL to solve this .. if you really need it in pl/sql, just imbed this query within plsql ..
(setup)
drop table yourtable;
create table yourtable (
userID number,
prod varchar2(10)
)
/
insert into yourtable values ( 123544, 'cars' );
insert into yourtable values ( 123544, 'cars' );
insert into yourtable values ( 123544, 'dogs' );
insert into yourtable values ( 123544, 'cats' );
insert into yourtable values ( 987689, 'bats' );
insert into yourtable values ( 987689, 'cats' );
commit;
-- assuming ties are not broken, this logic returns both ties
with w_grp as (
select userID, prod, count(*) over ( partition by userID, prod ) rgrp
from yourtable
),
w_rnk as (
select userID, prod, rgrp,
rank() over (partition by userID order by rgrp desc) rnk,
from w_grp
)
select distinct userID, prod
from w_rnk
where rnk = 1
/
USERID PROD
---------- ----------
987689 bats
987689 cats
123544 cars
-- assuming you just want 1 .. this will return 1 random one if they are tied. (ie this time it pulled 987689 bats, next time it might pull 987689 cats. It will always return 123544 cars, however, since there is no tie for that one.
with w_grp as (
select userID, prod, count(*) over ( partition by userID, prod ) rgrp
from yourtable
),
w_rnk as (
select userID, prod, rgrp,
row_number() over (partition by userID order by rgrp desc) rnum
from w_grp
)
select userID, prod, rnum
from w_rnk
where rnum = 1
/
USERID PROD RNUM
---------- ---------- ----------
123544 cars 1
987689 bats 1
[edit] Cleaned up unused rank/row_number from functions to avoid confusion [/edit]

SELECT user_id, prod, prod_cnt FROM (
SELECT user_id, prod, prod_cnt
, RANK() OVER ( PARTITION BY user_id ORDER BY prod_cnt DESC ) AS rn
FROM (
SELECT user_id, prod, COUNT(*) AS prod_cnt
FROM mytable
GROUP BY user_id, prod
)
) WHERE rn = 1;
In the innermost subquery I am getting the COUNT of each product by user. Then I rank them using the analytic (window) function RANK(). Then I simply select all of those where the RANK is equal to 1. Using RANK() instead of ROW_NUMBER() ensures that ties will be returned.

Related

Group by two columns, take sum, then max

There are three columns: Id (char), Name (char), and Score (int).
First, we group by Id and Name and add Score for each group. Let us call the added score total_score.
Then, we group by Name and take only the maximum of total_score and its corresponding Id and Name. I've got everything else but I'm having a hard time figuring out how to get the Id. The error I get is
Column 'Id' is invalid in the select list because
it is not contained in either an aggregate function or the GROUP BY
clause.
WITH Tmp AS
(SELECT Id,
Name,
SUM(Score) AS total_score
FROM Mytable
GROUP BY Id,
Name)
SELECT Name, -- Id,
MAX(total_score) AS max_score
FROM Tmp
GROUP BY Name
ORDER BY max_score DESC
just add row_number() partition by Name to your query and get the 1st row (order by total_score descending)
select *
from
(
-- your existing `total_score` query
SELECT Id, Name,
SUM(Score) AS total_score,
r = row_number() over (partition by Name order by SUM(Score) desc)
FROM Mytable
GROUP BY Id, Name
) d
where r = 1
WITH Tmp AS
(SELECT Id,
Name,
SUM(Score) AS total_score
FROM Mytable
GROUP BY Id,
Name)
SELECT Name, Id,
MAX(total_score) AS max_score
FROM Tmp
GROUP BY Name,id
ORDER BY max_score DESC
Try this. Hope this will help.
WITH Tmp AS
(
SELECT Id,
Name,
SUM(Score) AS total_score
FROM Mytable
GROUP BY Id,
NAME
)
SELECT Name, Id,
MAX(total_score) AS max_score
FROM Tmp
GROUP BY Name,id
ORDER BY max_score DESC
Note:- If we are using aggregate function then we have to use other column as Group By....
In your case you are using SUM(Score) as aggregate function then we to use other column as Group by ...
I am not sure about performance of below query but we can use window functions to get maximum value from data partition.
SELECT
Id,
Name,
SUM(Score) AS total_score,
MAX(SUM(Score)) OVER(Partition by Name) AS max_score
FROM Mytable
GROUP BY Id, Name;
Tested -
declare #Mytable table (id int, name varchar(10), score int);
insert into #Mytable values
(1,'abc', 100),
(2,'abc', 200),
(3,'def', 300),
(3,'def', 400),
(4,'pqr', 500);
Output -
Id Name total_score max_score
1 abc 100 200
2 abc 200 200
3 def 700 700
4 pqr 500 500
You can select DENSE_RANK() with total_score column and then select records with Rank = 1. This will work for those also when there are multiple Name which are having same total_score.
WITH Tmp AS
(SELECT Id,
Name,
SUM(Score) AS total_score
FROM Mytable
GROUP BY Id, Name)
SELECT Id,
Name,
total_score AS max_score
FROM (SELECT Id,
Name,
total_score,
DENSE_RANK() OVER (PARTITION BY Name ORDER BY total_score DESC) AS Rank
FROM Tmp) AS Tmp2
WHERE Rank = 1
You can try this as well:
select id,name,max(total_score) over (partition by name) max_score from (
select id,name,sum(score) as total_score from YOURTABLE
group by id,name
) t

Select MAX Value for Each ROW - Oracle Sql

I have one doubt.
I need to find what is the latest occurrence for a specific list of Customers, let's say to simplify, I need it for 3 Customers out of 100.
I need to check when it was the last time each of them got a bonus.
The table would be:
EVENT_TBL
Fields: Account ID, EVENT_DATE, BONUS ID, ....
Can you suggest a way to grab the latest (MAX) EVENT DATE (that means one row each)
I'm using SELECT...IN to specify the Account ID but not sure how to use MAX, Group BY etc etc (if ever needed).
Use the ROW_NUMBER() analytic function:
SELECT *
FROM (
SELECT t.*,
ROW_NUMBER() OVER ( PARTITION BY Account_id ORDER BY event_date DESC ) AS rn
FROM EVENT_TBL t
WHERE Account_ID IN ( 123, 456, 789 )
)
WHERE rn = 1
You can try
with AccountID_Max_EVENT_DATE as (
select AccountID, max(EVENT_DATE) MAX_D
from EVENT_TBL
group by AccountID
)
SELECT E.*
FROM EVENT_TBL E
INNER JOIN AccountID_Max_EVENT_DATE M
ON (E.AccountID = M.AccountID AND M.MAX_D = E.EVENT_DATE)

SQL oracle Select condition in group

Let's suppose I have a table with 3 columns:
ID | GroupId | DateCreatedOn |
I want to select the datas grouped by GroupId so:
select GroupId from tableName group by GroupId;
But what if I want to execute another select on each group? let's suppose now that I want the last created row (DateCreatedOn) of each group?
Also, I would like to retrieve ALL the columns and not only the GroupId.
Im kind of lost because I only have GroupId available.
Please provide some explanation and not only the correct query.
You can use ROW_NUMBER for this:
SELECT ID, GroupId, DateCreatedOn
FROM (
SELECT ID, GroupId, DateCreatedOn,
ROW_NUMBER() OVER (PARTITION BY GroupId
ORDER BY DateCreatedOn DESC) AS rn
FROM mytable) t
WHERE t.rn = 1
rn field is equal to 1 for the record having the most recent DateCreatedOn value within each GroupId partition.
You can get the values with the maximum of another column using KEEP ( DENSE_RANK [FIRST|LAST] ORDER BY ... ) in the aggregation:
SELECT GroupID,
MAX( ID ) KEEP ( DENSE_RANK LAST ORDER BY DateCreatedOn ) AS id
MAX( DateCreatedOn ) AS DateCreatedOn
FROM table_name
GROUP BY GroupId
You can also do it using ROW_NUMBER():
SELECT ID,
GroupID,
DateCreatedOn
FROM (
SELECT t.*,
ROW_NUMBER() OVER ( PARTITION BY GroupID
ORDER BY DateCreatedOn DESC, ID DESC ) AS RN
FROM table_name t
)
WHERE RN = 1
(ID DESC is added to the ORDER BY to get the maximum ID for the latest DateCreatedOn to give the same result as the first query; if you don't have a deterministic order then you are likely to get whichever row the database produces first and the result can be non-deterministic)

SQL Server: INSERT INTO SELECT MAX

I want to insert multi rows from another table. The problem is that I want to get the Max + 1 before I insert. Note that I know I should use Identify etc... However, I have this complex scenario of offline database synchronization across nodes...
INSERT INTO Purchase_Deliveries_Items
(ID,Item_ID)
SELECT
(SELECT
MAX(ID)+1 -- same MAX ID for all (the problem)
FROM
Purchase_Deliveries_Items),
Item_ID,
FROM
Purchase_Orders_Items
WHERE
PurchaseOrder_ID = 1
You can get a newid based on the max existing id with the help of ROW_NUMBER
INSERT INTO Purchase_Deliveries_Items (
ID,
Item_ID
)
SELECT
ROW_NUMBER() OVER (
ORDER BY
Item_ID
) + (SELECT MAX(ID)
FROM Purchase_Deliveries_Items) newID,
Item_ID,
FROM Purchase_Orders_Items
WHERE
PurchaseOrder_ID = 1
Seems like you want to do a kind of sequence, use a ROW_NUMBER:
INSERT INTO Purchase_Deliveries_Items
(ID,Item_ID)
SELECT (
SELECT MAX(ID) -- same MAX ID for all
FROM Purchase_Deliveries_Items
) + ROW_NUMBER() OVER (ORDER BY any_column),
Item_ID,
FROM Purchase_Orders_Items
WHERE PurchaseOrder_ID = 1

How to find duplicate records in PostgreSQL

I have a PostgreSQL database table called "user_links" which currently allows the following duplicate fields:
year, user_id, sid, cid
The unique constraint is currently the first field called "id", however I am now looking to add a constraint to make sure the year, user_id, sid and cid are all unique but I cannot apply the constraint because duplicate values already exist which violate this constraint.
Is there a way to find all duplicates?
The basic idea will be using a nested query with count aggregation:
select * from yourTable ou
where (select count(*) from yourTable inr
where inr.sid = ou.sid) > 1
You can adjust the where clause in the inner query to narrow the search.
There is another good solution for that mentioned in the comments, (but not everyone reads them):
select Column1, Column2, count(*)
from yourTable
group by Column1, Column2
HAVING count(*) > 1
Or shorter:
SELECT (yourTable.*)::text, count(*)
FROM yourTable
GROUP BY yourTable.*
HAVING count(*) > 1
From "Find duplicate rows with PostgreSQL" here's smart solution:
select * from (
SELECT id,
ROW_NUMBER() OVER(PARTITION BY column1, column2 ORDER BY id asc) AS Row
FROM tbl
) dups
where
dups.Row > 1
In order to make it easier I assume that you wish to apply a unique constraint only for column year and the primary key is a column named id.
In order to find duplicate values you should run,
SELECT year, COUNT(id)
FROM YOUR_TABLE
GROUP BY year
HAVING COUNT(id) > 1
ORDER BY COUNT(id);
Using the sql statement above you get a table which contains all the duplicate years in your table. In order to delete all the duplicates except of the the latest duplicate entry you should use the above sql statement.
DELETE
FROM YOUR_TABLE A USING YOUR_TABLE_AGAIN B
WHERE A.year=B.year AND A.id<B.id;
You can join to the same table on the fields that would be duplicated and then anti-join on the id field. Select the id field from the first table alias (tn1) and then use the array_agg function on the id field of the second table alias. Finally, for the array_agg function to work properly, you will group the results by the tn1.id field. This will produce a result set that contains the the id of a record and an array of all the id's that fit the join conditions.
select tn1.id,
array_agg(tn2.id) as duplicate_entries,
from table_name tn1 join table_name tn2 on
tn1.year = tn2.year
and tn1.sid = tn2.sid
and tn1.user_id = tn2.user_id
and tn1.cid = tn2.cid
and tn1.id <> tn2.id
group by tn1.id;
Obviously, id's that will be in the duplicate_entries array for one id, will also have their own entries in the result set. You will have to use this result set to decide which id you want to become the source of 'truth.' The one record that shouldn't get deleted. Maybe you could do something like this:
with dupe_set as (
select tn1.id,
array_agg(tn2.id) as duplicate_entries,
from table_name tn1 join table_name tn2 on
tn1.year = tn2.year
and tn1.sid = tn2.sid
and tn1.user_id = tn2.user_id
and tn1.cid = tn2.cid
and tn1.id <> tn2.id
group by tn1.id
order by tn1.id asc)
select ds.id from dupe_set ds where not exists
(select de from unnest(ds.duplicate_entries) as de where de < ds.id)
Selects the lowest number ID's that have duplicates (assuming the ID is increasing int PK). These would be the ID's that you would keep around.
Inspired by Sandro Wiggers, I did something similiar to
WITH ordered AS (
SELECT id,year, user_id, sid, cid,
rank() OVER (PARTITION BY year, user_id, sid, cid ORDER BY id) AS rnk
FROM user_links
),
to_delete AS (
SELECT id
FROM ordered
WHERE rnk > 1
)
DELETE
FROM user_links
USING to_delete
WHERE user_link.id = to_delete.id;
If you want to test it, change it slightly:
WITH ordered AS (
SELECT id,year, user_id, sid, cid,
rank() OVER (PARTITION BY year, user_id, sid, cid ORDER BY id) AS rnk
FROM user_links
),
to_delete AS (
SELECT id,year,user_id,sid, cid
FROM ordered
WHERE rnk > 1
)
SELECT * FROM to_delete;
This will give an overview of what is going to be deleted (there is no problem to keep year,user_id,sid,cid in the to_delete query when running the deletion, but then they are not needed)
In your case, because of the constraint you need to delete the duplicated records.
Find the duplicated rows
Organize them by created_at date - in this case I'm keeping the oldest
Delete the records with USING to filter the right rows
WITH duplicated AS (
SELECT id,
count(*)
FROM products
GROUP BY id
HAVING count(*) > 1),
ordered AS (
SELECT p.id,
created_at,
rank() OVER (partition BY p.id ORDER BY p.created_at) AS rnk
FROM products o
JOIN duplicated d ON d.id = p.id ),
products_to_delete AS (
SELECT id,
created_at
FROM ordered
WHERE rnk = 2
)
DELETE
FROM products
USING products_to_delete
WHERE products.id = products_to_delete.id
AND products.created_at = products_to_delete.created_at;
Following SQL syntax provides better performance while checking for duplicate rows.
SELECT id, count(id)
FROM table1
GROUP BY id
HAVING count(id) > 1
begin;
create table user_links(id serial,year bigint, user_id bigint, sid bigint, cid bigint);
insert into user_links(year, user_id, sid, cid) values (null,null,null,null),
(null,null,null,null), (null,null,null,null),
(1,2,3,4), (1,2,3,4),
(1,2,3,4),(1,1,3,8),
(1,1,3,9),
(1,null,null,null),(1,null,null,null);
commit;
set operation with distinct and except.
(select id, year, user_id, sid, cid from user_links order by 1)
except
select distinct on (year, user_id, sid, cid) id, year, user_id, sid, cid
from user_links order by 1;
except all also works. Since id serial make all rows unique.
(select id, year, user_id, sid, cid from user_links order by 1)
except all
select distinct on (year, user_id, sid, cid)
id, year, user_id, sid, cid from user_links order by 1;
So far works nulls and non-nulls.
delete:
with a as(
(select id, year, user_id, sid, cid from user_links order by 1)
except all
select distinct on (year, user_id, sid, cid)
id, year, user_id, sid, cid from user_links order by 1)
delete from user_links using a where user_links.id = a.id returning *;