Oracle: sql query for deleting duplicate rows based on a group - sql

i need a SQL-Query to delete duplicates from a table. Lets start with my tables
rc_document: (there are more entries, this is just an example)
+----------------+-------------+----------------------+
| rc_document_id | document_id | rc_document_group_id |
+----------------+-------------+----------------------+
| 1 | 1 | 1 |
| 2 | 2 | 1 |
| 3 | 3 | 1 |
| 4 | 4 | 1 |
| 5 | 1 | 2 |
| 6 | 3 | 2 |
+----------------+-------------+----------------------+
(document_id can be exists in mulitple rc_document-group´s)
rc_document_group:
+----------------------+----------+
| rc_document_group_id | priority |
+----------------------+----------+
| 1 | 1 |
| 2 | 2 |
+----------------------+----------+
Each rc_document can be joined with the rc_document_group. In the rc_document_group is the priority for each rc_document.
I want to delete the rc_document rows with document_id which have not the highest priority in the rc_document_group. Because the document_id can be exists in multiple rc_document-group´s .. i just want to keep that one, with the highest priority.
here is my expected rc_document table after deleting duplicate document_id´s:
+----------------+-------------+----------------------+
| rc_document_id | document_id | rc_document_group_id |
+----------------+-------------+----------------------+
| 2 | 2 | 1 |
| 4 | 4 | 1 |
| 5 | 1 | 2 |
| 6 | 3 | 2 |
+----------------+-------------+----------------------+
the rc_document´s with rc_document_id 1 and 3 must be deleted, because there document_id 1 and 3 are in another rc_document_group with higher priority.
Im new in sql and i have no idea how to write these sql query ... thank for your help!!

First, you could join the two tables in order to get the corresponding priority on each row. After that, you could use the analytic function MAX() to get, for each row, the max priority within each group of document_id. At this point, you filter out the rows where the priority is not equal to the max priority in the group.
Try this query:
SELECT t.rc_document_id,
t.document_id,
t.rc_document_group_id
FROM (SELECT d.*,
g.priority,
MAX(g.priority) OVER(PARTITION BY document_id) max_priority
FROM rc_document d
INNER JOIN rc_document_group g
ON d.rc_document_group_id = g.rc_document_group_id) t
WHERE t.priority = t.max_priority

Related

Merging multiple "state-change" time series

Given a number of tables like the following, representing state-changes at time t of an entity identified by id:
| A | | B |
| t | id | a | | t | id | b |
| - | -- | - | | - | -- | - |
| 0 | 1 | 1 | | 0 | 1 | 3 |
| 1 | 1 | 2 | | 2 | 1 | 2 |
| 5 | 1 | 3 | | 3 | 1 | 1 |
where t is in reality a DateTime field with millisecond precision (making discretisation infeasible), how would I go about creating the following output?
| output |
| t | id | a | b |
| - | -- | - | - |
| 0 | 1 | 1 | 3 |
| 1 | 1 | 2 | 3 |
| 2 | 1 | 2 | 2 |
| 3 | 1 | 2 | 1 |
| 5 | 1 | 3 | 1 |
The idea is that for any given input timestamp, the entire state of a selected entity can be extracted by selecting one row from the resulting table. So the latest state of each variable corresponding to any time needs to be present in each row.
I've tried various JOIN statements, but I seem to be getting nowhere.
Note that in my use case:
rows also need to be joined by entity id
there may be more than two source tables to be merged
I'm running PostgreSQL, but I will eventually translate the query to SQLAlchemy, so a pure SQLAlchemy solution would be even better
I've created a db<>fiddle with the example data.
I think you want a full join and some other manipulations. The ideal would be:
select t, id,
last_value(a.a ignore nulls) over (partition by id order by t) as a,
last_value(b.b ignore nulls) over (partition by id order by t) as b
from a full join
b
using (t, id);
But . . . Postgres doesn't support ignore nulls. So an alternative method is:
select t, id,
max(a) over (partition by id, grp_a) as a,
max(b) over (partition by id, grp_b) as b
from (select *,
count(a.a) over (partition by id order by t) as grp_a,
count(b.b) over (partition by id order by t) as grp_b
from a full join
b
using (t, id)
) ab;

Grouping the rows on the basis of specific condition in SQL Server

I want to group the rows on the basis of a specific condition.
The table structure is something like this
EmpID | EmpName | TaskId | A_Shift_Status | B_Shift_Status | C_Shift_Status | D_Shift_Status
1 | John | 1 | 1 | null | 2 | 1
1 | John | 2 | 1 | null | 1 | 1
2 | Mike | 3 | 1 | 1 | 2 | 1
2 | Mike | 4 | null | 1 | null | 1
3 | Steve | 5 | null | 1 | 2 | 1
3 | Steve | 6 | 1 | null | 2 | 1
The criteria will be
Done 1
Pending 2
NA 3
The expected output is to group the employees by task and the status will be on the following condition
if ALL tasks are done by any employee then the status will be done
(i.e. 1)
if ANY of the tasks is incomplete then the status will be
incomplete/pending (i.e. 2)
So the desired output will be
EmpID | EmpName | A_Shift_Status | B_Shift_Status | C_Shift_Status | D_Shift_Status
1 | John | 1 | null | 2 | 1
2 | Mike | 1 | 1 | 2 | 1
3 | Steve | 1 | 1 | 2 | 1
So in other terms summary/grouping should only show complete/done (i.e. 1) when all the rows of a particular shift column of an employee have status as complete/done (i.e. 1)
Based on your data (where the criteria are 1, 2 and NULL for n/a), a simple 'group by' the employee, and MAX of the columns, should work e.g.,
SELECT
yt.EmpID,
yt.EmpName,
MAX(yt.A_Shift_Status) AS A_Shift_Status,
MAX(yt.B_Shift_Status) AS B_Shift_Status,
MAX(yt.C_Shift_Status) AS C_Shift_Status,
MAX(yt.D_Shift_Status) AS D_Shift_Status
FROM
yourtable yt
GROUP BY
yt.EmpID,
yt.EmpName;
For the shift statuses
If any of them are 2, it returns 2
otherwise if any of them are 1, it returns 1
otherwise it returns NULL
Notes re 1/2/3 (which was specified as criteria) vs 1/2/NULL (which is in the data)
It gets a little tricker if the inputs are supposed to use 1/2/3 instead of 1/2/NULL. Let us know if you are changing the inputs to reflect that.
If the input is fine as NULLs, but you need the output to have '3' for n/a (nulls), you can put an ISNULL or COALESCE around the MAX statements e.g., ISNULL(MAX(yt.A_Shift_Status), 3) AS A_Shift_Status

First two rows per combination of two columns

Given a table like this in PostgreSQL:
Messages
message_id | creating_user_id | receiving_user_id | created_utc
-----------+------------------+-------------------+-------------
1 | 1 | 2 | 1424816011
2 | 3 | 2 | 1424816012
3 | 3 | 2 | 1424816013
4 | 1 | 3 | 1424816014
5 | 1 | 3 | 1424816015
6 | 2 | 1 | 1424816016
7 | 2 | 1 | 1424816017
8 | 1 | 2 | 1424816018
I want to get the newest two rows per creating_user_id/receiving_user_id where the other user_id is 1. So the result of the query should look like:
message_id | creating_user_id | receiving_user_id | created_utc
-----------+------------------+-------------------+-------------
1 | 1 | 2 | 1424816011
4 | 1 | 3 | 1424816014
5 | 1 | 3 | 1424816015
6 | 2 | 1 | 1424816016
Using a window function with row_number() I can get the first 2 messages for each creating_user_id or the first 2 messages for each receiving_user_id, but I'm not sure how to get the first two messages for per creating_user_id/receiving_user_id.
Since you filter rows where one of both columns is 1 (and irrelevant), and 1 happens to be the smallest number of all, you can simply use GREATEST(creating_user_id, receiving_user_id) to distill the relevant number to PARTITION BY. (Else you could employ CASE.)
The rest is standard procedure: calculate a row number in a subquery and select the first two in the outer query:
SELECT message_id, creating_user_id, receiving_user_id, created_utc
FROM (
SELECT *
, row_number() OVER (PARTITION BY GREATEST (creating_user_id
, receiving_user_id)
ORDER BY created_utc) AS rn
FROM messages
WHERE 1 IN (creating_user_id, receiving_user_id)
) sub
WHERE rn < 3
ORDER BY created_utc;
Exactly your result.
SQL Fiddle.

CREATE VIEW with multiple tables - must show 0 values

I have three tables:
// priorities // statuses // projects
+----+--------+ +----+-------------+ +----+------+--------+----------+
| ID | NAME | | ID | STATUS NAME | | ID | NAME | STATUS | PRIORITY |
+----+--------+ +----+-------------+ +----+------+--------+----------+
| 1 | Normal | | 1 | Pending | | 1 | a | 1 | 3 |
+----+--------+ +----+-------------+ +----+------+--------+----------+
| 2 | High | | 2 | In Progress | | 2 | b | 1 | 1 |
+----+--------+ +----+-------------+ +----+------+--------+----------+
| 3 | Urgent | | 3 | c | 2 | 1 |
+----+--------+ +----+------+--------+----------+
| 4 | d | 1 | 2 |
+----+------+--------+----------+
I need to create a view that shows how many projects hold a status of 1 and a priority of 1, how many hold a status of 1 and a priority of 2, how many hold a status of 1 and a priority of 3, and so on.
This should go through each status, then each priority, then count the projects that apply to the criteria.
The view should hold values something like this:
// VIEW (stats)
+--------+----------+-------+
| STATUS | PRIORITY | COUNT |
+--------+----------+-------+
| 1 | 1 | 1 |
+--------+----------+-------+
| 1 | 2 | 1 |
+--------+----------+-------+
| 1 | 3 | 1 |
+--------+----------+-------+
| 2 | 1 | 1 |
+--------+----------+-------+
| 2 | 2 | 0 |
+--------+----------+-------+
| 2 | 3 | 0 |
+--------+----------+-------+
This view is so that I can call, for example, how many projects have a status of 1 and a priority of 3, the answer given the data above should be 1.
Using the below select statement I've been able to produce a similar result but it does not explicitly show that 0 projects have a status of 2 and a priority of 3. I need this 0 value to be accessible the same way as any of the others with a COUNT >= 1.
// my current select statement
CREATE VIEW stats
AS
SELECT P.STATUS, P.PRIORITY, COUNT(*) AS hits
FROM projects P
GROUP BY P.STATUS, P.PRIORITY
// does not show rows where COUNT = 0
How could I create a VIEW that holds all of the priorities' ids, all of the statuses' ids, and 0 values for COUNT?
You need to generate all the rows and then get the count for each one. Here is a query that should work:
SELECT s.status, p.priority, COUNT(pr.status) AS hits
FROM (SELECT DISTINCT status FROM projects) s CROSS JOIN
(SELECT DISTINCT priority FROM projects) p LEFT JOIN
project pr
ON pr.status = s.status and pr.priority = p.priority
GROUP BY s.status, p.priority;

Getting Sum of MasterTable's amount which joins to DetailTable

I have two tables:
1. Master
| ID | Name | Amount |
|-----|--------|--------|
| 1 | a | 5000 |
| 2 | b | 10000 |
| 3 | c | 5000 |
| 4 | d | 8000 |
2. Detail
| ID |MasterID| PID | Qty |
|-----|--------|-------|------|
| 1 | 1 | 1 | 10 |
| 2 | 1 | 2 | 20 |
| 3 | 2 | 2 | 60 |
| 4 | 2 | 3 | 10 |
| 5 | 3 | 4 | 100 |
| 6 | 4 | 1 | 20 |
| 7 | 4 | 3 | 40 |
I want to select sum(Amount) from Master which joins to Deatil where Detail.PID in (1,2,3)
So I execute the following query:
SELECT SUM(Amount) FROM Master M INNER JOIN Detail D ON M.ID = D.MasterID WHERE D.PID IN (1,2,3)
Result should be 20000. But I am getting 40000
See this fiddle. Any suggestion?
You are getting exactly double the amount because the detail table has two occurences for each of the PIDs in the WHERE clause.
See demo
Use
SELECT SUM(Amount)
FROM Master M
WHERE M.ID IN (
SELECT DISTINCT MasterID
FROM DETAIL
WHERE PID IN (1,2,3) )
What is the requirement of joining the master table with details when you have all your columns are in Master table.
Also, isnt there any FK relationhsip defined on these tables. Looking at your data it seems to me that there should be FK on detail table for MasterId. If that is the case then you do not need join the table at all.
Also, in case you want to make sure that you have records in details table for the records for which you need sum and there is no FK relationship. Then you could give a try for exists instead of join.