Postgres select distinct row per group without repeating - sql

I have a query that generated matched data like this. For each parent I need to select a child but not repeat the same combination or parent or child. In the picture below, the black border shows the groups and the blue highlighted rows are the rows I want returned.
I also have the case where there are 6 parents and only 3 children. In this case I only want 3 rows max, the child and parent ids can't repeat. I just want the first matched children to parents.

So I conducting some research and found options that were close but didn't do the trick. I finally got exactly what I wanted.
WITH firstmatched AS (
SELECT parent,
child,
ROW_NUMBER() OVER(PARTITION BY child
ORDER BY parent DESC) AS rowkey1,
ROW_NUMBER() OVER(PARTITION BY parent
ORDER BY child DESC) AS rowkey2
FROM mytable)
SELECT *
FROM firstmatched where rowkey1 = rowkey2
Take the case where there are 6 parents and only children, this query without the where clause (where rowkey1 = rowkey2) looks like this.
Then with the where clause added it reduces to this.
Hopefully this helps someone with a similar issue.

Rock&Roll-method: just make all assignments and skip the errors:
\i tmp.sql
-- The data [in non-graphical form]
CREATE TABLE tableau(
parent integer NOT NULL
, child integer NOT NULL
, PRIMARY KEY (parent,child)
) ;
INSERT INTO tableau(parent,child) VALUES
( 450759,450768) , ( 450759,450771) , ( 450759,450773)
, ( 450763,450768) , ( 450763,450771) , ( 450763,450773)
, ( 450765,450768) , ( 450765,450771) , ( 450765,450773)
;
-- Receptor table for the results:
CREATE TEMP TABLE assignment(
parent integer NOT NULL UNIQUE
, child integer NOT NULL UNIQUE
);
-- Just do it!
INSERT INTO assignment(parent,child)
SELECT parent,child
FROM tableau
ON CONFLICT DO NOTHING --<< MAGIC!
;
SELECT * FROM tableau;
SELECT * FROM assignment;
Results:
CREATE SCHEMA
SET
CREATE TABLE
INSERT 0 9
CREATE TABLE
INSERT 0 3
parent | child
--------+--------
450759 | 450768
450759 | 450771
450759 | 450773
450763 | 450768
450763 | 450771
450763 | 450773
450765 | 450768
450765 | 450771
450765 | 450773
(9 rows)
parent | child
--------+--------
450759 | 450768
450763 | 450771
450765 | 450773
(3 rows)
Note: this solution is greedy; it can fail to find an optimal solution on some types of tableau data.

Related

Remove duplicate rows based on specific columns

I have a table that contains these columns:
ID (varchar)
SETUP_ID (varchar)
MENU (varchar)
LABEL (varchar)
The thing I want to achieve is to remove all duplicates from the table based on two columns (SETUP_ID, MENU).
Table I have:
id | setup_id | menu | label |
-------------------------------------
1 | 10 | main | txt |
2 | 10 | main | txt |
3 | 11 | second | txt |
4 | 11 | second | txt |
5 | 12 | third | txt |
Table I want:
id | setup_id | menu | label |
-------------------------------------
1 | 10 | main | txt |
3 | 11 | second | txt |
5 | 12 | third | txt |
You can achieve this with a common table expression (cte)
with cte as (
select id, setup_id, menu,
row_number () over (partition by setup_id, menu, label) rownum
from atable )
delete from atable a
where id in (select id from cte where rownum >= 2)
This will give you your desired output.
Common Table Expression docs
Assuming a table named tbl where both setup_id and menu are defined NOT NULL and id is the PRIMARY KEY.
EXISTS will do nicely:
DELETE FROM tbl t0
WHERE EXISTS (
SELECT FROM tbl t1
WHERE t1.setup_id = t0.setup_id
AND t1.menu = t0.menu
AND t1.id < t0.id
);
This deletes every row where a dupe with lower id is found, effectively only keeping the row with the smallest id from each set of dupes. An index on (setup_id, menu) or even (setup_id, menu, id) will help performance with big tables a lot.
If there is no PK and no reliable UNIQUE (combination of) column(s), you can fall back to using the ctid. If NULL values can be involved, you need to specify how to deal with those.
Consider:
Delete duplicate rows from small table
How to delete duplicate rows without unique identifier
How do I (or can I) SELECT DISTINCT on multiple columns?
After cleaning up duplicates, add a UNIQUE constraint to prevent new dupes:
ALTER TABLE tbl ADD CONSTRAINT tbl_setup_id_menu_uni UNIQUE (setup_id, menu);
If you had an index on (setup_id, menu), drop that now. It's superseded by the UNIQUE constraint.
I have found a solution that fits me the best.
Here it is if anyone needs it:
DELETE FROM table_name
WHERE id IN
(SELECT id
FROM
(SELECT id,
ROW_NUMBER() OVER( PARTITION BY setup_id,
menu
ORDER BY id ) AS row_num
FROM table_name ) t
WHERE t.row_num > 1 );
link: https://www.postgresql.org/docs/current/queries-union.html
https://www.postgresql.org/docs/current/sql-select.html#SQL-DISTINCT
let's sat table name is a
select distinct on (setup_id,menu ) a.* from a;
Key point: The DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s). The ORDER BY clause will normally contain additional expression(s) that determine the desired precedence of rows within each DISTINCT ON group.
Which means you can only order by setup_id,menu in this distinct on query scope.
Want the opposite:
EXCEPT returns all rows that are in the result of query1 but not in the result of query2. (This is sometimes called the difference between two queries.) Again, duplicates are eliminated unless EXCEPT ALL is used.
SELECT * FROM a
EXCEPT
select distinct on (setup_id,menu ) a.* from a;
You can try something along these lines to delete all but the first row in case of duplicates (please note that this is not tested in any way!):
DELETE FROM your_table WHERE id IN (
SELECT unnest(duplicate_ids[2:]) FROM (
SELECT array_agg(id) AS duplicate_ids FROM your_table
GROUP BY SETUP_ID, MENU
HAVING COUNT(*) > 1
)
)
)
The above collects the ids of the duplicate rows (COUNT(*) > 1) in an array (array_agg), then takes all but the first element in that array ([2:]) and "explodes" the id values into rows (unnest).
The outer query just deletes every id that ends up in that result.
For mysql the similar question is already answered here Find and remove duplicate rows by two columns
Try if any of the approach helps in this matter.
I like the below one for MySql:
ALTER IGNORE TABLE your_table ADD UNIQUE (SETUP_ID, MENU);
DELETE t1
FROM table_name t1
join table_name t2 on
(t2.setup_id = t1.setup_id or t2.menu = t1.menu) and t2.id < t1.id
There are many ways to find and delete all duplicate row(s) based on conditions. But I like inner join method, which works very fast even in a large amount of Data. Please check follows :
DELETE T1 FROM <TableName> T1
INNER JOIN <TableName> T2
WHERE
T1.id > T2.id AND
T1.<ColumnName1> = T2.<ColumnName1> AND T1.<ColumnName2> = T2.<ColumnName2>;
In your case you can write as follows :
DELETE T1 FROM <TableName> T1
INNER JOIN <TableName> T2
WHERE
T1.id > T2.id AND
T1.setup_id = T2. setup_id;
Let me know if you face any issue or need more help.

How can I write a SQL query to calculate the quantity of components sold with their parent assemblies? (Postgres 11/recursive CTE?)

My goal
To calculate the sum of components sold as part of their parent assemblies.
I'm sure this must be a common use case, but I haven't yet found documentation that leads to the result I'm looking for.
Background
I'm running Postgres 11 on CentOS 7.
I have some tables like as follows:
CREATE TABLE the_schema.names_categories (
id INTEGER NOT NULL PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
created_at TIMESTAMPTZ DEFAULT now(),
thing_name TEXT NOT NULL,
thing_category TEXT NOT NULL
);
CREATE TABLE the_schema.relator (
id INTEGER NOT NULL PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
created_at TIMESTAMPTZ DEFAULT now(),
parent_name TEXT NOT NULL,
child_name TEXT NOT NULL,
child_quantity INTEGER NOT NULL
);
CREATE TABLE the_schema.sales (
id INTEGER NOT NULL PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
created_at TIMESTAMPTZ DEFAULT now(),
sold_name TEXT NOT NULL,
sold_quantity INTEGER NOT NULL
);
And a view like so, which is mainly to associate the category key with relator.child_name for filtering:
CREATE VIEW the_schema.relationships_with_child_catetgory AS (
SELECT
r.parent_name,
r.child_name,
r.child_quantity,
n.thing_category AS child_category
FROM
the_schema.relator r
INNER JOIN
the_schema.names_categories n
ON r.child_name = n.thing_name
);
And these tables contain some data like this:
INSERT INTO the_schema.names_categories (thing_name, thing_category)
VALUES ('parent1', 'bundle'), ('child1', 'assembly'), ('subChild1', 'component'), ('subChild2', 'component');
INSERT INTO the_schema.relator (parent_name, child_name, child_quantity)
VALUES ('parent1', 'child1', 1),('child1', 'subChild1', 10), ('child1', 'subChild2', 2);
INSERT INTO the_schema.sales (sold_name, sold_quantity)
VALUES ('parent1', 1), ('parent1', 2);
I need to construct a query that, given these data, will return something like the following:
child_name | sum_sold
------------+----------
subChild1 | 30
subChild2 | 6
(2 rows)
The problem is that I haven't the first idea how to go about this and in fact it's getting scarier as I type. I'm having a really hard time visualizing the connections that need to be made, so it's difficult to get started in a logical way.
Usually, Molinaro's SQL Cookbook has something to get started on, and it does have a section on hierarchical queries, but near as I can tell, none of them serve this particular purpose.
Based on my research on this site, it seems like I probably need to use a recursive CTE /Common Table Expression, as demonstrated in this question/answer, but I'm having considerable difficulty understanding this method and how to use this it for my case.
Aping the example from E. Brandstetter's answer linked above, I arrive at:
WITH RECURSIVE cte AS (
SELECT
s.sold_name,
r.child_name,
s.sold_quantity AS total
FROM
the_schema.sales s
INNER JOIN
the_schema.relationships_with_child_catetgory r
ON s.sold_name = r.parent_name
UNION ALL
SELECT
c.sold_name,
r.child_name,
(c.total * r.child_quantity)
FROM
cte c
INNER JOIN
the_schema.relationships_with_child_catetgory r
ON r.parent_name = c.child_name
) SELECT * FROM cte
which gets part of the way there:
sold_name | child_name | total
-----------+------------+-------
parent1 | child1 | 1
parent1 | child1 | 2
parent1 | subChild1 | 10
parent1 | subChild1 | 20
parent1 | subChild2 | 2
parent1 | subChild2 | 4
(6 rows)
However, these results include undesired rows (the first two), and when I try to filter the CTE by adding where r.child_category = 'component' to both parts, the query returns no rows:
sold_name | child_name | total
-----------+------------+-------
(0 rows)
and when I try to group/aggregate, it gives the following error:
ERROR: aggregate functions are not allowed in a recursive query's recursive term
I'm stuck on how to get the undesired rows filtered out and the aggregation happening; clearly I'm failing to comprehend how this recursive CTE works. All guidance is appreciated!
Basically you have the solution. If you stored the quantities and categories in your CTE as well, you can simply add a WHERE filter and a SUM aggregation afterwards:
SELECT
child_name,
SUM(sold_quantity * child_quantity)
FROM cte
WHERE category = 'component'
GROUP BY child_name
My entire query looks like this (which only differs in the details I mentioned above from yours):
demo:db<>fiddle
WITH RECURSIVE cte AS (
SELECT
s.sold_name,
s.sold_quantity,
r.child_name,
r.child_quantity,
nc.thing_category as category
FROM
sales s
JOIN relator r
ON s.sold_name = r.parent_name
JOIN names_categories nc
ON r.child_name = nc.thing_name
UNION ALL
SELECT
cte.sold_name,
cte.sold_quantity,
r.child_name,
r.child_quantity,
nc.thing_category
FROM cte
JOIN relator r ON cte.child_name = r.parent_name
JOIN names_categories nc
ON r.child_name = nc.thing_name
)
SELECT
child_name,
SUM(sold_quantity * child_quantity)
FROM cte
WHERE category = 'component'
GROUP BY child_name
Note: I didn't use your view, because I found it more handy to fetch the data from directly from the tables instead of joining data I already have. But that's just the way I personally like it :)
Well, I figured out that the CTE can be used as a subquery, which permits the filtering and aggregation that I needed :
SELECT
cte.child_name,
sum(cte.total)
FROM
(
WITH RECURSIVE cte AS (
SELECT
s.sold_name,
r.child_name,
s.sold_quantity AS total
FROM
the_schema.sales s
INNER JOIN
the_schema.relationships_with_child_catetgory r
ON s.sold_name = r.parent_name
UNION ALL
SELECT
c.sold_name,
r.child_name,
(c.total * r.child_quantity)
FROM
cte c
INNER JOIN
the_schema.relationships_with_child_catetgory r
ON r.parent_name = c.child_name
) SELECT * FROM cte ) AS cte
INNER JOIN
the_schema.relationships_with_child_catetgory r1
ON cte.child_name = r1.child_name
WHERE r1.child_category = 'component'
GROUP BY cte.child_name
;
which gives the desired rows:
child_name | sum
------------+-----
subChild2 | 6
subChild1 | 30
(2 rows)
Which is good and probably enough for the actual case at hand-- but I suspect there's a clearner way to go about this, so I'll be eager to read all other offered answers.

How to find parent row for same table

I have table something like this:
childId | parentId
1 | null
2 | 1
3 | null
4 | 2
Column childId is primary key of this table and parentId is foreign key to same this table and have reference to column (childId).
And I need to call a function and send parameter (childId) and function will find the most parent row of this child.
Example:
If I pass childId = 4, the output result need to be 1.
Is there any solution for this problem?
EDIT:
I need something like hierarchy top level row.
I have tried with recursive CTE but I couldn't get done.
It looks like a recursive CTE (common-table expression) is a good fit for this type of query.
Sample data
DECLARE #T TABLE (childId int, parentId int);
INSERT INTO #T VALUES
( 1 , null),
( 2 , 1 ),
( 3 , null),
( 4 , 2 );
Query
Replace constant 4 with a parameter. I'm including AnchorChildID and AnchorParentID to make it easier to understand the result and what is going on.
Run this query without the final filter WHERE ParentID IS NULL to see how it works.
WITH
CTE
AS
(
SELECT
childId AS AnchorChildID
,parentId AS AnchorParentID
,childId AS ChildID
,parentId AS ParentID
FROM #T AS T
WHERE childId = 4
UNION ALL
SELECT
CTE.AnchorChildID
,CTE.AnchorParentID
,T.ChildID
,T.ParentID
FROM
CTE
INNER JOIN #T AS T ON T.ChildID = CTE.ParentID
)
SELECT ChildID
FROM CTE
WHERE ParentID IS NULL
OPTION(MAXRECURSION 0)
;
Result
ChildID
1

Given a parent / child key table, how can we recursively insert a copy of the structure into another table?

I have a recursive CTE which gives me a listing of a set of parent child keys as follows, lets say its in a temp table called [#relationtree]:
Parent | Child
--------------
1 | 3
3 | 5
5 | 6
5 | 9
I want to create a copy of these relationships into a table with, lets say, the following stucture:
CREATE TABLE [dbo].[Relations]
(
[Id] int identity(1,1)
[ParentId] int
)
How can I insert the above records but recursively obtain the previously inserted identity value to be able to insert that value as the ParentId column for each copy of a child I insert?
I would expect to have at the end of this in [dbo].[Relations] (given our current seed value is, say 50)
Id | ParentId
-------------
... other rows present before this query ...
50 | NULL
51 | 50
52 | 51
53 | 51
I'm not sure that scope_identity can work in this situation, or that creating a new temp table with a list of new IDs and inserting identity columns manually is the correct approach?
I could write a cursor / loop to do this, but there must be a nice way of doing some recursive select magic!
Since you're trying to put the tree into a segment of the table it looks like you're going to need to use SET IDENTITY_INSERT ON for the table anyway. You're going to need to make sure that there is room for the new tree. In this case, I'll assume that 49 is the current maximum id in your table so that we don't need to be concerned with overrunning a tree that's later in the table.
You'll need to be able to map the IDs from the old tree to the new tree. Unless there's some rule around the ids, the exact mapping should be irrelevant as long as it's accurate, so in that case, I'd just do something like this:
SET IDENTITY_INSERT dbo.Relations ON
;WITH CTE_MappedIDs AS
(
SELECT
old_id,
ROW_NUMBER() OVER(ORDER BY old_id) + 49 AS new_id
FROM
(
SELECT DISTINCT parent AS old_id FROM #relationtree
UNION
SELECT DISTINCT child AS old_id FROM #relationtree
) SQ
)
INSERT INTO dbo.Relations (Id, ParentId)
SELECT
CID.new_id,
PID.new_id
FROM
#relationtree RT
INNER JOIN CTE_MappedIDs PID ON PID.old_id = RT.parent
INNER JOIN CTE_MappedIDs CID ON CID.old_id = RT.parent
-- We need to also add the root node
UNION ALL
SELECT
NID.new_id,
NULL
FROM
#relationtree RT2
INNER JOIN CTE_MappedIDs NID ON NID.old_id = RT2.parent
WHERE
RT2.parent NOT IN (SELECT DISTINCT child FROM #relationtree)
SET IDENTITY_INSERT dbo.Relations OFF
I haven't tested that, but if it doesn't work as expected then hopefully it will point you in the right direction.
I know you already have a working answer, but I think you can accomplish the same thing a little more simply (not that there is anything at all wrong with Tom H's answer) using the LAG function to inspect the previous row, assuming you have SQL Server 2012 or later.
Setup:
CREATE TABLE #relationtree (
Parent INT,
Child INT
)
CREATE TABLE #relations (
Id INT IDENTITY(1,1),
ParentId INT
)
INSERT INTO #relationtree (Parent, Child) VALUES(1,3), (3,5), (5,6), (5,9)
INSERT INTO #relations (ParentId) values(1), (3), (5)
Solution:
DECLARE #offset INT = IDENT_CURRENT('#relations')
;WITH relationtreeids AS (
SELECT *,
ROW_NUMBER() OVER(ORDER BY Parent, Child) - 2 AS UnmodifiedParentId -- Simulate an identity field
FROM #relationtree
)
INSERT INTO #relations
-- The LAG window function allows you to inspect the previous row
SELECT CASE WHEN LAG(Parent) OVER(ORDER BY Parent) IS NULL
THEN NULL
WHEN LAG(Parent) OVER(ORDER BY Parent) = Parent
THEN UnmodifiedParentId + #offset ELSE UnmodifiedParentId + #offset + 1
END AS ParentId
FROM relationtreeids
Output:
Id ParentId
1 1
2 3
3 5
4 NULL
5 4
6 5
7 5

How to select parent ids

I have table with such structure.
ElementId | ParentId
-------------------
1 | NULL
2 | 1
3 | 2
4 | 3
Let say current element has Id 4. I want to select all parent ids.
Result should be: 3, 2, 1
How I can do it? DB is MSSQL
You can use recursive queries for this: http://msdn.microsoft.com/en-us/library/aa175801(SQL.80).aspx
You can use it like this:
with Hierachy(ElementID, ParentID, Level) as (
select ElementID, ParentID, 0 as Level
from table t
where t.ElementID = X -- insert parameter here
union all
select t.ElementID, t.ParentID, th.Level + 1
from table t
inner join Hierachy th
on t.ParentId = th.ElementID
)
select ElementID, ParentID
from Hierachy
where Level > 0
I think it might be easiest to do the following:
while parent != NULL
get parent of current element
I can't think of any way of doing this in plain SQL that wouldn't cause issues on larger databases.
if you want pure sql try:
select ParentId from myTable Desc
that would work in mysql... you might need to modify the Desc (sort in descending order) part