unduplicating similar data based on multiple fields

unduplicating similar data based on multiple fields - sql

I'd like to create a table of possible duplicate records from an original table, however the data is based on two different attibutes and they must only join in the same grouping id. Also in some cases the data doesn't look exactly the same (but the similarities are there). Here is what the original table would look like:
group_id| House_num | Apt | code
----------------------------------
45 | 1000 | 1 | M
45 | 1 | | D
45 | 1000 | 2 | M
45 | 2 | | D
87 | 2300 | 310 | M
87 | 2310 | | D
87 | 2400 | 470 | M
87 | 2470 | | D
What I'd like to be returned is where these like numbers are all on the same row. So something like:
new_id |group_id|a.house_num|a.apt|a.code|b.house_num|b.apt| b.code
-------------------------------------------------------------------------
1 | 45 | 1000 | 1 | M | 1 | | D
2 | 45 | 1000 | 2 | M | 2 | | D
3 | 87 | 2300 | 310 | M | 2310 | | D
4 | 87 | 2400 | 470 | M | 2470 | | D
I'm not sure what kind of join to use here; also, im not sure how to get around the ones where a.house_num is the base number, a.apt is the suffixed number and b.housenumber is the combination of the two. Any help would be greatly appreciated, thank you.

create table t (group_id, House_num , Apt , code) as
select 45 , 1000 , 1 , 'M' from dual union all
select 45 , 1 , null , 'D' from dual union all
select 45 , 1000 , 2 , 'M' from dual union all
select 45 , 2 , null , 'D' from dual union all
select 87 , 2300 , 310 , 'M' from dual union all
select 87 , 2310 , null , 'D' from dual union all
select 87 , 2400 , 470 , 'M' from dual union all
select 87 , 2470 , null , 'D' from dual;
select rownum new_id,
a.GROUP_ID, a.HOUSE_NUM ahn, a.APT aapt, a.CODE acode,
b.HOUSE_NUM bhn, b.APT bapt, b.CODE bcode
from t a
join t b
on a.group_id = b.group_id
and a.code = 'M'
and b.code = 'D'
and (
b.house_num = a.apt
or b.house_num like '%'||a.apt
);
NEW_ID GROUP_ID AHN AAPT ACODE BHN BAPT BCODE
1 45 1000 1 M 1 D
2 45 1000 2 M 2 D
3 87 2300 310 M 2310 D
4 87 2400 470 M 2470 D

You can use the below to achieve your answer.The DB fiddle here
WITH data
AS (SELECT * FROM YOUR_TABLE),
data2
AS (SELECT row_number()over
(order by thing1,thing2) rw,
d1.group_id,
d1.thing1 a_thing1,
d1.thing2 a_thing2,
d1.thing3 a_thing3
FROM data d1
WHERE d1.thing3 = 'M'),
data3
AS (SELECT row_number()over
(order by thing1,thing2) rw,
d1.group_id,
d1.thing1 b_thing1,
d1.thing2 b_thing2,
d1.thing3 b_thing3
FROM data d1
WHERE d1.thing3 = 'D')
SELECT d2.rw New_id,
d2.group_id,
d2.a_thing1,
d2.a_thing2,
d2.a_thing3,
d3.b_thing1,
d3.b_thing2,
d3.b_thing3
FROM data2 d2,
data3 d3
WHERE d2.group_id = d3.group_id
AND d2.rw = d3.rw

Related

Postgres: Find missing items in a version table

I have a Table in Postgres 12 which tracks what Items i are use in which Versions v:
CREATE TABLE compare_test(v BIGINT, i BIGINT);
With example data:
INSERT INTO compare_test VALUES
(1,21),
(1,22),
(1,23),
(2,21),
(2,22),
(2,23),
(3,21),
(3,22);
I'm trying to create a View that returns:
source_v
target_v
source_i
target_i
1
3
23
null
2
3
23
null
Queries used to compare missing values in two Tables like:
SELECT l.v as source_v, l.i as source_i,
r.v as target_v, r.i as target_i
FROM compare_test l
LEFT JOIN
compare_test r ON r.i = l.i
WHERE r.i IS NULL;
and
SELECT l.v as source_v, l.i as source_i
FROM compare_test l
WHERE NOT EXISTS
(
SELECT i as target_i
FROM compare_test r
WHERE r.i = l.i
)
do not seem to work if the joined Table is the same Table or if more than 2 Versions are in the Table.
I don't have the option to change the Database Structure but I can use plugins.

The solution below gives those results.
It makes re-use of a CTE.
(but somehow I got a feeling that there should exist a more efficient way)
with cte1 as (
SELECT i
, count(*) cnt
, min(v) min_v
, max(v) max_v
FROM compare_test
GROUP BY i
)
, cte2 as
(
select *
from cte1 as c1
where not exists (
select 1
from cte1 c2
where c2.min_v = c1.min_v
and c2.max_v < c1.max_v
)
)
select distinct
t.v as source_v
, c1.max_v as target_v
, c2.i as source_i
, null as target_i
from cte2 c2
left join compare_test t
on t.i = c2.i
left join cte1 c1
on t.v between c1.min_v and c1.max_v
and c1.i != t.i
order by source_v
But if it's not really required to follow the relations, then it becomes really simple.
Then it's just a left join of the existing to all possible combinations.
select distinct
src.v as source_v
, missing.v as target_v
, src.i as source_i
, missing.i as target_i
from
(
select ver.v, itm.i
from (select distinct v from compare_test) as ver
cross join (select distinct i from compare_test) as itm
left join compare_test t
on t.v = ver.v and t.i = itm.i
where t.v is null
) as missing
left join compare_test as src
on src.i = missing.i and src.v != missing.v
order by target_i, target_v, source_v
source_v | target_v | source_i | target_i
-------: | -------: | -------: | -------:
1 | 5 | 21 | 21
2 | 5 | 21 | 21
3 | 5 | 21 | 21
1 | 5 | 22 | 22
2 | 5 | 22 | 22
3 | 5 | 22 | 22
1 | 3 | 23 | 23
2 | 3 | 23 | 23
1 | 5 | 23 | 23
2 | 5 | 23 | 23
5 | 1 | 44 | 44
5 | 2 | 44 | 44
5 | 3 | 44 | 44
db<>fiddle here

Defining a new variable in SELECT clause in SQL developer

I am new to SQL and the extraction of data from databases, so please bear with me. I only have experience with coding in statistical programs, including Stata, SAS, and R.
I currently have a SELECT clause that extracts a table from an Oracle database.
To simplify the question, I make use of a illustrative example:
I am interested in CREATING a new variable, which is not included in the database and must be defined based on the other variables, that contains the weight of their mother. Since I am new to SQL, I do not know if this is possible to do in the SELECT clause or if there exist more efficient options
Note that,
Mother and Mother_number are referring to the "same numbers", meaning that mothers and daughters are represented in the model.
AA (number 1) and CC (number 3) have the same mother (BB) (number 2)
I need to do some conversion of the date, e.g. to_char(a.from_date, 'dd-mm-yyyy') as fromdat since SQL confuses the year with the day-of-the month
The SQL code:
select to_char(a.from_date, 'dd-mm-yyyy') as fromdate, a.Name, a.Weight, a.Number, a.Mother_number
from table1 a, table2 b
where 1=1
and a.family_ref=b.family_ref
and .. (other conditions)
What I currently obtain:
| fromdate | Name | Weight | Number | Mother_number |
|------------|------|--------|--------|---------------|
| 06-07-2021 | AA | 100 | 1 | 2 |
| 06-07-2021 | BB | 200 | 2 | 3 |
| 06-07-2021 | CC | 300 | 3 | 2 |
| 06-07-2021 | DD | 400 | 4 | 5 |
| 06-07-2021 | EE | 500 | 5 | 6 |
| ... | ... | ... | ... | ... |
What I wish to obtain:
| fromdate | Name | Weight | Number | Mother_number | Mother_weight |
|------------|------|--------|--------|---------------|---------------|
| 06-07-2021 | AA | 100 | 1 | 2 | 200 |
| 06-07-2021 | BB | 200 | 2 | 3 | 300 |
| 06-07-2021 | CC | 300 | 3 | 2 | 200 |
| 06-07-2021 | DD | 400 | 4 | 5 | 500 |
| 06-07-2021 | EE | 500 | 5 | 6 | … |
| | … | … | … | … | …

Assuming the MOTHER_NUMBER value is referencing the same value as the NUMBER variable just join the table with itself.
select a.fromdate
, a.name
, a.weight
, a.number
, a.mother_number
, b.weight as mother_weight
from HAVE a
left join HAVE b
on a.mother_number = b.number

Although I'm not sure I'm following the "mother" logic, the way you need to implement the last column in your SELECT statement is to add b.weight as Mother_Weight in the end of the first line, before the for keyword.
Since the b table references "Mothers", you can add the column just by taking the weight of the person in table b.
If instead you wish to add the data of a person's mother's weight, you can do that by adding a column to the relevant table and then updating each row in your table by executing the statements below:
ALTER TABLE table1 ADD Mother_weight FLOAT;
UPDATE table1 SET Mother_weight=(SELECT (Weight) FROM table2 WHERE table1.family_ref=table2.familyref);
Then you add the a.Mother_weight clause in your SELECT statement.

Use a hierarchical query:
SELECT to_char(a.fromdate, 'dd-mm-yyyy') as fromdate,
a.Name,
a.Weight,
a."NUMBER",
a.Mother_number,
PRIOR weight AS mother_weight
FROM table1 a
INNER JOIN table2 b
ON (a.family_ref=b.family_ref)
WHERE LEVEL = 2
OR ( LEVEL = 1
AND NOT EXISTS(
SELECT 1
FROM table1 x
WHERE a.mother_number = x."NUMBER"
)
)
CONNECT BY NOCYCLE
PRIOR "NUMBER" = mother_number
AND PRIOR a.family_ref = a.family_ref
ORDER BY a."NUMBER"
Or, a sub-query factoring clause and a self-join:
WITH data (fromdate, name, weight, "NUMBER", mother_number) AS (
SELECT to_char(a.fromdate, 'dd-mm-yyyy'),
a.Name,
a.Weight,
a."NUMBER",
a.Mother_number
FROM table1 a
INNER JOIN table2 b
ON (a.family_ref=b.family_ref)
)
SELECT d.*,
m.weight AS mother_weight
FROM data d
LEFT OUTER JOIN data m
ON (d.mother_number = m."NUMBER")
ORDER BY d."NUMBER"
Which, for the sample data:
CREATE TABLE table1 (family_ref, fromdate, Name, Weight, "NUMBER", Mother_number) AS
SELECT 1, DATE '2021-07-06', 'AA', 100, 1, 2 FROM DUAL UNION ALL
SELECT 1, DATE '2021-07-06', 'BB', 200, 2, 3 FROM DUAL UNION ALL
SELECT 1, DATE '2021-07-06', 'CC', 300, 3, 2 FROM DUAL UNION ALL
SELECT 1, DATE '2021-07-06', 'DD', 400, 4, 5 FROM DUAL UNION ALL
SELECT 1, DATE '2021-07-06', 'EE', 500, 5, 6 FROM DUAL;
CREATE TABLE table2 (family_ref) AS
SELECT 1 FROM DUAL;
Both output:
FROMDATE
NAME
WEIGHT
NUMBER
MOTHER_NUMBER
MOTHER_WEIGHT
06-07-2021
AA
100
1
2
200
06-07-2021
BB
200
2
3
300
06-07-2021
CC
300
3
2
200
06-07-2021
DD
400
4
5
500
06-07-2021
EE
500
5
6
db<>fiddle here

updating records by putting sum(column) value

I have table1 like this:
+-----+-----+------+
| cat | val | type |
+-----+-----+------+
| A | 100 | c1 |
| H | 25 | c2 |
| H | 50 | c3 |
| H | 30 | c2 |
| A | 15 | c3 |
| H | 10 | c1 |
| H | 15 | c1 |
| B | 10 | c4 |
| H | 20 | c4 |
| H | 15 | c3 |
+-----+-----+------+
I need to add the sum(val) group by type to only one H belonging to each type
So I have after grouping by type we have say table2 :
+-----+-----+
| cat | val |
+-----+-----+
| c1 | 125 |
| c2 | 55 |
| c3 | 80 |
| c4 | 30 |
+-----+-----+
I need 125 added to any one H values with type c1, 55 added to any one H values with c2, and so on..If there is no H with c1,then it should create that record.
So finally we get:
+-----+-----+------+
| cat | val | type |
+-----+-----+------+
| A | 100 | c1 |
| H | 25 | c2 |
| H | 130 | c3 |
| H | 85 | c2 |
| A | 15 | c3 |
| H | 135 | c1 |
| H | 15 | c1 |
| B | 10 | c4 |
| H | 50 | c4 |
| H | 15 | c3 |
+-----+-----+------+
How do I do it without doing table1 union table2 (with 'H' as cat) group by type? Also I don't have update privileges and cannot use stored procedures. I also have to keep in mind that table1 is a result of a query involving multiple inner joins that I don't want to use over and over again for select statements.

See whether this makes sense. I've added the ID column just to display the final result in the same order as the input (for easier reading).
SQL> -- T1 is what you currently have; it can/could be your current query
SQL> with t1 (id, cat, val, type) as
2 (select 1, 'A', 100, 'C1' from dual union all
3 select 2, 'H', 25 , 'C2' from dual union all
4 select 3, 'H', 50 , 'C3' from dual union all
5 select 4, 'H', 30 , 'C2' from dual union all
6 select 5, 'A', 15 , 'C3' from dual union all
7 select 6, 'H', 10 , 'C1' from dual union all
8 select 7, 'H', 15 , 'C1' from dual union all
9 select 8, 'B', 10 , 'C4' from dual union all
10 select 9, 'H', 20 , 'C4' from dual union all
11 select 10,'H', 15 , 'C3' from dual
12 ),
13 -- sum VAL per type
14 t1_sum as
15 (select type, sum(val) sum_val
16 from t1
17 group by type
18 ),
19 -- find row number; let any H be number 1
20 t1_rn as
21 (select id, cat, val, type,
22 row_number() over (partition by type
23 order by case when cat = 'H' then 1 end) rn
24 from t1
25 )
26 -- the final result; add SUM_VAL to the first H row per type
27 select r.cat, r.val + case when r.rn = 1 then s.sum_val else 0 end val,
28 r.type
29 From t1_rn r join t1_sum s on s.type = r.type
30 order by r.id;
CAT VAL TYPE
--- ---------- ----
A 100 C1
H 80 C2
H 130 C3
H 30 C2
A 15 C3
H 135 C1
H 15 C1
B 10 C4
H 50 C4
H 15 C3
10 rows selected.
SQL>
[EDIT: trying to clarify how to use your large query]
Suppose that this is that very large and complex query of yours:
select a.cat,
case when a.cat = 'A' then b.val
when a.cat = 'Z' then c.val
else 'H'
end val,
c.type
from a join b on a.id = b.id and a.x = b.y
join c on c.id = b.idx
where a.date_column < sysdate
and c.type = 'X';
As I've said, create a view based on it as
create or replace view v_view as
select a.cat,
case when a.cat = 'A' then b.val
when a.cat = 'Z' then c.val
else 'H'
end val,
c.type
from a join b on a.id = b.id and a.x = b.y
join c on c.id = b.idx
where a.date_column < sysdate
and c.type = 'X';
and use it as a source for "my" query (from line 14 onwards):
with t1_sum as
(select type, sum(val) sum_val
from v_view --> here's the view
group by type
), etc.
Or, use the "huge" query itself as the initial CTE:
with t1 as
-- this is your "huge" query
(select a.cat,
case when a.cat = 'A' then b.val
when a.cat = 'Z' then c.val
else 'H'
end val,
c.type
from a join b on a.id = b.id and a.x = b.y
join c on c.id = b.idx
where a.date_column < sysdate
and c.type = 'X'
),
-- sum VAL per type
t1_sum as
(select type, sum(val) sum_val
from t1
group by type
), etc.

BigQuery - Find the closest region

I have two tables, and for each region in A, I want to find the closest regions in B.
A:
------------------------
ID | Start | End | Color
------------------------
1 | 400 | 500 | White
------------------------
1 | 10 | 20 | Red
------------------------
2 | 2 | 10 | Blue
------------------------
4 | 88 | 90 | Color
------------------------
B:
------------------------
ID | Start | End | Name
------------------------
1 | 1 | 2 | XYZ1
------------------------
1 | 50 | 60 | XYZ4
------------------------
2 | 150 | 160 | ABC1
------------------------
2 | 50 | 60 | ABC2
------------------------
4 | 100 | 120 | EFG
------------------------
RS:
---------------------------------------
ID | Start | End | Color | Closest Name
---------------------------------------
1 | 400 | 500 | White | XYZ4
---------------------------------------
1 | 10 | 20 | Red | XYZ1
---------------------------------------
2 | 2 | 10 | Blue | ABC2
---------------------------------------
4 | 88 | 90 | Color | EFG
---------------------------------------
Currently, I first find min distance by joining two tables:
MinDist Table:
SELECT A.ID, A.Start, A.End,
MIN(CASE
WHEN (ABS(A.End-B.Start)>=ABS(A.Start - B.End))
THEN ABS(A.Start-B.End)
ELSE ABS(A.End - B.Start)
END) AS distance
FROM ( Select A ... )
Join B On A.ID=B.ID)
Group By A.ID, A.Start, A.End
Then recompute distance for by joining table A and B again,
GlobDist Table (Note, the query retrieves B.Name in this case):
SELECT A.ID, A.Start, A.End,
CASE
WHEN (ABS(A.End-B.Start)>=ABS(A.Start - B.End))
THEN ABS(A.Start-B.End)
ELSE ABS(A.End - B.Start)
END AS distance,
B.Name
FROM ( Select A ... )
Join B On A.ID=B.ID)
Finally join these two tables MinDist and GlobDist Tables on
GlobDist.ID= MinDist.ID,
GlobDist.Start=MinDist.Start,
GlobDist.End= MinDist.End,
GlobDist.distance= MinDist.distance.
I tested ROW_NUMBER() and PARTITION BY over (ID, Start, End), but it took much longer. So, what's the fastest and most efficient way of solving this problem? How can I reduce duplicate computation?
Thanks!

Below solution is for BigQuery Standard SQL and as simple and short as below
#standardSQL
SELECT a_id, a_start, a_end, color,
ARRAY_AGG(name ORDER BY POW(ABS(a_start - b_start), 2) + POW(ABS(a_end - b_end), 2) LIMIT 1)[SAFE_OFFSET(0)] name
FROM A JOIN B ON a_id = b_id
GROUP BY a_id, a_start, a_end, color
-- ORDER BY a_id
You can test / play with above using dummy data in your question
#standardSQL
WITH A AS (
SELECT 1 a_id, 400 a_start, 500 a_end, 'White' color UNION ALL
SELECT 1, 10, 20 , 'Red' UNION ALL
SELECT 2, 2, 10, 'Blue' UNION ALL
SELECT 4, 88, 90, 'Color'
), B AS (
SELECT 1 b_id, 1 b_start, 2 b_end, 'XYZ1' name UNION ALL
SELECT 1, 50, 60, 'XYZ4' UNION ALL
SELECT 2, 150, 160,'ABC1' UNION ALL
SELECT 2, 50, 60, 'ABC2' UNION ALL
SELECT 4, 100, 120,'EFG'
)
SELECT a_id, a_start, a_end, color,
ARRAY_AGG(name ORDER BY POW(ABS(a_start - b_start), 2) + POW(ABS(a_end - b_end), 2) LIMIT 1)[SAFE_OFFSET(0)] name
FROM A JOIN B ON a_id = b_id
GROUP BY a_id, a_start, a_end, color
ORDER BY a_id
with result as below
Row a_id a_start a_end color name
1 1 400 500 White XYZ4
2 1 10 20 Red XYZ1
3 2 2 10 Blue ABC2
4 4 88 90 Color EFG

Multiply quantities for all parent child relationships

I have a table kind of this.
======================================
ID | Description|Quantity| Parentid|
=====================================
1 | Main | NULL | NULL |
2 | Sub | 20 | 1 |
3 | Sub2 | 21 | 1 |
4 | A1 | 32 | 2 |
5 | B1 | 51 | 3 |
6 | B2 | 43 | 3 |
7 | C1 | 34 | 4 |
9 | D1 | 22 | 5 |
10 | D2 | 90 | 5 |
11 | E1 | 21 | 7 |
12 | F1 | 2 | 11 |
13 | F2 | 42 | 11 |
14 | G1 | 12 | 13 |
-------------------------------------
I want total quantity of G1.. parent of G1 is F2. parent of F2 is E1 . parent of E1 is C1. parent of C1 is A1. parent of A1 is Sub. Parent of Sub is Main. so the total quantity of G1 is (12*42*21*34*32*20=230307840).
How to get that answer with sql query?

WITH TotalQuantity AS
(
SELECT Quantity, ParentID
FROM MyTable
WHERE Description = 'G1'
UNION ALL
SELECT TQ.Quantity * COALESCE(T.Quantity,1), T.ParentID
FROM TotalQuantity TQ
INNER JOIN MyTable T ON T.ID = TQ.ParentID
)
SELECT * FROM TotalQuantity
WHERE ParentID IS NULL

This will give the increasing totals for each generation.
WITH Hierarchy(ChildId, Description, Quantity, Generation, ParentId)
AS
(
SELECT Id, Description, Quantity, 0 as Generation, ParentId
FROM Table1 AS FirtGeneration
WHERE ParentId IS NULL
UNION ALL
SELECT NextGeneration.Id, NextGeneration.Description,
ISNULL(NextGeneration.Quantity, 1) * ISNULL(Parent.Quantity, 1),
Parent.Generation + 1, Parent.ChildId
FROM Table1 AS NextGeneration
INNER JOIN Hierarchy AS Parent ON NextGeneration.ParentId = Parent.ChildId
)
SELECT *
FROM Hierarchy
For G1 simply
select quantity from Hierarchy where description = 'G1' -- result = 230307840
SQL Fiddle

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

unduplicating similar data based on multiple fields - sql

Related

Postgres: Find missing items in a version table

Defining a new variable in SELECT clause in SQL developer

updating records by putting sum(column) value

BigQuery - Find the closest region

Multiply quantities for all parent child relationships

Categories

Resources