Fastest way to find distinct matching records

Fastest way to find distinct matching records - sql

I have two tables A and B. Both have same structure. We find matching records between these two. Here are the scripts
CREATE TABLE HRS.A
(
F_1 NUMBER(5,0),
F_2 NUMBER(5,0),
F_3 NUMBER(5,0)
);
CREATE TABLE HRS.B
(
F_1 NUMBER(5,0),
F_2 NUMBER(5,0),
F_3 NUMBER(5,0)
);
INSERT INTO hrs.a VALUES (1,1000,2000);
INSERT INTO hrs.a VALUES (2,1100,8000);
INSERT INTO hrs.a VALUES (3,4000,3000);
INSERT INTO hrs.a VALUES (4,2000,5000);
INSERT INTO hrs.a VALUES (5,5000,3000);
INSERT INTO hrs.a VALUES (6,6000,6000);
INSERT INTO hrs.a VALUES (7,3000,7000);
INSERT INTO hrs.a VALUES (8,1100,9000);
INSERT INTO hrs.b VALUES (1,4000,2000);
INSERT INTO hrs.b VALUES (2,6000,8000);
INSERT INTO hrs.b VALUES (3,1000,3000);
INSERT INTO hrs.b VALUES (4,2000,5000);
INSERT INTO hrs.b VALUES (5,8000,3000);
INSERT INTO hrs.b VALUES (6,1100,6000);
INSERT INTO hrs.b VALUES (7,5000,7000);
INSERT INTO hrs.b VALUES (8,1000,9000);
To find matching records
SELECT a.F_1 A_F1, b.F_1 B_F1 FROM HRS.A, HRS.B WHERE A.F_2 = B.F_2
results
A_F1 B_F1
3 1
6 2
1 3
4 4
8 6
2 6
5 7
1 8
Now i want to remove duplicate entries in both columns separately e.g. 1 is repeating in A_F1 (regardless of B_F1) so row # 3(1-3) and 8(1-8) will be removed. Now 6 is repeating in B_F1 (regardless of A_F1) so row # 5(8-6) and 6(2-6) will be removed. Final result should be
A_F1 B_F1
3 1
6 2
4 4
5 7
Now most important part, These two tables contain 500,000 records each. I was first finding and inserting these matching records into a temp table, then removing duplicate from first column then from second column and then selecting all from temp table. This is too too slow. How can i achieve this as faster as possible?
Edit # 1
I executed following statements multiple times to generate 4096 records in each table
INSERT INTO hrs.a SELECT F_1 + 1, F_2 + 1, 0 FROM hrs.a;
INSERT INTO hrs.b SELECT F_1 + 1, F_2 + 1, 0 FROM hrs.b;
Now i executed all answers and found these
Rachcha 9.11 secs OK
techdo 1.14 secs OK
Gentlezerg 577 msecs WRONG RESULTS
Justin 218 msecs OK
Even #Justin took 37.69 secs for 65,536 records in each (total = 131,072)
Waiting for more optimized answers as actual number of records are 1,000,000 :)
Here is the execution plan of the query based on Justin's answer

Please try:
select A_F1, B_F1 From(
SELECT a.F_1 A_F1, b.F_1 B_F1,
count(*) over (partition by a.F_1 order by a.F_1) C1,
count(*) over (partition by b.F_1 order by b.F_1) C2
FROM HRS.A A, HRS.B B WHERE A.F_2 = B.F_2
)x
where C1=1 and C2=1;
How about an INNER JOIN instead? Please check with this query.
select A_F1, B_F1 From(
SELECT a.F_1 A_F1, b.F_1 B_F1,
count(*) over (partition by a.F_1 order by a.F_1) C1,
count(*) over (partition by b.F_1 order by b.F_1) C2
FROM HRS.A A INNER JOIN HRS.B B ON A.F_2 = B.F_2
)x
where C1=1 and C2=1;

Query:
SQLFIDDLEExample
SELECT a.f_1 AS a_f_1,
b.f_1 AS b_f_1
FROM a JOIN b ON a.f_2 = b.f_2
WHERE 1 = (SELECT COUNT(*)
FROM a aa JOIN b bb ON aa.f_2 = bb.f_2
WHERE aa.f_1 = a.f_1 )
AND 1 = (SELECT COUNT(*)
FROM a aa JOIN b bb ON aa.f_2 = bb.f_2
WHERE bb.f_1 = b.f_1 )
Result:
| A_F_1 | B_F_1 |
-----------------
| 3 | 1 |
| 6 | 2 |
| 4 | 4 |
| 5 | 7 |

According to #techdo 's answer, I think this can be better:
select A_F1, B_F1 From(
SELECT a.F_1 A_F1, b.F_1 B_F1,a.F_2,
count(*) OVER(PARTITION BY A.F_2) C
FROM HRS.A A, HRS.B B WHERE A.F_2 = B.F_2
)x
where C=1 ;
The existence of multi rows is due to the same f_2. This SQL has only one count..over,so you said you have vast data, I think this would be a little faster.

I have the answer.
See this fiddle here.
I used the following code:
WITH x AS (SELECT a.f_1 AS a_f_1, b.f_1 AS b_f_1
FROM a JOIN b ON a.f_2 = b.f_2)
SELECT *
FROM x x1
WHERE NOT EXISTS (SELECT 1
FROM x x2
WHERE (x2.a_f_1 = x1.a_f_1
AND x2.b_f_1 != x1.b_f_1)
OR (x2.a_f_1 != x1.a_f_1
AND x2.b_f_1 = x1.b_f_1)
)
;
EDIT
I used to following code that runs within 14 ms on SQL fiddle. I removed the common table expression and observed that the query performance improved.
SELECT a1.f_1 AS a_f1, b1.f_1 AS b_f1
FROM a a1 JOIN b b1 ON a1.f_2 = b1.f_2
WHERE NOT EXISTS (SELECT 1
FROM a a2 JOIN b b2 ON a2.f_2 = b2.f_2
WHERE (a2.f_1 = a1.f_1
AND b2.f_1 != b1.f_1)
OR (a2.f_1 != a1.f_1
AND b2.f_1 = b1.f_1))
;
Output:
A_F_1 B_F_1
3 1
6 2
4 4
5 7

Each one of these solutions are taking time, the best one (Justin) took almost 45 mins without even returning for 2 million records. I ended up with inserting matching records in a temp table and then removing duplicates and i found it much faster than these solutions with this data set.

Related

How do I find which range a value falls within in another table

I want to take the value written in the number field in table A and find which range it corresponds to in the high range and low range fields in table B and show it as in the result table. If in more than one range, it should take whichever comes first (B_id is smaller)
A table
A_Id
Number
1
10
2
50
3
60
4
52
for example( number = 10)
B table
B_Id
Low range
High range
Type
1
5
30
ACARD
2
35
55
BCARD
3
50
110
CCARD
for example( Low range >10 and high range <10 the result B_id = 1)
Result Table
Id
Number
Type
1
10
ACARD
2
50
BCARD
3
60
CCARD
4
52
BCARD

You can try the following SQL.
The BETWEEN used to find the ranges and ROW_NUMBER helps to get the smaller value.
create table tableA(a_id int, Number int)
insert into tableA values(1,10)
insert into tableA values(2,50)
insert into tableA values(3,60)
insert into tableA values(4,52)
create table tableB(b_id int, lowRange int, highRange int, Rtype varchar(10))
insert into tableB values (1,5,30,'ACARD')
insert into tableB values (2,35,55,'BCARD')
insert into tableB values (3,50,110,'CCARD')
;with CTE AS(
select a.a_id Id, a.Number, b.Rtype, ROW_NUMBER() OVER(PARTITION BY a.a_id ORDER BY LowRange) RN
from tableA A
inner join tableB B on A.Number between B.lowRange and B.highRange
)
select Id, Number, Rtype
from CTE
where RN = 1
drop table tableA
drop table tableB
Output:
Id Number Rtype
----------- ----------- ----------
1 10 ACARD
2 50 BCARD
3 60 CCARD
4 52 BCARD

Find the percentage of a group by count row in sql

I have a table as
Person| Count
A | 10
B | 20
C | 30
I use code as below to get above table:
select new_table.person, count(new_table.person)
from (person_table_1
inner join person_table_2
on person_table_1.user_name = person_table_2.user_all_name) new_table
group by new_table.person
However, I wish to have the percentage for each row based on overall sum in count.
Expected:
Person| Count | Percentage
A | 10 | 0.167
B | 20 | 0.333
C | 30 | 0.500
I wish it to be in 3 decimal places. Can anyome please help me. thank you.

Just do an inner query in SELECT clause
select p1.person, count(p1.person), count(p1.person) / (SELECT COUNT(p2.person) FROM person_table p2)
from person_table p1
group by p1.person
Edit: if you want only up to 3 decimal:
select
p1.person,
count(p1.person),
ROUND(count(p1.person) / (SELECT COUNT(p2.person) FROM person_table p2), 3)
from person_table p1
group by p1.person
Edit 2: OP edited his/her table
select
new_table.person,
count(new_table.person),
ROUND(
count(new_table.person) /
SELECT COUNT(new_table_COUNTER.person) FROM (
person_table_1
inner join person_table_2
on person_table_1.user_name = person_table_2.user_all_name
) new_table_COUNTER )
from
(
person_table_1
inner join person_table_2
on person_table_1.user_name = person_table_2.user_all_name
) new_table
group by new_table.person

Try below query:
declare #tbl table ([person] varchar(5));
insert into #tbl values
('a'),('a'),('a'),('b'),('b'),('c');
-- here we tabke max(rowsCnt), but we wwant any value, because every value is the same in that column
select person, count(*) * 1.0 / max(rowsCnt) [percentage] from (
select person,
count(*) over (partition by (select null)) rowsCnt
from #tbl
) a group by person

Query SQL with "childs"

I'm a newbie with SQL and I was wondering if something like what I'm going to show will be possible to do.
I have a table like this :
A B C D
-------------------------
1 ONE
1 P
1 PF
2 TWO
2 PF
3 THREE
3 P
3 P
3 P
3 P
4 FOUR
4 PF
4 PF
5 FIVE
5 P
I would like to do a query to extract the fields in column "A" which doesn't have a "PF" in column "C" with the same number. I.e.:
A B C D
-------------------------
3 THREE
3 P
3 P
3 P
3 P
5 FIVE
5 P
I'm using Python 2.7 and SQLite 3

Assuming that the empty values are NULL, you can use coalesce() to get the parent ID in each row:
SELECT *
FROM MyTable
WHERE COALESCE(A, B) NOT IN (SELECT B
FROM MyTable
WHERE C = 'PF');

I created the below table and inserted the records
create table abcd ( a integer, b integer, c varchar(10), d varchar(100) );
insert into abcd values (1,null,null,'ONE');
insert into abcd values (null,1,'P',null);
insert into abcd values (null, 1,'PF',null);
insert into abcd values (2,null,null,'TWO')
insert into abcd values (null,2,'PF',null);
insert into abcd values (3,null,null,'THREE');
insert into abcd values (null,3,'P',null);
insert into abcd values (null,3,'P',null);
insert into abcd values (null,3,'P',null);
insert into abcd values (null,3,'P',null);
insert into abcd values (4,null,null,'FOUR');
insert into abcd values (null,4,'PF',null);
insert into abcd values (null,4,'PF',null);
insert into abcd values (5,null,null,'FIVE');
insert into abcd values (null,5,'P',null);
select * from abcd;
The below query gives the result set
with temp as
(
SELECT ISNULL(a, (SELECT TOP 1 a FROM abcd
WHERE a = t.b
AND a IS NOT NULL
ORDER BY d DESC)) as a , b,c,d
FROM abcd t
),
flag as
(
select a,b,c,d,case when c='PF' then 1 else 0 end as flag
from temp
where c is not null
)
select * from temp
where a in (select distinct a from flag where c <> 'PF'
and a not in ( select a from flag where flag=1))

Find next row with specific value in a given row

The table I have now looks something like this. Each row has a time value (on which the table is sorted in ascending order), and two values which can be replicated across rows:
Key TimeCall R_ID S_ID
-------------------------------------------
1 100 40 A
2 101 50 B
3 102 40 C
4 103 50 D
5 104 60 A
6 105 40 B
I would like to return something like this, wherein for each row, a JOIN is applied such that the S_ID and Time_Call of the next row that shares that row's R_ID is displayed (or is NULL if that row is the last instance of a given R_ID). Example:
Key TimeCall R_ID S_ID NextTimeCall NextS_ID
----------------------------------------------------------------------
1 100 40 A 102 C
2 101 50 B 103 D
3 102 40 C 105 B
4 103 50 D NULL NULL
5 104 60 A NULL NULL
6 105 40 B NULL NULL
Any advice on how to do this would be much appreciated. Right now I'm joining the table on itself and staggering the key on which I'm joining, but I know this won't work for the instance that I've outlined above:
SELECT TOP 10 Table.*, Table2.TimeCall AS NextTimeCall, Table2.S_ID AS NextS_ID
FROM tempdb..#Table AS Table
INNER JOIN tempdb..#Table AS Table2
ON Table.TimeCall + 1 = Table2.TimeCall
So if anyone could show me how to do this such that it can call rows that aren't just consecutive, much obliged!

Use LEAD() function:
SELECT *
, LEAD(TimeCall) OVER (PARTITiON BY R_ID ORDER BY [Key]) AS NextTimeCall
, LEAD(S_ID) OVER (PARTITiON BY R_ID ORDER BY [Key]) AS NextS_ID
FROM Table2
ORDER BY [Key]
SQLFiddle DEMO

This is only test example I had close by ... but i think it could help you out, just adapt it to your case, it uses Lag and Lead ... and it's for SQL Server
if object_id('tempdb..#Test') IS NOT NULL drop table #Test
create table #Test (id int, value int)
insert into #Test (id, value)
values
(1, 1),
(1, 2),
(1, 3)
select id,
value,
lag(value, 1, 0) over (order by id) as [PreviusValue],
lead(Value, 1, 0) over (order by id) as [NextValue]
from #Test
Results are
id value PreviusValue NextValue
1 1 0 2
1 2 1 3
1 3 2 0

Use an OUTER APPLY to select the top 1 value that has the same R_ID as the first Query and has a higher Key field
Just change the TableName to the actual name of your table in both parts of the query
SELECT a.*, b.TimeCall as NextTimeCall, b.S_ID as NextS_ID FROM
(
SELECT * FROM TableName as a
) as a
OUTER APPLY
(
SELECT TOP 1 FROM TableName as b
WHERE a.R_ID = b.R_ID
AND a.Key > B.Key
ORDER BY Key ASC
) as b
Hope this helps! :)

For older versions, here is one trick using Outer Apply
SELECT a.*,
nexttimecall,
nexts_id
FROM table1 a
OUTER apply (SELECT TOP 1 timecall,s_id
FROM table1 b
WHERE a.r_id = b.r_id
AND a.[key] < b.[key]
ORDER BY [key] ASC) oa (nexttimecall, nexts_id)
LIVE DEMO
Note : It is better to avoid reserved keywords(Key) as column/table names.

Is it possible to left join two tables and have the right table supply each row no more than once?

Given this table structure:
Table A
ID AGE EDUCATION
1 23 3
2 25 6
3 22 5
Table B
ID AGE EDUCATION
1 26 4
2 24 6
3 21 3
I want to find all matches between the two tables where the age is within 2 and the education is within 2. However, I do not want to select any row from TableB more than once. Each row in B should be selected 0 or 1 times and each row in A should be selected one or more times (standard left join).
SELECT *
FROM TableA as A LEFT JOIN TableB as B ON
abs(A.age - B.age) <= 2 AND
abs(A.education - B.education) <= 2
A.ID A.AGE A.EDUCATION B.ID B.AGE B.EDUCATION
1 23 3 3 21 3
2 25 6 1 26 4
2 25 6 2 24 6
3 22 5 2 24 6
3 22 5 3 21 3
As you can see, the last two rows in the output have duplicated B.ID of 2 and 3 when compared to the entire result set. I'd like those rows to return as a single null match with A.ID = 3 since they were both matched to previous A values.
Desired output:
(note that for A.ID = 3, there is no match in B because all rows in B have already been joined to rows in A.)
A.ID A.AGE A.EDUCATION B.ID B.AGE B.EDUCATION
1 23 3 3 21 3
2 25 6 1 26 4
2 25 6 2 24 6
3 22 5 null null null
I can do this with a short program, but I'd like to solve the problem using a SQL query because it is not for me and I will not have the luxury of ever seeing the data or manipulating the environment.
Any ideas? Thanks

As #Joel Coehoorn said earlier, there has to be a mechanism that selects which pairs of (a,b) with the same (b) are filtered out from the output. SQL is not great on allowing you to select ONE row when multiple match, so a pivot query needs to be created, where you filter out the records you don't want. In this case, filtering can be done by reducing all of match IDs of B as a smallest (or largest, it doesn't really matter), using any function that will return one value from a set, it's just min() and max() are most convenient to use. Once you reduced the result to know which (a,b) pairs you care about, then you join against that result, to pull out the rest of the table data.
select a.id a_id, a.age a_age, a.education a_e,
b.id b_id, b.age b_age, b.education b_e
from a left join
(
SELECT
a.id a_id, min(b.id) b_id from a,b where
abs(A.age - B.age) <= 2 AND
abs(A.education - B.education) <= 2
group by a.id
) g on a.id = g.a_id
left join b on b.id = g.b_id;

To my knowledge something like this is not possible with a simple select statement and joins because you need to know what has already been selected in order to eliminate duplicates.
You can however try something a little more like this:
DECLARE #JoinResults TABLE
(A_ID INT, A_Age INT, A_Education INT, B_ID INT, B_Age INT, B_Education INT)
INSERT INTO #JoinResults (A_ID, A_Age, A_Education)
SELECT ID, AGE, EDUCATION
FROM TableA
DECLARE #i INT
SET #i = 1
--Assume that A_ID is incremental and no values missed
WHILE (#i < (SELECT Max(A_ID) FROM #JoinResults
BEGIN
UPDATE #JoinResult
SET B_ID = SQ.ID,
B_Age = SQ.AGE,
B_Education = SQ.Education
FROM (
SELECT ID, AGE, EDUCATION
FROM TableB b
WHERE (
abs((SELECT A_Age FROM #JoinResult WHERE A_Id = #i) - AGE) <=2
AND abs((SELECT A_Education FROM #JoinResult WHERE A_Id = #i) - EDUCATION) <=2
) AND (SELECT B_ID FROM #JoinResults WHERE B_ID = b.id) IS NULL
) AS SQ
SET #i = #i + 1
END
SELECT #JoinResults
NOTE: I do not currently have access to a database so this is untested and I am weary of 2 potential issues with it
I am not sure how the update will react if there are no results
I am unsure if the IS NULL check is correct to eliminate the duplicates.
If these issues do arise let me know and I can help troubleshoot.

In SQL-Server, you can use the CROSS APPLY syntax:
SELECT
a.id, a.age, a.education,
b.id AS b_id, b.age AS b_age, b.education AS b_education
FROM tableB AS b
CROSS APPLY
( SELECT TOP (1) a.*
FROM tableA AS a
WHERE ABS(a.age - b.age) <= 2
AND ABS(a.education - b.education) <= 2
ORDER BY a.id -- your choice here
) AS a ;
Depending on the order you choose in the subquery, different rows from tableA will be selected.
Edit (after your update): But the above query will not show rows from A that have no matching rows in B or even some that have but not been selected.
It could also be done with window functions but Access does not have them. Here is a query that I think will work in Access:
SELECT
a.id, a.age, a.education,
s.id AS s_id, s.age AS b_age, s.education AS b_education
FROM tableB AS a
LEFT JOIN
( SELECT
b.id, b.age, b.education, MIN(a.id) AS a_id
FROM tableB AS b
JOIN tableA AS a
ON ABS(a.age - b.age) <= 2
AND ABS(a.education - b.education) <= 2
GROUP BY b.id, b.age, b.education
) AS s
ON a.id = s.a_id ;
I'm not sure if Access allows such a subquery but if it doesn't, you can define it as a "Query" and then use it in another.

Use SELECT DISTINCT
SELECT DISTINCT A.id, A.age, A.education, B.age, B.education
FROM TableA as A LEFT JOIN TableB as B ON
abs(A.age - B.age) <= 2 AND
abs(A.education - B.education) <= 2

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Fastest way to find distinct matching records - sql

Each one of these solutions are taking time, the best one (Justin) took almost 45 mins without even returning for 2 million records. I ended up with inserting matching records in a temp table and then removing duplicates and i found it much faster than these solutions with this data set.

Related

How do I find which range a value falls within in another table

Find the percentage of a group by count row in sql

Query SQL with "childs"

Find next row with specific value in a given row

Is it possible to left join two tables and have the right table supply each row no more than once?

Categories

Resources