SQL - Get per column count of differences when comparing two tables

SQL - Get per column count of differences when comparing two tables - sql

I have 2 similar tables as shown below with minor difference between some cells
Table A
Roll_ID
FirstName
LastName
Age
1
AAA
XXX
31
2
BBB
YYY
32
3
CCC
ZZZ
33
Table B
Roll_ID
FirstName
LastName
Age
1
AAA
XXX
35
2
PPP
YYY
36
3
QQQ
WWW
37
I would like to get an output that shows the count of different records on a per-column level.
For example the output of the query for the above scenario should be
Output
Roll_ID
FirstName
LastName
Age
0
2
1
3
For this question we can assume that there will always be one column which will have non-null unique values (or one column which may be primary key). In above example Roll_ID is such a column.
My question is: What would be the most efficient way to get such an output? Is there anything to keep in mind when running such query for tables that may have millions of records from point of view of efficiency?

First you have to join the tables
SELECT *
FROM table1
JOIN table2 on table1.ROLL_ID = table2.ROLL_ID
Now just add the counts
SELECT
SUM(CASE WHEN table1.FirstName <> table2.FirstName THEN 1 ELSE 0 END) as FirstNameDiff,
SUM(CASE WHEN table1.LastName <> table2.LastName THEN 1 ELSE 0 END) as LastNameDiff,
SUM(CASE WHEN table1.Age <> table2.Age THEN 1 ELSE 0 END) as AgeDiff
FROM table1
JOIN table2 on table1.ROLL_ID = table2.ROLL_ID
If an id not existing in both tables is considered "different" then you would need something like this
SELECT
SUM(CASE WHEN COALESCE(table1.FirstName,'x') <> COALESCE(table2.FirstName,'y') THEN 1 ELSE 0 END) as FirstNameDiff,
SUM(CASE WHEN COALESCE(table1.LastName,'x') <> COALESCE(table2.LastName,'y') THEN 1 ELSE 0 END) as LastNameDiff,
SUM(CASE WHEN COALESCE(table1.Age,-1) <> COALESCE(table2.Age,-2) THEN 1 ELSE 0 END) as AgeDiff
FROM ( SELECT table1.Roll_id FROM table1
UNION
SELECT table2.Roll_id FROM table2
) base
LEFT JOIN table1 on table1.ROLL_ID = base.ROLL_ID
LEFT JOIN table2 on table2.ROLL_ID = base.ROLL_ID
Here we get all the roll_ids and then left join back to the tables. This is much better than a cross join if the roll_id column is indexed.

SELECT SUM(IIF(ISNULL(A.FirstName, '') <> ISNULL(B.FirstName, ''), 1, 0)) AS FirstNameRecordDiff,
SUM(IIF(ISNULL(A.LastName, '') <> ISNULL(B.LastName, ''), 1, 0)) AS LastNameRecordDiff,
SUM(IIF(ISNULL(A.Age, 0) <> ISNULL(B.Age, 0), 1, 0)) AS LastNameRecordDiff
FROM A
FULL OUTER JOIN B
ON B.Roll_ID = A.Roll_ID;
This query intentionally allows nulls to equal, assuming that a lack of data would mean the same thing to the end user.
As written, it would only work on SQL Server. To use it for MySQL or Oracle, the query would vary.

Related

Compare two columns in SQL

I'm new to SQL and have very basic queries in GCP.
Let's consider this table below:
Name
B
C
Arun
1234-5678
1234
Tara
6789 - 7654
6789
Arun
4567
4324
Here, I want to compare column B and C and if they match then give 1 else 0 in column same and else different (which we have to create).
So here the catch:
if column B has 1234-5678 and column C has 1234, then the column should match considering only the number before the "-" in the value.
The output should be :
Name
B
C
same
different
Arun
1234-5678
1234
1
0
Tara
6789 - 7654
6789
1
0
Arun
4567
4324
0
1
Also, I want to count the values of 1 for each values in Name for same and different columns.
So far I've tried this:
SELECT
name,
b,
c ,
if(b = c, 1, 0) as same,
if (b!=c,1,0) as different,
count(same),
count(different)
From Table

using "MySQL" (will work almost same with SQL server as well) here's the possible solution.
Step 1) Setup table
CREATE TABLE Users (
Name varchar(50),
B varchar(50),
C varchar(50)
);
INSERT INTO Users
VALUES
('Arun', '1234-5678', '1234'),
('Tara', '6789-7654', '6789'),
('Arun', '4567', '4324');
Step 2) same & different columns
SELECT
Name, B, C,
CASE WHEN SUBSTRING_INDEX(B, "-", 1) = C THEN 1 ELSE 0 END as same,
CASE WHEN SUBSTRING_INDEX(B, "-", 1) <> C THEN 1 ELSE 0 END as different
FROM
Users
Step 3) Join both results to get total_same & total_different for each user
SELECT
Name,
SUM(CASE WHEN SUBSTRING_INDEX(B, "-", 1) = C THEN 1 ELSE 0 END) as total_same,
SUM(CASE WHEN SUBSTRING_INDEX(B, "-", 1) <> C THEN 1 ELSE 0 END) as total_different
FROM
Users
GROUP BY Name
Reference: SQL Fiddle

For the first step, you will need to SUBSTR the column b.
We start at position 1 and we want 4 characters (only works if there's only 4 characters before the '-').
With table2 as (
select name, b,c, same, different from (select name, b, c, case when (SUBSTR(b,1,4) = c)
then '1' else '0' end as same, case when(SUBSTR(b,1,4)!= c) then '1' else '0' end as different
from Table1
group by name, b,c))
The WITH clause can be used when you have complex query, and if you want to create a temporary table in order to use it after.
The Table2 give you this :
After the WITH clause, you will have the second step, the count of same / different per name :
Select table1.name,count(table2.same+table2.different) as total from table1
join table2 on (table2.name = table1.name and table2.b = table1.b)
group by table1.name;
The output give you the total per name (the name are group by, so in your example you will only have 2 rows, one for Arun with a total of 2 (same + different) and the other one with a total of 1)
So here's the entire code :
with table2 as (
select name, b,c, same, different from (select name, b, c, case when (SUBSTR(b,1,4) = c) then '1' else '0' end as same, case when(SUBSTR(b,1,4)!= c) then '1' else '0' end as different
From Table1
group by name, b,c))
select table1.name, table1.b, table1.c, count(table2.same+table2.different) as total from table1
join table2 on (table2.name = table1.name and table2.b = table1.b)
group by table1.name;

Multiple Selects from Subquery

I have multiple queries that look like this:
select count(*) from (
SELECT * FROM TABLE1 t
JOIN TABLE2 e
USING (EVENT_ID)
) s1
WHERE
s1.SOURCE_ID = 1;
where the only difference is the t1.SOURCE_ID = (some other number). I would like to turn these into a single query that just selects from the subquery using a different SOURCE_ID for each column in the result, like this:
+----------------+----------------+----------------+
| source_1_count | source_2_count | source_3_count | ... so on
+----------------+----------------+----------------+
I am trying to avoid using the multiple queries as the join is on a very large table and takes some time, so I would rather do it once and query the result multiple times.
This is on a Snowflake data warehouse which I think uses something similar to PostgreSQL (also I'm fairly new to SQL so feel free to suggest a completely different solution as well).

Use conditional aggregation
SELECT sum(case when sourceid=1 then 1 else 0 end) source_1_count, sum(case when sourceid=2 then 1 else 0 end) source_2_count...
FROM TABLE1 t
JOIN TABLE2 e
USING (EVENT_ID)

You would put the results in separate rows, using group by:
SELECT SOURCE_ID, COUNT(*)
FROM TABLE1 t JOIN
TABLE2 e
USING (EVENT_ID)
GROUP BY SOURCE_ID;
Putting the separate sources in columns is troublesome, unless you know the exact list of sources that you want in the result set.
EDIT:
If you know the exact list of sources, you can use conditional aggregation or pivot:
SELECT SUM(CASE WHEN SOURCE_ID = 1 THEN 1 ELSE 0 END) as source_id_1,
SUM(CASE WHEN SOURCE_ID = 2 THEN 1 ELSE 0 END) as source_id_2,
SUM(CASE WHEN SOURCE_ID = 3 THEN 1 ELSE 0 END) as source_id_3
FROM TABLE1 t JOIN
TABLE2 e
USING (EVENT_ID);

All the comments so far ignore the fact that you won't have the possible benefits of pruning the data during the scan, as there are no WHERE predicates. Join can also be slower than it needs to be because of that.
This is a possible improvement:
SELECT SUM(CASE WHEN SOURCE_ID = 1 THEN 1 ELSE 0 END) as source_id_1,
SUM(CASE WHEN SOURCE_ID = 2 THEN 1 ELSE 0 END) as source_id_2,
SUM(CASE WHEN SOURCE_ID = 3 THEN 1 ELSE 0 END) as source_id_3
FROM TABLE1 t JOIN
TABLE2 e
USING (EVENT_ID);
WHERE SOURCE_ID IN (1, 2, 3)

Returning only id's of records that meet criteria

I need to return distinct ID's of records which meet following conditions :
must have records with field reason_of_creation = 1
and must NOT have records with field reason_of_creation = 0 or null
in the same time.
While i was able to do it, i keep wondering is there more elegant (even recommended) way of doing it.
Here is anonymized version of what i have :
select distinct st.some_id from (
select st.some_id, wanted.wanted_count as wanted, unwanted.unwanted_count as unwanted
from some_table st
left join (
select st.some_id, count(st.reason_of_creation) as wanted_count
from some_table st
where st.reason_of_creation=1
group by st.some_id
) wanted on wanted.some_id = st.some_id
left join (
select st.some_id, count(st.reason_of_creation) as unwanted_count
from some_table st
where st.reason_of_creation=0
group by st.some_id
) unwanted on unwanted.some_id = st.some_id
where wanted.wanted_count >0 and (unwanted.unwanted_count = 0 or unwanted.unwanted_count is null)
) st;
Sample data :
some_id reason_of_creation
1 1
1 0
2 1
3 null
4 0
4 1
5 1
desired result would be list of records with some_id = 2, 5

It seems to me your query is overkill,all you need is some post aggregation filtering
SELECT some_id FROM t
GROUP BY some_id
HAVING SUM(CASE WHEN reason_of_creation = 1 THEN 1 ELSE 0 END)>0
AND SUM(CASE WHEN reason_of_creation = 0 OR reason_of_creation IS NULL THEN 1 ELSE 0 END)=0

I think that more elegant query exists and it is based on assumption what reasoson_of_crdeation field is integer, so minimal possible it's value, which greater than 0 is 1
This is for possible negative values for reasoson_of_crdeation:
select someid from st
where reasoson_of_crdeation != -1
group by someid
having(min(nvl(abs(reasoson_of_crdeation), 0)) = 1)
or
select someid from st
group by someid
having(min(nvl(abs(case when reasoson_of_crdeation = -1 then -2 else reasoson_of_crdeation end), 0)) = 1)
And this one in a case if reasoson_of_crdeation is non-negative integer:
select someid from st
group by someid
having(min(nvl(reasoson_of_crdeation, 0)) = 1)

SQL (TSQL) - Select values in a column where another column is not null?

I will keep this simple- I would like to know if there is a good way to select all the values in a column when it never has a null in another column. For example.
A B
----- -----
1 7
2 7
NULL 7
4 9
1 9
2 9
From the above set I would just want 9 from B and not 7 because 7 has a NULL in A. Obviously I could wrap this as a subquery and USE the IN clause etc. but this is already part of a pretty unique set and am looking to keep this efficient.
I should note that for my purposes this would only be a one-way comparison... I would only be returning values in B and examining A.
I imagine there is an easy way to do this that I am missing, but being in the thick of things I don't see it right now.

You can do something like this:
select *
from t
where t.b not in (select b from t where a is null);
If you want only distinct b values, then you can do:
select b
from t
group by b
having sum(case when a is null then 1 else 0 end) = 0;
And, finally, you could use window functions:
select a, b
from (select t.*,
sum(case when a is null then 1 else 0 end) over (partition by b) as NullCnt
from t
) t
where NullCnt = 0;

The query below will only output one column in the final result. The records are grouped by column B and test if the record is null or not. When the record is null, the value for the group will increment each time by 1. The HAVING clause filters only the group which has a value of 0.
SELECT B
FROM TableName
GROUP BY B
HAVING SUM(CASE WHEN A IS NULL THEN 1 ELSE 0 END) = 0
If you want to get all the rows from the records, you can use join.
SELECT a.*
FROM TableName a
INNER JOIN
(
SELECT B
FROM TableName
GROUP BY B
HAVING SUM(CASE WHEN A IS NULL THEN 1 ELSE 0 END) = 0
) b ON a.b = b.b

How do I modify this query without increasing the number of rows returned?

I've got a sub-select in a query that looks something like this:
left outer join
(select distinct ID from OTHER_TABLE) as MYJOIN
on BASE_OBJECT.ID = MYJOIN.ID
It's pretty straightforward. Checks to see if a certain relation exists between the main object being queried for and the object represented by OTHER_TABLE by whether or not MYJOIN.ID is null on the row in question.
But now the requirements have changed a little. There's another row in OTHER_TABLE that can have a value of 1 or 0, and the query needs to know whether a relation exists between the primary for a 1-value, and also if it exists for a 0 value. The obvious solutions is to put:
left outer join
(select distinct ID, TYPE_VALUE from OTHER_TABLE) as MYJOIN
on BASE_OBJECT.ID = MYJOIN.ID
But that would be wrong because if 0-type and 1-type objects both exist for the same ID, it will increase the number of rows returned by the query, which isn't acceptable. So what I need is some sort of subselect that will return 1 row for each distinct ID, with a "1-type exists" column and a "0-type exists" column. And I have no idea how to code that in SQL.
For example, for the following table,
ID | TYPE_VALUE
_________________
1 | 1
3 | 0
3 | 1
4 | 0
I'd like to see a result set like this:
ID | HAS_TYPE_0 | HAS_TYPE_1
______________________________
1 | 0 | 1
3 | 1 | 1
4 | 1 | 0
Anyone know how I could set up a query to do this? Hopefully with a minimum of ugly hacks?

In the general case, you would use EXISTS:
SELECT DISTINCT ID,
CASE WHEN EXISTS (
SELECT * FROM Table1 y
WHERE y.TYPE_VALUE = 0 AND ID = x.ID)
THEN 1
ELSE 0 END AS HAS_TYPE_0,
CASE WHEN EXISTS (
SELECT * FROM Table1 y
WHERE y.TYPE_VALUE = 1 AND ID = x.ID)
THEN 1
ELSE 0 END AS HAS_TYPE_1
FROM Table1 x;
If you have a very large number of elements in the table, this won't perform so great - those nested subselects are often a kiss of death when it comes to performance.
For your specific case, you could also use GROUP BY and MAX() and MIN() to speed things up:
SELECT
ID,
CASE WHEN MIN(TYPE_VALUE) = 0 THEN '1' ELSE 0 END AS HAS_TYPE_0,
CASE WHEN MAX(TYPE_VALUE) = 1 THEN '1' ELSE 0 END AS HAS_TYPE_1
FROM Table1
GROUP BY ID;

Instead of select distinct ID, TYPE_VALUE from OTHER_TABLE
use
select ID,
MAX(CASE WHEN TYPE_VALUE =0 THEN 1 END) as has_type_0,
MAX(CASE WHEN TYPE_VALUE =1 THEN 1 END) as has_type_1
from OTHER_TABLE
GROUP BY ID;
You can do the same using PIVOT opearator...

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL - Get per column count of differences when comparing two tables - sql

Related

Compare two columns in SQL

Multiple Selects from Subquery

Returning only id's of records that meet criteria

SQL (TSQL) - Select values in a column where another column is not null?

How do I modify this query without increasing the number of rows returned?

Categories

Resources