INTERSECT and UNION giving different counts of duplicate rows - sql

I have two tables A and B with same column names. I have to combine them into table C
when I am running following query, the count is not matching -
select * into C
from
(
select * from A
union
select * from B
)X
The record count of C is not matching with A and B. There is difference of 89 rows. So I figured out that there are duplicates.
I used following query to find duplicates -
select * from A
INTERSECT
select * from B
-- 80 rows returned
Can anybody tell me why intersect returns 80 dups whereas count difference on using union is 89 ?

There are probably duplicates inside of A and/or B as well. All set operators perform an implicit DISTINCT on the result (logically, not necessarily physically).
Duplicate rows are usually a data-quality issue or an outright bug. I usually mitigate this risk by adding unique indexes on all columns and column sets that are supposed to be unique. I especially make sure that every table has a primary key if that is at all possible.

Related

How to get results of a joined query (2 tables) sorted in the order shown in the main table? [duplicate]

I have a table MY_TABLE with a primary key MY_PK. Then, I have a list of ordered primary keys, for example (17,13,35,2,9).
Now I want to retrieve all rows with these primary keys, and keep the order of the rows in the same way as the given list of keys.
What I was initally doing was:
SELECT * FROM MY_TABLE WHERE MY_PK IN (:my_list)
But then the order of the returned rows is random and does not correspond to the order of the given keys anymore. Is there a way to achieve that?
The only thing I thought of is making many SELECT statements and concatenate them with UNION, but my list of primary keys can be very long and contain hundreds or even thousands of keys. The alternative I thought of was to reorder the rows afterwards in the application, but I would prefer a solution where this is not necessary.
First, doing this with a union would not necessarily help. The ordering of rows in the result set is not guaranteed unless you have an order by clause.
Here is one solution, although it is inelegant:
with keys as (
select 1 as ordering, 17 as pk from dual union all
select 2 as ordering, 13 as pk from dual union all
select 3 as ordering, 35 as pk from dual union all
select 4 as ordering, 2 as pk from dual union all
select 5 as ordering, 9 as pk from dual
)
select mt.*
from My_Table mt join
keys
on mt.my_pk = keys.pk
order by keys.ordering
Oracle only guarantees the order of a result set if it is sorted with an explicit ORDER BY statement. So your actual question is, "how can I guarantee to sort my results into an arbitrary order?"
Well the simple answer is you can't. The more complicated answer is that you need to associate your arbitrary order with an index which can be sorted.
I will presume you're getting your list of IDs as a string. (If you get them as an array or something similarly table-like life is easier.) So first of all you need to tokenize your string. In my example I use the splitter function from this other SO thread. I'm going to use that in a common table expression to get some rows, and use the rownum pseudo-column to synthesize an index. Then we join the CTE output to your table.
with cte as
( select s.column_value as id
, rownum as sort_order
from table(select splitter('17,13,35,2,9') from dual) s
)
select yt.*
from your_table yt
where yt.id = cte.id
order by cte.sort_order
Caveat: this is untested code but the principle is sound. If you do get compilation or syntax errors which you cannot resolve please include sufficient detail in the comments.
The way to guarantee an order on your resultset is using ORDER BY, so what I would do is to insert in a temporary table with 2 columns your primary key and a secuencial ID which you would use later to make the ORDER BY. Your temporal table would be:
PrimaryKey ID
-------------------
17 1
13 2
35 3
2 4
9 5
After that just using a join of your table and the temporal table on the PrimaryKey column and order by the ID column of your temporal table.

How to Identify matching records in two tables?

I have two tables with same column names. There are a total 40 columns in each table. Both the tables have same unique IDs. If I perform an inner join on the ID columns I get a match on 80% of the data. However, I would like to see if this match has exactly same data in each of the columns.
If there were a few rows like say 50-100 I could have performed a simple union operation ordered by ID and manually checked for the data. But both the tables contain more than 5000 records.
Is a join on each of the columns a valid solution for this or do I need to perform concatenation?
Suppose you have N columns, you can add GROUP BY COL1,COL2,....COLN
select * from table1
union all
select * from table2
group by COL1, COL2, ... , COLN
having count(*)>1;
Reference: link

Database Table Content Comparison

We Use SAP HANA as database.
How can I compare if two tables have the same content?
I already did a comparison of the primary key using SQL:
select COUNT (*) from Schema.table1;
select COUNT (*) from Schema.table2;
select COUNT (*)
from Schema.table1 p
join schema.table2 r
on p.keyPart1 = r.keyPart1
and p.keyPart2 = r.keyPart2
and p.keyPart3 = r.keypart3;
So I compared the rows of both tables and of the join. All row counts are the same.
But I still don't know if the content of all rows are exactly the same. It could be that one ore more cells of a non-key column is deviating.
I thought about putting all columns in the join Statement. But that did not feel right.
You might want to use except
SELECT * FROM A
EXCEPT
SELECT * FROM B;
SELECT * FROM B
EXCEPT
SELECT * FROM A;

distinct values from multiple fields within one table ORACLE SQL

How can I get distinct values from multiple fields within one table with just one request.
Option 1
SELECT WM_CONCAT(DISTINCT(FIELD1)) FIELD1S,WM_CONCAT(DISTINCT(FIELD2)) FIELD2S,..FIELD10S
FROM TABLE;
WM_CONCAT is LIMITED
Option 2
select DISTINCT(FIELD1) FIELDVALUE, 'FIELD1' FIELDNAME
FROM TABLE
UNION
select DISTINCT(FIELD2) FIELDVALUE, 'FIELD2' FIELDNAME
FROM TABLE
... FIELD 10
is just too slow
if you were scanning a small range in the data (not full scanning the whole table) you could use WITH to optimise your query
e.g:
WITH a AS
(SELECT field1,field2,field3..... FROM TABLE WHERE condition)
SELECT field1 FROM a
UNION
SELECT field2 FROM a
UNION
SELECT field3 FROM a
.....etc
For my problem, I had
WL1 ... WL2 ... correlation
A B 0.8
B A 0.8
A C 0.9
C A 0.9
how to eliminate the symmetry from this table?
select WL1, WL2,correlation from
table
where least(WL1,WL2)||greatest(WL1,WL2) = WL1||WL2
order by WL1
this gives
WL1 ... WL2 ... correlation
A B 0.8
A C 0.9
:)
The best option in the SQL is the UNION, though you may be able to save some performance by taking out the distinct keywords:
select FIELD1 FROM TABLE
UNION
select FIELD2 FROM TABLE
UNION provides the unique set from two tables, so distinct is redundant in this case. There simply isn't any way to write this query differently to make it perform faster. There's no magic formula that makes searching 200,000+ rows faster. It's got to search every row of the table twice and sort for uniqueness, which is exactly what UNION will do.
The only way you can make it faster is to create separate indexes on the two fields (maybe) or pare down the set of data that you're searching across.
Alternatively, if you're doing this a lot and adding new fields rarely, you could use a materialized view to store the result and only refresh it periodically.
Incidentally, your second query doesn't appear to do what you want it to. Distinct always applies to all of the columns in the select section, so your constants with the field names will cause the query to always return separate rows for the two columns.
I've come up with another method that, experimentally, seems to be a little faster. In affect, this allows us to trade one full-table scan for a Cartesian join. In most cases, I would still opt to use the union as it's much more obvious what the query is doing.
SELECT DISTINCT CASE lvl WHEN 1 THEN field1 ELSE field2 END
FROM table
CROSS JOIN (SELECT LEVEL lvl
FROM DUAL
CONNECT BY LEVEL <= 2);
It's also worthwhile to add that I tested both queries on a table without useful indexes containing 800,000 rows and it took roughly 45 seconds (returning 145,000 rows). However, most of that time was spent actually fetching the records, not running the query (the query took 3-7 seconds). If you're getting a sizable number of rows back, it may simply be the number of rows that is causing the performance issue you're seeing.
When you get distinct values from multiple columns, then it won't return a data table. If you think following data
Column A Column B
10 50
30 50
10 50
when you get the distinct it will be 2 rows from first column and 1 rows from 2nd column. It simply won't work.
And something like this?
SELECT 'FIELD1',FIELD1, 'FIELD2',FIELD2,...
FROM TABLE
GROUP BY FIELD1,FIELD2,...

Returning more than one value from a sql statement

I was looking at sql inner queries (bit like the sql equivalent of a C# anon method), and was wondering, can I return more than one value from a query?
For example, return the number of rows in a table as one output value, and also, as another output value, return the distinct number of rows?
Also, how does distinct work? Is this based on whether one field may be the same as another (thus classified as "distinct")?
I am using Sql Server 2005. Would there be a performance penalty if I return one value from one query, rather than two from one query?
Thanks
You could do your first question by doing this:
SELECT
COUNT(field1),
COUNT(DISTINCT field2)
FROM table
(For the first field you could do * if needed to count null values.)
Distinct means the definition of the word. It eliminates duplicate returned rows.
Returning 2 values instead of 1 would depend on what the values were, if they were indexed or not and other undetermined possible variables.
If you are meaning subqueries within the select statement, no you can only return 1 value. If you want more than 1 value you will have to use the subquery as a join.
If the inner query is inline in the SELECT, you may struggle to select multiple values. However, it is often possible to JOIN to a sub-query instead; that way, the sub-query can be named and you can get multiple results
SELECT a.Foo, a.Bar, x.[Count], x.[Avg]
FROM a
INNER JOIN (SELECT COUNT(1) AS [Count], AVG(something) AS [Avg]) x
ON x.Something = a.Something
Which might help.
DISTINCT does what it says. IIRC, you can SELECT COUNT(DISTINCT Foo) etc to query distinct data.
you can return multiple results in 3 ways (off the top of my head)
By having a select with multiple values eg: select col1, col2, col3
With multiple queries eg: select 1 ; select "2" ; select colA. you would get to them in a datareader by calling .NextRecord()
Using output parameters, declare the parameters before exec the query then get the value from them afterwards. eg: set #param1 = "2" . string myparam2 = sqlcommand.parameters["param1"].tostring()
Distinct, filters resulting rows to be unique.
Inner queries in the form:
SELECT * FROM tbl WHERE fld in (SELECT fld2 FROM tbl2 WHERE tbl.fld = tbl2.fld2)
cannot return multiple rows. When you need multiple rows from a secondary query, you usually need to do an inner join on the other query.
rows:
SELECT count(*), count(distinct *) from table
will return a dataset with one row containing two columns. Column 1 is the total number of rows in the table. Column 2 counts only distinct rows.
Distinct means the returned dataset will not have any duplicate rows. Distinct can only appear once usually directly after the select. Thus a query such as:
SELECT distinct a, b, c FROM table
might have this result:
a1 b1 c1
a1 b1 c2
a1 b2 c2
a1 b3 c2
Note that values are duplicated across the whole result set but each row is unique.
I'm not sure what your last question means. You should return from a query all the data relevant to the query. As for faster, only benchmarking can tell you which approach is faster.