How to verify if two queries contain exact same data - sql

I have a table that maintains a "Gold Standard" set of data that another table should match if the table was processed correctly.
Both of these tables have almost 1,000,000 records of data.
For example. I have table (table1) that have PrimaryKey1, ColumnA, ColumnB, ColumnC, ColumnD, and Column E.
I have another table (table2) with ForeignKey1, ColumnF, ColumnG, ColumnH, ColumnI, ColumnJ.
I need to check that all the data in these two table are exactly the same except for a few columns.
What I mean by that is that ColumnA from table1 has to have all of the same as columnF in table2, and ColumnC from table1 has to matchup with ColumnI from table2 FOR THE SAME RECORD (lets call this primaryKey1). The other columns in the table do not matter.
Also, if there is a mismatch between the datasets, I need to know where the mismatch is.

I think your best bet is SUBSTRACT(). Select x, y, z from A substract select x,y,z from B. If it returns nothing, you're good to go.
Hope this helps!

A quick trick that I use is just comparing row counts. This will at least show you if you have a problem (it won't show you where the problem is).
A union query can join two queries together and display the combined result. Common rows are treated as 1 row. So, if the first query returns exactly 1 million rows, the UNION query (both queries combined) should return exactly 1 million rows. If it doesn't there is a problem.
select ColumnA 'Col1'
, ColumnC 'Col2'
from Table1
UNION
select ColumnF 'Col1'
, ColumnI 'Col2'
from TableB

Something like
select
*
from
gold_copy a
join my_copy b on a.primary_key = b.primary_key
and
a.field1 <> b.field1
or a.field_a <> b.field_f
or a.field_c <> b.field_i
or a.field_x <> b.field_y

I think the following will helps you to get the unmatched records.
select * from table1 where not exists (select * from table2);
so instead of all columns you can check with the columns what you need from the two tables,but i think the column names should be same.
Thank you.

You could use symetric difference for this
(select 'table1', col
from table1
UNION ALL
select 'table2', col
from table2)
EXCEPT
(select 'table1', col
from table1
INTERSECT
select 'table2', col
from table2)
This query returns only those rows that are only in one table and it says in which table it was found

Related

Query not reading the quoted string values stored in the table

I have stored some quoted values in a separate table and based on the value in this table. I am trying to filter the rows in another table
by using the values in this table in a subquery. But it is not reading the values for the subquery and returns a blank table in output.
The value is in column override and resolves to 'HCC11','HCC12'.
When I just copy the value from the column and paste it in place of the subquery it is fetching the data correctly. I am not able to understand the issue here. I have tried using the trim() function here but still its not working
Note-: I have attached the pic for your reference:
select *
from table1
where column1 in (select override from table 2 )
Storing comma separated values in a single column is a really poor database to begin with enclosing them in quotes makes things even wors. The proper solution to your problem is a better design.
However, if you are forced to work with that bad design, you can convert them to a proper list of values using
select *
from table1
where column1 in (select trim(both '''' from w.word)
from table2 t2
cross join unnest(string_to_array(t2.override, ',')) as w(word)
This assumes that table1.column1 only contains a single value without any quotes and that the override values never contain a comma in the real value (e.g. the above would break on a value like 'A,B', 'C')
You have the override column value as 'HCC11','HCC12' which can not match with single value 'HCC11'. You should better use the LIKE operator as follows:
select * from table1 t1
where exists
(select 1 from table2 t2
where t2.override like concat('%''', t1.column1, '''%'));
According to your image, the value of table1.column1 has to be 'HCC11','HCC12' (one string) to get the match from subquery.
If the table1 has 2 rows with values HCC11 and HCC12 then you might use the exists keyword in your subquery.
Something like
select *
from table1 t1
where exists
(select 1
from table2 t2
where instr( t2.override, concat("'",t1.column1,"'") ) >=1
);
You can do this like -
1.
select * from table1
where column1 in
(select regexp_replace(unnest(string_to_array(override, ',')),'''', '', 'g') from table2)
Or
2.
select * from table1
where '''' || column1 || '''' in
(select unnest(string_to_array(override, ',')) from table2)
Although, I would just recommend not storing your data like this, since you want to query using it.

Combine three columns from different tables into one row

I am new to sql and are trying to combine a column value from three different tables and combine to one row in DB2 Warehouse on Cloud. Each table consists of only one row and unique column name. So what I want to is just join these three to one row their original column names.
Each table is built from a statement that looks like this:
SELECT SUM(FUEL_TEMP.FUEL_MLAD_VALUE) AS FUEL
FROM
(SELECT ML_ANOMALY_DETECTION.MLAD_METRIC AS MLAD_METRIC, ML_ANOMALY_DETECTION.MLAD_VALUE AS FUEL_MLAD_VALUE, ML_ANOMALY_DETECTION.TAG_NAME AS TAG_NAME, ML_ANOMALY_DETECTION.DATETIME AS DATETIME, DATA_CONFIG.SYSTEM_NAME AS SYSTEM_NAME
FROM ML_ANOMALY_DETECTION
INNER JOIN DATA_CONFIG ON
(ML_ANOMALY_DETECTION.TAG_NAME =DATA_CONFIG.TAG_NAME AND
DATA_CONFIG.SYSTEM_NAME = 'FUEL')
WHERE ML_ANOMALY_DETECTION.MLAD_METRIC = 'IFOREST_SCORE'
AND ML_ANOMALY_DETECTION.DATETIME >= (CURRENT DATE - 9 DAYS)
ORDER BY DATETIME DESC)
AS FUEL_TEMP
I have tried JOIN, INNER JOIN, UNION/UNION ALL, but can't get it to work as it should. How can I do this?
Use a cross-join like this:
create table table1 (field1 char(10));
create table table2 (field2 char(10));
create table table3 (field3 char(10));
insert into table1 values('value1');
insert into table2 values('value2');
insert into table3 values('value3');
select *
from table1
cross join table2
cross join table3;
Result:
field1 field2 field3
---------- ---------- ----------
value1 value2 value3
A cross join joins all the rows on the left with all the rows on the right. You will end up with a product of rows (table1 rows x table2 rows x table3 rows). Since each table only has one row, you will get (1 x 1 x 1) = 1 row.
Using UNION should solve your problem. Something like this:
SELECT
WarehouseDB1.WarehouseID AS TheID,
'A' AS TheSystem,
WarehouseDB1.TheValue AS TheValue
FROM WarehouseDB1
UNION
SELECT
WarehouseDB2.WarehouseID AS TheID,
'B' AS TheSystem,
WarehouseDB2.TheValue AS TheValue
FROM WarehouseDB2
UNION
WarehouseDB3.WarehouseID AS TheID,
'C' AS TheSystem,
WarehouseDB3.TheValue AS TheValue
FROM WarehouseDB3
Ill adapt the code with your table names and rows if you tell me what they are. This kind of query would return something like the following:
TheID TheSystem TheValue
1 A 10
2 A 20
3 B 30
4 C 40
5 C 50
As long as your column names match in each query, you should get the desired results.

I am trying to return a certain values in each row which depend on whether different values in that row are already in a different table

I'm still a n00b at SQL and am running into a snag. What I have is an initial selection of certain IDs into a temp table based upon certain conditions:
SELECT DISTINCT ID
INTO #TEMPTABLE
FROM ICC
WHERE ICC_Code = 1 AND ICC_State = 'CA'
Later in the query I SELECT a different and much longer listing of IDs along with other data from other tables. That SELECT is about 20 columns wide and is my result set. What I would like to be able to do is add an extra column to that result set with each value of that column either TRUE or FALSE. If the ID in the row is in #TEMPTABLE the value of the additional column should read TRUE. If not, FALSE. This way the added column will ready TRUE or FALSE on each row, depending on if the ID in each row is in #TEMPTABLE.
The second SELECT would be something like:
SELECT ID,
ColumnA,
ColumnB,
...
NEWCOLUMN
FROM ...
NEWCOLUMN's value for each row would depend on whether the ID in that row returned is in #TEMPTABLE.
Does anyone have any advice here?
Thank you,
Matt
If you left join to the #TEMPTABLE you'll get a NULL where the ID's don't exist
SELECT ID,
ColumnA,
ColumnB,
...
T.ID IS NOT NULL AS NEWCOLUMN -- Gives 1 or 0 or True/false as a bit
FROM ... X
LEFT JOIN #TEMPTABLE T
ON T.ID = X.ID -- DEFINE how the two rows can be related unquiley
You need to LEFT JOIN your results query to #TEMPTABLE ON ID, this will give you the ID if there is one and NULL if there isn't, if you want 1 or 0 this would do it (For SQL Server) ISNULL(#TEMPTABLE.ID,0)<>0.
A few notes on coding for performance:
By definition an ID column is unique so the DISTINCT is redundant and causes unnecisary processing (unless it is an ID from another table)
Why would you store this to a temporary table rather than just using it in the query directly?
You could use a union and a subquery.
Select . . . . , 'TRUE'
From . . .
Where ID in
(Select id FROM #temptable)
UNION
SELECT . . . , 'FALSE'
FROM . . .
WHERE ID NOT in
(Select id FROM #temptable)
So the top part, SELECT ... FROM ... WHERE ID IN (Subquery), does a SELECT if the ID is in your temptable.
The bottom part does a SELECT if the ID is not in the temptable.
The UNION operator joins the two results nicely, since both SELECT statements will return the same number of columns.
To expand on what someone else was saying with Union, just do something like so
SELECT id, TRUE AS myColumn FROM `table1`
UNION
SELECT id, FALSE AS myColumn FROM `table2`

Display multiple queries with different row types as one result

In PostgreSQL 8.3 on Ubuntu, I do have 3 tables, say T1, T2, T3, of different schemas.
Each of them contains (a few) records related to the object of the ID I know.
Using 'psql', I frequently do the 3 operations:
SELECT field-set1 FROM T1 WHERE ID='abc';
SELECT field-set2 FROM T2 WHERE ID='abc';
SELECT field-set3 FROM T3 WHERE ID='abc';
and just watch the results; for me it is enough to see.
Is it possible to have a procedure/function/macro etc, with one parameter 'id',
just running the three SELECTS one after another,
displaying results on the screen ?
field-set1, field-set2 and field-set 3 are completely different.
There is no reasonable way to JOIN the tables T1, T2, T3; these are unrelated data.
I do not want JOIN.
I want to see the three resulting sets on the screen.
Any hint?
Quick and dirty method
If the row types (data types of all columns in sequence) don't match, UNION will fail.
However, in PostgreSQL you can cast a whole row to its text representation:
SELECT t1:text AS whole_row_in_text_representation FROM t1 WHERE id = 'abc'
UNION ALL
SELECT t2::text FROM t2 WHERE id = 'abc'
UNION ALL
SELECT t3::text FROM t3 WHERE id = 'abc';
Only one ; at the end, and the one is optional with a single statement.
A more refined alternative
But also needs a lot more code. Pick the table with the most columns first, cast every individual column to text and give it a generic name. Add NULL values for the other tables with fewer columns. You can even insert headers between the tables:
SELECT '-t1-'::text AS c1, '---'::text AS c2, '---'::text AS c1 -- table t1
UNION ALL
SELECT '-col1-'::text, '-col2-'::text, '-col3-'::text -- 3 columns
UNION ALL
SELECT col1::text, col2::text, col3::text FROM t1 WHERE id = 'abc'
UNION ALL
SELECT '-t2-'::text, '---'::text, '---'::text -- table t2
UNION ALL
SELECT '-col_a-'::text, '-col_b-'::text, NULL::text -- 2 columns, 1 NULL
UNION ALL
SELECT col_a::text, col_b::text, NULL::text FROM t2 WHERE id = 'abc'
...
put a union all in between and name all columns equal
SELECT field-set1 as fieldset FROM T1 WHERE ID='abc';
union all
SELECT field-set2 as fieldset FROM T2 WHERE ID='abc';
union all
SELECT field-set3 as fieldset FROM T3 WHERE ID='abc';
and execute it at once.

updating changes rows

I have a requirement to update a couple of thousand rows in a table based on whether any changes have happened to any of the values. At the moment im just updating all the values regardless but was wondering what was more effecient. Should i check all the columns to see if there are any changes and update or should i just update regardless. e.g
update someTable Set
column1 = somevalue,
column2 = somevalue,
column3 = somevalue,
etc....
from someTable inner join sometable2 on
someTable.id = sometable2.id
where
someTable.column1 != sometable2.column1 or
someTable.column2 != sometable2.column2 or
someTable.column2 != sometable2.column2 or
etc etc......
Whats faster and whats best practice
See two articles on Paul White's Blog.
The Impact of Non-Updating Updates for discussion of the main issue.
Undocumented Query Plans: Equality Comparisons for a less tedious way of doing the inequality comparisons particularly if your columns are nullable (WHERE NOT EXISTS (SELECT someTable.* INTERSECT SELECT someTable2.*)).
I believe this is the best way.
Tables and data:
declare #someTable1 table(id int, column1 int, column2 varchar(2))
declare #someTable2 table(id int, column1 int, column2 varchar(2))
insert #someTable1
select 1,10 a, 'a3'
union all select 2,20 , 'a3'
union all select 3,null, 'a4'
insert #someTable2
select 1,10, 'a3'
union all select 2,19, 'a3'
union all select 3,null, 'a5'
Update:
UPDATE t1
set t1.column1 = t2.column1,
t1.column2 = t2.column2
from #someTable1 t1
JOIN
(select * from #someTable2
EXCEPT
select * from #someTable1) t2
on t2.id = t1.id
Result:
select * from #someTable1
id a b
----------- -------- --
1 10 a3
2 19 a3
3 NULL a5
I've found that explicitly including the where clause the excludes no-op updates to perform faster, when working against large tables, but this is a very YMMV type of question.
If possible, compare the two approaches side by side, against a realistic set of data. E.g. if your tables contain millions of rows, and the updates affect only 10, make sure your sample data affects just a few rows. Or likewise, if it's likely that most rows will change, make your sample data reflect that.