SQL Query Design for data validation issue - sql

I have a fact table that contains some finance data. There is a column (VERS_NM) that defines wherther the value is "Actual" or "Current Outlook". The value for these two should always be the same but we noticed in some reports it seems incorrect. So I want to write a query to find where the actual value does not match the current outlook.
I cannot wrap my head around a way to do this.
Here is what the table looks like:
So to recap there will be an identical row to row 1 except the VERS_NM column will say "Actual". At least it is supposed to be, I want to find any instances where the Actual and Current Outlook don't match. Any help or ideas is much appreciated. Just a push in the right direction or some kind of plan to tackle this would be great.
Thanks!

You could just self join the data, replacing the fields a, b, c, d with the fields that indicate that the rows are equivilent.
SELECT
*
FROM
yourTable AS actual
INNER JOIN
yourTable AS outlook
ON actual.a = outlook.a
AND actual.b = outlook.b
AND actual.c = outlook.c
AND actual.d = outlook.d
WHERE
actual.VERS_NM = 'Actual'
AND outlook.VERS_NM = 'Current Outlook'
AND actual.FINC_ACCT_METRIC_VAL <> outlook.FINC_ACCT_METRIC_VAL

An alternative method is to use UNION. It's more of a way to detect differences in many table fields but can work in your situation as well.
The technique is described in the article The shortest, fastest, and easiest way to compare two tables in SQL Server: UNION! to compare two separate tables but you can analyse a single table.
Replace the fields COL1, COL2 etc. with the columns you want to compare. For your comparison I've added a WHERE clause to the inner SELECT to effectively view your data as two separate tables.
SELECT MIN(TableName) as TableName, ID, COL1, COL2, COL3 ...
FROM
(
SELECT 'Actual' as TableName, A.ID, A.COL1, A.COL2, A.COL3, ...
FROM Finance_Data A
WHERE VERS_NM = 'Actual'
UNION ALL
SELECT 'Outlook' as TableName, B.ID, B.COL1, B.COl2, B.COL3, ...
FROM Finance_Data B
WHERE VERS_NM = 'Current Outlook'
) T
GROUP BY ID, COL1, COL2, COL3 ...
HAVING COUNT(*) = 1
ORDER BY ID

You could group by all other columns, and then use a having clause to demand that each group has an "Actual" and a "Current outlook" variant:
select *
from YourTable
group by
col1
, col2
, ... all other columns ...
having sum(case when VERS_NM = 'Actual' then 1 end) <> 1
or sum(case when VERS_NM = 'Current Outlook' then 1 end) <> 1
or count(*) <> 2

Related

Joining two datasets with subqueries

I am attempting to join two large datasets using BigQuery. they have a common field, however the common field has a different name in each dataset.
I want to count number of rows and sum the results of my case logic for both table1 and table2.
I believe that I have errors resulting from subquery (subselect?) and syntax errors. I have tried to apply precedent from similar posts but I still seem to be missing something. Any assistance in getting this sorted is greatly appreciated.
SELECT
table1.field1,
table1.field2,
(
SELECT COUNT (*)
FROM table1) AS table1_total,
sum(case when table1.mutually_exclusive_metric1 = "Y" then 1 else 0 end) AS t1_pass_1,
sum(case when table1.mutually_exclusive_metric1 = "Y" AND table1.mutually_exclusive_metric2 IS null OR table1.mutually_exclusive_metric3 = 'Y' then 1 else 0 end) AS t1_pass_2,
sum(case when table1.mutually_exclusive_metric3 ="Y" AND table1.mutually_exclusive_metric2 ="Y" AND table1.mutually_exclusive_metric3 ="Y" then 1 else 0 end) AS t1_pass_3,
(
SELECT COUNT (*)
FROM table2) AS table2_total,
sum(case when table2.metric1 IS true then 1 else 0 end) AS t2_pass_1,
sum(case when table2.metric2 IS true then 1 else 0 end) AS t2_pass_2,
(
SELECT COUNT (*)
FROM dataset1.table1 JOIN EACH dataset2.table2 ON common_field_table1 = common_field_table2) AS overlap
FROM
dataset1.table1,
dataset2.table2
WHERE
XYZ
Thanks in advance!
Sho. Lets take this one step at a time:
1) Using * is not explicit, and being explicit is good. Additionally, stating explicit selects and * will duplicate selects with autorenames. table1.field will become table1_field. Unless you are just playing around, don't use *.
2) You never joined. A query with a join looks like this (note order of WHERE and GROUP statements, note naming of each):
SELECT
t1.field1 AS field1,
t2.field2 AS field2
FROM dataset1.table1 AS t1
JOIN dataset2.table2 AS t2
ON t1.field1 = t2.field1
WHERE t1.field1 = "some value"
GROUP BY field1, field2
Where t1.f1 = t2.f1 contain corresponding values. You wouldn't repeat those in the select.
3) Use whitespace to make your code easier to read. It helps everyone involved, including you.
4) Your subselects are pretty useless. A subselect is used instead of creating a new table. For example, you would use a subselect to group or filter out data from an existing table. For example:
SELECT
subselect.field1 AS ssf1,
subselect.max_f1 AS ss_max_f1
FROM (
SELECT
t1.field1 AS field1,
MAX(t1.field1) AS max_f1,
FROM dataset1.table1 AS t1
GROUP BY field1
) AS subselect
The subselect is practically a new table that you select from. Treat it logically like it happens first, and you take the results from that and use it in your main select.
5) This was a terrible question. It didn't even look like you tried to figure things out one step at a time.

Error handling or Exception handling in Netezza

I'm running a netezza sql process as part of a shell script and in one of the sql codes, I want it to raise an ERROR or exception if the number of rows from 2 different tables don't match.
SQL Code:
/* The following 2 tables should return the same number of rows to make sure the process is correct */
select count(*)
from (
select distinct col1, col2,col3
from table_a
where week > 0 and rec >= 1
) as x ;
select count(*)
from (
select distinct col1, col2, col3
from table_b
) as y ;
How do I compare the 2 row counts and raise an exception/ERROR in the netezza SQL process, so that it exits the process, if the 2 row counts aren't equal ?
I agree a script is the best option. However you could still do the check in your SQL itself by using a cross join
Select a.*
from Next_Step_table a cross join
(select case when y.y_cnt is null then 'No Match' else 'Match' end as match
from (select count(*) as x_cnt
from ( select distinct col1, col2,col3
from table_a
where week > 0 and rec >= 1
)) x left outer join
(select count(*) as y_cnt
from (select distinct col1, col2, col3
from table_b
)) y on x.x_cnt=y.y_cnt) match_tbl
where match_tbl.match='Match'
i'm guessing the best solution here is to do it in the script.
i.e store the result of count(*) in variables, then compare them. nzsql has command line options to only return the result data of a single query.
If it must be done in plain SQL, a horribly, horrible kludge that will work is to use divide-by-zero. It's ugly but I've used it before when testing stuff. off the top of my head:
with
subq_x as select count(*) c1 .... ,
subq_y as select count(*) c2 ...
select (case when (subq_x.c1 != subq_y.c1) then 1/0 else 1 end) counts_match;
Did I mention this is ugly ?

SQL using CASE in SELECT with GROUP BY. Need CASE-value but get row-value

so basicially there is 1 question and 1 problem:
1. question - when I have like 100 columns in a table(and no key or uindex is set) and I want to join or subselect that table with itself, do I really have to write out every column name?
2. problem - the example below shows the 1. question and my actual SQL-statement problem
Example:
A.FIELD1,
(SELECT CASE WHEN B.FIELD2 = 1 THEN B.FIELD3 ELSE null FROM TABLE B WHERE A.* = B.*) AS CASEFIELD1
(SELECT CASE WHEN B.FIELD2 = 2 THEN B.FIELD4 ELSE null FROM TABLE B WHERE A.* = B.*) AS CASEFIELD2
FROM TABLE A
GROUP BY A.FIELD1
The story is: if I don't put the CASE into its own select statement then I have to put the actual rowname into the GROUP BY and the GROUP BY doesn't group the NULL-value from the CASE but the actual value from the row. And because of that I would have to either join or subselect with all columns, since there is no key and no uindex, or somehow find another solution.
DBServer is DB2.
So now to describing it just with words and no SQL:
I have "order items" which can be divided into "ZD" and "EK" (1 = ZD, 2 = EK) and can be grouped by "distributor". Even though "order items" can have one of two different "departements"(ZD, EK), the fields/rows for "ZD" and "EK" are always both filled. I need the grouping to consider the "departement" and only if the designated "departement" (ZD or EK) is changing, then I want a new group to be created.
SELECT
(CASE WHEN TABLE.DEPARTEMENT = 1 THEN TABLE.ZD ELSE null END) AS ZD,
(CASE WHEN TABLE.DEPARTEMENT = 2 THEN TABLE.EK ELSE null END) AS EK,
TABLE.DISTRIBUTOR,
sum(TABLE.SOMETHING) AS SOMETHING,
FROM TABLE
GROUP BY
ZD
EK
TABLE.DISTRIBUTOR
TABLE.DEPARTEMENT
This here worked in the SELECT and ZD, EK in the GROUP BY. Only problem was, even if EK was not the designated DEPARTEMENT, it still opened a new group if it changed, because he was using the real EK value and not the NULL from the CASE, as I was already explaining up top.
And here ladies and gentleman is the solution to the problem:
SELECT
(CASE WHEN TABLE.DEPARTEMENT = 1 THEN TABLE.ZD ELSE null END) AS ZD,
(CASE WHEN TABLE.DEPARTEMENT = 2 THEN TABLE.EK ELSE null END) AS EK,
TABLE.DISTRIBUTOR,
sum(TABLE.SOMETHING) AS SOMETHING,
FROM TABLE
GROUP BY
(CASE WHEN TABLE.DEPARTEMENT = 1 THEN TABLE.ZD ELSE null END),
(CASE WHEN TABLE.DEPARTEMENT = 2 THEN TABLE.EK ELSE null END),
TABLE.DISTRIBUTOR,
TABLE.DEPARTEMENT
#t-clausen.dk: Thank you!
#others: ...
Actually there is a wildcard equality test.
I am not sure why you would group by field1, that would seem impossible in your example. I tried to fit it into your question:
SELECT FIELD1,
CASE WHEN FIELD2 = 1 THEN FIELD3 END AS CASEFIELD1,
CASE WHEN FIELD2 = 2 THEN FIELD4 END AS CASEFIELD2
FROM
(
SELECT * FROM A
INTERSECT
SELECT * FROM B
) C
UNION -- results in a distinct
SELECT
A.FIELD1,
null,
null
FROM
(
SELECT * FROM A
EXCEPT
SELECT * FROM B
) C
This will fail for datatypes that are not comparable
No, there's no wildcard equality test. You'd have to list every field you want tested individually. If you don't want to test each individual field, you could use a hack such as concatenating all the fields, e.g.
WHERE (a.foo + a.bar + a.baz) = (b.foo + b.bar + b.az)
but either way, you're listing all of the fields.
I might tend to solve it something like this
WITH q as
(SELECT
Department
, (CASE WHEN DEPARTEMENT = 1 THEN ZD
WHEN DEPARTEMENT = 2 THEN EK
ELSE null
END) AS GRP
, DISTRIBUTOR
, SOMETHING
FROM mytable
)
SELECT
Department
, Grp
, Distributor
, sum(SOMETHING) AS SumTHING
FROM q
GROUP BY
DEPARTEMENT
, GRP
, DISTRIBUTOR
If you need to find all rows in TableA that match in TableB, how about INTERSECT or INTERSECT DISTINCT?
select * from A
INTERSECT DISTINCT
select * from B
However, if you only want rows from A where the entire row matches the values in a row from B, then why does your sample code take some values from A and others from B? If the row matches on all columns, then that would seem pointless. (Perhaps your question could be explained a bit more fully?)

What is the fastest/easiest way to tell if 2 records in the same SQL table are different?

I want to be able to compare 2 records in the same SQL table and tell if they are different. I do not need to tell what is different, just that they are different.
Also, I only need to compare 7 of 10 columns in the records. ie.) each record has 10 columns but I only care about 7 of these columns.
Can this be done through SQL or should I get the records in C# and hash them to see if they are different values?
You can write a group by query like this:
SELECT field1, field2, field3, .... field7, COUNT(*)
FROM table
[WHERE primary_key = key1 OR primary_key = key2]
GROUP BY field1, field2, field3, .... field7
HAVING COUNT(*) > 1
That way you get all records with same values for field 1 to 7, along with the number of occurrences.
Add the part between brackets to limit your search for duplicates, either with OR, or with IN (...).
IF EXISTS (SELECT Col1, Col2, ColEtc...
from MyTable
where condition1
EXCEPT SELECT Col1, Col2, ColEtc...
from MyTable
where condition2)
BEGIN
-- Query returns all rows from first set that are not column for column
-- also in the second (EXCEPT) set. So if there are any, there will be
-- rows returned, which meets the EXISTS criteria. Since you're only
-- checking EXISTS, SQL doesn't actually need to return columns.
END
No hash is necessary. Normal equality comparison is enough:
select isEqual = case when t1.a <> t2.a or t1.b <> t2.b bbb then 1 else 0 end
SELECT
CASE WHEN (a.column1, a.column2, ..., a.column7)
= (b.column1, b.column2, ..., b.column7)
THEN 'all 7 columns same'
ELSE 'one or more of the 7 columns differ'
END AS result
FROM tableX AS a
JOIN tableX AS b
ON t1.PK = #PK_of_row_one
AND t2.PK = #PK_of_row_two
Can't you just use the DISTINCT keyword? All duplicates will not be returned, so each row you receive is unique (and different from the others).
http://www.mysqlfaqs.net/mysql-faqs/SQL-Statements/Select-Statement/How-does-DISTINCT-work-in-MySQL
So you could make this query:
SELECT DISTINCT x,y,z FROM RandomTable WHERE x = something
Which will only return one row for each unique x,y,z combination.

SQL Summing Multiple Joins

I shortened the code quite a bit, but hopefully someone will get the idea of what i am tryign to do. Need to sum totals from two different selects, i tried putting each of them in Left Outer Joins(tried Inner Joins too). If i run wiht either Left Outer Join commented out, I get the correct data, but when i run them together, i get really screwed up counts. So, i know joins are probably not the correct approach to summing data from the same table, i can;t simple do it in a where clause there is other table involved int he code i commented out.
I guess i am trying to sum together 2 different queries.
SELECT eeoc.EEOCode AS 'Test1',
SUM(eeosum.Col_One) AS 'Col_One',
FROM EEO1Analysis eeo
LEFT OUTER JOIN (
SELECT eeor.AnalysisID, eeor.Test1,
SUM(CASE eeor.ZZZ WHEN 1 THEN (CASE eeor.AAAA WHEN 1 THEN 1 ELSE 0 END) ELSE 0 END) AS 'Col_One',
FROM EEO1Roster eeor
..........
WHERE eeor.AnalysisID = 7
GROUP BY eeor.AnalysisID, eeor.EEOCode
) AS eeosum2 ON eeosum2.AnalysisID = eeo.AnalysisID
LEFT OUTER JOIN (
SELECT eeor.AnalysisID, eeor.Test1,
SUM(CASE eeor.ZZZ WHEN 1 THEN (CASE eeor.AAAA WHEN 1 THEN 1 ELSE 0 END) ELSE 0 END) AS 'Col_One',
FROM EEO1Roster eeor
........
) AS eeosum ON eeosum.AnalysisID = eeo.AnalysisID
WHERE eeo.AnalysisID = 7
GROUP BY eeoc.Test1
You could UNION ALL the 2 queries and then do a SUM + GROUP BY i.e.
SELECT Col1, Col2, SUM(Col_One) FROM
(SELECT Col1, Col2, SUM(Col_One)
FROM Table1
WHERE <Conditionset1>
GROUP BY Col1, Col2
UNION ALL
SELECT Col1, Col2, SUM(Col_One)
FROM Table1
WHERE <Conditionset2>
GROUP BY Col1, Col2)
GROUP BY
Col1, Col2
Of course, if there is are row(s) returned by and they would be double counted.
What about
SELECT ... FROM EEO1Analysis eeo,
(SELECT ... LEFT OUTER JOIN ... GROUP BY ... ) AS data
...
?
And, if you can, I'd recommend preparing the data to separate tables, then operate on them with different analysis IDs. Could save some execution time at least.
Need to sum totals from two different selects
If you expect one row single-column result, this way is enough
SELECT
((SELECT SUM(...) FROM ... GROUP BY...) +
(SELECT SUM(...) FROM ... GROUP BY...)) as TheSumOfTwoSums