How can I stop joins from adding rows in my match query? - sql

I'm having difficulty translating what I want into functional programming, since I think imperatively. Basically, I have a table of forms, and a table of expectations. In the Expectation view, I want it to look through the forms table and tell me if each one found a match. However, when I try to use joins to accomplish this, the joins are adding rows to the Expectation table when two or more forms match. I do not want this.
In an imperative fashion, I want the equivalent of this:
ForEach (row in Expectation table)
{
if (any form in the Form table matches the criteria)
{
MatchID = form.ID;
SignDate = form.SignDate;
...
}
}
What I have in SQL is this:
SELECT
e.*, match.ID, match.SignDate, ...
FROM
POFDExpectation e LEFT OUTER JOIN
(SELECT MIN(ID) as MatchID, MIN(SignDate) as MatchSignDate,
COUNT(*) as MatchCount, ...
FROM Form f
GROUP BY (matching criteria columns)
) match
ON (form.[match criteria] = expectation.[match criteria])
Which works okay, but very slowly, and every time there are TWO matches, a row is added to the Expectation results. Mathematically I understand that a join is a cross multiply and this is expected, but I'm unsure how to do this without them. Subquery perhaps?
I'm not able to give too many further details about the implementation, but I'll be happy to try any suggestion and respond with the results. I have 880 Expectation rows, and 942 results being returned. If I only allow results that match one form, I get 831 results. Neither are desirable, so if yours gets me to exactly 880, yours is the accepted answer.
Edit: I am using SQL Server 2008 R2, though a generic solution would be best.
Sample code:
--DROP VIEW ExpectationView; DROP TABLE Forms; DROP TABLE Expectations;
--Create Tables and View
CREATE TABLE Forms (ID int IDENTITY(1,1) PRIMARY KEY, ReportYear int, Name varchar(100), Complete bit, SignDate datetime)
GO
CREATE TABLE Expectations (ID int IDENTITY(1,1) PRIMARY KEY, ReportYear int, Name varchar(100))
GO
CREATE VIEW ExpectationView AS select e.*, filed.MatchID, filed.SignDate, ISNULL(filed.FiledCount, 0) as FiledCount, ISNULL(name.NameCount, 0) as NameCount from Expectations e LEFT OUTER JOIN
(select MIN(ID) as MatchID, ReportYear, Name, Complete, Min(SignDate) as SignDate, COUNT(*) as FiledCount from Forms f GROUP BY ReportYear, Name, Complete) filed
on filed.ReportYear = e.ReportYear AND filed.Name like '%'+e.Name+'%' AND filed.Complete = 1 LEFT OUTER JOIN
(select MIN(ID) as MatchID, ReportYear, Name, COUNT(*) as NameCount from Forms f GROUP BY ReportYear, Name) name
on name.ReportYear = e.ReportYear AND name.Name like '%'+e.Name+'%'
GO
--Insert Text Data
INSERT INTO Forms (ReportYear, Name, Complete, SignDate)
SELECT 2011, 'Bob Smith', 1, '2012-03-01' UNION ALL
SELECT 2011, 'Bob Jones', 1, '2012-10-04' UNION ALL
SELECT 2011, 'Bob', 1, '2012-07-20'
GO
INSERT INTO Expectations (ReportYear, Name)
SELECT 2011, 'Bob'
GO
SELECT * FROM ExpectationView --Should only return 1 result, returns 9
The 'filed' shows that they have completed a form, 'name' shows that they may have started one but not finished it. My view has four different 'match criteria' - each a little more strict, and counts each. 'Name Only Matches', 'Loose Matches', 'Matches' (default), 'Tight Matches' (used if there are more than one default match.

This is how I do it when I want to keep to a JOIN-type query format:
SELECT
e.*,
match.ID,
match.SignDate,
...
FROM POFDExpectation e
OUTER APPLY (
SELECT TOP 1
MIN(ID) as MatchID,
MIN(SignDate) as MatchSignDate,
COUNT(*) as MatchCount,
...
FROM Form f
WHERE form.[match criteria] = expectation.[match criteria]
GROUP BY ID (matching criteria columns)
-- Add ORDER BY here to control which row is TOP 1
) match
It usually performs better as well.
Semantically, {CROSS|OUTER} APPLY (table-expression) specifies a table-expression that is called once for each row in the preceding table expressions of the FROM clause and then joined to them. Pragmatically, however, the compiler treats it almost identically to a JOIN.
The practical difference is that unlike a JOIN table-expression, the APPLY table-expression is dynamically re-evaluated for each row. So instead of an ON clause, it relies on its own logic and WHERE clauses to limit/match its rows to the preceding table-expressions. This also allows it to make reference to the column-values of the preceding table-expressions, inside its own internal subquery expression. (This is not possible in a JOIN)
The reason that we want this here, instead of a JOIN, is that we need a TOP 1 in the sub-query to limit its returned rows, however, that means that we need to move the ON clause conditions to the internal WHERE clause so that it will get applied before the TOP 1 is evaluated. And that means that we need an APPLY here, instead of the more usual JOIN.

#RBarryYoung answered the question as I asked it, but there was a second question that I didn't make very clear. What I really wanted was a combination of his answer and this question, so for the record here's what I used:
SELECT
e.*,
...
match.ID,
match.SignDate,
match.MatchCount
FROM
POFDExpectation e
OUTER APPLY (
SELECT TOP 1
ID as MatchID,
ReportYear,
...
SignDate as MatchSignDate,
COUNT(*) as MatchCount OVER ()
FROM
Form f
WHERE
form.[match criteria] = expectation.[match criteria]
-- Add ORDER BY here to control which row is TOP 1
) match

Related

SQL Server ISNULL multiple columns

I have the following query which works great but how do I add multiple columns in its select statement? Following is the query:
SELECT ISNULL(
(SELECT DISTINCT a.DatasourceID
FROM [Table1] a
WHERE a.DatasourceID = 5 AND a.AgencyID = 4 AND a.AccountingMonth = 201907), NULL) TEST
So currently I only get one column (TEST) but would like to add other columns such as DataSourceID, AgencyID and AccountingMonth.
If you want to output a row for some condition (or requested values ) and output a row when it does not meet condition,
you can set a pseudo table for your requested values in the FROM clause and make a left outer join with your Table1.
SELECT ISNULL(Table1.DatasourceId, 999999),
Table1.AgencyId,
Table1.AccountingMonth,
COUNT(*) as count
FROM ( VALUES (5, 4, 201907 ),
(6, 4, 201907 ))
AS requested(DatasourceId, AgencyId, AccountingMonth)
LEFT OUTER JOIN Table1 ON requested.agencyid=Table1.AgencyId
AND requested.datasourceid = Table1.DatasourceId
AND requested.AccountingMonth = Table1.AccountingMonth
GROUP BY Table1.DatasourceId, Table1.AgencyId, Table1.AccountingMonth
Note that:
I have put a ISNULL for the first column like you did to output a particular value (9999) when no value is found.
I did not put the ISNULL(...,NULL) like your query in the other columns since IMHO it is not necessary: if there is no value, a null will be output anyway.
I added a COUNT(*) column to illustrate an aggregate, you could use another (SUM, MIN, MAX) or none if you do not need it.
The set of requested values is provided as a constant table values (see https://learn.microsoft.com/en-us/sql/t-sql/queries/table-value-constructor-transact-sql?view=sql-server-2017)
I have added multiple rows for requested conditions : you can request for multiple datasources, agencies or months in one query with one line for each in the output.
If you want only one row, put only one row in "requested" pseudo table values.
There must be a GROUP BY, even if you do not want to use an aggregate (count, sum or other) in order to have the same behavior as your distinct clause , it restricts the output to single lines for requested values.
To me it seems that you want to see does data exists, i guess that your's AgencyID is foreign key to agency table, DataSourceID also to DataSource, and that you have AccountingMonth table which has all accounting periods:
SELECT ds.ID as DataSourceID , ag.ID as AgencyID , am.ID as AccountingMonth ,
ISNULL(COUNT(a.*),0) as Count
FROM [Table1] a
RIGHT JOIN [Datasource] ds ON ds.ID = a.DataSourceID
RIGHT JOIN [Agency] ag ON ag.ID = a.AgencyID
RIGHT JOIN [AccountingMonth] am on am.ID = a.AccountingMonth
GROUP BY ds.ID, ag.ID, am.ID
In this way you can see count of records per group by criteria. Notice RIGHT join, you must use RIGHT JOIN if you want to include all record from "Right" table.
In yours query you have DISTINCT a.DatasourceID and WHERE a.DatasourceID = 5 and it returns 5 if in table exists rows that match yours WHERE criteria, and returns null if there is no data. If you remove WHERE a.DatasourceID = 5 your query would break with error: subquery returned multiple rows.
the way you are doing only allows for one column and one record and giving it the name of test. It does not look like you really need to test for null. because you are returning null so that does nothing to help you. Remove all the null testing and return a full recordset distinct will also limit your returns to 1 record. When working with a single table you don't need an alias, if there are no spaces or keywords braced identifiers not required. if you need to see if you have an empty record set, test for it in the calling program.
SELECT DatasourceID, AgencyID,AccountingMonth
FROM Table1
WHERE DatasourceID = 5 AND AgencyID = 4 AND AccountingMonth = 201907

Ensuring two columns only contain valid results from same subquery

I have the following table:
id symbol_01 symbol_02
1 abc xyz
2 kjh okd
3 que qid
I need a query that ensures symbol_01 and symbol_02 are both contained in a list of valid symbols. In other words I would needs something like this:
select *
from mytable
where symbol_01 in (
select valid_symbols
from somewhere)
and symbol_02 in (
select valid_symbols
from somewhere)
The above example would work correctly, but the subquery used to determine the list of valid symbols is identical both times and is quite large. It would be very innefficient to run it twice like in the example.
Is there a way to do this without duplicating two identical sub queries?
Another approach:
select *
from mytable t1
where 2 = (select count(distinct symbol)
from valid_symbols vs
where vs.symbol in (t1.symbol_01, t1.symbol_02));
This assumes that the valid symbols are stored in a table valid_symbols that has a column named symbol. The query would also benefit from an index on valid_symbols.symbol
You could try use a CTE like;
WITH ValidSymbols AS (
SELECT DISTINCT valid_symbol
FROM somewhere
)
SELECT mt.*
FROM MyTable mt
INNER JOIN ValidSymbols v1
ON mt.symbol_01 = v1.valid_symbol
INNER JOIN ValidSymbols v2
ON mt.symbol_02 = v2.valid_symbol
From a performance perspective, your query is the right way to do this. I would write it as:
select *
from mytable t
where exists (select 1
from valid_symbols vs
where t.symbol_01 = vs.valid_symbol
) and
exists (select 1
from valid_symbols vs
where t.symbol_02 = vs.valid_symbol
) ;
The important component is that you need an index on valid_symbols(valid_symbol). With this index, the lookup should be pretty fast. Appropriate indexes can even work if valid_symbols is a view, although the effect depends on the complexity of the view.
You seem to have a situation where you have two foreign key relationships. If you explicitly declare these relationships, then the database will enforce that the columns in your table match the valid symbols.

SQL query to find rows with the most matching keywords

I'm really bad at SQL and I would like to know what SQL I can run to solve the problem below which I suspect to be a NP-Complete problem but I'm ok with the query taking a long time to run over large datasets as this will be done as a background task. A standard sql statement is preferred but if a stored procedure is required then so be it. The SQL is required to run on Postgres 9.3.
Problem: Given a set of articles that contain a set of keywords, find the top n articles for each article that contains the most number of matching keywords.
A trimmed down version of the article table looks like this:
CREATE TABLE article (
id character varying(36) NOT NULL, -- primary key of article
keywords character varying, -- comma separated set of keywords
CONSTRAINT pk_article PRIMARY KEY (id)
);
-- Test Data
INSERT INTO article(id, keywords) VALUES(0, 'red,green,blue');
INSERT INTO article(id, keywords) VALUES(1, 'red,green,yellow');
INSERT INTO article(id, keywords) VALUES(2, 'purple,orange,blue');
INSERT INTO article(id, keywords) VALUES(3, 'lime,violet,ruby,teal');
INSERT INTO article(id, keywords) VALUES(4, 'red,green,blue,yellow');
INSERT INTO article(id, keywords) VALUES(5, 'yellow,brown,black');
INSERT INTO article(id, keywords) VALUES(6, 'black,white,blue');
Which would result in this for a SELECT * FROM article; query:
Table: article
------------------------
id keywords
------------------------
0 red,green,blue
1 red,green,yellow
2 purple,orange,blue
3 lime,violet,ruby,teal
4 red,green,blue,yellow
5 yellow,brown,black
6 black,white,blue
Assuming I want to find the top 3 articles for each article that contains the most number of matching keywords then the output should be this:
------------------------
id related
------------------------
0 4,1,6
1 4,0,5
2 0,4,6
3 null
4 0,1,6
5 1,6
6 5,0,4
Like #a_horse commented: This would be simpler with a normalized design (besides making other tasks simpler/ cleaner), but still not trivial.
Also, a PK column of data type character varying(36) is highly suspicious (and inefficient) and should most probably be an integer type or at least a uuid instead.
Here is one possible solution based on your design as is:
WITH cte AS (
SELECT id, string_to_array(a.keywords, ',') AS keys
FROM article a
)
SELECT id, string_agg(b_id, ',') AS best_matches
FROM (
SELECT a.id, b.id AS b_id
, row_number() OVER (PARTITION BY a.id ORDER BY ct.ct DESC, b.id) AS rn
FROM cte a
LEFT JOIN cte b ON a.id <> b.id AND a.keys && b.keys
LEFT JOIN LATERAL (
SELECT count(*) AS ct
FROM (
SELECT * FROM unnest(a.keys)
INTERSECT ALL
SELECT * FROM unnest(b.keys)
) i
) ct ON TRUE
ORDER BY a.id, ct.ct DESC, b.id -- b.id as tiebreaker
) sub
WHERE rn < 4
GROUP BY 1;
sqlfiddle (using an integer id instead).
The CTE cte converts the string into an array. You could even have a functional GIN index like that ...
If multiple rows tie for the top 3 picks, you need to define a tiebreaker. In my example, rows with smaller id come first.
Detailed explanation in this recent related answer:
Query and order by number of matches in JSON array
The comparison is between a JSON array and an SQL array, but it's basically the same problem, burns down to the same solution(s). Also comparing a couple of similar alternatives.
To make this fast, you should at least have a GIN index on the array column (instead of the comma-separated string) and the query wouldn't need the CTE step. A completely normalized design has other advantages, but won't necessarily be faster than an array with GIN index.
You can store lists in comma-separated strings. No problem, as long as this is just a string for you and you are not interested in its separate values. As soon as you are interested in the separate values, as in your example, store them separately.
This said, correct your database design and only then think about the query.
The following query selects all ID pairs first and counts common keywords. It then ranks the pairs by giving the other ID with the most keywords in common rank #1, etc. Then you keep only the three best matching IDs. STRING_AGG lists the best matching IDs in a string ordered by the number of keywords in common.
select
this_article as id,
string_agg(other_article, ',' order by rn) as related
from
(
select
this_article,
other_article,
row_number() over (partition by this_article order by cnt_common desc) as rn
from
(
select
this.id as this_article,
other.id as other_article,
count(other.id) as cnt_common
from keywords this
left join keywords other on other.keyword = this.keyword and other.id <> this.id
group by this.id, other.id
) pairs
) ranked
where rn <= 3
group by this_article
order by this_article;
Here is the SQL fiddle: http://sqlfiddle.com/#!15/1d20c/9.

SQL IN query produces strange result

Please see the table structure below:
CREATE TABLE Person (id int not null, PID INT NOT NULL, Name VARCHAR(50))
CREATE TABLE [Order] (OID INT NOT NULL, PID INT NOT NULL)
INSERT INTO Person VALUES (1,1,'Ian')
INSERT INTO Person VALUES (2,2,'Maria')
INSERT INTO [Order] values (1,1)
Why does the following query return two results:
select * from Person WHERE id IN (SELECT ID FROM [Order])
ID does not exist in Order. Why does the query above produce results? I would expect it to error because I'd does not exist in order.
This behavior, while unintuitive, is very well defined in Microsoft's Knowledge Base:
KB #298674 : PRB: Subquery Resolves Names of Column to Outer Tables
From that article:
To illustrate the behavior, use the following two table structures and query:
CREATE TABLE X1 (ColA INT, ColB INT)
CREATE TABLE X2 (ColC INT, ColD INT)
SELECT ColA FROM X1 WHERE ColA IN (Select ColB FROM X2)
The query returns a result where the column ColB is considered from table X1.
By qualifying the column name, the error message occurs as illustrated by the following query:
SELECT ColA FROM X1 WHERE ColA in (Select X2.ColB FROM X2)
Server: Msg 207, Level 16, State 3, Line 1
Invalid column name 'ColB'.
Folks have been complaining about this issue for years, but Microsoft isn't going to fix it. It is, after all, complying with the standard, which essentially states:
If you don't find column x in the current scope, traverse to the next outer scope, and so on, until you find a reference.
More information in the following Connect "bugs" along with multiple official confirmations that this behavior is by design and is not going to change (so you'll have to change yours - i.e. always use aliases):
Connect #338468 : CTE Column Name resolution in Sub Query is not validated
Connect #735178 : T-SQL subquery not working in some cases when IN operator used
Connect #302281 : Non-existent column causes subquery to be ignored
Connect #772612 : Alias error not being reported when within an IN operator
Connect #265772 : Bug using sub select
In your case, this "error" will probably be much less likely to occur if you use more meaningful names than ID, OID and PID. Does Order.PID point to Person.id or Person.PID? Design your tables so that people can figure out the relationships without having to ask you. A PersonID should always be a PersonID, no matter where in the schema it is; same with an OrderID. Saving a few characters of typing is not a good price to pay for a completely ambiguous schema.
You could write an EXISTS clause instead:
... FROM dbo.Person AS p WHERE EXISTS
(
SELECT 1 FROM dbo.[Order] AS o
WHERE o.PID = p.id -- or is it PID? See why it pays to be explicit?
);
The problem here is that you're not using Table.Column notation in your subquery, table Order doesn't have column ID and ID in subquery really means Person.ID, not [Order].ID. That's why I always insist on using aliases for tables in production code. Compare these two queries:
select * from Person WHERE id IN (SELECT ID FROM [Order]);
select * from Person as p WHERE p.id IN (SELECT o.ID FROM [Order] as o)
The first one will execute but will return incorrect results, and the second one will raise an error. It's because the outer query's columns may be referenced in a subquery, so in this case you can use Person columns inside the subquery.
Perhaps you wanted to use the query like this:
select * from Person WHERE pid IN (SELECT PID FROM [Order])
But you never know when the schema of the [Order] table changes, and if somebody drops the column PID from [Order] then your query will return all rows from the table Person. Therefore, use aliases:
select * from Person as P WHERE P.pid IN (SELECT O.PID FROM [Order] as O)
Just quick note - this is not SQL Server specific behaviour, it's standard SQL:
SQL Server demo
PostgreSQL demo
MySQL demo
Oracle demo
Order table doesnt have id column
Try these instead:
select * from Person WHERE id IN (SELECT OID FROM [Order])
OR
select * from Person WHERE pid IN (SELECT PID FROM [Order])

Alternative to NOT IN()

I have a table with 14,028 rows from November 2012. I also have a table with 13,959 rows from March 2013. I am using a simple NOT IN() clause to see who has left:
select * from nov_2012 where id not in(select id from mar_2013)
This returned 396 rows and I never thought anything of it, until I went to analyze who left. When I pulled all the ids for the lost members and put them in a temp table (##lost), 32 of them were actually still in the mar_2013 table. I can pull them up when I search for their ids using the following:
select * from mar_2013 where id in(select id from ##lost)
I can't figure out what is going on. I will mention that the id field I created is an IDENTITY column. Could that have any effect on the matching using NOT IN? Is there a better way to check for missing rows between tables? I have also tried:
select a.* from nov_2012 a left join mar_2013 b on b.id = a.id where b.id is NULL
And received the same results.
This is how I created the identity field;
create table id_lookup( dateofcusttable date ,sin int ,sex varchar(12) ,scid int identity(777000,1))
insert into id_lookup (sin, sex) select distinct sin, sex from [Client Raw].dbo.cust20130331 where sin <> 0 order by sin, sex
This is how I added the scid into the march table:
select scid, rowno as custrowno
into scid_20130331
from [Client Raw].dbo.cust20130331 cust
left join id_lookup scid
on scid.sin = cust.sin
and scid.sex = cust.sex
update scid_20130331
set scid = custrowno where scid is NULL --for members who don't have more than one id or sin information is not available
drop table Account_Part2_Current
select a.*, scid
into Account_Part2_Current
from Account_Part1_Current a
left join scid_20130331 b
on b.custrowno = a.rowno_custdmd_cust
I then group all the information by the scid
I would prefer this form (and here's why):
SELECT a.id --, other columns
FROM dbo.nov_2012 AS a
WHERE NOT EXISTS (SELECT 1 FROM dbo.mar_2013 WHERE id = a.id);
However this should still give the same results as what you've tried, so I suspect there is something about the data model that you're not telling us - for example, is mar_2013.id nullable?
this is logically equivalent to not in and is faster than not in.
where yourfield in
(select afield
from somewhere
minus
select
thesamefield
where you want to exclude the record
)
It probably isn't as fast as using where not exists, as per Aaron's answer so you should only use it if not exists does not provide the results you want.