Related
I'm new to SQL so be gentle! I'm working for a charity in London so I'm having a stab at SQL and I've gotten stuck with something you can probably all do with your eyes shut.
I've got a table with rows that I want to update:
It's a table showing up potential duplicate records in our database. There's a stored procedure that creates a criterion field from data on a customer account and if criterion fields from 2 records match, they are flagged as potential duplicates.
We have some known duplicates that we had to create a while ago that have Placeholder as the last name.
The criterion field matches other 'real' records with real last names that we want to keep.
What I want to do is:
update the status of the Placeholder ones to Delete; and the 'real' ones to Keep (even I can totally blitz that!)
update the keep ID field of the Placeholder ones so it's the customer_no field of the row with the matching criterion field
Once that's done, another stored procedure will take it from there.
Help!
Hi Welcome to Stack Overflow.
If I have understood you correctly you need something like this:
declare #customers table(
customer_no int NOT NULL,
matching_criteria varchar(50) NOT NULL,
lname varchar(50) NOT NULL,
keepID int NULL,
[status] varchar(50) NULL);
INSERT INTO #customers VALUES(1, 'a', 'Smith', null, null);
INSERT INTO #customers VALUES(2, 'a', 'Placeholder', null, null);
INSERT INTO #customers VALUES(3, 'b', 'Jones', null, null);
INSERT INTO #customers VALUES(4, 'b', 'Placeholder', null, null);
INSERT INTO #customers VALUES(5, 'c', 'Brown', null, null);
INSERT INTO #customers VALUES(6, 'c', 'Placeholder', null, null);
INSERT INTO #customers VALUES(7, 'd', 'Williams', null, null);
INSERT INTO #customers VALUES(8, 'd', 'Placeholder', null, null);
INSERT INTO #customers VALUES(9, 'e', 'Taylor', null, null);
INSERT INTO #customers VALUES(10, 'e', 'Placeholder', null, null);
SELECT * FROM #customers;
UPDATE #customers set [status] = case lname WHEN 'Placeholder' THEN 'DELETE' ELSE 'Keep' END;
SELECT * FROM #customers;
UPDATE k
SET keepID = d.customer_no
FROM
#customers k
INNER JOIN #customers d
on k.matching_criteria = d.matching_criteria
WHERE d.[status] = 'Delete' AND k.[status] = 'Keep';
SELECT * FROM #customers;
If you are using SQL Server you can paste this in a Query window and observe the results.
Please for future reference tell us which DB you are using, as the answers do vary from db to db. The above is specifically SQL Server. Whilst most dbs will behave similarly, some do not. The above updates on a join. Oracle, for example, does not allow this, so the syntax when using Oracle is different - Oracle can of course achieve the same thing, it just uses different SQL to do so.
I have been struggling to get my head around how SQL Server full text search ranks my results.
Consider the following FREETEXTTABLE search:
DECLARE #SearchTerm varchar(55) = 'Peter Alex'
SELECT ftt.[RANK], v.*
FROM FREETEXTTABLE (vMembersFTS, (Surname, FirstName, MiddleName, MemberRef, Passport), #SearchTerm) ftt
INNER JOIN vMembersFTS v ON v.ID = ftt.[KEY]
ORDER BY ftt.[RANK] DESC;
This returns the following results and rankings:
RANK ID MemberRef Passport FirstName MiddleName Surname Salutation
----- ---- ---------- ----------- ----------- ------------ ---------- ------------
18 2 AB-002 Pete Peters
18 9 AB-006 George Alex Mr Alex
18 13 AB-009 Peter David Alex Mr Alex
14 3 AB-003 Peter Alex Jones
As you may be able to tell from the results posted above, the last row, although having, what I consider, a good match on both 'Peter' and 'Alex', appears with a rank of only 14 where the result in the first row has only a single match on 'Peter' (admittedly the surname is 'Peters').
This is a contrived example, but goes some way to illustrate my frustrations and lack of knowledge.
I have spent quite a bit of time researching, but I am feeling a bit out of my depth now. I'm sure that I'm doing something stupid such as searching across multiple columns.
I welcome your help and support. Thanks in advance.
Thanks,
Kaine
(BTW I am using SQL Server 2012)
Here is the SQL you can use to repeat the test yourself:
-- Create the Contacts table.
CREATE TABLE dbo.Contacts
(
ID int NOT NULL PRIMARY KEY,
FirstName varchar(55) NULL,
MiddleName varchar(55) NULL,
Surname varchar(55) NOT NULL,
Salutation varchar(55) NULL,
Passport varchar(55) NULL
);
GO
-- Create the Members table.
CREATE TABLE dbo.Members
(
ContactsID int NOT NULL PRIMARY KEY,
MemberRef varchar(55) NOT NULL
);
GO
-- Create the FTS view.
CREATE VIEW dbo.vMembersFTS WITH SCHEMABINDING AS
SELECT c.ID,
m.MemberRef,
ISNULL(c.Passport, '') AS Passport,
ISNULL(c.FirstName, '') AS FirstName,
ISNULL(c.MiddleName, '') AS MiddleName,
c.Surname,
ISNULL(c.Salutation, '') AS Salutation
FROM dbo.Contacts c
INNER JOIN dbo.Members AS m ON m.ContactsID = c.ID
GO
-- Create the view index for FTS.
CREATE UNIQUE CLUSTERED INDEX IX_vMembersFTS_ID ON dbo.vMembersFTS (ID);
GO
-- Create the FTS catalogue and stop-list.
CREATE FULLTEXT CATALOG ContactsFTSCatalog WITH ACCENT_SENSITIVITY = OFF;
CREATE FULLTEXT STOPLIST ContactsSL FROM SYSTEM STOPLIST;
GO
-- Create the member full-text index.
CREATE FULLTEXT INDEX ON dbo.vMembersFTS
(Surname, Firstname, MiddleName, Salutation, MemberRef, Passport)
KEY INDEX IX_vMembersFTS_ID
ON ContactsFTSCatalog
WITH STOPLIST = ContactsSL;
GO
-- Insert some data.
INSERT INTO Contacts VALUES (1, 'John', NULL, 'Smith', NULL, NULL);
INSERT INTO Contacts VALUES (2, 'Pete', NULL, 'Peters', NULL, NULL);
INSERT INTO Contacts VALUES (3, 'Peter', 'Alex', 'Jones', NULL, NULL);
INSERT INTO Contacts VALUES (4, 'Philip', NULL, 'Smith', NULL, NULL);
INSERT INTO Contacts VALUES (5, 'Harry', NULL, 'Dukes', NULL, NULL);
INSERT INTO Contacts VALUES (6, 'Joe', NULL, 'Jones', NULL, NULL);
INSERT INTO Contacts VALUES (7, 'Alex', NULL, 'Phillips', 'Mr Phillips', NULL);
INSERT INTO Contacts VALUES (8, 'Alexander', NULL, 'Paul', 'Alex', NULL);
INSERT INTO Contacts VALUES (9, 'George', NULL, 'Alex', 'Mr Alex', NULL);
INSERT INTO Contacts VALUES (10, 'James', NULL, 'Castle', NULL, NULL);
INSERT INTO Contacts VALUES (11, 'John', NULL, 'Alexander', NULL, NULL);
INSERT INTO Contacts VALUES (12, 'Robert', NULL, 'James', 'Mr James', NULL);
INSERT INTO Contacts VALUES (13, 'Peter', 'David', 'Alex', 'Mr Alex', NULL);
INSERT INTO Members VALUES (1, 'AB-001');
INSERT INTO Members VALUES (2, 'AB-002');
INSERT INTO Members VALUES (3, 'AB-003');
INSERT INTO Members VALUES (5, 'AB-004');
INSERT INTO Members VALUES (8, 'AB-005');
INSERT INTO Members VALUES (9, 'AB-006');
INSERT INTO Members VALUES (11, 'AB-007');
INSERT INTO Members VALUES (12, 'AB-008');
INSERT INTO Members VALUES (13, 'AB-009');
-- Run the FTS query.
DECLARE #SearchTerm varchar(55) = 'Peter Alex'
SELECT ftt.[RANK], v.*
FROM FREETEXTTABLE (vMembersFTS, (Surname, FirstName, MiddleName, MemberRef, Passport), #SearchTerm) ftt
INNER JOIN vMembersFTS v ON v.ID = ftt.[KEY]
ORDER BY ftt.[RANK] DESC;
The rank is assigning based on the order in your query:
DECLARE #SearchTerm varchar(55) = 'Peter Alex'
SELECT ftt.[RANK], v.*
FROM FREETEXTTABLE (vMembersFTS, (Surname, FirstName, MiddleName, MemberRef, Passport), #SearchTerm) ftt
INNER JOIN vMembersFTS v ON v.ID = ftt.[KEY]
ORDER BY ftt.[RANK] DESC;
So in your case, a match on SurName trumps FirstName, and both trump MiddleName.
Your top 3 results have a rank of 18 as all three match on Surname. The last record has a rank of 14 for matching on FirstName and MiddleName but not SurName.
You can find details on the rank calculations here: https://technet.microsoft.com/en-us/library/ms142524(v=sql.105).aspx
If you want to allocate equal weight to these you can, but you'd have to use CONTAINSTABLE and not FREETEXTTABLE.
Info can be found here: https://technet.microsoft.com/en-us/library/ms189760(v=sql.105).aspx
This might not be a straightforward Firebird question, but I'm hoping there's a feature I'm unaware of that can help me beyond plain-vanilla SQL.
I have two tables. The first is a list of names of "critical parameters," and the second relates certain object IDs, critical parameter names, and critical parameter values:
CREATE TABLE CRITICALPARAMS
(
PARAM Varchar(32) NOT NULL,
INDX INTEGER NOT NULL,
CONSTRAINT PK_CRITICALPARAMS_1 PRIMARY KEY (PARAM),
CONSTRAINT UNQ_CRITICALPARAMS_1 UNIQUE (INDX)
);
CREATE TABLE CRITICALPARAMVALS
(
ID INTEGER NOT NULL,
PARAM Varchar(32) NOT NULL,
VAL Float NOT NULL,
CONSTRAINT PK_CRITICALPARAMVALS_1 PRIMARY KEY (ID,PARAM)
);
Let's suppose we have a four-dimensional space:
insert into CRITICALPARAMS values ('a', 1);
insert into CRITICALPARAMS values ('b', 2);
insert into CRITICALPARAMS values ('c', 3);
insert into CRITICALPARAMS values ('foo', 4);
...and a handful of objects in that space:
insert into CRITICALPARAMVALS values (1, 'a', 0.0);
insert into CRITICALPARAMVALS values (1, 'b', 0.0);
insert into CRITICALPARAMVALS values (1, 'c', 2.0);
insert into CRITICALPARAMVALS values (1, 'foo', 99.0);
insert into CRITICALPARAMVALS values (2, 'a', 0.0);
insert into CRITICALPARAMVALS values (2, 'b', 0.0);
insert into CRITICALPARAMVALS values (2, 'c', 2.0);
insert into CRITICALPARAMVALS values (2, 'foo', 99.0);
insert into CRITICALPARAMVALS values (3, 'a', 0.0);
insert into CRITICALPARAMVALS values (3, 'b', 0.0);
insert into CRITICALPARAMVALS values (3, 'c', 1.0);
insert into CRITICALPARAMVALS values (3, 'foo', 98.0);
insert into CRITICALPARAMVALS values (4, 'a', 0.0);
insert into CRITICALPARAMVALS values (4, 'b', 0.0);
insert into CRITICALPARAMVALS values (4, 'c', 1.0);
insert into CRITICALPARAMVALS values (4, 'foo', 98.0);
insert into CRITICALPARAMVALS values (5, 'a', 0.0);
insert into CRITICALPARAMVALS values (5, 'b', 0.0);
insert into CRITICALPARAMVALS values (5, 'c', 2.0);
insert into CRITICALPARAMVALS values (5, 'foo', 98.0);
The problem is to partition the critical parameter space, grouping together all object IDs that have the same parameter values. We can think of using a "seed" object ID, and asking what other IDs belong to the same partition as the seed object.
In our example, objects 1 and 2 form a partition, 3 and 4 form another, and 5 forms a third. All five objects are equal in the critical parameters a and b, but differ in parameters c and foo.
Is there any way to solve this using plain-vanilla SQL? How about a recursive CTE?
I've solved the problem crudely, using EXECUTE STATEMENT in a stored procedure, looping through the seed's critical parameter values and manually constructing a big SQL statement with as many WHERE clauses as critical parameters, but that solution doesn't scale as I go up to around 500-1000 critical parameters (or more!).
My current attempt has petered out at the following point -- I first define a view that can give me a partition along a single critical parameter (TEST_FLOAT_EQ is a selectable stored proc that compares two floats for 'good enough!' equality):
CREATE VIEW VGROUPIDBYPARAM (SEEDID, GROUPMEMBERID, CRITPARAMINDX)
AS
select a.id as seedid, b.id as groupmemberid, c.INDX as critparamindx
from CRITICALPARAMVALS a
join CRITICALPARAMVALS b on a.PARAM=b.param and (exists (select isequal from TEST_FLOAT_EQ(a.val, b.val, 1e-5) where ISEQUAL=1))
join CRITICALPARAMS c on b.param=c.PARAM;
...and then I want to use the VGROUPIDBYPARAM view inductively, in something like the following partially-complete select:
SELECT a1.SEEDID, a6.GROUPMEMBERID
FROM VGROUPIDBYPARAM a1
join VGROUPIDBYPARAM a2 on a1.SEEDID=a2.SEEDID and a1.GROUPMEMBERID=a2.GROUPMEMBERID
join VGROUPIDBYPARAM a3 on a1.SEEDID=a3.SEEDID and a2.GROUPMEMBERID=a3.GROUPMEMBERID
join VGROUPIDBYPARAM a4 on a1.SEEDID=a4.SEEDID and a3.GROUPMEMBERID=a4.GROUPMEMBERID
join VGROUPIDBYPARAM a5 on a1.SEEDID=a5.SEEDID and a4.GROUPMEMBERID=a5.GROUPMEMBERID
join VGROUPIDBYPARAM a6 on a1.SEEDID=a6.SEEDID and a5.GROUPMEMBERID=a6.GROUPMEMBERID
...
where a1.CRITPARAMINDX=1
and a2.CRITPARAMINDX=2
and a3.CRITPARAMINDX=3
and a4.CRITPARAMINDX=4
and a5.CRITPARAMINDX=5
and a6.CRITPARAMINDX=6
...
At the end of this inductive process (which I'm hoping a recursive CTE can imitate), the only surviving records that made it through the pile of JOINS have group member ID's belong to the same partition as the seed ID.
Many thanks to anyone that can help me solve this efficiently!
To solve the problem, I would start with this simple query (count matching dimensions in other objects):
SELECT
CPV1.ID AS ID1,
CPV2.ID AS ID2,
COUNT(*)
FROM
CRITICALPARAMVALS CPV1
INNER JOIN CRITICALPARAMVALS CPV2 ON CPV1.ID <> CPV2.ID
AND CPV1.PARAM = CPV2.PARAM
AND CPV1.VAL = CPV2.VAL
GROUP BY
CPV1.ID, CPV2.ID
With the following output:
As you can see, the interesting rows are marked with yellow background.
To filter only those rows we should add this condition:
HAVING COUNT(*) = (SELECT COUNT(*) FROM CRITICALPARAMS)
We can think of using a "seed" object ID, and asking what other IDs
belong to the same partition as the seed object.
The final query to answer the above question, with :SEED parameter, looks like this:
SELECT
CPV2.ID
FROM
CRITICALPARAMVALS CPV1
INNER JOIN CRITICALPARAMVALS CPV2 ON CPV1.ID <> CPV2.ID
AND CPV1.PARAM = CPV2.PARAM
AND CPV1.VAL = CPV2.VAL
WHERE CPV1.ID = :SEED
GROUP BY
CPV1.ID, CPV2.ID
HAVING COUNT(*) = (SELECT COUNT(*) FROM CRITICALPARAMS)
It should perform well even for big data sets.
I am trying to work out a bug we've found during our last iteration of testing. It involves a query which uses a common table expression. The main theme of the query is that it simulates a 'first' aggregate operation (get the first row for this grouping).
The problem is that the query seems to choose rows completely arbitrarily in some circumstances - multiple rows from the same group get returned, some groups simply get eliminated altogether. However, it always picks the correct number of rows.
I have created a minimal example to post here. There are clients and addresses, and a table which defines the relationships between them. This is a much simplified version of the actual query I'm looking at, but I believe it should have the same characteristics, and it is a good example to use to explain what I think is going wrong.
CREATE TABLE [Client] (ClientID int, Name varchar(20))
CREATE TABLE [Address] (AddressID int, Street varchar(20))
CREATE TABLE [ClientAddress] (ClientID int, AddressID int)
INSERT [Client] VALUES (1, 'Adam')
INSERT [Client] VALUES (2, 'Brian')
INSERT [Client] VALUES (3, 'Charles')
INSERT [Client] VALUES (4, 'Dean')
INSERT [Client] VALUES (5, 'Edward')
INSERT [Client] VALUES (6, 'Frank')
INSERT [Client] VALUES (7, 'Gene')
INSERT [Client] VALUES (8, 'Harry')
INSERT [Address] VALUES (1, 'Acorn Street')
INSERT [Address] VALUES (2, 'Birch Road')
INSERT [Address] VALUES (3, 'Cork Avenue')
INSERT [Address] VALUES (4, 'Derby Grove')
INSERT [Address] VALUES (5, 'Evergreen Drive')
INSERT [Address] VALUES (6, 'Fern Close')
INSERT [ClientAddress] VALUES (1, 1)
INSERT [ClientAddress] VALUES (1, 3)
INSERT [ClientAddress] VALUES (2, 2)
INSERT [ClientAddress] VALUES (2, 4)
INSERT [ClientAddress] VALUES (2, 6)
INSERT [ClientAddress] VALUES (3, 3)
INSERT [ClientAddress] VALUES (3, 5)
INSERT [ClientAddress] VALUES (3, 1)
INSERT [ClientAddress] VALUES (4, 4)
INSERT [ClientAddress] VALUES (4, 6)
INSERT [ClientAddress] VALUES (5, 1)
INSERT [ClientAddress] VALUES (6, 3)
INSERT [ClientAddress] VALUES (7, 2)
INSERT [ClientAddress] VALUES (8, 4)
INSERT [ClientAddress] VALUES (5, 6)
INSERT [ClientAddress] VALUES (6, 3)
INSERT [ClientAddress] VALUES (7, 5)
INSERT [ClientAddress] VALUES (8, 1)
INSERT [ClientAddress] VALUES (5, 4)
INSERT [ClientAddress] VALUES (6, 6)
;WITH [Stuff] ([ClientID], [Name], [Street], [RowNo]) AS
(
SELECT
[C].[ClientID],
[C].[Name],
[A].[Street],
ROW_NUMBER() OVER (ORDER BY [A].[AddressID]) AS [RowNo]
FROM
[Client] [C] INNER JOIN
[ClientAddress] [CA] ON
[C].[ClientID] = [CA].[ClientID] INNER JOIN
[Address] [A] ON
[CA].[AddressID] = [A].[AddressID]
)
SELECT
[CTE].[ClientID],
[CTE].[Name],
[CTE].[Street],
[CTE].[RowNo]
FROM
[Stuff] [CTE]
WHERE
[CTE].[RowNo] IN (SELECT MIN([CTE2].[RowNo]) FROM [Stuff] [CTE2] GROUP BY [CTE2].[ClientID])
ORDER BY
[CTE].[Name] ASC,
[CTE].[Street] ASC
DROP TABLE [ClientAddress]
DROP TABLE [Address]
DROP TABLE [Client]
The query is designed to get all clients, and their first address (the address with the lowest ID). This appears to me that it should work.
I have a theory about why it sometimes will not work. The statement that follows the CTE refers to the CTE in two places. If the CTE is non-deterministic, and it gets run more than once, the result of the CTE may be different in the two places it's referenced.
In my example, the CTE's RowNo column uses ROW_NUMBER() with an order by clause that will potentially result in different orderings when run multiple times (we're ordering by address, the clients can be in any order depending on how the query is executed).
Because of this is it possible that CTE and CTE2 can contain different results? Or is the CTE only executed once and do I need to look elsewhere for the problem?
It is not guaranteed in any way.
SQL Server is free to evaluate CTE each time it's accessed or cache the results, depending on the plan.
You may want to read this article:
Generating XML in subqueries
If your CTE is not deterministic, you will have to store its result in a temporary table or a table variable and use it instead of the CTE.
PostgreSQL, on the other hand, always evaluates CTEs only once, caching their results.
I have been trying to move away from using DECODE to pivot rows in Oracle 11g, where there is a handy PIVOT function. But I may have found a limitation:
I'm trying to return 2 columns for each value in the base table. Something like:
SELECT somethingId, splitId1, splitName1, splitId2, splitName2
FROM (SELECT somethingId, splitId
FROM SOMETHING JOIN SPLIT ON ... )
PIVOT ( MAX(splitId) FOR displayOrder IN (1 AS splitId1, 2 AS splitId2),
MAX(splitName) FOR displayOrder IN (1 AS splitName1, 2 as splitName2)
)
I can do this with DECODE, but I can't wrestle the syntax to let me do it with PIVOT. Is this even possible? Seems like it wouldn't be too hard for the function to handle.
Edit: is StackOverflow maybe not the right Overflow for SQL questions?
Edit: anyone out there?
From oracle-developer.net it would appear that it can be done like this:
SELECT somethingId, splitId1, splitName1, splitId2, splitName2
FROM (SELECT somethingId, splitId
FROM SOMETHING JOIN SPLIT ON ... )
PIVOT ( MAX(splitId) ,
MAX(splitName)
FOR displayOrder IN (1 AS splitName1, 2 as splitName2)
)
I'm not sure from what you provided what the data looks or what exactly you would like. Perhaps if you posted the decode version of the query that returns the data you are looking for and/or the definition for the source data, we could better answer your question. Something like this would be helpful:
create table something (somethingId Number(3), displayOrder Number(3)
, splitID Number(3));
insert into something values (1, 1, 10);
insert into something values (2, 1, 11);
insert into something values (3, 1, 12);
insert into something values (4, 1, 13);
insert into something values (5, 2, 14);
insert into something values (6, 2, 15);
insert into something values (7, 2, 16);
create table split (SplitID Number(3), SplitName Varchar2(30));
insert into split values (10, 'Bob');
insert into split values (11, 'Carrie');
insert into split values (12, 'Alice');
insert into split values (13, 'Timothy');
insert into split values (14, 'Sue');
insert into split values (15, 'Peter');
insert into split values (16, 'Adam');
SELECT *
FROM (
SELECT somethingID, displayOrder, so.SplitID, sp.splitname
FROM SOMETHING so JOIN SPLIT sp ON so.splitID = sp.SplitID
)
PIVOT ( MAX(splitId) id, MAX(splitName) name
FOR (displayOrder, displayOrder) IN ((1, 1) AS split, (2, 2) as splitname)
);