SQL Server: efficiently search for many values on many to many columns?

SQL Server: efficiently search for many values on many to many columns? - sql

I am creating a website using SQL Server. In the admin interface, I have two fields:
Subject: Math, English, History, ...
Grade: 1, 2, 3, 4, ...
Multiple values of a field can be assigned to a record.
Now in the frontend search, I would like a visitor to be able to select more than one value of a field for search. For example, someone may search for Subject being Math OR History and Grade being 1 OR 3.
What table design and SQL statement (or MS-proprietary statement) should I use to have efficient search?
Thanks and regards.
UPDATE
Thanks for all input!
I feel compelled to explain. I am technical and familiar with SQL. One thing I learned over my MANY years of programming experience is to be practical. For this question, I already have an initial design, but not sure how other folks to handle it for EFFICIENT SEARCH (there are always smarter folks out there). Here is my table design for storing a record:
Subject
type: varchar. record example: ,1,3, (each is the id of corresponding value)
Grade (this means applicable grade)
type: varchar. record example: ,1,2, (each is the id of corresponding values. this means a record is applicable to grade 1, 2)
Search example
where (subject LIKE '%,1,%' OR subject like '%,3,%') AND (grade like '%,1,%')
This design should lead to efficient search, but has drawbacks that it increases the complexity data management in the backend.
Another reason for my design is: the Subject and Grade each have a list of values that never/rarely change, and once a record is created, it never or rarely updates.
I am trying to strike a balance among simplicity, understandability, design, management, etc.

create table Subject (
SubjectId int identity(1, 1),
SubjectName nvarchar(255),
other fields.... )
create table GradingScale (
GradeId int identity(1, 1),
Grade int,
Description varchar(25),
other fields... )
create table Students (
StudentId int identity(1, 1),
StudentName nvarchar(255))
create table StudentGrades (
StudentId int,
SubjectId int,
GradeId int,
SemesterId int )
create FUNCTION [dbo].[fnArray] ( #Str varchar(max), #Delim varchar(1) = ' ', #RemoveDups bit = 0 )
returns #tmpTable table ( arrValue varchar(max))
as
begin
declare #pos integer
declare #lastpos integer
declare #arrdata varchar(8000)
declare #data varchar(max)
set #arrdata = replace(#Str,#Delim,'|')
set #arrdata = #arrdata + '|'
set #lastpos = 1
set #pos = 0
set #pos = charindex('|', #arrdata)
while #pos <= len(#arrdata) and #pos <> 0
begin
set #data = substring(#arrdata, #lastpos, (#pos - #lastpos))
if rtrim(ltrim(#data)) > ''
begin
if #RemoveDups = 0
begin
insert into #tmpTable ( arrValue ) values ( #data )
end
else
begin
if not exists( select top 1 arrValue from #tmpTable where arrValue = #data )
begin
insert into #tmpTable ( arrValue ) values ( #data )
end
end
end
set #lastpos = #pos + 1
set #pos = charindex('|', #arrdata, #lastpos)
end
return
end
select *
from Students st
inner join StudentGrades sg on sg.StudentId = st.StudentId
inner join Subject s on sg.SubjectId = s.SubjectId
inner join GradingScale gs on sg.GradeId = gs.GradeId
inner join dbo.fnArray(#subjects, ',') sArr on s.SubjectId = convert(int, sArr.arrValue)
inner join dbo.fnArray(#grades, ',') gArr on gs.GradeId = convert(int, gArr.arrValue)
obviously #subjectId and #gradeId could be passed in via some drop down selectors or however your UI is composed.
Edited to use dbo.fnArray, a little tool that can parse delimited strings into a list.
Now of course this would mean that if you had 2 subjects and 2 grades...like Show me all students that took ( Math and Science ) and scored ( 2 or 3 ) this would work. However if you wanted students who took Math and scored 2 or 3 or Students who took Science and scored a 3 you would have to rework the query.

Related

What is the best way to join between two table which have coma seperated columns

Table1
ID Name Tags
----------------------------------
1 Customer1 Tag1,Tag5,Tag4
2 Customer2 Tag2,Tag6,Tag4,Tag11
3 Customer5 Tag6,Tag5,Tag10
and Table2
ID Name Tags
----------------------------------
1 Product1 Tag1,Tag10,Tag6
2 Product2 Tag2,Tag1,Tag5
3 Product5 Tag1,Tag2,Tag3
what is the best way to join Table1 and Table2 with Tags column?
It should look at the tags column which coma seperated on table 2 for each coma seperated tag on the tags column in the table 1
Note: Tables are not full-text indexed.

The best way is not to have comma separated values in a column. Just use normalized data and you won't have trouble with querying like this - each column is supposed to only have one value.
Without this, there's no way to use any indices, really. Even a full-text index behaves quite different from what you might thing, and they are inherently clunky to use - they're designed for searching for text, not meaningful data. In the end, you will not get much better than something like
where (Col like 'txt,%' or Col like '%,txt' or Col like '%,txt,%')
Using a xml column might be another alternative, though it's still quite a bit silly. It would allow you to treat the values as a collection at least, though.

I don't think there will ever be an easy and efficient solution to this. As Luaan pointed out, it is a very bad idea to store data like this : you lose most of the power of SQL when you squeeze what should be individual units of data into a single cell.
But you can manage this at the slight cost of creating two user-defined functions. First, use this brilliant recursive technique to split the strings into individual rows based on your delimiter :
CREATE FUNCTION dbo.TestSplit (#sep char(1), #s varchar(512))
RETURNS table
AS
RETURN (
WITH Pieces(pn, start, stop) AS (
SELECT 1, 1, CHARINDEX(#sep, #s)
UNION ALL
SELECT pn + 1, stop + 1, CHARINDEX(#sep, #s, stop + 1)
FROM Pieces
WHERE stop > 0
)
SELECT pn AS SplitIndex,
SUBSTRING(#s, start, CASE WHEN stop > 0 THEN stop-start ELSE 512 END) AS SplitPart
FROM Pieces
)
Then, make a function that takes two strings and counts the matches :
CREATE FUNCTION dbo.MatchTags (#a varchar(512), #b varchar(512))
RETURNS INT
AS
BEGIN
RETURN
(SELECT COUNT(*)
FROM dbo.TestSplit(',', #a) a
INNER JOIN dbo.TestSplit(',', #b) b
ON a.SplitPart = b.SplitPart)
END
And that's it, here is a test roll with table variables :
DECLARE #A TABLE (Name VARCHAR(20), Tags VARCHAR(100))
DECLARE #B TABLE (Name VARCHAR(20), Tags VARCHAR(100))
INSERT INTO #A ( Name, Tags )
VALUES
( 'Customer1','Tag1,Tag5,Tag4'),
( 'Customer2','Tag2,Tag6,Tag4,Tag11'),
( 'Customer5','Tag6,Tag5,Tag10')
INSERT INTO #B ( Name, Tags )
VALUES
( 'Product1','Tag1,Tag10,Tag6'),
( 'Product2','Tag2,Tag1,Tag5'),
( 'Product5','Tag1,Tag2,Tag3')
SELECT * FROM #A a
INNER JOIN #B b ON dbo.MatchTags(a.Tags, b.Tags) > 0

I developed a solution as follows:
CREATE TABLE [dbo].[Table1](
Id int not null,
Name nvarchar(250) not null,
Tag nvarchar(250) null,
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[Table2](
Id int not null,
Name nvarchar(250) not null,
Tag nvarchar(250) null,
) ON [PRIMARY]
GO
get sample data for Table1, it will insert 28000 records
INSERT INTO Table1
SELECT CustomerID,CompanyName, (FirstName + ',' + LastName)
FROM AdventureWorks.SalesLT.Customer
GO 3
sample data for Table2.. i need same tags for Table2
declare #tag1 nvarchar(50) = 'Donna,Carreras'
declare #tag2 nvarchar(50) = 'Johnny,Caprio'
get sample data for Table2, it will insert 9735 records
INSERT INTO Table2
SELECT ProductID,Name, (case when(right(ProductID,1)>=5) then #tag1 else #tag2 end)
FROM AdventureWorks.SalesLT.Product
GO 3
My Solution
create TABLE #dt (
Id int IDENTITY(1,1) PRIMARY KEY,
Tag nvarchar(250) NOT NULL
);
I've create temp table and i will fill with Distinct Tag-s in Table1
insert into #dt(Tag)
SELECT distinct Tag
FROM Table1
Now i need to vertical table for tags
create TABLE #Tags ( Tag nvarchar(250) NOT NULL );
Now i'am fill #Tags table with While, you can use Cursor but while is faster
declare #Rows int = 1
declare #Tag nvarchar(1024)
declare #Id int = 0
WHILE #Rows>0
BEGIN
Select Top 1 #Tag=Tag,#Id=Id from #dt where Id>#Id
set #Rows =##RowCount
if #Rows>0
begin
insert into #Tags(Tag) SELECT Data FROM dbo.StringToTable(#Tag, ',')
end
END
last step : join Table2 with #Tags
select distinct t.*
from Table2 t
inner join #Tags on (',' + t.Tag + ',') like ('%,' + #Tags.Tag + ',%')
Table rowcount= 28000 Table2 rowcount=9735 select is less than 2 second

I use this kind of solution with paths of trees. First put a comma at the very begin and at the very end of the string. Than you can call
Where col1 like '%,' || col2 || ',%'
Some database index the column also for the like(postgres do it partially), therefore is also efficient. I don't know sqlserver.

Complex SQL selection query

I have a stored procedure which needs a different if condition to work properly.
The procedure has 2 parameter namely, #CategoryID and #ClassID, which basically come from a UI tree view functionality. #CategoryID corresponds to the parent nodes, while #ClassID corresponds to the child nodes.
Based upon the above parameters I need to make a selection(Column Code) from a table which has CategoryID and ClassID as columns.
Now there are 2 scenarios:
Scenario 1
#CategoryID:A
#ClassID:B (which is a child node of CategoryID A)
Result needed: Codes corresponding to only ClassID B, which is basically the intersection
Scenario 2
#CategoryID:A
#ClassID: C (which is not a child node for CategoryID A)
Result needed: Codes corresponding to the CategoryID A, as well as ClassID B, basically a union
The procedure which I wrote gives me correct answer for the second scenario, but the first scenario it fails. Below is my procedure:
ALTER PROCEDURE [dbo].[uspGetCodes]
#CategoryID varchar(50),
#ClassID varchar(50)
AS
BEGIN
BEGIN TRY
DECLARE #SQLQuery NVARCHAR(MAX)
SET #SQLQuery=N'SELECT Code FROM dbo.ClassToCategoryMapping WHERE '
IF (#CategoryID IS NULL OR #CategoryID='')
BEGIN
SET #SQLQuery=#SQLQuery + 'ClassId IN ('+#ClassID+')'
PRINT(#SQLQuery)
EXEC(#SQLQuery)
END
ELSE IF (#ClassID IS NULL OR #ClassID='')
BEGIN
SET #SQLQuery=#SQLQuery+'CategoryID IN ('+#CategoryID+')'
PRINT(#SQLQuery)
EXEC(#SQLQuery)
END
ELSE
BEGIN
SET #SQLQuery=#SQLQuery+'(CategoryID IN ('+#CategoryID+') OR ClassId IN ('+#ClassID+') )'
PRINT(#SQLQuery)
EXEC(#SQLQuery)
END
END TRY
BEGIN CATCH
SELECT ERROR_NUMBER() AS 'ErrorNumber', ERROR_MESSAGE() AS 'ErrorMessage', ERROR_SEVERITY() AS 'ErrorSeverity', ERROR_STATE() AS 'ErrorState', ERROR_LINE() AS 'ErrorLine'
RETURN ERROR_NUMBER()
END CATCH
END
The Last Else part actually does an 'OR', which gives me the union of the Codes for CategoryID's and ClassID's irrespective whether the given ClassID is a child of the given CategoryID or not.
My question over here would be, how to write the condition to achieve both the scenarios.
Latest Sample Data:
Scenario 1
#CategoryId=2,5, #ClassID=10 (Here 10 is the child while 2 is the parent, CategoryID 2 corresponds to ClassID's 10, 11, 12)
Expected Result: 10, 26, 27 (26 and 27 correspond to the CategoryID 5)
Scenario 2
#CategoryID=2, #ClassID=13,15 (13 and 15 is the child of a different parent, CategoryID 2 corresponds to ClassID's 10, 11 ,12)
Expected Result: 10, 11, 12, 13, 15
Data in Table dbo.ClasstoCategoryMapping will be somewhat as below:
CategoryID ClassID Code
2 10 200
2 11 201
2 12 202
5 26 501
5 27 502
6 15 601
6 16 602
6 17 603
7 20 701
7 21 702
7 22 703
I guess I have made my question quite clear, if no then, folks can ask me to edit it. I would be happy to do so. I urge the experts to assist me in this problem. Any pointers too will be quite appreciated.
Regards
Anurag

If I understand the question correctly, what you require in your result set is:
(all supplied classid) + (all classid for supplied categoryid with no matching supplied classid)
That would translate to the following:
CREATE PROCEDURE [dbo].[uspGetCodes]
(
#CategoryID varchar(50),
#ClassID varchar(50)
)
AS
BEGIN
SELECT COALESCE(CM.CategoryID, CM2.CategoryID) AS CategoryID,
COALESCE(CM.ClassID, CM2.ClassID) AS ClassID,
COALESCE(CM.Code, CM2.Code) AS Code
--Matched classIDs:
FROM dbo.udfSplitCommaSeparatedIntList(#ClassID) CLAS
JOIN dbo.ClassToCategoryMapping CM
ON CM.ClassId = CLAS.Value
--Unmatched CategoryIDs:
FULL
OUTER
JOIN dbo.udfSplitCommaSeparatedIntList(#CategoryID) CAT
ON CM.CategoryID = CAT.Value
LEFT
JOIN dbo.ClassToCategoryMapping CM2
ON CM.CategoryID IS NULL
AND CM2.CategoryID = CAT.Value
END
I have included Category, Class and Code in the result since its easier to see what's going on, however I guess you only really need code
This makes use of the following function to split the supplied comma separated strings:
CREATE FUNCTION [dbo].[udfSplitCommaSeparatedIntList]
(
#Values varchar(50)
)
RETURNS #Result TABLE
(
Value int
)
AS
BEGIN
DECLARE #LengthValues int
SELECT #LengthValues = COALESCE(LEN(#Values), 0)
IF (#LengthValues = 0)
RETURN
DECLARE #StartIndex int
SELECT #StartIndex = 1
DECLARE #CommaIndex int
SELECT #CommaIndex = CHARINDEX(',', #Values, #StartIndex)
DECLARE #Value varchar(50);
WHILE (#CommaIndex > 0)
BEGIN
SELECT #Value = SUBSTRING(#Values, #StartIndex, #CommaIndex - #StartIndex)
INSERT #Result VALUES (#Value)
SELECT #StartIndex = #CommaIndex + 1
SELECT #CommaIndex = CHARINDEX(',', #Values, #StartIndex)
END
SELECT #Value = SUBSTRING(#Values, #StartIndex, LEN(#Values) - #StartIndex + 1)
INSERT #Result VALUES (#Value)
RETURN
END

this is the sample query that can achieve your goal, is this what you want?
DECLARE #SAMPLE TABLE
(
ID INT IDENTITY(1,1),
CategoryId INT,
ClassID INT
)
INSERT INTO #sample
VALUES(2,10)
INSERT INTO #sample
VALUES(2,11)
INSERT INTO #sample
VALUES(2,12)
INSERT INTO #sample
VALUES(3,13)
DECLARE #CategoryID INT
DECLARE #ClassID Int
--Play around your parameter(s) here
SET #CategoryID = 2
SET #ClassID = 13
--Snenario 1
--#CategoryId=2, #ClassID=10 (Here 10 is the child while 2 is the parent, CategoryID 2 corresponds to ClassID's 10, 11, 12)
--Expected Result: 10
IF EXISTS(SELECT * FROM #SAMPLE WHERE CategoryId = #CategoryID AND ClassID = #ClassID)
SELECT ClassID FROM #SAMPLE WHERE CategoryId = #CategoryID AND ClassID = #ClassID
--Scenario 2
--#CategoryID=2, #ClassID=13 (13 is the child of a different parent, CategoryID 2 corresponds to ClassID's 10, 11 ,12)
--Expected Result: 10, 11, 12, 13
ELSE
SELECT ClassID FROM #SAMPLE WHERE ClassID = #ClassID OR CategoryId = #CategoryID

Try this
select * from yourtable
where CategoryId = #CategoryID and ClassID = #ClassID
union
select * from
(
select * from yourtable where ClassID = #ClassID
union
select * from yourtable where CategoryId = #CategoryID
) v
where not exists (select * from yourtable where CategoryId = #CategoryID and ClassID = #ClassID)

UPDATE FOR DELIMITED STRING
If you have a comma delimited string then it is best to use a CLR function to create the table, but you could use a SQL function. Examples of how to do this are easy to find with a Google search... but for reference here is one good article on the subject -> http://www.sqlperformance.com/2012/07/t-sql-queries/split-strings I expect at some point there will be native support on most platforms.
Given that you have a function that returns a table of one column (named ID) of type int, or an empty table on a null input. Note: You may have to have the null return a table with one row containing an invalid value (a value that will never join), say -1.
The code is as simple as this:
SELECT Code
FROM ClassToCategoryMapping
LEFT JOIN MakeTableFromCVS(#CategoryID) AS CatTable
ON CatTable.ID = CategoryID
LEFT JOIN MakeTableFromCVS(#ClassID) AS ClassTable
ON ClassTable.ID = ClassID
WHERE
CASE
WHEN #CatgoryID IS NULL THEN -1 ELSE CatTable.ID
END = ISNULL(CatTable.ID,-1)
AND
CASE
WHEN #ClassID IS NULL THEN -1 ELSE ClassTable.ID
END = ISNULL(ClassTable.ID,-1)
AND
COALESCE(CatTable.ID,ClassTable.ID,-1) != -1
The logic is the same as below. Because the join will vary the values if it is not null we have to use a different trick. Here we use a marker value (in this case -1) to signal the null value. Any value that won't appear in the comma separated list will work as this marker value, remember it must be of the same type.
You don't need dynamic SQL here and you will find SQL server is better at optimizing if you don't use dynamic SQL. In fact, you don't even need an if statement If you can be sure the input is always null you can do this:
SELECT Code
FROM ClassToCategoryMapping
WHERE
How this works
This query checks for both CategoryID and ClassID columns match the incoming parameters but "ignores" the input when they are null by checking the column against itself. This is an handy SQL trick.
Note if you do need to check for empty strings then this will be almost as fast
DECLARE #myCatID varchar(max)
DECLARE #myClassID varchar(max)
SET #myCatID = #CategoryID
SET #myClassID = #ClassID
IF LTRIM(RTRIM(#CategoryID) = '' THEN SET #myCatID = NULL
IF LTRIM(RTRIM(#ClassID) = '' THEN SET #myClassID = NULL
SELECT Code
FROM ClassToCategoryMapping
WHERE CatgoryID = ISNULL(#myCatID,CategoryID)
AND ClassID = ISNULL(#myClassID,ClassID)
You can replace ISNULL() with COALESCE() if you want... they do the same thing in this case.

Checking existence of all words words of a column of table 1 in other column of table 2

I have a table which contains product_name field. Then another table with models.
===products
product_id, product_name
===models
model_id, model_name
I am looking for a way to do the following.
Model names can have words separated by hyphen i.e JVC-600-BLACK
For each model I need to check the existence of each words of model in product name.
I'll need result in some where like below.
== results
model_id, product_id
If someone can point me in right direction, that would be a great help.
Notes
These are huge tables with about millions of records and number of
words in model_name are not fixed.
words in model may exist in any order or in between or other words in product name

Here's a function that splits the first string into parts using - as a delimiter and looks up each part in the second string, returning 1 if all parts were found and 0 otherwise.
CREATE FUNCTION dbo.func(#str1 varchar(max), #str2 varchar(max))
RETURNS BIT
AS
BEGIN
DECLARE #pos INT, #newPos INT,
#delimiter NCHAR(1)
SET #delimiter = '-'
SET #pos = 1
SET #newPos = 0
WHILE (#newPos < LEN(#str1))
BEGIN
SET #newPos = CHARINDEX(#delimiter, #str1, #pos)
IF #newPos = 0
SET #newPos = LEN(#str1)+1
DECLARE #data2 NVARCHAR(MAX)
SET #data2 = SUBSTRING(#str1, #pos, #newPos-#pos)
IF CHARINDEX(#data2, #str2) = 0
RETURN 0
SET #pos = #newPos + 1
IF #newPos = 0
BREAK
END
RETURN 1
END
You can use the above function for your problem as follows:
SELECT model_id, product_id
FROM models
JOIN products
ON dbo.func(models.model_name, products.product_name) = 1
It's not going to be fast, but I don't think a fast solution exists, since your problem doesn't allow for indexing. It may be possible to change the database structure to allow for this, but how exactly this can be done largely depends on what your data looks like.

I don't know if this solution is faster, for you to check if you care:
--=======================
-- sample data
-- ======================
declare #Products table
(
product_id int,
product_name nvarchar(max)
)
insert into #Products select 1, 'sdfsd def1 abc1klm1 sdljkfd'
insert into #Products select 2, 'sdfsd def2 abc2klm2 sdljkfd'
insert into #Products select 3, 'sdfsd def3 abc3klm3 sdljkfd'
declare #Models table
(
model_id int,
model_name nvarchar(max)
)
insert into #Models select 1, 'abc1-def1-klm1'
insert into #Models select 2, 'abc2-def2-klm2'
insert into #Models select 3, 'abc3-def3-klm3'
--=======================
-- solution
-- ======================
select t1.product_id, t2.model_id from #Products t1
cross join (
select
t1.model_id, Word = t2.r.value('.', 'nvarchar(max)')
from (select model_id, x = cast('<e>' + replace(model_name, '-', '</e><e>') + '</e>' as xml) from #Models ) t1
cross apply x.nodes('e') as t2 (r)
) t2
group by product_id, model_id
having min(charindex(word, product_name)) != 0

You may want to consider using the Full Text Search feature of SQL Server. In a nutshell, it catalogs all of the words (ignoring noise words like "and", "or", "a" and "the" by default but this list of noise worlds is configurable) in the tables and columns you specify when setting up the Full Text Catalog and offers a handful of functions that allow you to utilize that catalog to quickly find rows.

Finding Uppercase Character then Adding Space

I bought a SQL World City/State database. In the state database it has the state names pushed together. Example: "NorthCarolina", or "SouthCarolina"...
IS there a way in SQL to loop and find the uppercase characters and add a space???
this way "NorthCarolina" becomes "North Carolina"???

Create this function
if object_id('dbo.SpaceBeforeCaps') is not null
drop function dbo.SpaceBeforeCaps
GO
create function dbo.SpaceBeforeCaps(#s varchar(100)) returns varchar(100)
as
begin
declare #return varchar(100);
set #return = left(#s,1);
declare #i int;
set #i = 2;
while #i <= len(#s)
begin
if ASCII(substring(#s,#i,1)) between ASCII('A') and ASCII('Z')
set #return = #return + ' ' + substring(#s,#i,1)
else
set #return = #return + substring(#s,#i,1)
set #i = #i + 1;
end;
return #return;
end;
GO
Then you can use it to update your database
update tbl set statename = select dbo.SpaceBeforeCaps(statename);

There's a couple ways to approach this
Construct a function using a pattern and the PATINDEX feature.
Chain minimal REPLACE statements for each case (e.g. REPLACE(state_name, 'hC', 'h C' for your example case). This seems is kind of a hack, but might actually give you the best performance, since you have such a small set of replacements.

If you absolutely cannot create functions and need this as a one-off, you can use a recursive CTE to break the string up (and add the space at the same time where required), then recombine the characters using FOR XML. Elaborate example below:
-- some sample data
create table #tmp (id int identity primary key, statename varchar(100));
insert #tmp select 'NorthCarolina';
insert #tmp select 'SouthCarolina';
insert #tmp select 'NewSouthWales';
-- the complex query updating the "statename" column in the "#tmp" table
;with cte(id,seq,char,rest) as (
select id,1,cast(left(statename,1) as varchar(2)), stuff(statename,1,1,'')
from #tmp
union all
select id,seq+1,case when ascii(left(rest,1)) between ascii('A') and ascii('Z')
then ' ' else '' end + left(rest,1)
, stuff(rest,1,1,'')
from cte
where rest > ''
), recombined as (
select a.id, (select b.char+''
from cte b
where a.id = b.id
order by b.seq
for xml path, type).value('/','varchar(100)') fixed
from cte a
group by a.id
)
update t
set statename = c.fixed
from #tmp t
join recombined c on c.id = t.id
where statename != c.fixed;
-- check the result
select * from #tmp
----------- -----------
id statename
----------- -----------
1 North Carolina
2 South Carolina
3 New South Wales

Remove a sentence from a paragraph that has a specific pattern with T-SQL

I have a large number of descriptions that can be anywhere from 5 to 20 sentences each. I am trying to put a script together that will locate and remove a sentence that contains a word with numbers before or after it.
before example: Hello world. Todays department has 345 employees. Have a good day.
after example: Hello world. Have a good day.
My main problem right now is identifying the violation.
Here "345 employees" is what causes the sentence to be removed. However, each description will have a different number and possibly a different variation of the word employee.
I would like to avoid having to create a table of all the different variations of employee.
JTB

This would make a good SQL Puzzle.
Disclaimer: there are probably TONS of edge cases that would blow this up
This would take a string, split it out into a table with a row for each sentence, then remove the rows that matched a condition, and then finally join them all back into a string.
CREATE FUNCTION dbo.fn_SplitRemoveJoin(#Val VARCHAR(2000), #FilterCond VARCHAR(100))
RETURNS VARCHAR(2000)
AS
BEGIN
DECLARE #tbl TABLE (rid INT IDENTITY(1,1), val VARCHAR(2000))
DECLARE #t VARCHAR(2000)
-- Split into table #tbl
WHILE CHARINDEX('.',#Val) > 0
BEGIN
SET #t = LEFT(#Val, CHARINDEX('.', #Val))
INSERT #tbl (val) VALUES (#t)
SET #Val = RIGHT(#Val, LEN(#Val) - LEN(#t))
END
IF (LEN(#Val) > 0)
INSERT #tbl VALUES (#Val)
-- Filter out condition
DELETE FROM #tbl WHERE val LIKE #FilterCond
-- Join back into 1 string
DECLARE #i INT, #rv VARCHAR(2000)
SET #i = 1
WHILE #i <= (SELECT MAX(rid) FROM #tbl)
BEGIN
SELECT #rv = IsNull(#rv,'') + IsNull(val,'') FROM #tbl WHERE rid = #i
SET #i = #i + 1
END
RETURN #rv
END
go
CREATE TABLE #TMP (rid INT IDENTITY(1,1), sentence VARCHAR(2000))
INSERT #tmp (sentence) VALUES ('Hello world. Todays department has 345 employees. Have a good day.')
INSERT #tmp (sentence) VALUES ('Hello world. Todays department has 15 emps. Have a good day. Oh and by the way there are 12 employees somewhere else')
SELECT
rid, sentence, dbo.fn_SplitRemoveJoin(sentence, '%[0-9] Emp%')
FROM #tmp t
returns
rid | sentence | |
1 | Hello world. Todays department has 345 employees. Have a good day. | Hello world. Have a good day.|
2 | Hello world. Todays department has 15 emps. Have a good day. Oh and by the way there are 12 employees somewhere else | Hello world. Have a good day. |

I've used the split/remove/join technique as well.
The main points are:
This uses a pair of recursive CTEs, rather than a UDF.
This will work with all English sentence endings: . or ! or ?
This removes whitespace to make the comparison for "digit then employee" so you don't have to worry about multiple spaces and such.
Here's the SqlFiddle demo, and the code:
-- Split descriptions into sentences (could use period, exclamation point, or question mark)
-- Delete any sentences that, without whitespace, are like '%[0-9]employ%'
-- Join sentences back into descriptions
;with Splitter as (
select ID
, ltrim(rtrim(Data)) as Data
, cast(null as varchar(max)) as Sentence
, 0 as SentenceNumber
from Descriptions -- Your table here
union all
select ID
, case when Data like '%[.!?]%' then right(Data, len(Data) - patindex('%[.!?]%', Data)) else null end
, case when Data like '%[.!?]%' then left(Data, patindex('%[.!?]%', Data)) else Data end
, SentenceNumber + 1
from Splitter
where Data is not null
), Joiner as (
select ID
, cast('' as varchar(max)) as Data
, 0 as SentenceNumber
from Splitter
group by ID
union all
select j.ID
, j.Data +
-- Don't want "digit+employ" sentences, remove whitespace to search
case when replace(replace(replace(replace(s.Sentence, char(9), ''), char(10), ''), char(13), ''), char(32), '') like '%[0-9]employ%' then '' else s.Sentence end
, s.SentenceNumber
from Joiner j
join Splitter s on j.ID = s.ID and s.SentenceNumber = j.SentenceNumber + 1
)
-- Final Select
select a.ID, a.Data
from Joiner a
join (
-- Only get max SentenceNumber
select ID, max(SentenceNumber) as SentenceNumber
from Joiner
group by ID
) b on a.ID = b.ID and a.SentenceNumber = b.SentenceNumber
order by a.ID, a.SentenceNumber

One way to do this. Please note that it only works if you have one number in all sentences.
declare #d VARCHAR(1000) = 'Hello world. Todays department has 345 employees. Have a good day.'
declare #dr VARCHAR(1000)
set #dr = REVERSE(#d)
SELECT REVERSE(RIGHT(#dr,LEN(#dr) - CHARINDEX('.',#dr,PATINDEX('%[0-9]%',#dr))))
+ RIGHT(#d,LEN(#d) - CHARINDEX('.',#d,PATINDEX('%[0-9]%',#d)) + 1)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL Server: efficiently search for many values on many to many columns? - sql

Related

What is the best way to join between two table which have coma seperated columns

Complex SQL selection query

Checking existence of all words words of a column of table 1 in other column of table 2

Finding Uppercase Character then Adding Space

Remove a sentence from a paragraph that has a specific pattern with T-SQL

Categories

Resources