Remove a sentence from a paragraph that has a specific pattern with T-SQL - sql

I have a large number of descriptions that can be anywhere from 5 to 20 sentences each. I am trying to put a script together that will locate and remove a sentence that contains a word with numbers before or after it.
before example: Hello world. Todays department has 345 employees. Have a good day.
after example: Hello world. Have a good day.
My main problem right now is identifying the violation.
Here "345 employees" is what causes the sentence to be removed. However, each description will have a different number and possibly a different variation of the word employee.
I would like to avoid having to create a table of all the different variations of employee.
JTB

This would make a good SQL Puzzle.
Disclaimer: there are probably TONS of edge cases that would blow this up
This would take a string, split it out into a table with a row for each sentence, then remove the rows that matched a condition, and then finally join them all back into a string.
CREATE FUNCTION dbo.fn_SplitRemoveJoin(#Val VARCHAR(2000), #FilterCond VARCHAR(100))
RETURNS VARCHAR(2000)
AS
BEGIN
DECLARE #tbl TABLE (rid INT IDENTITY(1,1), val VARCHAR(2000))
DECLARE #t VARCHAR(2000)
-- Split into table #tbl
WHILE CHARINDEX('.',#Val) > 0
BEGIN
SET #t = LEFT(#Val, CHARINDEX('.', #Val))
INSERT #tbl (val) VALUES (#t)
SET #Val = RIGHT(#Val, LEN(#Val) - LEN(#t))
END
IF (LEN(#Val) > 0)
INSERT #tbl VALUES (#Val)
-- Filter out condition
DELETE FROM #tbl WHERE val LIKE #FilterCond
-- Join back into 1 string
DECLARE #i INT, #rv VARCHAR(2000)
SET #i = 1
WHILE #i <= (SELECT MAX(rid) FROM #tbl)
BEGIN
SELECT #rv = IsNull(#rv,'') + IsNull(val,'') FROM #tbl WHERE rid = #i
SET #i = #i + 1
END
RETURN #rv
END
go
CREATE TABLE #TMP (rid INT IDENTITY(1,1), sentence VARCHAR(2000))
INSERT #tmp (sentence) VALUES ('Hello world. Todays department has 345 employees. Have a good day.')
INSERT #tmp (sentence) VALUES ('Hello world. Todays department has 15 emps. Have a good day. Oh and by the way there are 12 employees somewhere else')
SELECT
rid, sentence, dbo.fn_SplitRemoveJoin(sentence, '%[0-9] Emp%')
FROM #tmp t
returns
rid | sentence | |
1 | Hello world. Todays department has 345 employees. Have a good day. | Hello world. Have a good day.|
2 | Hello world. Todays department has 15 emps. Have a good day. Oh and by the way there are 12 employees somewhere else | Hello world. Have a good day. |

I've used the split/remove/join technique as well.
The main points are:
This uses a pair of recursive CTEs, rather than a UDF.
This will work with all English sentence endings: . or ! or ?
This removes whitespace to make the comparison for "digit then employee" so you don't have to worry about multiple spaces and such.
Here's the SqlFiddle demo, and the code:
-- Split descriptions into sentences (could use period, exclamation point, or question mark)
-- Delete any sentences that, without whitespace, are like '%[0-9]employ%'
-- Join sentences back into descriptions
;with Splitter as (
select ID
, ltrim(rtrim(Data)) as Data
, cast(null as varchar(max)) as Sentence
, 0 as SentenceNumber
from Descriptions -- Your table here
union all
select ID
, case when Data like '%[.!?]%' then right(Data, len(Data) - patindex('%[.!?]%', Data)) else null end
, case when Data like '%[.!?]%' then left(Data, patindex('%[.!?]%', Data)) else Data end
, SentenceNumber + 1
from Splitter
where Data is not null
), Joiner as (
select ID
, cast('' as varchar(max)) as Data
, 0 as SentenceNumber
from Splitter
group by ID
union all
select j.ID
, j.Data +
-- Don't want "digit+employ" sentences, remove whitespace to search
case when replace(replace(replace(replace(s.Sentence, char(9), ''), char(10), ''), char(13), ''), char(32), '') like '%[0-9]employ%' then '' else s.Sentence end
, s.SentenceNumber
from Joiner j
join Splitter s on j.ID = s.ID and s.SentenceNumber = j.SentenceNumber + 1
)
-- Final Select
select a.ID, a.Data
from Joiner a
join (
-- Only get max SentenceNumber
select ID, max(SentenceNumber) as SentenceNumber
from Joiner
group by ID
) b on a.ID = b.ID and a.SentenceNumber = b.SentenceNumber
order by a.ID, a.SentenceNumber

One way to do this. Please note that it only works if you have one number in all sentences.
declare #d VARCHAR(1000) = 'Hello world. Todays department has 345 employees. Have a good day.'
declare #dr VARCHAR(1000)
set #dr = REVERSE(#d)
SELECT REVERSE(RIGHT(#dr,LEN(#dr) - CHARINDEX('.',#dr,PATINDEX('%[0-9]%',#dr))))
+ RIGHT(#d,LEN(#d) - CHARINDEX('.',#d,PATINDEX('%[0-9]%',#d)) + 1)

Related

SQL Server: efficiently search for many values on many to many columns?

I am creating a website using SQL Server. In the admin interface, I have two fields:
Subject: Math, English, History, ...
Grade: 1, 2, 3, 4, ...
Multiple values of a field can be assigned to a record.
Now in the frontend search, I would like a visitor to be able to select more than one value of a field for search. For example, someone may search for Subject being Math OR History and Grade being 1 OR 3.
What table design and SQL statement (or MS-proprietary statement) should I use to have efficient search?
Thanks and regards.
UPDATE
Thanks for all input!
I feel compelled to explain. I am technical and familiar with SQL. One thing I learned over my MANY years of programming experience is to be practical. For this question, I already have an initial design, but not sure how other folks to handle it for EFFICIENT SEARCH (there are always smarter folks out there). Here is my table design for storing a record:
Subject
type: varchar. record example: ,1,3, (each is the id of corresponding value)
Grade (this means applicable grade)
type: varchar. record example: ,1,2, (each is the id of corresponding values. this means a record is applicable to grade 1, 2)
Search example
where (subject LIKE '%,1,%' OR subject like '%,3,%') AND (grade like '%,1,%')
This design should lead to efficient search, but has drawbacks that it increases the complexity data management in the backend.
Another reason for my design is: the Subject and Grade each have a list of values that never/rarely change, and once a record is created, it never or rarely updates.
I am trying to strike a balance among simplicity, understandability, design, management, etc.
create table Subject (
SubjectId int identity(1, 1),
SubjectName nvarchar(255),
other fields.... )
create table GradingScale (
GradeId int identity(1, 1),
Grade int,
Description varchar(25),
other fields... )
create table Students (
StudentId int identity(1, 1),
StudentName nvarchar(255))
create table StudentGrades (
StudentId int,
SubjectId int,
GradeId int,
SemesterId int )
create FUNCTION [dbo].[fnArray] ( #Str varchar(max), #Delim varchar(1) = ' ', #RemoveDups bit = 0 )
returns #tmpTable table ( arrValue varchar(max))
as
begin
declare #pos integer
declare #lastpos integer
declare #arrdata varchar(8000)
declare #data varchar(max)
set #arrdata = replace(#Str,#Delim,'|')
set #arrdata = #arrdata + '|'
set #lastpos = 1
set #pos = 0
set #pos = charindex('|', #arrdata)
while #pos <= len(#arrdata) and #pos <> 0
begin
set #data = substring(#arrdata, #lastpos, (#pos - #lastpos))
if rtrim(ltrim(#data)) > ''
begin
if #RemoveDups = 0
begin
insert into #tmpTable ( arrValue ) values ( #data )
end
else
begin
if not exists( select top 1 arrValue from #tmpTable where arrValue = #data )
begin
insert into #tmpTable ( arrValue ) values ( #data )
end
end
end
set #lastpos = #pos + 1
set #pos = charindex('|', #arrdata, #lastpos)
end
return
end
select *
from Students st
inner join StudentGrades sg on sg.StudentId = st.StudentId
inner join Subject s on sg.SubjectId = s.SubjectId
inner join GradingScale gs on sg.GradeId = gs.GradeId
inner join dbo.fnArray(#subjects, ',') sArr on s.SubjectId = convert(int, sArr.arrValue)
inner join dbo.fnArray(#grades, ',') gArr on gs.GradeId = convert(int, gArr.arrValue)
obviously #subjectId and #gradeId could be passed in via some drop down selectors or however your UI is composed.
Edited to use dbo.fnArray, a little tool that can parse delimited strings into a list.
Now of course this would mean that if you had 2 subjects and 2 grades...like Show me all students that took ( Math and Science ) and scored ( 2 or 3 ) this would work. However if you wanted students who took Math and scored 2 or 3 or Students who took Science and scored a 3 you would have to rework the query.

What is the best way to join between two table which have coma seperated columns

Table1
ID Name Tags
----------------------------------
1 Customer1 Tag1,Tag5,Tag4
2 Customer2 Tag2,Tag6,Tag4,Tag11
3 Customer5 Tag6,Tag5,Tag10
and Table2
ID Name Tags
----------------------------------
1 Product1 Tag1,Tag10,Tag6
2 Product2 Tag2,Tag1,Tag5
3 Product5 Tag1,Tag2,Tag3
what is the best way to join Table1 and Table2 with Tags column?
It should look at the tags column which coma seperated on table 2 for each coma seperated tag on the tags column in the table 1
Note: Tables are not full-text indexed.
The best way is not to have comma separated values in a column. Just use normalized data and you won't have trouble with querying like this - each column is supposed to only have one value.
Without this, there's no way to use any indices, really. Even a full-text index behaves quite different from what you might thing, and they are inherently clunky to use - they're designed for searching for text, not meaningful data. In the end, you will not get much better than something like
where (Col like 'txt,%' or Col like '%,txt' or Col like '%,txt,%')
Using a xml column might be another alternative, though it's still quite a bit silly. It would allow you to treat the values as a collection at least, though.
I don't think there will ever be an easy and efficient solution to this. As Luaan pointed out, it is a very bad idea to store data like this : you lose most of the power of SQL when you squeeze what should be individual units of data into a single cell.
But you can manage this at the slight cost of creating two user-defined functions. First, use this brilliant recursive technique to split the strings into individual rows based on your delimiter :
CREATE FUNCTION dbo.TestSplit (#sep char(1), #s varchar(512))
RETURNS table
AS
RETURN (
WITH Pieces(pn, start, stop) AS (
SELECT 1, 1, CHARINDEX(#sep, #s)
UNION ALL
SELECT pn + 1, stop + 1, CHARINDEX(#sep, #s, stop + 1)
FROM Pieces
WHERE stop > 0
)
SELECT pn AS SplitIndex,
SUBSTRING(#s, start, CASE WHEN stop > 0 THEN stop-start ELSE 512 END) AS SplitPart
FROM Pieces
)
Then, make a function that takes two strings and counts the matches :
CREATE FUNCTION dbo.MatchTags (#a varchar(512), #b varchar(512))
RETURNS INT
AS
BEGIN
RETURN
(SELECT COUNT(*)
FROM dbo.TestSplit(',', #a) a
INNER JOIN dbo.TestSplit(',', #b) b
ON a.SplitPart = b.SplitPart)
END
And that's it, here is a test roll with table variables :
DECLARE #A TABLE (Name VARCHAR(20), Tags VARCHAR(100))
DECLARE #B TABLE (Name VARCHAR(20), Tags VARCHAR(100))
INSERT INTO #A ( Name, Tags )
VALUES
( 'Customer1','Tag1,Tag5,Tag4'),
( 'Customer2','Tag2,Tag6,Tag4,Tag11'),
( 'Customer5','Tag6,Tag5,Tag10')
INSERT INTO #B ( Name, Tags )
VALUES
( 'Product1','Tag1,Tag10,Tag6'),
( 'Product2','Tag2,Tag1,Tag5'),
( 'Product5','Tag1,Tag2,Tag3')
SELECT * FROM #A a
INNER JOIN #B b ON dbo.MatchTags(a.Tags, b.Tags) > 0
I developed a solution as follows:
CREATE TABLE [dbo].[Table1](
Id int not null,
Name nvarchar(250) not null,
Tag nvarchar(250) null,
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[Table2](
Id int not null,
Name nvarchar(250) not null,
Tag nvarchar(250) null,
) ON [PRIMARY]
GO
get sample data for Table1, it will insert 28000 records
INSERT INTO Table1
SELECT CustomerID,CompanyName, (FirstName + ',' + LastName)
FROM AdventureWorks.SalesLT.Customer
GO 3
sample data for Table2.. i need same tags for Table2
declare #tag1 nvarchar(50) = 'Donna,Carreras'
declare #tag2 nvarchar(50) = 'Johnny,Caprio'
get sample data for Table2, it will insert 9735 records
INSERT INTO Table2
SELECT ProductID,Name, (case when(right(ProductID,1)>=5) then #tag1 else #tag2 end)
FROM AdventureWorks.SalesLT.Product
GO 3
My Solution
create TABLE #dt (
Id int IDENTITY(1,1) PRIMARY KEY,
Tag nvarchar(250) NOT NULL
);
I've create temp table and i will fill with Distinct Tag-s in Table1
insert into #dt(Tag)
SELECT distinct Tag
FROM Table1
Now i need to vertical table for tags
create TABLE #Tags ( Tag nvarchar(250) NOT NULL );
Now i'am fill #Tags table with While, you can use Cursor but while is faster
declare #Rows int = 1
declare #Tag nvarchar(1024)
declare #Id int = 0
WHILE #Rows>0
BEGIN
Select Top 1 #Tag=Tag,#Id=Id from #dt where Id>#Id
set #Rows =##RowCount
if #Rows>0
begin
insert into #Tags(Tag) SELECT Data FROM dbo.StringToTable(#Tag, ',')
end
END
last step : join Table2 with #Tags
select distinct t.*
from Table2 t
inner join #Tags on (',' + t.Tag + ',') like ('%,' + #Tags.Tag + ',%')
Table rowcount= 28000 Table2 rowcount=9735 select is less than 2 second
I use this kind of solution with paths of trees. First put a comma at the very begin and at the very end of the string. Than you can call
Where col1 like '%,' || col2 || ',%'
Some database index the column also for the like(postgres do it partially), therefore is also efficient. I don't know sqlserver.

Checking existence of all words words of a column of table 1 in other column of table 2

I have a table which contains product_name field. Then another table with models.
===products
product_id, product_name
===models
model_id, model_name
I am looking for a way to do the following.
Model names can have words separated by hyphen i.e JVC-600-BLACK
For each model I need to check the existence of each words of model in product name.
I'll need result in some where like below.
== results
model_id, product_id
If someone can point me in right direction, that would be a great help.
Notes
These are huge tables with about millions of records and number of
words in model_name are not fixed.
words in model may exist in any order or in between or other words in product name
Here's a function that splits the first string into parts using - as a delimiter and looks up each part in the second string, returning 1 if all parts were found and 0 otherwise.
CREATE FUNCTION dbo.func(#str1 varchar(max), #str2 varchar(max))
RETURNS BIT
AS
BEGIN
DECLARE #pos INT, #newPos INT,
#delimiter NCHAR(1)
SET #delimiter = '-'
SET #pos = 1
SET #newPos = 0
WHILE (#newPos < LEN(#str1))
BEGIN
SET #newPos = CHARINDEX(#delimiter, #str1, #pos)
IF #newPos = 0
SET #newPos = LEN(#str1)+1
DECLARE #data2 NVARCHAR(MAX)
SET #data2 = SUBSTRING(#str1, #pos, #newPos-#pos)
IF CHARINDEX(#data2, #str2) = 0
RETURN 0
SET #pos = #newPos + 1
IF #newPos = 0
BREAK
END
RETURN 1
END
You can use the above function for your problem as follows:
SELECT model_id, product_id
FROM models
JOIN products
ON dbo.func(models.model_name, products.product_name) = 1
It's not going to be fast, but I don't think a fast solution exists, since your problem doesn't allow for indexing. It may be possible to change the database structure to allow for this, but how exactly this can be done largely depends on what your data looks like.
I don't know if this solution is faster, for you to check if you care:
--=======================
-- sample data
-- ======================
declare #Products table
(
product_id int,
product_name nvarchar(max)
)
insert into #Products select 1, 'sdfsd def1 abc1klm1 sdljkfd'
insert into #Products select 2, 'sdfsd def2 abc2klm2 sdljkfd'
insert into #Products select 3, 'sdfsd def3 abc3klm3 sdljkfd'
declare #Models table
(
model_id int,
model_name nvarchar(max)
)
insert into #Models select 1, 'abc1-def1-klm1'
insert into #Models select 2, 'abc2-def2-klm2'
insert into #Models select 3, 'abc3-def3-klm3'
--=======================
-- solution
-- ======================
select t1.product_id, t2.model_id from #Products t1
cross join (
select
t1.model_id, Word = t2.r.value('.', 'nvarchar(max)')
from (select model_id, x = cast('<e>' + replace(model_name, '-', '</e><e>') + '</e>' as xml) from #Models ) t1
cross apply x.nodes('e') as t2 (r)
) t2
group by product_id, model_id
having min(charindex(word, product_name)) != 0
You may want to consider using the Full Text Search feature of SQL Server. In a nutshell, it catalogs all of the words (ignoring noise words like "and", "or", "a" and "the" by default but this list of noise worlds is configurable) in the tables and columns you specify when setting up the Full Text Catalog and offers a handful of functions that allow you to utilize that catalog to quickly find rows.

Finding Uppercase Character then Adding Space

I bought a SQL World City/State database. In the state database it has the state names pushed together. Example: "NorthCarolina", or "SouthCarolina"...
IS there a way in SQL to loop and find the uppercase characters and add a space???
this way "NorthCarolina" becomes "North Carolina"???
Create this function
if object_id('dbo.SpaceBeforeCaps') is not null
drop function dbo.SpaceBeforeCaps
GO
create function dbo.SpaceBeforeCaps(#s varchar(100)) returns varchar(100)
as
begin
declare #return varchar(100);
set #return = left(#s,1);
declare #i int;
set #i = 2;
while #i <= len(#s)
begin
if ASCII(substring(#s,#i,1)) between ASCII('A') and ASCII('Z')
set #return = #return + ' ' + substring(#s,#i,1)
else
set #return = #return + substring(#s,#i,1)
set #i = #i + 1;
end;
return #return;
end;
GO
Then you can use it to update your database
update tbl set statename = select dbo.SpaceBeforeCaps(statename);
There's a couple ways to approach this
Construct a function using a pattern and the PATINDEX feature.
Chain minimal REPLACE statements for each case (e.g. REPLACE(state_name, 'hC', 'h C' for your example case). This seems is kind of a hack, but might actually give you the best performance, since you have such a small set of replacements.
If you absolutely cannot create functions and need this as a one-off, you can use a recursive CTE to break the string up (and add the space at the same time where required), then recombine the characters using FOR XML. Elaborate example below:
-- some sample data
create table #tmp (id int identity primary key, statename varchar(100));
insert #tmp select 'NorthCarolina';
insert #tmp select 'SouthCarolina';
insert #tmp select 'NewSouthWales';
-- the complex query updating the "statename" column in the "#tmp" table
;with cte(id,seq,char,rest) as (
select id,1,cast(left(statename,1) as varchar(2)), stuff(statename,1,1,'')
from #tmp
union all
select id,seq+1,case when ascii(left(rest,1)) between ascii('A') and ascii('Z')
then ' ' else '' end + left(rest,1)
, stuff(rest,1,1,'')
from cte
where rest > ''
), recombined as (
select a.id, (select b.char+''
from cte b
where a.id = b.id
order by b.seq
for xml path, type).value('/','varchar(100)') fixed
from cte a
group by a.id
)
update t
set statename = c.fixed
from #tmp t
join recombined c on c.id = t.id
where statename != c.fixed;
-- check the result
select * from #tmp
----------- -----------
id statename
----------- -----------
1 North Carolina
2 South Carolina
3 New South Wales

Finding strings with duplicate letters inside

Can somebody help me with this little task? What I need is a stored procedure that can find duplicate letters (in a row) in a string from a table "a" and after that make a new table "b" with just the id of the string that has a duplicate letter.
Something like this:
Table A
ID Name
1 Matt
2 Daave
3 Toom
4 Mike
5 Eddie
And from that table I can see that Daave, Toom, Eddie have duplicate letters in a row and I would like to make a new table and list their ID's only. Something like:
Table B
ID
2
3
5
Only 2,3,5 because that is the ID of the string that has duplicate letters in their names.
I hope this is understandable and would be very grateful for any help.
In your answer with stored procedure, you have 2 mistakes, one is missing space between column name and LIKE clause, second is missing single quotes around search parameter.
I first create user-defined scalar function which return 1 if string contains duplicate letters:
EDITED
CREATE FUNCTION FindDuplicateLetters
(
#String NVARCHAR(50)
)
RETURNS BIT
AS
BEGIN
DECLARE #Result BIT = 0
DECLARE #Counter INT = 1
WHILE (#Counter <= LEN(#String) - 1)
BEGIN
IF(ASCII((SELECT SUBSTRING(#String, #Counter, 1))) = ASCII((SELECT SUBSTRING(#String, #Counter + 1, 1))))
BEGIN
SET #Result = 1
BREAK
END
SET #Counter = #Counter + 1
END
RETURN #Result
END
GO
After function was created, just call it from simple SELECT query like following:
SELECT
*
FROM
(SELECT
*,
dbo.FindDuplicateLetters(ColumnName) AS Duplicates
FROM TableName) AS a
WHERE a.Duplicates = 1
With this combination, you will get just rows that has duplicate letters.
In any version of SQL, you can do this with a brute force approach:
select *
from t
where t.name like '%aa%' or
t.name like '%bb%' or
. . .
t.name like '%zz%'
If you have a case sensitive collation, then use:
where lower(t.name) like '%aa%' or
. . .
Here's one way.
First create a table of numbers
CREATE TABLE dbo.Numbers
(
number INT PRIMARY KEY
);
INSERT INTO dbo.Numbers
SELECT number
FROM master..spt_values
WHERE type = 'P'
AND number > 0;
Then with that in place you can use
SELECT *
FROM TableA
WHERE EXISTS (SELECT *
FROM dbo.Numbers
WHERE number < LEN(Name)
AND SUBSTRING(Name, number, 1) = SUBSTRING(Name, number + 1, 1))
Though this is an old post it's worth posting a solution that will be faster than a brute force approach or one that uses a scalar udf (which generally drag down performance). Using NGrams8K this is rather simple.
--sample data
declare #table table (id int identity primary key, [name] varchar(20));
insert #table([name]) values ('Mattaa'),('Daave'),('Toom'),('Mike'),('Eddie');
-- solution #1
select id
from #table
cross apply dbo.NGrams8k([name],1)
where charindex(replicate(token,2), [name]) > 0
group by id;
-- solution #2 (SQL 2012+ solution using LAG)
select id
from
(
select id, token, prevToken = lag(token,1) over (partition by id order by position)
from #table
cross apply dbo.NGrams8k([name],1)
) prep
where token = prevToken
group by id; -- optional id you want to remove possible duplicates.
another burte force way:
select *
from t
where t.name ~ '(.)\1';