Trouble extracting First and LastName from Full Name column - sql

I am having FullName column and I am extracting the First Name and last name using the following query
select SUBSTRING(FULL_NAME, 1, CHARINDEX(' ', FULL_NAME) - 1) AS FirstName,
SUBSTRING(FULL_NAME, CHARINDEX(' ', FULL_NAME) + 1, 500) AS LastName
from [dbo].[TABLE]
But in the Full Name column there are just First names, some 10 digit phone numbers, 4 digit extensions and some text like 'this is a special case'.
How should I modify my query to accommodate these exceptions? And also when there are only single words in the Full Name column I am getting this following error message:
"Invalid length parameter passed to the LEFT or SUBSTRING function."

Parsing good names from free form fields is not an easy task...
I would suggest a dual approach.
Identify common patterns, i.e. you might find phone number with something like this
Where IsNumeric( Replace(Field,'-','')=1
and you might identify single names with
Where charindex(' ',trim(field))=0
etc.
Once you've identified them, the write code to attempt to split them...
So you might use the code you have above with the following WHERE clause
select SUBSTRING(FULL_NAME, 1, CHARINDEX(' ', FULL_NAME) - 1) AS FirstName,
SUBSTRING(PRQ_BP_CONTACT_NAME, CHARINDEX(' ', FULL_NAME) + 1, 500)
AS LastN
from [dbo].[TABLE]
where charindex(' ',trim(field))>0 and Where IsNumeric( Replace(Field,'-','')=0
Use the WHERE clauses to (a) make sure you only get records you can parse and (b) help identify the oddball cases you'll like need to do by hand...
Good luck

You could go with a function this allows you to put in any logic you need in the transform and keep things a bit more readable :
create function dbo.namepart( #fullname varchar(50), #part varchar(5))
returns varchar(10)
as
begin
declare #first varchar(50)
declare #last varchar(50)
declare #sp int
if #fullname like '%special value%' return ''
if #fullname like '% %'
begin
set #sp = CHARINDEX(' ', #fullname)
set #first = left(#fullname, #sp - 1)
set #last = substring(#fullname,#sp + 1 ,50)
if isnumeric(#last) <> 0 set #last = ''
end
else
begin
set #first = #fullname
set #last = ''
end
if #part like 'f%'
return #first
else
return #last
return ''
end
Sample data
create table blah(
full_name varchar(50)
)
insert into blah values ( 'john smith' ), ('paul 12345'),('evan'),('special value')
And see if it works
select
dbo.namepart(full_name,'first') first,
dbo.namepart(full_name,'last') last,
full_name
from blah
http://sqlfiddle.com/#!6/eb28f/2

Related

MS SQL Server - replace names while avoiding words containing the names

This is my first time posting on Stack Overflow, so please let me know if I can do anything better or provide more information.
I have been working on this issue for a few days now. I have a table with comments from employees about the company. Some of them could refer to specific employees in the company. For HR reasons, we want to replace any occurrence of an employee name with the word 'employee'. We aren't accounting for typos or misspellings.
An example of my desired outcome would be:
Input: 'I dislike dijon mustard. My boss Jon sucks.'
Name to search for: 'Jon'
Output: 'I dislike dijon mustard. My boss employee sucks.'
Another example:
Input: 'Aggregating data is boring. Greg is the worst person ever.'
Name to search for: 'Greg'
Output: 'Aggregating data is boring. employee is the worst person ever.'
I want to search the comments for occurrences of the employee names, but only if they aren't followed by other letters or numbers on either end. Occurrences with spaces or punctuation on either end of the name should be replaced.
So far I have tried the suggestions in the following threads:
How to replace a specific word in a sentence without replacing in substring in SQL Server
replacing-in-substring-in-s
This yielded the following
update c
set c.Comment = rtrim(ltrim(Replace(replace(' ' + c.Comment + ' ',' ' + en.FirstName + ' ', 'employee'), ' ' + en.FirstName + ' ', 'employee')))
from AnswerComment c
join #EmployeeNames en on en.SurveyId = c.SurveyId
and c.Comment like '%' + en.FirstName + '%'
However, I got results like this:
Input: 'I hate bob.'
Name to search for: 'Bob'
Output: 'I hate bob.'
Input: 'Jon sucks'
Name to search for: 'Jon'
Output: 'employeesucks'
A coworker looked at this thread Replace whole word using ms sql server "replace"
and gave me the following based off of it:
DECLARE #token VARCHAR(10) = 'bob';
DECLARE #replaceToken VARCHAR(10) = 'employee';
DECLARE #paddedToken VARCHAR(10) = ' ' + #token + ' ';
DECLARE #paddedReplaceToken VARCHAR(10) = ' ' + #replaceToken + ' ';
;WITH Step1 AS (
SELECT CommentorId
, QuestionId
, Comment
, REPLACE(Comment, #paddedToken, #paddedReplaceToken) AS [Value]
FROM AnswerComment
WHERE SurveyId = 90492
AND Comment LIKE '%' + #token + '%'
), Step2 AS (
SELECT CommentorId
, QuestionId
, Comment
, REPLACE([Value], #paddedToken, #paddedReplaceToken) AS [Value]
FROM Step1
), Step3 AS (
SELECT CommentorId
, QuestionId
, Comment
, IIF(CHARINDEX(LTRIM(#paddedToken), [Value]) = 1, STUFF([Value], 1, LEN(TRIM(#paddedToken)), TRIM(#paddedReplaceToken)), [Value]) AS [Value]
FROM Step2
)
SELECT CommentorId
, QuestionId
, Comment
, IIF(CHARINDEX(REVERSE(RTRIM(#paddedToken)), REVERSE([Value])) = 1,
REVERSE(STUFF(REVERSE([Value]), CHARINDEX(REVERSE(RTRIM(#paddedToken)), REVERSE([Value])), LEN(RTRIM(#paddedToken)), REVERSE(RTRIM(#paddedReplaceToken)))),
[Value])
FROM Step3;
But I have no idea how I would implement this.
Another thread I can't find anymore suggested using %[^a-z0-9A-Z]% for searching, like this:
update c
set c.Comment = REPLACE(c.Comment, en.FirstName, 'employee')
from AnswerComment c
join #EmployeeNames en on en.SurveyId = c.SurveyId
and c.Comment like '%' + en.FirstName + '%'
and c.Comment not like '%[^a-z0-9A-Z]%' + en.FirstName + '%[^a-z0-9A-Z]%'
select ##ROWCOUNT [first names replaced]
This doesn't work for me. It replaces occurrences of the employee names even if they're part of a larger word, like in this example:
Input: 'I dislike dijon mustard.'
Name to search for: 'Jon'
Output: 'I dislike diemployee mustard.'
At this point it seems to me that it's impossible to accomplish this. Is there anything wrong with how I've implemented these, or anything obvious that I'm missing?
Here is a method that uses a combination of STUFF and PATINDEX.
It'll only replace the first occurence of the name in the comment.
So it might have to be executed more than once till nothing gets updated by it.
UPDATE c
SET c.Comment = STUFF(c.Comment, PATINDEX('%[^a-z0-9]'+en.FirstName+'[^a-z0-9]%', '/'+c.Comment+'/'), len(en.FirstName), 'employee')
FROM AnswerComment c
JOIN #EmployeeNames en ON en.SurveyId = c.SurveyId
WHERE '/'+c.Comment+'/' LIKE '%[^a-z0-9]'+en.FirstName+'[^a-z0-9]%';
Something like this seems to work.
declare #charsTable table (notallowed char(1))
insert into #charsTable (notallowed) values (',')
insert into #charsTable (notallowed) values ('.')
insert into #charsTable (notallowed) values (' ')
declare #input nvarchar(max) = 'Aggregating data is boring. Greg is the worst person ever.'
declare #name nvarchar(50) = 'Greg'
--declare #input nvarchar(max) = 'I dislike dijon mustard. You know who sucks? My boss Jon.'
--declare #name nvarchar(50) = 'Jon'
select case when #name + notallowed = value or notallowed + #name = value or notallowed + #name = value then replace(value, #name, 'employee') else value end 'data()' from string_split(#input, ' ')
left join #charsTable on #name + notallowed = value or notallowed + #name = value or notallowed + #name + notallowed = value
for xml path('')
Results:
Aggregating data is boring. employee is the worst person ever.
I dislike dijon mustard. You know who sucks? My boss employee.

How to Specify Trim Chars in SQL TRIM

I'm having a table Employee, in that some values are started with ", ". So, I need to remove the comma and white-space at the beginning of the name at the time of SELECT query using LTRIM() - SQL-Server.
My Table : Employee
CREATE TABLE Employee
(
PersonID int,
ContactName varchar(255),
Address varchar(255),
City varchar(255)
);
INSERT INTO Employee(PersonID, ContactName, Address, City)
VALUES ('1001',', B. Bala','21, Car Street','Bangalore');
SELECT PersonID, ContactName, Address, City FROM Employee
Here the ContactName Column has a value ", B. Bala". I need to remove the comma and white-space at the beginning of the name.
Alas, SQL Server does not support the ANSI standard functionality of specifying the characters for LTRIM().
In this case, you can use:
(case when ContactName like ', %' then stuff(ContactName, 1, 2, '')
else ContactName
end)
You could potentially use PATINDEX() in order to get this done.
DECLARE #Text VARCHAR(50) = ', Well Crap';
SELECT STUFF(#Text, 1, PATINDEX('%[A-z]%', #Text) - 1, '');
This would output Well Crap. PATINDEX() will find first letter in your word and cut everything before it.
It works fine even if there's no leading rubbish:
DECLARE #Text VARCHAR(50) = 'Mister Roboto';
SELECT STUFF(#Text, 1, PATINDEX('%[A-z]%', #Text) - 1, '');
This outputs Mister Roboto
If there are no valid characters, let's say ContactName is , 9132124, :::, this would output NULL, if you'd like to get blank result, you can use COALESCE():
DECLARE #Text VARCHAR(50) = ', 9132124, :::';
SELECT COALESCE(STUFF(#Text, 1, PATINDEX('%[A-z]%', #Text) - 1, ''), '');
This will output an empty string.
You could also use REPLACE.....
eg.
REPLACE( ' ,Your String with space comma', ' ,', '')
UPDATE dbo.Employee
SET
dbo.Employee.ContactName = replace(LEFT(ContactName, 2),', ','')
+ SUBSTRING (ContactName, 3, len(contactname))
where LEFT(ContactName, 2)=', '
This will only update where first two character contains ', '

sybase ISNULL - are the conditions too long?

I am trying to unmangle a very old variable. It was a full name entered all in one field and is found in two separate tables. In most newer places the name is in three logical columns, first, middle, last. In these it is all together and I am trying to strip it out to first initial, last name. I found the following here and modified it:
http://dbaspot.com/sqlserver-programming/365656-find-last-word-string.html
DECLARE #full_name VARCHAR(20)
DECLARE #fullname VARCHAR(20)
SELECT #full_name = REVERSE('John A Test')
SELECT #fullname = REVERSE('Joe Q Public')
SELECT #final = ISNULL(
(right(#full_name,1) + '. ' + (REVERSE(SUBSTRING(#full_name, 1,
CHARINDEX(' ',#full_name) - 1)))),
(right(#fullname,1) + '. ' + (REVERSE(SUBSTRING(#fullname, 1,
CHARINDEX(' ',#fullname) - 1 ))))
)
When I leave full_name there it works fine...returns J. Test. When I null it the default should be
J. Public but instead ends up as .. When I test each line separately they work. I tried COALESCE as well with the same results. Is this too many brackets for ISNULL or ????
You have problem with right(#full_name,1) + '. ', for example:
select null+'.'
gives you ..
Try to change your code using case as below:
DECLARE #full_name VARCHAR(20)
DECLARE #fullname VARCHAR(20)
DECLARE #final VARCHAR(20)
SELECT #full_name = null--REVERSE('John A Test')
SELECT #fullname = REVERSE('Joe Q Public')
SELECT #final = case
when #full_name is not null
then (right(#full_name,1) + '. ' + (REVERSE(SUBSTRING(#full_name, 1,
CHARINDEX(' ',#full_name) - 1))))
else (right(#fullname,1) + '. ' + (REVERSE(SUBSTRING(#fullname, 1,
CHARINDEX(' ',#fullname) - 1 ))))
end
select #final

Using PATINDEX to find varying length patterns in T-SQL

I'm looking to pull floats out of some varchars, using PATINDEX() to spot them. I know in each varchar string, I'm only interested in the first float that exists, but they might have different lengths.
e.g.
'some text 456.09 other text'
'even more text 98273.453 la la la'
I would normally match these with a regex
"[0-9]+[.][0-9]+"
However, I can't find an equivalent for the + operator, which PATINDEX accepts. So they would need to be matched (respectively) with:
'[0-9][0-9][0-9].[0-9][0-9]' and '[0-9][0-9][0-9][0-9][0-9].[0-9][0-9][0-9]'
Is there any way to match both of these example varchars with one single valid PATINDEX pattern?
I blogged about this a while ago.
Extracting numbers with SQL server
Declare #Temp Table(Data VarChar(100))
Insert Into #Temp Values('some text 456.09 other text')
Insert Into #Temp Values('even more text 98273.453 la la la')
Insert Into #Temp Values('There are no numbers in this one')
Select Left(
SubString(Data, PatIndex('%[0-9.-]%', Data), 8000),
PatIndex('%[^0-9.-]%', SubString(Data, PatIndex('%[0-9.-]%', Data), 8000) + 'X')-1)
From #Temp
Wildcards.
SELECT PATINDEX('%[0-9]%[0-9].[0-9]%[0-9]%','some text 456.09 other text')
SELECT PATINDEX('%[0-9]%[0-9].[0-9]%[0-9]%','even more text 98273.453 la la la')
Yes you need to link to the clr to get regex support. But if PATINDEX does not do what you need then regex was designed exactly for that.
http://msdn.microsoft.com/en-us/magazine/cc163473.aspx
Should be checked for robustness (what if you only have an int, for example), but this is just to put you on a track:
if exists (select routine_name from information_schema.routines where routine_name = 'GetFirstFloat')
drop function GetFirstFloat
go
create function GetFirstFloat (#string varchar(max))
returns float
as
begin
declare #float varchar(max)
declare #pos int
select #pos = patindex('%[0-9]%', #string)
select #float = ''
while isnumeric(substring(#string, #pos, 1)) = 1
begin
select #float = #float + substring(#string, #pos, 1)
select #pos = #pos + 1
end
return cast(#float as float)
end
go
select dbo.GetFirstFloat('this is a string containing pi 3.14159216 and another non float 3 followed by a new fload 5.41 and that''s it')
select dbo.GetFirstFloat('this is a string with no float')
select dbo.GetFirstFloat('this is another string with an int 3')
Given that the pattern is going to be varied in length, you're not going to have a rough time getting this to work with PATINDEX. There is another post that I wrote, which I've modified to accomplish what you're trying to do here. Will this work for you?
CREATE TABLE #nums (n INT)
DECLARE #i INT
SET #i = 1
WHILE #i < 8000
BEGIN
INSERT #nums VALUES(#i)
SET #i = #i + 1
END
CREATE TABLE #tmp (
id INT IDENTITY(1,1) not null,
words VARCHAR(MAX) null
)
INSERT INTO #tmp
VALUES('I''m looking for a number, regardless of length, even 23.258 long'),('Maybe even pi which roughly 3.14159265358,'),('or possibly something else that isn''t a number')
UPDATE #tmp SET words = REPLACE(words, ',',' ')
;WITH CTE AS (SELECT ROW_NUMBER() OVER (ORDER BY ID) AS rownum, ID, NULLIF(SUBSTRING(' ' + words + ' ' , n , CHARINDEX(' ' , ' ' + words + ' ' , n) - n) , '') AS word
FROM #nums, #tmp
WHERE ID <= LEN(' ' + words + ' ') AND SUBSTRING(' ' + words + ' ' , n - 1, 1) = ' '
AND CHARINDEX(' ' , ' ' + words + ' ' , n) - n > 0),
ids AS (SELECT ID, MIN(rownum) AS rownum FROM CTE WHERE ISNUMERIC(word) = 1 GROUP BY id)
SELECT CTE.rownum, cte.id, cte.word
FROM CTE, ids WHERE cte.id = ids.id AND cte.rownum = ids.rownum
The explanation and origin of the code is covered in more detail in the origional post
PATINDEX is not powerful enough to do that. You should use regular expressions.
SQL Server has Regular expression support since SQL Server 2005.

SQL: problem word count with len()

I am trying to count words of text that is written in a column of table. Therefor I am using the following query.
SELECT LEN(ExtractedText) -
LEN(REPLACE(ExtractedText, ' ', '')) + 1 from EDDSDBO.Document where ID='100'.
I receive a wrong result that is much to high.
On the other hand, if I copy the text directly into the statement then it works, i.e.
SELECT LEN('blablabla text') - LEN(REPLACE('blablabla text', ' ', '')) + 1.
Now the datatype is nvarchar(max) since the text is very long. I have already tried to convert the column into text or ntext and to apply datalength() instead of len(). Nevertheless I obtain the same result that it does work as a string but does not work from a table.
You're counting spaces not words. That will typically yield an approximate answer.
e.g.
' this string will give an incorrect result '
Try this approach: http://www.sql-server-helper.com/functions/count-words.aspx
CREATE FUNCTION [dbo].[WordCount] ( #InputString VARCHAR(4000) )
RETURNS INT
AS
BEGIN
DECLARE #Index INT
DECLARE #Char CHAR(1)
DECLARE #PrevChar CHAR(1)
DECLARE #WordCount INT
SET #Index = 1
SET #WordCount = 0
WHILE #Index <= LEN(#InputString)
BEGIN
SET #Char = SUBSTRING(#InputString, #Index, 1)
SET #PrevChar = CASE WHEN #Index = 1 THEN ' '
ELSE SUBSTRING(#InputString, #Index - 1, 1)
END
IF #PrevChar = ' ' AND #Char != ' '
SET #WordCount = #WordCount + 1
SET #Index = #Index + 1
END
RETURN #WordCount
END
GO
usage
DECLARE #String VARCHAR(4000)
SET #String = 'Health Insurance is an insurance against expenses incurred through illness of the insured.'
SELECT [dbo].[WordCount] ( #String )
Leading spaces, trailing spaces, two or more spaces between the neighbouring words – these are the likely causes of the wrong results you are getting.
The functions LTRIM() and RTRIM() can help you eliminate the first two issues. As for the third one, you can use REPLACE(ExtractedText, ' ', ' ') to replace double spaces with single ones, but I'm not sure if you do not have triple ones (in which case you'd need to repeat the replacing).
UPDATE
Here's a UDF that uses CTEs and ranking to eliminate extra spaces and then counts the remaining ones to return the quantity as the number of words:
CREATE FUNCTION fnCountWords (#Str varchar(max))
RETURNS int
AS BEGIN
DECLARE #xml xml, #res int;
SET #Str = RTRIM(LTRIM(#Str));
WITH split AS (
SELECT
idx = number,
chr = SUBSTRING(#Str, number, 1)
FROM master..spt_values
WHERE type = 'P'
AND number BETWEEN 1 AND LEN(#Str)
),
ranked AS (
SELECT
idx,
chr,
rnk = idx - ROW_NUMBER() OVER (PARTITION BY chr ORDER BY idx)
FROM split
)
SELECT #res = COUNT(DISTINCT rnk) + 1
FROM ranked
WHERE chr = ' ';
RETURN #res;
END
With this function your query will be simply like this:
SELECT fnCountWords(ExtractedText)
FROM EDDSDBO.Document
WHERE ID='100'
UPDATE 2
The function uses one of the system tables, master..spt_values, as a tally table. The particular subset used contains only values from 0 to 2047. This means the function will not work correctly for inputs longer than 2047 characters (after trimming both leading and trailing spaces), as #t-clausen.dk has correctly noted in his comment. Therefore, a custom tally table should be used if longer input strings are possible.
Replace the spaces with something that never occur in your text like ' $!' or pick another value.
then replace all '$! ' and '$!' with nothing this way you never have more than 1 space after a word. Then use your current script. I have defined a word as a space followed by a non-space.
This is an example
DECLARE #T TABLE(COL1 NVARCHAR(2000), ID INT)
INSERT #T VALUES('A B C D', 100)
SELECT LEN(C) - LEN(REPLACE(C,' ', '')) COUNT FROM (
SELECT REPLACE(REPLACE(REPLACE(' ' + COL1, ' ', ' $!'), '$! ',''), '$!', '') C
FROM #T ) A
Here is a recursive solution
DECLARE #T TABLE(COL1 NVARCHAR(2000), ID INT)
INSERT #T VALUES('A B C D', 100)
INSERT #T VALUES('have a nice day with 7 words', 100)
;WITH CTE AS
(
SELECT 1 words, col1 c, col1 FROM #t WHERE id = 100
UNION ALL
SELECT words +1, right(c, len(c) - patindex('% [^ ]%', c)), col1 FROM cte
WHERE patindex('% [^ ]%', c) > 0
)
SELECT words, col1 FROM cte WHERE patindex('% [^ ]%', c) = 0
You should declare the column using the varchar data type, like:
create table emp(ename varchar(22));
insert into emp values('amit');
select ename,len(ename) from emp;
output : 4