Removing Punctuation from varchar - punctuation

I looked at the other threads related to this topic and tried to incorporate the information in my query...without success. I want to compare to data fields without taking into consideration punctuation. I tried PATINDEX, I tried STUFF with PATINDEX and it removes some, but not all, of the punctuation. Here is my masked query. Any feedback appreciated. My SQL skills are whatever I can read or asking others. TIA
SELECT TABLE_1.FIELD_1,
TABLE_1.FIELD_2,
STUFF(TABLE_1.FIELD_3, PATINDEX('%[.,&!<>;:$- ]%',TABLE_1.FIELD_3),1, '') AS FIELD_3,
STUFF(TABLE_2.FIELD_4, PATINDEX('%[.,&!<>;:$- ]%',TABLE_2.FIELD_4),1,'') AS FIELD_4,
TABLE_1.FIELD_6,
TABLE_1.FIELD_7,
TABLE_1.FIELD_8,
TABLE_1.FIELD_9,
TABLE_1.FIELD_10,
TABLE_1.FIELD_11
INTO #BOOK1
FROM TABLE_1 LEFT OUTER JOIN TABLE_2 ON
TABLE_1.FIELD_2 = TABLE_2.FIELD_5
SELECT * FROM #BOOK1
WHERE LEFT(#BOOK1.FIELD_3,10) <>
LEFT(#BOOK1.FIELD_4,10) AND
#BOOK1.CLE_PROFILE_STATE IN ('VALUE_1','VALUE_2')
ORDER BY FIELD_3

Sql has a replace function that will let you strip a character or sequence out. It's a bit ugly but you can nest lots of them or create a function.

Related

How to check string have custom template in SQL Server

I have a column like this :
Codes
--------------------------------------------------
3/1151---------366-500-2570533-1
9/6809---------------------368-510-1872009-1
1-260752-305-154----------------154-200-260752-1--------154-800-13557-1
2397/35425---------------------------------377-500-3224575-1
17059----------------377-500-3263429-1
126/42906---------------------377-500-3264375-1
2269/2340-------------------------377-500-3065828-1
2267/767---------377-500-1452908-4
2395/118593---------377-500-3284699-1
2395/136547---------377-500-3303413-1
92/10260---------------------------377-500-1636038-1
2345-2064---------377-500-3318493-1
365-2290--------377-500-3278261-12
365-7212--------377-500-2587120-1
How can I extract codes with this format:
3digit-3digit-5to7digit-1to2digit
xxx-xxx-xxxxxx-xx
The result must be :
Codes
--------------------------------------------------
366-500-2570533-1
368-510-1872009-1
154-200-260752-1 , 154-800-13557-1 -- have 2 code template
377-500-3224575-1
377-500-3263429-1
377-500-3264375-1
377-500-3065828-1
377-500-1452908-4
377-500-3284699-1
377-500-3303413-1
377-500-1636038-1
377-500-3318493-1
377-500-3278261-12
377-500-2587120-1
------------------------------------
This problem is completely tired of me.
Thanks for reading about my problem
This is really ugly, really really ugly. I don't for one second suggest doing this in your RDBMS, and really I suggest you fix your data. You should not be storing "delimited" (I use that word loosely to describe your data) data in your tables you should be storing in in separate columns and rows. In this case, the first "code" should be in one column, with a one to many relationship with another table with the codes you're trying to extract.
As you haven't tagged or mentioned your Version of SQL Server I've used the latest SQL Server syntax. STRING_SPLIT is available in SQL Server 2016+ and STRING_AGG in 2017+. If you aren't using those versions you will need to replace those functions with a suitable alternative (I suggest delimitedsplit8k(_lead) and FOR XML PATH respectively).
Anyway, what this does. Firstly we need to fix that data to something more useable, so I change the double hyphens (--) to a Pipe (|), as that doesn't seem to appear in your data. Then then use that pipe to split your data into parts (individual codes).
Because your delimiter is inconsistent (it isn't a consistent width) this leaves some codes with a leading hyphen, so I have to then get rid of that. Then I use my answer from your other question to split the code further into it's components, and reverse the WHERE; previously the answer was looking for "bad" rows, where as now we want "good" rows.
Then after all of that, it's as "simple" as using STRING_AGG to delimit the "good" rows:
SELECT STRING_AGG(ca.Code,',') AS Codes
FROM (VALUES('3/1151---------366-500-2570533-1'),
('9/6809---------------------368-510-1872009-1'),
('1-260752-305-154----------------154-200-260752-1--------154-800-13557-1'),
('2397/35425---------------------------------377-500-3224575-1'),
('17059----------------377-500-3263429-1'),
('126/42906---------------------377-500-3264375-1'),
('2269/2340-------------------------377-500-3065828-1'),
('2267/767---------377-500-1452908-4'),
('2395/118593---------377-500-3284699-1'),
('2395/136547---------377-500-3303413-1'),
('92/10260---------------------------377-500-1636038-1'),
('2345-2064---------377-500-3318493-1'),
('365-2290--------377-500-3278261-12'),
('365-7212--------377-500-2587120-1')) V(Codes)
CROSS APPLY (VALUES(REPLACE(V.Codes,'--','|'))) D(DelimitedCodes)
CROSS APPLY STRING_SPLIT(D.DelimitedCodes,'|') SS
CROSS APPLY (VALUES(CASE LEFT(SS.[value],1) WHEN '-' THEN STUFF(SS.[value],1,1,'') ELSE SS.[value] END)) ca(Code)
CROSS APPLY (VALUES(PARSENAME(REPLACE(ca.Code,'-','.'),4),
PARSENAME(REPLACE(ca.Code,'-','.'),3),
PARSENAME(REPLACE(ca.Code,'-','.'),2),
PARSENAME(REPLACE(ca.Code,'-','.'),1))) PN(P1, P2, P3, P4)
WHERE LEN(PN.P1) = 3
AND LEN(PN.P2) = 3
AND LEN(PN.P3) BETWEEN 5 AND 7
AND LEN(PN.P4) BETWEEN 1 AND 2
AND ca.Code NOT LIKE '%[^0-9\-]%' ESCAPE '\'
GROUP BY V.Codes;
db<>fiddle
You have several problems here:
Splitting your longer strings into the codes you want.
Dealing with the fact that your separator for the longer strings is the same as your separator for the shorter ones.
Finding the patterns that you want.
The last is perhaps the simplest, because you can use brute force to solve that.
Here is a solution that extracts the values you want:
with t as (
select v.*
from (values ('3/1151---------366-500-2570533-1'),
('9/6809---------------------368-510-1872009-1'),
('1-260752-305-154----------------154-200-260752-1--------154-800-13557-1'),
('2397/35425---------------------------------377-500-3224575-1')
) v(str)
)
select t.*, ss.value
from t cross apply
(values (replace(replace(replace(replace(replace(t.str, '--', '><'), '<>', ''), '><', '|'), '|-', '|'), '-|', '|'))
) v(str_sep) cross apply
string_split(v.str_sep, '|') ss
where ss.value like '%-%-%-%' and
ss.value not like '%-%-%-%-%' and
(ss.value like '[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9]-[0-9]' or
ss.value like '[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9]-[0-9][0-9]' or
ss.value like '[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9][0-9]-[0-9]' or
ss.value like '[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9][0-9]-[0-9][0-9]' or
ss.value like '[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9][0-9][0-9]-[0-9]' or
ss.value like '[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9][0-9][0-9]-[0-9][0-9]'
);
Here is a db<>fiddle.
I would strongly encourage you to find some way of doing this string parsing anywhere other than SQL.
The key to this working is getting the long string of hyphens down to a single delimiter. SQL Server does not offer regular expressions for the hyphens (as some other databases do and as is available in other programming languages). In Python, for instance, this would be much simpler.
The strange values statement with a zillion replaces is handling the repeated delimiters, replacing them with a single pipe delimiter.
Note: This uses string_split() as a convenience. It was introduced in SQL Server 2017. For earlier versions, there are plenty of examples of string splitting functions on the web.

SQL Server 2016: How to read different substrings from a text with special characters

I have a string with the value 'Initiator;abcd#gmail.com'.
I would like to read these two fields in two separate queries.
';' would be the delimiter always.
I used the below query and it works for fetching the email. However it looks overcomplicated to me.
select SUBSTRING(NOTES (CHARINDEX(';',NOTES,1)+1),LEN(NOTES))
from DOCUMENT
where DOC_ID = '12345'
Can someone please help in simplifying it and reading both the values.
Thanks in advance.
First, your query looks fine for email but it has incorrect syntax, second just use left() with charindex() to get the first part :
select left(notes, charindex(';', notes)-1),
substring(notes, charindex(';', notes)+1, len(notes))
from document
where doc_id = 12345;

SQL - Split string with multiple delimiters into multiple rows and columns

I am trying to split a string in SQL with the following format:
'John, Mark, Peter|23, 32, 45'.
The idea is to have all the names in the first columns and the ages in the second column.
The query should be "dynamic", the string can have several records depending on user entries.
Does anyone know how to this, and if possible without SQL functions? I have tried the cross apply approach but I wasn't able to make it work.
Any ideas?
This solution uses Jeff Moden's DelimitedSplit8k. Why? Because his solution provides the ordinal position of the item. Ordinal Position something that many others functions, including Microsoft's own STRING_SPLIT, does not provide. It's going to be vitally important for getting this to work correctly.
Once you have that, the solution becomes fairly simple:
DECLARE #NameAges varchar(8000) = 'John, Mark, Peter|23, 32, 45';
WITH Splits AS (
SELECT S1.ItemNumber AS PipeNumber,
S2.ItemNumber AS CommaNumber,
S2.Item
FROM dbo.DelimitedSplit8K (REPLACE(#NameAges,' ',''), '|') S1 --As you have spaces between the delimiters I've removed these. Be CAREFUL with that
CROSS APPLY DelimitedSplit8K (S1.item, ',') S2)
SELECT S1.Item AS [Name],
S2.Item AS Age
FROM Splits S1
JOIN Splits S2 ON S1.CommaNumber = S2.CommaNumber
AND S2.PipeNumber = 2
WHERE S1.PipeNumber = 1;

Remove # characters from arrays in PostgreSQL table?

I have a field (of type character varying) called 'directedlink_href' in a table which contains arrays that have values that all start with a '#' character.
How am I able to remove the '#' character from any entries in these arrays in this field?
For instance...
{#osgb4000000030451486,#osgb4000000030451491}
to
{osgb4000000030451486,osgb4000000030451491}
The clean solution is to unnest, replace and then re-aggregate the values:
select id,
(select array_agg(substr(x.val,2) order by x.idx) from unnest(t1.directedlink_href) with ordinality as x(val,idx)) as data
from the_table t1;
If you want to actually change the data in the table:
update the_table t1
set directedlink_href = (select array_agg(substr(x.val,2) order by x.idx) from unnest(t1.directedlink_href) with ordinality as x(val,idx));
This simply strips off the first character. If you might have other characters at the start of the value, you need to use regexp_replace(x.val,'^#', '') instead of the substr(x.val,2)
#a_horse_with_no_name got my upvote for a cleaner and more "Posgres-ish" solution.
I was about to delete this answer, but after some tests, it seems that performance wise this solution has an advantage.
Therefore, I would leave this solution here, but I do recommend to choose the solution of #a_horse_with_no_name as the right answer.
I'm using chr(1) has a character that most likely does not appear in the array's' elements.
select string_to_array(substr(replace(array_to_string(directedlink_href,chr(1)),chr(1)||'#',chr(1)),2),chr(1))
from t
;
Think this is a simpler and more generic solution, thought I'd share:
SELECT regexp_split_to_array(regexp_replace(array_to_string(ARRAY['#osgb4000000030451486','#osgb4000000030451491'], '__DELIMITER__'), '#', '', 'g'), '__DELIMITER__');

Finding Potential Duplicates With hyphens or dashes in SQL server 2008

I am trying to find potential duplicates in my database.
Some people might have a duplicate since they have added a "-" into their name or last name (for which ever reason).
My query currently does not pull people who might have a duplicate of someone with a "-".
What might be the best way to do this?
This is my query so far
SELECT t1.FirstName, t1.LastName, t1.ID, t2.dupeCount
FROM Contact t1
INNER JOIN (
SELECT FirstName, REPLACE(LastName, '-', ' ') as LastName, COUNT(*) AS dupeCount
FROM Contact
GROUP BY FirstName, LastName
HAVING COUNT(*) > 1
) t2 ON ((SOUNDEX(t1.LastName) = SOUNDEX(t2.LastName)
OR SOUNDEX(REPLACE(t1.LastName, '-', ' ')) like '%' + SOUNDEX(t2.LastName) + '%'
OR SOUNDEX(REPLACE(t2.LastName, '-', ' ')) like '%' + SOUNDEX(t1.LastName) + '%' )
AND SOUNDEX(t1.FirstName) = SOUNDEX(t2.FirstName))
ORDER BY t1.LastName, t1.ID
This is a lot more involved than something you can fix in one Select statement. When I run into this, I create a stored procedure and trim off leading and trailing spaces, remove punctuation that shouldn't be there (such as in middle names that are abbreviated some of the time and not other times), and check to see if phone numbers, address/zip code combinations, and/or email addresses point to the same person. Soundex helps, but it isn't enough.
Something like a Levenshtein distance algorithm would be useful, this measures the number of edits you would need to make to a string to make it the same as another string. In Oracle there is a built in function called edit_distance under the utl_match library, but I don't know of a built in version in SQL Server.
I did a quick Google search for Levenshtein distance and Edit distance SQL Server and found the following stack overflow thread among other results that may be helpful:
Levenshtein distance in T-SQL
If you are able to create a function that you can call to get the Levenshtein distance then you can just filter the query on whether the distance < x, setting the threshold as you see fit.