Replace a recurring word and the character before it - sql

I am using SQL Server trying to replace each recurring "[BACKSPACE]" in a string and the character that came before the word [BACKSPACE] to mimic what a backspace would do.
Here is my current string:
"This is a string that I would like to d[BACKSPACE]correct and see if I could make it %[BACKSPACE] cleaner by removing the word and $[BACKSPACE] character before the backspace."
Here is what I want it to say:
"This is a string that I would like to correct and see if I could make it cleaner by removing the word and character before the backspace."
Let me make this clearer. In the above example string, the $ and % signs were just used as examples of characters that would need to be removed since they are before the [BACKSPACE] word that I want to replace.
Here is another before example:
The dog likq[BACKSPACE]es it's owner
I want to edit it to read:
The dog likes it's owner
One last before example is:
I am frequesn[BACKSPACE][BACKSPACE]nlt[BACKSPACE][BACKSPACE]tly surprised
I want to edit it to read:
I am frequently surprised

Without a CLR function that provides Regex replacement the only way you'll be able to do this is with iteration in T-SQL. Note, however, that the below solution does not give you the results you ask for, but does the logic you ask. You state that you want to remove the string and the character before, but in 2 of your scenarios that isn't true. For the last 2 strings you remove ' %[BACKSPACE]' and ' $[BACKSPACE]' respectively (notice the leading whitespace).
This leading whitespace is left in this solution. I am not entertaining fixing that, as the real solution is don't use T-SQL for this, use something that supports Regex.
I also assume this string is coming from a column in a table, and said table has multiple rows (with a distinct value for the string on each).
Anyway, the solution:
WITH rCTE AS(
SELECT V.YourColumn,
STUFF(V.YourColumn,CHARINDEX('[BACKSPACE]',V.YourColumn)-1,LEN('[BACKSPACE]')+1,'') AS ReplacedColumn,
1 AS Iteration
FROM (VALUES('"This is a string that I would like to d[BACKSPACE]correct and see if I could make it %[BACKSPACE] cleaner by removing the word and $[BACKSPACE] character before the backspace."'))V(YourColumn)
UNION ALL
SELECT r.YourColumn,
STUFF(r.ReplacedColumn,CHARINDEX('[BACKSPACE]',r.ReplacedColumn)-1,LEN('[BACKSPACE]')+1,''),
r.Iteration + 1
FROM rCTE r
WHERE CHARINDEX('[BACKSPACE]',r.ReplacedColumn) > 0)
SELECT TOP (1) WITH TIES
r.YourColumn,
r.ReplacedColumn
FROM rCTE r
ORDER BY ROW_NUMBER() OVER (PARTITION BY r.YourColumn ORDER BY r.Iteration DESC);
dB<>fiddle

I've had a crack to see if I can get this to work using the traditional tally-table method without any recursion.
I think I have something that works - however the recursive cte version is definitely a cleaner solution and probably better performing, however throwing this in as just an alternative non-recursive way.
/* tally table for use below */
select top 1000 N=Identity(int, 1, 1)
into dbo.Digits
from master.dbo.syscolumns a cross join master.dbo.syscolumns
with w as (
select seq = Row_Number() over (order by t.N),
part = Replace(Substring(#string, t.N, CharIndex(Left(#delimiter,1), #string + #delimiter, t.N) - t.N),Stuff(#delimiter,1,1,''),'')
from Digits t
where t.N <= DataLength(#string)+1 and Substring(Left(#delimiter,1) + #string, t.N, 1) = Left(#delimiter,1)
),
p as (
select seq,Iif(Iif(Lead(part) over(order by seq)='' and lag(part) over(order by seq)='',1,0 )=1 ,'', Iif( seq<Max(seq) over() and part !='',Left(part,Len(part)-1),part)) part
from w
)
select result=(
select ''+ part
from p
where part!=''
order by seq
for xml path('')
)

Here's a simple RegEx pattern that should work:
/.\[BACKSPACE\]/g
EDIT
I have no way to test this right now on my chromebook, but this seems like it should work for T-SQL in the LIKE clause
LIKE '_\[BACKSPACE]' ESCAPE '\'

Related

Replace strings between two characters using T-SQL

I need to update a string to amend any aliases - which can be 'H1.', 'H2.', 'H3.'... etc - to all be 'S.' and am struggling to work out the logic.
For example I have this:
'H1.HUB_CUST_ID, H2.HUB_SALE_ID, H3.HUB_LOC_ID'
But I want this:
'S.HUB_CUST_ID, S.HUB_SALE_ID, S.HUB_LOC_ID'
If you could use wildcards in REPLACE, I'd do something like this REPLACE(#string, 'H%.H', 'S.H').
Theoretically, there is no limit to how many H# aliases there could be. In practice there will almost definitely be less than 10.
Is there a better way than a nested replace of H1 - H10 separately, which both looks messy in a script and carries a small risk if more tables are joined in future?
SQL Server doesn't support pattern replacement. You are better off using a different language, that does support pattern/REGEX replacement or implementing a CLR function.
That said, however, considering you said that the value would always be below 10 you could brute force it, but it's not "pretty".
SELECT REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(YourString,'H1.','S.'),'H2.','S.'),'H3.','S.'),'H4.','S.'),'H5.','S.'),'H6.','S.'),'H7.','S.'),'H8.','S.'),'H9.','S.')
FROM YourTable ...
You can convert your string to XML and then convert it into simple table:
DECLARE #txt nvarchar(max) = N'H1.HUB_CUST_ID, H2.HUB_SALE_ID, H3.HUB_LOC_ID',
#x xml
SELECT #x = '<a al="' + REPLACE(REPLACE(#txt,', ','</a><a al="'),'.','">')+ '</a>'
SELECT t.c.value('#al', 'nvarchar(max)') as alias_name,
t.c.value('.','nvarchar(max)') as col_name
FROM #x.nodes('/a') t(c)
Output:
alias_name col_name
H1 HUB_CUST_ID
H2 HUB_SALE_ID
H3 HUB_LOC_ID
You can put results into temp table, amend them using LIKE 'some basic pattern' and then build new string.
If you don't care about the result order, you can unaggregate and reaggregate:
select t.*, v.new_val
from t cross apply
(select string_agg(concat('S1', stuff(s.value, 1, charindex('.'), '') - 1, ',') within group (order by (select null) as newval
from string_split(t.col, ',') s
) s;
Note: This assumes that all values start with the prefix you want to replace -- as your sample data suggests. A case expression can be used if there are exceptions.
You can actually get the original ordering -- assuming no duplicates -- using charindex():
select t.*, v.new_val
from t cross apply
(select string_agg(concat('S1', stuff(s.value, 1, charindex('.'), '') - 1, ',')
within group (order by charindex(s.value, t.col)
) as newval
from string_split(t.col, ',') s
) s;

TSQL - Remove list of parameters from Text

I have a field in a SQL table which has text looks something like this:
'The Employee <PARAM1> was replaced with <PARAM2> and was given the new IP address <PARAM3> with limited access <PARAM4>. <PARAM2> loves the new role'
I want to remove all the <PARAMs> from the text and just show as below using TSQL
'The Employee was replaced with and was given the new IP address with limited access. loves the new role'
What is the best way to do that?
One way to solve this is to generate a common table expression that will hold the start position and length of each <PARAMn> in the string. To do that, you can use a single common table expression but I've done it with three so that the process is easy to understand.
Please note that I'm assuming the string only contains < and > as params separators - so there is no < or > chars in the content. If that's not the case, it's still solvable but the solution would need some changes.
You start with a numbers (tally) cte that starts with 1 and ends with the length of your string.
Then another cte to get the start position of each <PARAMn>.
A third cte is used to get the length of each <PARAMn> (since I'm assuming you are not limited to 10 parameters in a string, so you can have <PARAM12> or even <PARAM105>).
Then, create a query that will update the original string and remove the <PARAMn> one by one.
-- Test data:
DECLARE #str nvarchar(1000) = 'The Employee <PARAM1> was replaced with <PARAM2> and was given the new IP address <PARAM3> with limited access <PARAM4>. <PARAM2> loves the new role';
-- The numbers (Tally) cte:
WITH Tally AS
(
SELECT TOP(LEN(#Str)) ROW_NUMBER() OVER(ORDER BY ##SPID) As N
FROM sys.objects A CROSS JOIN sys.objects B
), -- The StartPosition cte contains the position of each < char in the string
StartPosition AS
(
SELECT DISTINCT
CHARINDEX('<', #Str, N) As Start
FROM Tally
WHERE CHARINDEX('<', #Str, N) > 0
), -- the Length cte contains both the start position and the length of each <PARAMn> in the string
Length As
(
SELECT Start,
CHARINDEX('>', #Str, Start) - Start + 1 As Length
FROM StartPosition
)
-- Use STUFF to remove `<PARAMn>`. Note the order by is critical to remove from end to start.
SELECT #Str = STUFF(#Str, Start, Length, '')
FROM Length
ORDER BY Start DESC
-- verify the results:
SELECT #Str
Result:
The Employee was replaced with and was given the new IP address with limited access . loves the new role
You might note that the results have places with double spaces where the <PARAMn> used to be - that can be solved by using the technique Gordon Linoff shows in this answer.

mssql select all nvarchar with wrong encoding

I am working with a old database where someone didn't encode the data the right way before inserting it into the database. which result in text like
"Wrong t�xt" (in my case the '�' is a ø).
I am looking for a way to find all rows where the column contains data like this, so i can correct it.
So far i tried using regex like
SELECT * FROM table WHERE ([colm] not like '[a-zA-Z\s]%')
but no matter what i do, i can't find a way to select only the ones containing the '�'
a search like
SELECT * FROM table WHERE ([colm] like '%�%')
won't return anything either. (tried it, just in cases).
I been search for this on Google and here on Stackoverflow, but either there is no one having this problem, or I am searching for the wrong thing.
So if someone would be so kind to help me with this, I would be really happy.
Thanks for your time.
Assuming the character in the string really is U+FFFD REPLACEMENT CHARACTER (�), and it's not displayed as a replacement character because there are actually other bytes in there that can't be decoded properly, you can find it with
SELECT * FROM table WHERE [colm] LIKE N'%�%' COLLATE Latin1_General_BIN2
Or (to avoid any further issues with encoding mangling characters)
SELECT * FROM table WHERE [colm] LIKE N'%' + NCHAR(0xfffd) + N'%' COLLATE Latin1_General_BIN2
Unicode is required because � does not exist in any single-byte collation, and a binary collation is required because the regular collations treat � as if it did not occur in strings at all.
Try this:
WHERE [colm] not like N'%[a-zA-Z]%'
Of course, this should return values with numbers, spaces, and punctuation.
As Jeroen mentioned, using a binary seems to be the way to go. Personally I would suggest using NGrams4k here, but I built a quick tally table instead that does the job:
WITH N AS(
SELECT N
FROM (VALUES(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) N(N)),
Tally AS(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS I
FROM N N1, N N2, N N3, N N4)
SELECT V.Colm
FROM (VALUES(N'Wrong t�xt" (in my case the ''�'' is a ø)'),
(N'This string is ok'))V(colm)
JOIN Tally T ON LEN(V.Colm) >= T.I
CROSS APPLY (VALUES(SUBSTRING(V.Colm,T.I,1))) SS(C)
GROUP BY V.colm
HAVING COUNT(CASE CONVERT(binary(2),SS.C) WHEN 0xFDFF THEN 1 END) > 0;
You could replace occurences of the U+FFFD REPLACEMENT CHARACTER (�) and compare it with the original value:
SELECT *
, CASE WHEN CONVERT(VARBINARY(MAX), t.colm) = CAST(REPLACE(CONVERT(VARBINARY(MAX), t.colm), 0xFDFF, 0x) AS VARBINARY(MAX)) THEN 1 ELSE 0 END AS EncodingCorrect
FROM (
SELECT N'Wrong t�xt" (in my case the ''�'' is a ø)' AS colm
UNION ALL
SELECT 'Correct text'
UNION ALL
SELECT 'Wrong t?xt" (in my case the ''?'' is a ø)'
) t
#Jeroen Mostert's suggestion WHERE colm LIKE N'%�%' COLLATE Latin1_General_BIN2 seems like the better and more readable solution.

What's the equivalent of Excel's `left(find(), -1)` in BigQuery?

I have names in my dataset and they include parentheses. But, I am trying to clean up the names to exclude those parentheses.
Example: ABC Company (Somewhere, WY)
What I want to turn it into is: ABC Company
I'm using standard SQL with google big query.
I've done some research and I know big query has left(), but I do not know the equivalent of find(). My plan was to do something that finds the ( and then gives me everything to the left of -1 characters from the (.
My plan was to do something that finds the ( and then gives me everything to the left of -1 characters from the (.
Good plan! In BigQuery Standard SQL - equivalent of LEFT is SUBSTR(value, position[, length]) and equivalent of FIND is STRPOS(value1, value2)
With this in mind your query can look like (which is exactly as you planned)
#standardSQL
WITH names AS (
SELECT 'ABC Company (Somewhere, WY)' AS name
)
SELECT SUBSTR(name, 1, STRPOS(name, '(') - 1) AS clean_name
FROM names
Usually, string functions are less expensive than regular expression functions, so if you have pattern as in your example - you should go with above version
But in more generic cases, when pattern to clean is more dynamic like in Graham's answer - you should go with solution in Graham's answer
Just use REGEXP_REPLACE + TRIM. This will work with all variants (just not nested parentheses):
#standardSQL
WITH
names AS (
SELECT
'ABC Company (Somewhere, WY)' AS name
UNION ALL
SELECT
'(Somewhere, WY) ABC Company' AS name
UNION ALL
SELECT
'ABC (Somewhere, WY) Company' AS name)
SELECT
TRIM(REGEXP_REPLACE(name,r'\(.*?\)',''), ' ') AS cleaned
FROM
names
Use REGEXP_EXTRACT:
SELECT
RTRIM(REGEXP_EXTRACT(names, r'([^(]*)')) AS new_name
FROM yourTable
The regex used here will greedily consume and match everything up until hitting an opening parenthesis. I used RTRIM to remove any unwanted whitespace picked up by the regex.
Note that this approach is robust with respect to the edge case of an address record not having any term with parentheses. In this case, the above query would just return the entire original value.
I can't test this solution at the moment, but you can combine SUBSTR and INSTR. Like this:
SELECT CASE WHEN INSTR(name, '(') > 0 THEN SUBSTR( name, 1, INSTR(name, '(') ) ELSE name END as name FROM table;

Remove ASCII Extended Characters 128 onwards (SQL)

Is there a simple way to remove extended ASCII characters in a varchar(max). I want to remove all ASCII characters from 128 onwards. eg - ù,ç,Ä
I have tried this solution and its not working, I think its because they are still valid ASCII characters?
How do I remove extended ASCII characters from a string in T-SQL?
Thanks
The linked solution is using a loop which is - if possible - something you should avoid.
My solution is completely inlineable, it's easy to create an UDF (or maybe even better: an inline TVF) from this.
The idea: Create a set of running numbers (here it's limited with the count of objects in sys.objects, but there are tons of example how to create a numbers tally on the fly). In the second CTE the strings are splitted to single characters. The final select comes back with the cleaned string.
DECLARE #tbl TABLE(ID INT IDENTITY, EvilString NVARCHAR(100));
INSERT INTO #tbl(EvilString) VALUES('ËËËËeeeeËËËË'),('ËaËËbËeeeeËËËcË');
WITH RunningNumbers AS
(
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS Nmbr
FROM sys.objects
)
,SingleChars AS
(
SELECT tbl.ID,rn.Nmbr,SUBSTRING(tbl.EvilString,rn.Nmbr,1) AS Chr
FROM #tbl AS tbl
CROSS APPLY (SELECT TOP(LEN(tbl.EvilString)) Nmbr FROM RunningNumbers) AS rn
)
SELECT ID,EvilString
,(
SELECT '' + Chr
FROM SingleChars AS sc
WHERE sc.ID=tbl.ID AND ASCII(Chr)<128
ORDER BY sc.Nmbr
FOR XML PATH('')
) AS GoodString
FROM #tbl As tbl
The result
1 ËËËËeeeeËËËË eeee
2 ËaËËbËeeeeËËËcË abeeeec
Here is another answer from me where this approach is used to replace all special characters with secure characters to get plain latin