how to strip all html tags and special characters from string using sql server - sql

I work as asp.net developer using C#, I receive text like this from the client:
> <p><a
> href="http://www.vogue.co.uk/person/kate-winslet">KATE
> WINSLET</a>&nbsp;has given birth to a 9lb baby boy. The
> Oscar-winning actress welcomed the baby with her husband Ned Rocknroll
> at a hospital in Sussex.</p>
>
> <p>&quot;Kate had &#39;Baby Boy Winslet&#39; on
> Saturday at an NHS Hospital,&quot; Winslet&#39;s spokeswoman
> said, adding that the family were &quot;thrilled to
> bits&quot;.</p>
>
> <p>The announcement suggests that the child might bear his
> mother&#39;s surname, rather than his father&#39;s slightly
> more unusual moniker.</p>
>
> <p>The baby is Winslet&#39;s third - she is already mother
> to Mia, 13, and Joe, eight, &nbsp;from previous relationships -
> and her husband&#39;s first. They met on Necker Island, owned by
> Rocknroll&#39;s uncle, Richard Branson, and<a
> href="http://www.vogue.co.uk/news/2013/kate-winslet-married-to-ned-rocknroller---wedding-details">married almost a year ago</a>&nbsp;in New York.</p>
I need a way to extract the real text without tags and special characters using sql server 2008 or above ??

The best I can suggest is to use a .net HTML parser or such which is wrapped in a SQL CLR function. Or to wrap the regex in SQL CLR if you want.
Note regex limitations: http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html
Raw SQL language won't do it: it is not a string (or HTML) processing language

I recently had the same requirement (to remove HTML tags and entities) so developed this function in SQL Server.
CREATE FUNCTION CTU_FN_StripHTML (#dirtyText NVARCHAR(MAX))
RETURNS NVARCHAR(MAX)
AS
BEGIN
-- Cleaned Text
DECLARE #cleanText NVARCHAR(MAX)=RTRIM(LTRIM(#dirtyText));
-- HTML Tags
DECLARE #tagStart SMALLINT =PATINDEX('%<%>%', #cleanText);
DECLARE #tagEnd SMALLINT;
DECLARE #tagLength SMALLINT;
-- HTML Entities
DECLARE #entityStart SMALLINT =PATINDEX('%&%;%', #cleanText);
DECLARE #entityEnd SMALLINT;
DECLARE #entityLength SMALLINT;
WHILE #tagStart > 0
OR
#entityStart > 0
BEGIN
-- Remove HTML Tag
SET #tagStart=PATINDEX('%<%>%', #cleanText);
IF #tagStart > 0
BEGIN
SET #tagEnd=CHARINDEX('>', #cleanText, #tagStart);
SET #tagLength=(#tagEnd - #tagStart) + 1;
SET #cleanText=STUFF(#cleanText, #tagStart, #tagLength, '');
END;
-- Remove HTML Entity
SET #entityStart=PATINDEX('%&%;%', #cleanText);
IF #entityStart > 0
BEGIN
SET #entityEnd=CHARINDEX(';', #cleanText, #entityStart);
SET #entityLength=(#entityEnd - #entityStart) + 1;
SET #cleanText=STUFF(#cleanText, #entityStart, #entityLength, '');
END;
END;
SET #cleanText = RTRIM(LTRIM(#cleanText))
RETURN #cleanText;
END;

HTML is so complex it's a very bad idea to do this without an HTML Parser.
You might be interested in This Question.
The answer that's accepted there is to just use Lynx via the command line and dump the output to a file.
If you can do it outside the users page-load it might be the best option.

Related

Replace the multiple values between 2 characters in azure sql

In Azure SQL, I'm attempting to delete any text that is present between the < and > characters to my column in my table
Sample text:
The best part is that. < br >Note:< br >< u> reading
:< /u> < span style="font-family: calibri,sans-serif; font-size: 11pt;"> moral stories from an early age
< b>not only helps your child.< /b>< br>< u>in
learning important: < /u>< /span>< span style="font-family: calibri;
">life lessons but it also helps, in language development.< /span>< ./span>
Output:
The best part is that. reading: moral stories from an early age not only helps your child in learning important: life lessons but it also helps in language development.
I tried below query its working only for small comments text:
SELECT [Comments],REPLACE([Comments], SUBSTRING([Comments], CHARINDEX('<', [Comments]), CHARINDEX('>', [Comments]) - CHARINDEX('<', [Comments]) + 1),'') AS result
FROM table
I have taken input table named check_1 and sample data is inserted into that table.
This query removes only the first occurring pattern.
SELECT [Comments],REPLACE([Comments], SUBSTRING([Comments], CHARINDEX('<', [Comments]), CHARINDEX('>', [Comments]) - CHARINDEX('<', [Comments]) + 1),'') AS result
FROM check_1
In order to remove all string patterns beginning with '<' and ending with '>' in the text, a user defined function with a while loop is created.
CREATE FUNCTION [dbo].[udf_removetags] (#input_text VARCHAR(MAX)) RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE #pos_1 INT
DECLARE #pos_n INT
DECLARE #Length INT
SET #pos_1 = CHARINDEX('<',#input_text)
SET #pos_n = CHARINDEX('>',#input_text,CHARINDEX('<',#input_text))
SET #Length = (#pos_n - #pos_1) + 1
WHILE #pos_1 > 0 AND #pos_n > 0 AND #Length > 0
BEGIN
SET #input_text = replace(#input_text,substring(#input_text,#pos_1,#Length),'')
SET #pos_1 = CHARINDEX('<',#input_text)
SET #pos_n = CHARINDEX('>',#input_text,CHARINDEX('<',#input_text))
SET #Length = (#pos_n - #pos_1) + 1
END
RETURN #input_text
END
select [dbo].[udf_removetags](comments) as result from check_1
Output String:
The best part is that. Note: reading : moral stories from an early age not only helps your child.in learning important: life lessons but it also helps, in language development.
You can also use Stuff [Refer Microsoft document on STUFF] in place of replace+substring function.
Replace this SET #input_text = replace(#input_text,substring(#input_text,#pos_1,#Length),'')
line with the line
SET #input_text = STUFF(#input_text,#pos_1,#Length,'')
in the user defined function.
Result will be same.
According to https://learn.microsoft.com/../azure/../regexp_replace Azure supports REGEXP_REPLACE.
This means it should be possible to replace all '<...>' by '' via
select regexp_replace(comments, '<[^>]*>', '') from mytable;

UDF on DB2 11.0

I was asked to build a user defined function on our Mainframe environment that checks for a search string in a longer string. The only catch is that if we search for example for 'AA' in 'ABCAADAA' the only valid result is the last AA because the first AA actually split in CA and AD.
CREATE FUNCTION F#CRE#WK (WK CHAR(02), WKATTR CHAR(10))
RETURNS INTEGER
LANGUAGE SQL
READS SQL DATA
BEGIN
DECLARE INDEX INTEGER DEFAULT 1;
WHILE (INDEX < 9) DO
SET INDEX = LOCATE_IN_STRING(WKATTR, WK, INDEX);
IF (MOD(INDEX, 2) <> 0) THEN
RETURN 1;
END IF;
END WHILE;
RETURN 0;
END;
It is working fine when I implement it using Data Studio but if I put it onto the host directly (we're using Quick32770) I'm getting a bunch of errors which don't make sense at all. I couldn't find any helpful resources(searched the whole IBM page and Google of course).
First error I'm getting is:
SQLCODE = -104, ERROR: ILLEGAL SYMBOL "<END-OF-STATEMENT>". SOME
SYMBOLS THAT MIGHT BE LEGAL ARE: ;
Which refers to the line I'm declaring my index variable. If I remove the semicolon it tells me that the SET is illegal there because it is expecting a semicolon.
I cannot think of anything else I could try(I messed around with the code a lot but errors just kept getting more weird.). I started working in this field while being in college just a couple of weeks ago and nobody here has actual knowledge about this so I was hoping to find some help here.
If there's anything else you need, just let me know!
Thanks in advance.
This might help you:
https://bytes.com/topic/db2/answers/754686-db2-udf-need-eliminate-if-statement
It says the if statement is not allowed on the mainframe in UDF ?
So this user bend it around to a CASE function.
In order to fix this you need to go into the SPUFI settings and change the TERMINATOR option to something else than a semicolon. If I changed it to & my code must look like this:
CREATE FUNCTION F#CRE#WK (WK CHAR(02), WKATTR CHAR(10))
RETURNS INTEGER
LANGUAGE SQL
READS SQL DATA
BEGIN
DECLARE INDEX INTEGER DEFAULT 1;
WHILE (INDEX < 9) DO
SET INDEX = LOCATE_IN_STRING(WKATTR, WK, INDEX);
IF (MOD(INDEX, 2) <> 0) THEN
RETURN 1;
END IF;
END WHILE;
RETURN 0;
END&

How to split a column by the number of white spaces in it with SQL?

I've got a single column that contains a set of names in it. I didn't design the database so that it contains multiple values in one column, but as it is I've got to extract that information now.
The problem is that in one field I've got multiple values like in this example:
"Jack Tom Larry Stan Kenny"
So the first three should be one group, and the other ones on the far right are another group. (Basically the only thing that separates them in the column is a specific number of whitespace between them, let's say 50 characters.)
How can I split them in pure SQL, so that I can get two columns like this:
column1 "Jack Tom Larry"
column2 "Stan Kenny"
A fairly simplistic answer would be to use a combination of left(), right() and locate(). Something like this (note I've substituted 50 spaces with "XXX" for readability):
declare global temporary table session.x(a varchar(100))
on commit preserve rows with norecovery;
insert into session.x values('Jack Tom LarryXXXStan Kenny');
select left(a,locate(a,'XXX')-1),right(a,length(a)+1-(locate(a,'XXX')+length('XXX'))) from session.x;
If you need a more general method of extracting the nth field from a string with a given separator, a bit like the split_part() function in PostgreSQL, in Ingres your options would be:
Write a user defined function using the Object Management Extension (OME). This isn't entirely straightforward but there is an excellent example in the wiki pages of Actian's community site to get you started:
http://community.actian.com/wiki/OME:_User_Defined_Functions
Create a row-producing procedure. A bit more clunky to use than an OME function, but much easier to implement. Here's my attempt at such a procedure, not terribly well tested but it should serve as an example. You may need to adjust the widths of the input and output strings:
create procedure split
(
inval = varchar(200) not null,
sep = varchar(50) not null,
n = integer not null
)
result row r(x varchar(200)) =
declare tno = integer not null;
srch = integer not null;
ptr = integer not null;
resval = varchar(50);
begin
tno = 1;
srch = 1;
ptr = 1;
while (:srch <= length(:inval))
do
while (substr(:inval, :srch, length(:sep)) != :sep
and :srch <= length(:inval))
do
srch = :srch + 1;
endwhile;
if (:tno = :n)
then
resval=substr(:inval, :ptr, :srch - :ptr);
return row(:resval);
return;
endif;
srch = :srch + length(:sep);
ptr = :srch;
tno = :tno + 1;
endwhile;
return row('');
end;
select s.x from session.x t, split(t.a,'XXX',2) s;

Update(Replace partcial value) XML Column in SQL

I have an XML column in my Table and i wanted to replace particular text wherever it appear in that column with a new text. Here is the xml structure,
<Story>
<StoryNonText>
<NonText>
<ImageID>1</ImageID>
<Src>http://staging.xyz.com/FolderName/1.png</Src>
</NonText>
<NonText>
<ImageID>2</ImageID>
<Src>http://staging.xyz.com/FolderName/2.png</Src>
</NonText>
</StoryNonText>
</Story>
In the above XML I wanted to replace all the <Src> values having http://staging.xyz.com/ to http://production.xyz.com/. Please guide me how i can do this!
You can use Replace() function as below:
Update TableName
SET
ColumnName=replace(CAST(ColumnName AS VARCHAR(8000)),'<Src>http://staging.xyz.com/','<Src>http://production.xyz.com/')
With a little help from a couple of XML functions you can do this in a loop.
The loop is necessary since replace value of can only replace one value at a time. This code assumes the URL is located first in the node and not embedded in text anywhere.
declare #T table(X xml);
insert into #T(X) values('<Story>
<StoryNonText>
<NonText>
<ImageID>1</ImageID>
<Src>http://staging.xyz.com/FolderName/1.png</Src>
</NonText>
<NonText>
<ImageID>2</ImageID>
<Src>http://staging.xyz.com/FolderName/2.png</Src>
</NonText>
</StoryNonText>
</Story> ');
declare #FromURL nvarchar(100);
declare #ToURL nvarchar(100);
set #FromURL = 'http://staging.xyz.com/';
set #ToURL = 'http://production.xyz.com/';
while 1 = 1
begin
update #T
set X.modify('replace value of (//*/text()[contains(., sql:variable("#FromURL"))])[1]
with concat(sql:variable("#ToURL"), substring((//*/text()[contains(., sql:variable("#FromURL"))])[1], string-length(sql:variable("#FromURL"))+1))')
where X.exist('//*/text()[contains(., sql:variable("#FromURL"))]') = 1;
if ##rowcount = 0
break;
end;
select *
from #T
replace value of (XML DML)
concat Function (XQuery)
contains Function (XQuery)
string-length Function (XQuery)
sql:variable() Function (XQuery)
There are many ways to do that.
The first way is to add a WHILE loop. Inside a loop, you search (CHARINDEX) for a position of first tag and first tag. Then, knowing the start and end positions, replace the value. Then on the next iteration you search again, but change starting position in CHARINDEX() function
The second way is to use SELECT ... FROM OPENXML + EXEC sp_xml_preparedocument

Not able to remove injected script from database rows

I've been handed a MS SQL 2000 database which has been injected with malware.
The malware script is as follows:
<script src=http://www.someAddress.ru/aScript.js></script>
Now I want to remove this piece of code from the table rows.
As a test, I inputed < h1> Test < /h1> on a row, and successfully ran the following query:
UPDATE myTable
SET description = REPLACE (description, '<h1>','')
WHERE id = 2;
This removed the h1 tag.
But trying the same with the script tag does not work:
UPDATE myTable
set description = REPLACE (description, '<script src=http://www.someAddress.ru/aScript.js></script>','')
WHERE id = 2
Why does this not work?
UPDATE 2
WOHO! I found the solution!
I'm using the folloing code, which I found here: http://www.tek-tips.com/viewthread.cfm?qid=1563568&page=3
-- Look for open and close HTML tags making sure a letter or / follows < ensuring its an opening
-- HTML tag or closing HTML tag and not an unencoded < symbol
CREATE FUNCTION [dbo].[udf_StripHTML]
(#HTMLText VARCHAR(8000))
RETURNS VARCHAR(8000)
AS
BEGIN
DECLARE #Start INT
DECLARE #End INT
DECLARE #Length INT
SET #Start = CHARINDEX('<',#HTMLText)
SET #End = CHARINDEX('>',#HTMLText,CHARINDEX('<',#HTMLText))
SET #Length = (#End - #Start) + 1
WHILE #Start > 0
AND #End > 0
AND #Length > 0
BEGIN
SET #HTMLText = STUFF(#HTMLText,#Start,#Length,'')
SET #Start = CHARINDEX('<',#HTMLText)
SET #End = CHARINDEX('>',#HTMLText,CHARINDEX('<',#HTMLText))
SET #Length = (#End - #Start) + 1
END
RETURN Replace(LTRIM(RTRIM(#HTMLText)),' ',' ')
END
GO
To remove the HTML tags / scripts, I run the following query:
UPDATE mytable
SET description = [dbo].[udf_StripHTML](description)
//WHERE id = 35;
This works perfectly. Note that this script removes ALL html. So if I only want to remove < script> , I just replace '<' with '< script'.
Have you tried looking for just aScript.js, the entry could be url_encoded, or something similar, so it gives something like
%3Cscript+src%3Dhttp%3A%2F%2Fwww.someAddress.ru%2FaScript.js%3E%3C%2Fscript%3E
Reread Question
Do you mean that even when you have the script tag in a column with id=2 it doesn't work? Because if its not working are you sure that it exists in row with id=2? :p
Should work, unless there are other hidden characters in there you can't see, or there is some form of encoding going on. Can you SELECT a suspect row to look at more closely.
I would tend to completely DELETE FROM myTable WHERE description LIKE '%someAddress.ru%' where possible.
However, fixing the database isn't a real solution; the application must be fixed. It shouldn't ever be echoing text out of the database unencoded. If someone enters some data including the string <script> it should simply appear on the page as the literal string <script>, or in the source <script>.
Wouldn't the src attribute value be surrounded by quotes? If so, you would have to escape them to get a proper match on the replace.
Why not try:
UPDATE myTable
set description = REPLACE (description, 'www.someAddress.ru','localhost')
WHERE id = 2
That would eliminate the immediate hijacking problem, and would likely avoid line break / funky characters problems.
You could try the following to strip the code out of your field (I'm assuming you have information in the same field that you want to keep):
update myTable
set description = case when PATINDEX('%<script%', notes) > 0
then SUBSTRING(notes, 1, PATINDEX('%<script%', notes)-1) + SUBSTRING(notes, PATINDEX('%script>%', notes) + 7, LEN(notes))
else notes
end
where id=2
You could first run a select to see if the value returned by the CASE statement is correct before running the update. It should not affect fields without a script tag in them, though.
Hold on...
Is the database related to a financial system? Is the application under Sarbanes-Oxley? Has any fraud been committed?
Any of those things preclude you from making changes that would, "destroy evidence." Those little guys running around with "FBI" on their jackets don't take kindly to that. It would be a good thing to back it up now, and the logs (SQL and Web), and put that backup on a few DVDs. It would be better to remove the disk and put in another one (but that may not be an option).
Moving on to cleansing:
bobince's direction is the correct one. Don't look for the whole SCRIPT tag, or try to find variations. Instead, look for something in the script tag that isn't part of the normal dataset. That's what you key off. If it SELECTs okay, then turn it into a DELETE and save that query, because you will need it while you turn to fixing the application (guaranteed your database will get corrupted again).