How to get all PDF links from HTML content using T-SQL

How to get all PDF links from HTML content using T-SQL - sql

I am trying to retrieve all PDF links from a string column which contains HTML.
Example text of one column is:
<p>text here link
some other text home
link 2</p>
I need all links with .pdf extension.
I already tried function like this
ALTER function [dbo].[GetLinks] (#t nvarchar(max))
returns #Links table (link nvarchar(max))
as
begin
declare #strtpos int
set #strtpos=100
declare #endpos int
declare #lnk nvarchar(max)
while #strtpos > 6
begin
set #strtpos = PATINDEX('%href="%', #t)+6
if #strtpos>6 begin
--set #endpos = CHARINDEX ('"',#t,#strtpos+1)
set #endpos = PATINDEX('%.pdf"%',#t)+4
if #endpos>0 begin
set #lnk = substring(#t ,#strtpos, #endpos - #strtpos)
set #strtpos = PATINDEX('%href="%', #lnk)+6
set #t= RIGHT (#t, len(#t) - #endpos)
insert #Links values(#lnk)
end
end
end
return
end
And calling this function from SQL Server like this:
select top 1 * from dbo.GetLinks(' <p>text here link
some other text home
link 2</p>')
This returns the first link only when I match CHAR, but when I match string ".pdf" it returns long string. Please let me know if I am doing something wrong or need to change approach for this.

If your html column can be converted to XML like your example suggests, your can parse the href values in T-SQL using XML data type methods:
CREATE FUNCTION dbo.GetLinks (#t xml)
RETURNS #Links TABLE (link nvarchar(max))
AS
BEGIN
INSERT #Links
SELECT
AnchorTag.value('#href', 'nvarchar(MAX)') AS link
FROM #t.nodes('//a') AS AnchorTags(AnchorTag);
RETURN;
END;
GO
The same approach can be used with an inline TVF:
CREATE FUNCTION dbo.GetLinks (#t xml)
RETURNS TABLE
AS
RETURN (
SELECT
AnchorTag.value('#href', 'nvarchar(MAX)') AS link
FROM #t.nodes('//a') AS AnchorTags(AnchorTag)
);
GO

Xquery expression can do it simply
DECLARE #html xml = '<p>text here link<b v="3">ok</b>some other text home<a title="er">kj</a>link 2</p>'
select [pdfLink] = a.value('#href','varchar(max)')
from #html.nodes('//a[#href[contains(., ".pdf")]]') c(a)

If you are on SQL Server 2016+ you can use STRING_SPLIT.
DECLARE #string VARCHAR(8000) = '
<p>text here link
some other text home
link 2</p>';
SELECT TheUrl = split.value
FROM STRING_SPLIT(#string,'"') AS split
WHERE split.value LIKE '%.pdf';
Returns:
TheUrl
---------------------------
example.com/abc.pdf
www.example.com/abc123.pdf

If you can't convert your html into xml for whatever reason, you can still do this with regular string manipluation, though it is not pretty.
This solution (ironically) utilises an xml based string splitter to allow for multi-character delimiters, the output of which is then further filtered to only return the .pdf links:
create or alter function [dbo].[fn_StringSplitXML]
(
#str varchar(max) = '' -- String to split.
,#Delimiter varchar(10) = ',' -- Delimiting value to split on.
,#num int = null -- Which value to return.
)
returns table
as
return
select rn
,item
from(select rn = row_number() over(order by(select null))
,item = ltrim(rtrim(n.i.value('(./text())[1]','varchar(max)')))
from(select x = cast('<x>'+replace(#str,#Delimiter,'</x><x>')+'</x>' as xml).query('.')) as s
cross apply s.x.nodes('x') as n(i)
) as a
where rn = #num
or #num is null
;
declare #html varchar(1000) =
'<p>text here link
some other text home
link 2</p>
<input type="text" name="self closed tag" />
<b>some more text</b>
';
select left(s.item
,patindex('%.pdf%',s.item)+3
) as link
from dbo.fn_StringSplitXML(replace(replace(#html
,'>'
,''
)
,'<'
,''
)
,'href="'
,null
) as s
where patindex('%.pdf%',s.item) > 0;
Output
link
example.com/abc.pdf
www.example.com/abc123.pdf

Related

SQL Server : remove character "

I want to remove from my value / string this character " but I can not remove it using my UPDATE statement here:
UPDATE dbo].[Tablename]
SET [columnname] = REPLACE([columnname], '"', '')
Do you have any idea how to remove this character?
Thank you for opinions

You could simply use the ASCII character codes to replace the wildcard characters. There might be other good solution as well. It worked for me!
UPDATE dbo].[Tablename] SET [columnname] = REPLACE([columnname], char(47), '')
Char(47) is the ASCII code for the forward slash. You can find the full list here.

So first make 100% sure of the character you are trying to eliminate.
Just because it 'looks' like a double quote, doesn't mean that it is.
Alter the commented line in the code below, to select a single record from your dataset.
It will spit out the charcode for each character in the source string.
(obv change table and field names also !)
declare
#sample nvarchar(max),
#char nvarchar(1),
#i_idx int,
#temp int
create table #RtnValue(
Id int identity(1,1),
A nvarchar(max),
[uni] int,
[ascii] int
)
-- set #sample to a single value from your dataset here
select #sample = (select top (1) [sample] from test.test)
While len(#sample) > 0
Begin
Set #char = left(#sample,1)
Insert Into #RtnValue (A, uni, [ascii])
Select A = #char, UNI = UNICODE(#char), [ASCII] = ASCII(#char)
Set #sample = RIGHT(#sample,len(#sample)-1)
End
Select * from #RtnValue

Sql sever Replace function

I have file Location in my table like
FileLocation :- "\Saurabh\Rahul\Saurabh\ABC.text"
I need to replace "Saurabh" with "Ramesh" but i need to replace first "Saurabh" word only not all the "saurabh" which is in the string.
i tried with
select REPLACE(FILELOCATION,'saurabh','Ramesh')
How can i achive ?

try this:
You can Use STUFF and CHARINDEX functions.
DECLARE #str varchar(100) = '\Saurabh\Rahul\Saurabh\ABC.text'
SELECT STUFF(#str, CHARINDEX('saurabh', #str), LEN('saurabh'), '')
OUTPUT:
\\Rahul\Saurabh\ABC.text

To replace only the first occurence of a word in a string you can write as:
DECLARE #FileLocation VARCHAR(MAX)
DECLARE #ReplaceSubString VARCHAR(MAX),#NewSubString VARCHAR(MAX)
SET #FileLocation = '\\Saurabh\Rahul\Saurabh\ABC.text'
SET #ReplaceSubString = 'Saurabh'
SET #NewSubString = 'Ramesh'
SELECT STUFF ( #FileLocation ,
CHARINDEX(#ReplaceSubString, #FileLocation) ,
Len(#ReplaceSubString) ,
#NewSubString
)
Demo

Split/explode comma delimited string with Sybase SQL Anywhere

UPDATE:
Someone marked this question as duplicate of
How do I split a string so I can access item x.
But it's different, my question is about Sybase SQL Anywhere, the other is about MS SQL Server. These are two different SQL engines, even if they have the same origin, they have different syntax. So it's not duplicate. I wrote in the first place in description and tags that it's all about Sybase SQL Anywhere.
I have field id_list='1234,23,56,576,1231,567,122,87876,57553,1216'
and I want to use it to search IN this field:
SELECT *
FROM table1
WHERE id IN (id_list)
id is integer
id_list is varchar/text
But in this way this doesn't work, so I need in some way to split id_list into select query.
What solution should I use here? I'm using the T-SQL Sybase ASA 9 database (SQL Anywhere).
Way I see this, is to create own function with while loop through,
and each element extract based on split by delimiter position search,
then insert elements into temp table which function will return as result.

This can be done without using dynamic SQL but you will need to create a couple of supporting objects. The fist object is a table valued function that will parse your string and return a table of integers. The second object is a stored procedure that will have a parameter where you can pass the string (id_list), parse it to a table, and then finally join it to your query.
First, create the function to parse the string:
CREATE FUNCTION [dbo].[String_To_Int_Table]
(
#list NVARCHAR(1024)
, #delimiter NCHAR(1) = ',' --Defaults to CSV
)
RETURNS
#tableList TABLE(
value INT
)
AS
BEGIN
DECLARE #value NVARCHAR(11)
DECLARE #position INT
SET #list = LTRIM(RTRIM(#list))+ ','
SET #position = CHARINDEX(#delimiter, #list, 1)
IF REPLACE(#list, #delimiter, '') <> ''
BEGIN
WHILE #position > 0
BEGIN
SET #value = LTRIM(RTRIM(LEFT(#list, #position - 1)));
INSERT INTO #tableList (value)
VALUES (cast(#value as int));
SET #list = RIGHT(#list, LEN(#list) - #position);
SET #position = CHARINDEX(#delimiter, #list, 1);
END
END
RETURN
END
Now create your stored procedure:
CREATE PROCEDURE ParseListExample
#id_list as nvarchar(1024)
AS
BEGIN
SET NOCOUNT ON;
--create a temp table to hold the list of ids
CREATE TABLE #idTable (ID INT);
-- use the table valued function to parse the ids into a table.
INSERT INTO #idTable(ID)
SELECT Value FROM dbo.String_to_int_table(#id_list, ',');
-- join the temp table of ids to the table you want to query...
SELECT T1.*
FROM table1 T1
JOIN #idTable T2
on T1.ID = T2.ID
Execution Example:
exec ParseListExample #id_list='1234,23,56,576,1231,567,122,87876,57553,1216'
I hope this helps...

Like Mikael Eriksson said, there is answer at dba.stackexchange.com with two very good solutions, first with use of sa_split_list system procedure, and second slower with CAST statement.
For the Sybase SQL Anywhere 9 sa_split_list system procedure not exist, so I have made sa_split_list system procedure replacement (I used parts of the code from bsivel answer):
CREATE PROCEDURE str_split_list
(in str long varchar, in delim char(10) default ',')
RESULT(
line_num integer,
row_value long varchar)
BEGIN
DECLARE str2 long varchar;
DECLARE position integer;
CREATE TABLE #str_split_list (
line_num integer DEFAULT AUTOINCREMENT,
row_value long varchar null,
primary key(line_num));
SET str = TRIM(str) || delim;
SET position = CHARINDEX(delim, str);
separaterows:
WHILE position > 0 loop
SET str2 = TRIM(LEFT(str, position - 1));
INSERT INTO #str_split_list (row_value)
VALUES (str2);
SET str = RIGHT(str, LENGTH(str) - position);
SET position = CHARINDEX(delim, str);
end loop separaterows;
select * from #str_split_list order by line_num asc;
END
Execute the same way as sa_split_list with default delimiter ,:
select * from str_split_list('1234,23,56,576,1231,567,122,87876,57553,1216')
or with specified delimiter which can be changed:
select * from str_split_list('1234,23,56,576,1231,567,122,87876,57553,1216', ',')

You use text in your query and this is not going to work.
Use dynamic query.

Good contribution from bsivel answer, but to generalise it (for other separators than a comma), then the line
SET #list = LTRIM(RTRIM(#list))+ ','
must become
SET #list = LTRIM(RTRIM(#list))+ #delimiter
The first version will only work for comma-separated lists.

The dynamic query approach would look like this:
create procedure ShowData #IdList VarChar(255)
as
exec ('use yourDatabase; select * from MyTable where Id in ('+#IdList+')')

Storing special character(e.g. &) in XML datatype

If I do
Declare #t table(Email xml)
Declare #email varchar(100) = 'xxx&xx#monop.com'
Insert into #t
select '<Emails> <Email>' + #email +'</Email></Emails>'
select * From #t
I will get expected error
Msg 9411, Level 16, State 1, Line 8
XML parsing: line 1, character 27, semicolon expected
One solution which I found almost everywhere(including SO) is to replace '&' with '& and it works
Insert into #t
select CAST('<Emails><Email>' + REPLACE(#email, '&', '&') + '</Email></Emails>' AS XML)
Output
<Emails><Email>xxx&xx#monop.com</Email></Emails>
However, I was trying with CData approach (just another way to approach the problem)
Declare #t table(Email xml)
Declare #email varchar(100) = 'xxx&xx#monop.com'
Insert into #t
Select CAST('<![CDATA[Emails> <Email>' + #email + '</Email> </Emails]]>' AS XML)
select * From #t
When I got the below output
Emails> <Email>xxx&xx#monop.com</Email> </Emails
What I am trying to achieve is to store the data as it is i.e. the desired output should be
<Emails><Email>xxx&xx#monop.com</Email></Emails>
Is it at all possible?
I know that the replace function will fail if any other special character that xml fails to understand will be passed as an input to it e.g. '<' i which case again we need to replace it...
Thanks

Tags are PCDATA, not CDATA, so don't put them in the CDATA section.

When you work with XML you should use XML-related features of SQL Server.
For example:
/* Create xml and add a variable to it */
DECLARE
#xml xml = '<Emails />',
#email varchar(100) = 'xxx&xx#monop.com';
SET #xml.modify ('insert (
element Email {sql:variable("#email")}
) into (/Emails)[1]');
SELECT #xml;
/* Output:
<Emails>
<Email>xxx&xx#monop.com</Email>
</Emails>
*/
/* Extract value from xml */
DECLARE #email_out varchar(200);
SET #email_out = #xml.value ('(/Emails/Email)[1]', 'varchar (200)');
SELECT #email_out; /* Returns xxx&xx#monop.com */
Good luck
Roman

SQL Server: How do you remove punctuation from a field?

Any one know a good way to remove punctuation from a field in SQL Server?
I'm thinking
UPDATE tblMyTable SET FieldName = REPLACE(REPLACE(REPLACE(FieldName,',',''),'.',''),'''' ,'')
but it seems a bit tedious when I intend on removing a large number of different characters for example: !##$%^&*()<>:"
Thanks in advance

Ideally, you would do this in an application language such as C# + LINQ as mentioned above.
If you wanted to do it purely in T-SQL though, one way make things neater would be to firstly create a table that held all the punctuation you wanted to removed.
CREATE TABLE Punctuation
(
Symbol VARCHAR(1) NOT NULL
)
INSERT INTO Punctuation (Symbol) VALUES('''')
INSERT INTO Punctuation (Symbol) VALUES('-')
INSERT INTO Punctuation (Symbol) VALUES('.')
Next, you could create a function in SQL to remove all the punctuation symbols from an input string.
CREATE FUNCTION dbo.fn_RemovePunctuation
(
#InputString VARCHAR(500)
)
RETURNS VARCHAR(500)
AS
BEGIN
SELECT
#InputString = REPLACE(#InputString, P.Symbol, '')
FROM
Punctuation P
RETURN #InputString
END
GO
Then you can just call the function in your UPDATE statement
UPDATE tblMyTable SET FieldName = dbo.fn_RemovePunctuation(FieldName)

I wanted to avoid creating a table and wanted to remove everything except letters and digits.
DECLARE #p int
DECLARE #Result Varchar(250)
DECLARE #BadChars Varchar(12)
SELECT #BadChars = '%[^a-z0-9]%'
-- to leave spaces - SELECT #BadChars = '%[^a-z0-9] %'
SET #Result = #InStr
SET #P =PatIndex(#BadChars,#Result)
WHILE #p > 0 BEGIN
SELECT #Result = Left(#Result,#p-1) + Substring(#Result,#p+1,250)
SET #P =PatIndex(#BadChars,#Result)
END

I am proposing 2 solutions
Solution 1: Make a noise table and replace the noises with blank spaces
e.g.
DECLARE #String VARCHAR(MAX)
DECLARE #Noise TABLE(Noise VARCHAR(100),ReplaceChars VARCHAR(10))
SET #String = 'hello! how * > are % u (: . I am ok :). Oh nice!'
INSERT INTO #Noise(Noise,ReplaceChars)
SELECT '!',SPACE(1) UNION ALL SELECT '#',SPACE(1) UNION ALL
SELECT '#',SPACE(1) UNION ALL SELECT '$',SPACE(1) UNION ALL
SELECT '%',SPACE(1) UNION ALL SELECT '^',SPACE(1) UNION ALL
SELECT '&',SPACE(1) UNION ALL SELECT '*',SPACE(1) UNION ALL
SELECT '(',SPACE(1) UNION ALL SELECT ')',SPACE(1) UNION ALL
SELECT '{',SPACE(1) UNION ALL SELECT '}',SPACE(1) UNION ALL
SELECT '<',SPACE(1) UNION ALL SELECT '>',SPACE(1) UNION ALL
SELECT ':',SPACE(1)
SELECT #String = REPLACE(#String, Noise, ReplaceChars) FROM #Noise
SELECT #String Data
Solution 2: With a number table
DECLARE #String VARCHAR(MAX)
SET #String = 'hello! & how * > are % u (: . I am ok :). Oh nice!'
;with numbercte as
(
select 1 as rn
union all
select rn+1 from numbercte where rn<LEN(#String)
)
select REPLACE(FilteredData,' ',SPACE(1)) Data from
(select SUBSTRING(#String,rn,1)
from numbercte
where SUBSTRING(#String,rn,1) not in('!','*','>','<','%','(',')',':','!','&','#','#','$')
for xml path(''))X(FilteredData)
Output(Both the cases)
Data
hello how are u . I am ok . Oh nice
Note- I have just put some of the noises. You may need to put the noises that u need.
Hope this helps

You can use regular expressions in SQL Server - here is an article based on SQL 2005:
http://msdn.microsoft.com/en-us/magazine/cc163473.aspx

I'd wrap it in a simple scalar UDF so all string cleaning is in one place if it's needed again.
Then you can use it on INSERT too...

I took Ken MC's solution and made it into an function which can replace all punctuation with a given string:
----------------------------------------------------------------------------------------------------------------
-- This function replaces all punctuation in the given string with the "replaceWith" string
----------------------------------------------------------------------------------------------------------------
IF object_id('[dbo].[fnReplacePunctuation]') IS NOT NULL
BEGIN
DROP FUNCTION [dbo].[fnReplacePunctuation];
END;
GO
CREATE FUNCTION [dbo].[fnReplacePunctuation] (#string NVARCHAR(MAX), #replaceWith NVARCHAR(max))
RETURNS NVARCHAR(MAX)
BEGIN
DECLARE #Result Varchar(max) = #string;
DECLARE #BadChars Varchar(12) = '%[^a-z0-9]%'; -- to leave spaces - SELECT #BadChars = '%[^a-z0-9] %'
DECLARE #p int = PatIndex(#BadChars,#Result);
DECLARE #searchFrom INT;
DECLARE #indexOfPunct INT = #p;
WHILE #indexOfPunct > 0 BEGIN
SET #searchFrom = LEN(#Result) - #p;
SET #Result = Left(#Result, #p-1) + #replaceWith + Substring(#Result, #p+1,LEN(#Result));
SET #IndexOfPunct = PatIndex(#BadChars, substring(#Result, (LEN(#Result) - #SearchFrom)+1, LEN(#Result)));
SET #p = (LEN(#Result) - #searchFrom) + #indexOfPunct;
END
RETURN #Result;
END;
GO
-- example:
SELECT dbo.fnReplacePunctuation('This is, only, a tést-really..', '');
Output:
Thisisonlyatéstreally

If it's a one-off thing, I would use a C# + LINQ snippet in LINQPad to do the job with regular expressions.
It is quick and easy and you don't have to go through the process of setting up a CLR stored procedure and then cleaning up after yourself.

Can't you use PATINDEX to only include NUMBERS and LETTERS instead of trying to guess what punctuation might be in the field? (Not trying to be snarky, if I had the code ready, I'd share it...but this is what I'm looking for).
Seems like you need to create a custom function in order to avoid a giant list of replace functions in your queries - here's a good example:
http://www.codeproject.com/KB/database/SQLPhoneNumbersPart_2.aspx?display=Print

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to get all PDF links from HTML content using T-SQL - sql

Xquery expression can do it simply DECLARE #html xml = '<p>text here link<b v="3">ok</b>some other text home<a title="er">kj</a>link 2</p>' select [pdfLink] = a.value('#href','varchar(max)') from #html.nodes('//a[#href[contains(., ".pdf")]]') c(a)

Related

SQL Server : remove character "

Sql sever Replace function

Split/explode comma delimited string with Sybase SQL Anywhere

Storing special character(e.g. &) in XML datatype

SQL Server: How do you remove punctuation from a field?

Categories

Resources