SQL new column with pattern match - sql

I have a column with a url sting that looks like this
http://www.somedomain.edu/rootsite1/something/something/
or
http://www.somedomain.edu/sites/rootsite2/something/something
Basically I want to ONLY return the string up to root site (in another column).. root site can be anyting (but /sites), but it will either follow /sites/ or .edu/
so the above two strings would return:
http://www.somedomain.edu/rootsite1
http://www.somedomain.edu/sites/rootsite2
I can't compile the view with CLR, so I don't think Regex is an option.
Thanks for any help.

I think you'll do better by splitting up the URL on the client side and saving it as two pieces in the table (one containing the "root" site, the other containing the site-specific path), then putting them back together again on the client side after retrieval.
If you choose to store them in the table as you describe above, you can use CHARINDEX to determine where the .edu or /sites/ occurs in the string, then use SUBSTRING to break it up based on that index.
If you really need to do this, here's an example:
declare #sites table (URL varchar(500))
insert into #sites
values
('http://www.somedomain.edu/rootsite1/something/something/'),
('http://www.somedomain.edu/sites/rootsite2/something/something')
select
URL,
SUBSTRING(URL, 1, case when charindex('/sites/', URL) > 0 then
charindex('/', URL, charindex('/sites/', URL) + 7) else
charindex('/', URL, charindex('.edu/', URL) + 5) end - 1)
from #sites

you could use CHARINDEX, LEN and SUBSTRING to do this although im not sure sql is the best place to do it
DECLARE #testStr VARCHAR(255)
SET #testStr = 'http://www.somedomain.edu/rootsite1/something/something/'
PRINT SUBSTRING(#testStr, 0, CHARINDEX('.edu', #testStr))
Not a full solution but should give you a start

Related

Does Pervasive Have a SQL Function for URL Encoding?

I've written a query that builds some URLs to an intranet site, but some of the URLs don't work because they contain special characters that need to be URL encoded. I'm trying to avoid writing a script (outside of SQL) to do the URL encoding; I'd like the database to do URL-Encoding instead, so that I can just export the data (as is) directly into csv file.
For example, I can encode just one character quite easily. Here, I encode & to %26:
select
customer_id
,customer_name
,'https://intranet.local/customer/?id=' + replace(customer_id,'&','%26') as url
from customer
However, this method becomes quite verbose when encoding multiple characters.
Is there a function in Pervasive 13 that will do URL-encoding?
Based on the answer give here, you can create a function using:
CREATE FUNCTION urlencode(:description char(200))
RETURNS char(200)
AS
BEGIN
SELECT
Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(
Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(
Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(
Replace(RTRIM(:description)
,'%','%25') ,'&','%26') ,'$','%24') ,'+','%2B')
,',','%2C') ,':','%3A') ,';','%3B') ,'=','%3C')
,'?','%3D') ,':','%3F') ,'#','%40') ,'#','%23')
,'<','%3C') ,'>','%3E') ,'[','%5B') ,']','%5D')
,'{','%7B') ,'}','%7D') ,'|','%7C') ,'^','%5E')
,' ','%20') ,'~','%7E') ,'`','%60') ,'*','%2A')
,'(','%28') ,')','%29') ,'/','%2F') ,'\\','%5C')
,' ','%20') INTO :description;
RETURN :description;
END;

URL parsing in SQL

I have an inconsistent url in of the tables.
The sample looks like
https://blue.decibal.com.au/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0
or
https://www.google.com/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0
or
https3A%google.com/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0
For the first URL "blue" is the result but it comes with two domains blue and decibal.
Second one is google.
Third is again google.
My requirement is to parse the url and match it with a look table with domain name which contains blue, google, bing etc.
However, the inconstancy in the URL that's stored in DB is a challenge. Need to write a sql which can identify the match and if there are two domain just pick the first one. The URL can be a sit and not expected to be a standard one.
Appreciate some help.
Are you looking for something like this? If not, I do believe that using the SPLIT as part of your parsing will help, since it then creates an array that you can manipulate. This is an example for Snowflake SQL, not SQL Server. They are both tagged in the OP, so not sure which you are looking for.
WITH x AS (
SELECT REPLACE(url,'3A%','//') as url
FROM (VALUES
('https://blue.decibal.com.au/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0'),
('https://www.google.com/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0'),
('https3A%google.com/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0')) as x (url)
)
SELECT split(split_part(split_part(url,'//',2),'/',1),'.') as url_array,
array_construct('google') as google_array,
array_construct('decibal') as decibal_array,
array_construct('bing') as bing_array,
CASE WHEN arrays_overlap(url_array,google_array) THEN 'GOOGLE'
WHEN arrays_overlap(url_array,decibal_array) THEN 'DECIBAL'
WHEN arrays_overlap(url_array,bing_array) THEN 'BING' END as domain_match
FROM x;

SQL search for string within string, excluding another string

I have an SQL field containing a large chunk of HTML. I'd like to identify any records where there is the string "http://" but it is not part of a string that is "http://www.example.com." Many of the records include "http://www.example.com" -- I am not looking to exclude those. Rather, to return them if there is an additional "http://" link that is not of the same format.
As an example, I would want to return these records:
http://www.foo.com is a great site but http://www.example.com is not
http://www.foo.com is a great site
but not this one:
http://www.example.com is a great site
How's this?
SELECT *
FROM table
WHERE REPLACE(field, 'http://www.example.com', '') LIKE '%http://%'

Modify a column, to get rid of html surrounding an ID

I have a table and one of the columns contains html for an iFrame & within it an external video, specifically it's like
<iframe src="http://host.com/videos/ID" otherattributes...></iframe>.
I need to update the current column or create a new one (doesn't matter) so what I have is just the ID of that video, I know I could use a regex for it but I'm really weak with it.
perhaps so it find the content that is within literal characters: [videos/] and the upcoming ["] which comes right after the ID but I'm unsure how.
You can use CHARINDEX() function:
update T SET
VideoID=SUBSTRING(descr,
charindex('/videos/',descr)+LEN('/videos/'),
charindex('"',descr,charindex('/videos/',descr)+LEN('/videos/'))
-(charindex('/videos/',descr)+LEN('/videos/')))
SQLFiddle demo
This should work, assuming the text videos/ doesn't appear anywhere else in the html.
update htmltable
set id = SUBSTRING(SUBSTRING(html,
CHARINDEX('videos/', html) + 7,
LEN(html)
),
0,
CHARINDEX('"', SUBSTRING(html,
CHARINDEX('videos/', html) + 7,
LEN(html)
)
)
)
This updates a field named otherfield in table htmltable where the id in the url is '123'. It's pretty ugly code, but SQL Server has limited string functions.
If you have any control over the table structure, I would suggest you make some changes. The video ID should be stored in its own column, separate from the rest of the url. Then when you need to retrieve the url, you would concatenate the two parts to get the whole url. That would be much more maintainable.

SQL Server : remove duplicated text within a string

I have a SQL Server 2008 table with a column containing lengthy HTML text. Near the top there is a link provided for an associated MP3 file which is unique to each record. The links are are all formatted as follows:
<div class="MediaSaveAs">Download Audio </div>
Unfortunately many records contain two or three sequential and identical instances of this link where there should be only one. Is there a relatively simple script I can run to find and eliminate the redundant links?
I'm not entirely sure - because your explanation wasn't very clear - but this appears to do what you want, although whether or not you consider this to be a "simple script", I don't know.
declare #Link nvarchar(200) = N'<div class="MediaSaveAs">Download Audio </div>'
declare #BadData nvarchar(max) = N'cbjahcgfhjasgfzhjaucv' + replicate(#Link, 3) + N'cabhjcsghagj',
#StartPattern nvarchar(34) = N'<div class="MediaSaveAs"><a href="',
#EndPattern nvarchar(27) = N'">Download Audio </a></div>'
select #BadData
select replace (
#BadData,
substring(#BadData, charindex(#StartPattern, #BadData), len(#BadData)-charindex(reverse(#EndPattern), reverse(#BadData))-charindex(#StartPattern, #BadData) + 2),
substring(#BadData, charindex(#StartPattern, #BadData), charindex(#EndPattern, #BadData) + len(#EndPattern) - charindex(#StartPattern, #BadData))
)
Personally I would not like to have to maintain this code; I would far rather use a script in another language that can actually parse HTML. You said this is "just a repeated text issue", but that doesn't mean it's an easy problem and especially not in a language like TSQL that has such limited support for string operations.
For future reference, please put all relevant information into the question - you can edit it if you need to - instead of leaving them in the comments where they are difficult to read and may be overlooked. And please post sample data and results instead of describing things in words.
First we need to identify the file names, which we can do with PATINDEX:
select
substring(html, PATINDEX('%filename%.mp3%', html), PATINDEX('%.mp3%', html)-PATINDEX('%filename%.mp3%', html)+4)
from files
And then secondly identify and the duplicates, check it out:
delete
from files
where id not in (
select max(id)
from files
group by substring(html, PATINDEX('%filename%.mp3%', html), PATINDEX('%.mp3%', html)-PATINDEX('%filename%.mp3%', html)+4)
)
http://www.sqlfiddle.com/#!3/887a3/5