SQL search for string within string, excluding another string - sql

I have an SQL field containing a large chunk of HTML. I'd like to identify any records where there is the string "http://" but it is not part of a string that is "http://www.example.com." Many of the records include "http://www.example.com" -- I am not looking to exclude those. Rather, to return them if there is an additional "http://" link that is not of the same format.
As an example, I would want to return these records:
http://www.foo.com is a great site but http://www.example.com is not
http://www.foo.com is a great site
but not this one:
http://www.example.com is a great site

How's this?
SELECT *
FROM table
WHERE REPLACE(field, 'http://www.example.com', '') LIKE '%http://%'

Related

URL parsing in SQL

I have an inconsistent url in of the tables.
The sample looks like
https://blue.decibal.com.au/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0
or
https://www.google.com/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0
or
https3A%google.com/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0
For the first URL "blue" is the result but it comes with two domains blue and decibal.
Second one is google.
Third is again google.
My requirement is to parse the url and match it with a look table with domain name which contains blue, google, bing etc.
However, the inconstancy in the URL that's stored in DB is a challenge. Need to write a sql which can identify the match and if there are two domain just pick the first one. The URL can be a sit and not expected to be a standard one.
Appreciate some help.
Are you looking for something like this? If not, I do believe that using the SPLIT as part of your parsing will help, since it then creates an array that you can manipulate. This is an example for Snowflake SQL, not SQL Server. They are both tagged in the OP, so not sure which you are looking for.
WITH x AS (
SELECT REPLACE(url,'3A%','//') as url
FROM (VALUES
('https://blue.decibal.com.au/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0'),
('https://www.google.com/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0'),
('https3A%google.com/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0')) as x (url)
)
SELECT split(split_part(split_part(url,'//',2),'/',1),'.') as url_array,
array_construct('google') as google_array,
array_construct('decibal') as decibal_array,
array_construct('bing') as bing_array,
CASE WHEN arrays_overlap(url_array,google_array) THEN 'GOOGLE'
WHEN arrays_overlap(url_array,decibal_array) THEN 'DECIBAL'
WHEN arrays_overlap(url_array,bing_array) THEN 'BING' END as domain_match
FROM x;

Modify a column, to get rid of html surrounding an ID

I have a table and one of the columns contains html for an iFrame & within it an external video, specifically it's like
<iframe src="http://host.com/videos/ID" otherattributes...></iframe>.
I need to update the current column or create a new one (doesn't matter) so what I have is just the ID of that video, I know I could use a regex for it but I'm really weak with it.
perhaps so it find the content that is within literal characters: [videos/] and the upcoming ["] which comes right after the ID but I'm unsure how.
You can use CHARINDEX() function:
update T SET
VideoID=SUBSTRING(descr,
charindex('/videos/',descr)+LEN('/videos/'),
charindex('"',descr,charindex('/videos/',descr)+LEN('/videos/'))
-(charindex('/videos/',descr)+LEN('/videos/')))
SQLFiddle demo
This should work, assuming the text videos/ doesn't appear anywhere else in the html.
update htmltable
set id = SUBSTRING(SUBSTRING(html,
CHARINDEX('videos/', html) + 7,
LEN(html)
),
0,
CHARINDEX('"', SUBSTRING(html,
CHARINDEX('videos/', html) + 7,
LEN(html)
)
)
)
This updates a field named otherfield in table htmltable where the id in the url is '123'. It's pretty ugly code, but SQL Server has limited string functions.
If you have any control over the table structure, I would suggest you make some changes. The video ID should be stored in its own column, separate from the rest of the url. Then when you need to retrieve the url, you would concatenate the two parts to get the whole url. That would be much more maintainable.

SQL query to bring all results regardless of punctuation with JSF

So I have a database with articles in them and the user should be able to search for a keyword they input and the search should find any articles with that word in it.
So for example if someone were to search for the word Alzheimer's I would want it to return articles with the word spell in any way regardless of the apostrophe so;
Alzheimer's
Alzheimers
results should all be returned. At the minute it is search for the exact way the word is spell and wont bring results back if it has punctuation.
So what I have at the minute for the query is:
private static final String QUERY_FIND_BY_SEARCH_TEXT = "SELECT o FROM EmailArticle o where UPPER(o.headline) LIKE :headline OR UPPER(o.implication) LIKE :implication OR UPPER(o.summary) LIKE :summary";
And the user's input is called 'searchText' which comes from the input box.
public static List<EmailArticle> findAllEmailArticlesByHeadlineOrSummaryOrImplication(String searchText) {
Query query = entityManager().createQuery(QUERY_FIND_BY_SEARCH_TEXT, EmailArticle.class);
String searchTextUpperCase = "%" + searchText.toUpperCase() + "%";
query.setParameter("headline", searchTextUpperCase);
query.setParameter("implication", searchTextUpperCase);
query.setParameter("summary", searchTextUpperCase);
List<EmailArticle> emailArticles = query.getResultList();
return emailArticles;
}
So I would like to bring back all results for alzheimer's regardless of weather their is an apostrophe or not. I think I have given enough information but if you need more just say. Not really sure where to go with it or how to do it, is it possible to just replace/remove all punctuation or just apostrophes from a user search?
In my point of view, you should change your query,
you should add alter your table and add a FULLTEXT index to your columns (headline, implication, summary).
You should also use MATCH-AGAINST rather than using LIKE query and most important, read about SOUNDEX() syntax, very beautiful syntax.
All I can give you is a native query example:
SELECT o.* FROM email_article o WHERE MATCH(o.headline, o.implication, o.summary) AGAINST('your-text') OR SOUNDEX(o.headline) LIKE SOUNDEX('your-text') OR SOUNDEX(o.implication) LIKE SOUNDEX('your-text') OR SOUNDEX(o.summary) LIKE SOUNDEX('your-text') ;
Though it won't give you results like Google search but it works to some extent. Let me know what you think.

filter result by taking out a matching regex in pig latin

I have some data that contains a url string, which all have some variety substring embeded.
my goal to to get a set of results which have the substring removed from the string:
e.g.
rawdata: {
id Long,
url String
}
here's some sample rawdata:
1,/213112341_v1.html
2,43524254243_v2.html
5,/000000_v3.html
5,/000000_v4.html
the result I want is:
1,/213112341.html
2,43524254243.html
5,/000000.html
so basically remove teh subversion number( _v1|_v2|v3|_v4) from the url and create unique results.
How do I do that in pig?
Thanks,
Your best bet would be to do something like the following:
FOREACH data GENERATE id, CONCAT(REGEX_EXTRACT(url, '(/?[0-9]*)_,',1),'.html');
EDIT:
How about trying the following if the data is more complicated
FOREACH data GENERATE id, CONCAT(STRSPLIT(url, '_v[0-9]',1),'.html')
That should get everything before the version #, with the concat adding the .html back in. If both the before verson number and after verison number sections are more comlicated you could do something like:
FOREACH data GENERATE id, CONCAT(FLATTEN(STRSPLIT(url, '_v[0-9]',2)))

SQL new column with pattern match

I have a column with a url sting that looks like this
http://www.somedomain.edu/rootsite1/something/something/
or
http://www.somedomain.edu/sites/rootsite2/something/something
Basically I want to ONLY return the string up to root site (in another column).. root site can be anyting (but /sites), but it will either follow /sites/ or .edu/
so the above two strings would return:
http://www.somedomain.edu/rootsite1
http://www.somedomain.edu/sites/rootsite2
I can't compile the view with CLR, so I don't think Regex is an option.
Thanks for any help.
I think you'll do better by splitting up the URL on the client side and saving it as two pieces in the table (one containing the "root" site, the other containing the site-specific path), then putting them back together again on the client side after retrieval.
If you choose to store them in the table as you describe above, you can use CHARINDEX to determine where the .edu or /sites/ occurs in the string, then use SUBSTRING to break it up based on that index.
If you really need to do this, here's an example:
declare #sites table (URL varchar(500))
insert into #sites
values
('http://www.somedomain.edu/rootsite1/something/something/'),
('http://www.somedomain.edu/sites/rootsite2/something/something')
select
URL,
SUBSTRING(URL, 1, case when charindex('/sites/', URL) > 0 then
charindex('/', URL, charindex('/sites/', URL) + 7) else
charindex('/', URL, charindex('.edu/', URL) + 5) end - 1)
from #sites
you could use CHARINDEX, LEN and SUBSTRING to do this although im not sure sql is the best place to do it
DECLARE #testStr VARCHAR(255)
SET #testStr = 'http://www.somedomain.edu/rootsite1/something/something/'
PRINT SUBSTRING(#testStr, 0, CHARINDEX('.edu', #testStr))
Not a full solution but should give you a start