How to make LIKE in SQL look for specific string instead of just a wildcard - sql

My SQL Query:
SELECT
[content_id] AS [LinkID]
, dbo.usp_ClearHTMLTags(CONVERT(nvarchar(600), CAST([content_html] AS XML).query('root/Physicians/name'))) AS [Physician Name]
FROM
[DB].[dbo].[table1]
WHERE
[id] = '188'
AND
(content LIKE '%Urology%')
AND
(contentS = 'A')
ORDER BY
--[content_title]
dbo.usp_ClearHTMLTags(CONVERT(nvarchar(600), CAST([content_html] AS XML).query('root/Physicians/name')))
The issue I am having is, if the content is Neurology or Urology it appears in the result.
Is there any way to make it so that if it's Urology, it will only give Urology result and if it's Neurology, it will only give Neurology result.
It can be Urology, Neurology, Internal Medicine, etc. etc... So the two above used are what is causing the issue.
The content is a ntext column with XML tag inside, for example:
<root><Location><location>Office</location>
<office>Office</office>
<Address><image><img src="Rd.jpg?n=7513" /></image>
<Address1>1 Road</Address1>
<Address2></Address2>
<City>Qns</City>
<State>NY</State>
<zip>14404</zip>
<phone>324-324-2342</phone>
<fax></fax>
<general></general>
<from_north></from_north>
<from_south></from_south>
<from_west></from_west>
<from_east></from_east>
<from_connecticut></from_connecticut>
<public_trans></public_trans>
</Address>
</Location>
</root>
With the update this content column has the following XML:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<Physicians>
<name>Doctor #1</name>
<picture>
<img src="phys_lab coat_gradation2.jpg?n=7529" />
</picture>
<gender>M</gender>
<langF1>
English
</langF1>
<specialty>
<a title="Neurology" href="neu.aspx">Neurology</a>
</specialty>
</Physicians>
</root>
If I search for Lab the result appears because there is the text lab in the column.

This is what I would do if you're not into making a CLR proc to use Regexes (SQL Server doesn't have regex capabilities natively)
SELECT
[...]
WHERE
(content LIKE #strService OR
content LIKE '%[^a-z]' + #strService + '[^a-z]%' OR
content LIKE #strService + '[^a-z]%' OR
content LIKE '%[^a-z]' + #strService)
This way you check to see if content is equal to #strService OR if the word exists somewhere within content with non-letters around it OR if it's at the very beginning or very end of content with a non-letter either following or preceding respectively.
[^...] means "a character that is none of these". If there are other characters you don't want to accept before or after the search query, put them in every 4 of the square brackets (after the ^!). For instance [^a-zA-Z_].

As I see it, your options are to either:
Create a function that processes a string and finds a whole match inside it
Create a CLR extension that allows you to call .NET code and leverage the REGEX capabilities of .NET
Aaron's suggestion is a good one IF you can know up front all the terms that could be used for searching. The problem I could see is if someone searches for a specific word combination.

Databases are notoriously bad at semantics (i.e. they don't understand the concept of neurology or urology - everything is just a string of characters).
The best solution would be to create a table which defines the terms (two columns, PK and the name of the term).
The query is then a join:
join table1.term_id = terms.term_id and terms.term = 'Urology'
That way, you can avoid the LIKE and search for specific results.
If you can't do this, then SQL is probably the wrong tool. Use LIKE to get a set of results which match and then, in an imperative programming language, clean those results from unwanted ones.

Judging from your content, can you not leverage the fact that there are quotes in the string you're searching for?
SELECT
[...]
WHERE
(content LIKE '%""Urology""%')

Related

Open Refine: Exporting nested XML with templating

I have a question regarding the templating option for XML in Open Refine. Is it possible to export data from two columns in a nested XML-structure, if both columns contain multiple values, that need to be split first?
Here's an example to illustrate better what I mean. My columns look like this:
Column1
Column2
https://d-nb.info/gnd/119119110;https://d-nb.info/gnd/118529889
Grützner, Eduard von;Elisabeth II., Großbritannien, Königin
https://d-nb.info/gnd/1037554086;https://d-nb.info/gnd/1245873660
Müller, Jakob;Meier, Anina
Each value separated by semicolon in Column1 has a corresponding value in Column2 in the right order and my desired output would look like this:
<rootElement>
<recordRootElement>
...
<edm:Agent rdf:about="https://d-nb.info/gnd/119119110">
<skos:prefLabel xml:lang="zxx">Grützner, Eduard von</skos:prefLabel>
</edm:Agent>
<edm:Agent rdf:about="https://d-nb.info/gnd/118529889">
<skos:prefLabel xml:lang="zxx">Elisabeth II., Großbritannien, Königin</skos:prefLabel>
</edm:Agent>
...
</recordRootElement>
<recordRootElement>
...
<edm:Agent rdf:about="https://d-nb.info/gnd/1037554086">
<skos:prefLabel xml:lang="zxx">Müller, Jakob</skos:prefLabel>
</edm:Agent>
<edm:Agent rdf:about="https://d-nb.info/gnd/1245873660">
<skos:prefLabel xml:lang="zxx">Meier, Anina</skos:prefLabel>
</edm:Agent>
...
</recordRootElement>
<rootElement>
(note: in my initial posting, the position of the root element was not indicated and it looked like this:
<edm:Agent rdf:about="https://d-nb.info/gnd/119119110">
<skos:prefLabel xml:lang="zxx">Grützner, Eduard von</skos:prefLabel>
</edm:Agent>
<edm:Agent rdf:about="https://d-nb.info/gnd/118529889">
<skos:prefLabel xml:lang="zxx">Elisabeth II., Großbritannien, Königin</skos:prefLabel>
</edm:Agent>
)
I managed to split the values separated by ";" for both columns like this
{{forEach(cells["Column1"].value.split(";"),v,"<edm:Agent rdf:about=\""+v+"\">"+"\n"+"</edm:Agent>")}}
{{forEach(cells["Column2"].value.split(";"),v,"<skos:prefLabel xml:lang=\"zxx\">"+v+"</skos:prefLabel>")}}
but I can't find out how to nest the splitted skos:prefLabel into the edm:Agent element. Is that even possible? If not, I would work with seperate columns or another workaround, but I wanted to make sure, if there's a more direct way before.
Thank you!
Kristina
I am going to expand the answer from RolfBly using the Templating Exporter from OpenRefine.
I do have the following assumptions:
There is some other column left of Column1 acting as record identifying column (see first screenshot).
The columns actually have some proper names
The columns URI and Name are the only columns with multiple values. Otherwise we might produce empty XML elements with the following recipe.
We will use the information about records available via GREL to determine whether to write a <recordRootElement> or not.
Recipe:
Split first Name and then URI on the separator ";" via "Edit cells" => "Split multi-valued cells".
Go to "Export" => "Templating..."
In the prefix field use the value
<?xml version="1.0" encoding="utf-8"?>
<rootElement>
Please note that I skipped the namespace imports for edm, skos, rdf and xml.
In the row template field use the value:
{{if(row.index - row.record.fromRowIndex == 0, '<recordRootElement>', '')}}
<edm:Agent rdf:about="{{escape(cells['URI'].value, 'xml')}}">
<skos:prefLabel xml:lang="zxx">{{escape(cells['Name'].value, 'xml')}}</skos:prefLabel>
</edm:Agent>
{{if(row.index - row.record.fromRowIndex == row.record.rowCount - 1, '</recordRootElement>', '')}}
The row separator field should just contain a linebreak.
In the suffix field use the value:
</rootElement>
Disclaimer: If you're keen on using only OpenRefine, this won't be the answer you were hoping for. There may be ways in OR that I don't know of. That said, here's how I would do it.
Edit The trick is to keep URL and literal side by side on one line. b2m's answer below does just that: go from right to left splitting, not from left to right. You can then skip steps 2 and 3, to get the result in the image.
split each column into 2 columns by separator ;. You'll get 4 columns, 1 and 3 belong together, and 2 and 4 belong together. I'm assuming this will be the case consistently in your data.
export 1 and 3 to a file, and export 2 and 4 to another file, of any convenient format, using the custom tabular exporter.
concatenate those two files into one single file using an editor (I use Notepad++), or any other method you may prefer. Several ways to Rome here. Result in OR would be something like this.
You then have all sorts of options to put text strings in front, between and after your two columns.
In OR, you could use transform on column URL to build your XML using the below code
(note the \n for newline, that's probably just a line feed, you may want to use \r\n for carriage return + line feed if you're using Windows).
'<edm:Agent rdf:about="' + value + '">\n<skos:prefLabel xml:lang="zxx">' + cells.Name.value + '</skos:prefLabel>\n</edm:Agent>'
to get your XML in one column, like so
which you can then export using the custom tabular exporter again. Or instead you could use Add column based on this column in a similar manner, if you want to retain your URL column.
You could even do this in the editor without re-importing the file back into OR, but that's beyond the scope of this answer.

Coldfusion - Need to get string data between opening and closing tags

I am trying to get specific data between two strings which are a opening and closing tag. Normally I would just parse it using XmlParse but the problem is it has a lot of other junk in the dataset.
Here is an example of the large string:
test of data need to parse:<?xml version="1.0" encoding="UTF-8"?><alert xmlns="urn:oasis:names:tc::cap:1.2"><identifier>_2020-12-16T17:32:5620201116173256</identifier><sender>683</sender><sent>2020-12-16T17:32:56-05:00</sent><status>Test</status><msgType>Alert</msgType><source>test of data need to parse</source><scope>Public</scope><addresses/><code>Test1.0</code><note>WENS IPAWS</note><info><language>en-US</language></info>
<capsig:Signature xmlns:capsig="http://www.w3.org/2000/09/xmldsig">
<capsig:Info>
<capsig:CanonicalizationMethod Algorithm="http://www.w3.org/2001/10/xml-exc-c14n"/>
<capsig:SignatureMethod Algorithm="http://www.w3.org/2001/04/xmldsig-morersa-sha256"/>
<capsig:Referrer URI="">
<capsig:Trans>
<capsig:Trans Algorithm="http://www.w3.org/2000/09/xmldsigenveloped-signature"/>
</capsig:Trans>
<capsig:DMethod Algorithm="http://www.w3.org/2001/04/xmlencsha256"/>
<capsig:DigestValue>wjL4tqltJY7m/4=</capsig:DigestValue>
</capsig:Referrer>
</capsig:Info>
test of data need to parse:<?xml version="1.0" encoding="UTF-8"?><alert xmlns="urn:oasis:names:tc::cap:1.2"><identifier>_2020-12-16T17:32:5620201116173256</identifier><sender>683</sender><sent>2020-12-16T17:32:56-05:00</sent><status>Test</status><msgType>Alert</msgType><source>test of data need to parse</source><scope>Public</scope><addresses/><code>Test1.0</code><note>WENS IPAWS</note><info><language>en-US</language></info>
So what I need to do is just extract the following:
<capsig:Info>
<capsig:CanonicalizationMethod Algorithm="http://www.w3.org/2001/10/xml-exc-c14n"/>
<capsig:SignatureMethod Algorithm="http://www.w3.org/2001/04/xmldsig-morersa-sha256"/>
<capsig:Referrer URL="">
<capsig:Trans>
<capsig:Trans Algorithm="http://www.w3.org/2000/09/xmldsigenveloped-signature"/>
</capsig:Trans>
<capsig:DMethod Algorithm="http://www.w3.org/2001/04/xmlencsha256"/>
<capsig:DigestValue>wjL4tqltJY7m/4=</capsig:DigestValue>
</capsig:Referrer>
</capsig:Info>
I have searched everywhere and I have found where things can be done with characters and counts but none of them really worked. Tried doing it with SQL but because the constant change in the string it causes issues. So my plan was get everything after "capsig:Info" and before "</capsig:Info>" then insert it into a table.
Is there a way to do this with Coldfusion?
Any suggestions would be appreciated.
Thanks!
Yes, you can use a regular expression match to extract the substring containing the text between the <capsig:Info> ... </capsig:Info> tags by using the ColdFusion function reMatch() which will return an array of all substrings that match the specified pattern. This can be done using the line of code below.
<!--- Use reMatch to extract all pattern matches into an array --->
<cfset parsedXml = reMatch("<capsig:Info>(.*?)</capsig:Info>", xmlToParse)>
<!--- parsedXml is an array of strings. The result will be found in the first array element as such --->
<cfdump var="#parsedXml[1]#" label="parsedXml">
You can see this using the demo here.
https://trycf.com/gist/00be732d93ef49b2427768e18e371527/lucee5?theme=monokai

T-SQL XML node value

I'm trying to extract the values from the following xml document
<response>
<entry>
<title>the tales</title>
<subject-area code="1" abbrev="XX1">Test1</subject-area>
<subject-area code="2" abbrev="XX2">Test2</subject-area>
</entry>
</response>
but I'm having problem getting the subject-area text values i.e. "Test1"
I'm using the below T-SQL to extract the rest of the values, I'm using a cross appy on the node as I required this to loop to get all values so can't use [1] etc to extract it that way as I'm not sure how many subject area there will be.
Any ideas
SELECT
,a.APIXMLResponse.value('(response[1]/entry[1]/title[1])','VARCHAR(250)') AS Title
,sa.value('(./#code)','varchar(10)') AS SubjectAreaCode
,sa.value('(./#abbrev)','varchar(10)') AS SubjectAreaAbbrev
FROM [dbo].[APIXML] a
CROSS APPLY APIXMLResponse.nodes('response/entry/subject-area') AS SubjectArea(sa)
Although there is a working solution in a comment already, I'd like to point out some things:
Using just '.' as path can lead to very annoying effects, if there are nested elements.
Looking for performance it is recommended to use text()[1] to read the needed value at its actual place (Here are some details with examples).
As the internal values are NVARCHAR(x) it is slightly faster to use NVARCHAR as target type (if you don't have a reason to do otherwise...
That's my query:
SELECT
a.APIXMLResponse.value('(response/entry/title)[1]','NVARCHAR(250)') AS Title
,sa.value('#code','nvarchar(10)') AS SubjectAreaCode
,sa.value('#abbrev','nvarchar(10)') AS SubjectAreaAbbrev
,sa.value('text()[1]','nvarchar(10)') AS SubjectAreaContent
FROM #mockup a
CROSS APPLY APIXMLResponse.nodes('response/entry/subject-area') AS SubjectArea(sa)

Modify a column, to get rid of html surrounding an ID

I have a table and one of the columns contains html for an iFrame & within it an external video, specifically it's like
<iframe src="http://host.com/videos/ID" otherattributes...></iframe>.
I need to update the current column or create a new one (doesn't matter) so what I have is just the ID of that video, I know I could use a regex for it but I'm really weak with it.
perhaps so it find the content that is within literal characters: [videos/] and the upcoming ["] which comes right after the ID but I'm unsure how.
You can use CHARINDEX() function:
update T SET
VideoID=SUBSTRING(descr,
charindex('/videos/',descr)+LEN('/videos/'),
charindex('"',descr,charindex('/videos/',descr)+LEN('/videos/'))
-(charindex('/videos/',descr)+LEN('/videos/')))
SQLFiddle demo
This should work, assuming the text videos/ doesn't appear anywhere else in the html.
update htmltable
set id = SUBSTRING(SUBSTRING(html,
CHARINDEX('videos/', html) + 7,
LEN(html)
),
0,
CHARINDEX('"', SUBSTRING(html,
CHARINDEX('videos/', html) + 7,
LEN(html)
)
)
)
This updates a field named otherfield in table htmltable where the id in the url is '123'. It's pretty ugly code, but SQL Server has limited string functions.
If you have any control over the table structure, I would suggest you make some changes. The video ID should be stored in its own column, separate from the rest of the url. Then when you need to retrieve the url, you would concatenate the two parts to get the whole url. That would be much more maintainable.

SQL Server : remove duplicated text within a string

I have a SQL Server 2008 table with a column containing lengthy HTML text. Near the top there is a link provided for an associated MP3 file which is unique to each record. The links are are all formatted as follows:
<div class="MediaSaveAs">Download Audio </div>
Unfortunately many records contain two or three sequential and identical instances of this link where there should be only one. Is there a relatively simple script I can run to find and eliminate the redundant links?
I'm not entirely sure - because your explanation wasn't very clear - but this appears to do what you want, although whether or not you consider this to be a "simple script", I don't know.
declare #Link nvarchar(200) = N'<div class="MediaSaveAs">Download Audio </div>'
declare #BadData nvarchar(max) = N'cbjahcgfhjasgfzhjaucv' + replicate(#Link, 3) + N'cabhjcsghagj',
#StartPattern nvarchar(34) = N'<div class="MediaSaveAs"><a href="',
#EndPattern nvarchar(27) = N'">Download Audio </a></div>'
select #BadData
select replace (
#BadData,
substring(#BadData, charindex(#StartPattern, #BadData), len(#BadData)-charindex(reverse(#EndPattern), reverse(#BadData))-charindex(#StartPattern, #BadData) + 2),
substring(#BadData, charindex(#StartPattern, #BadData), charindex(#EndPattern, #BadData) + len(#EndPattern) - charindex(#StartPattern, #BadData))
)
Personally I would not like to have to maintain this code; I would far rather use a script in another language that can actually parse HTML. You said this is "just a repeated text issue", but that doesn't mean it's an easy problem and especially not in a language like TSQL that has such limited support for string operations.
For future reference, please put all relevant information into the question - you can edit it if you need to - instead of leaving them in the comments where they are difficult to read and may be overlooked. And please post sample data and results instead of describing things in words.
First we need to identify the file names, which we can do with PATINDEX:
select
substring(html, PATINDEX('%filename%.mp3%', html), PATINDEX('%.mp3%', html)-PATINDEX('%filename%.mp3%', html)+4)
from files
And then secondly identify and the duplicates, check it out:
delete
from files
where id not in (
select max(id)
from files
group by substring(html, PATINDEX('%filename%.mp3%', html), PATINDEX('%.mp3%', html)-PATINDEX('%filename%.mp3%', html)+4)
)
http://www.sqlfiddle.com/#!3/887a3/5