MSSQL read tags from XHTML stored as XML column - sql

I have stored a ~500kb ~10k lines XHTML document as XML datatype. Now I want to extract data from a single table in this document. There seem to be a lot of ways to deal with XML structured data, the most promissing solution (I tried others) seemed to be:
how to get data from html in SQL Server column
When I run this testcode:
DECLARE #HtmlTbl TABLE (ID INT IDENTITY, Html XML)
INSERT INTO #HtmlTbl(Html) VALUES('
<html>
<head>
<title>asdf</title>
</head>
<body>
</body>
</html>
')
SELECT html.query('//title')
FROM #HtmlTbl
WHERE ID = 1
it works pretty well. But if my XHTML looks like this:
INSERT INTO #HtmlTbl(Html) VALUES('
<html xmlns="../www.w3.org/1999/xhtml/index.html">
<head>
<title>asdf</title>
</head>
<body>
</body>
</html>
')
it is already failing and I'm getting empty strings as result. Of course my original XHTML file is much larger and I don't want to run in all those comments and stuff to cause my search to fail.
I'm new to handling XML in MSSQL, maybe someone can tell me a better way to extract the table I'm looking for. The server has SQL Express 2014, thx for help.

Related

Where do I find the "the HTMLQuestion schema URL" and CDATA (newbie ... run out of options)

I want to conduct a simple turk survey.
I've made the form, uploaded the images and set the details but I'm not quite sure what's next.
Here is the framework of my form with what I think are the AWS elements needed but:
1) How do I find the "the HTMLQuestion schema URL"?
2) Do I generate the assignmentId or does it get inserted on the POST?
3) Is there something I need to add for CDATA is it is a placeholder for an array?
(Please forgive my ignorance but I may even being asking the wrong questions. I'm just not clear what to do next - especially to test it myself (sandbox). I tried posting in the Turk forum but no replies in two days. I don't expect the AWS manual to be for novices.)
<pre>
<HTMLQuestion xmlns="[the HTMLQuestion schema URL]">
<HTMLContent><![CDATA[
<!DOCTYPE html>
<html>
<head>
<meta http-equiv='Content-Type' content='text/html; charset=UTF-8'/>
<script type='text/javascript' src='https://s3.amazonaws.com/mturk-public/externalHIT_v1.js'></script>
</head>
<body>
<form name='mturk_form' method='post' id='mturk_form' action='https://www.mturk.com/mturk/externalSubmit'>
<input type='hidden' value='' name='assignmentId' id='assignmentId'/>
... my handwritten form elements ...
</form>
<script language='Javascript'>turkSetAssignmentID();</script>
</body>
</html>
]]>
</HTMLContent>
<FrameHeight>0</FrameHeight>
</HTMLQuestion>
</pre>
Here is the latest link I found so far:
<HTMLQuestion xmlns="http://mechanicalturk.amazonaws.com/AWSMechanicalTurkDataSchemas/2011-11-11/HTMLQuestion.xsd">

MSBuild to update html tag

Here is my html file:
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<script id="ScriptId" src=""></script>
</body>
</html>
I want to replace empty src by script.js.
I tried with XmlPoke, but my XPath query doesn't work I think or maybe I can't do this way:
<XmlPoke XmlInputPath="test.html"
Query="/html/body/script[id='ScriptId']/src"
Value="script.js"/>
Thanks in advance to help me to update this src value.
Attributes in XPath are prefixed with #.
/html/body/script[#id='ScriptId']/#src
You probably shouldn't be using something designed for XML with HTML as two are not the same, at best, if HTML is well-formed, it'll strip out non-XML stuff like DOCTYPE, at worst it'll blow up.

libxml2 get inner (X)HTML

I have some sample XHTML data, like this:
<html>
<head>
<style type="text/css">
..snip
</style>
<script type="text/javascript" src="http://code.jquery.com/mobile/1.0a4.1/jquery.mobile-1.0a4.1.js"></script>
</head>
<body>
<div id="contentA">
This is sample content <b> that is bolded as well </b>
</div>
</body>
</html>
Now, what I need to do, is using an xmlNode *, get the inner HTML of the div contentA. I have the xmlNode * for it, but how can I get the innerXML of that? I looked at content, but that only returns This is sample content and not the xml in the bold tags. I looked into jQuery for this, but due to limitations on Apple and JavaScript, I cannot use jQuery to get the innerXML of that node.
On another note, Is there another library I should be using to get the inner XML? I looked into TBXML, but that had the same problem.
The content of the div node is not a single text string. It probably consists of:
A text node containing This is sample content (with the preceding new line).
an element node with a tag name of b
A text node containing the trailing new line and the indentation up to the div's closing tag.
The element node for the <b>...</b> will have the text content that is bolded as well.
To get all the text in the div as one string, you'll need to recursively descend through the entire tree of child nodes looking for text content.

Split table data in SQL and replace with results

I need to remove a bunch of unneeded data from each table based on split parameters.
My SQL table is storing a bunch of HTML for caching, The data is already in SQL and it's growing to be quite large so now I want to split some of the data I don't use from each table based on a string and update the table with the new results.
cacheHTML table is holding data like this
<html>
... (a bunch of data I don't need)
<first div>
... (the data I do want to save)
</div>
... (data I don't care about also)
</html>
I only want whats inside the first div and to remove all the html up to that point.
Is there any easy method for this? I need to do this to 5k rows of cached data...
I need a function or method to say give me everything between string1 till string2 then replace the table with the results. Any help would be appreciated Thanks!
You could do something like this. Will only work if you always need the text inside the first div in the html string. Im assuming SQL Server as database system but it could probably be translated to others pretty easily.
Sample html string:
<html>
<head>
<title>Stuff i dont need</title>
</head>
<body>
<h1>Stuff i dont need</title>
<p>I dont need any of this data</title>
<div>This is the data i need to save!</div>
<h3>Dont need this</h3>
<div>Wont need this either!<div>
<h3>Bye</h3>
</body>
SQL to do the update:
UPDATE cacheHTML
SET htmlText = REPLACE(SUBSTRING(htmlText, CHARINDEX('<div>', htmlText, 0), CHARINDEX('</div>', htmlText, 0) - CHARINDEX('<div>', htmlText, 0)), '<div>', '')

Extracting XML (as xml type) using xpath query in SQL

I'm trying to extract a chunk of XML (ie the whole xml of a node, not just content) using an xpath query in SQL. I can extract a single content field but am not sure how to do the above.
Say the xml is as follows
<head>
<instance>
<tag1>
<tag2>data</tag2>
<tag3>data</tag3>
</tag1>
</instance>
</head>
I would like to extract all of the xml inside tag1, and was hoping something like this would work in a SQL query:
Table.value('(/head/instance/tag1)[1]', 'varchar(max)') as "col"
Any help would be great.
Thanks
this should work:
Select Cast(Table.xmlcolumnname.query('/head/instance/tag1') as varchar(max)) 'col';
(its not checked! may contain typo..)