libxml2 get inner (X)HTML - objective-c

I have some sample XHTML data, like this:
<html>
<head>
<style type="text/css">
..snip
</style>
<script type="text/javascript" src="http://code.jquery.com/mobile/1.0a4.1/jquery.mobile-1.0a4.1.js"></script>
</head>
<body>
<div id="contentA">
This is sample content <b> that is bolded as well </b>
</div>
</body>
</html>
Now, what I need to do, is using an xmlNode *, get the inner HTML of the div contentA. I have the xmlNode * for it, but how can I get the innerXML of that? I looked at content, but that only returns This is sample content and not the xml in the bold tags. I looked into jQuery for this, but due to limitations on Apple and JavaScript, I cannot use jQuery to get the innerXML of that node.
On another note, Is there another library I should be using to get the inner XML? I looked into TBXML, but that had the same problem.

The content of the div node is not a single text string. It probably consists of:
A text node containing This is sample content (with the preceding new line).
an element node with a tag name of b
A text node containing the trailing new line and the indentation up to the div's closing tag.
The element node for the <b>...</b> will have the text content that is bolded as well.
To get all the text in the div as one string, you'll need to recursively descend through the entire tree of child nodes looking for text content.

Related

How can I search a Beautiful Soup tree to get the tag path to a text match?

I would like to search a Beautiful Soup element for a text match and return the sequence of tags that lead to the element containing that text.
For example, if at soup.html.head.meta there is text “Hello everybody”, I would like to search on “soup.head” for “Hello everybody” and return the result “soup.html.head.meta”.
Is there a good way to do this and if there is not a simple way, is there a good workaround for quickly finding out where certain known text is located?
Example:
I retrieved the HTML source code from this URL with wget: https://www.gitpod.io/docs/context-urls
I created a Beautiful Soup object from this document like so:
soup = bs4.BeautifulSoup(doc, 'html.parser')
The method soup.html.head.get_text() returns
'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nGitpod
Contexts\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'
I know that somewhere in the head element is some text, "Gitpod Contexts". I would like to know the nearest element tag so I can delete everything except that element, because I am trying to prune the Beautiful Soup object to just contain elements with text in them, myself, without using "get_text()" over the entire object and just automatically pulling it out.
Example 2
A simpler demonstration would be this:
<html>
<body>
<p>
Hello!
</p>
<p>
Goodbye!
</p>
</body>
</html>
The function:
html.returnLocationOf("Hello!")
returns:
html.body.p
I don't know enough about Beautiful Soup to know how it would specify "the second p" for "Goodbye!" but I imagine it could be incorporated as a method somehow.

how do I validate a particular script in persent in page source using selenium?

In page source I have script tags as below,
how to validate in selenium that particular scripts are persent???
<script src="/core/assets/vendor/domready/ready.min.js?v=1.0.8"></script>
<script src="/core/misc/drupalSettingsLoader.js?v=8.4.8"></script>
<script src="/core/misc/drupal.js?v=8.4.8"></script>
<script src="/core/misc/drupal.init.js?v=8.4.8"></script>
you can search for attribute src within a script. i.e. finding element by attribute
driver.findElement(By.xpath("//script[#src='/core/assets/vendor/domready/ready.min.js?v=1.0.8']"))
OR
driver.findElement(By.xpath("//script[contains(#src,'/core/assets/vendor/domready/ready.min.js?v=1.0.8')]")
OR
driver.findElement(By.cssSelector("script[src='/core/assets/vendor/domready/ready.min.js?v=1.0.8']"))

MSBuild to update html tag

Here is my html file:
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<script id="ScriptId" src=""></script>
</body>
</html>
I want to replace empty src by script.js.
I tried with XmlPoke, but my XPath query doesn't work I think or maybe I can't do this way:
<XmlPoke XmlInputPath="test.html"
Query="/html/body/script[id='ScriptId']/src"
Value="script.js"/>
Thanks in advance to help me to update this src value.
Attributes in XPath are prefixed with #.
/html/body/script[#id='ScriptId']/#src
You probably shouldn't be using something designed for XML with HTML as two are not the same, at best, if HTML is well-formed, it'll strip out non-XML stuff like DOCTYPE, at worst it'll blow up.

VBA insertAdjacentHTML , stripping tags

I am processing some HTML in VBA and want to inject a element to the tag.
oElement.insertAdjacentHTML "beforeEnd", "<base>HELLO</base>"
If I inspect the oElement.OuterHTML all that is added is HELLO
...<LINK rel=stylesheet type=text/css href="css/default.css">HELLO</HEAD>...
If I try adding li tags , it works as expected.
oElement.insertAdjacentHTML "beforeEnd", "<li>HELLO</li>"
Result
....<LINK rel=stylesheet type=text/css href="css/default.css">HELLO <LI>HELLO</LI> </HEAD>...
I've tried using just <base /> or <base href="blah blah , nothing get's added. Am I missing some key piece of knowledge about insertAdjacentHTML.
Any ideas??
You need to use IHTMLDOMNode interface for head object (don't know why, but it works). Create a "BASE" element, set attribute for href and finally add it to a head using appendChild.

Is it valid to put h2 tag in span tag?

Is it valid to put h2 tag in span tag given that the span tag is displayed as block?
would it make difference for search engines (SEO) if i used div instead
Sample input:
<!DOCTYPE HTML>
<html>
<head><title></title></head>
<body>
<span style="display: block">
<h2>A</h2>
</span>
</body>
</html>
And results from W3C validator:
Element h2 not allowed as child of element span in this context.
No, you can't. Accordind to HTML 4.01/XHTML 1.0 dtd you can include only inline elements in span tag. It's the following one:
a, object, applet, img, map, iframe, br, span, bdo, tt, i, b, u, s, strike, big, small, font, basefont, sub, sup, em, strong, dfn, code, q, samp, kbd, var, cite, abbr, acronym, input, select, textarea, label, button, ins, del, script.
Can't quickly check HTML 5, but don't think it's different here.
In HTML4, it is not valid to put any block element inside of any inline element.
This changes in HTML5, where it is valid to put block-level elements inside of anchor tags.