How to ignore html in an xml element when validating with relaxng compact - relaxng

How can I have a pattern that ignores html within an element rather than the validator trying to validate it
<stuff>
<data>
this is some text <b>with the odd</b> bit of html<p>and unclosed tags
</data>
</stuff>
This isn't valid but I tried things like
datatypes xs = "http://www.w3.org/2001/XMLSchema-datatypes"
start = stuff
stuff = element stuff
{
element data { * }
}

You can't allow arbitrary unmodified HTML within XML. Either escape the individual special characters (What are the official XML reserved characters?) or encapsulate the HTML within a CDATA container (Is it possible to insert HTML content in XML document?).

You won't be able to validate an XML document with non-well-formed HTML in it, since on account of the non-wellformedness such documents are not XML documents. But if in fact the input you're getting is XML, then you can certainly define data to allow any well-formed HTML elements, or any well-formed XML.
Allowing any well-formed XML is the simplest. We define a pattern than means "any well-formed XML here": any elements encountered are validated using the same pattern, recursively:
wellformed-xml = (text
| element * { wellformed-xml }
)*
Now define the data element to use that pattern:
stuff = element stuff {
element data { wellformed-xml }
}
If you really want to ensure that it's just HTML, you'll want a nameclass more restrictive than "*". I've populated it with b, i, p, span, and div, and leave it as an exercise to you to add the other elements you want.
start = stuff
stuff =
element stuff {
element data { wellformed-html }
}
wellformed-html =
(text
| element b | div | i | p | span { wellformed-html }
)*
If you want to be able to support XHTML input as well, you'll want to use a namespace reference; again, an exercise for the reader.

Related

How to use 'find_elements_by_xpath' inside a for loop

I'm somewhat (or very) confused about the following:
from selenium.webdriver import Chrome
driver = Chrome()
html_content = """
<html>
<head></head>
<body>
<div class='first'>
Text 1
</div>
<div class="second">
Text 2
<span class='third'> Text 3
</span>
</div>
<div class='first'>
Text 4
</div>
<my_tag class="second">
Text 5
<span class='third'> Text 6
</span>
</my_tag>
</body>
</html>
"""
driver.get("data:text/html;charset=utf-8,{html_content}".format(html_content=html_content))
What I'm trying to do, is find each span element using xpath, print out its text and then print out the text of the parent of that element. The final output should be something like:
Text 3
Text 2
Text 6
Text 5
I can get the text of span like this:
el = driver.find_elements_by_xpath("*//span")
for i in el:
print(i.text)
With the output being:
Text 3
Text 6
But when I try to get the parent's (and only the parent's) text by using:
elp = driver.find_elements_by_xpath("*//span/..")
for i in elp:
print(i.text)
The output is:
Text 2 Text 3
Text 5 Text 6
The xpath expressions *//span/..and //span/../text() usually (but not always, depending on which xpath test site is being used) evaluate to:
Text 2
Text 5
which is what I need for my for loop.
Hence the confusion. So I guess what I'm looking for is a for loop which, in pseudo code, looks like:
el = driver.find_elements_by_xpath("*//span")
for i in el:
print(i.text)
print(i.parent.text) #trying this in real life raises an error....
There's probably a few ways to do this. Here's one way
elp = driver.find_elements_by_css_selector("span.third")
for i in elp:
print(i.text)
s = i.find_element_by_xpath("./..").get_attribute("innerHTML")
print(s.split('<')[0].strip())
I used a simple CSS selector to find the child elements ("text 3" and "text 6"). I loop through those elements and print their .text as well as navigate up one level to find the parent and print its text also. As OP noted, printing the parent text also prints the child. To get around this, we need to get the innerHTML, split it and strip out the spaces.
To explain the XPath in more detail
./..
^ start at an existing node, the 'i' in 'i.find_element_*'. If you skip/remove this '.', you will start at the top of the DOM instead of at the child element you've already located.
^ go up one level, to find the parent
I know I already accepted #JeffC's answer, but in the course of working on this question something occurred to me. It's very likely an overkill, but it's an interesting approach and, for the sake of future generations, I figured I might as well post it here as well.
The idea involves using BeautifulSoup. The reason is that BS has a couple of methods for erasing nodes from the tree. One of them which can be useful here (and for which, to my knowledge, Selenium doesn't have an equivalent method) is decompose() (see more here). We can use decompose() to suppress the printing of the second part of the text of the parent, which is contained inside a span tag by eliminating the tag and its content. So we import BS and start with #JeffC's answer:
from bs4 import BeautifulSoup
elp = driver.find_elements_by_css_selector("span.third")
for i in elp:
print(i.text)
s = i.find_element_by_xpath("./..").get_attribute("innerHTML")
and here switch to bs4
content = BeautifulSoup(s, 'html.parser')
content.find('span').decompose()
print(content.text)
And the output, without string manipulation, regex, or whatnot is...:
Text 3
Text 2
Text 6
Text 5
i.parent.text will not work, in java i used to write some thing like
ele.get(i).findElement("here path to parent may be parent::div ").getText();
Here is the python method that will retrieve the text from only parent node.
def get_text_exclude_children(element):
return driver.execute_script(
"""
var parent = arguments[0];
var child = parent.firstChild;
var textValue = "";
while(child) {
if (child.nodeType === Node.TEXT_NODE)
textValue += child.textContent;
child = child.nextSibling;
}
return textValue;""",
element).strip()
This is how to use the method in your case:
elements = driver.find_elements_by_css_selector("span.third")
for eleNum in range(len(elements)):
print(driver.find_element_by_xpath("(//span[#class='third'])[" + str(eleNum+1) +"]").text)
print(get_text_exclude_children(driver.find_element_by_xpath("(//span[#class='third'])[" + str(eleNum+1) +"]/parent::*")))
Here is the output:

Get the nested nodes in a string with Xpath or HtmlAgilityPack

On the server, I am getting back a HTML snippet as a string via AJAX from the client JS. The contents are a nested DIV with ul, li items. HTML DIv snippet
<div> //please see link above
<ul class="tree" id="ulID" name="input">
<li><span class="vertical..."></span>
<div></span>1</div>
<ul>..
</div>
I am using C# HtmlAgilityPack, but I am not able to get the nested contents to extract the data, and add data back.
Below is some of the code.
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
// nested
htmlDoc.OptionFixNestedTags=true;
bool failed = false;
// Use: htmlDoc.LoadHtml(htmlString);
// ParseErrors is an ArrayList
if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count() > 0)
{
// Handle any parse errors as required
// check if string was JSON formatted
if (htmlDoc.LoadHtml(JSONdeserialize(htmlString)).ParseErrors.Count() > 0) failed = true;
}
else
{
if (htmlDoc.DocumentNode != null)
{
HtmlAgilityPack.HtmlNode bodyNode = htmlDoc.DocumentNode.SelectSingleNode("//ulID");
if (bodyNode != null)
{
// **how can I get the contents of the node here??****
// what is the xpath to get all the structured contents so I can walk the tree
// If option walk tree
// How can I build foreach(HTMLnode node in nodes) nested array
}
}
}
What is the Xpath to select all content in DOM string, when I don't have body, but simple Div enclosed string.
How can I extract all the nodes, and their contents at their nested levels
Any recommendations on how to save this structure? so I can easily recover it?
I am not sure the Xpath you have now is correct.
I am also unsure when the first ul tag ends. If it ends just before the div closes. Then you can just use this xpath.
"//ul[#id='ulID']"
Then you get the first ul htmlnode. Then you can iterate through its children.
I highly recommend you take a look at some xpath examples.

Alternative to innerHTML for IE

I create HTML documents that include proprietary tags for links that get transformed into standard HTML when they go through our publishing system. For example:
<LinkTag contents="Link Text" answer_id="ID" title="Tooltip"></LinkTag>
When I'm authoring and reviewing these documents, I need to be able to test these links in a browser, before they get published. I wrote the following JavaScript to read the attributes and write them into an <a> tag:
var LinkCount = document.getElementsByTagName('LinkTag').length;
for (i=0; i<LinkCount; i++) {
var LinkText = document.getElementsByTagName('LinkTag')[i].getAttribute('contents');
var articleID = document.getElementsByTagName('LinkTag')[i].getAttribute('answer_id');
var articleTitle = document.getElementsByTagName('LinkTag')[i].getAttribute('title');
document.getElementsByTagName('LinkTag')[i].innerHTML = '' + LinkText + '';
}
This works great in Firefox, but not in IE. I've read about the innerHTML issue with IE and imagine that's the problem here, but I haven't been able to figure a way around it. I thought perhaps jQuery might be the way to go, but I'm not that well versed in it.
This would be a huge productivity boost if I could get this working in IE. Any ideas would be greatly appreciated.
innerHTML only works for things INSIDE of the open/close tags. So for instance if your LinkTag[i] is an <a> element, then putting innerHTML="<a .... > </a> would put that literally between the <a tag=LinkTag> and </a>.
You would need to put that in a DIV. Perhaps use your code to draw from links, then place the corresponding HTML code into a div.
var LinkCount = document.getElementsByTagName('LinkTag').length;
for (i=0; i<LinkCount; i++) {
var LinkText = document.getElementsByTagName('LinkTag')[i].getAttribute('contents');
var articleID = document.getElementsByTagName('LinkTag')[i].getAttribute('answer_id');
var articleTitle = document.getElementsByTagName('LinkTag')[i].getAttribute('title');
document.getElementsById('MyDisplayDiv')[i].innerHTML = '' + LinkText + '';
This should produce your HTML results within a div. You could also simply append the other LinkTag elements to a single DIV to produce a sort of "Preview of Website" within the div.
document.getElementsById('MyDisplayDiv').innerHTML += '' + LinkText + '';
Note the +=. Should append your HTML fabrication to the DIV with ID "MyDisplayDiv". Your list of LinkTag elements would be converted into a list of HTML elements within the div.
DOM functions might be considered more "clean" (and faster) than simply replacing the innerHTML in cases like this. Something like this may work better for you:
// where el is the element you want to wrap with <a link.
newel = document.createElement('a')
el.parentNode.insertBefore(newel,prop);
el = prop.parentNode.removeChild(prop);
newel.appendChild(prop);
newel.setAttribute('href','urlhere');
Similar code worked fine for me in Firebug, it should wrap the element with <a> ... </a>.
I wrote a script I have on my blog which you can find here: http://blog.funelr.com/?p=61 anyways take it a look it automatically fixes ie innerHTML errors in ie8 and ie9 by hijacking ie's innerHTML property which is "read-only" for table elements and replacing it with my own.
I have this xmlObject:
<c:value i:type="b:AccessRights">ReadAccess WriteAccess AppendAccess AppendToAccess CreateAccess DeleteAccess ShareAccess AssignAccess</c:value>
I find value use this: responseXML.getElementsByTagNameNS("http://schemas.datacontract.org/2004/07/System.Collections.Generic",'value')[0].firstChild.data
This one is the best : elm.insertAdjacentHTML( 'beforeend', str );
In case your element size is very small :elm.innerHTML += str;

Cannot get FusionCharts setDataXML method to work in Ruby on Rails

This may be a sad implementation to start with but...
I have an XML string that I want to pass to a FusionChart but it fails (without errors...).
Controller code: trying to pull XML from a page
data_url = "[removed URL]/run/custom.xml?start=#{#start_week}&end=#{#end_week}&x=daychart=line&application=#{#application}".html_safe
data_uri = URI::escape(data_url)
#response = RestClient.get(data_uri)
Rails.logger.debug("Data: " + #response.inspect)
View Code: instantiating the Chart and passing the XML
<script>
var chart = new FusionCharts("/charts/MSLine.swf", "quality_center_stats", "700", "400", "1", "0");
chart.setDataXML("<%= #response %>");
chart.render("quality_center_stats");
</script>
Let me know if there is more I can add
Can be any of the following:
The XML contains special characters which needs to be encoded in the DataXML method. e.g., % should be passed as %25, & as %26, " as %26quot; or "
Make sure you do not have any newline character in that XML string. That would break the string in JavaScript (and you will get "unterminated string" error in JavaScript console)
Make sure the XML data which is passed conforms to FusionCharts Mulit-series data format
Make sure you have placed the SWF and JS files in proper paths and are accessed correctly. (can check through network tab of Firebug or Developer Console of Chrome)

Dojo disable all input fields in div container

Is there any way to disable all input fields in an div container with dojo?
Something like:
dijit.byId('main').disable -> Input
That's how I do it:
dojo.query("input, button, textarea, select", container).attr("disabled", true);
This one-liner disables all form elements in the given container.
Sure there is. Open up this form test page for example, launch FireBug and execute in the console:
var container = dojo.query('div')[13];
console.log(container);
dojo.query('input', container).forEach(
function(inputElem){
console.log(inputElem);
inputElem.disabled = 'disabled';
}
)
Notes:
On that test page form elements are actually dijit form widgets, but in this sample I'm treating them as if they were normal input tags
The second dojo.query selects all input elements within the container element. If the container had some unique id, you could simplify the sample by having only one dojo.query: dojo.query('#containerId input').forEach( ...
forEach loops through all found input elements and applies the given function on them.
Update: There's also a shortcut for setting an attribute value using NodeList's attr function instead of forEach. attr takes first the attribute name and then the value or an object with name/value pairs:
var container = dojo.query('div')[13];
dojo.query('input', container).attr('disabled', 'disabled');
Something else to keep in mind is the difference between A Dijit and a regular DomNode. If you want all Dijit's within a DomNode, you can convert them from Nodes -> Dijit refs with query no problem:
// find all widget-dom-nodes in a div, convert them to dijit reference:
var widgets = dojo.query("[widgetId]", someDiv).map(dijit.byNode);
// now iterate over that array making each disabled in dijit-land:
dojo.forEach(widgets, function(w){ w.attr("disabled", "disabled"); }
It really just depends on if your inputs are regular Dom input tags or have been converted into the rich Dijit templates (which all do have a regular input within them, just controlled by the widget reference instead)
I would do it like this:
var widgets;
require(["dijit/registry", "dojo/dom"], function(registry, dom){
widgets = registry.findWidgets(dom.byId(domId));
});
require(["dojo/_base/array"], function(array){
array.forEach(widgets, function(widget, index) {
widget.set("disabled", true);
});
});
Method findWidgets is essential to get all widgets underneath a specific DOM.