XML File:
<apps>
<app name="google">
<branches>
<branch location="us">
<datacenters>
<datacenter city="Mayes County" />
<datacenter city="Douglas County" />
</datacenters>
</branch>
</branches>
</app>
<app name="facebook">
</app>
</apps>
Java Code:
XPath xpath = XPathFactory.newInstance().newXPath();
Document document = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(file);
String expression = "/apps/app[name='google']";
Element element = xpath.evaluvate(expression,document.getDocumentElement();
Element app = (Element)xpath.evaluvate(expression,element,XPathConstants.NODE);
I can able to store the app element and execute the xpath expression within the app element . Like this
xpath.evaluvate("branches/branch[#location='us']",app,XPathConstants.STRING);
How can I do this in VTD XML?
I think you can learn the API by visiting VTD-XML's sourceforge web site. Below is what I think how it will look like:
VTDGen vg = new VTDGen();
String filename ="this.xml";
if (vg.parseFile(filename, true)){
VTDNav vn = vg.getNav();
AutoPilot ap = new AutoPilot(vn);
ap.selectXPath("/apps/app[name='google']");
int i=-1;
while((i=ap.evalXPath())!=-1){
//do something with i, it is the node index
}
}
Related
I've implemented code inside a ckeditor that replaces the src attribute of iframe elements with data-src and adds the data-cookieconsent attribute. In addition, a placeholder div is added after the iframe element. Regex is used to match the iframe elements in the string.
var value = CKEDITOR.instances.editor1.getData();
const regex = new RegExp('(?:<iframe[^>]*)(?:(?:\/>)|(?:>.*?<\/iframe>))');
const matches = value.match(regex);
if (typeof matches !== "undefined" && matches != null) {
matches.forEach(element => {
if (!element.includes("data-cookieconsent")) {
value = value.replace(element, element.replace("src=", "data-add-placeholder data-cookieconsent=\"marketing\" data-src="))
value = value.replace("</iframe>", "</iframe><div class=\"row justify - content - center\">" +
"<div class=\"cookieconsent-optout-marketing blocked-media-placeholder\">" +
"<div class=\"col-xs-6 col-xs-offset-3\">" +
"<h3>For at se denne video skal vi bruge dit samtykke til at anvende cookies.Venligst tryk på dette link og vfremvist denne video.</h3>" +
"</div></div></div>")
}
});
}
However, I now need to replace the currently existing iframe elements and add the placeholder div in the database using sql. I'm aware that I can use update with the replace function to update parts of a text column, but this seems to require Regex to work, which as I understand, is quite limited in T-SQL.
What would be the best approach to this problem? How can I ensure only the iframe elements in the column are altered?
I am a bit confused on how to do this, all of the docs / examples show how to read and edit xml docs but there doesn't seem to be any clear way of creating an xml from scratch, I would rather not have to ship my program with a dummy xml file in order to edit one. any ideas? thanks.
What you could do instead would be to just hard-code an empty document like this:
byte[] emptyDoc = "<?xml version='1.0' encoding='UTF-8'?><root></root>".getBytes("UTF-8");
And then use that to create your VTDGen and XMLModifier, and start adding elements:
VTDGen vg = new VTDGen();
vg.setDoc(emptyDoc);
vg.parse(true);
VTDNav vn = vg.getNav();
XMLModifier xm = new XMLModifier(vn);
// Cursor is at root, update Root Element Name
xm.updateElementName("employee");
xm.insertAttribute(" id='6'");
xm.insertAfterHead("<name>Bob Smith</name>");
vn = xm.outputAndReparse();
// etc...
i need to scrape a p tag which has h3 tag after it but does not have a closing p tag. It looks like this :
<script ad>asdasdasd</script>
<p>Translation companies are
-----------------------
-----------------------
<h3 class="this_class">mind blown site</h3>
There is no </p> tag so i cannot parse it completely. Now i have two questions :
1) can this be parsed using httpagility xpath ?
2) i have a function to find text between two strings (getbetween). But i have a doubt - If i use "asdasdasd" and " is it always 100% that vb.net will use the script tag which is just above h3 because there are 2-3 same lines - "asdasdasd"
3) Any other method you guys are aware of ?
(had to write in code so html does not mess up)
Regards,
It might be a good idea to post some more "real" html to really help you, at least the tags between the h3 and the p.
Anyway, this should get you the p-Tag from the h3-Tag.
HtmlDocument doc = new HtmlDocument();
doc.Load(... //Load the Html...
//Either of these lines will do
HtmlNode pNode = doc.DocumentNode.SelectSingleNode("//h3[#class='this_class']/preceding-sibling::p");
//HtmlNode pNode = doc.DocumentNode.SelectSingleNode("//h3[contains(text(),'mind blown site')]/preceding-sibling::p");
string pInnerHtml = pNode.NextSibling.InnerHtml; //Has the text "Translation companies are...."
So in general, to get all the nodes from the opening p tag to the start of a tag you don't want, you could do this:
var p = doc.DocumentNode.SelectSingleNode("//p");
var h3 = p.SelectSingleNode("following-sibling::h3[#class='this_class']");
var following = new List<string>();
for (var current = p.NextSibling; current != h3; current = current.NextSibling)
{
following.Add(current.InnerText);
}
var innerText = String.Concat(following);
I saw this question already, but I didnt see an answer..
So I get this error:
The ':' character, hexadecimal value 0x3A, cannot be included in a name.
On this code:
XDocument XMLFeed = XDocument.Load("http://feeds.foxnews.com/foxnews/most-popular?format=xml");
XNamespace content = "http://purl.org/rss/1.0/modules/content/";
var feeds = from feed in XMLFeed.Descendants("item")
select new
{
Title = feed.Element("title").Value,
Link = feed.Element("link").Value,
pubDate = feed.Element("pubDate").Value,
Description = feed.Element("description").Value,
MediaContent = feed.Element(content + "encoded")
};
foreach (var f in feeds.Reverse())
{
....
}
An item looks like that:
<rss>
<channel>
....items....
<item>
<title>Pentagon confirms plan to create new spy agency</title>
<link>http://feeds.foxnews.com/~r/foxnews/most-popular/~3/lVUZwCdjVsc/</link>
<category>politics</category>
<dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/" />
<pubDate>Tue, 24 Apr 2012 12:44:51 PDT</pubDate>
<guid isPermaLink="false">http://www.foxnews.com/politics/2012/04/24/pentagon-confirms-plan-to-create-new-spy-agency/</guid>
<content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[|http://global.fncstatic.com/static/managed/img/Politics/panetta_hearing_030712.jpg<img src="http://feeds.feedburner.com/~r/foxnews/most-popular/~4/lVUZwCdjVsc" height="1" width="1"/>]]></content:encoded>
<description>The Pentagon confirmed Tuesday that it is carving out a brand new spy agency expected to include several hundred officers focused on intelligence gathering around the world.&#160;</description>
<dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2012-04-4T19:44:51Z</dc:date>
<feedburner:origLink>http://www.foxnews.com/politics/2012/04/24/pentagon-confirms-plan-to-create-new-spy-agency/</feedburner:origLink>
</item>
....items....
</channel>
</rss>
All I want is to get the "http://global.fncstatic.com/static/managed/img/Politics/panetta_hearing_030712.jpg", and before that check if content:encoded exists..
Thanks.
EDIT:
I've found a sample that I can show and edit the code that tries to handle it..
EDIT2:
I've done it in the ugly way:
text.Replace("content:encoded", "contentt").Replace("xmlns:content=\"http://purl.org/rss/1.0/modules/content/\"","");
and then get the element in the normal way:
MediaContent = feed.Element("contentt").Value
The following code
static void Main(string[] args)
{
var XMLFeed = XDocument.Parse(
#"<rss>
<channel>
....items....
<item>
<title>Pentagon confirms plan to create new spy agency</title>
<link>http://feeds.foxnews.com/~r/foxnews/most-popular/~3/lVUZwCdjVsc/</link>
<category>politics</category>
<dc:creator xmlns:dc='http://purl.org/dc/elements/1.1/' />
<pubDate>Tue, 24 Apr 2012 12:44:51 PDT</pubDate>
<guid isPermaLink='false'>http://www.foxnews.com/politics/2012/04/24/pentagon-confirms-plan-to-create-new-spy-agency/</guid>
<content:encoded xmlns:content='http://purl.org/rss/1.0/modules/content/'><![CDATA[|http://global.fncstatic.com/static/managed/img/Politics/panetta_hearing_030712.jpg<img src='http://feeds.feedburner.com/~r/foxnews/most-popular/~4/lVUZwCdjVsc' height='1' width='1'/>]]></content:encoded>
<description>The Pentagon confirmed Tuesday that it is carving out a brand new spy agency expected to include several hundred officers focused on intelligence gathering around the world.&#160;</description>
<dc:date xmlns:dc='http://purl.org/dc/elements/1.1/'>2012-04-4T19:44:51Z</dc:date>
<!-- <feedburner:origLink>http://www.foxnews.com/politics/2012/04/24/pentagon-confirms-plan-to-create-new-spy-agency/</feedburner:origLink> -->
</item>
....items....
</channel>
</rss>");
XNamespace contentNs = "http://purl.org/rss/1.0/modules/content/";
var feeds = from feed in XMLFeed.Descendants("item")
select new
{
Title = (string)feed.Element("title"),
Link = (string)feed.Element("link"),
pubDate = (string)feed.Element("pubDate"),
Description = (string)feed.Element("description"),
MediaContent = GetMediaContent((string)feed.Element(contentNs + "encoded"))
};
foreach(var item in feeds)
{
Console.WriteLine(item);
}
}
private static string GetMediaContent(string content)
{
int imgStartPos = content.IndexOf("<img");
if(imgStartPos > 0)
{
int startPos = content[0] == '|' ? 1 : 0;
return content.Substring(startPos, imgStartPos - startPos);
}
return string.Empty;
}
results in:
{ Title = Pentagon confirms plan to create new spy agency, Link = http://feeds.f
oxnews.com/~r/foxnews/most-popular/~3/lVUZwCdjVsc/, pubDate = Tue, 24 Apr 2012 1
2:44:51 PDT, Description = The Pentagon confirmed Tuesday that it is carving out
a brand new spy agency expected to include several hundred officers focused on
intelligence gathering around the world. , MediaContent = http://global
.fncstatic.com/static/managed/img/Politics/panetta_hearing_030712.jpg }
Press any key to continue . . .
A few points:
You never want to treat Xml as text - in your case you removed the namespace declaration but actually if the namespace was declared inline (i.e. without binding to the prefix) or a different prefix would be defined your code would not work even though semantically both documents would be equivalent
Unless you know what's inside CDATA and how to treat it you always want to treat is as text. If you know it's something else you can treat it differently after parsing - see my elaborate on CDATA below for more details
To avoid NullReferenceExceptions if the element is missing I used explicit conversion operator (string) instead of invoking .Value
the Xml you posted was not a valid xml - there was missing namespace Uri for feedburner prefix
This is no longer related to the problem but may be helpful for some folks so I am leaving it
As far as the contents of the encode element is considered it is inside CDATA section. What's inside CDATA section is not an Xml but plain text. CDATA is usually used to not have to encode '<', '>', '&' characters (without CDATA they would have to be encoded as < > and & to not break the Xml document itself) but the Xml processor treat characters in the CDATA as if they were encoded (or to be more correct in encodes them). The CDATA is convenient if you want to embed html because textually the embedded content looks like the original yet it won't break your xml if the html is not a well-formed Xml. Since the CDATA content is not an Xml but text it is not possible to treat it as Xml. You will probably need to treat is as text and use for instance regular expressions. If you know it is a valid Xml you can load the contents to an XElement again and process it. In your case you have got mixed content so it is not easy to do unless you use a little dirty hack. Everything would be easy if you have just one top level element instead of mixed content. The hack is to add the element to avoid all the hassle. Inside the foreach look you can do something like this:
var mediaContentXml = XElement.Parse("<content>" + (string)item.MediaContent + "</content>");
Console.WriteLine((string)mediaContentXml.Element("img").Attribute("src"));
Again it's not pretty and it is a hack but it will work if the content of the encoded element is valid Xml. The more correct way of doing this is to us XmlReader with ConformanceLevel set to Fragment and recognize all kinds of nodes appropriately to create a corresponding Linq to Xml node.
You should use XNamespace:
XNamespace content = "...";
// later in your code ...
MediaContent = feed.Element(content + "encoded")
See more details here.
(Of course, you the string to be assigned to content is the same as in xmlns:content="...").
I have a very simple WCF service running which returns the following (from a basic new project) xml:
<ArrayOfSampleItem xmlns="http://schemas.datacontract.org/2004/07/WcfRestService1" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<SampleItem>
<Id>1</Id>
<StringValue>Hello</StringValue>
</SampleItem>
</ArrayOfSampleItem>
I am then consuming this in a Windows Phone 7 app. The result is coming back fine however I'm having problems parsing the xml. This is the code I am using on the callback after completion of the request:
XDocument xmlDoc = XDocument.Parse(e.Result);
itemsFetched.ItemsSource = from item in xmlDoc.Descendants("SampleItem")
select new Product()
{
Id = item.Element("Id").Value,
StringValue = item.Element("StringValue").Value
};
The collection is not populated with this, when I try adding the namespace:
XNamespace web = "http://schemas.datacontract.org/2004/07/WcfRestService1";
XDocument xmlDoc = XDocument.Parse(e.Result);
itemsFetched.ItemsSource = from item in xmlDoc.Descendants(web + "SampleItem")
The item is found but I get a null exception when it attempts to get the Id value.
Any help would be much appreciated.
Well the xmlns="..." puts the elements and all its descendants in the namespace so you need to use your XNamespace object web anywhere where you access elements:
XDocument xmlDoc = XDocument.Parse(e.Result);
XNamespace web = "http://schemas.datacontract.org/2004/07/WcfRestService1";
itemsFetched.ItemsSource = from item in xmlDoc.Descendants(web + "SampleItem")
select new Product()
{
Id = item.Element(web + "Id").Value,
StringValue = item.Element(web + "StringValue").Value
};