VTD-XML creating an XML document - vtd-xml

I am a bit confused on how to do this, all of the docs / examples show how to read and edit xml docs but there doesn't seem to be any clear way of creating an xml from scratch, I would rather not have to ship my program with a dummy xml file in order to edit one. any ideas? thanks.

What you could do instead would be to just hard-code an empty document like this:
byte[] emptyDoc = "<?xml version='1.0' encoding='UTF-8'?><root></root>".getBytes("UTF-8");
And then use that to create your VTDGen and XMLModifier, and start adding elements:
VTDGen vg = new VTDGen();
vg.setDoc(emptyDoc);
vg.parse(true);
VTDNav vn = vg.getNav();
XMLModifier xm = new XMLModifier(vn);
// Cursor is at root, update Root Element Name
xm.updateElementName("employee");
xm.insertAttribute(" id='6'");
xm.insertAfterHead("<name>Bob Smith</name>");
vn = xm.outputAndReparse();
// etc...

Related

How to move text in PDF?

Is there a Java or Nodejs library that can move existing text in a PDF file?
I'd like to extract all the text nodes, then move some of them to a new location based on some conditions.
I tried PdfClown, galkahana/HummusJS, Hopding/pdf-lib, but seems they don't have exactly what I need.
can anyone help? thanks
After inspecting the variables, I figured out how to move text, here is the code
PrimitiveComposer composer = new PrimitiveComposer(page);
ContentScanner scanner = composer.getScanner();
tranverse(scanner);
composer.flush();
...
while (level.moveNext()){
ContentObject content = level.getCurrent();
if (content instanceof Text){
...
List<ContentObject> objects = text.getBaseDataObject().getObjects();
for(ContentObject co: objects){
if(co instanceof SetTextMatrix){
List<PdfDirectObject> operands = ((SetTextMatrix)co).getOperands();
PdfInteger y = (PdfInteger)operands.get(5);
operands.set(5, new PdfInteger(y.getIntValue()-100));
}
}

How to read conf.properties file?

I want to read conf.properties file but but not by giving a hard coded path, so how can I fins it? I have use the below line but its written a null value but the path was correct.
InputStream inputStream = ReadPropertyFile.class.getResourceAsStream("configuration/conf.properties");
Assuming you have configuration folder in your project & that folder contains conf.properties
Use following in your function -
String projdir = System.getProperty("user.dir");
String propfilepath = projdir+"\\config\\"+"conf.properties";
Properties p = new Properties();
p.load(new FileInputStream(propfilepath ));
String value = p.getProperty("test");
System.out.println(value); // It is returning me a value corresponding to key "test"

parsing tcx document using xdocument selectsinglenode

I am trying to parse an TCX document similiar to the one used in this post: Import TCX into R using XML package
Only I'm trying to use XmlDocument.SelectNodes and SelectSingleNode instead of getNodeSet. The line with xmlns looks like this:
<TrainingCenterDatabase xmlns="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2 http://www.garmin.com/xmlschemas/TrainingCenterDatabasev2.xsd">
If I remove the xmlns and just have , I can parse it without any problems.
My (vb.net) code:
Dim tcxXmlDocument As New System.Xml.XmlDocument()
tcxXmlDocument.Load(tcxFile)
Dim xmlnsManager = New System.Xml.XmlNamespaceManager(tcxXmlDocument.NameTable)
Dim trackpoints As New List(Of Trackpoint)
For Each tpXml As System.Xml.XmlNode In tcxXmlDocument.SelectNodes("//Trackpoint", xmlnsManager)
Dim newTrackpoint As New Trackpoint
With newTrackpoint
.Time = tpXml.SelectSingleNode("Time").InnerText
.LatitudeDegrees = tpXml.SelectSingleNode("Position/LatitudeDegrees").InnerText
.LongitudeDegrees = tpXml.SelectSingleNode("Position/LongitudeDegrees").InnerText
.HeartRateBpm = tpXml.SelectSingleNode("HeartRateBpm").InnerText
End With
trackpoints.Add(newTrackpoint)
Next
How can I configure the XmlNamespaceManager so that I can access the nodes in such a tcx document?
Thanks,
Jason
Use the XmlNamespaceManager.AddNamespace() method to associate a preffix (say "x") with the namespace "http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2".
Then in your XPath expression, pefix every element name with the "x" prefix.
Replace this:
//Trackpoint
with:
//x:Trackpoint
Replace this:
Time
with:
x:Time
Replace this:
Position/LatitudeDegrees
with:
x:Position/x:LatitudeDegrees
Replace this:
Position/LongitudeDegrees
with:
x:Position/x:LongitudeDegrees
Finally, replace this:
HeartRateBpm
with:
x:HeartRateBpm

VTD-XML Exception: Name space qualification Exception: prefixed attribute not qualified

I receive an XML via a web service and I am using legacy code (which uses dom4j) to perform some xml transformation. Loading/parsing the original XML into VTD-XML (VTDGen) works fine, no exceptions thrown. However, after loading the xml into dom4j, I noticed some of the element namespace declarations and attributes are re-arranged. Apparently, this re-arrangement causes VTD-XML to throw the following exception:
Exception:
Name space qualification Exception: prefixed attribute not qualified
Line Number: 101 Offset: 1827
Here is the element at this line number in the original XML:
<RR_PerformanceSite:PerformanceSite_1_4 RR_PerformanceSite:FormVersion="1.4" xmlns:NSF_ApplicationChecklist="http://apply.grants.gov/forms/NSF_ApplicationChecklist-V1.1" xmlns:NSF_CoverPage="http://apply.grants.gov/forms/NSF_CoverPage-V1.1" xmlns:NSF_DeviationAuthorization="http://apply.grants.gov/forms/NSF_DeviationAuthorization-V1.1" xmlns:NSF_Registration="http://apply.grants.gov/forms/NSF_Registration-V1.1" xmlns:NSF_SuggestedReviewers="http://apply.grants.gov/forms/NSF_SuggestedReviewers-V1.1" xmlns:PHS398_CareerDevelopmentAwardSup="http://apply.grants.gov/forms/PHS398_CareerDevelopmentAwardSup_1_1-V1.1" xmlns:PHS398_Checklist="http://apply.grants.gov/forms/PHS398_Checklist_1_3-V1.3" xmlns:PHS398_CoverPageSupplement="http://apply.grants.gov/forms/PHS398_CoverPageSupplement_1_4-V1.4" xmlns:PHS398_ModularBudget="http://apply.grants.gov/forms/PHS398_ModularBudget-V1.1" xmlns:PHS398_ResearchPlan="http://apply.grants.gov/forms/PHS398_ResearchPlan_1_3-V1.3" xmlns:PHS_CoverLetter="http://apply.grants.gov/forms/PHS_CoverLetter_1_2-V1.2" xmlns:RR_Budget="http://apply.grants.gov/forms/RR_Budget-V1.1" xmlns:RR_KeyPersonExpanded="http://apply.grants.gov/forms/RR_KeyPersonExpanded_1_2-V1.2" xmlns:RR_OtherProjectInfo="http://apply.grants.gov/forms/RR_OtherProjectInfo_1_2-V1.2" xmlns:RR_PerformanceSite="http://apply.grants.gov/forms/PerformanceSite_1_4-V1.4" xmlns:RR_PersonalData="http://apply.grants.gov/forms/RR_PersonalData-V1.1" xmlns:RR_SF424="http://apply.grants.gov/forms/RR_SF424_1_2-V1.2" xmlns:RR_SubawardBudget="http://apply.grants.gov/forms/RR_SubawardBudget-V1.2" xmlns:SF424C="http://apply.grants.gov/forms/SF424C-V1.0" xmlns:att="http://apply.grants.gov/system/Attachments-V1.0" xmlns:codes="http://apply.grants.gov/system/UniversalCodes-V2.0" xmlns:globlib="http://apply.grants.gov/system/GlobalLibrary-V2.0">
Here is the same element after loaded into dom4j:
<RR_PerformanceSite:PerformanceSite_1_4 xmlns:RR_PerformanceSite="http://apply.grants.gov/forms/PerformanceSite_1_4-V1.4" xmlns:NSF_ApplicationChecklist="http://apply.grants.gov/forms/NSF_ApplicationChecklist-V1.1" xmlns:NSF_CoverPage="http://apply.grants.gov/forms/NSF_CoverPage-V1.1" xmlns:NSF_DeviationAuthorization="http://apply.grants.gov/forms/NSF_DeviationAuthorization-V1.1" xmlns:NSF_Registration="http://apply.grants.gov/forms/NSF_Registration-V1.1" xmlns:NSF_SuggestedReviewers="http://apply.grants.gov/forms/NSF_SuggestedReviewers-V1.1" xmlns:PHS398_CareerDevelopmentAwardSup="http://apply.grants.gov/forms/PHS398_CareerDevelopmentAwardSup_1_1-V1.1" xmlns:PHS398_Checklist="http://apply.grants.gov/forms/PHS398_Checklist_1_3-V1.3" xmlns:PHS398_CoverPageSupplement="http://apply.grants.gov/forms/PHS398_CoverPageSupplement_1_4-V1.4" xmlns:PHS398_ModularBudget="http://apply.grants.gov/forms/PHS398_ModularBudget-V1.1" xmlns:PHS398_ResearchPlan="http://apply.grants.gov/forms/PHS398_ResearchPlan_1_3-V1.3" xmlns:PHS_CoverLetter="http://apply.grants.gov/forms/PHS_CoverLetter_1_2-V1.2" xmlns:RR_Budget="http://apply.grants.gov/forms/RR_Budget-V1.1" xmlns:RR_KeyPersonExpanded="http://apply.grants.gov/forms/RR_KeyPersonExpanded_1_2-V1.2" xmlns:RR_OtherProjectInfo="http://apply.grants.gov/forms/RR_OtherProjectInfo_1_2-V1.2" xmlns:RR_PersonalData="http://apply.grants.gov/forms/RR_PersonalData-V1.1" xmlns:RR_SF424="http://apply.grants.gov/forms/RR_SF424_1_2-V1.2" xmlns:RR_SubawardBudget="http://apply.grants.gov/forms/RR_SubawardBudget-V1.2" xmlns:SF424C="http://apply.grants.gov/forms/SF424C-V1.0" xmlns:att="http://apply.grants.gov/system/Attachments-V1.0" xmlns:codes="http://apply.grants.gov/system/UniversalCodes-V2.0" xmlns:globlib="http://apply.grants.gov/system/GlobalLibrary-V2.0" RR_PerformanceSite:FormVersion="1.4">
The problem is regarding the attribute (at offset 1827, at the end of the element) in the new XML element: RR_PerformanceSite:FormVersion="1.4"
Here is what removes the exception:
1. Adding the RR_PerformanceSite xmlns declaration for this element to the root element of the XML doc.
2. Replacing new element with original element. This SEEMS to lead me to believe that the order of the attributes/ns declarations affects VTD when parsing.
NOTE: I parse the xml doc setting ns aware to 'true' with both xml docs (original and post-dom4j xml). Also, new VTD objects are created for each xml, original and post-dom4j.
I tried to put 'RR_PerformanceSite:FormVersion="1.4"' at the beginning of the element like the original but that does not remove the exception. The offset in the error message is different due to the change of location of the attribute. Does the order of the xmlns declarations affect VTD?
I have looked at the VTDGen source code and cannot figure out why this exception is being thrown.
Why would dom4j parse the new doc and vtd is unable to? Can anyone can shed some light on this?
It appears to be a bug on VTD-XML, related with namespace declaration order.
Always reproducible using the following Java code
public class SchemaTester {
/**
* #param args
*/
public static void main(String[] args) throws Exception {
String bad = "C:/Temp/VTD_bad.xml"; // XML files to test
String good = "C:/Temp/VTD_good.xml";
StringBuilder sb = new StringBuilder();
char[] buf = new char[4*1024];
FileReader fr = new FileReader(bad);
int readed = 0;
while ((readed = fr.read(buf, 0, buf.length)) != -1) {
sb.append(buf, 0, readed);
}
fr.close();
String x = sb.toString();
//instantiate VTDGen
//and call parse
VTDGen vg = new VTDGen();
vg.setDoc(x.getBytes("UTF-8"));
vg.parse(true); // set namespace awareness to true
VTDNav vn = vg.getNav();
AutoPilot ap = new AutoPilot (vn);
ap.selectXPath("//*/#*");
int i= -1;
while((i=ap.evalXPath()) != -1) {
// i will be attr name, i+1 will be attribute value
System.out.println("\t\tAttribute ==> " + vn.toNormalizedString(i));
System.out.println("\t\tValue ==> " + vn.toNormalizedString(i+1));
}
}
}
The OP has uploaded the XML to https://gist.github.com/2696220

The ':' character, hexadecimal value 0x3A, cannot be included in a name

I saw this question already, but I didnt see an answer..
So I get this error:
The ':' character, hexadecimal value 0x3A, cannot be included in a name.
On this code:
XDocument XMLFeed = XDocument.Load("http://feeds.foxnews.com/foxnews/most-popular?format=xml");
XNamespace content = "http://purl.org/rss/1.0/modules/content/";
var feeds = from feed in XMLFeed.Descendants("item")
select new
{
Title = feed.Element("title").Value,
Link = feed.Element("link").Value,
pubDate = feed.Element("pubDate").Value,
Description = feed.Element("description").Value,
MediaContent = feed.Element(content + "encoded")
};
foreach (var f in feeds.Reverse())
{
....
}
An item looks like that:
<rss>
<channel>
....items....
<item>
<title>Pentagon confirms plan to create new spy agency</title>
<link>http://feeds.foxnews.com/~r/foxnews/most-popular/~3/lVUZwCdjVsc/</link>
<category>politics</category>
<dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/" />
<pubDate>Tue, 24 Apr 2012 12:44:51 PDT</pubDate>
<guid isPermaLink="false">http://www.foxnews.com/politics/2012/04/24/pentagon-confirms-plan-to-create-new-spy-agency/</guid>
<content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[|http://global.fncstatic.com/static/managed/img/Politics/panetta_hearing_030712.jpg<img src="http://feeds.feedburner.com/~r/foxnews/most-popular/~4/lVUZwCdjVsc" height="1" width="1"/>]]></content:encoded>
<description>The Pentagon confirmed Tuesday that it is carving out a brand new spy agency expected to include several hundred officers focused on intelligence gathering around the world.&amp;#160;</description>
<dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2012-04-4T19:44:51Z</dc:date>
<feedburner:origLink>http://www.foxnews.com/politics/2012/04/24/pentagon-confirms-plan-to-create-new-spy-agency/</feedburner:origLink>
</item>
....items....
</channel>
</rss>
All I want is to get the "http://global.fncstatic.com/static/managed/img/Politics/panetta_hearing_030712.jpg", and before that check if content:encoded exists..
Thanks.
EDIT:
I've found a sample that I can show and edit the code that tries to handle it..
EDIT2:
I've done it in the ugly way:
text.Replace("content:encoded", "contentt").Replace("xmlns:content=\"http://purl.org/rss/1.0/modules/content/\"","");
and then get the element in the normal way:
MediaContent = feed.Element("contentt").Value
The following code
static void Main(string[] args)
{
var XMLFeed = XDocument.Parse(
#"<rss>
<channel>
....items....
<item>
<title>Pentagon confirms plan to create new spy agency</title>
<link>http://feeds.foxnews.com/~r/foxnews/most-popular/~3/lVUZwCdjVsc/</link>
<category>politics</category>
<dc:creator xmlns:dc='http://purl.org/dc/elements/1.1/' />
<pubDate>Tue, 24 Apr 2012 12:44:51 PDT</pubDate>
<guid isPermaLink='false'>http://www.foxnews.com/politics/2012/04/24/pentagon-confirms-plan-to-create-new-spy-agency/</guid>
<content:encoded xmlns:content='http://purl.org/rss/1.0/modules/content/'><![CDATA[|http://global.fncstatic.com/static/managed/img/Politics/panetta_hearing_030712.jpg<img src='http://feeds.feedburner.com/~r/foxnews/most-popular/~4/lVUZwCdjVsc' height='1' width='1'/>]]></content:encoded>
<description>The Pentagon confirmed Tuesday that it is carving out a brand new spy agency expected to include several hundred officers focused on intelligence gathering around the world.&amp;#160;</description>
<dc:date xmlns:dc='http://purl.org/dc/elements/1.1/'>2012-04-4T19:44:51Z</dc:date>
<!-- <feedburner:origLink>http://www.foxnews.com/politics/2012/04/24/pentagon-confirms-plan-to-create-new-spy-agency/</feedburner:origLink> -->
</item>
....items....
</channel>
</rss>");
XNamespace contentNs = "http://purl.org/rss/1.0/modules/content/";
var feeds = from feed in XMLFeed.Descendants("item")
select new
{
Title = (string)feed.Element("title"),
Link = (string)feed.Element("link"),
pubDate = (string)feed.Element("pubDate"),
Description = (string)feed.Element("description"),
MediaContent = GetMediaContent((string)feed.Element(contentNs + "encoded"))
};
foreach(var item in feeds)
{
Console.WriteLine(item);
}
}
private static string GetMediaContent(string content)
{
int imgStartPos = content.IndexOf("<img");
if(imgStartPos > 0)
{
int startPos = content[0] == '|' ? 1 : 0;
return content.Substring(startPos, imgStartPos - startPos);
}
return string.Empty;
}
results in:
{ Title = Pentagon confirms plan to create new spy agency, Link = http://feeds.f
oxnews.com/~r/foxnews/most-popular/~3/lVUZwCdjVsc/, pubDate = Tue, 24 Apr 2012 1
2:44:51 PDT, Description = The Pentagon confirmed Tuesday that it is carving out
a brand new spy agency expected to include several hundred officers focused on
intelligence gathering around the world. , MediaContent = http://global
.fncstatic.com/static/managed/img/Politics/panetta_hearing_030712.jpg }
Press any key to continue . . .
A few points:
You never want to treat Xml as text - in your case you removed the namespace declaration but actually if the namespace was declared inline (i.e. without binding to the prefix) or a different prefix would be defined your code would not work even though semantically both documents would be equivalent
Unless you know what's inside CDATA and how to treat it you always want to treat is as text. If you know it's something else you can treat it differently after parsing - see my elaborate on CDATA below for more details
To avoid NullReferenceExceptions if the element is missing I used explicit conversion operator (string) instead of invoking .Value
the Xml you posted was not a valid xml - there was missing namespace Uri for feedburner prefix
This is no longer related to the problem but may be helpful for some folks so I am leaving it
As far as the contents of the encode element is considered it is inside CDATA section. What's inside CDATA section is not an Xml but plain text. CDATA is usually used to not have to encode '<', '>', '&' characters (without CDATA they would have to be encoded as < > and & to not break the Xml document itself) but the Xml processor treat characters in the CDATA as if they were encoded (or to be more correct in encodes them). The CDATA is convenient if you want to embed html because textually the embedded content looks like the original yet it won't break your xml if the html is not a well-formed Xml. Since the CDATA content is not an Xml but text it is not possible to treat it as Xml. You will probably need to treat is as text and use for instance regular expressions. If you know it is a valid Xml you can load the contents to an XElement again and process it. In your case you have got mixed content so it is not easy to do unless you use a little dirty hack. Everything would be easy if you have just one top level element instead of mixed content. The hack is to add the element to avoid all the hassle. Inside the foreach look you can do something like this:
var mediaContentXml = XElement.Parse("<content>" + (string)item.MediaContent + "</content>");
Console.WriteLine((string)mediaContentXml.Element("img").Attribute("src"));
Again it's not pretty and it is a hack but it will work if the content of the encoded element is valid Xml. The more correct way of doing this is to us XmlReader with ConformanceLevel set to Fragment and recognize all kinds of nodes appropriately to create a corresponding Linq to Xml node.
You should use XNamespace:
XNamespace content = "...";
// later in your code ...
MediaContent = feed.Element(content + "encoded")
See more details here.
(Of course, you the string to be assigned to content is the same as in xmlns:content="...").