MS Word / OpenXML: extracting grammar and spelling errors - vba

InOpenXMLthere is the w:proofErr element which can be of spelling type (attributes w:type="spellStart" and w:type="spellEnd") or of grammar type (attributes w:type="gramStart" and w:type="gramEnd"). I need to extract errors of my document (both types), when I created a small test document (just one sentence with two errors), the information was indeed in the .docx XML file, but when I saved the whole text I need to process (a 5MB file), the information was not included in the .docx file (probably Word considers that in large documents this would be too much noise in the XML data).
How can I extract this information even in big files?
Is there some way to force MS Word to include the information in the .docx file?
If not, is there some VBA script that can mark spelling errors and grammar errors with, for example, a different color or some special character, so that the information becomes hardcoded into the file?
Here is an example, for the sentence "The children plays in the guarden" (which has an agreement error and a spelling error):
<w:t>The children </w:t>
</w:r>
<w:proofErr w:type="gramStart"/>
<w:r w:rsidRPr="008E17B0">
<w:rPr><w:lang w:val="en-US"/></w:rPr>
<w:t>plays</w:t>
</w:r>
<w:proofErr w:type="gramEnd"/>
<w:r w:rsidRPr="008E17B0">
<w:rPr><w:lang w:val="en-US"/></w:rPr>
<w:t xml:space="preserve"> in the </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r w:rsidRPr="008E17B0">
<w:rPr><w:lang w:val="en-US"/></w:rPr>
<w:t>guarden</w:t>
</w:r>
<w:bookmarkStart w:id="0" w:name="_GoBack"/><w:bookmarkEnd w:id="0"/>
<w:proofErr w:type="spellEnd"/>
<w:r w:rsidRPr="008E17B0">
<w:rPr><w:lang w:val="en-US"/></w:rPr>
<w:t>.</w:t>
</w:r>
I would like to obtain, for example, "The children ▶plays◀ in the *guarden*"

Related

Merging xfdf into template pdf without losing some special characters (eg. ő,Ű,č)

I have an xfdf file, which is utf8 and may contain non ASCII characters. I would like to merge it with the pdf that contains the form. I tried with pdftk, and although merging happens correctly - in terms of all fields are being populated - some characters are not appearing in the flattened pdf.
Taking the xfdf:
<?xml version="1.0" encoding="utf-8"?>
<xfdf xmlns="http://ns.adobe.com/xfdf/" xml:space="preserve">
<fields>
<field name="some_data">
<value>Űző</value>
</field>
<field name="some_other_data">
<value>ùûüÿ€’“”«»àâæçéèêëïôœÙÛÜŸÀÂÆÇÉÈÊËÏÎÔ</value>
</field>
</fields>
</xfdf>
The result pdf's fields have the following values (excluded the quotation marks):
some_data: " z "
some_other_data: "ùûüÿ€’“”«»àâæçéèêëïôœÙÛÜŸÀÂÆÇÉÈÊËÏÎÔ"
So all the characters in some_other_data are stored correctly, but ő and Ű are stored as 00.
I also realized that if I uncompress the pdf with pdftk, I can find the original characters stored in the pdf as
/DA (/Helv 8.64 Tf 0 g)
/Subtype /Widget
/V (ţ˙ Q z\r )
/T (some_data)
The fact that the correct characters are there is also clear if I open the unflattened form with Adobe Reader. After opening, initially the form field some_data contains only the letter z surrounded with spaces, BUT if I click on the form field, the special characters are revealed, and any changes made to the field value will result in the correct characters to stay visible. On the other hand if I unfocus the form field without any modification, they disappear again..
I also tried to use numeric entities in the xfdf, but it did not help either.
I have 2 questions:
Why won't these characters appear in the value of the field when the pdf clearly contains the correct character information, and is also capable of rendering them?
Most importantly what can I do in order to have the correct characters appearing in the pdf after flattening the form? I'd prefer a solution which does not require any postprocessing once the xfdf is merged into the pdf form, but any solutions or ideas are welcome.
Thank you all!

Oracle SQL XML Offending characters in XML file (UTF-8)

I'm trying to find a way of either replacing/deleting offending characters from the Oracle SQL XML files I'm creating. The structure of the XML file is correct but the company I'm sending the files too can't load the files because of the offending characters in the XML file. I'm using Oracle 11g release 2 database.
What can I do and what are my options?
A screen shot is below of an example of these offending characters, both myself and the company i'm sending the files too are using the Unicode UTF-8 encoding.
An example of a tag that it does not like is below for ZOË WANAMAKER
<prodAssociatedParty>
<apType>ACTOR</apType>
<lastName>ZOË WANAMAKER</lastName>
</prodAssociatedParty>
Ë (0xCB), É (0xC9), Ï (0xCF), £ (0xA3), Ç (0xC7), Ò (0xD2), Ü (0xDC)
Thanks in Advance for any advice.
Thanks for all your replies. In the end I put the below into my PL/SQL, as like some of you said even though you have version '1.0" encoding="UTF-8" in your SQL XML code the file was being stored in a different encoding. So I needed to force it to write/store an XML file in UTF-8 format.
If you look up DBMS_XSLPROCESSOR.clob2file there are a number of parameters passed to this procedure, one being the character set to use for the output file. Which in this case for UTF-8 it was nls_charset_id('AL32UTF8').
DBMS_XSLPROCESSOR.clob2file(l_clob, l_directory, l_file_name||'.xml',nls_charset_id('AL32UTF8'));
thanks Guys
Obviously your XML files has <?xml version="1.0" encoding="UTF-8"?> as declaration but in fact it is stored with different encoding.
Do not declare your XML-File with <?xml version="1.0" encoding="UTF-8"?> just because "everybody does it". If you declare UTF-8 then you also have to save it as UTF-8. Check save options at your editor, resp. settings in your application which creates the file.
I assume the XML File is saved in Windows-1252 encoding. Try <?xml version="1.0" encoding="ISO-8859-1"?> instead.
Windows-1252 is very similar to ISO 8859-1, see ISO 8859-1 vs. ISO 8859-15 vs. Windows-1252 vs. Unicode so it should work unless your XML contains any of € Š š Ž ž Œ œ Ÿ.
However, according XML specification only UTF-8 and UTF-16 are mandatory, ISO 8859-x are optional, so the target application may not be able to read the file. In this case you have to convert your XML-File as UTF-8.

Extract sections of PDF

I am trying to extract sections of a PDF file, for use in text analysis. I tried using pdfextract to accomplish this. However, a command such as
pdf-extract extract --regions --no-lines Bauer2010.pdf
only extract the (x,y) coordinates of a region, as in the example below.
<region x="226.32" y="750.47" width="165.57" height="6.37"
line_height="6.37" font="BGBFHO+AdvP4DF60E">Patient Education and
Counseling 79 (2010) 315-319</region>
Can sections of a PDF be extracted?
Have a look at http://text-analyzer.com where you can upload your PDF file and it will convert it into a format suitable for Natural Language Processing. Once converted into a text file it can then process the file, breaking it down into sentences with sentiment analysis. It has over 40 different types of sentence views where you can tag sections. Those tagged sentences can be exported.

DITA OT printing '#' in stead of Chinese characters in PDF

I am very new to DITA OT. Downloaded the DITA-OT1.5.4_full_easy_install_bin and playing around with it. I'm trying to print few characters in Simplified Chinese (zh-CN) into a PDF. I see that the characters are printed correctly in XHTML but in PDF they are printed as "#".
In the command line I see this - "Warning: Glyph "?" (0x611f) not available in font "Helvetica".
Here are the things I have tried so far:
In demo\fo\fop\conf\fop.xconf :
<fonts>
<font kerning="yes"
embed-url="file:///C:/Windows/Fonts/simsun.ttc"
embedding-mode="subset" encoding-mode="cid">
<font-triplet name="SimSun" style="normal" weight="normal"/>
</font>
<auto-detect/>
<directory recursive="true">C:\Windows\Fonts</directory>
</fonts>
In demo\fo\cfg\fo\attrs\custom.xsl :
<xsl:attribute-set name="__fo__root">
<xsl:attribute name="font-family">SimSun</xsl:attribute>
</xsl:attribute-set>
In demo\fo\cfg\fo\font-mapping.xml added this block for Sans, Serif & Monospaced logical fonts:
<physical-font char-set="Simplified Chinese">
<font-face>SimSun</font-face>
</physical-font>
In samples\concepts\garageconceptsoverview.xml :
<shortdesc xml:lang="zh_CN">職業道德感.</shortdesc>
And this is the command I am using to generate the PDF:
ant -Dargs.input=samples\hierarchy.ditamap -Dtranstype=pdf
Any help would be appreciated. Thanks.
[EDIT]
I see that the topic.fo file which gets generated in temp folder, does contain the Chinese characters correctly. Like this:
<fo:block font-size="10pt" keep-with-next.within-page="5" start-indent="25pt">職業道德感.</fo:block>
But I do not see the font related information anywhere in this document.
First of all you should set the "xml:lang='zh_CN'" attribute on the root elements for all DITA topics and maps. This will help the DITA OT publishing decide the language to use for static texts like "Table X" and also to decide on the charset to use for the font mappings.
Then you should run the publishing by setting the parameter "clean.temp" parameter to "no".
After the publishing you can look in the temporary files folder for a file called "topic.fo" and look inside it to see what font families are used.
Because even if you set a font on the root element, there are other places in the XSL-FO file where you have font families set explicitly.
So instead of setting a font on the XSL-FO root element you should edit the font mappings XML file and for each of the logical fonts "Sans" and "Serif" you should configure the actual font family to use for the Chinese charset, something like:
<logical-font name="Sans">
.........
<physical-font char-set="Simplified Chinese">
<font-face>SimSun</font-face>
</physical-font>
......
</logical-font>
More about how the font mappings work:
https://www.oxygenxml.com/doc/versions/17.0/ug-editor/#topics/DITA-map-set-font-Apache-FOP.html
Update:
If you insist of having that XSLT customization which sets the "SimSun" font as a font family on the root element, then in the font-mappings.xml you need to define a new mapping for your alias:
<aliases>
<alias name="SimSun">SimSun</alias>
</aliases>
and then map the logical font to a physical one in the same font-mappings.xml:
<logical-font name="SimSun">
<physical-font char-set="Simplified Chinese">
<font-face>SimSun</font-face>
</physical-font>
</logical-font>
0x611f , this character is a chinese character (感), helvetica is an europe font , so no this character in the "helvetica" font. You can search this "helvetica" font loaction, in this position your content(ditamap/dita) should use chinese font, not europe font. You must find that arritbute that include the [font-famliy=helvetical], modify in your own plugin [SimSun, Helvetical].
Sorry, I cannot answer your question, but you should definetely try a newer DITA-OT from http://dita-ot.github.io/. Your DITA-OT is not supported anymore. Maybe your problem fades away using the latest release.

docx4j - Nodes Omitted From XmlUtils.marshalToString()

Using XMLUtils.marshalToString() from docx4j, I have the following content at identical locations in two docx files (extracted from corresponding word/document.xml after unzipping the .docx). These are the only differences between the files:
<w:t xml:space="preserve">New line. First is </w:t>
and
<w:t xml:space="preserve">
<w:r>
<w:t xml:space="preserve">New line.</w:t>
</w:r>
<w:r>
<w:t xml:space="preserve"> First is </w:t>
</w:r>
</w:t>
In the first document, the <w:t> node is output as above.
However, in the second, an empty <w:t> node is printed as follows:
<w:t xml:space="preserve"></w:t>
I checked the w:t schema at http://www.schemacentral.com/sc/ooxml/e-w_p-1.html and w:r is a valid contained element.
Edit: the above link is the schema of the w:p element, not w:t. The proper link for w:t is: http://www.schemacentral.com/sc/ooxml/e-w_t-1.html. It clearly shows the only acceptable content for w:t is a string (not a w:r or any other tags). Consequently (as suggested Jason's answer below), the XML from document.xml was invalid, and (as such) not being unmarshalled into docx4j. As a result, the text was not available for output by XmlUtils.marshalToString().
What is keeping the second block from being output?
You can trust marshalToString.
If it is returning an empty w:t, that's because the underlying org.docx4j.wml.Text object has a null or empty value field.
You need to look at whatever code is supposed to be populating that.