Tika - how to extract text from PDF text: underlined, highlighted, crossed out - pdf

I'm using Tika* to parse a PDF file.
There are no problems to retrieve the document's text, but I don't figure out how to extract text:
underlined
highlighted
crossed out
Adobe Writer gives you different text edit options, but I'm not able to see where they are "hidden".
Is there a solution to extract these metadata information? (underline, highligh ...)
Do you know if Tika is able to extract this data?
*http://tika.apache.org/

Wow. 4 years is a long time to wait for an answer, and I figure you have found a solution by now. Anyways, for the sake of those who would visit this link, the answer is Yes. Apache Tika can extract not just text in a document, but also the formatting as well (e.g. bold, italicized). This was my Scenario:
//inputStream is the document you wish to parse from.
AutoDetectParser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler(new ToXMLContentHandler());
Metadata metadata = new Metadata();
parser.parse(inputStream,handler,metadata);
System.out.println(handler.toString());
The print statement prints an XML of your document. With a little work of cleaning up the XML (really HTML tags), you would be left with tags like < b >text< /b> for bold text and < i >text < / i > for italicized text. Then you could find a way to render it. Good luck.

Related

Word to HTML fields in header and footer

I'm using docx4j to convert a Word template to several HTML files, one per chapter.
The Word template has several custom properties mapped by several fields (DOCPROPERTY ...) represented as both simple and complex fields. I populate those properties to obtain Freemarker code when the word document is converted to HTML (like ${...} or [#... /] directives).
In a later step I look for "heading 1" paragraphs to identify chapters and then split the document in several Word documents before conversion, then these documents are converted to HTML and written to temporary files.
Each document is successfully converted to HTML and fields are correctly replaced with my markers, but it behaves wrong when it writes header and footer parts: field codes are written before field values (eg. DOCPROPERTY "PROPERTY_NAME" \* MERGEFORMAT ${constants['PROPERTY_NAME']} ) instead of field values only (eg. ${constants['PROPERTY_NAME']} ).
If I write the updated document to a docx file instead, nothing seems wrong into the generated document.
If it's useful to solve the problem, this is what I do to split the document (per chapter):
clone the updated WordprocessingMLPackage (clone method)
delete every root element before the chapter's "heading 1" element
delete every root element from the "heading 1" element of the next chapter
convert the cloned and cleaned document
(actually I don't use the clone method every time, but I write the updated document to a ByteArrayOutputStream and then read it for every chapter, inspired by the source of the clone method).
I suspect it's for a docx4j bug, did anybody else try something similar?
Finally these are my platform details:
JDK 1.6
Docx4J v3.2.2
Thanks in advance for any help
EDIT
To produce freemarker markers in place of Word fields, I set document property values as follows:
traverse the document looking for simple or complex fields with new TraversalUtil(wordMLPackage.getMainDocumentPart().getContent(), visitor);, where visitor is my custom callback for looking for fields and set properties
traversing the document I look for
FldChar elements with type BEGIN and parse them using FieldsPreprocessor.canonicalise((P) ((R) fc.getParent()).getParent(), fields); (I don't use the return value of canonicalise) where fc is the found FldChar and fields is a empty ArrayList<FieldRef>; then I extract and parse field's instrText attribute
CTSimpleField elements and parse them using FldSimpleModel fldSimpleModel = new FldSimpleModel(); fldSimpleModel.build((CTSimpleField) o, null);; then I use fldSimpleModel.getFldArgument() to get the property name
I look for the freemarker code to show in place of the current field and set it as property value using wordMLPackage.getDocPropsCustomPart().setProperty(propertyName, finalValue);
finally I do the same from step 1 for headers and footers as follows:
List<Relationship> rels = wordMLPackage.getMainDocumentPart().getRelationshipsPart().getRelationships().getRelationship();
for (Relationship rel : rels) {
Part p = wordMLPackage.getMainDocumentPart().getRelationshipsPart().getPart(rel);
if (p == null) {
continue;
}
if (p instanceof ContentAccessor) {
new TraversalUtil(((ContentAccessor) p).getContent(), visitor);
}
}
Finally I update fields as follows
FieldUpdater updater = new FieldUpdater(wordMLPackage);
try {
updater.update(true);
} catch (Docx4JException ex) {
Logger.getLogger(WorkerDocx4J.class.getName()).log(Level.SEVERE, null, ex);
}
After filling all field properties, I clone the document as previously described and convert filtered cloned instances using
HTMLSettings settings = Docx4J.createHTMLSettings();
settings.setWmlPackage(wordDoc);
settings.setImageHandler(new InlineImageHandler(myDataModel));
Docx4jProperties.setProperty("docx4j.Convert.Out.HTML.OutputMethodXML", true);
ByteArrayOutputStream os = new ByteArrayOutputStream();
os.write("[#ftl]\r\n".getBytes("UTF-8"));
Docx4J.toHTML(settings, os, Docx4J.FLAG_EXPORT_PREFER_XSL);
String template = new String(os.toByteArray(), "UTF-8");
then I obtain in template variable the resulting freemarker template.
The following XML is the content of footer1.xml part of the document generated after updating the document properties as described: footer1.xml after field updates
The very strange thing (in my opinion) is that if some properties are not found, step 5 throws an Exception (ok), fields updating stops at the wrong field (ok) and all fields in header and footer are rendered right. In this case, this is the content for footer1.xml.
In the last case, fields are defined in a different way. I think the HTML converter handles well the last case and does something wrong in the first one.
Is there something I do wrong or I can do better?

Can I define a text property as rich text?

VS 2013, VB, EF6
I am creating an object that will keep user input in one of its properties. I would like that user input to be stored as rich text. What's involved to make that stored text be rich text format? So,
Public Property Text as <what?>
I thought I would post what was my answer for others who might ask the question the same way I did. I begin by stating that my question was poorly formed because I didn't understand I'm not really storing RTF, I'm storing WYSIWYG text with html tags. But I think the question as phrased is useful because that's how many people think until they are taught by others.
Ultimately this process opens a serious XSS vector, but first we have to at least collect the WYSIWYG text.
First step: using a script-based editor capture the text with html tags. I used CKEditor which is easy to download on NuGet. It comes in 3 flavors: basic, standard and full. Another popular one seems to be TinyMCE also available through NuGet.
CKEditor must be 'wired in' to replace the existing input element. I replaced #html.editorfor with a < textarea > directly as follows. Model.UserPost.Body is the property into which I want to place the WYSIWYG text. The Raw helper is required so the output is NOT encoded allowing us to see our WYSIWYG text.
<textarea name="model.UserPost.Body" id="model_UserPost_Body" class="form-control text-box multi-line">
#Html.Raw(Model.UserPost.Body)
</textarea>
CKEditor is 'wired in' using a script element to replace the < textarea > element.
#Section Scripts
<script src="~/scripts/ckeditor/ckeditor.js"></script>
<script>
CKEDITOR.replace('model.UserPost.Body');
</script>
End Section
The script above can be added to all pages via _layout.vbhtml, or just the target page via a #Section Scripts section as shown above, which is often recommended and what I did, but that may also require adding to the standard _Layout the following in the < head > section such as follows.
#RenderSection("Styles", False)
In the controller POST method for the view the following code is needed to capture the WYSIWYG text otherwise the default filter will raise an exception when it detects anything that looks like an html tag.
Dim rawBody = Request.Unvalidated.Form("model.UserPost.Body")
userPost.Body = rawBody
There are some possible gotcha's; The 'body' property has to be removed from the Include:= list of the < Bind > element in the method paramter list if < Bind > is being used. Also, although not directly related to this solution, you can't have a Data Annotation like < Required() > on this property in the model because background checking won't be able to confirm that condition so the ModelState.IsValid flag won't ever go true.
Second step: before saving the input it MUST be checked for XSS. Microsoft has a nice video explaining basic XSS that I recommend viewing; it's only 11 minutes.
Mikesdotnetting has a nice explaination for dealing with XSS and shows a whitelisting algorithm toward the bottom of this page. The following code is based on his work.
To create a white listing approach, the HTML Agility Pack is useful to catalogue the HTML nodes for review. This is easily loaded from Nu Get as well. This is the code I used in the POST method to invoke the white list methods (Yes, it could be more compact, but this is easier to read for us novices):
Dim tempDoc = New HtmlDocument()
tempDoc.LoadHtml(rawBody)
RemoveNodes(tempDoc.DocumentNode, allowedTags)
userPost.Body = tempDoc.DocumentNode.OuterHtml
The allowed tags are what you will allow, which means everything else is rejected, hence whitelisting. This is just a sample list:
Dim allowedTags As New List(Of String)() From {"p", "em", "s", "ol", "ul", "li", "h1", "h2", "h3", "h4", "h5", "h6", "strong"}
These are the methods based on Mikesdotnetting page:
Private Sub RemoveNodes(ByVal node As HtmlNode, allowedTags As List(Of String))
If (node.NodeType = HtmlNodeType.Element) Then
If Not allowedTags.Contains(node.Name) Then
node.ParentNode.RemoveChild(node)
Exit Sub
End If
End If
If (node.HasChildNodes) Then
RemoveChildren(node, allowedTags)
End If
End Sub
Private Sub RemoveChildren(ByVal parent As HtmlNode, allowedTags As List(Of String))
For i = parent.ChildNodes.Count() - 1 To 0 Step -1
RemoveNodes(parent.ChildNodes(i), allowedTags)
Next
End Sub
So basically, (1) CKEditor captures user input with html tags that looks nice, (2) the raw input is specially requested in the Controller POST method and then (3) cleaned using a white list. After that it can be output directly to the page using #Html.Raw() because it can be trusted.
That's it. I've not really posted solutions like this before, so if I've missed something let me know and I'll correct or add it.
Rich Text is stored in the Rich Text Format.
The Rich Text Format specifications can be found here:
http://www.microsoft.com/en-us/download/details.aspx?id=10725
It is just an ordinary string. You can extract the string from a RichTextBox using the SaveFile function:
Private Function GetRTF(ByRef Box As RichTextBox) As String
Using ms As New IO.MemoryStream
Box.SaveFile(ms, RichTextBoxStreamType.RichText)
Return System.Text.Encoding.ASCII.GetString(ms.ToArray)
End Using
End Function
You can load text in the Rich Text Format into a RichTextBox using the LoadFile method of the RichTextBox. The text needs to be in the correct format:
Dim rtf As String = "{\rtf1 {\colortbl;\red0\green0\blue255;\red255\green0\blue0;}Guten Tag!\line{\i Dies} ist ein\line formatierter {\b Text}.\line Das {\cf1 Ende}.}"
Using ms As New IO.MemoryStream(System.Text.Encoding.ASCII.GetBytes(rtf))
RichTextBox1.LoadFile(ms, RichTextBoxStreamType.RichText)
End Using
Ordinary controls usually will not interpret this format in their text property.

Docx4J: Vertical text frame not exported to PDF

I'm using Docx4J to make an invoice model.
In the left-side of the page, it's usual to show a legal sentence as: Registered company in ... Book ... Page ...
I have inserted this in my template with a Word text frame.
Well, my issue is: when exporting to .docx, this legal text is shown perfect, but when exporting to .pdf, it's shown as an horizontal table under the other data.
The code to export to PDF is:
FOSettings foSettings = Docx4J.createFOSettings();
foSettings.setFoDumpFile(foDumpFile);
foSettings.setWmlPackage(template);
fos = new FileOutputStream(new File("/C:/mypath/prueba_OUT.pdf"));
Docx4J.toFO(foSettings, fos, Docx4J.FLAG_EXPORT_PREFER_XSL);
Any help would be very appreciated.
Thanks.
You'd need to extend the PDF via FO code; see further How to correctly position a header image with docx4j?
Float left may or may not be easy; similarly the rotated text.
In general, the way to work on this is to take the FO generated by docx4j, then hand edit it to something which FOP can convert to a PDF you are happy with. If you can do that, then its a matter of modifying docx4j to generate that FO.

Docx4j v3 Docx to HTML with Images

I'm working to convert a docx to html using Docx4j version 3.
The document contains white space consisting of tabs, spaces and newlines. The resulting HTML either has unrecognized characters or does not preserve whitespace at all.
The java code I'm using is:
WordprocessingMLPackage wordMLPackage = Docx4J.load(is);
HTMLSettings htmlSettings = Docx4J.createHTMLSettings();
htmlSettings.setImageDirPath( System.getProperty("user.dir") + uploadedImagesDirectory );
htmlSettings.setWmlPackage(wordMLPackage);
Docx4J.toHTML(htmlSettings, out, Docx4J.FLAG_EXPORT_PREFER_XSL);
String result = ((ByteArrayOutputStream)out).toString();
How can I preserve the whitespace in the document. Also, is there a method to apply css to a particular node? Specifically, I have 3 images which should be evenly spaced horizontally on the page.
I've looked over the documentation and searched online with no success.
Thank you.
I resolved the issue and it was not related to Docx4j.
Docx4j parsed the document perfectly! The problem was related to sending the output in an email.
I set the Spring helper javamail mime encoding to resolve this issue:
MimeMessageHelper message = new MimeMessageHelper(mimeMessage, true, "utf-8");

How to find x,y location of a text in pdf

Is there any tool to find the X-Y location on a text content in a pdf file ?
Docotic.Pdf Library can do it. See C# sample below:
using (PdfDocument doc = new PdfDocument("your_pdf.pdf"))
{
foreach (PdfTextData textData in doc.Pages[0].Canvas.GetTextData())
Console.WriteLine(textData.Position + " " + textData.Text);
}
Try running "Preflight..." in Acrobat and choosing PDF Analysis -> List page objects, grouped by type of object.
If you locate the text objects within the results list, you will notice there is a position value (in points) within the Text Properties -> * Font section.
TET, the Text Extraction Toolkit from the pdflib family of products can do that. TET has a commandline interface, and it's the most powerful of all text extraction tools I'm aware of. (It can even handle ligatures...)
Geometry
TET provides precise metrics for the text, such as the position on the page, glyph widths, and text direction. Specific areas on the page can be excluded or included in the text extraction, e.g. to ignore headers and footers or margins.