comment or highlight two-column pdf using pdf-clown - pdf

I have searched for possible solution by googling/so/forums for pdfClown/pdfbox and posting the problem at SO.
Problem: I have been trying to find a solution to highlight text, which spans across multiple lines in pdf document. The pdf can have one/two-column pages.
By using pdf-clown, I was able to highlight phrases, ONLY if all the words appear in the same line. pdfBox has created the XML for individual words, I could not find solution for phrases/lines.
Please suggest solution for pdf-clown, if any. (or) any other tool that is capable of highlighting text in multiple lines in pdf, with JAVA compatibility.
I could not understand the answer similar question, but iText, any help?:
Multiline markup annotations with iText

it is possible to get the coordinates of each word in a pdf document using pdfbox, here is the code for it:
import java.io.*;
import org.apache.pdfbox.exceptions.InvalidPasswordException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDStream;
import org.apache.pdfbox.util.PDFTextStripper;
import org.apache.pdfbox.util.TextPosition;
import java.io.IOException;
import java.util.List;
public class PrintTextLocations extends PDFTextStripper {
public PrintTextLocations() throws IOException {
super.setSortByPosition(true);
}
public static void main(String[] args) throws Exception {
PDDocument document = null;
try {
File input = new File("C:\\path\\to\\PDF.pdf");
document = PDDocument.load(input);
if (document.isEncrypted()) {
try {
document.decrypt("");
} catch (InvalidPasswordException e) {
System.err.println("Error: Document is encrypted with a password.");
System.exit(1);
}
}
PrintTextLocations printer = new PrintTextLocations();
List allPages = document.getDocumentCatalog().getAllPages();
for (int i = 0; i < allPages.size(); i++) {
PDPage page = (PDPage) allPages.get(i);
System.out.println("Processing page: " + i);
PDStream contents = page.getContents();
if (contents != null) {
printer.processStream(page, page.findResources(), page.getContents().getStream());
}
}
} finally {
if (document != null) {
document.close();
}
}
}
protected void processTextPosition(TextPosition text) {
System.out.println("String[" + text.getXDirAdj() + ","
+ text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale="
+ text.getXScale() + " height=" + text.getHeightDir() + " space="
+ text.getWidthOfSpace() + " width="
+ text.getWidthDirAdj() + "]" + text.getCharacter());
}
}

Multi-column text is, at the moment (PDF Clown 0.1.2), not supported for extraction: the current algorithm gathers text laying on the same horizontal baseline without evaluating possible gaps between columns.
Automatic multi-column-layout detection would be possible yet somewhat tricky, as PDF is essentially (you know) an unstructured graphic format. Nonetheless, I'm considering to experiment something about that, in order to deal at least with the most common scenarios.
In the meantime, I can suggest you to try an effective workaround (it implies that you work on a document whose columns are placed in predictable areas): for each column do a separate text extraction, instructing the TextExtractor to look into the corresponding page area, then put all those partial extraction results together and apply your filter.

Related

Text Extraction, Not Image Extraction

Please help me understand if my solution is correct.
I'm trying to extract text from a PDF file with a LocationTextExtractionStrategy parser. I'm getting exceptions because the ParseContentMethod tries to parse inline images? The code is simple and looks similar to this:
RenderFilter[] filter = { new RegionTextRenderFilter(cropBox) };
ITextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
PdfTextExtractor.GetTextFromPage(pdfReader, pageNumber, strategy);
I realize the images are in the content stream but I have a PDF file failing to extract text because of inline images. It returns an UnsupportedPdfException of "The filter /DCTDECODE is not supported" and then it finally fails with and InlineImageParseException of "Could not find image data or EI", when all I really care about is the text. The BI/EI exists in my file so I assume this failure is because of the /DCTDECODE exception. But again, I don't care about images, I'm looking for text.
My current solution for this is to add a filterHandler in the InlineImageUtils class that assigns the Filter_DoNothing() filter to the DCTDECODE filterHandler dictionary. This way I don't get exceptions when I have InlineImages with DCTDECODE. Like this:
private static bool InlineImageStreamBytesAreComplete(byte[] samples, PdfDictionary imageDictionary) {
try {
IDictionary<PdfName, FilterHandlers.IFilterHandler> handlers = new Dictionary<PdfName, FilterHandlers.IFilterHandler>(FilterHandlers.GetDefaultFilterHandlers());
handlers[PdfName.DCTDECODE] = new Filter_DoNothing();
PdfReader.DecodeBytes(samples, imageDictionary, handlers);
return true;
} catch (IOException e) {
return false;
}
}
public class Filter_DoNothing : FilterHandlers.IFilterHandler
{
public byte[] Decode(byte[] b, PdfName filterName, PdfObject decodeParams, PdfDictionary streamDictionary)
{
return b;
}
}
My problem with this "fix" is that I had to change the iTextSharp library. I'd rather not do that so I can try to stay compatible with future versions.
Here's the PDF in question:
https://app.box.com/s/7eaewzu4mnby9ogpl2frzjswgqxn9rz5

Import annotations (XFDF) to PDF

I have created a sample program to try to import XFDF to PDF using the Aspose library. The program can be run without exception, but the output PDF does not include any annotations. Any suggestions to solve this problem?
Update - 2014-12-12
I have also sent the issue to Aspose. They can reproduce the same problem and logged a ticket PDFNEWJAVA-34609 in their issue tracking system.
Following is my sample program:
public static void main(String[] args) {
final String ROOT = "C:\\PdfAnnotation\\";
final String sourcePDF = "hackermonthly-issue.pdf";
final String destPDF = "output.pdf";
final String sourceXFDF = "XFDFTest.xfdf";
try
{
// Specify the path of license file
License lic = new License();
lic.setLicense(ROOT + "Aspose.Pdf.lic");
//create an object of PdfAnnotationEditor class
PdfAnnotationEditor editor = new PdfAnnotationEditor();
//bind input PDF file
editor.bindPdf(ROOT + sourcePDF);
//create a file stream for input XFDF file to import annotations
FileInputStream fileStream = new FileInputStream(ROOT + sourceXFDF);
//create an enumeration of all the annotation types which you want to import
//int[] annType = {AnnotationType.Ink };
//import annotations of specified type(s) from XFDF file
//editor.importAnnotationFromXfdf(fileStream, annType);
editor.importAnnotationFromXfdf(fileStream);
//save output pdf file
editor.save(ROOT + destPDF);
} catch (Exception e) {
System.out.println("exception: " + e.getMessage());
}
}

Tess4J doOCR() for *First Page* of pdf / tif

Is there a way to tell Tess4J to only OCR a certain amount of pages / characters?
I will potentially be working with 200+ page PDF's, but I really only want to OCR the first page, if that!
As far as I understand, the common sample
package net.sourceforge.tess4j.example;
import java.io.File;
import net.sourceforge.tess4j.*;
public class TesseractExample {
public static void main(String[] args) {
File imageFile = new File("eurotext.tif");
Tesseract instance = Tesseract.getInstance(); // JNA Interface Mapping
// Tesseract1 instance = new Tesseract1(); // JNA Direct Mapping
try {
String result = instance.doOCR(imageFile);
System.out.println(result);
} catch (TesseractException e) {
System.err.println(e.getMessage());
}
}
}
Would attempt to OCR the entire, 200+ page into a single String.
For my particular case, that is way more than I need it to do, and I'm worried it could take a very long time if I let it do all 200+ pages and then just substring the first 500 or so.
The library has a PdfUtilities class that can extract certain pages of a PDF.

Docx4j Test - No File is Output

Attempting to write my first class with docx4j (http://www.docx4java.org). Basically the idea is to find a string of text in the .docx file and replace it with another string of text. Essentially a mail merge. While I'm not receiving any errors, the merged document itself is not being saved in the path I've suggested. This makes me think it's a file path problem but I don't see anything wrong with it.
package efi.mailmerge.servlets;
import java.util.List;
import javax.xml.bind.JAXBElement;
import org.docx4j.openpackaging.exceptions.Docx4JException;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
import org.docx4j.wml.Text;
public class WordDocTest {
/**
* Open word document /Users/Jeff/Development/ReServe-Unleashed/Dev/MailMerge/uploads/Sample.docx, replace a piece of text and save
* the result to /Users/Jeff/Development/ReServe-Unleashed/Dev/MailMerge/uploads/Sample-Out.docx.
*
* The text <<CUS_FNAME>> will be replaced with John.
*
* #param args
*/
public static void main(String[] args) {
// Text nodes begin with w:t in the word document
final String XPATH_TO_SELECT_TEXT_NODES = "//w:t";
try {
// Open the input file
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new java.io.File("/Users/Jeff/Development/ReServe-Unleashed/Dev/MailMerge/uploads/Sample.docx"));
// Build a list of "text" elements
List texts = wordMLPackage.getMainDocumentPart().getJAXBNodesViaXPath(XPATH_TO_SELECT_TEXT_NODES, true);
// Loop through all "text" elements
for (Object obj : texts) {
Text text = (Text) ((JAXBElement) obj).getValue();
// Get the text value
String textValueBefore = text.getValue();
// Perform the replacement
String textValueAfter = textValueBefore.replaceAll("<<CUS_FNAME>>", "John");
// Show the element before and after the replacement
System.out.println("textValueBefore = " + textValueBefore);
System.out.println("textValueAfter = " + textValueAfter);
// Update the text element now that we have performed the replacement
text.setValue(textValueAfter);
}
wordMLPackage.save(new java.io.File("/Users/Jeff/Development/ReServe-Unleashed/Dev/MailMerge/uploads/Sample-Out.docx"));
} catch (Docx4JException e) {
Logger.getLogger(WordDocTest.class.getName()).log(Level.SEVERE, null, e);
e.printStackTrace();
} catch (Exception e) {
Logger.getLogger(WordDocTest.class.getName()).log(Level.SEVERE, null, e);
e.printStackTrace();
}
}
}
On lines 26 and 50 you can see the input/output paths. I've confirmed that the Sample.docx input file does exist and that the uploads directory has write permissions. Can you see anything wrong with my file paths here? I could be completely on the wrong path, but this is all very new to me so I'm learning as I go.
Any and all help is very much appreciated.
At first sight, I would suggest trying with your path written the following way :
wordMLPackage.save(new java.io.File("\\Users\\Jeff\\Development\\ReServe-Unleashed\\Dev\\MailMerge\\uploads\\Sample-Out.docx"));
If it still not works, please provide the stack traces ? It could help. (if no doc is saved, there must be an exception thrown)

Lucene 4.1 : How split words that contains "dots" when indexing?

I'l trying to figure out what I should do to index my keywords that contains "." .
ex : this.name
I want to index the terms : this and name in my index.
I use the StandardAnalyser. I try to extends the WhitespaceTokensizer or extends TokenFilter, but I'm not sure if I'm in the right direction.
if I use the StandardAnalyser, I'll obtain "this.name" as a keyword, and that's not what I want, but the analyser do the rest correctly for me.
You can put a CharFilter in front of StandardTokenizer that converts periods and underscores to spaces. MappingCharFilter will work.
Here's MappingCharFilter added to a stripped-down StandardAnalyzer (see the original 4.1 version here):
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.charfilter.MappingCharFilter;
import org.apache.lucene.analysis.charfilter.NormalizeCharMap;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.core.StopAnalyzer;
import org.apache.lucene.analysis.core.StopFilter;
import org.apache.lucene.analysis.standard.StandardFilter;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.util.StopwordAnalyzerBase;
import org.apache.lucene.util.Version;
import java.io.IOException;
import java.io.Reader;
public final class MyAnalyzer extends StopwordAnalyzerBase {
private int maxTokenLength = 255;
public MyAnalyzer() {
super(Version.LUCENE_41, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
}
#Override
protected TokenStreamComponents createComponents
(final String fieldName, final Reader reader) {
final StandardTokenizer src = new StandardTokenizer(matchVersion, reader);
src.setMaxTokenLength(maxTokenLength);
TokenStream tok = new StandardFilter(matchVersion, src);
tok = new LowerCaseFilter(matchVersion, tok);
tok = new StopFilter(matchVersion, tok, stopwords);
return new TokenStreamComponents(src, tok) {
#Override
protected void setReader(final Reader reader) throws IOException {
src.setMaxTokenLength(MyAnalyzer.this.maxTokenLength);
super.setReader(reader);
}
};
}
#Override
protected Reader initReader(String fieldName, Reader reader) {
NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
builder.add(".", " ");
builder.add("_", " ");
NormalizeCharMap normMap = builder.build();
return new MappingCharFilter(normMap, reader);
}
}
Here's a quick test to demonstrate it works:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.BaseTokenStreamTestCase;
public class TestMyAnalyzer extends BaseTokenStreamTestCase {
private Analyzer analyzer = new MyAnalyzer();
public void testPeriods() throws Exception {
BaseTokenStreamTestCase.assertAnalyzesTo
(analyzer,
"this.name; here.i.am; sentences ... end with periods.",
new String[] { "name", "here", "i", "am", "sentences", "end", "periods" } );
}
public void testUnderscores() throws Exception {
BaseTokenStreamTestCase.assertAnalyzesTo
(analyzer,
"some_underscore_term _and____ stuff that is_not in it",
new String[] { "some", "underscore", "term", "stuff" } );
}
}
If I understand you correctly, you need to use a tokenizer that removes dots -- that is, any name that contains a dot should be split at that point ("here.i.am" becomes "here" + "i" + "am").
you are getting caught by behavior documented here:
However, a dot that's not followed by whitespace is considered part of a token.
StandardTokenizer introduces some more complex to parsing rules than you may not be looking for. This one, in particular, is intended to prevent tokenization of URLs, IPs, idenifiers, etc. A simpler implementation might suit your needs, like LetterTokenizer.
If that doesn't really suit your needs (and it might well turn out to be throwing the baby out with the bathwater), then you may need to modify StandardTokenizer yourself, which is explicitly encouraged by the Lucene docs:
Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.
Sebastien Dionne: I didn't understand how to split a word, do I have to parse the document char by char ?
Sebastien Dionne: I still want to know how to split a token into multiple part, and index them all
You may have to write a custom analyzer.
Analyzer is a combination of Tokenizer and possibly a chain of TokenFilter instances.
Tokenizer : Takes in the input text passed by you probably as a java.io.Reader. It
JUST breakdowns the text. Doesn't alter, just breaks it down.
TokenFilter : Takes in the token emitted by Tokenizer, adds / removes / alters tokens and emits the same one by one until all are finished.
If it replaces a token with multiple tokens based on requirements, buffers all, emits them one by one to the Indexer.
You may check following resource, unfortunately, you may have to sign-up for a trial membership.
By writing a custom analyzer, you can breakdown the text the way you want to. You may even use some existing components like LowercaseFilter. Fortunately, it is achievable with Lucene to come up with some Analyzer that serves your purpose if you couldn't find that as a built-in or on the web.
" Writing Custom Filters: Lucene in Action 2"