How to fix the incorrect mapping of glyphs to unicode characters seen in PDF - pdf

If you read the letters marked as Latin-1 Supplement in the PDF, there is a problem in reading them as other letters.
Here is an example of the changed output.
Can I get some help on what is causing this and how to fix it?
public static void main(String[] args) throws IOException{
PDDocument doc = PDDocument.load(new File("myPDF.pdf"));
PDFTextStripper tStripper = new PDFTextStripper();
String cont = tStripper .getText(doc);
System.out.println(cont);
}
inPDF->inConsole
'Ô' -> '«'
'Á' -> '¸'
'Ệ' -> 'Ö'
"CÔNG BÁO" -> "c«ng b¸o"

Related

How to decode data from Content Stream

I created a pdf document using the code looks like the following:
// The text parameter equels 'שדג' it is Hebrew. unicode equivalent is '\u05E9\u05D3\u05D2'
private static void createSimplePdf(String filename, String text) throws Exception {
final String path = RunItextApp.class.getResource("/Arial.ttf").getPath();
final PdfFont font = PdfFontFactory.createFont(path, PdfEncodings.IDENTITY_H);
Style hebrewStyle = new Style()
.setBaseDirection(BaseDirection.RIGHT_TO_LEFT)
.setFontSize(14)
.setFont(font);
final PdfWriter pdfWriter = new PdfWriter(filename);
final PdfDocument pdfDocument = new PdfDocument(pdfWriter);
final Document pdf = new Document(pdfDocument);
pdf.add(
new Paragraph(text)
.setFontScript(Character.UnicodeScript.HEBREW)
.addStyle(hebrewStyle)
);
pdf.close();
System.out.println("The document '" + filename + "' has been created.");
}
and after that, I tried to open this document using pdfbox util and I got the following data:
but I got an unexpected result in the Contents:stream section especially Tj tag. I expected string like the following 05E905D305D2 but I got 02b902a302a2. I tried to convert this hex string to normal string and I got the following result: ʹʣʢ but I expected that string שדג.
What do I wrong? Hot to convert this 02b902a302a2 string and get שדג?
This answer writes in a comment #usr2564301. Thanks for the help!
The numbers you get are not Unicode characters but font indexes instead. (Check how the font is embedded!) The text in a PDF does not specifically care about Unicode – it may or may not be this. Good PDF creators add a /ToUnicode table to help decoding, but it's optional.

PDFBOX is not giving right output

I am parsing a pdf in PDFBox to extract all the text from it
public static void main(String args[]) {
PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File file = new File("C:\\Users\\admin\\Downloads\\Airtel.pdf");
try {
PDFParser parser = new PDFParser(new FileInputStream(file));
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(1);
String parsedText = pdfStripper.getText(pdDoc);
System.out.println(parsedText);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
BUT its not giving any text in output
help
PDFBox extracts text similar to how Adobe Reader copies&pastes text.
If you open your document in Adobe Reader and press <Ctrl-A> to mark all text (you'll see that hardly anything is marked) and copy&paste it to an editor, you'll find that Adobe Reader also hardly extracts anything.
The reason why neither PDFBox nor Adobe Reader (nor any other normal text extractor) extracts the text from your document is that there virtually is no text at all in it! The "text" you see is not drawn using text drawing operations but instead is painted using by defining the outlines of each "character" as path and filling the area in that path. Thus, there is no indication to a text extractor that there even is text.
There actually are two characters of real text in your document, the '-' sign between the "Previous balance" and the "Payments" boxes and the '-' sign between the "Payments" and the "Adjustments" boxes. And even those two characters are not extracted as desired because the font does not provide the information which Unicode codepoint those characters represent.
Your virtually only chance, therefore, to extract the text content of the document is to apply OCR to the document.

PDFBox Signature Field not well recognized

I'm going in trouble using PDFBox 2.0.0-RC3 and producing a digital signature field into a PDF.
This is the piece of code i use:
public static void main(String[] args) throws IOException, URISyntaxException
{
PDDocument document;
document = new PDDocument();
PDPage page = new PDPage(PDRectangle.A4);
document.addPage(page);
PDAcroForm acroForm = new PDAcroForm(document);
document.getDocumentCatalog().setAcroForm(acroForm);
PDSignatureField signatureBox = new PDSignatureField(acroForm);
signatureBox.setPartialName("ENSGN-MY_SIGNATURE_FIELD-001");
acroForm.getFields().add(signatureBox);
PDAnnotationWidget widget = signatureBox.getWidgets().get(0);
PDRectangle rect = new PDRectangle();
rect.setLowerLeftX(50);
rect.setLowerLeftY(750);
rect.setUpperRightX(250);
rect.setUpperRightY(800);
widget.setRectangle(rect);
page.getAnnotations().add(widget);
try {
document.save("/tmp/mySignatureFieldGEN_PDFBOX.pdf");
document.close();
} catch (Exception io) {
System.out.println(io);
}
}
The code generates a pdf document, i open it with Acrobat Reader and this is the result:
PDF BOX Generated
As you can see, the signature panel on the left is void but the signature field on the left is present and works.
I generate the same PDF with PDFTron. This is the result:
PDF Tron Generated
In this case the signature panel on the left show correctly the presence of the signature field.
I would like to obtain this second case (correct) but i don't understand why PDF Box can do this.
Many thanks
add this:
widget.setPage(page);
This sets the /P entry.
Now the panel on the left appears. How did I get the idea? I got a document with such an empty signature field (from here), and compared it with yours with PDFDebugger.

Import annotations (XFDF) to PDF

I have created a sample program to try to import XFDF to PDF using the Aspose library. The program can be run without exception, but the output PDF does not include any annotations. Any suggestions to solve this problem?
Update - 2014-12-12
I have also sent the issue to Aspose. They can reproduce the same problem and logged a ticket PDFNEWJAVA-34609 in their issue tracking system.
Following is my sample program:
public static void main(String[] args) {
final String ROOT = "C:\\PdfAnnotation\\";
final String sourcePDF = "hackermonthly-issue.pdf";
final String destPDF = "output.pdf";
final String sourceXFDF = "XFDFTest.xfdf";
try
{
// Specify the path of license file
License lic = new License();
lic.setLicense(ROOT + "Aspose.Pdf.lic");
//create an object of PdfAnnotationEditor class
PdfAnnotationEditor editor = new PdfAnnotationEditor();
//bind input PDF file
editor.bindPdf(ROOT + sourcePDF);
//create a file stream for input XFDF file to import annotations
FileInputStream fileStream = new FileInputStream(ROOT + sourceXFDF);
//create an enumeration of all the annotation types which you want to import
//int[] annType = {AnnotationType.Ink };
//import annotations of specified type(s) from XFDF file
//editor.importAnnotationFromXfdf(fileStream, annType);
editor.importAnnotationFromXfdf(fileStream);
//save output pdf file
editor.save(ROOT + destPDF);
} catch (Exception e) {
System.out.println("exception: " + e.getMessage());
}
}

What's the easiest way of converting an xhtml string to PDF using Flying Saucer?

I've been using Flying Saucer for a while now with awesome results.
I can set a document via uri like so
ITextRenderer renderer = new ITextRenderer();
renderer.setDocument(xhtmlUri);
Which is nice, as it will resolve all relative css resources etc relative to the given URI. However, I'm now generating the xhtml, and want to render it directly to a PDF (without saving a file). The appropriate methods in ITextRenderer seem to be:
private Document loadDocument(final String uri) {
return _sharedContext.getUac().getXMLResource(uri).getDocument();
}
public void setDocument(String uri) {
setDocument(loadDocument(uri), uri);
}
public void setDocument(Document doc, String url) {
setDocument(doc, url, new XhtmlNamespaceHandler());
}
As you can see, my existing code just gives the uri and ITextRenderer does the work of creating the Document for me.
What's the shortest way of creating the Document from my formatted xhtml String? I'd prefer to use the existing Flying Saucer libs without having to import another XML parsing jar (just for the sake of consistent bugs and functionality).
The following works:
Document document = XMLResource.load(new ByteArrayInputStream(templateString.getBytes())).getDocument();
Previously, I had tried
final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
dbf.setValidating(false);
final DocumentBuilder documentBuilder = dbf.newDocumentBuilder();
Document document = documentBuilder.parse(new ByteArrayInputStream(templateString.getBytes()));
but that fails as it attempts to download the HTML docType from http://www.w3.org (which returns 503's for the java libs).
I use the following without problem:
final DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
documentBuilderFactory.setValidating(false);
DocumentBuilder builder = documentBuilderFactory.newDocumentBuilder();
builder.setEntityResolver(FSEntityResolver.instance());
org.w3c.dom.Document document = builder.parse(new ByteArrayInputStream(doc.toString().getBytes()));
ITextRenderer renderer = new ITextRenderer();
renderer.setDocument(document, null);
renderer.layout();
renderer.createPDF(os);
The key differences here are passing in a null URI, and also provided the DocumentBuilder with an entity resolver.