PDFBOX Printing : Printed PDF contains Junk characters for Arabic text from the PDF - pdf

I have a PDF file containing Arabic text and a watermark. I am using PDFBox to print the PDF from Java. My issue is the PDF is printed with high quality, but all the lines with Arabic characters have junk characters instead. Could somebody help on this?
Code:
String pdfFile = "C:/AresEPOS_Home/Receipts/1391326264281.pdf";
PDDocument document = null;
try {
document = PDDocument.load(pdfFile);
//PDFont font = PDTrueTypeFont.loadTTF(document, "C:/Windows/Fonts/Arial.ttf");
PrinterJob printJob = PrinterJob.getPrinterJob();
printJob.setJobName(new File(pdfFile).getName());
PrintService[] printService = PrinterJob.lookupPrintServices();
boolean printerFound = false;
for (int i = 0; !printerFound && i < printService.length; i++) {
if (printService[i].getName().indexOf("EPSON") != -1) {
printJob.setPrintService(printService[i]);
printerFound = true;
}
}
document.silentPrint(printJob);
}
finally {
if (document != null) {
document.close();
}
}

In essence
Your PDF can properly be printed using PDFBox 2.0.0-SNAPSHOT but not using PDFBox 1.8.4. Thus, either the Arabic font in question requires a feature which is not yet supported in PDFBox up to version 1.8.4 or there was a bug in 1.8.4 which meanwhile has been fixed.
The details
Printing the OP's document using PDFBox 1.8.4 resulted in some scrambled output like this
but printing it using the current PDFBox 2.0.0-SNAPSHOT resulted in a proper output like this
In 2.0.0-SNAPSHOT the PDDocument methods print and silentPrint have been removed, though, so the original
document.silentPrint(printJob);
has to be replaced by something like
printJob.setPageable(new PDPageable(document, printJob));
printJob.print();

Related

PDFBox getText not returning all of the visible text

I am using PDFBox to extract text from my PDF document. It retrieves the text, but not all of it (specifically, seems like title/header and footer texts are missing). The parts that are missing are not images and are extracted when using text view in foxit reader.
I am using version 1.8.12 and made a test case with 2.0.2 just to see if it would return more of the content.
This is the code i used for 2.0.2:
public static void main(String[] args) {
File file = new File("D:\\\\file.pdf");
try {
PDDocument doc = PDDocument.load(file);
PDFTextStripper stripper = new PDFTextStripper();
//stripper.setSuppressDuplicateOverlappingText(false);
stripper.getText(doc);
} catch (Exception e) {
System.out.println("Exc errirs ");
}
}
Now I wonder are there any settings I missed? Is PDFBox failing because text is on top of some decorative elements (rectangle under text)?
Thanks
EDIT: link to file in question
As discussed in the comments, the text wasn't missing, but at the "wrong" position. By default, PDFBox text extraction extracts the characters as they come in the content stream, but they don't always come in a "natural" way. PDF files are created by software, not by humans.
An alternative is to use the sort option:
stripper.setSortByPosition(true)
However, as mkl pointed out, if the text is in two columns, you won't like the result either.

How can I Extract Japanese Text from a PDF using iTextSharp?

This code returns lots of \0\0s and extracts only a few English phrases from the PDF. Any Japanese text is not returned.
I am using Unicode encoding, so I am not sure what is happening here.
StringBuilder text = new StringBuilder(2000);
string fullFileName = #"c:\my_japanaese_pdf.pdf";
PdfReader pdfReader = new iTextSharp.text.pdf.PdfReader(fullFileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.Unicode.GetString(UnicodeEncoding.Convert(Encoding.Unicode, Encoding.Unicode, Encoding.Unicode.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
(Windows 7 x64, iTextSharp 5.0.2.0)
Thanks
Ryan
I had this same problem, and here's what I did (note this code is extremely similar to the code in the question, but doesn't use any encoding conversion stuff).
using (iTextSharp.text.pdf.PdfReader reader = new iTextSharp.text.pdf.PdfReader(inputPDF))
{
ITextExtractionStrategy Strategy = new LocationTextExtractionStrategy();
for (int i = 1; i <= reader.NumberOfPages; i++)
{
string page = PdfTextExtractor.GetTextFromPage(reader, i, Strategy);
string[] lines = page.Split('\n');
foreach (string line in lines)
{
// do anything you want here
}
}
}
Even when using the above code, I was still not getting any Japanese characters out of the PDF, so I changed the font used in the PDF to Meiryo UI font. That is how to solve this problem. Meiryo UI is a font that iTextSharp recognizes (at least version 5.5.13.2), so Japanese text with that font can successfully be extracted from the PDF.

PDFBox pdf to image generates overlapping text

For a side project I started using PDFBox to convert pdf file to image. This is the pdf file I am using to convert to image file https://bitcoin.org/bitcoin.pdf.
This is the code I am using. It is very simple code which calls PDFToImage. But the output jpg image file looks really bad with lot of commas inserted and some overlapping text.
String [] args_2 = new String[7];
String pdfPath = "C:\\bitcoin.pdf";
args_2[0] = "-startPage";
args_2[1] = "1";
args_2[2] = "-endPage";
args_2[3] = "1";
args_2[4] = "-outputPrefix";
args_2[5] = "my_image_2";
//args_2[6] = "-resolution";
//args_2[7] = "1000";
args_2[6] = pdfPath;
try {
PDFToImage.main(args_2);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
If you look at the logging outputs (maybe you need to activate logging in your environment). you'll see many entries like these (generated using PDFBox 1.8.5):
Jun 16, 2014 8:40:43 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
Warnung: Changing font on <t> from <Century Schoolbook Fett> to the default font
Jun 16, 2014 8:40:43 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
Warnung: Changing font on <S> from <Times New Roman> to the default font
Jun 16, 2014 8:40:46 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
Warnung: Changing font on <c> from <Arial> to the default font
Jun 16, 2014 8:40:52 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
Warnung: Changing font on <i> from <Courier New> to the default font
So PDFBox uses different fonts than the fonts indicated by the PDF for rendering the text of it. This explains both the lots of commas inserted and the overlapping text:
different fonts may have different encodings. It looks like your sample PDF uses an encoding which has a comma where the default font assumed by PDFBox has a space character;
different fonts have different glyph widths. In your sample PDF the different glyph widths cause overlapping text.
This results in
The reason for all this is that PDFBox 1.8.x does not properly support all kinds of fonts for rendering. You might want to try PDFBox 2.0.0-SNAPSHOT, the new PDFBox currently under development, instead. Be aware, though, the classes for rendering have been changed.
Using PDFBox 2.0.0-SNAPSHOT
Using the current (mid-June 2014) state of PDFBox 2.0.0-SNAPSHOT you can render PDFs like this:
PDDocument document = PDDocument.loadNonSeq(resource, null);
PDDocumentCatalog catalog = document.getDocumentCatalog();
#SuppressWarnings("unchecked")
List<PDPage> pages = catalog.getAllPages();
PDFRenderer renderer = new PDFRenderer(document);
for (int i = 0; i < pages.size(); i++)
{
BufferedImage image = renderer.renderImage(i);
ImageIO.write(image, "png", new File("bitcoin-convertToImage-" + i + ".png"));
}
The result with this code is:
Other PDFRenderer.renderImage overloads allow you to explicitly set the desired resolution.
PS: As proposed by Tilman Hausherr you may want to replace the ImageIO.write call by
ImageIOUtil.writeImage(image, "bitcoin-convertToImage-" + i + ".png", 72);
ImageIOUtil is a PDFBox helper class which tries to optimize the selection of the ImageIO writer and to add a DPI attribute to the image file.
If you use a different PDFRenderer.renderImage overload to set a resolution, remember to change the final parameter 72 here accordingly.

iTextSharp PDF Reading highlighed text (highlight annotations) using C#

I am developing a C# winform application that converts the pdf contents to text. All the required contents are extracted except the content found in highlighted text of the pdf.
Please help to get the working sample to extract the highlighted text found in pdf.
I am using the iTextSharp.dll in the project
Assuming that you're talking about Comments. Please try this:
for (int i = pageFrom; i <= pageTo; i++)
{
PdfDictionary page = reader.GetPageN(i);
PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);
if (annots != null)
foreach (PdfObject annot in annots.ArrayList)
{
PdfDictionary annotation = (PdfDictionary)PdfReader.GetPdfObject(annot);
PdfString contents = annotation.GetAsString(PdfName.CONTENTS);
// now use the String value of contents
}
}
This is written from memory (I'm a Java developer, not a C# developer).

Hyperlink Detection from PDF

I have some PDFs containing Hyperlinks both in form of URL and mailto. Now Is there any way or tool(may be 3rd party) to extract the Hyperlink meta information form the PDF like coordinates, link type and destination address. Any help is highly appreciated.
I have already tried with iText and PDFBox but with no major success, even some third party software are not providing me the desired output.
I have tried the following code in Java using iText
PdfReader myReader = new PdfReader("pdf File Path");
PdfDictionary pageDict = myReader.getPageN(1);
PdfArray annots = pageDict.getAsArray(PdfName.ANNOTS);
System.out.println(annots);
ArrayList<String> dests = new ArrayList<String>();
if(annots != null)
{
for(int i=0; i<annots.size(); ++i)
{
PdfDictionary annotDict = annots.getAsDict(i);
PdfName subType = annotDict.getAsName(PdfName.SUBTYPE);
if (subType != null && PdfName.LINK.equals(subType))
{
PdfDictionary action = annotDict.getAsDict(PdfName.A);
if(action != null && PdfName.URI.equals(action.getAsName(PdfName.S)))
{
dests.add(action.getAsString(PdfName.URI).toString());
} // else { its an internal link }
}
}
}
System.out.println(dests);
You can use Docotic.Pdf library for links extraction (disclaimer: I work for the company).
Below is the code that opens specified file, finds all hyperlinks, collects information about position of each link and draws rectangle around each links.
After that the code creates new PDF (with links in rectangles) and a text file with collected information. In the end, both created files are opened in default viewers.
public static void ListAndHighlightLinks(string inputFile, string outputFile, string outputTxt)
{
using (PdfDocument doc = new PdfDocument(inputFile))
{
StringBuilder sb = new StringBuilder();
for (int i = 0; i < doc.Pages.Count; i++)
{
PdfPage page = doc.Pages[i];
foreach (PdfWidget widget in page.Widgets)
{
PdfActionArea actionArea = widget as PdfActionArea;
if (actionArea == null)
continue;
PdfUriAction linkAction = actionArea.Action as PdfUriAction;
if (linkAction == null)
continue;
Uri url = linkAction.Uri;
PdfRectangle rect = actionArea.BoundingBox;
// add information about found link into string buffer
sb.Append("Page ");
sb.Append(i.ToString());
sb.Append(" : ");
sb.Append(rect.ToString());
sb.Append(" ");
sb.AppendLine(url.ToString());
// draw rectangle around found link
page.Canvas.DrawRectangle(rect);
}
}
// save document with highlighted links and text information about links to files
doc.Save(outputFile);
System.IO.File.WriteAllText(outputTxt, sb.ToString());
// open created PDF and text file in default viewers
System.Diagnostics.Process.Start(outputTxt);
System.Diagnostics.Process.Start(outputFile);
}
}
You can use the sample code with a call like this:
ListAndHighlightLinks("input.pdf", "output.pdf", "links.txt");
if your pdfs are copy protected, you need to start with step 1, if they're free to copy, you can start with step 2
step 1: convert your pdfs into word .doc: use Adobe Acrobat Pro or an online pdf to word converter:
http://www.pdfonline.com/pdf2word/index.asp
step 2: copy-paste the whole document into the input window here, you can also download the lightweight html tool:
http://www.surf7.net/services/value-added-services/free-web-tools/email-extractor-lite/
select 'url' as 'Type of address to extract', select your separator, hit extract and that's it.
Hope it works cheers.
One possibility would be using a custom JavaScript in Acrobat, which would enumerate the "words" on the page and then read out their Quads. From that you get the coordinates to create a link (or to compare with the links on the page), as well as the actual text (that's the "word(s)".
If it is "only" to set the border of the existing links, you also do another Acrobat JavaScript which enumerates the links of the document, and set their border color property (and you may need to set the width as well).
(if you prefer "buy" over "make" feel free to contact me in private; such things are part of my standard "repertoire").