iText: why would adding an image cause text to appear fuzzy in PDF? - pdf

I'm using iText with Java to create a PDF file. I'm trying to place a paragraph on left, and float an image on right (e.g. next to each other). Using the following code does insert the image, but it also makes the text fuzzy on the entire page (other pages are fine).
// add image
Image img = Image.getInstance(imgPath);
img.setAlignment(Image.RIGHT | Image.TEXTWRAP);
img.scaleToFit(1000, 72f); // 1" height
//img.setSpacingBefore(0f); // does not have any effect
document.add(img);
// add text
Paragraph par = new Paragraph("some text here", styleBody);
par.setSpacingBefore(20f);
document.add(par);
If I remove the image portion of the code, the text looks clean. This is my first attempt at adding an image next to text. Must be doing something obviously wrong. Any idea what could cause this?

I was able to solve this problem. The code above is perfectly fine. The problem was I was using a PNG image with transparency. When I removed the transparency (by re-exporting the image from Illustrator with transparency turned off), I was able to create PDFs with clear text.
I think the transparency forces the PDF page to be written in CMYK color scheme rather than RGB, which perhaps causes this issue.
Hope this helps someone else. I searched everywhere but couldn't find any leads talking about fuzzy text in iText.

Related

How can I set the line height of a multi-line PDF form field, and save it so it doesn't get reset by filling it?

I am having the exact same issue described in this question: Multiline pdf text box
I have a PDF that has some dotted lines that I want to convert into a fillable multi-line field. I tried the solution in the linked question, but my setting is not staying when I try to fill in the field outside of Acrobat.
When I am preparing the form inside Acrobat, I set the line height to 30 and it is lining up fine:
But when I save this PDF and then try to fill in the field outside of Acrobat, the line height setting does not stay. It gets reset every time:
It's super frustrating and I have scoured the internet looking for an answer but I have nothing yet. If someone knows what to do to get the line height looking like the first screenshot, please save my sanity.
I'm using Adobe Acrobat Pro DC 2021.001.20135 on macOS 10.14.6.
Thank you
You can't. Those settings don't "stick" when the field is cleared and there's no way to set them programmatically. It's best to simply remove the lines from the PDF.
As #joelgeraci stated, the settings don't stick.
However, if the form has to be manually fillable, removing the writing lines may not be the best idea. In this situation, it would be better to change the field's background color. When the field has no content, its background color is transparent, otherwise white. And that will cover the writing lines.

qpdf - replace text in existing PDF file

this is the first I'm working with PDFs on this level. So please be patient with
my noob question. I understand the logical and physical structure of an PDF file
on a basic level.
I have an PDF that contains a dummy ID that needs to be replaced. To check, if there
is way to do this, I used qpdf to expand the PDF using
qpdf --qdf --object-streams=disable orig.pdf expanded.pdf
Using a hex editor I located the dummy ID in expanded.pdf and changed the value by
simply swapping two digits
<001800180017> Tj => <001700170018> Tj
and saved it. Opening expanded.pdf in Acrobat didn't show the modification. The original
ID 443 is still rendered, but searching for "443" doesn't find it. When searching for
"334", the modified content, I get the rendered original ID 443 highlighted.
The PDF consist of text and vector graphic. When I insert additional digits (which obviously
invalidates the offsets in the xref), I get an error message regarding a missing font and
all digits are shown as dots but the vector graphic is still in place. This seems to indicate
that the ID is not part of the graphic.
What did I miss?
EDIT 1:
After mkl's comment, I did a deeper analysis of my PDF and found, that beside the obvious graphic content, all text was rendered by a series of m/l/c commands follwoed by a BT/ET section. Color for stroke and non-stroke was 0,0,0 for both in the BT/ET section.
Is this because of the used embedded non-standard font?
Are PDFs with embedded fonts usually done this way? A graphics part for the visual representation and a transparent (hidden) text part just to get searching and highlighting capabilities?
Looking back I wonder what I did to get the dots when I first modified the
content. I seems impossible and I can't reproduce it either.
Thanks
Tom
First off, the following is merely guesswork as you could not share the pdf in question. Educated guesswork but guesswork nonetheless.
You report that you changed the value by simply swapping two digits in the text drawing instruction argument and now can successfully search for the value with swapped digits but that Acrobat didn't show the modification.
Furthermore you observed that all text was rendered by a series of m/l/c commands followed by a BT/ET section.
The main situation in which one observes text being rendered as arbitrary vector graphics (a series of m/l/c commands), is in pdfs in which the producer didn't want text extraction to be possible and replaced text drawing instructions by arbitrary vector graphics instructions.
This apparently is not the case in your pdf as the text drawing instructions are not replaced but merely supplemented by the vector graphics ones.
Supposing that this construct is used for a reason and not by accident, I can only assume that the pdf producer was not willing or allowed to embed the font in question but wanted the specific font appearance to be displayed without having to count on the font being installed on the computer the pdf is viewed on.
Thus, the text appearance is drawn using arbitrary vector graphics instructions and the following text drawing instructions actually draw nothing but merely make the text searchable and extractable. This way there is no need to embed the apparent font face as font program. (Text drawing instructions can be made to draw nothing either by using a font with all blank glyphs or by using the text rendering mode "invisible".)
If this assumption turns out to be correct, your task to replace the dummy id requires not merely editing the arguments of the text drawing instructions but also replacing the arbitrary vector graphics instructions showing the dummy id appearance by other instructions showing the actual id.
If you happen to have the font in question and are willing and able to embed it, you can actually replace the arbitrary vector graphics instructions by text drawing instructions using the font. Otherwise be prepared to also draw the actual id as arbitrary vector graphics.

PDF with OCR text visible, how to hide it from existing PDF

I have several PDF files that have been OCR-processed (not by me). They contain both the scanned image and the OCR text. They seem to work fine in some viewers (iPhone/iPad), but not in others (Preview.app on macOS) which makes them somewhat awkward to read.
From googling around, it seems that the text & image may be layered incorrectly or there is a problem with the fonts used? I'm not even sure I'm using the correct vocabulary, as most hits I get are worthless.
Is it possible to use ghostscript or something to batch-fix these files?
Example of "bad" rendering:
Its impossible to say what's wrong with the PDF file (or viewer) without seeing the PDF file, which alse makes it hard to propose solutions!
You could certainly run the file through Ghostscript to the pdfwrite device, and use the -dFILTERTEXT switch to not process the text. The resulting document would therefore not contain the offending text, but would still contain the image.
Of course, this would then not be possible to search or highlight.
You could instead use -dFILTERIMAGE which would remove the original image leaving the text behind. But then anything in the original document which was not text would now be missing.
The usual 'best practice' is to have the text drawn in rendering mode 3, which makes no marks. This allows you to see the original image without the OCR'ed text interfering. Its possible that the viewer you are using is not honouring the text rendering mode, which would be a (fairly serious) bug in the viewer. The most recent versions of MacOS seems to have some nasty bugs in the Quartz PDF rendering engine.
The other way to do this is to draw the text first, then put the original image on top of it, but that's hard to get wrong, I suspect its more likely the text rendering mode.
EDIT
The PDF file first draws the text, then draws the image on top of the text. The underlying text should not appear. mkl is quite correct in his comment.
The correct way to fix this is to fix the consumer which is rendering it incorrectly. As I mentioned above the latest version of Quartz seems to have some fairly serious bugs, you might choose to raise this as a bug with Apple.
The only other solution would be to run this through something which will remove the text. Ghostscript can do this but there are implications; firstly it will no longer be possible to search/copy/paste text from the document. Secondly you would need to run quite a complex command line in order to prevent the decompressed JPX images being recompressed as JPEG, which would probably result in compromised quality. Finally the resulting file size would be larger.

move PDF content using PDFBox

I need to be able to specify a rectangular area on a PDF page and move the text and graphic content of that area to a new location on the same page using PDFBox. Any graphics (lines, pictures, etc) will each move as a whole unit if selected in the area.
The PDF documents being modified originate as text based PCL and are converted to PDF using a third party tool. I can answer technical questions about these documents if needed.
This Stack Overflow question is exactly what I am after but that question seems to have been abandoned before a working solution was found?
I would bounty this question if I had a few more reputation points.
If you can help with any aspect of this issue I would appreciate your assistance, thank you.
I'm not as familiar with PDFBox as I should be but any library should be able to do the following; I know the one I represent can.
Create a new blank page that's the same size as your original. Copy the content of the original to an XObject and apply that to the blank page. Add a white rectangle to the page to obscure the rectangle in question. Clip the content of the original page to the rectangle you want to "move". Create a second XObject from that. Apply it to the new page in the position you want.
If PDFBox is capable of it, Sanitize the new page to remove the hidden content under the white box.

False dots around circles in pdf export of libreoffice draw

When i draw a small circle in LibreOffice draw and export it to pdf i get some extra dots around the circles. Especially in the upper left and lower right outer corner of the circle.
See example PDF here: https://dl.dropbox.com/u/233922/example-dots-circle.pdf
or as a Screenshot here:
You have any idea how i can get rid of this?
It is old bug and has not been fixed yet. I can reproduce it under Linux and Windows. My version: LibreOffice 4.1.0.
Create new file in LO Impress or LO Draw.
Draw ellipse (or rounded rectangle, or smile etc.).
Set line width e.g. 5mm (for better view).
Export as PDF.
I propose two workaround:
Export to MS PowerPoint and export in it :/
Print to PDF (using e.g. cups-pdf).
ad 1) You must have MS PP and you graphics may look bad.
ad 2) I use cups-pdf and PDF look very well, but:
Text is stored as bitmap graphics (small rectangles)! You can not extract text without using OCR.
You must use paper format from list (A4, A0, Letter etc.). If you use unstandardised paper format you must use bigger format and you get white bars on PDF. However you can use pdfcrop and remove white bars.
PDF is always orienter horizontally. If you print as vertically you can rotate pdf using pdf270 command line tool.
In Adobe Reader (version 11 at least) -> Go to "Preferences" => "Page Display" => uncheck "Enhance thin lines"
Libre Office seems to add dots of 0 size and practically no visibility. When "Enhance thin lines" is checked, Adobe Reader will make these dots visible.
Best wishes,
Patrick
Similar to the https://stackoverflow.com/users/1797782/dzwiedziu-nkg 's answer, I need a multi-step process to fix this issue.
Steps:
Open the file in a pdf viewer (Document Viewer for me in Ubuntu.)
Print the pdf to a file (also a pdf) from the viewer. I assume this also uses cups-pdf, as it modifies the image size. (I don't mind, because I use the next step to eliminate all margins anyways.)
Use pdfcrop to remove all the extra space around the actual content's bounding box. If you just give pdfcrop one argument, it doesn't overwrite the old file, so use the same argument twice:
$ pdfcrop monkey.pdf monkey.pdf
Another "workaround" that worked for me:
Go without outline. You can set the line style in Draw to "none" and just work with flat solid objects.
PS: I see these dots also in Draw, not just in the exported pdf.
A simple workaround is to "patch" the dot in Libreoffice Draw using a white object -- say, a square with white area and white outline. Note that you can not see the dot in Draw. So you first generate the pdf with the orginal drawing, see where the dot appears in the pdf, go back to Draw, and a add a white patch where it is required.
Searching for a workaround myself, I've found this awk script called odg2epsfix that will fix the exported EPS to not contain those ghost dots anymore.
I stumbled upon it in this launchpad bug entry.
Fixed in LibreOffice pre-export.
Steps:
Right click on the circle in LibreOffice and select "Line"
On the "Line" page, set "Corner Style" to "-none-"
Save document and Export as PDF.
The dot is gone without removing line enhance. Mine still shows in preview but doesn't print.
The bug is still present in LO 6.0. But if you set "Cap style" to "flat" in the "Line" tab of the "Graphic Styles", the dots disappear from the screen and from the exported pdf.