Well, after spending quite some time playing with this, I think I may have found a solution soon after I posted the question. Bit embarrassing but hey, it seems to work now.
I don't want to delete the question yet, in case someone wants the details. Basically, I tried again point (c) but in the XML I put back I used an embedded image. So, it seems I can change the XFA template master page using iText. More testing to follow ...
=========================
The problem: Apply watermark to a PDF XFA (dynamic or not). At the end the PDF should still be an XFA and should have all its security settings intact. I don't have control over the PDFs coming in.
The question: Can I use iText 5 to do that? If yes, is it via PdfStamper.getUnderContent() or via XfaForm.setXfa() or something else ? The first 2 didn't work for me yet.
I wonder if it's some XFA detail that I am missing (i.e. when I try to replace the XML), not sure how the XFA is actually protected from changes. Do I need to generate some UUID, encrypt some things, something about signatures ...
Btw, if I take the PDF generated by iText after replacing the XML, then open and save in LC, the watermark shows
thanks,
Cristian
========================
Anyway, that's the short version of my question. If you think you can help and/or are interested in more details ...
I know this or similar questions came up before but what I have tried so far didn't work for me. I also admit I am not a IText, XFA or PDF Standard expert. I browsed a number of forums posts, the Itext book, browsed the specs, no luck yet.
The PDF input has no regular usage rights, no security. When opened in Acrobat it will show the restrictions on Changing Document/Document Assembly/Adding templates but I have a feeling all XFAs have that. Nothing comes back from the following ...
System.out.println(" permissions " + reader.getPermissions());
System.out.println(" usage rights " + reader.hasUsageRights());
System.out.println(" viewer pref" + reader.getSimpleViewerPreferences());
Here are some of the things I have been through:
a) doing in it LC (https://forums.adobe.com/thread/496558)
This works. If I try to place an image on the master page, then it shows as watermark on all pages when saved from LC.
b) trying to write under using iText (https://sourceforge.net/p/itext/mailman/message/17225398/)
I found a post from '07 suggesting using pdfStamper.getUnderContent(). The thread does not seem to conclude if it did work eventually for the person asking
For me, the code works for a simple PDF, but not for XFA
PdfContentByte under = pdfStamper.getUnderContent(1);
under.beginText();
BaseFont FONT = BaseFont.createFont("c:/windows/fonts/times.ttf", BaseFont.WINANSI, BaseFont.EMBEDDED);
under.setFontAndSize(FONT, 40);
under.showTextAligned(Element.ALIGN_CENTER, "TEST_TEXT", 200, 600, 45);
under.endText();
c) generate the XML and replace it using iText (Some pdf file watermark does not show using iText)
After reading the post above and section 8.6 from itext in action, it seemed this is the correct path, so:
I create a simple XFA using LiveCycle
Save as xfa1.pdf and extract xml (using iText) to xfa1.xml
in LC add an image to the masterpage and save as xfa2.pdf
open xfa2.pdf and notice the watermark is present
extract XML from xfa2 to xfa2.xml and compare to xfa1.xml - notice the image element
either place the image element in original XML or use xfa2.xml and replace the XFA in original xfa1.pdf using iText, let's call it xfa3_itext;
Relevant code:
XfaForm xfa = new XfaForm(reader);
DocumentBuilderFactory fact = DocumentBuilderFactory.newInstance();
fact.setNamespaceAware(true);
DocumentBuilder db = fact.newDocumentBuilder();
Document doc = db.parse(new FileInputStream(xml));
xfa.setDomDocument(doc);
xfa.setChanged(true);
XfaForm.setXfa(xfa, stamper.getReader(), stamper.getWriter());
I have tried with a sample PDF that was provided by a customer and with a simple XFA form created in LiveCycle Designer, no luck. If I open xfa3_itext in LiveCycle it does the watermark and if I save it again from LiveCycle as dynamic XFA PDF, the new PDF shows the watermark.
phew, that is a long post ... sorry.
thanks for reading and for any feedback
Related
I am creating a separate question, stemming from this one. The used code is almost the same. The reason is that the original problem was about subsetting a font with pdfbox, which I kind of dealt with. I got faced though with another problem, which is : the annotations, and how the fonts used in them are interpreted by particularly Acrobat Reader DC.
I tried different combinations of fonts and embedding options and got rather desperate. The fact is that I had a feeling that in particular the way these things are handled by the programs that interpret the PDF files is non-standard. I think I read somewhere that the annotations and the way they are displayed is on purpose non-standardized by the PDF format, to give freedom to the interpreters to handle them in their own way, since the main purpose of the annotations is the interaction with the user. TL;DR I cannot understand why Acrobat Reader DC doesn't like the annotations I have created and saved with PDFBOX. I even opened a question on friendly and helpful Adobe's User Community forum. But as I expected, someone suggested me to better investigate this question with the PDFBOX team.
Everything is possible, but rather than writing a question on PDFBOX mailing list (I could never get used or understand the efficient use of the mailing lists btw), I want to open a question here because I hope that it could help others to understand the PDF format better.
I basically rephrase the above question from the Adobe's forums here: Here is an example (Google Drive link) with FreeText annotations (but it seems to make no difference if I use Stamp annotations instead), it causes problems when open by Adobe Acrobat Reader DC (file) version 21.001.20149.37945 (I think this corresponds to April 16th '21 update). Specifically the problem happens when the Comments pane is opened by the user, either manually or automatically.
Manually:
link
Automatically:
link
While experimenting, I also tried to unset the "Use local fonts" option in Preferences -> Page Display. I had the impression that maybe Acrobat Reader will be more eager to show the error message once it is not allowed to substitute the erroneously embedded fonts with the possible local fonts. I am not sure if this is true.
The error that I get is the infamous "Cannot extract the embedded font XXXXXX+SomeFontName" as seen in the below picture:
link
The same problems happen also if I use full font embed (subsetting option set to false when using PDType0Font.load). I also tried to embed OpenSans font instead of LiberationSans, also tried to manually convert LiberationSans to a TTF font with fewer glyphs using FontForge, even tried to use Windows ARIALN.TTF, thinking that maybe the font is the problem. All cause the same behavior in Acrobat Reader DC. I have also tried to run Acrobat Reader 2019 Pro Preflight on the document and in the profile that scans the document for the possible font inconsistencies, it reports no errors.
Of course, when I use e.g. PDType1Font.HELVETICA instead of custom TTF font, I do not get the above errors. But I cannot use it because it does not contain the glyphs for the Unicode characters that I use. Does anybody have a better idea?
Thank you very much!
EDIT: to make myself clear - the error does not appear ALWAYS. it appears on some machines constantly (e.g. I am using Windows 7 64-bit with latest Acrobat Reader DC installed to reproduce it fairly well), while on my Windows 10 64-bit with the same version of Acrobat Reader DC it sometimes appears, and sometimes not - I haven't figured out why or in what cases.. - which makes me think - but no - I checked that too - the font I am using opens up alright on the machine where the problem is fairly constant)
UPDATE: at my wits ends again, I created a blank page with Apache OpenOffice, exported it to PDF, opened it with Acrobat Reader DC (last version), added a FreeTextTypewriter annotation (View -> Tools -> Comment -> Open) with 4 greek letters in ArialNarrow font, saved it, reopened it with Acrobat Reader DC, and it gives me the same error (cannot extract the embedded font...).. So this could be the Reader problem? But they made this so difficult to diagnose.. Here is the file, but I do not expect it to show errors on other machines. It's one of those moments that you start to believe in magic and the power of prayer (and a good sleep)
UPDATE 30/04/2021
So, to sum things up, I haven't come with a solution yet, but I came up with three files created with PDFBOX, OpenPDF (iText5 fork) and Acrobat Reader DC itself (can append annotations and save - just adding a simple Text box with greek text through Comment pane) - and they all issue the above error message, when open by Acrobat Reader DC. I have posted details in the Acrboat Reader forum here (same link as in comment)
I have added the code that I used to create the OpenPDF example file here and the example 3 files are in the same repository here
I am trying to print a section of an existing pdf to a new pdf. The original is searchable and selectable but the new pdf cannot do either. I am using "adobe acrobat reader DC" and print via "Microsoft Print to PDF". Unsure if there is any other relevant information.
After searching for a period of time I could not find an answer that allows for direct PDF to PDF print.
I did find a workaround however.
I downloaded a free software called PrimoPDF. Once installed, PrimoPDF becomes a printer option within Adobe acrobat reader. I then selected my desired pages and printed to PrimoPDf instead of Microsoft Print to PDF. This Generated a .ps file. I then imported the .ps file into PrimoPDF application and was able to generate a .pdf from that. The newly generated pdf was searchable and selectable and exactly what I needed.
Hopefully someone else finds this useful in the future.
Generally refrying (printing to PostScript then converting back to PDF) is a bad idea. The reason that Microsoft Print to PDF created a file that wasn't searchable is because when Adobe Reader detects that the printer it is targeting isn't capable of rendering the PDF correctly because of any number of reasons, like it doesn't have the right fonts for example, it will render the PDF itself and send an image to the printer. A simpler PDF probably would have worked just fine.
You are much better off getting a tool that will simply allow you to extract the pages you need to a new file rather than printing.
I have access to a scanner at my library which can create "searchable PDFs." These are PDFs that show the exact image of a scanned document, but there is a kind of hidden text in the PDF that can be selected when you try to select a portion of the image that contains text. In this way you can copy and paste text or search for text in the scanned document. This is VERY useful. It's an awesome improvement over raw scanned images. I also have several apps on my mac that can create this kind of searchable PDF from a scanned document or a raw image.
Now it's obvious from any who has ever used OCR that the process of converting images to text is not 100% accurate, so the text that you search or copy will not be correct in some places.
So I search for quite some time to find an application that would load a searchable PDF and allow me to repair the hidden searchable text without reformatting or modifying the original scanned image.
Does anyone know of a tool (or library API) that would allow this?
It's worth saying here that I tried the latest version of Adobe Acrobat DC for Mac, and it doesn't seem to even allow me to view the hidden searchable text, much less edit it. It does allow me to replace scanned image with the results of it's own OCR process so that I could edit and save the document. But this would produce horrible results for any of the scanned documents that I am using. It seems designed for editing a "native PDF" not editing a scanned document.
I have also tried ABBYY FineReader with no luck.
i'm using ABBYY FineReader 12 Professional. (not open source)
Just open a scanned image or scanned pdf and press Verify Text(or Ctrl + F7), than you go over all the spelling errors or low-confidence charachters and fix them.
The program is very good, it shows you the exact place in image/pdf to correct and the OCR guessing side by side for convenience. It iterates all of them.
[By the way, I'm using the shortcuts to speed up things:
Alt+Enter to add the unrecognized word to dictionary.
Ctrl+Delete to skip word or confirm in case you fixed it.]
Than save the document as a pdf file Menu:File>Save Document As> PDF File, and you can search it on every pdf reader. The saved file look the same as the scanned one, but 'behind' it there text.
It's weird you tried ABBYY with no luck... it's working great for me. maybe you tried not the Professional version.
Hope it helps you.
It is not creating a searchable pdf from images the poster is after, he wants to start with an already searchable pdf and modify its text (e.g. because intially a searchable pdf was made but later an overlooked error in recognition was found and needs correction). I see no way and no tool that assists in doing this.
We have a requirement to convert XFA Form (Adobe LiveCycle Form) to PDF/A-1B.
We're attempting to use iText 5.4.0 to parse the PDF, populate the XFA fields and then save the modified PDF back out.
All the examples I can find with iText talk about populating XFA fields into PDF.
Can I convert an XFA form ( static / dynamic and generated using LiveCycle) to PDFA 1b directly?
As need PDFA for sure and can’t live with plain PDFs. So as workaround we were thinking about converting the PDF to PDFA. Is that the right approach or we are missing something here.
You can also use Adobe LiveCycle Forms Server or PDF Generator for this purpose. It supports conversion of XFA-based forms (either static or dynamic) to either PDF/A-1b or PDF/A-1a.
Yes, you can convert XFA forms to PDF/A using iText in combination with XFA Worker. However, XFA Worker is a closed source product. So you need to be an iText customer if you want to achieve what you want.
Note that we've done exactly what you need in a project for the Ministry of Justice in Belgium. I've written the following blog post about this project: : http://lowagie.com/xfabpm
Disclaimer: I'm the CEO of the iText Software Group. This answer isn't meant to promote the product. It's a genuine answer to this question.
I was also looking for the same problem and I reached an easy solution, you can try this out:
Drag and drop it xfa format PDF into chrome, it will open in chrome browser.
You will find three options at right corner:
Rotate clockwise
Download
Print
Click on "Print"
Change destination "save as PDF" and save.
Saved PDF is flat PDF(Acroform) and can be edited easily
In my iOS app, I would like to regenerate an existing pdf into another pdf after the users are done annotating on the existing pdf.
My regenerated pdf should be an exact replica of the existing pdf but should have embedded annotations and highlights etc which can be opened and viewed on desktops as well.
I have done some research on this including the solutions proposed on other SO posts. I have tried libharu etc.
But somehow I am not able to convert an existing pdf into a replica pdf. I am able to add annotations to a new pdf I create using libharu.
Now my problem is to copy the existing pdf as is to my regenerated pdf. Any pointers will be much helpful.
My understanding is that a library that can save back out a PDF with "true" annotations (those that can be hidden in Acrobat, for example) is not something that exists in a FOSS solution.
LibHaru, for example, only supports creating new PDFs, not editing or appending existing PDFs. From their homepage:
At this moment libHaru does not support reading and editing existing
PDF files and it's unlikely this support will ever appear.
You can render the PDF on a page by page basis, and then re-save it with some additional information. This S.O question has a reasonable looking piece of code. That will save any "annotations" more as an image in the PDF itself, though.
You might try a paid library like PDFNet.