We have a program that processes PDF documents - Automated. We fail with certain PDFs because they are malformed . When we open the PDFs in acrobat, it opens it. I see that Acrobat goes to extra measures to fix the malformed PDFs. So in our case, someone manually has to open and save them to make them clean. Is there a way I can programmatically do this in Python or Powershell? Has anyone done this?
Thanks!
You might try this this link.
You can run a macro from powershell. You can also set up a scheduled task to run your powershell script in task scheduler at pretty much any interval you like (TASKSCHD.MSC) This particular example has a msgbox for the path to folder but it loops through all pdf files in a folder, flattens and saves. Perhaps flattening might not be required but might help with a malformed PDF.
** This relies on Acrobat and uses the javascript API through the excel ... I'm not sure if libreoffice draw has has a javascript api like acrobat. I'm not aware of any open source alternatives that have that sort of functionality. If anyone is please let me know.
Related
I am trying to make a dynamic PDF generator as an .NET Core API. I want to take an existing PDF, or .docx file, and edit it so it replaces the current name (John Doe) with something that can be replaced like #NAME_PLACEHOLDER.
I then want to transform #NAME_PLACEHOLDER -> John Doe (or whatever is in the KeyValuePair or Dictionary<string, string>).
I am running this on a Docker environment, so I can easily execute commands and I am willing to do that as well.
So far I have tried a few things:
1) pdf2htmlEX
Executes as pdf2htmlEX file.pdf
Does the job pretty well
Can be converted back to PDF using Google Chrome headless or similar
Problem: Only the characters used in the PDF can be used to replace. So if I only use A, B, C as characters, it will make D into Times New Roman (or default font)
2) LibreOffice ODT to PDF
This was pretty nice, because I could simply unzip the .odt file, open content.xml, search and replace, then save it as an .odt file again
Could be converted into PDF rather easily using soffice --convert-to pdf
LibreOffice is quite nice
Problem 1: Microsoft Word -> Save as ODT tends to break the formatting, so we have to use LibreOffice to go and change it back again
Problem 2: We don't want to move away from Microsoft's Office suite
3) HTML to PDF using Chrome Headless
What you see is what you get
By far the best option, if we're all developers aaand have unlimited time
Problem 1: Only our developers can make changes, since our marketing department do not know HTML
Problem 2: Our existing PDFs would have to be rewritten in HTML
As you can see, I have tried a bunch of things. None of them, except Chrome Headless, has lived up to my expectations. What I really like about #3 is what you see is what you get. I can make the whole thing in HTML, press CTRL+P and see what it looks like as a finished PDF, basically.
I am looking for a better solution, though. It can be paid. It can be free. All I need is to change out words/phrases with other words dynamically, which apparently seems like a tough thing to do.
Thanks for specifying what you've already found clearly. It helps a lot providing a succinct answer.
The conversion is always tricky - I'm sure you know Word has trouble displaying/editing some Word documents itself.
I have experience regarding point #2 "LibreOffice ODT to PDF" and can suggest a few things to test:
Don't use Microsoft to do the docx->odt conversion. It's not good as you know. Use LibreOffice itself to do this step. The rest of your process remains the same.
For some documents, Libre Office does doc->odt much better. So, you can instead work with DOC format and get a better result without any other changes.
You won't be able to remove the devs from the process, but you can certainly reduce their role allowing your business/marketing teams to have more direct input simply by:
get the starting point document to the devs to run through the conversion process. The devs can "clean up" the document to make it convert nicely.
make this version of the document the "official" starting point. The business or technical teams can load it, adjust it, and put it back into the process.
if possible, expose a test-platform to the business teams so they can download, adjust, upload and render to PDF. This cycle means they will be able to achieve more and if they're good, do impressive stuff without any dev input.
the above steps simply mean don't expect perfect conversion of arbitrary complex documents. Starting from a (even complex) working baseline is great.
Some of that might show you that your #2 is actually going to get the best overall results.
I hope that helps.
To all whom are concerned:
My boss and I are on GitLab and we have problems trying to differentiate between dynamic PDF files.
Normally, for code files like C# class .cs files, it's easy to double-click and have GitLab highlight the changes made between two different versions.
However, we also create dynamic XFA/PDF files in Adobe LiveCycle and it's difficult to tell what has been changed at times, especially if the commit messages are not too specific or too vague. We know people suggested taking screenshots of the PDF between each version, but you can't diff text changes or format changes on image files.
We tried the program DiffPDF found here:
http://www.qtrac.eu/diffpdf.html
But we found out that it does not work with XFA/dynamic forms.
Does anyone have any suggestions on any possible programs that can diff the actual content on PDFs in GitLab?
Thank you for your time and future advice.
I will agree that it is hard to see what have been changed on a PDF, but if you instead look at the XDP-file you will be able to see what code have been changed.
If that is possible for you.
I am attempting to use Automator to turn a folder of ArtPro (.ap) images into .pdf's, but I can't find any existing or downloadable actions to do anything other than open a .ap file with automator.
Does anyone know of an action I could download or a different way to automate the conversion of .ap to .pdf? Is it possible to do it using applescript instead?
It is only possible with ArtPro itself (manually) or Automation Engine's Action List. You can try recording your actions with "Watch-me-do" in Automator, but it's not a good idea. Apple Script will not help.
The problem is that Esko has its own file format which no other software can understand.
I could see some approaches:
a) open the document in ArtPro, then use the Print command and write out as PDF
b) (if Preview.app can read in .ap files) open the document in Preview.app and save as PDF
c) if there is no direct way (a) or b)), write out as TIFF and convert that intermediate file, for example in Acrobat or Preview
The ArtPro format is proprietary to Esko - you won't be able to open it in anything else.
Secondly, Esko favours selling its own automation solution (Automation Engine) - ArtPro will not allow you to automate it. It doesn't integrate with Automator and as far as I know it also doesn't publish AppleScript actions.
So basically I think your only option is using Automation Engine from Esko.
You need used task "Export ArtPro to Normalized PDF File" in esko automation engine
Problem:
embedded a PDF into a Excel workbook.
This PDF is a feedback form which has a send to mail recipient button.
Without embedding it to Excel sending works, when embedded sending causes Excel to crash.
My idea is to open the OLEObject, save the embedded PDF temporary, close the OLEObject
and open the saved PDF so that it runs in a Acrobat instance.
Opening the OLEObject already works by using:
ThisWorkbook.Sheets(SheetName).Shapes(OLE_Name).Select
Selection.Verb Verb:=xlOpen
But im struggling with the following steps.
How to do this?
Possible other ways?
WIN7, Office 2010, Acrobat Reader
OK. OLEObjects are notoriously difficult to work with. I have done some searching because it seems hard to believe this is not supported. This was my first attempt to use a Shell command to copy the OLE and paste it to a known directory.
However, you can't copy/paste a PDF embedded object, which means you can't use a Shell command to copy/paste it to a temporary folder/file path. So that won't work.
The consensus seems to be that this is not really possible; all you can do with embedded PDF is open them. See here and, here which referencing this PDF says:
See the "AxAcroPDFLib.AxAcroPDF" section under OLE Automation. Those are the only methods you have available to you from Reader
And also, here:
What you are doing relies on you knowing how to interact with the particular class which may be OK if you only have a couple of types and know what they are but in general, unless (a) the Class supports Automation and (b) you know the commands to issue this can't work.
The class does not really support the sort of automation you desire.
Possible Solution
Use Acrobat and WinAPI
If you have Acrobat installed, then I believe you may be able to do this possibly using WinAPI to get the hWnd of the Acrobat instance, and control it that way. I have perused some documentation, but I do not have Acrobat and so I am not able to test.
Could any of you help me with the following:
I have quite a bunch load of InDesign Documents, and I need to be able to search through them, text wise. I don't have the resources of opening these files, make a pdf, and then do the search. I want, in short, to be able to either extract the textual context and index that, or directly index the file itself.
In the end, I would present the content or the index to a SOLR engine for further processing. This all should take place in a php/apache/mysql environment.
Your insights are highly appreciated.
In order to search the textual contents of an InDesign file, you will have to open the file in InDesign or InDesign server. There is no legal way around this.
However, there is no need to do a time consuming pdf export. You can use the InDesign scripting API to search through the text content of the file and create an index either inside the document or in an external location.
I think you might be looking for an application that can read & allow you to edit text in InDesign without having to actually have InDesign?
If so, I may be wrong, but there is a product in the market called PageZephyr, from Markzware.
You should look into it, I believe there's 30-day free demo as well. I used it awhile ago and it worked great, saved me tons of time. I don't have much InDesign files nowadays though.
Google them.