I'm converting WORD docs to PDF programmatically using vb.net and ghostscript. This word doc I’m having problems with has hyperlinks to external URLs and also hyperlinks to bookmarks within the document. When the doc is converted to PDF the external URLs work but the links to the bookmarks do not.
I have searched for a solution to get these bookmarks to work on the output PDF but haven’t had any luck. Hopefully someone has done this and can share the solution.
Ghostscript only handles PDF or PostScript as an input, there are sibling products to handle XPS and PCL as well but none of them handle Word .doc files. So you must be converting the Word file into something else.
I'll hazard a guess that you are using the Windows PostScript printer driver to convert to PostScript and passing that to GS (possibly via the RedMon Port Monitor) to convert into PDF.
Now PostScript doesn't support hyperlinks, bookmarks, or any of the other paraphernalia of a viewing application, since its intended as a print language. To overcome this Adobe introduced an extension, the pdfmark operator, which can be used to create this kind of information. NOTE this is an extension which is only supported for conversion to PDF.
So, in order to get these inserted, you need to create pdfmarks in the PostScript. If you are printing from Word, this means that you have to insert PostScript into the file when printing. There is a 'pass through' mechanism for this purpose.
So what you need to do is create the appropriate Visual Basic script in Word which inserts the relevant pdfmarks when the document is printed. This is how the Adobe plug-in for Word (which used to be called PDFMaker a long time ago) works.
Have a look at this tool.
It does maintain bookmarks and hyperlinks.
http://www.transcom.de/transcom/en/2004_pdf-t-maker.htm
Related
I am looking for the method (of Word ole-object) which can open pdf in the Microsoft Word.
I want to copy all pages of pdf into doc/docx and add there footers.
Could anybody give the cue how to import pdf?
PS: any sample code for this problem would be great.
Thanks,
Lilya
You need OCR (Optical Character Recognition) engine for converting PDF to document. PDF is generic format and it can include text as image. So it is very hard to convert PDF to document. SAP hasn't got any OCR function for doing this. Maybe OpenText (if customer using it) has this functionality, I haven't got detail information about opentext. You need third party tools for this. You can use online services or command line utilities to converting PDF files to text files easelly if PDF included text, otherwise you need professional SDKs (for example Abbyy Finereader) for doing this.
I used FoxIT PDF Reader to save the PDF file into text file and make a macro to read the text file. Of course, by doing so, you can only get the text, but nothing else.
I am trying to print a section of an existing pdf to a new pdf. The original is searchable and selectable but the new pdf cannot do either. I am using "adobe acrobat reader DC" and print via "Microsoft Print to PDF". Unsure if there is any other relevant information.
After searching for a period of time I could not find an answer that allows for direct PDF to PDF print.
I did find a workaround however.
I downloaded a free software called PrimoPDF. Once installed, PrimoPDF becomes a printer option within Adobe acrobat reader. I then selected my desired pages and printed to PrimoPDf instead of Microsoft Print to PDF. This Generated a .ps file. I then imported the .ps file into PrimoPDF application and was able to generate a .pdf from that. The newly generated pdf was searchable and selectable and exactly what I needed.
Hopefully someone else finds this useful in the future.
Generally refrying (printing to PostScript then converting back to PDF) is a bad idea. The reason that Microsoft Print to PDF created a file that wasn't searchable is because when Adobe Reader detects that the printer it is targeting isn't capable of rendering the PDF correctly because of any number of reasons, like it doesn't have the right fonts for example, it will render the PDF itself and send an image to the printer. A simpler PDF probably would have worked just fine.
You are much better off getting a tool that will simply allow you to extract the pages you need to a new file rather than printing.
I'm creating a program which extracts a docx file, displays it in a Javafx graphic interface with buttons in place of flags put in the docx, and when one puts on it, it modifies the docx taken in input.
I'm using the docx4j API for extracting and modifying the document.
The problem is that the program fails if i take in entry a docx generated from Microsoft Word. I'm forced to use an artifice.
I'm taking my docx made on Word, then i load it in Google Docs and I use the "Download in .docx format" option. If i directly put the docx from Word in my program, it fails.
I noticed my Word file was two times lighter after being passed trough google doc. Same, if I tale a docx file downloaded from Google Docs, if i open it in Word and modify one letter and save it, he becomes two times heavier. For the record i use word 2008.
That's it, so I'd like to know if someone know what explains this difference.
Thanks
Example PDF page: https://db.tt/qRcF000k
This is sample page from a document, where copied text shows as question marks in my favorite reader SumatraPDF (mupdf) just the same as in Adobe Acrobat. But my main problem is that I can not search this document because of this, nor I can index it.
OTOH, xpdf's pdftotext extracts correct text.
In Adobe Acrobat if I use "Copy as formatted text", correct text is written to clipboard, although I still can't search from Acrobat.
Also if I open the linked page in Firefox's built-in PDF reader I can correctly copy the text.
Can GhostScript perhaps be instructed to correct this issue, which I can not describe differently then as 'unreadable characters'?
The PDF file uses subset fonts with non-standard Encodings and no ToUnicode CMaps. So no, you can't have Ghostscript 'correct' this file.
In fact I can't see how anything can possibly be extracting sensible text from this, and indeed my version of Acrobat (Pro X and Reader XI) can't copy meaningful text and don't appear to have a 'copy as formatted text' menu item, can you tell me where to find this ?
However, I notice that the PDF file has actually been created by Ghostscript (version 9.14) so possibly you mean 'starting with a different input file, which I haven't given you, could I have generated a PDF file where the text could be copied', to which I can only say 'I don't know', it depends what was in the original input file .
We all know that we can highlight certain texts in a pdf file either using Adobe Acrobat or Preview on Mac. I'm wondering how I can extract all these highlights in a pdf file, and generate a summary (a note kind of thing).
The following post
PDF: standard format for highlights?
points out that there are multiple ways to do highlighting. Will it be a challenge to distinguish the original content of the file and the user-added highlights if shapes with transparency is used to achieve highlights?
Details about this can be found in open source pdf parsing-rendering libraries, and you just have to read the code or document if available.