Broken hyperlinks when converting .docx files to PDF via LibreOffice

Broken hyperlinks when converting .docx files to PDF via LibreOffice - pdf

I'm attempting to convert .docx files to .pdf via soffice, using the following command:
soffice --convert-to pdf input.docx --outdir <outdir>
The conversion succeeds without any error. However, the hyperlinks do not work. The hyperlinks in the file are relative links to other PDF files in the same directory.
The hyperlinks do not work - no matter what PDF reader I use, the hyperlinks aren't clickable.
I tried the following to narrow down my issue:
Open the document in LibreOffice and “edit” the first hyperlink, open and close the dialog without doing anything.
Save the .docx file in LibreOffice, and then run the headlesss conversion CLI via soffice.
Observe that only the first hyperlink works - the one we “edited” without changing anything.
I opened the actual .docx internals, and found the _rels files were exactly identical. The only difference was that the hyperlink I “edited” had this additional tag in the word/document.xml:
<w:rStyle w:val="InternetLink" />
Some googling me found this unanswered issue. It appears that LibreOffice expects the InternetLink property to be present when converting hyperlinks to pdf.
My question is - how can I use soffice to convert the .docx to .pdf while retaining hyperlinks? Is there a code change I can make to the LibreOffice source code?

Related

Add link to PDF within a PDF file using relative addressing

I want to add a link to some text within a PDF that will bring up another PDF that is located in the same folder. I wish to use relative addressing so that the PDF suite is transportable to other users and computers. I wish this to work on Linux and Macs.
LibreOffice Draw, despite promises, writes out the link address as a full path. Thus if taken to another computer with another user the link fails to work.
I tried manually editing the PDF files using vi and altered the link syntax so;
<</Type/Annot/Subtype/Link/Border[0 0 0]/Rect[940.9 480.3 1200.7 507.9]/A<</Type/Action/S/URI/URI(Content/Information.pdf)>>
where the target file, "Information.pdf" is in a subdirectory "Contents".
On Linux using Document Viewer, it works! On an Apple, Preview (a PDF viewer) interprets the target file needs to be opened by some application. Adobe Reader doesn't like this syntax either. I tried prefixing the filename with the keyword "file:" which works for a full path but not with relative addressing.
Does anyone know what syntax might work for me

Editing a PDF you can select text and add a hyperlink using LibreOffice Draw. It is then possible to edit the PDF file with a text editor such as vi.
To find the line with the link search for the filename of the target. One problem is that LibreOffice insists in using a fully qualified domain name to locate the file and this won't work after the file is moved, say to another computer. The unedited line should be similar to;
<</Type/Annot/Subtype/Link/Border[0 0 0]/Rect[940.9 480.3 1200.7 507.9]/A<</Type/Action/S/URI/URI(File:<fullpathname>/Content/Information.pdf)>>
Where Content/Information.pdf is the link target in the same directory as the linking pdf. This line should be changed to
<</Type/Annot/Subtype/Link/Border[0 0 0]/Rect[940.9 480.3 1200.7 507.9]/A<</Type/Action/S /Launch/F(Content/Information.pdf)>>
This works on Unix and MacOs

cfdocument not converting Word Document to PDF correctly

cfdocument in ColdFusion 11 is not converting my Word Documents to PDF correctly. I have OpenOffice 4.1.3 installed and configured in CF Admin. I am able to open the source document in OpenOffice and Export to PDF without issue. However, when I run the following code, the resulting PDF is "gobbledigook":
<cfdocument
format="pdf"
srcfile="#_tempSourceFilePath#"
filename="#_destinationFilePath#" />
Here is an excerpt of the resulting PDF (the snip shows developer edition, but, the same thing happens with Standard installation):
I can't figure out why this is happening. Any ideas?

The problem is:
srcfile="#_tempSourceFilePath#"
This is apparently the path to a binary file that is not browser-writable. A necessary condition for the srcfile attribute is that the file be browser-writable. That is, without the need for a browser plugin.

Open a .pdf file

I am trying to open a .pdf file within Excel like an iframe in HTML.
My requirement is:
Save the path of multiple PDF files in Excel.
Excel should open each .pdf file within Excel itself (no need to open that in a separate .pdf window).
It should be like iframe in HTML. The user should be able
to view the .pdf within Excel itself.
I know this is little weird, but can anybody help me?

you could probably get the filenames via vba.
here's some that claim to work:
Loop through files in a folder using VBA?
So far as opening a pdf in excel - thats kinda pushing it.
Since your request is exotic I can think of an exotic workaround:
If you can spare the interactivity you can simply make copies and convert your pdfs to word formats to work with them and load them in that way. I've seen people convert pdfs to Jpgs just to load them in some other documents but thats rudimentary and really fringe.
Otherwise you are facing a lot of custom coding that needs to make it possible.

docx4j word/googledocs compatibility

I'm creating a program which extracts a docx file, displays it in a Javafx graphic interface with buttons in place of flags put in the docx, and when one puts on it, it modifies the docx taken in input.
I'm using the docx4j API for extracting and modifying the document.
The problem is that the program fails if i take in entry a docx generated from Microsoft Word. I'm forced to use an artifice.
I'm taking my docx made on Word, then i load it in Google Docs and I use the "Download in .docx format" option. If i directly put the docx from Word in my program, it fails.
I noticed my Word file was two times lighter after being passed trough google doc. Same, if I tale a docx file downloaded from Google Docs, if i open it in Word and modify one letter and save it, he becomes two times heavier. For the record i use word 2008.
That's it, so I'd like to know if someone know what explains this difference.
Thanks

Extract text from a PowerPoint (.ppt or .pptx) file?

I'm currently using a combination of OpenOffice macros and a pdf2text program to extract text and would like to find an easier, more efficient way getting the text out of a PowerPoint file.
I've tried using the Apache POI library and have not had much luck, encountered numerous exceptions within the library when trying to process the files I'm looking at and don't particularly want to sift through the source code of the library.
Is there an easy way to do this without using the aforementioned library?

If you have MS Office and you save the PPT in the RTF (Rich Text Format), it contains just the text from the presentation. You could then open the file in any editor that understands RTF files and save it as a text (TXT) file.
I expect this to work from Open Office too.
Since you talk of API, this may not be the way to go for you but maybe it will give you newer ideas on getting there. Say, you use multiple macros to do the conversion in stages...
Edit: I got curious and did a short google search
This is what i found on one of the www.openoffice.org pages
As people in this thread have pointed out, retrieving text from an OO
document isn't hard since it's just zipped xml that can be parsed with a
perl script. The problem is getting Microsoft Powerpoint documents into
a zipped XML format in the first place.
I've found that File -> Wizards -> Document Convertor does exactly that.
Just tell it you want to convert Powerpoint documents, not templates,
point it to your source directory and where you want it to spit out the
result and you're away.
I then find unzip -p $file.sxi content.xml | perl -p -e
"s/<[^>]>/\n/g;s/ +//;s/\n\n/\n/g;" -w
works rather well for extracting the text.
Sorry, i don't have Open Office handy to try any of that out.

pptx files are relatively easy to deal with, because they are just zipped xml - you can just unzip them and then strip all the xml tags from the content of the files in the 'ppt/slides' subdirectory of the unzipped stuff, yielding most of the pertinent text.
ppt files are a whole other ballgame, and the process is rendered even more painful because the canonical tool, catppt from the catdoc package, is susceptible to a buffer overflow that makes it nearly useless (it segfaults on a large percentage of ppt files).

LibreOffice-5 File - Export - HTML includes both slide contents and presenter notes.
Then, open the .html file in Firefox or other browser, and File - Save Page As - Text File (or utility such as pandoc -o file.txt file.html).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Broken hyperlinks when converting .docx files to PDF via LibreOffice - pdf

Related

Add link to PDF within a PDF file using relative addressing

cfdocument not converting Word Document to PDF correctly

Open a .pdf file

docx4j word/googledocs compatibility

Extract text from a PowerPoint (.ppt or .pptx) file?

Categories

Resources