Selecting text and image from pdf through any programming language

Selecting text and image from pdf through any programming language - pdf

I'm trying to develop a tool/web application such that it will import a PDF file and I need to select text and images available in PDF by selecting them with a mouse click and marking them as title,content and image with a button click (3 different button) where the marked contents and image will be copied to clipboard or will be pasted into a word document which is going to be a another part. So in which programming language is this possible to work with and carry on ?

I'd probably try researching pure browser-side solution using pdf.js and clipboard API.
Otherwise, you'd still need clipboard API in the browser and the server-side may actually be powered by any programming language which can be hooked into a web server and has a library to parse PDFs.
You said nothing at all about your prospective server platform but to name a few, .NET has PdfSharp which is able to read PDFs, Python has a host of tools available for it. After all, there exist a bunch of command-line utilities to extract data from PDF which can be called using any PL able to call external processes.
Note that this only appears to be a simpler solution than using pdf.js but note that unless your PDFs are really uniform (say, invoices created by some piece of software), and so you'll be able to make your PDF parser know which bits of data it has to extract and return, the parser will need to returl all the data it extracted to the client, and you'll need to somehow render it all there. May be it's exactly what you need but maybe not.
Since PDFs are really tailored for typesetting and not presenting information in a structured manner, I'd try to piggyback on an already hard-core PDF rendering solution which runs in the browser, so see above.

Related

SOLVED: Looking for a way to automate generation of internal PDF hyperlinks

I have a 300+ page PDF document which needs to have internal page links added to it to reference other pages in the document. The document is created in Visio, which does not support consistent hyperlink generation in PDF export, so the link generation needs to be done on the PDF itself, not up the chain. This is an annual need, and regularly takes over a week due to the amount of manual labor, time, and checking needed.
The text which is hyperlinked has the same format in every case (e.g., "See Section 8.18 - How to Hyperlink"), and I'm certain this can be automated, as there are commercial plugins which can do this, but they cost hundreds of dollars, and are not able to be used in this case due to restrictions imposed by my employer. Example: https://www.evermap.com/ABAddingHyperlinks.asp
I've been looking through the Acrobat Plugin SDK and it seems doable, but I know there is also a higher level scripting language available for Acrobat. Does anyone have experience working with PDFs or with the Acrobat scripting / SDK tools? Are there open source methods for doing this? I've looked everywhere! Willing to learn. I've looked at Ghostscript (Adding internal hyperlink to a pdf) but what I need is way more than just a Table of Contents, and links can appear in many places on the page with line breaks, so consistency is a challenge.
EDIT: I found a solution! Bluebeam software's Revu Extreme works pretty darn well, and can be used as a 30 day free trial of all features. Only limitation is that links which extend across a line break (multiple lines of text) do not properly work in Edge or Chrome's PDF viewer, as they don't properly support hyperlinks with multiple click regions. I've submitted a ticket requesting a feature be added to Revu that fixes this, but for now those links need to be manually fixed following the batch link. The process is described here: https://support.bluebeam.com/online-help/revu2018/Content/RevuHelp/Menus/Batch/Link/Batch-Link--T.htm

EDIT: I found a solution! Bluebeam software's Revu Extreme works pretty darn well, and can be used as a 30 day free trial of all features. Only limitation is that links which extend across a line break (multiple lines of text) do not properly work in Edge or Chrome's PDF viewer, as they don't properly support hyperlinks with multiple click regions. I've submitted a ticket requesting a feature be added to Revu that fixes this, but for now those links need to be manually fixed following the batch link. The process is described here: https://support.bluebeam.com/online-help/revu2018/Content/RevuHelp/Menus/Batch/Link/Batch-Link--T.htm

You can add hyperlinks to a document with Ghostscript, but you would need to know the location of the text to hyperlink and the destination in advance, you cannot automate it or in fact write any reasonably simple code to automate the task using Ghostscript. You'd need to modify chunks of the PDF interpreter, which is written in PostScript and is not a task for anyone not a PostScript expert.
You could probably do it with MuPDF, and probably using MuJS to script it, but I don't know enough to be certain. It would still require some coding effort, but it would probably be easier to use JavaScript at least.

Annotate the pdf file on the location clicked by user

I am having trouble in trying to find the solution for the below described problem.
Annotate the PDF file when user clicks on specific location in pdf and then finaly save the pdf which in future opens at annotated location.
How to approach this?
What I have tried.
I have tried to find various libraries irrespective of programming language (since programing language is not the dependency)- found few libraries like minipdf in python, pdfbox in java to mention few relevant ones. Finally selected pdfbox since it seemed to be mature enough to provide the solution closeby.
There are various hurdles now how to get user the location clicked by the user? since after getting the location I can able to perform various actions like annotating at the clicked location and then saving the pdf on the same specific location.
It seems I have to write whole pdf javascript to approach it but again how to do so?

I had similar problem and have solved it the other way. In my case I am not opening PDF in Adobe reader, but in browser. So what I did is converted the pdf to html using python libraries (Let me know if you are interested, I will share different library names with their pros and cons).
Now that html can be edited easily. We can put hyperlinks, highlights everything there as source code is with us.
This workaround may be applicable to you if your front end is web based.
PS: Wanted to post this workaround as comment, but couldn't due to little less reputation count as of now. Hope, it won't be downmarked :)

workflow for managing help content written by external co-workers

We develop a WPF application that has something like a context sensitive help. The content of the help pages is currently written as word documents by external colleagues (say biologists) and then translated to xaml code by developers. This process is tedious and error prone because the biologists don't see the xaml code and the word documents can't easily be diffed and tracked in a version control system.
So we'd like to improve this process and maintain the content in a single place, in a format that
is simple to edit (preferrably with a wysiwyg editor),
is stored in a simple ascii format (for diffing / version control) and
can be included automatically as a resource in our C# application.
The solution could be a framework, an external tool or any other idea.
The format should support simple html rendering such as bold and italic, superscripts, etc and images.

I suggest to use Flow Documents:
It is a WPF technology so you will use the well known tool.
Flow documents can be edited in RichTextBox WPF control. You can access an edited flow document via RichTextBox.Document property. Then you can save it into XAML file with XamlWriter. Taking all this thibngs into account you can easily and quickly create a simple application for your external colleagues.
Finally, you can load saved XAML files into FlowDocumentReader control in order to display them. It is described here.
I'm not only sure if flow documents can be embedded in resources. If it is not possible, I think that help files can be distributed separately. It doesn't seem to be a big problem.
Alternatively instead of flow documents you can use RTF format. RichTextBox can be also used to edit this kind of documents.

I need some insight on PDF Bookmarks

I haven't done any programming to handle PDFs in depth, only PDF creation with PHP.
I've been asked into a project where the requirements are generating PDF bookmarks with titles created from selected text.
The scenario goes like this:
The user highlights some text in a given PDF file.
The user is prompted to enter the starting page number for the chapter (bookmark)
A bookmark is created with a title which points to the given page number.
Multi-level bookmarks to handle sub-chapters (like child nodes) should be supported.
Due to some restraints, the client would prefer this to be a web app if possible.
What platform/language/technology/library would you recommend?
Is it doable in a browser? Should this be a desktop app instead?
I am fluent in PHP/Javascript and capable in Python with tiny bits of experience on handling PDF files (nothing further than generating formatted PDF). (plus willing to learn anything new)
I've got some time to dig around and conceptualise it, so I'm very open to suggestions.
Any insight would be appreciated.

Document -> Flash viewer, not hosted

I've got a content management solution where we present scanned images (TIFF), PDFs, word docs for viewing. While we can simply embed a PDF, sometimes depending on user preferences it's a bit fiddly and sometimes not user-intuitive.
I'd like a solution like scribd, embedit, etc, but not hosted. I want to run the application on our own servers and manage it that way (for legal reasons, and our clients won't buy the service if it's hosted somewhere else).
SWFtools looks a little basic for my needs, plus doesn't do doc, docx or ppt.
Any options? Doesn't have to be free, but would be ideal.

As far as I understand (Scribd) uses swftools. And it is not basic, it is amazingly flexible. Convert everything into PDF and use swftools to convert PDF's into swf or something like Scribd does (SCB, what they call it, modified swf).

webSupergoo has a .net component that will do this...
Their ABCpdf component can import and export a wide range of graphic and document formats, including those you've mentioned.
The installation also contains an SWF demo project that can be freely adapted, and used as the basis for a scribd-like service.
http://www.websupergoo.com/products.htm

you can try this alternative solution :
FreepapeR.
You can display pdf documents. The pdf is converted using swftools (pdf2swf), using php on the server side or locally by hand, the user interface is written in as3.
Hope this helps...

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas