Scanning file as searchable PDF - What's the workflow? [closed] - pdf

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 8 years ago.
Improve this question
I recently bought an Epson scanner so I can start digitizing a mountain of documents I've accumulated over the years. I've already learned how to scan documents into PDF's. However, I want to make sure my PDF's have searchable text - I think the technical term is OCR, but I'm thoroughly confused.
I can scan files into PDF's using my scanner alone. But if I understand correctly, I can't make them OCR searchable unless I make Adobe Acrobat and/or ABBYY Fine Reader part of the workflow. (I'm using a Mac running Mavericks, by the way.)
I guess the the first thing I need to ask is this: What software do I need for creating a PDF that's OCR searchable? Like I said, I already have the Epson scanner software installed, but it looks like I also need Acrobat and/or ABBYY Fine Reader.
I guess a second question I should ask is how do I know if a PDF has searchable text? Could I simply search for a word or phrase on a PDF page with a standard program like Dreamweaver or Apple's Spotlight? Thanks.

The scanner produces an image and saves it either in an image format or as PDF. Then you open the result in OCR software, such as ABBYY Fine Reader. You can also open it in Acrobat, as Acrobat itself has OCR components built in. If you were using Acrobat, you have a searchable document, unless Acrobat was unable to locate any readable character. Other OCR software may save a PDF, or another file format.
Another product has been mentioned in another answer; I don't know it, but it might be worthwhile having a look at it.
For the second question:
a) There is an Acrobat JavaScript Doc object method getPageNumWords(); if this methods returns a number greater than 0, the page you passed as argument has searchable text. You find more information about this method in the Acrobat JavaScript documentation, which is part of the Acrobat SDK, downloadable from the Adobe website.
b) There is a preflight check which finds out whether the page/document has Text objects. If so, it has searchable text. You will need Acrobat Pro, for this, however.

You can scan to multiple-page TIFF image and let Tesseract 3.03 create searchable PDF for you.

Most solutions are to use the scanner to generate an image file (like a nonsearchable PDF), then to move your body from your scanner over to your computer, log in, run some unwieldy outrageously priced software called ABBSGDS or something, click a ton of menu buttons, respond to a ton of dialogue boxes, twiddle your thumbs as you watch the OCR progress bar, and voila--a searchable PDF.
Or, you can get a Canon scanner (e.g. DR-M160) and use their free CaptureOnTouch software. In that case, you put a document in the scanner, choose a number on the scanner, and press scan. A few seconds later (even on a slow computer) a fully OCRd searchable PDF will be in the directory programmed to the number you selected. You never even have to touch your computer (although it must be on, of course)
Anything else is, in my opinion, utterly worthless for a busy office environment where you are scanning dozens of multi-page documents per day. I, e.g., stand by my scanner dropping in document after document in rapid succession. I never go to my computer, and all of my documents are searchable PDFs just about as fast as I can drop them in.
If anyone knows of a software solution with that kind of workflow only that works with general scanners, please let me know. I just made the mistake of buying a Lexmark multifunction that, since it came with ABBYYwhatever software is, effectively, a unifunction.

Related

How to make a PDF responsive [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I have a pdf file that I want to make responsive so as to view it in desktops as well as mobiles. Responsive in the sense that it should not only fit the page based on the device size, but also the content i.e. text, images inside the pdf should also be responsive when viewing on a mobile. Just like the image shown below, pdf content should be aligned based on the device. Is there any API or library to achieve this.
Thanks in advance. Please help me to achieve this.
As indicated by other answers, PDFs primary function was to be a visual representation of content and visual representation should typically be identical across different platforms / readers / devices. That was the goal of the file format and it's diametrically opposed to file formats such as XML that are all about structure.
However, in recent years PDF did get additional functionality that may help with this. PDF files now support tagging and the purpose of tagging is to add structure to the file. A PDF file that is properly tagged does know where paragraphs of text are, what are headers, what are lists etc... And that information in theory can be used to support (limited) responsiveness.
For example, see the link here (https://helpx.adobe.com/acrobat/using/reading-pdfs-reflow-accessibility-features.html) where Adobe explains how the reflow view in Acrobat Pro works. It states that Acrobat can use the tagging structure inside a PDF file (or even automatically create some semblance of tagging on the fly for documents that are not tagged) to give you a view of the PDF file adjusts itself to the available display size.
Whether or not this is going to work depends mostly on the reader technology you will be using on your mobile device and you should certainly not confuse the possibilities of this with full responsiveness where content is hidden, replaced, adjusted, repositioned etc... such as what you can accomplish with HTML and CSS on web sites.
But it is a start.
It cannot be done. PDF is a final layout. Unlike the web page, where you are never sure what you're getting, the whole purpose of a PDF is to look the same no matter what device, or even medium, you're accessing it from. It basically says, "there will be the phrase 'Hello, World' in this font, this point size, at these x and y coordinates". You might as well try to reflow a hardcover book to fit into your pocket better.

PDF with fillable, saveable form using open-source software [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
My question is an extension of this one.
Is there any way to create a PDF that contains a fillable, saveable form using open-source software? Any development effort or library to this end gets points. Any software other than Acrobat gets points too.
Update as of February 2013
According to this answer Adobe Reader XI allows saving any kind of PDF forms. I tested it myself and it worked.
My old answer:
If you want to generate PDF forms that can be filled out and saved using Adobe Acrobat Reader, then you are out of luck. This kind of PDF files contain an encrypted digital certificate that only Adobe Acrobat can generate. Adobe Acrobat Reader verifies the presence of this certificate on PDF forms before enabling the possibility to save the modifications.
Your choices are then to use Adobe Acrobat to generate the forms, or to use alternative ways of getting your PDF files with the filled data inside. One common approach is to include a submit button on your PDF file that posts the values of your fields to a web server, then you can fill out your PDF file there using a library of your choice.
Here is an example that uses this approach with the commercial library Amyuni PDF Creator.
Editing PDF Forms (AcroForms) within a Silverlight Application (Usual disclaimer applies)
You can create documents and Export to PDF, fillable or not, using OpenOffice. I've done it and it's pretty easy. The not so easy part is setting up the submission of the filled out data.
OpenOffice and LibreOffice since version 3.2 have the ability to create fillable PDF forms. The only thing I can't get working properly in them is calculations. But for everything else these free open source office suites are great, including combo boxes with choices!
You can even set up a submit to email button very simply, no coding required! Wow. OpenOffice and LibreOffice are fantastice for creating fillable PDF forms that work!
Give it a shot. You have nothing to lose and it won't cost you a cent.
I think this should work. Try PDF Form Designer -- its an opensource application. See here
My 'goto' open source .NET PDF library is ITextSharp. Not sure if it supports fillable forms though, I've never needed to do that. Worth a look anyway.
With the latest version of Adobe Reader, Adobe Reader XI, it seems that you can save the form.
From their webpage: Type your responses right on the PDF form, or click through and fill in the form fields. Then save and submit

Need to print a PDF from .net and select different trays for output

My company is moving to a new system which has a very poor printing system in place but it does create PDF's on the file system.
My Boss has asked me to create an application to print all the PDF's based on a JOB number.
I've gotten the filesystem search working, I have used the acrobat sdk to open each file and find certain strings to determine which pages go where.
The problem I'm dealing with is that the Acrobat SDK doesn't seem to support choosing printer settings.
My first thought was no big deal I just change the default windows printer and just change the tray so the invoice part and equipment listing go to white paper from tray 1, and the remittance goes to tray 2 on blue paper.
It seems like the printdocument in .net can handle alot of printer settings but I'm not sure if a PDF can be used with a print document.
Looking for any advice or assistance.
Thanks,
Joshua
I found the answer was to use Win32.
Here was the website that helped me get through some of the hurdles:
http://edinkapic.blogspot.com/2011/01/how-to-set-printer-default-paper-bin-in.html
The underlying problem is that PDFs are combination of vector graphics for the text and bitmapped images. It all needs to be rendered into a format the printer understands before being printable.
Ghostscript does this very nicely and if you need to do it from .Net, GhostScript.Net provides an excellent vb.Net interface.
The problem I'm dealing with is that the Acrobat SDK doesn't seem to support choosing printer settings.
You can't use the desktop version of Acrobat for this, since it's not designed for unattended operation and requires a user interface. Also, I believe it violates Adobe's license.

.Net Tool or Library to compare one PDF to another PDF [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I am working on a project that currently uses a .tiff, compares the defined template document to the document in question. We are moving away from the .tiff format for a variety of reasons but mainly because the new files will be coming in the format of PDF.
I see two potential solutions to the issue. First convert the PDF to a tiff and use the existing code.
Or second, use a PDF library that will compare the template PDF to the PDF that is received.
Because the PDF that is received will basically come from an outside source we won’t know for sure if it is text based or image based so the library or tool will have to be able to compare both.
Any suggestions on tools/libraries you have found helpful would be great!
Thank you in advance!
dj
How about i-net PDFC - it does a full content comparison - text, images, lines, header/footer-detection and so on. You can use it either on command line or with a GUI (2.0, currently in public beta-phase) or via API (I think we have an internal version being a .NET library).
Disclaimer: Yep, I work for the company who made this - so feedback highly appreciated.
What we ended up doing was using the Aspose.Pdf library.
I ended up learning there are two types of PDFs:
Image based and
Text based
I did not have any issues comparing the Text based PDFs. However, at the point that a image based PDF was received converting the PDF to a .tiff so that we could use Microsoft's MODI to compare the PDF against our specified template. The .tiff would be a blank image rather than the actual content of the PDF. Aspose.Pdf library did cost some money, however in the end, the library did exactly what we needed it to and it allowed us to meet our client's needs.
I think your method of comparing tiffs is the way to go, using ImageMagick or other library?
Converting PDF to images can also be done via ImageMagick with the help of Ghostscript.
http://www.imagemagick.org/script/compare.php
I have a C# wrapper for GhostScript that may help, sent me a mail (on profile) and I can send it to you.
As far as I can see from your question, you want visual comparison of 2 PDFs, not structural comparison. (Because I can create you a thousand different PDF pages, which will have different internal structures and PDF source code, but will render identically on screen or on paper.)
In this case any comparison software will have to transform the 2 PDFs into raster images and compare those.
But since you have your own code already to do that for TIFFs, you can as well re-use it for PDFs (like you are considering already) which you convert to TIFFs.
Unless you find another, external tool that is better, faster, more precise, more funky, less resource-hungry... than your own solution! -- But that one will not be able to avoid converting the PDF pages to some sort of raster image before it can start the real visual comparison. (This may happen internally and unnoticeable for the user, but nevertheless it will have to take place...)
Docotic.Pdf library can compare PDF documents for you.
Please have a look at Check that two PDF documents are equal sample.
We use this feature for regression testing of the library itself (yes, I am part of the library's dev team).

Component to view and annotate PDF documents [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
Can anyone recommend a good Windows form component for displaying PDF documents and allowing users to add real annotation (by which I mean identical to that created by Adobe Reader).
Update: I've tried the AxAcroPDF component which Abobe installs alongside Reader, but this doesn't support annotation. I basically want AxAcroPDF combined with Reader's "Comment & Markup Toolbar". It seems that the Foxit SDK ActiveX supports this, so I'm going to try that. I just thought that there would be some more alternatives to choose from.
There's also http://a.nnotate.com which you can use as a PDF / Word annotation component in web applications - just uses AJAX / JS / HTML and displays the pdfs properly in the browser without needing adobe reader. (see http://a.nnotate.com/embed-guide.html for a working demo)
For editing the documents I have worked with SyncFusions Essential PDF and it worked quite well
The free version Foxit Reader does this, you can do Tools->Commenting Tools->Note, then click anywhere on the page of the PDF to place a little note icon which has text inside. Then just save the PDF. Later, if someone views the PDF in Acrobat or Foxit, just hover the mouse over or click on the little note icons on the page to view the comments.
If anyone's interested, it looks like we'll end up using jPDFNotes, from Qoppa Software.
To quote from the web site:
jPDFNotes is a Java™ bean that
integrates into your application to
display PDF documents and forms and
allow your users to annotate the
documents and fill the forms. After
editing documents, the library can
save them to a local file or the host
application can override the save
function to save the file to any
location locally or on a network.
jPDFNotes is built on top of Qoppa's
proprietary PDF technology so your
users do not have to install Acrobat
Reader or any other third party
software or drivers. jPDFNotes is 100%
Java so it is completely platform
independent and so can run on Windows,
Linux, Unix, Mac OSX and any other
platform that supports the Java
runtime environment.
It's not what we started looking for, but it seems to be exactly what we need. They seem a nice bunch of people too.