I'd like to recognize the text of all pdfs on my computer and save them without moving them from their locations. Is it possible? - pdf

I've tried using Adobe Acrobat X Pro to "recognize text in multiple files."
When I start this process and it asks for the directory, I've chose C:, my main hard drive.
It took hours to load and when it did, the list of files it generated included word documents as well. Adobe said I couldn't proceed until I removed the problem files.
Once I removed all the pdfs Adobe flagged as having errors (like password protection) and the prompt remained, I assumed it meant the word documents in the list.
So I manually removed those too. But Adobe still said that I couldn't proceed until problem files were removed and there weren't any remaining files in the list that adobe had flagged as having issues.
My firm is trying to make sure all pdfs we have are searcheable. Currently, some are and some aren't. Our goal is to make them all searchable without removing them from their varied locations.

I think you can do this using a combination of
regular java : to list all files in a directory that match a given criterium (e.g. their name ends with '.pdf')
iText : to iterate over the PDF document and extract all images
Tess4J : a port of Tesseract (google OCR engine) for java, to turn the extracted images back into text
Unless I am much mistaken, Tesseract even offers a crude version of this workflow for you. But only for 1 pdf at a time. So you'd still need some windows/linux scripting to pipe in all files of a given directory.

Related

Can QPDF utility be used to extract attachments from a PDF file?

I have a PDF file with other PDF files attached to it. Acrobat shows them in "Attachments" tab and allows to open them in turn.
QPDF documentations says something about extracting attachments but I failed to find any particular commands that do that.
Is it possible to extract these attachments and have them stored on the disk as separate PDF files?
UPDATE: Just a notice to explain better what you can see in the UI: "Attachments" tab was present in older versions of Acrobat, as well as a special page of the container document recommending to download newer version of Acrobat (this page seems to be really existing as it is shown in other viewers as well as on preview image). Latest versions of Acrobat (Reader) skip this page and get you to the first attached document, with the list of all attachments shown on the left side of the screen.
I found an old GitHub issue which a little bit clarify the possibilities of attachment extraction.
It is possible to extract attachments from PDF files using the qpdf
library by understanding the PDF file structure and pulling the
attachments out "manually" by knowing which objects to extract. There
is nothing in the public API at the moment nor in the command-line
tool that enables you to work with attachments as a first-class thing,
but there is an item in the TODO list, and there is some private code
used internally to detect cases where attachments are encrypted
differently from the rest of the file. The main reason, aside from
lack of time, that attachments are not more directly supported is
because there have been various ways that they are stored in the file,
and I don't know whether I have examples of all of them. I'm reluctant
to add a feature for attachments that may miss some attachments in
some older PDF files.
https://github.com/qpdf/qpdf/issues/24
So, it seems it is possible but you should examine the details of the pdf file.
Starting with qpdf 10.2, you can work with file attachments in PDF files from the command line. The following options are available:
http://qpdf.sourceforge.net/files/qpdf-manual.html#ref.attachments

How do I make an offline front end for over 50 pdf documents?

I have over 50 training documents (PDFs) at work. I would like to create a 'front end' that a user can 'run', which would provide a convenient access portal to all the PDFs available.
This needs to be able to be dropped on to my work colleagues laptops (they don't have Office on there but do have Acrobat). And it also needs to be able to be edited/added to as more PDF training materials are created.
I know that I could create a Word document that contained links to the PDFs, then convert that to a PDF itself. Or I could create an offline web page that linked to them, but I wondered if there was a better solution?
Like a way to compile an executable that would bring up a front-end and contain all the PDF files? I've seen similar things for car-repair manuals years ago, where you insert a CD, run an executable and get a nice front-end that essentially just allows you to browse PDF manuals.
Anyone know if this is possible and, if so, how to go about it?
Or does anyone know another viable solution to this?
Thanks
There are indeed various possibilities, depending on what the users have (Acrobat or Reader), and how you can control the distribution.
a) You create a front end PDF document which has links or buttons to open the subsequent documents residing in a subfolder or on the same level as the front end document.
b) You create a front end PDF document into which you embed the subsequent documents as Data Objects. You have buttons which export/open the embedded documents in a different window.
c) You create a front end PDF document into which you embed the subsequent documents as File Attachments (part of the Comments tools). You have buttons which open the embedded documents.
d) You would create a PDF Portfolio in Acrobat, containing the subsequent documents, and maybe provide an overview page from which you can open the documents.
Of these three approaches, a) would run in the biggest number of supporting PDF viewers, in particular also mobile devices. The downside is that you have the subsequent documents around loosely, and your users may mess up with them.
The most elegant (and app-like) approach would be b). However, it requires smart PDF viewers, and you would have to make sure that the user's viewer is not too dumb.
Approach c) would be a compromise between integrity and portability, and approach d) would be quite nice for distributing, but does require a PDF viewer by Adobe, and may most likely not work in any mobile viewer.

How can I convert old xif image files (Pagis, Xerox) to PDF (or another format.)

Years ago I began scanning my "important" documents using Pagis, software that came with my HP Scanner. Eventually I began to scan to PDF (as the scanner software became able), but I still had many old XIF file. The Pagis software would run only on 32 bit OS (Windows) which is now becoming less and less common. In fact I have a Win32 system I've kept alive just to retain access to the XIF files.
I can convert these files using Adobe Acrobat (or equivalent) "simply" by opening the XIF viewer, then printing the doc to the Adobe PDF "printer". Unfortunately I have enough files that this manual process would take many years.
So, what's the best way to convert a large number of XIF files to PDF?
I recently found SikuliX, a scripting tool intended mainly for testing GUI. It is different from most such tools I have seen (e.g., Selenium) in that it is purely image based and cares not what the underlying technology is (HTML, XAML, etc.)
It took me about an hour to learn enough to write a script to open the XIF viewer, select the PDF "printer", click the button to print, fill in the desired output file name (XIF viewer truncated to short name if left alone), and then wait for the print to complete. The script then moved to the next XIF file. (I fed the script a file listing all of the XIF file paths on the drive.) I was using Nitro PDF rather than Adobe.
The script ran for a couple of days (I didn't say it was fast!), but converted all but a few of the files. From time to time it would stall and I'd have to modify the script a bit (increase wait time for UI to change, etc.)
There are probably not many folks facing this particular conversion problem, but I've been looking for a good solution literally for years. So, if you're in the same boat then this is a way to get to shore!

Get selected "PostScript" from PDF

I wasn't able to find anything on the internet and I get the feeling that what I want is not such a trivial thing. To make a long story short: I'd like to get my hands on the underlying code that describes the PDF document of a selected area from a .pdf file. I've been looking for libraries or open source readers but couldn't find anything useful yet.
Does there exist something that might be able to accomplish my needs here or anything that might be reused (like an open source reader) to get there a little faster and not having to write everything from scratch?
You can convert a whole PDF document to PostScript using pdftops, one of the utilities from the poppler PDF rendering library.
This utility enables you to convert individual pages, which is at least a start.
If you just want to extract bitmapped images, try pdfimages from the same package. This extraction can also be restricted to individual pages.
The poppler library was originally written for UNIX-like systems, but there are a couple of windows builds available.
The open source tool from iText called iText RUPS does what you want, showing you all the PDF commands for a particular PDF and allow you to visualize the structure and relationships.
http://sourceforge.net/projects/itextrups/

Does the string <!-FTCACHE-1-> in a PDF file mean anything?

My program downloads a PDF file from a source location every day. When I see the binary text of the PDF file in Notepad, I find that sometimes the PDF file has the string <!-FTCACHE-1-> at the end. Sometimes this word is missing from the PDF file.
My program downloads this PDF daily and compares it with the previous day's PDF file using the Windiff binary comparison.
99% of the time, Windiff reports differences in the PDF file just because one PDF contains the string <!-FTCACHE-1-> at the end.
Does anyone knows what the reason behind this is?
Thanks,
Praveen
<!--FTCACHE-1--> is generated by FatWire Content Server, a web content management solution that is probably generating your URL. FTCACHE means FutureTenseCache, the name of the original product component. The text is a "footer" flag that indicates to the caching module whether or not the page was properly generated. If the page is supposed to be cached, a 1 indicates that the page was properly built, and so is cacheable. If 0 is returned, it indicates that the page was corrupted and should not be cached. The Satellite Server caching engine is supposed to strip this footer once it reads it.
In other words, the key that is there to ensure that the cache is not corrupted, is causing the corruption in your PDF.
This issue has been fixed in patches to FatWire ContentServer for quite some time now.
For your purposes, just ignore the string - strip it if you can.
Sorry about that. That was my bug. :-)
The application that generates the PDF file has a bug, the FTCACHE tag should not be there, it is not a valid PDF construct. Its presence actually damages the PDF file, it invalidates the FastWebView feature in the PDF file, as you have seen it. It is safe to remove it before comparing the files.
"FT" could be FreeType, the open source font engine. The comment probably comes from the software that generates the PDF. If you can somehow identify that, you could (assuming it is open source) perhaps take a look through it and see what causes it to emit the comment.
FreeType has a source folder dedicated to caching, the root source file there is called ftcache.c. It doesn't do a lot though, just #includes (!) the other source files.
Googling on the string you see, reveals several more or less random PDF:s that seem to contain it.