I'm considering convincing my company to upgrade to Acrobat Pro so I can automate the processing of my scanned documents. Before I bring it up, I want to make sure the things I want to do are possible. I don't need anyone to give me the code, I just want to know if this is possible.
The documents i'm working with are landscape, 2-5 pages, and have the filename and page numbers in the footer. I want to scan a big stack of them and have a script perform the following actions:
Use OCR to acquire the filename and page numbers for each page. I would like to restrict the OCR to only look at the footer to save time and RAM.
Using the filenames, I want it to detect when one document ends and the next one begins so they can be split into separate files.
Before saving the split files, check that the number of pages in the file matches the page total in the footer. (I work in a factory and the documents can get sticky, so my scanner frequently pulls two pages at once)
Instead of saving the files where the page total doesn't match, compile a list of the errors so I know which documents need to be rescanned.
Finally, save all correct documents with their filenames from the footer to a folder on my desktop.
This could save me hours a week, so I'm hopeful that it's all possible. Thanks
Related
I'm using a publisher document as a template to create fitting instructions for our products. Everytime we launch a new product an individual instruction is produced which involves a lot of copy and pasting and then translating the master document into 4 different languages.
Although being individual to the product, there are only 5 sets of instructions with their own wordings (which doesn't change) and pics, the layout of the document is the same across all 5.
I was thinking to create a user form to enter the product name, choose the required set, insert photos and save the new doc as .pub and .pdf (only in English, I want to get this running first)
I tried around with Access and mail merge but it doesn't work the way I need it to. So I reverted to using VB in publisher where I've been basically able to return the text boxes, however, I can't see a way to display the code of the entire document with all text boxes and their formatting. Is this possible or would I have to code the entire document from scratch?
Thanks for reading and your input.
I was given some really old but very useful hand-written notes recently and in a bid to preserve them, I had them scanned into a file in the PDF format. What I have is a 35 page PDF but I want to add a contents page at the beginning so that I can use the first page to click my way to a specific topic.
More precisely,
I want a page which says
Topic 1
Topic 2
Topic 3
...
Each one should be linked to a page of my choosing.
I've explored a lot of standard tools out there to help me with this, like LibreOffice, pdftk etc. but the solution does not appear to be in the form of a simple application and a few clicks. My hunch is that this will require a program written in a suitable language. The way I'd want this program to work as follows:
ProgramName Input.pdf CustomTOC.txt
Where CustomTOC.txt could be a simple ASCII table containing two columns, one column being the title and the second column being the page number. The output of this program will be another PDF file which contains one page appended at the beginning of Input.pdf containing a table of contents with hyperlinks to the right pages.
I have managed to solve this problem though I don't think this is the best way to do it. I have written a Python program that accepts two mandatory inputs - the input PDF file and '|' separated ASCII table containing columns and page numbers. A third optional output can be the name of a PDF file which contains the output. If this is not provided then the original input file is rewritten.
How the code works? Uses a system call to 'pdftk' for bursting the PDF file into its constituent pages. Writes a .tex file which contains a \listoffigures command for the first page with the package hyperref ensuring it links to the figures. The later part of the .tex code contains several figure insertion statements where the PDF file corresponding to each page is inserted, providing captions only to those PDFs for which there is an entry in the provided TOC table.
Why the code is not ideal? It relies on too many dependencies. It relies on a system call to the pdftk package, it requires that LaTeX be also installed on the machine with the graphics package. In the current version of the code, the PDFs on each page do have some offset which I am trying to solve using geometry package with custom margin settings. I will try to post the code once this problem is solved.
A more ideal solution. That which does not require LaTeX and can use some PDF library within Python to achieve the same effect. Comments and suggestions welcome!
I have a template for a Hipster PDA (you remember those, don't you?) that shows four copies of the same card on one page then four copies of the next card on the next, and so forth. I would like to rearrange things so that each page only has one copy of each card, so I can print four distinct cards to a page, without wasting a lot of paper. I did something vaguely similar to this years ago, but that involved hand editing a lot of Postscript and took forever to do. I would like some sort of command-line solution that would cut a different quadrant from each page and then paste four them onto a single new page.
You might try and get what you want in two steps:
Setup CropBox for each of the pages so that only one copy of a card lays within the CropBox.
Use a PDF imposition software to make new pages from 4 "old" ones
For the latter you could try Multivalent Impose tool.
Nowadays it is more practical to purchase an ebook than the dead-tree version. But the PDFs frequently contain the blank pages used by the print edition. I typically see between 10-30 blank pages (or pages with text "This page intentionally left blank.") per ebook. Is it possible to programmatically remove these blank pages? Currently I manually identify the blank pages and then run it through this:
pdftops orig.pdf - | psselect "$range_of_non_blank_pages" | ps2pdf - new.pdf
So the hard part is identifying the blank pages. pdftotext would work for the most part, except where the page has only images and no text.
Also, even after removing many pages and seeing the resulting file size is smaller, after shrinking both the original file and the new version (using various methods found on the internets), the original file is usually smaller by several hundred KB or more. So it appears the method I'm using to remove the blank pages doesn't create an optimal pdf. I've also tried various gui programs and see the same results in this respect.
Partial answer: you don't need to go via postscript (this is probably the reason why you get a bigger file). One possibility is
pdftk orig.pdf cat "$range_of_non_blank_pages" output new.pdf
To identify blank pages, you'd need to use a tool that can go beyond selecting and reassembling pages. Try a library for a scripting language, for example CAM::PDF or PDF::API2 in Perl.
I don't know of an open source solution that can detect and remove blank pages. However, Apago's commercial PDF Enhancer can automatically remove blank pages -- both vector and scanned. For scanned, it can remove scan artifacts such as black edges, hole punches and noise prior to determining if page is blank.
I have a directory of PDFs. They are all different, but they all have 5 pages. I need to insert a bar code on each page for each PDF. After this process I need to combine and decollate every PDF. Essentially there would be 5 different PDFs created. The first would contain all page ones from every PDF, the second the second page, etc.
I need to find a tool, or a toolset, that would allow me to accomplish this. I'm willing to program my own solution but I'm not even sure what would be the most efficient language to attack it with.
What I ended up doing was using Perl with the PDF::Reuse and PDF::Reuse::BarCode libraries to get all PDFS in the directory, pull the pages I wanted, put the barcode and save out to a new PDF.