Power Automate: Is there an operation that can split PDFs based on shared text across pages? - pdf

Any advice on this would be appreciated! I'm a newbie to Power Automate and Flows, though have watched a lot of tutorial content. I haven't seen a guide for exactly what I'm looking to do, so was hoping an experienced user could provide some advice.
What I need to do is split a PDF into smaller PDFs grouped by entity ID numbers that are on each page. I can't go an split on an increment because some entities have more pages of data that others. Generally the PDF will be about 700 pages and will be split into about 300 PDFs grouped by entity. Currently this is a labor intensive process, and automating would be incredible.
I'm looking into doing it with an Encodian split PDF by text action, but that requires the text is provided. What I need is a way to identify which pages have the same ID and group those into PDFs.
Does anyone have any experience doing something similar?
I have tried putting this together, but so far have only found operations that will let me split when I find a specific text string that must be provided during the operation. What I need is a way to find the entity IDs on each page, and then group the pages for the each entity together and split into its own smaller PDF file.

Related

Best, duplex enabled way to merge PDFs

I'm merging multiple multi-page source PDFs into one new result PDF for customers to print.
Now some source PDFs contain an even number of pages, some contain an uneven number (unpredictable).
Some customers print simplex, some print duplex. This is difficult because the simplex customers don't want to have empty pages between the documents and the duplex customers don't want to have and end-page and a start-page on the same sheet.
What's the best way to offer a good experience for both types of customers?
Is there a PDF feature for marking document borders? I couldn't find anything...
[Edit]
To further clarify my problem: People upload pdf documents to my tool, merge them into one and download them again. From a software point of view i am completely unaware of their printing configuration/habits/setup/devices. Thus i seem to need a PDF feature for storing the "document borders" or "printing instructions" (document 1 goes from page 1-3, document 2 goes from 4-11, ...) - but this feature does not seem to exist - or something else that has the same effect and can be stored in the file because that file is all the software produces.
[Edit 2]
An obvious solution to this problem would be asking the user if we wants to have blank pages inserted after every single merged document with an uneven page number (except the last one), but this would ruin the digital reading experience of the PDF document.
There is no feature in the PDF specification for "sub-documents". A PDF document is an array of pages. If you are joining them together, then you are making one document of all the pages from the source documents.
It might be possible to use Adobe's Job Definition File format (JDF) to contain data describing the sub-document boundaries (as it's extensible XML). A JDF file can be stored within a PDF. However, your users would need software at their end that can parse the JDF file and act accordingly.
Alternatively, you could create two separate tools: one that adds blank pages to each source document with an odd number of pages, and one that doesn't. However, this would rely on your users exercising their judgment to select the correct one.
Another course of action might be to advise those users with duplex printers that there's little merit in combining the PDFs, and that they should just send multiple PDF documents to their printer.

PDF file search then display that page only

I create a PDF file with 20,000 pages. Send it to a printer and individual pages are printed and mailed. These are tax bills to homeowners.
I would like to place the PDF file my web server.
When a customer inputs a unique bill number on a search page, a search for that specific page is started.
When the page within the PDF file is located, only that page is displayed to the requester.
There are other issues with security, uniqueness of bill number to search that can be worked out.
The main question is... 1: Can this be done 2: Is there third party program that is required.
I am a novice programmer and would like to try and do this myself.
Thank you
It is possible but I would strongly recommend a different route. Instead of one 20,000 page document which might be great for printing, can you instead make 20,000 individual documents and just name them with something unique (bill number or whatever)? PDFs are document presentations and aren't suited for searching or even text information storage. There's no "words" or "paragraphs" and there's even no guarantee that text is written letter after letter. "Hello World" could be written "Wo", "He", "llo", "rld". Your customer's number might be "H1234567" but be written "1234567", "H". Text might be "in-page" but it also might be in form fields which adds to the complexity. There are many PDF libraries out there that try to solve these problems but if you can avoid them in the first your life will be much easier.
If you can't re-make the main document then I would suggest a compromise. Take some time now and use a library like iText (Java) or iTextSharp (.Net) to split the giant document into smaller documents arbitrarily named. Then try to write your text extraction logic using the same libraries to find your uniqueifiers in the documents and rename each document accordingly. This is really the only way that you can prove that your logic worked on every possible scenario.
Also, be careful with your uniqueifiers. If you have accounts like "H1234" and "H12345" you need to make sure that your search algorithm is aware that one is a subset (and therefore a match) of the other.
Finally, and this depends on how sensitive your client's data is, but if you're transporting very sensitive material I'd really suggest you spot-check every single document. Sucks, I know, I've had to do it. I'd get a copy of Ghostscript and convert all of the PDFs to images and then just run them through a program that can show me the document and the file name all at once. Google Picasa works nice for this. You could also write a Photoshop action that cropped the document to a specific region and then just use Windows Explorer.

If identifying text structure in PDF documents is so difficult, how do PDF readers do it so well?

I have been trying to write a simple console application or PowerShell script to extract the text from a large number of PDF documents. There are several libraries and CLI tools that offer to do this, but it turns out that none are able to reliably identify document structure. In particular I am concerned with the recognition of text columns. Even the very expensive PDFLib TET tool frequently jumbles the content of two adjacent columns of text.
It is frequently noted that the PDF format does not have any concept of columns, or even words. Several answers to similar questions on SO mention this. The problem is so great that it even warrants academic research. This journal article notes:
All data objects in a PDF file are represented in a
visually-oriented way, as a sequence of operators which...generally
do not convey information about higher level text units such as
tokens, lines, or columns—information about boundaries between such
units is only available implicitly through whitespace
Hence, all extraction tools I have tried (iTextSharp, PDFLib TET, and Python PDFMiner) have failed to recognize text column boundaries. Of these tools, PDFLib TET performs best.
However, SumatraPDF, the very lightweight and open source PDF Reader, and many others like it can identify columns and text areas perfectly. If I open a document in one of these applications, select all the text on a page (or even the entire document with CTRL+A) copy and paste it into a text file, the text is rendered in the correct order almost flawlessly. It occasionally mixes the footer and header text into one of the columns.
So my question is, how can these applications do what is seemingly so difficult (even for the expensive tools like PDFLib)?
EDIT 31 March 2014: For what it's worth I have found that PDFBox is much better at text extraction than iTextSharp (notwithstanding a bespoke Strategy implementation) and PDFLib TET is slightly better than PDFBox, but it's quite expensive. Python PDFMiner is hopeless. The best results I have seen come from Google. One can upload PDFs (2GB at a time) to Google Drive and then download them as text. This is what I am doing. I have written a small utility that splits my PDFs into 10 page files (Google will only convert the first 10 pages) and then stitches them back together once downloaded.
EDIT 7 April 2014. Cancel my last. The best extraction is achieved by MS Word. And this can be automated in Acrobat Pro (Tools > Action Wizard > Create New Action). Word to text can be automated using the .NET OpenXml library. Here is a class that will do the extraction (docx to txt) very neatly. My initial testing finds that the MS Word conversion is considerably more accurate with regard to document structure, but this is not so important once converted to plain text.
I once wrote an algorithm that did exactly what you mentioned for a PDF editor product that is still the number one PDF editor used today. There are a couple of reasons for what you mention (I think) but the important one is focus.
You are correct that PDF (usually) doesn't contain any structure information. PDF is interested in the visual representation of a page, not necessarily in what the page "means". This means in its purest form it doesn't need information about lines, paragraphs, columns or anything like that. Actually, it doesn't even need information about the text itself and there are plenty of PDF files where you can't even copy and paste the text without ending up with gibberish.
So if you want to be able to extract formatted text, you have to indeed look at all of the pieces of text on the page, perhaps taking some of the line-art information into account as well, and you have to piece them back together. Usually that happens by writing an engine that looks at white-space and then decides first what are lines, what are paragraphs and so on. Tables are notoriously difficult for example because they are so diverse.
Alternative strategies could be to:
Look at some of the structure information that is available in some PDF files. Some PDF/A files and all PDF/UA files (PDF for archival and PDF for Universal Accessibility) must have structure information that can very well be used to retrieve structure. Other PDF files may have that information as well.
Look at the creator of the PDF document and have specific algorithms to handle those PDFs well. If you know you're only interested in Word or if you know that 99% of the PDFs you will ever handle will come out of Word 2011, it might be worth using that knowledge.
So why are some products better at this than others? Focus I guess. The PDF specification is very broad, and some tools focus more on lower-level PDF tasks, some more on higher-level PDF tasks. Some are oriented towards "office" use - some towards "graphic arts" use. Depending on your focus you may decide that a certain feature is worth a lot of attention or not.
Additionally, and that may seem like a lousy answer, but I believe it's actually true, this is an algorithmically difficult problem and it takes only one genius developer to implement an algorithm that is much better than the average product on the market. It's one of those areas where - if you are clever and you have enough focus to put some of your attention on it, and especially if you have a good idea what the target market is you are writing this for - you'll get it right, while everybody else will get it mediocre.
(And no, I didn't get it right back then when I was writing that code - we never had enough focus to follow-through and make something that was really good)
To properly extract formatted text a library/utility should:
Retrieve correct information about properties of the fonts used in the PDF (glyph sizes, hinting information etc.)
Maintain graphics state (i.e. non-font parameters like text and page scaling etc.)
Implement some algorithm to decide which symbols on a page should be treated like words, lines or columns.
I am not really an expert in products you mentioned in your question, so the following conclusions should be taken with a grain of salt.
The tools that do not draw PDFs tend to have less expertise in the first two requirements. They have not have to deal with font details on a deeper level and they might not be that well tested in maintaining graphics state.
Any decent tool that translates PDFs to images will probably become aware of its shortcomings in text positioning sooner or later. And fixing those will help to excel in text extraction.

itextsharp: solution on how to display a report

i have a report which looks like this. it will be in PDF format:
alt text http://img52.imageshack.us/img52/3324/fullscreencapture121420.png
the user will input all the different foods, thus every section like NONE, MODERATE, SEVERE will be a different size and thus i need to be able to expand the sections during run time. in order to do that i should probably slice up the image and add different sections during run time. i dont know the proper way to do it.
please help me with a suggestion on how to go about fitting the text in the appropriate sections (but also keep in mind i have no control over how many foods are in each section, the user will decide this during run time)
I would create an iTextSharp table for each of your results (None, Moderate, Severe) and write out the table sequentially, in the order you want them to appear on your PDF. Each row in your tables would have four columns.
I found these articles useful for creating tables in iTextSharp:
iTextSharp - Introducing Tables
SourceForge Table Tutorial
Edit
Sorry, I didn't see the vb.net tag on your question. The pages I linked are in C# - I hope you can translate. I found that most of the iTextSharp samples you'll find are in C#.
It might be worth using a reporting tool rather than iTextSharp for formatted/tabular data?
We use Active Reports from http://www.datadynamics.com/ but I am sure there are others.
EDIT:
It looks like iTextSharp supports html-to-pdf conversion? Maybe thats easier to render?
Just did a search and found this: http://somewebguy.wordpress.com/2009/05/08/itextsharp-simplify-your-html-to-pdf-creation/

optical character recognition of PDFs of parliamentary debates

For a contract work, I need to digitalize a lot of old, scanned-graphic-only plenary debate protocol PDFs from the Federal Parliament of Germany.
The problem is that most of these files have a two-column format:
Sample Protocol http://sert.homedns.org/img/btp12001.png
I would love to read your answer to my following questions:
How I can split the two columns before feeding them into OCR?
Which commercial, open-source OCR software or framework, do you recommend and why?
Please note that any tool, programming-language, framework etc. is all fine. Don't hesitate recommend esoteric products, libraries if you think they are cut for the jub ^__^!!
UPDATE: These documents are already scanned by the parliament o_O: sample (same as the image above) and there are lots of them and I want to deliver on the contract ASAP so I can't go fetch print copies of the same documents, cut and scan them myself. There are just too many of them.
Best Regards,
Cetin Sert
Cut the pages down the middle before you scan.
It depends what OCR software you are using. A few years ago I did some work with an OCR API, I cant quite remember the name but I think there's lots of alternatives. Anyway this API allowed me to define regions on the page to OCR, If you always know roughly where the columns are you could use an SDK to map out parts of the page.
I use Omnipage 17 for such things. It has an batchmode too, where you can put the documents in an folder, where they was grabed, and put the result into another.
It autorecognit the layout, include columns, or you can set the default layout to columns.
You can set many options how the output should look like.
But try a demo, if it goes correct. I have at the moment problems with ligaturs in some of my documents. So words like "fliegen" comes out as "fl iegen" so you must spell them.
Take a look at http://www.wisetrend.com/wisetrend_ocr_cloud.shtml (an online, REST API for OCR). It is based on the powerful ABBYY OCR engine. You can get a free account and try it with a few of your images to see if it handles the 2-column format (it should be able to do it). Also, there are a bunch of settings you can play with (see API documentation) - you may have to tweak some of them before it will work with 2 columns. Finally, as a solution of last resort, if the 2-column split is always in the same place, you can first create a program that splits the input image into two images (shouldn't be very difficult to write this using some standard image processing library), and then feed the resulting images to the OCR process.