Extracting text from a PDF using PDFLib TET

Extracting text from a PDF using PDFLib TET - pdf

Is anyone aware of a method to limit the text extracted from a collection of PDFs when using the PDFLib TET library? Specifically I wish to skip over any text contained in either the header & footer sections of a page from the PDFs. I know how I can specify exclusion rectangles by coordinates, but I am hoping to exclude the header & footer sections by logical name(s) across a collection of PDFs. By that I mean I don't want to specify a one size fits all approach, but instead to have the header and footer rectangles for a specific PDF to be excluded as defined by the outcome of the document's structure analysis phase. I have reviewed the TET manual and I have yet to find a simple way to dynamically identify the header & footer rectangles for subsequent exclusion from the text extraction process...
Thanks, for any suggestions as to how I might achieve my desired results.

Related

Batch check Adobe Acrobat .pdf's for files containing rotated text

Does anybody know if there is a way to check whether a list of Adobe Acrobat .pdf files contain rotated text (any text not at 0 degrees)?
I thought this would be simple, but I'm struggling to find an answer.
I am using ABBYY Recognition Server to OCR thousands of files and the results are quite poor where the text is rotated. I need to get a list of files that have rotated text to allow me to perform some pre-processing on them.
I usually use iTextSharp for .pdf automation and modification but don't seem to be able to find anything for checking text rotation.
Thanks

You could achieve your goal by extracting all words from these PDFs and checking if any of the words is rotated.
I would recommend you to use a PDF library higher level abilities for the task. Docotic.Pdf library is a good choice (of course, I am one of the developers of the library).
Here is an articles that shows how to extract words from PDFs with extra info about their position etc.
Each extracted word comes in PdfTextData object. The PdfTextData contains IsTransformed property to check if word is rotated, scaled, and / or flipped. You can also analyze PdfTextData.TransformationMatrix for more information about the transformation.

How do I duplicate a PDF with some text replacement and redaction

I am exploring couple of third party components to work with PDF through C#. These are Aspose.pdf.net and iTextSharp. Following are the details about what I am exploring them for:
I have some PDFs that contain sensitive information in form of text, like name of person, city, etc.
These PDFs need to be duplicated into another copy but while creating duplicated copy, sensitive text needs to be searched & replaced with some dummy text. The replacement is essential to avoid tracing original information, by any fraudulent means.
Also, the replaced text requires to be redacted.
Finding text is expected to support RegEx, as there could be variations of text that needs to be masked.
Could you please assist me how can this be done using iTextShart.
Thanks in advance.

iTextSharp is capable of complete redaction(both visual as well as the data stored in the pdf) using the PdfSweep module (http://itextpdf.com/itext7/pdfsweep).
In order to have the redaction happen after text search you'd have to:
Extract the text from the document (can be done using iText).
Search through the extracted text and obtain the positions of the text you want redacted. (needs an implementation from your side)
Use these positions to define where PdfSweep has to redact. (a couple of lines of code)
By default, PdfSweep visualy redacts by drawing coloured bars over the locations, and internally removes the text and any image.
While it is technically possible to use iText to fill the redacted positions with some dummy text, the implementation thereof has a number of pitfalls.
PdfSweep is closed source module for iText7, you can contact our sales team for more information on the licensing.

Is this possible to break the pdf file smaller than page wise breaking?

I found there is a lot of tools available for breaking the Big PDF files into smaller one by splitting the original PDF file PAGE WISE.for example, if i have a 10 page PDF Document,then we can able to break the original pdf file into 10 pieces in page wise splitting.
But i want similar kind of tool that breaks the PDF file smaller than the Page wise splitting.That means,i need to split the PDF page into different documents based on any parameter like paragraph,section,element...
for example,
If my PDF file having 2 pages with 10 paragraphs then i would like to split the pdf file into 10 separate Pdf file based on paragraph parameter...
Also, I strongly believe pdf does not contain any structure like Open XML.But i also Suspecting
How the tools can able to break the pdf files in to small pdf files by splitting page wise? What kind of mechanism they are using for page wise splitting PDF File?
So, Is there any way to do my work? Please give me your valuable suggestion on this?

PDF is a vector based document description language. It's page based so in a way every page is independent from the next one. Splitting page wise is therefore pretty easy. Contrary to a raster image where you can extract small subsets independently in a pdf you have to render the whole page to know how a small subset looks like.
Say you have a Page (black) which contains a complex shaped object (here it is a line but it could be any text, shape, image, etc.) and you want to extract a subset (red). You would have to first find all the objects that produce visible output in the region of interest. Then you would have to modify them so they are rendered correctly (in this case calculate the green points from the blue points while preserving the shape of the object).
An easier approach would be to include the whole page and clip the viewing area to the dimensions of the region.
You could do this with pdfjam. Check the --trim/--offset/--delta command in conjunction with a custom paper size (Example 6,7 on the pdfjam website). You would still have to somehow calculate the coordinates of the region of interest though.

PDF itext TOC generation

I have to merge multiple PDF documents into a single PDF document. Besides this, I have to generate TOC. The original documents will contain text with a specific style (say H1). This special text becomes part of TOC.
Have used iText for merging multiple PDF files. I am unable to find example/API on parsing the document to find all the contents having style H1.
Generating TOC is next challenge.

You don't. PDFs don't have styles. They have "current Graphic State", which includes:
current transformation matrix (CTM).
stroke & fill colors
clipping path
font & size
gobs of other text state stuff (char spacing, word spacing, leading, text render mode...)
Including a separate text transformation matrix which is combined with the CTM.
So first you have to track all this stuff (which iText can mostly do for you). Then you have to determine how big "H1" text is, and latch on to all the text that is in that size screen size, taking the CTM, text matrix, and font size into account (which iText will do for you again, IIRC).
And just to make life more exciting for folks like yourself, it's entirely possible that the text you're looking at isn't text at all. It could be paths, or a bitmap... at which point you need OCR, and I don't think you'll get much in the way of size info with OCR.
You'll need to write a TextRenderListener that determines the final size of a given piece of text (and whether or not its a part of the last piece) and filter out all the stuff that's too small. You'll then build your TOC based on the text you find.

PDF Colo(u)r Analysis (without Acrobat itself ?)

Is there a library/tool which would list all colours used in a PDF document ?
I'm sure Acrobat itself would do this but I would like an alternative (ideally something that could be scripted).
So the idea is if you have a very simple PDF document with four colours in it the output might say :
RGB(100,0,0)
RGB(105,0,0)
CMYK(0,0,0,1)
CMYK(1,1,1,1)

You could explore the insides with pdfbox, but you would have to write some code to find and catalog all those colors.

Most PDF tools have access to this information but no api to access it. You could take any tool and add it in

Apago PDFspy generates an XML file containing all kinds of metadata extracted from PDF files. It reports color usage including spot colors.

We recently added a function called GetPageColorSpaces(0) to the Quick PDF Library - www.quickpdflibrary.com to retrieve much of the ColorSpace info used in the document.
Here is some sample output.
Resource,\"QuickPDFCS2eb0f578\",Separation,\"HKS 52 E\",DeviceCMYK,0.95,0,0.55,0
Resource,\"QuickPDFCSb7b05308\",Separation,\"Black\",DeviceCMYK,0,0,0,1
Resource,\"QuickPDFCSd9f10810\",Separation,\"Pantone 117 C\",DeviceCMYK,0,0.18,1,0.15
Resource,\"QuickPDFCS9314518c\",Separation,\"All\",DeviceCMYK,0,1,0,0.5
Resource,\"QuickPDFCS333d463d\",Separation,\"noplate\",DeviceCMYK,1,0,0,0
Resource,\"QuickPDFCSb41cafc4\",Separation,\"noprint\",DeviceCMYK,0,1,0,0
Resource,\"Cs10\",DeviceN,Black,Colorant,-1,-1,-1,-1
Resource,\"Cs10\",DeviceN,P1495,Colorant,-1,-1,-1,-1
Resource,\"Cs10\",DeviceN,CalRGB,Colorant,-1,-1,-1,-1
Resource,\"Cs10\",Separation,\"P1495\",DeviceCMYK,0,0.31,0.69,0
XObject,\"R29\",Image,,DeviceRGB,-1,-1,-1,-1

Disclaimer: I work at Atalasoft.
Our product, DotImage with the PDF Reader add-on, can do this. The easiest way is to rasterize the page and then just use any of our image analysis tools to get the colors.
This example shows how to do it if you want to group similar colors -- the deployed example will only work for PNG and JPEG, but if you download the code, it's trivial to include the add-on and get PDF as well (let me know if you need help)
Source here:
http://www.atalasoft.com/cs/blogs/31appsin31days/archive/2008/05/30/color-scheme-generator.aspx
Run it here:
http://www.atalasoft.com/31apps/ColorSchemeGenerator

If you are working with specific and simple PDF documents from a constrained source then you may be able to find the colors by reading through the content stream. However this cannot be a generic solution.
For example PDF documents can contain gradients or transparency. If your document contains this type of construct then you are likely to end up with a wide range of colors rather than a specific set.
Similarly many PDF documents contain bitmapped images. Given that these will need to be interpolated to be displayed at different resolutions, the set of colors in a displayed PDF may be bigger or different to (though obviously broadly similar to) the embedded bitmap.
Similarly many PDF documents contain constructs in multiple color spaces that are rendered into different color spaces. For example a PDF might contain a DeviceRGB bitmap, a line in an ICC based CMYK color and a Lab based rectangle. The displayed version might be in sRGB for display or CMYK for print. Each of these will influence the precise set of colors.
So the only 100% valid answer is going to be related to a particular render of a PDF at a particular resolution to a particular color space. From the resultant bitmap you can determine the colors that have been used.
There are a variety of PDF libraries that will do this type of render including DotImage (referenced in another answer) and ABCpdf .NET (on which I work).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Extracting text from a PDF using PDFLib TET - pdf

Related

Batch check Adobe Acrobat .pdf's for files containing rotated text

How do I duplicate a PDF with some text replacement and redaction

Is this possible to break the pdf file smaller than page wise breaking?

PDF itext TOC generation

PDF Colo(u)r Analysis (without Acrobat itself ?)

Categories

Resources