Extract only the text from PDF files with CGPDFScanner

Extract only the text from PDF files with CGPDFScanner - objective-c

There are a number of questions (some answered and others not) about extracting simple text from PDF files. Stackoverflow has been helpful to point out that the PDF Adobe documentation is very clear to detect objects during parsing: i.e. one should use 'BT' and 'ET' PDF reference Operators to construct the callbacks when using CGPDFScanner.
The apple documentation shows a callback example:
static void op_BT (CGPDFScannerRef s, void *info) {
const char *name;
if (!CGPDFScannerPopName(s, &name))
return;
printf("BT /%s\n", name);
}
And, among other CGPDFScanner commands, the above call-back is set-up by first creating:
myTable = CGPDFOperatorTableCreate();
CGPDFOperatorTableSetCallback (myTable, "BT", &op_BT);
All good so far, but the Apple documentation doesn't appear to help low-to-intermediate programmers like me to understand the next step: Beyond identifying the text block (presumably between BT and BE callbacks?), what few steps/lines are needed during/in/outside the callback to capture the identified text block into a NSString?
Many thanks.

The first thing you should do is download the PDF reference. These days that's an ISO standard, but you can download the Acrobat SDK (http://www.adobe.com/devnet/acrobat.html) which contains an Adobe copy that will serve you just as well.
Read chapter 9. It'll teach you that on the one hand you need to understand text operators (Tj, ', ", TJ) and on the other hand you need to understand fonts and encodings.
The text operators are the operators that you can intercept that add "strings" to the PDF document; while all text operators must appear between BT and ET blocks, intercepting these BT and ET blocks by itself isn't going to do much for you I think.
Fonts are important because they will define how the bytes used by those operators correspond to actual (Unicode) characters. So if you want to derive the meaning of the bytes you get from the PDF file, you need to know how to use fonts to derive that meaning.
Some additional points:
Don't assume BT and ET correspond to an actual text block or paragraph as you may know it from an application such as InDesign or Word. One text block may contain a whole page or a single character (or nothing).
There are also text state operators that determine how the text is going to be shown on the page. There are ways for example to draw invisible text; you may or may not wish to extract that type of text. If you don't, you'll need to support enough text state operators that you can tell the difference.
Not a small task :)
Update after looking at sample PDF
Because in comments the question was refined to indicate text extraction of a specific type of PDF file, let me add a little additional information.
1) Looking at the PDF file you reference, you won't be able to skip the font/encoding problem. The fonts in the sample PDF file are subsetted which means that you don't have "cleartext" in the PDF page description but instead indexes that have to be mapped through the encoding of the fonts used to get meaningful text.
2) Extracting the text is possible, if you look at the following output from pdfToolbox (warning, I'm affiliated rather heavily with this tool):
<page id="33">
<words>
<word txt="Senator">
<parts>
<part tlh="28.3481" tlv="868.534" trh="55.4455" trv="868.534" blh="28.3481" blv="859.902" brh="55.4455" brv="859.902"></part>
</parts>
</word>
<word txt="House,">
<parts>
<part tlh="57.5305" tlv="868.534" trh="82.123" trv="868.534" blh="57.5305" blv="859.902" brh="82.123" brv="859.902"></part>
</parts>
</word>
<word txt="85">
<parts>
<part tlh="84.208" tlv="868.534" trh="92.548" trv="868.534" blh="84.208" blv="859.902" brh="92.548" brv="859.902"></part>
</parts>
</word>
There are undoubtedly other tools which can give a similar (or better) result, so extracting the text by itself should be doable.
The big problem is going to be finding the text you're interested in in the right order. The extraction I used here gives the text of each "word" and it's position (bounding box) on the page. When I look through the XML when you get to the table, the challenge is going to be which text belongs to which table cell, where rows and columns end etc...
In a way this problem is harder than the problem of simply detecting lines of text because you're dealing with a pretty dense table (and where my problem was largely one-dimensional (gathering everything on the same line) this problem is two-dimensional.

Related

Tabulator - formatting print and PDF output

I am a relatively new user of Tabulator so please forgive me if I am asking anything that, perhaps, should be obvious.
I have a Tabulator report that I am able to print and create as a PDF, but the report's formatting (as shown on the screen) is not used in either output.
For printing I have used printAsHtml and printStyled=true, but this doesn't produce a printout that matches what is on the screen. I have formatted number fields (with comma separators) and these are showing correctly, but the number columns should be right-aligned but all of the columns appear as left-aligned.
I am also using Tree View where the tree rows are coloured differently to the main table, but when I print the report with a tree open it colours the whole table with the tree colours and not just the tree.
For the PDF none of the Tabulator formatting is being used. I've looked for anything similar to the printStyled option, but I can't see anything. I've also looked at the autoTable option, but I am struggling to find what to use.
I want to format the print and PDF outputs so that they look as close to the screen representation as possible.
Is there anywhere I could look that would provide examples of how to achieve the above? The Tabulator documentation is very good, but the provided examples don't appear to explain what I am trying to do.
Perhaps there are there CSS classes that I am missing or even mis-using? I have tried including .tabulator-print-table in my CSS, but I am probably not using it correctly. I also couldn't find anything equivalent for producing PDFs. Some examples would help immensely.
Thank you in advance for any advice or assistance.

Formatting is deliberately not included in these, below i will outline why:
Downloaders
Downloaded files do not contain formatted data, only the raw data, this is because a lot of the formatters create visual elements (progress bar, star formatter etc) that cannot be replicated sensibly in downloaded files.
If you want to change the format of data in the download you will need to use an accessor, the accessorDownload option is the one you want to use in this case. The accessors transform the data as it is leaving the table.
For instance we could create an accessor that prepended "Mr " to the front of every name in a column:
var mrAccessor= function(value, data, type, params, column, row){
return "Mr " + value;
}
Assign it to a columns definition:
{title:"Name", field:"name", accessorDownload:mrAccessor}
Printing
Printing also does not include the formatters, this is because when you print a Tabulator table, the whole table is actually rebuilt as a standard HTML table, which allows the printer to work out how to layout everything across multiple pages with column headers etc. The downside of this is that it is only loosely styled like a Tabulator and so formatted contents generated inside Tabulator cells will likely break when added to a normal td element.
For this reason there is also a accessorPrint option that works in the same way as the download accessor but for printing.
If you want to use the same accessor for both occasions, you can assign the function once to the accessor option and it will be applied in both instances.
Checkout the Accessor Documentation for full details.

Why is it so hard to convert PDF to plain text?

I needed to convert some PDF back to text. I tried many soft and online tools and result was always mediocre.
Why is it so difficult technically speaking ?

Let's not assume you are talking about PDFs which merely wrap some bitmap image because it should be clear that in that case you can only resort to OCR with all its restrictions.
Let's instead assume that text is drawn in the PDF at hand.
What is drawn on a PDF page is determined by a sequence of instructions in the content stream of that page. "Text is drawn" on a page means that among those instructions there are some setting the font to use by the instructions to come, some setting the text position and direction to use by the instructions to come, and some actually drawing text given by "string arguments".
Text extraction is the task of taking the sequence of instructions from a content stream and instead of drawing the text as indicated by the font and position setting instructions, to export it in a sensible order using a standard encoding, usually the encoding of the character type of the used programming language / platform.
The first problem is to understand the encoding of the string arguments of those text drawing instructions:
each font can have its own encoding; to extract the text one cannot simply ignore everything but the instructions drawing text and concatenate their string contents, you always have to take the current font into account (some extremely simple text extractors ignore this and, therefore, fail pretty often to return something sensible);
there are a large number of predefined encodings, some reminding of encodings you know, e.g. WinAnsiEncoding, many you likely don't know, e.g. Add-RKSJ-H; these encodings may use a constant number of bytes per glyph or they may be mixed-multibyte; so a text extractor must support very many encodings to start with;
encodings also may be completely ad-hoc and arbitrary; in particular in case of embedded subset fonts one often sees ad-hoc encodings generated by dealing out character codes from some starting value whenever one is needed; i.e. the first glyph in a given font used on a page is given the starting value as code, the next, different glyph is given the starting value plus one, the next, different one the starting value plus two, etc; "Hello World" and a starting value of 48 (ASCII value of '0') would result in "01223453627"; these fonts may contain a mapping to Unicode but they are not required to.
The next problem is to make sense out of the order of the strings:
the string drawing instructions may occur in an arbitrary order, e.g "Hello" might be drawn "lo" first, then after moving back "el", then after again moving back "H"; to extract the text one cannot ignore text positioning instructions and simply concatenate text strings, you always have to take the current position into account (some simple text extractors ignore this and, therefore, can fail to return something sensible);
multi-columnar text may present a difficulty, text may be drawn line by line, e.g. first the text of the top line of the first column, then the top line of the second column, then the second line of the first column, then the second line of the second column, etc.; there need not be any hints in the PDF that the text is multi-columnar.
Another problem is to recognize formatting or styling artifacts:
spaces between words need not be created by drawing a space glyph, it may also be done by text position changing instructions; text extractors not trying to recognize gaps created by text positioning instructions may return a result without spaces; on the other hand the same technique can be used to draw adjacent glyphs at an optimal distance, aka kerning; text extractors trying to recognize gaps created by text positioning instructions may falsely return spaces where there should be none;
sometimes selected words are printed s p a c e d o u t for extra emphasis; in the extracted text these gaps might be presented as space characters which automatic postprocessing of the text may see as word separators;
usually for bold text one uses a different, bold font program; if that is not at hand, people sometimes get creative and emulate bold by printing the same text twice with a minute offset; with a slightly larger offset (or a different transformation) and a different color a shadow effect can be emulated; if the text extractor does not try to recognize this, you end up having some duplicate characters in the output.
More problems arise due to incomplete or wrong extra information:
ToUnicode maps of fonts (optional maps from character code to Unicode) may be incomplete or contain errors; there e.g. are many questions here on stack overflow dealing with incorrect ToUnicode maps for Indian writings; the text extraction results reflect these errors;
there even are PDFs with contradictory information, e.g. with an error in the ToUnicode map but the correct information in an ActualText entry; this is used by some PDF creators to allow correct copy&paste from some programs (preferring an ActualText entry in such a situation) while injecting errors in the output of other programs (preferring ToUnicode information then).
Yet another problem arises if you expect the text extractor to extract only text eventually visible in the page:
text may be drawn outside the current clipping area or outside the visible page area; text extractors need to keep these in mind;
text may be drawn using the rendering mode "invisible"; text extractors have to keep an eye on the rendering mode;
text may be drawn using the same color as the background; to recognize this, a text extractor can not only look at the current instruction and a few graphics state details, it has to take into account anything drawn beforehand in the location of the text;
text may be drawn as a clip path; to recognize whether this text is visible in the end, a text extractor must keep track of what is drawn in the text area as long as the clip path is active;
text may be covered by something else later; a text extractor must drop recognized text in such a case; but depending on blend modes and transparency settings these coverings might or might not allow the text to shine through; thus, for a correct result the text extractor must for each glyph keep track of the color its drawn with, the color of the backdrop, and what all those spiffy effects do with those colors later on; and of course, both glyph color and backdrop color can be interesting, e.g. some shading colors; and the color spaces involved may differ, requiring one to convert back and forth between color spaces; and so on.
Furthermore, text may be drawn where text extractors usually don't look:
some tools hide text from text extraction by putting it into a pattern and filling the page area with that pattern;
similarly there are type 3 fonts; each character in a type 3 font is represented by its own content stream; thus, a tool can draw all text in the content stream of a single type 3 font glyph and then draw that glyph on the page.
...
You surely have meanwhile gotten an idea why text extraction results can be less than optimal. And be assured, the list above is not complete, there still are more complications for text extraction.

Special characters in PDF form fields and global and fieldbased DR

I have a question regarding a weird form field behaviour.
Two pdf documents, both have textfield(s) using Helvetica as a font
Both are filled with values using the same iText logic (cp. below)
The field value (/V) is correct for both PDFs however the field appearance is not.
One Pdf is working fine the other scrambles special character like the euro symbol € or German characters like üöäß.
I tried to define a substitute font (as described in the book) however never got € and ß to work.
The only difference I could find is that a /DR dictionary is defined on field level for the non-working PDF (in adition to the global one). But if I remove it, the € sign still doesn't work. Please note, that I am not talking about asian or some exotic unicode characters here - all are part of the standard helvetica font (as the other PDF proves)
Question(s):
Any ideas how to get the non working PDF to correctly display the characters?
Or does the PDF violates the pdf spec somehow? (It was created using Acrobat which makes that unlikely but not impossible).
If you suggest to replace the form field font - how can I differentiate between working and non working PDF files since I don't want to do that for perfectly valid and working files
Update: The code is not the problem (I am certain of that since its the same code for both) however for the sake of completeness here it is:
AcroFields acroFields = stamper.getAcroFields();
try {
boolean successful = acroFields.setField("Mitarbeiter", "öäüß€#");
if (!successful) {
//throw some exception
}
}
catch (DocumentException de) {
//some exceptionhandling
}

I didn't find any clues in the PDF reference about this, but the font that is used for the field doesn't define an encoding. However: an encoding is defined at the level of the resource dictionary (/DR). If you use that encoding, then the appearance of the field is created correctly. Note that the ISO specification doesn't say anything about the existence of an /Encoding entry at the level of the resource dictionary.
I've made a small update to iText. You can check the changes in revision 6693. This way, iText will now check if the /DR dictionary has encoding values in case no encoding is defined at the level of the font. With this fix, your form is filled out correctly.

PDFBox- is the reading order guaranteed with PDFTextStripper's processTextPosition ?

I am using PdfTextStripper (PDFBox 1.8.2) to process every TextPosition in a pdf file. I have tested with a lot of files and I noticed that it processes text in the reading order. However, this does not hold good if a pdf has footers (the docx which I exported as pdf). The pdfTextStripper processes the footer first and then the body of the file.
Is this expected behavior ? Is there a way I can specify the order ? or is there any way I can identify its a footer and I can make the adjustment in my code ?

PdfTextStripper has an attribute SortByPosition (getSortByPosition & setSortByPosition). It's false by default.
If this attribute is false, the PdfTextStripper essentially extracts the text in the order in which it appears in the PDF page content stream.
This order can be totally mangled (because in the content stream you use operators which can position the next printed text anywhere on the page) but often text sections belonging together are kept together (because the operations required for such sections often are inserted in that stream as a block).
Headers and footers, though, often are added at the same time and, therefore, appear together before or after the main body text.
If this attribute is true, though, the PdfTextStripper essentially extracts the text from top to bottom, from left to right (unless the reading order is defined to be right to left). (Ok, ok, it also respects article beads, but you hardly can count on them being used in general.)
This order is good in case of one-column text, and headers come first and footers last, but unless proper article beads are used, multi-column pages get mangled up.
BTW, you can switch off the use of article beads using the attribute ShouldSeparateByBeads (getSeparateByBeads & setShouldSeparateByBeads).

SQL Strip the Font Format(Colour or other)

I have a problem to strip out the format in a note table
Here is an example:
";\red31\green73\blue125;
\viewkind4\uc1\ltrpar\f0\fs20 USEFUL TEXT BODY \cf1\f3
\ltrpar\f0\fs17
"
How to get rid of those stuff? I want to play safe not to replace anything after'\'
Many thanks,
Rick

Your making it quite difficult for yourself by not replace '\' .
If you look at http://other9.tripod.com/Refs/easy-rtf.html you will see that there are different RTF codes and there is no default size for the codes.
Additionally, it is not like HTML where there must be a necessary "closing" tag which makes it additionally difficult.
The only thing I can think of is to record all possible RTF codes (or use an RTF parser library) and hence be able to recognize if a \ is or is not RTF code.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas