Looking for software or API that will give me co-ordinates of text in a pdf

Looking for software or API that will give me co-ordinates of text in a pdf - pdf

Simple question I hope - I have a pdf and want to detect the co-ordinates of specific word(s) or placeholder text. I then intend to use itextsharp to stamp a replacement bit of text on top at the co-ordinates found.
Can anyone recommend anything please?
Thanks

As answered in the comments, one could use iText to perform such a task. Maybe there are some better solutions, however, I doubt it. The cause of the mentioned issue, i.e. "[itextsharp] sometimes give co-ords of the start of the sentence the search text is in", is that sometimes glyphs are so close, that their boxes overlap, hence I don't see how it could be handled as you want.
So you can do the following:
extend LocationTextExtractionStrategy class and override eventOccurred, for example, as follows:
#Override
public void eventOccurred(IEventData data, EventType type) {
if (type.equals(EventType.RENDER_TEXT)) {
TextRenderInfo renderInfo = (TextRenderInfo) data;
// Obtain all the necesary information from renderInfo, for example
LineSegment segment = renderInfo.getBaseline();
// ...
}
pass an instance of such an extended class to PdfTextExtractor.getTextFromPage as follows:
PdfTextExtractor.getTextFromPage(pdfDocument.getPage(1), new ExtendedLocationTextExtractionStrategy()
once text is found, the event will be triggered.
There are some difficulties in such a solution, of course, because the text you want to find and write above could be present in the PDF not as "Text", but "T", "ex", t", or even "t", "x", "e", "T". However, since you use iText, you may want to harness the advantages of one of its products - pdfSweep. This product aims to completely remove unnecessary content from the PDF, with such a content being passed either as some locations (which you want to obtain, so that is not an option) or regexes.
This is how to create such a regex strategy (to find all "Dolor" and "dolor" instances in the document, completely remove them (from all the streams, so that they are either not observed from a PDF viewer nor found in the underlying PDF objects):
RegexBasedCleanupStrategy strategy = new RegexBasedCleanupStrategy("(D|d)olor").setRedactionColor(ColorConstants.GREEN);
This is how to use it:
PdfAutoSweep autoSweep = new PdfAutoSweep(strategy);
autoSweep.cleanUp(pdf); // a PdfDocument instance
And this is how to write some text on the location, at which the unnecessary text was present:
for (IPdfTextLocation location : strategy.getResultantLocations()) {
Rectangle rect = location.getRectangle();
// do something, for exapmle, write some text
}

Related

Tabulator - formatting print and PDF output

I am a relatively new user of Tabulator so please forgive me if I am asking anything that, perhaps, should be obvious.
I have a Tabulator report that I am able to print and create as a PDF, but the report's formatting (as shown on the screen) is not used in either output.
For printing I have used printAsHtml and printStyled=true, but this doesn't produce a printout that matches what is on the screen. I have formatted number fields (with comma separators) and these are showing correctly, but the number columns should be right-aligned but all of the columns appear as left-aligned.
I am also using Tree View where the tree rows are coloured differently to the main table, but when I print the report with a tree open it colours the whole table with the tree colours and not just the tree.
For the PDF none of the Tabulator formatting is being used. I've looked for anything similar to the printStyled option, but I can't see anything. I've also looked at the autoTable option, but I am struggling to find what to use.
I want to format the print and PDF outputs so that they look as close to the screen representation as possible.
Is there anywhere I could look that would provide examples of how to achieve the above? The Tabulator documentation is very good, but the provided examples don't appear to explain what I am trying to do.
Perhaps there are there CSS classes that I am missing or even mis-using? I have tried including .tabulator-print-table in my CSS, but I am probably not using it correctly. I also couldn't find anything equivalent for producing PDFs. Some examples would help immensely.
Thank you in advance for any advice or assistance.

Formatting is deliberately not included in these, below i will outline why:
Downloaders
Downloaded files do not contain formatted data, only the raw data, this is because a lot of the formatters create visual elements (progress bar, star formatter etc) that cannot be replicated sensibly in downloaded files.
If you want to change the format of data in the download you will need to use an accessor, the accessorDownload option is the one you want to use in this case. The accessors transform the data as it is leaving the table.
For instance we could create an accessor that prepended "Mr " to the front of every name in a column:
var mrAccessor= function(value, data, type, params, column, row){
return "Mr " + value;
}
Assign it to a columns definition:
{title:"Name", field:"name", accessorDownload:mrAccessor}
Printing
Printing also does not include the formatters, this is because when you print a Tabulator table, the whole table is actually rebuilt as a standard HTML table, which allows the printer to work out how to layout everything across multiple pages with column headers etc. The downside of this is that it is only loosely styled like a Tabulator and so formatted contents generated inside Tabulator cells will likely break when added to a normal td element.
For this reason there is also a accessorPrint option that works in the same way as the download accessor but for printing.
If you want to use the same accessor for both occasions, you can assign the function once to the accessor option and it will be applied in both instances.
Checkout the Accessor Documentation for full details.

Get annotation from a pdf to add to another document

I am using iTextSharp version 5.0.
For my projet, I need to copy my pdf document into another pdf document using pdfWriter. I can't use pdfCopy nor pdfStamper.
So all the annotations get lost during this operation.
To begin, I started to find how to get the annotations of the "pencil comment drawing markup" as shown below on adobe reader UI:
For my tests, I am using this pdf document with a drawing markup I added my self: https://easyupload.io/3c6i1g
I found how to get the annotation dictionary:
Dim pdfReader As New PdfReader(pdfPath)
Dim page As PdfDictionary = pdfReader.GetPageN(0)
Dim annots As PdfArray = page.GetAsArray(PdfName.ANNOTS)
If annots IsNot Nothing Then
For i = 0 To annots.Size - 1
Dim annotDict As PdfDictionary = annots.GetAsDict(i)
Dim annotContents As PdfString = annotDict.GetAsString(PdfName.CONTENT)
Dim annotSubtype As PdfString = annotDict.GetAsString(PdfName.SUBTYPE)
Dim annotName As PdfString = annotDict.GetAsString(PdfName.T)
Next
End If
When the loop is parsing my comment the annotName variable returns my name, so I am sure to parse the annotation I am looking for but the annotSubtype is equal Nothing, how is that possible? According to the pdf specification at section 12.5.2 table 1666 (https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf), the subtype parameter is required, so wouldn't it means this should not be at nothing?
Also, how can I get the image related to this annotation? I thought it would be stored in the content of the annotation dictionary but this is also returning nothing in the code above...
about why I can't use pdfStamper at the first place : one of the page of my pdf document must be resized (downscaled) in order to add some text at the bottom of the page, so I must use pdfWriter for that.
Question: How can I get the drawn line of a comment annotation with iTextSharp 5.0?

There are a lot of single questions in your post...
When the loop is parsing my comment the annotName variable returns my name, so I am sure to parse the annotation I am looking for but the annotSubtype is equal Nothing, how is that possible?
According to the pdf specification at section 12.5.2 table 1666 (https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf), the subtype parameter is required, so wouldn't it means this should not be at nothing?
According to table 164 in section 12.5.2 of ISO 32000-1, the Subtype entry indeed is required, but it also specified to be a name while you try to retrieve a string instead:
Dim annotSubtype As PdfString = annotDict.GetAsString(PdfName.SUBTYPE)
As the Subtype entry of that annotation in your PDF correctly is a name, GetAsString returns Nothing.
Thus, call GetAsName instead and expect a PdfName return type.
Also, how can I get the image related to this annotation? I thought it would be stored in the content of the annotation dictionary but this is also returning nothing in the code above...
The Contents entry is specified in the same table as above to be optional and (if present) to have a text string value containing a Text that shall be displayed for the annotation or, if this type of annotation does not display text, an alternate description of the annotation’s contents in human-readable form. As the annotation merely is a scribble, what should the annotation have as Contents value?
As your annotation actually is an Ink annotation, you can find the representation of the scribble in the required InkList and optional BS entries of the annotation, see table 182 of section 12.5.6.13 of ISO 32000-1.
The value of InkList is An array of n arrays, each representing a stroked path. Each array shall be a series of alternating horizontal and vertical coordinates in default user space, specifying points along the path. When drawn, the points shall be connected by straight lines or curves in an implementation-dependent way.
The value of the BS (if present) is A border style dictionary (see Table 166) specifying the line width and dash pattern that shall be used in drawing the paths.
Beware, though: The annotation dictionary’s AP entry, if present, takes precedence over the InkList and BS entries. And in your PDF the annotation has an appearance entry. So the actually displayed content is that of the Normal appearance stream which contains vector graphics instructions drawing your scribble.
about why I can't use pdfStamper at the first place : one of the page of my pdf document must be resized (downscaled) in order to add some text at the bottom of the page, so I must use pdfWriter for that.
First of all, this only means that you have to do something special to that special page, there is no need to damage all pages by copying them with a PdfWriter. You could manipulate that single page in a separate document, then use PdfCopy to copy the pages before that page from the original PDF, then that page from the separate PDF, and then all pages after that page from the original again.
Thus, you'd only have to fix the annotations of that special page, the annotations on the other pages could remain untouched.
Furthermore, you can even use the PdfStamper if you are ready to use low level iText routines. In particular before stamping you can apply the static PdfReader method GetPageContent to the page dictionary of the special page to retrieve the page content as byte array, build a new byte array from it in which you prepend an affine transformation which does the downscaling, and set the new byte array as content of the page in question using the SetPageContent method of the underlying PdfReader
Even in this scenario, though, you'd have to adjust the annotation coordinates (both of their rectangles and of other coordinates like the InkList in your case)...
Question: How can I get the drawn line of a comment annotation with iTextSharp 5.0?
See above, the annotation of the scribble is an Ink annotation and the drawn path is specified in the InkList and BS entries of its dictionary and additionally instantiated in its normal appearance stream.

Apache POI: Partial Cell fonts

If I crack open MS Excel (I assume), or LibreOffice Calc (tested), I can type stuff into a cell, and change the font of parts of the text in a cell, such as doing, in one cell, :
This text is bold and this text is italicized
Again, let me reiterate, that this string could exist in the shown format in one cell.
Can this level of customization be achieved with Apache POI? Searching only seems to show how to apply a font to an entire cell.
Thanks
===UPDATE===
As suggested below, I ended up going with the HSSFRichTextString (as I'm working with HSSF). However, after applying fonts (I tried bold and underline), my text would remain unchanged. This is what I attempted. To put things in context, I am working on something sports-related, in which it is common to display a match up in the form "awayteam"#"hometeam", and depending on certain external conditions, I would like to make one or the other bold. My code looks something like this:
String away = "foo";
String home = "bar";
String bolden = "foo"
HSSFRichTextString val = new HSSFRichTextString(away+"#"+home);
if(bolden.equals(home)) {
val.applyFont(val.getString().indexOf("#") + 1, val.length(), Font.U_SINGLE);
} else if(bolden.equals(away)) {
val.applyFont(0, val.getString().indexOf("#"), Font.U_SINGLE);
}
gameHeaderRow.createCell(g + 1).setCellValue(val);
As you can see, this is a snippet of code from a more complicated function than is displayed, but the brunt of this is actual code. As you can see, I'm doing val.applyFont to part of a string, and then setting a cell value with the string. So I'm not entirely sure what I did wrong there. Any advice is appreciated.
Thanks
KFJ

POI does support it, the class you're looking for is RichTextString. If your cell is a text one, you can get a RichTextString for it, then apply fonts to different parts of it to make different parts of the text look different.

You would be drained if working with SXSSFWorkbook, as it does not support such formatting.
Check it here.
http://apache-poi.1045710.n5.nabble.com/RichTextString-isn-t-working-for-SXSSFWorkbook-td5711695.html

val.applyFont(0, val.getString().indexOf("#"), Font.U_SINGLE);
You should not pass Font.U_SINGLE to applyFont,but new a Font, such as new HSSFFont(), then setUnderline(Font.U_SINGLE).
example:
HSSFFont f1 = new HSSFFont();
f1.setUnderline(Font.U_SINGLE);
val.applyFont(0, val.getString().indexOf("#"), f1);

Converting a PDF file to a nice table

I have this PDF file which is arranged in 5 columns.
I have looked and looked through Stack Overflow (and Googled crazily) and tried all the solutions (including the last resort of trying Adobe Acrobat itself).
However, for some reason I cannot get those 5 columns in csv/xls format - as I need them arranged. Usually when I export them, the format is horrible and all the entries are arranged line by line with some data loss.
http://www.2shared.com/document/PagE4A1T/ex1.html
Here is a link to an excerpt of the file above, but I am really getting frustrated and am running out of options.

iText (or iTextSharp) could do this, if you can give it the boundaries of those 5 columns, and are willing to deal with some overhead (namely reparsing the page's text for each column)
Rectangle2D columnBoxArray[] = buildColumnBoxes();
ArrayList<String> columnTexts = new ArrayList<String>(columnBoxArray.length);
For (Rectangle2D columnBBox : columnBoxArray) {
FilteredTextRenderListener textInRectStrategy =
new FilteredTextRenderListener(new LocationTextExtractionStrategy(),
new RegionTextRenderFilter( columnBBox ) );
columnTexts.add(PdfTextExtractor.extractText( reader, pageNum, textInRectStrategy));
}
Each line of text should be separated by \n, so it becomes a simple matter of string parsing.
If you wanted to not reparse the whole page for each column, you could probably come up with a custom implementation of FilteredTextRenderListener that would take multiple listener/filter pairs. You could then parse the whole thing once rather than once for each column.

How to obtain PDF table of contents (outline) data in iOS (iPad)?

I am building an iPad application that displays PDFs, and I'd like to be able to display the table of contents and the let user navigate to the relevant pages.
I have invested several hours in research at this point, and it appears that since PDFKit is [not supported in iOS], my only option is to parse the PDF meta data manually.
I have looked at several solutions, but all of them are silent on one point - how to associate a page in the "outline" metadata with the real page number of the item. I have examined my PDF document with [the Voyeur tool] and I can see the outline in the tree.
[This solution] helped me figure out how to navigate down the Outline/A/S/D tree to find the "Dest" object, but it performs some kind of object comparison using [self.pages indexOfObjectIdenticalTo:destPageDic] that I don't understand.
I have read the [official PDF spec from adobe], and section "12.3.2.3 Named Destinations" describes the way that an outline entry can point to a page:
Instead of being defined directly with
the explicit syntax shown in Table
151, a destination may be referred to
indirectly by means of a name object
(PDF 1.1) or a byte string (PDF 1.2).
And continues with this line which is utterly incomprehensible to me:
The value of this entry shall be a
dictionary in which each key is a
destination name and the corresponding
value is either an array defining the
destination, using the syntax shown in
Table 151, or a dictionary with a D
entry whose value is such an array.
This refers to page 366, "12.3.2.2 Explicit Destinations" where a table describes a page: "In each case, page is an indirect reference to a page object"
So is the result of CGPDFDocumentGetPage or CGPDFPageGetDictionary an "indirect reference to a page object"?
I found a [thread on lists.apple.com] that discusses. [This comment] implies that you can compare the address (in memory?) of a CGPDFPageGetDictionary object for a given page and compare it to the pages in the "Outline" tree of the PDF meta data.
However, when I look at the address of page objects in the Outline tree and compare them to addresses they are never the same. The line used in that thread "TTDPRINT(#"%d => %p", k+1, dict);" is printing "dict" as a pointer in memory.. there's no reason to believe that an object returned there would be the same as one returned somewhere else.. they'd be in different places in memory!
My last hope was to look at the source code from apple's command line "outline" tool [mentioned in this book] (as [suggested by this thread]), but I can't find it anywhere.
Bottom line - does anyone have some insight into how PDF outlines work, or know of some open source code (preferably objective-c) that reads PDF outlines?
ARGG: I had all kinds of links posted here, but apparently a new user can only post one link at a time

The result of CGPDFDocumentGetPage is the same as an indirect page reference you get when resolving a destination in an outline item. Both are essentially dictionaries and you can compare them using ==. When you have a CGPDFDictionaryRef that you want to know the page number of, you can do something like this:
CGPDFDocumentRef doc = ...;
CGPDFDictionaryRef outlinePageRef = ...;
for (int p=1; p<=CGPDFDocumentGetNumberOfPages(doc); p++) {
CGPDFPageRef page = CGPDFDocumentGetPage(doc, p);
if (page == outlinePageRef) {
printf("found the page number: %i", p);
break;
}
}
An explicit destination however is not a page, but an array with the first element being the page. The other elements are the scroll position on the page etc.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas