Pdfsharp - How to determine the page number of a field - pdf

I need to draw an image on a page that has a specific form field. Using pdfsharp, given a field name, how do I find the pdf page associated with that field?

Here an improvement with corrections which gives also back the pagenum:
PdfPage GetPageFromField(PdfDocument myDocument, string focusFieldName, out int pageNum)
{
// get the field we're looking for
PdfAcroField currentField = (PdfAcroField)(myDocument.AcroForm.Fields[focusFieldName]);
pageNum = 0;
if (currentField != null)
{
// get the page element
var focusPageReference = (PdfReference)currentField.Elements["/P"];
// loop through our pages to match the reference
foreach (var page in myDocument.Pages)
{
pageNum++;
if (page.Reference == focusPageReference)
{
return page;
}
}
}
// could not find a page for this field
return null;
}

You can access the page reference for the field using the page element of the field object. Then use this reference to match the page in the document.
public PdfPage GetPageFromField( PdfDocument myDocument, string focusFieldName )
{
// get the field we're looking for
PdfTextField currentField = (PdfTextField)( fillablePdf.AcroForm.Fields["MyFocusField"]);
if( currentField != null )
{
// get the page element
var focusPageReference = (PdfReference)currentField.Elements["/P"];
// loop through our pages to match the reference
foreach( var page in myDocument.Pages )
{
if( page.Reference = focusPageReference )
{
return page;
}
}
}
// could not find a page for this field
return null;
}

Related

Using Lucene's highlighting, getting too much highlighted, is there a workaround for this?

I am using the highlighting feature of Lucene to isolate matching terms for my query, but some of the matched terms are excessive.
I have some simple test cases which are delivered in an Ant project (download details below).
Materials
You can download the test case here: mydemo_with_libs.zip
That archive includes the Lucene 8.6.3 libraries which my test uses; if you prefer a copy without the JAR files you can download that from here: mydemo_without_libs.zip
The necessary libraries are: core, analyzers, queries, queryparser, highlighter, and memory.
You can run the test case by unzipping the archive into an empty directory and running the Ant command ant synsearch
Input
I have provided a short synonym list which is used for indexing and analysing in the highlighting methods:
cope,manage
jobs,tasks
simultaneously,at once
and there is one document being indexed:
Queues are a useful way of grouping jobs together in order to manage a number of them at once. You can:
hold or release multiple jobs at the same time;
group multiple tasks (for the same event);
control the priority of jobs in the queue;
Eventually log all events that take place in a queue.
Use either job.queue or task.queue in specifications.
Process
When building the index I am storing the text field, and using a custom analyzer. This is because (in the real world) the content I am indexing is technical documentation, so stripping out punctuation is inappropriate because so much of it may be significant in technical expressions. My analyzer uses a TechTokenFilter which breaks the stream up into tokens consisting of strings of words or digits, or individual characters which don't match the previous pattern.
Here's the relevant code for the analyzer:
public class MyAnalyzer extends Analyzer {
public MyAnalyzer(String synlist) {
if (synlist != "") {
this.synlist = synlist;
this.useSynonyms = true;
}
}
public MyAnalyzer() {
this.useSynonyms = false;
}
#Override
protected TokenStreamComponents createComponents(String fieldName) {
WhitespaceTokenizer src = new WhitespaceTokenizer();
TokenStream result = new TechTokenFilter(new LowerCaseFilter(src));
if (useSynonyms) {
result = new SynonymGraphFilter(result, getSynonyms(synlist), Boolean.TRUE);
result = new FlattenGraphFilter(result);
}
return new TokenStreamComponents(src, result);
}
and here's my filter:
public class TechTokenFilter extends TokenFilter {
private final CharTermAttribute termAttr;
private final PositionIncrementAttribute posIncAttr;
private final ArrayList<String> termStack;
private AttributeSource.State current;
private final TypeAttribute typeAttr;
public TechTokenFilter(TokenStream tokenStream) {
super(tokenStream);
termStack = new ArrayList<>();
termAttr = addAttribute(CharTermAttribute.class);
posIncAttr = addAttribute(PositionIncrementAttribute.class);
typeAttr = addAttribute(TypeAttribute.class);
}
#Override
public boolean incrementToken() throws IOException {
if (this.termStack.isEmpty() && input.incrementToken()) {
final String currentTerm = termAttr.toString();
final int bufferLen = termAttr.length();
if (bufferLen > 0) {
if (termStack.isEmpty()) {
termStack.addAll(Arrays.asList(techTokens(currentTerm)));
current = captureState();
}
}
}
if (!this.termStack.isEmpty()) {
String part = termStack.remove(0);
restoreState(current);
termAttr.setEmpty().append(part);
posIncAttr.setPositionIncrement(1);
return true;
} else {
return false;
}
}
public static String[] techTokens(String t) {
List<String> tokenlist = new ArrayList<String>();
String[] tokens;
StringBuilder next = new StringBuilder();
String token;
char minus = '-';
char underscore = '_';
char c, prec, subc;
// Boolean inWord = false;
for (int i = 0; i < t.length(); i++) {
prec = i > 0 ? t.charAt(i - 1) : 0;
c = t.charAt(i);
subc = i < (t.length() - 1) ? t.charAt(i + 1) : 0;
if (Character.isLetterOrDigit(c) || c == underscore) {
next.append(c);
// inWord = true;
}
else if (c == minus && Character.isLetterOrDigit(prec) && Character.isLetterOrDigit(subc)) {
next.append(c);
} else {
if (next.length() > 0) {
token = next.toString();
tokenlist.add(token);
next.setLength(0);
}
if (Character.isWhitespace(c)) {
// shouldn't be possible because the input stream has been tokenized on
// whitespace
} else {
tokenlist.add(String.valueOf(c));
}
// inWord = false;
}
}
if (next.length() > 0) {
token = next.toString();
tokenlist.add(token);
// next.setLength(0);
}
tokens = tokenlist.toArray(new String[0]);
return tokens;
}
}
Examining the index I can see that the index contains the separate terms I expect, including the synonym values. For example the text at the end of the first line has produced the terms
of
them
at , simultaneously
once
.
You
can
:
and the text at the end of the third line has produced the terms
same
event
)
;
When the application performs a search it analyzes the query without using the synonym list (because the synonyms are already in the index), but I have discovered that I need to include the synonym list when analyzing the stored text to identify the matching fragments.
Searches match the correct documents, but the code I have added to identify the matching terms over-performs. I won't show all the search method here, but will focus on the code which lists matched terms:
public static void doSearch(IndexReader reader, IndexSearcher searcher,
Query query, int max, String synList) throws IOException {
SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter("\001", "\002");
Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query));
Analyzer analyzer;
if (synList != null) {
analyzer = new MyAnalyzer(synList);
} else {
analyzer = new MyAnalyzer();
}
// Collect all the docs
TopDocs results = searcher.search(query, max);
ScoreDoc[] hits = results.scoreDocs;
int numTotalHits = Math.toIntExact(results.totalHits.value);
System.out.println("\nQuery: " + query.toString());
System.out.println("Matches: " + numTotalHits);
// Collect matching terms
HashSet<String> matchedWords = new HashSet<String>();
int start = 0;
int end = Math.min(numTotalHits, max);
for (int i = start; i < end; i++) {
int id = hits[i].doc;
float score = hits[i].score;
Document doc = searcher.doc(id);
String docpath = doc.get("path");
String doctext = doc.get("text");
try {
TokenStream tokens = TokenSources.getTokenStream("text", null, doctext, analyzer, -1);
TextFragment[] frag = highlighter.getBestTextFragments(tokens, doctext, false, 100);
for (int j = 0; j < frag.length; j++) {
if ((frag[j] != null) && (frag[j].getScore() > 0)) {
String match = frag[j].toString();
addMatchedWord(matchedWords, match);
}
}
} catch (InvalidTokenOffsetsException e) {
System.err.println(e.getMessage());
}
System.out.println("matched file: " + docpath);
}
if (matchedWords.size() > 0) {
System.out.println("matched terms:");
for (String word : matchedWords) {
System.out.println(word);
}
}
}
Problem
While the correct documents are selected by these queries, and the fragments chosen for highlighting do contain the query terms, the highlighted pieces in some of the selected fragments extend over too much of the input.
For example, if the query is
+text:event +text:manage
(the first example in the test case) then I would expect to see 'event' and 'manage' in the highlighted list. But what I actually see is
event);
manage
Despite the highlighting process using an analyzer which breaks terms apart and treats punctuation characters as single terms, the highlight code is "hungry" and breaks on whitespace alone.
Similarly if the query is
+text:queeu~1
(my final test case) I would expect to only see 'queue' in the list. But I get
queue.
job.queue
task.queue
queue;
It is so nearly there... but I don't understand why the highlighted pieces are inconsistent with the index, and I don't think I should have to parse the list of matches through yet another filter to produce the correct list of matches.
I would really appreciate any pointers to what I am doing wrong or how I could improve my code to deliver exactly what I need.
Thanks for reading this far!
I managed to get this working by replacing the WhitespaceTokenizer and TechTokenFilter in my analyzer with a PatternTokenizer; the regular expression took a bit of work but once I had it all the matching terms were extracted with pinpoint accuracy.
The replacement analyzer:
public class MyAnalyzer extends Analyzer {
public MyAnalyzer(String synlist) {
if (synlist != "") {
this.synlist = synlist;
this.useSynonyms = true;
}
}
public MyAnalyzer() {
this.useSynonyms = false;
}
private static final String tokenRegex = "(([\\w]+-)*[\\w]+)|[^\\w\\s]";
#Override
protected TokenStreamComponents createComponents(String fieldName) {
PatternTokenizer src = new PatternTokenizer(Pattern.compile(tokenRegex), 0);
TokenStream result = new LowerCaseFilter(src);
if (useSynonyms) {
result = new SynonymGraphFilter(result, getSynonyms(synlist), Boolean.TRUE);
result = new FlattenGraphFilter(result);
}
return new TokenStreamComponents(src, result);
}

Changing zoom level of all links in a PDF document

I'd like to change zoom level in links in PDF files using iText 7. There are already solutions for legacy iText 5 (1, 2) but there is none for iText 7.
There are many ways links can work (it can be even JavaScript code) but now I'm interested in GoTo actions (i.e. /XYZ, /Fit and others described in ISO 32000-1, Table 151).
My current code locates GoTo actions but I don't know what the destination of a link is (such as page number, coordinates for XYZ zoom type). How do I get the destination of a link? Or am I missing something and that problem can be solved in a different way?
for (int i = 1; i <= numberOfPages; i++) {
PdfPage page = document.getPage(i);
List<PdfAnnotation> annots = page.getAnnotations();
if (annots.isEmpty()) {
continue;
}
for (int j = 0; j < annots.size(); j++) {
PdfAnnotation annot = annots.get(j);
if (annot == null) {
continue;
}
PdfDictionary action = annot.getPdfObject()
.getAsDictionary(PdfName.A);
if (action == null ||
!PdfName.GoTo.equals(action.get(PdfName.S))) {
continue;
}
annot.remove(PdfName.A);
// here I need the destination of action
int pageNumber = ???
Rectangle cropBox = document.getPage(pageNumber).getCropBox();
PdfAction goToAction = PdfAction.createGoTo(PdfExplicitDestination.createXYZ(
document.getPage(pageNumber), cropBox.getLeft(), cropBox.getTop(), 2F));
annot.put(PdfName.A, goToAction.getPdfObject());
}
}
UPDATE Working Code
I managed to deal with GoTo actions - both explicit and of type Named Destination. It works on some pdf file I tested the code against but I'm not sure whether it covers all the possible cases
Note:
When dealing with named destination I create explicit destination and then use it instead of named destination. The thing is, however, that named destination still exists in pdf. How do I get rid of it? In iText 5 there was a method consolidateNamedDestinations() which seems to do what I need:
Replaces all the local named links with
the actual destinations.
Please see my code:
try (document) {
Map<String, PdfObject> namedDests = null;
for (int i = 1; i <= document.getNumberOfPages(); i++) {
List<PdfAnnotation> annots = document.getPage(i).getAnnotations();
if (annots.isEmpty()) {
continue;
}
for (PdfAnnotation annot : annots) {
if (annot == null) {
continue;
}
PdfDictionary action = annot.getPdfObject()
.getAsDictionary(PdfName.A);
if (action == null) {
continue;
}
PdfName actionType = action.getAsName(PdfName.S);
if (PdfName.Link.equals(actionType)) {
throw new UnsupportedOperationException(
"Action of type " + action);
} else if (!PdfName.GoTo.equals(actionType)) {
continue;
}
PdfArray dest = action.getAsArray(PdfName.D);
// explicit GoTO case
if (dest != null) {
changeZoomLevel(document, annot, dest.getAsDictionary(0));
continue;
}
// Named Destination case
if (namedDests == null) {
namedDests = document.getCatalog()
.getNameTree(PdfName.Dests)
.getNames();
}
String namedDest = action.getAsString(PdfName.D)
.getValue();
PdfObject d = namedDests.get(namedDest);
PdfDictionary dict = null;
if (d instanceof PdfDictionary) {
PdfArray arr = (PdfArray) ((PdfDictionary) d).get(PdfName.D);
dict = arr.getAsDictionary(0);
} else if (d instanceof PdfArray) {
dict = ((PdfArray) d).getAsDictionary(0);
}
changeZoomLevel(document, annot, dict);
}
}
}
private static void changeZoomLevel(PdfDocument document,
PdfAnnotation annot, PdfDictionary dest) {
annot.remove(PdfName.A);
annot.put(PdfName.A, createPdfAction(document, dest));
}
private static PdfObject createPdfAction(PdfDocument document, PdfDictionary dest) {
PdfPage actionPage = document.getPage(dest);
Rectangle cropBox = actionPage.getCropBox();
return PdfAction.createGoTo(PdfExplicitDestination.createXYZ(
actionPage, cropBox.getLeft(), cropBox.getTop(), 1.5F)).getPdfObject();
}

How to force Mobile Vision for Android to read full lines of text

I have implemented Google's Mobile Vision for Android by following a tutorial. I am trying to build an app that will scan a receipt and find the numeric total. However, as I scan different receipts that are printed in different formats, the API will detect TextBlocks in what seems to be an arbitrary way. For example, in one receipt, if several words of text are separated by single spaces, then they are grouped into a single TextBlock. However, if two words of text are separated by lots of spaces, then they are separated as independent TextBlocks, even though they appear on the same "line". What I am trying to do is force the API to recognize each entire line of the receipt as a single entity. Is this possible?
public ArrayList<T> getAllGraphicsInRow(float rawY) {
synchronized (mLock) {
ArrayList<T> row = new ArrayList<>();
// Get the position of this View so the raw location can be offset relative to the view.
int[] location = new int[2];
this.getLocationOnScreen(location);
for (T graphic : mGraphics) {
float rawX = this.getWidth();
for (int i=0; i<rawX; i+=10){
if (graphic.contains(i - location[0], rawY - location[1])) {
if(!row.contains(graphic)) {
row.add(graphic);
}
}
}
}
return row;
}
}
This should be in the GraphicOverlay.java file and essentially fetches all the graphics in that row.
public static boolean almostEqual(double a, double b, double eps){
return Math.abs(a-b)<(eps);
}
public static boolean pointAlmostEqual(Point a, Point b){
return almostEqual(a.y,b.y,10);
}
public static boolean cornerPointAlmostEqual(Point[] rect1, Point[] rect2){
boolean almostEqual=true;
for (int i=0; i<rect1.length;i++){
if (!pointAlmostEqual(rect1[i],rect2[i])){
almostEqual=false;
}
}
return almostEqual;
}
private boolean onTap(float rawX, float rawY) {
String priceRegex = "(\\d+[,.]\\d\\d)";
ArrayList<OcrGraphic> graphics = mGraphicOverlay.getAllGraphicsInRow(rawY);
OcrGraphic currentGraphics = mGraphicOverlay.getGraphicAtLocation(rawX,rawY);
if (graphics !=null && currentGraphics!=null) {
List<? extends Text> currentComponents = currentGraphics.getTextBlock().getComponents();
final Pattern pattern = Pattern.compile(priceRegex);
final Pattern pattern1 = Pattern.compile(priceRegex);
TextBlock text = null;
Log.i("text results", "This many in the row: " + Integer.toString(graphics.size()));
ArrayList<Text> combinedComponents = new ArrayList<>();
for (OcrGraphic graphic : graphics) {
if (!graphic.equals(currentGraphics)) {
text = graphic.getTextBlock();
Log.i("text results", text.getValue());
combinedComponents.addAll(text.getComponents());
}
}
for (Text currentText : currentComponents) { // goes through components in the row
final Matcher matcher = pattern.matcher(currentText.getValue()); // looks for
Point[] currentPoint = currentText.getCornerPoints();
for (Text otherCurrentText : combinedComponents) {//Looks for other components that are in the same row
final Matcher otherMatcher = pattern1.matcher(otherCurrentText.getValue()); // looks for
Point[] innerCurrentPoint = otherCurrentText.getCornerPoints();
if (cornerPointAlmostEqual(currentPoint, innerCurrentPoint)) {
if (matcher.find()) { // if you click on the price
Log.i("oh yes", "Item: " + otherCurrentText.getValue());
Log.i("oh yes", "Value: " + matcher.group(1));
itemList.add(otherCurrentText.getValue());
priceList.add(Float.valueOf(matcher.group(1)));
}
if (otherMatcher.find()) { // if you click on the item
Log.i("oh yes", "Item: " + currentText.getValue());
Log.i("oh yes", "Value: " + otherMatcher.group(1));
itemList.add(currentText.getValue());
priceList.add(Float.valueOf(otherMatcher.group(1)));
}
Toast toast = Toast.makeText(this, " Text Captured!" , Toast.LENGTH_SHORT);
toast.show();
}
}
}
return true;
}
return false;
}
This should be in OcrCaptureActivity.java and it breaks up the TextBlock into lines and finds the blocks in the same row as the line and checks if the components are all prices, and prints all value accordingly.
The eps value in almostEqual is the tolerance for how tall it checks for graphics in the row.

How to retain page labels when concatenating an existing pdf with a pdf created from scratch?

I have a code which is creating a "cover page" and then merging it with an existing pdf. The pdf labels were lost after merging. How can I retain the pdf labels of the existing pdf and then add a page label to the pdf page created from scratch (eg "Cover page")? The example of the book I think is about retrieving and replacing page labels. I don't know how to apply this when concatenating an existing pdf with a pdf created from scratch. I am using itext 5.3.0. Thanks in advance.
EDIT
as per comment of mkl
public ByteArrayOutputStream getConcatenatePDF()
{
if (bitstream == null)
return null;
if (item == null)
{
item = getItem();
if (item == null)
return null;
}
ByteArrayOutputStream byteout = null;
InputStream coverStream = null;
try
{
// Get Cover Page
coverStream = getCoverStream();
if (coverStream == null)
return null;
byteout = new ByteArrayOutputStream();
int pageOffset = 0;
ArrayList<HashMap<String, Object>> master = new ArrayList<HashMap<String, Object>>();
Document document = null;
PdfCopy writer = null;
PdfReader reader = null;
byte[] password = (ownerpass != null && !"".equals(ownerpass)) ? ownerpass.getBytes() : null;
// Get infomation of the original pdf
reader = new PdfReader(bitstream.retrieve(), password);
boolean isPortfolio = reader.getCatalog().contains(PdfName.COLLECTION);
char version = reader.getPdfVersion();
int permissions = reader.getPermissions();
// Get metadata
HashMap<String, String> info = reader.getInfo();
String title = (info.get("Title") == null || "".equals(info.get("Title")))
? getFieldValue("dc.title") : info.get("Title");
String author = (info.get("Author") == null || "".equals(info.get("Author")))
? getFieldValue("dc.contributor.author") : info.get("Author");
String subject = (info.get("Subject") == null || "".equals(info.get("Subject")))
? "" : info.get("Subject");
String keywords = (info.get("Keywords") == null || "".equals(info.get("Keywords")))
? getFieldValue("dc.subject") : info.get("Keywords");
reader.close();
// Merge cover page and the original pdf
InputStream[] is = new InputStream[2];
is[0] = coverStream;
is[1] = bitstream.retrieve();
for (int i = 0; i < is.length; i++)
{
// we create a reader for a certain document
reader = new PdfReader(is[i], password);
reader.consolidateNamedDestinations();
if (i == 0)
{
// step 1: creation of a document-object
document = new Document(reader.getPageSizeWithRotation(1));
// step 2: we create a writer that listens to the document
writer = new PdfCopy(document, byteout);
// Set metadata from the original pdf
// the position of these lines is important
document.addTitle(title);
document.addAuthor(author);
document.addSubject(subject);
document.addKeywords(keywords);
if (pdfa)
{
// Set thenecessary information for PDF/A-1B
// the position of these lines is important
writer.setPdfVersion(PdfWriter.VERSION_1_4);
writer.setPDFXConformance(PdfWriter.PDFA1B);
writer.createXmpMetadata();
}
else if (version == '5')
writer.setPdfVersion(PdfWriter.VERSION_1_5);
else if (version == '6')
writer.setPdfVersion(PdfWriter.VERSION_1_6);
else if (version == '7')
writer.setPdfVersion(PdfWriter.VERSION_1_7);
else
; // no operation
// Set security parameters
if (!pdfa)
{
if (password != null)
{
if (security && permissions != 0)
{
writer.setEncryption(null, password, permissions, PdfWriter.STANDARD_ENCRYPTION_128);
}
else
{
writer.setEncryption(null, password, PdfWriter.ALLOW_PRINTING | PdfWriter.ALLOW_COPY | PdfWriter.ALLOW_SCREENREADERS, PdfWriter.STANDARD_ENCRYPTION_128);
}
}
}
// step 3: we open the document
document.open();
// if this pdf is portfolio, does not add cover page
if (isPortfolio)
{
reader.close();
byte[] coverByte = getCoverByte();
if (coverByte == null || coverByte.length == 0)
return null;
PdfCollection collection = new PdfCollection(PdfCollection.TILE);
writer.setCollection(collection);
PdfFileSpecification fs = PdfFileSpecification.fileEmbedded(writer, null, "cover.pdf", coverByte);
fs.addDescription("cover.pdf", false);
writer.addFileAttachment(fs);
continue;
}
}
int n = reader.getNumberOfPages();
// step 4: we add content
PdfImportedPage page;
PdfCopy.PageStamp stamp;
for (int j = 0; j < n; )
{
++j;
page = writer.getImportedPage(reader, j);
if (i == 1) {
stamp = writer.createPageStamp(page);
Rectangle mediabox = reader.getPageSize(j);
Rectangle crop = new Rectangle(mediabox);
writer.setCropBoxSize(crop);
// add overlay text
//<-- Code for adding overlay text -->
stamp.alterContents();
}
writer.addPage(page);
}
PRAcroForm form = reader.getAcroForm();
if (form != null && !pdfa)
{
writer.copyAcroForm(reader);
}
// we retrieve the total number of pages
List<HashMap<String, Object>> bookmarks = SimpleBookmark.getBookmark(reader);
//if (bookmarks != null && !pdfa)
if (bookmarks != null)
{
if (pageOffset != 0)
{
SimpleBookmark.shiftPageNumbers(bookmarks, pageOffset, null);
}
master.addAll(bookmarks);
}
pageOffset += n;
}
if (!master.isEmpty())
{
writer.setOutlines(master);
}
if (isPortfolio)
{
reader = new PdfReader(bitstream.retrieve(), password);
PdfDictionary catalog = reader.getCatalog();
PdfDictionary documentnames = catalog.getAsDict(PdfName.NAMES);
PdfDictionary embeddedfiles = documentnames.getAsDict(PdfName.EMBEDDEDFILES);
PdfArray filespecs = embeddedfiles.getAsArray(PdfName.NAMES);
PdfDictionary filespec;
PdfDictionary refs;
PRStream stream;
PdfFileSpecification fs;
String path;
// copy embedded files
for (int i = 0; i < filespecs.size(); )
{
filespecs.getAsString(i++); // remove description
filespec = filespecs.getAsDict(i++);
refs = filespec.getAsDict(PdfName.EF);
for (PdfName key : refs.getKeys())
{
stream = (PRStream) PdfReader.getPdfObject(refs.getAsIndirectObject(key));
path = filespec.getAsString(key).toString();
fs = PdfFileSpecification.fileEmbedded(writer, null, path, PdfReader.getStreamBytes(stream));
fs.addDescription(path, false);
writer.addFileAttachment(fs);
}
}
}
if (pdfa)
{
InputStream iccFile = this.getClass().getClassLoader().getResourceAsStream(PROFILE);
ICC_Profile icc = ICC_Profile.getInstance(iccFile);
writer.setOutputIntents("Custom", "", "http://www.color.org", "sRGB IEC61966-2.1", icc);
writer.setViewerPreferences(PdfWriter.PageModeUseOutlines);
}
// step 5: we close the document
document.close();
}
catch (Exception e)
{
log.info(LogManager.getHeader(context, "cover_page: getConcatenatePDF", "bitstream_id="+bitstream.getID()+", error="+e.getMessage()));
// e.printStackTrace();
return null;
}
return byteout;
}
UPDATE
Based on mkl's answer, I modified the code above to look like this:
public ByteArrayOutputStream getConcatenatePDF()
{
if (bitstream == null)
return null;
if (item == null)
{
item = getItem();
if (item == null)
return null;
}
ByteArrayOutputStream byteout = null;
try
{
// Get Cover Page
InputStream coverStream = getCoverStream();
if (coverStream == null)
return null;
byteout = new ByteArrayOutputStream();
InputStream documentStream = bitstream.retrieve();
PdfReader coverPageReader = new PdfReader(coverStream);
PdfReader reader = new PdfReader(documentStream);
PdfStamper stamper = new PdfStamper(reader, byteout);
PdfImportedPage page = stamper.getImportedPage(coverPageReader, 1);
stamper.insertPage(1, coverPageReader.getPageSize(1));
PdfContentByte content = stamper.getUnderContent(1);
int n = reader.getNumberOfPages();
for (int j = 2; j <= n; j++) {
//code for overlay text
ColumnText.showTextAligned(stamper.getOverContent(j), Element.ALIGN_CENTER, overlayText,
crop.getLeft(10), crop.getHeight() / 2 + crop.getBottom(), 90);
}
content.addTemplate(page, 0, 0);
stamper.close();
}
catch (Exception e)
{
log.info(LogManager.getHeader(context, "cover_page: getConcatenatePDF", "bitstream_id="+bitstream.getID()+", error="+e.getMessage()));
e.printStackTrace();
return null;
}
return byteout;
}
And then I set the page labels to the cover page. I omitted code not relevant to my question.
/**
*
* #return InputStream the resulting output stream
*/
private InputStream getCoverStream()
{
ByteArrayOutputStream byteout = getCover();
return new ByteArrayInputStream(byteout.toByteArray());
}
/**
*
* #return InputStream the resulting output stream
*/
private byte[] getCoverByte()
{
ByteArrayOutputStream byteout = getCover();
return byteout.toByteArray();
}
/**
*
* #return InputStream the resulting output stream
*/
private ByteArrayOutputStream getCover()
{
ByteArrayOutputStream byteout;
Document doc = null;
try
{
byteout = new ByteArrayOutputStream();
doc = new Document(PageSize.LETTER, 24, 24, 20, 40);
PdfWriter pdfwriter = PdfWriter.getInstance(doc, byteout);
PdfPageLabels labels = new PdfPageLabels();
labels.addPageLabel(1, PdfPageLabels.EMPTY, "Cover page", 1);
pdfwriter.setPageLabels(labels);
pdfwriter.setPageEvent(new HeaderFooter());
doc.open();
//code omitted (contents of cover page)
doc.close();
return byteout;
}
catch (Exception e)
{
log.info(LogManager.getHeader(context, "cover_page", "bitstream_id="+bitstream.getID()+", error="+e.getMessage()));
return null;
}
}
The modified code retained the page labels of the existing pdf (see screenshot 1) (documentStream), but the resulting merged pdf (screenshots 2 and 3) is off by 1 page since a cover page was inserted. As suggested by mkl, I should use page labels to the cover page, but it seems the pdf labels of the imported page was lost. My concern now is how do I set the page labels to the final document state as also suggested by mkl? I suppose I should use PdfWriter but I don't know where to put that in my modified code. Am I correct to assume that after the stamper.close() portion, that is the final state of my document? Thanks again in advance.
Screenshot 1. Notice the actual page 1 labeled Front cover
Screenshot 2. Merged pdf, after the generated on-the-fly "cover page" was inserted. The page label "Front cover" was now assigned to the cover page even after I've set the pdf label of the inserted page using labels.addPageLabel(1, PdfPageLabels.EMPTY, "Cover page", 1)
Screenshot 3. Note that the page label 3 was assigned to page 2.
FINAL UPDATE
Kudos to #mkl
The screenshot below is the result after I applied the latest update of mkl's answer. The pages labels are now assigned correctly to pages. Also, using PdfStamper instead of PdfCopy (as used in my original code) did not break the PDF/A compliance of the existing pdf.
Adding the cover page
Usually using PdfCopy for merging PDFs is the right choice, it creates a new document from the copied pages copying as much of the page-level information as possible not preferring any single document.
Your case is somewhat special, though: You have one document whose structure and content you prefer and want to apply a small change to it by adding a single page, a title page. All the while all information including document-level information (e.g. metadata, embedded files, ...) from the main document shall still be present in the result.
In such a use case it is more appropriate to use a PdfStamper which you use to "stamp" changes onto an existing PDF.
You might want to start from something like this:
try ( InputStream documentStream = getClass().getResourceAsStream("template.pdf");
InputStream titleStream = getClass().getResourceAsStream("title.pdf");
OutputStream outputStream = new FileOutputStream(new File(RESULT_FOLDER, "test-with-title-page.pdf")) )
{
PdfReader titleReader = new PdfReader(titleStream);
PdfReader reader = new PdfReader(documentStream);
PdfStamper stamper = new PdfStamper(reader, outputStream);
PdfImportedPage page = stamper.getImportedPage(titleReader, 1);
stamper.insertPage(1, titleReader.getPageSize(1));
PdfContentByte content = stamper.getUnderContent(1);
content.addTemplate(page, 0, 0);
stamper.close();
}
PS: Concerning questions in comments:
In my code above, I should have an overlay text supposedly (before the stamp.alterContents() portion) but I omitted that part of code for testing purposes. Can you please give me an idea how to implement that?
Do you mean something like an overlayed watermark? The PdfStamper allows you to access an "over content" for each page onto which you can draw any content:
PdfContentByte overContent = stamper.getOverContent(pageNumber);
Keeping page labels
My other question is about page offset, because I inserted the cover page, the page numbering are off by 1 page. How can I resolve that?
Unfortunately iText's PdfStamper does not automatically update the page label definition of the manipulated PDF. Actually this is no wonder because it is not clear how the inserted page is meant to be labeled. #Bruno At least, though, iText could change the page label sections starting after the insertion page number.
Using iText's low level API it is possible, though, to fix the original label positions and add a label for the inserted page. This can be implemented similarly to the iText in Action PageLabelExample example, more exactly its manipulatePageLabel part; simply add this before stamper.close():
PdfDictionary root = reader.getCatalog();
PdfDictionary labels = root.getAsDict(PdfName.PAGELABELS);
if (labels != null)
{
PdfArray newNums = new PdfArray();
newNums.add(new PdfNumber(0));
PdfDictionary coverDict = new PdfDictionary();
coverDict.put(PdfName.P, new PdfString("Cover Page"));
newNums.add(coverDict);
PdfArray nums = labels.getAsArray(PdfName.NUMS);
if (nums != null)
{
for (int i = 0; i < nums.size() - 1; )
{
int n = nums.getAsNumber(i++).intValue();
newNums.add(new PdfNumber(n+1));
newNums.add(nums.getPdfObject(i++));
}
}
labels.put(PdfName.NUMS, newNums);
stamper.markUsed(labels);
}
For a document with these labels:
It generates a document with these labels:
Keeping links
I just found out that the inserted page "Cover Page" lost its link annotations. I wonder if there's a workaround for this, since according to the book, the interactive features of the inserted page are lost when using PdfStamper.
Indeed, among the iText PDF generating classes only Pdf*Copy* keeps interactive features like annotations. Unfortunately one has to decide whether one wants to
create a genuinely new PDF (PdfWriter) with no information from other PDFs beyond contents being embedable;
manipulate a single existing PDF ('PdfStamper') with all information from that one PDF being preserved but no information from other PDFs beyond contents being embedable;
merge any number of existing PDFs (PdfCopy) with most page-level information from all those PDFs being preserved but no document-level information from any.
In your case I thought the new cover page had only static content, no dynamic features, and so assumes the PdfStamper was best. If you only have to deal with links, you may consider copying links manually, e.g. using this helper method
/**
* <p>
* A primitive attempt at copying links from page <code>sourcePage</code>
* of <code>PdfReader reader</code> to page <code>targetPage</code> of
* <code>PdfStamper stamper</code>.
* </p>
* <p>
* This method is meant only for the use case at hand, i.e. copying a link
* to an external URI without expecting any advanced features.
* </p>
*/
void copyLinks(PdfStamper stamper, int targetPage, PdfReader reader, int sourcePage)
{
PdfDictionary sourcePageDict = reader.getPageNRelease(sourcePage);
PdfArray annotations = sourcePageDict.getAsArray(PdfName.ANNOTS);
if (annotations != null && annotations.size() > 0)
{
for (PdfObject annotationObject : annotations)
{
annotationObject = PdfReader.getPdfObject(annotationObject);
if (!annotationObject.isDictionary())
continue;
PdfDictionary annotation = (PdfDictionary) annotationObject;
if (!PdfName.LINK.equals(annotation.getAsName(PdfName.SUBTYPE)))
continue;
PdfArray rectArray = annotation.getAsArray(PdfName.RECT);
if (rectArray == null || rectArray.size() < 4)
continue;
Rectangle rectangle = PdfReader.getNormalizedRectangle(rectArray);
PdfName hightLight = annotation.getAsName(PdfName.H);
if (hightLight == null)
hightLight = PdfAnnotation.HIGHLIGHT_INVERT;
PdfDictionary actionDict = annotation.getAsDict(PdfName.A);
if (actionDict == null || !PdfName.URI.equals(actionDict.getAsName(PdfName.S)))
continue;
PdfString urlPdfString = actionDict.getAsString(PdfName.URI);
if (urlPdfString == null)
continue;
PdfAction action = new PdfAction(urlPdfString.toString());
PdfAnnotation link = PdfAnnotation.createLink(stamper.getWriter(), rectangle, hightLight, action);
stamper.addAnnotation(link, targetPage);
}
}
}
which you can call right after inserting the original page:
PdfImportedPage page = stamper.getImportedPage(titleReader, 1);
stamper.insertPage(1, titleReader.getPageSize(1));
PdfContentByte content = stamper.getUnderContent(1);
content.addTemplate(page, 0, 0);
copyLinks(stamper, 1, titleReader, 1);
Beware, this method is really simple. It only considers links with URI actions and creates a link on the target page using the same location, target, and highlight setting as the original one. If the original one uses more refined features (e.g. if it brings along its own appearance streams or even merely uses the border style attributes) and you want to keep these features, you have to improve the method to also copy the entries for these features to the new annotation.

How to Detect table start in itextSharp?

I am trying to convert pdf to csv file. pdf file has data in tabular format with first row as header. I have reached to the level where I can extract text from a cell, compare the baseline of text in table and detect newline but I need to compare table borders to detect start of table. I do not know how to detect and compare lines in PDF. Can anyone help me?
Thanks!!!
As you've seen (hopefully), PDFs have no concept of tables, just text placed at specific locations and lines drawn around them. There is no internal relationship between the text and the lines. This is very important to understand.
Knowing this, if all of the cells have enough padding you can look for gaps between characters that are large enough such as the width of 3 or more spaces. If the cells don't have enough spacing this will unfortunately probably break.
You could also look at every line in the PDF and try to figure out what represents your "table-like" lines. See this answer for how to walk every token on a page to see what's being drawn.
I was also searching the answer for the similar question, but unfortunately I didn't found one so I did it on my own.
A PDF page like this
Will give the output as
Here is the github link for the dotnet Console Application I made.
https://github.com/Justabhi96/Detect_And_Extract_Table_From_Pdf
This application detects the table in the specific page of the PDF and prints them in a table format on the console.
Here is the code that i used to make this application.
First of all I took the text out of PDF along with their coordinates using a class which extends iTextSharp.text.pdf.parser.LocationTextExtractionStrategy class of iTextSharp. The Code is as follows:
This is the Class that is going to store the chunks with there coordinates and text.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
namespace itextPdfTextCoordinates
{
public class RectAndText
{
public iTextSharp.text.Rectangle Rect;
public String Text;
public RectAndText(iTextSharp.text.Rectangle rect, String text)
{
this.Rect = rect;
this.Text = text;
}
}
}
And this is the class that extends the LocationTextExtractionStrategy class.
using iTextSharp.text.pdf.parser;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
namespace itextPdfTextCoordinates
{
public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy
{
public List<RectAndText> myPoints = new List<RectAndText>();
//Automatically called for each chunk of text in the PDF
public override void RenderText(TextRenderInfo renderInfo)
{
base.RenderText(renderInfo);
//Get the bounding box for the chunk of text
var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
var topRight = renderInfo.GetAscentLine().GetEndPoint();
//Create a rectangle from it
var rect = new iTextSharp.text.Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]
);
//Add this to our main collection
this.myPoints.Add(new RectAndText(rect, renderInfo.GetText()));
}
}
}
This class is overriding the RenderText method of the LocationTextExtractionStrategy class which will be called each time you extract the chunks from a PDF page using PdfTextExtractor.GetTextFromPage() method.
using itextPdfTextCoordinates;
using iTextSharp.text.pdf;
//Create an instance of our strategy
var t = new MyLocationTextExtractionStrategy();
var path = "F:\\sample-data.pdf";
//Parse page 1 of the document above
using (var r = new PdfReader(path))
{
for (var i = 1; i <= r.NumberOfPages; i++)
{
// Calling this function adds all the chunks with their coordinates to the
// 'myPoints' variable of 'MyLocationTextExtractionStrategy' Class
var ex = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(r, i, t);
}
}
//Here you can loop over the chunks of PDF
foreach(chunk in t.myPoints){
Console.WriteLine("character {0} is at {1}*{2}",i.Text,i.Rect.Left,i.Rect.Top);
}
Now for Detecting the start and end of the table you can use the coordinates of the chunks extracted from the PDF.
Like if the specific line is not having table then there will be no jumps in the right coordinate of the current chunk and and Left coordinate of next chunk. But the lines having table will be having those coordinate jumps of at least 3 points.
Like for Lines having table will have coordinates of chunks something like this:
right coord of current chunk -> 12.75pts
left coords of next chunk -> 20.30pts
so further you can use this logic to detect tables in the PDF.
The code is as follows:
using itextPdfTextCoordinates;
using iTextSharp.text.pdf;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace ConsoleApp1
{
class LineUsingCoordinates
{
public static List<List<string>> getLineText(string path, int page, float[] coord)
{
//Create an instance of our strategy
var t = new MyLocationTextExtractionStrategy();
//Parse page 1 of the document above
using (var r = new PdfReader(path))
{
// Calling this function adds all the chunks with their coordinates to the
// 'myPoints' variable of 'MyLocationTextExtractionStrategy' Class
var ex = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(r, page, t);
}
// List of columns in one line
List<string> lineWord = new List<string>();
// temporary list for working around appending the <List<List<string>>
List<string> tempWord;
// List of rows. rows are list of string
List<List<string>> lineText = new List<List<string>>();
// List consisting list of chunks related to each line
List<List<RectAndText>> lineChunksList = new List<List<RectAndText>>();
//List consisting the chunks for whole page;
List<RectAndText> chunksList;
// List consisting the list of Bottom coord of the lines present in the page
List<float> bottomPointList = new List<float>();
//Getting List of Coordinates of Lines in the page no matter it's a table or not
foreach (var i in t.myPoints)
{
Console.WriteLine("character {0} is at {1}*{2}", i.Text, i.Rect.Left, i.Rect.Top);
// If the coords passed to the function is not null then process the part in the
// given coords of the page otherwise process the whole page
if (coord != null)
{
if (i.Rect.Left >= coord[0] &&
i.Rect.Bottom >= coord[1] &&
i.Rect.Right <= coord[2] &&
i.Rect.Top <= coord[3])
{
float bottom = i.Rect.Bottom;
if (bottomPointList.Count == 0)
{
bottomPointList.Add(bottom);
}
else if (Math.Abs(bottomPointList.Last() - bottom) > 3)
{
bottomPointList.Add(bottom);
}
}
}
// else process the whole page
else
{
float bottom = i.Rect.Bottom;
if (bottomPointList.Count == 0)
{
bottomPointList.Add(bottom);
}
else if (Math.Abs(bottomPointList.Last() - bottom) > 3)
{
bottomPointList.Add(bottom);
}
}
}
// Sometimes the above List will be having some elements which are from the same line but are
// having different coordinates due to some characters like " ",".",etc.
// And these coordinates will be having the difference of at most 4 points between
// their bottom coordinates.
//so to remove those elements we create two new lists which we need to remove from the original list
//This list will be having the elements which are having different but a little difference in coordinates
List<float> removeList = new List<float>();
// This list is having the elements which are having the same coordinates
List<float> sameList = new List<float>();
// Here we are adding the elements in those two lists to remove the elements
// from the original list later
for (var i = 0; i < bottomPointList.Count; i++)
{
var basePoint = bottomPointList[i];
for (var j = i+1; j < bottomPointList.Count; j++)
{
var comparePoint = bottomPointList[j];
//here we are getting the elements with same coordinates
if (Math.Abs(comparePoint - basePoint) == 0)
{
sameList.Add(comparePoint);
}
// here ae are getting the elements which are having different but the diference
// of less than 4 points
else if (Math.Abs(comparePoint - basePoint) < 4)
{
removeList.Add(comparePoint);
}
}
}
// Here we are removing the matching elements of remove list from the original list
bottomPointList = bottomPointList.Where(item => !removeList.Contains(item)).ToList();
//Here we are removing the first matching element of same list from the original list
foreach (var r in sameList)
{
bottomPointList.Remove(r);
}
// Here we are getting the characters of the same line in a List 'chunkList'.
foreach (var bottomPoint in bottomPointList)
{
chunksList = new List<RectAndText>();
for (int i = 0; i < t.myPoints.Count; i++)
{
// If the character is having same bottom coord then add it to chunkList
if (bottomPoint == t.myPoints[i].Rect.Bottom)
{
chunksList.Add(t.myPoints[i]);
}
// If character is having a difference of less than 3 in the bottom coord then also
// add it to chunkList because the coord of the next line will differ at least 10 points
// from the coord of current line
else if (Math.Abs(t.myPoints[i].Rect.Bottom - bottomPoint) < 3)
{
chunksList.Add(t.myPoints[i]);
}
}
// Here we are adding the chunkList related to each line
lineChunksList.Add(chunksList);
}
bool sameLine = false;
//Here we are looping through the lines consisting the chunks related to each line
foreach(var linechunk in lineChunksList)
{
var text = "";
// Here we are looping through the chunks of the specific line to put the texts
// that are having a cord jump in their left coordinates.
// because only the line having table will be having the coord jumps in their
// left coord not the line having texts
for (var i = 0; i< linechunk.Count-1; i++)
{
// If the coord is having a jump of less than 3 points then it will be in the same
// column otherwise the next chunk belongs to different column
if (Math.Abs(linechunk[i].Rect.Right - linechunk[i + 1].Rect.Left) < 3)
{
if (i == linechunk.Count - 2)
{
text += linechunk[i].Text + linechunk[i+1].Text ;
}
else
{
text += linechunk[i].Text;
}
}
else
{
if (i == linechunk.Count - 2)
{
// add the text to the column and set the value of next column to ""
text += linechunk[i].Text;
// this is the list of columns in other word its the row
lineWord.Add(text);
text = "";
text += linechunk[i + 1].Text;
lineWord.Add(text);
text = "";
}
else
{
text += linechunk[i].Text;
lineWord.Add(text);
text = "";
}
}
}
if(text.Trim() != "")
{
lineWord.Add(text);
}
// creating a temporary list of strings for the List<List<string>> manipulation
tempWord = new List<string>();
tempWord.AddRange(lineWord);
// "lineText" is the type of List<List<string>>
// this is our list of rows. and rows are List of strings
// here we are adding the row to the list of rows
lineText.Add(tempWord);
lineWord.Clear();
}
return lineText;
}
}
}
You can call getLineText() method of the above class and run the following loop to see the output in the table structure on the console.
var testFile = "F:\\sample-data.pdf";
float[] limitCoordinates = { 52, 671, 357, 728 };//{LowerLeftX,LowerLeftY,UpperRightX,UpperRightY}
// This line gives the lists of rows consisting of one or more columns
//if you pass the third parameter as null the it returns the content for whole page
// but if you pass the coordinates then it returns the content for that coords only
var lineText = LineUsingCoordinates.getLineText(testFile, 1, null);
//var lineText = LineUsingCoordinates.getLineText(testFile, 1, limitCoordinates);
// For detecting the table we are using the fact that the 'lineText' item which length is
// less than two is surely not the part of the table and the item which is having more than
// 2 elements is the part of table
foreach (var row in lineText)
{
if (row.Count > 1)
{
for (var col = 0; col < row.Count; col++)
{
string trimmedValue = row[col].Trim();
if (trimmedValue != "")
{
Console.Write("|" + trimmedValue + "|");
}
}
Console.WriteLine("");
}
}
Console.ReadLine();