Remove underlines from text in PDF file - pdf

I have a bunch of PDF files with broken links.
I need to remove those links and right now I can do the following:
Remove link actions
Change text color from blue to black
What I can't do is to remove blue underlines below text that was a link before.
I tried several PDF libraries for .NET (because this is my primary platform)
Aspost.PDF
PDFSharp
ceTe DynamicPDF
PDFBox
You are welcone to recommend solution on any prograning language, platform and library. I just need to do this.

In case of the sample document the underlines are drawn as blue (RGB 0,0,1) filled vector graphics rectangles (long, slim ones). As blue only is used for the links, we can use that criterion to find the rectangles in question.
Here a sample implementation using PDFBox 1.8.10:
void removeBlueRectangles(PDDocument document) throws IOException
{
List<?> pages = document.getDocumentCatalog().getAllPages();
for (int i = 0; i < pages.size(); i++)
{
PDPage page = (PDPage) pages.get(i);
PDStream contents = page.getContents();
PDFStreamParser parser = new PDFStreamParser(contents.getStream());
parser.parse();
List<Object> tokens = parser.getTokens();
Stack<Boolean> blueState = new Stack<Boolean>();
blueState.push(false);
for (int j = 0; j < tokens.size(); j++)
{
Object next = tokens.get(j);
if (next instanceof PDFOperator)
{
PDFOperator op = (PDFOperator) next;
if (op.getOperation().equals("q"))
{
blueState.push(blueState.peek());
}
else if (op.getOperation().equals("Q"))
{
blueState.pop();
}
else if (op.getOperation().equals("rg"))
{
if (j > 2)
{
Object r = tokens.get(j-3);
Object g = tokens.get(j-2);
Object b = tokens.get(j-1);
if (r instanceof COSNumber && g instanceof COSNumber && b instanceof COSNumber)
{
blueState.pop();
blueState.push((
Math.abs(((COSNumber)r).floatValue() - 0) < 0.001 &&
Math.abs(((COSNumber)g).floatValue() - 0) < 0.001 &&
Math.abs(((COSNumber)b).floatValue() - 1) < 0.001));
}
}
}
else if (op.getOperation().equals("f"))
{
if (blueState.peek() && j > 0)
{
Object re = tokens.get(j-1);
if (re instanceof PDFOperator && ((PDFOperator)re).getOperation().equals("re"))
{
tokens.set(j, PDFOperator.getOperator("n"));
}
}
}
}
}
PDStream updatedStream = new PDStream(document);
OutputStream out = updatedStream.createOutputStream();
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
page.setContents(updatedStream);
}
}
(RemoveUnderlines.java)
original.pdf
Applying this to your first sample file original.pdf
public void testOriginal() throws IOException, COSVisitorException
{
try ( InputStream resourceStream = getClass().getResourceAsStream("original.pdf") )
{
PDDocument document = PDDocument.loadNonSeq(resourceStream, null);
removeBlueRectangles(document);
document.save("original-noBlueRectangles.pdf");
document.close();
}
}
(RemoveUnderlines.java)
results in
1178.pdf
You commented
After testing this on many files I have to say this solution works incorrectly in some cases. For example in for this file (dropbox.com/s/23g54bvt781lb93/1178.pdf?dl=0) it removes the entire content of the page. Keep searching..
So I applyed the code to your new sample file 1178.pdf
public void test1178() throws IOException, COSVisitorException
{
try ( InputStream resourceStream = getClass().getResourceAsStream("1178.pdf") )
{
PDDocument document = PDDocument.loadNonSeq(resourceStream, null);
removeBlueRectangles(document);
document.save(new File(RESULT_FOLDER, "1178-noBlueRectangles.pdf"));
document.close();
}
}
(RemoveUnderlines.java)
which resulted in
So I cannot confirm your claim that the solution works incorrectly; in particular I see that it does not remove the entire content of the page.
As I cannot reproduce your observation, I assume there are additional issues in your setup you have not yet mentioned.

Related

Changing zoom level of all links in a PDF document

I'd like to change zoom level in links in PDF files using iText 7. There are already solutions for legacy iText 5 (1, 2) but there is none for iText 7.
There are many ways links can work (it can be even JavaScript code) but now I'm interested in GoTo actions (i.e. /XYZ, /Fit and others described in ISO 32000-1, Table 151).
My current code locates GoTo actions but I don't know what the destination of a link is (such as page number, coordinates for XYZ zoom type). How do I get the destination of a link? Or am I missing something and that problem can be solved in a different way?
for (int i = 1; i <= numberOfPages; i++) {
PdfPage page = document.getPage(i);
List<PdfAnnotation> annots = page.getAnnotations();
if (annots.isEmpty()) {
continue;
}
for (int j = 0; j < annots.size(); j++) {
PdfAnnotation annot = annots.get(j);
if (annot == null) {
continue;
}
PdfDictionary action = annot.getPdfObject()
.getAsDictionary(PdfName.A);
if (action == null ||
!PdfName.GoTo.equals(action.get(PdfName.S))) {
continue;
}
annot.remove(PdfName.A);
// here I need the destination of action
int pageNumber = ???
Rectangle cropBox = document.getPage(pageNumber).getCropBox();
PdfAction goToAction = PdfAction.createGoTo(PdfExplicitDestination.createXYZ(
document.getPage(pageNumber), cropBox.getLeft(), cropBox.getTop(), 2F));
annot.put(PdfName.A, goToAction.getPdfObject());
}
}
UPDATE Working Code
I managed to deal with GoTo actions - both explicit and of type Named Destination. It works on some pdf file I tested the code against but I'm not sure whether it covers all the possible cases
Note:
When dealing with named destination I create explicit destination and then use it instead of named destination. The thing is, however, that named destination still exists in pdf. How do I get rid of it? In iText 5 there was a method consolidateNamedDestinations() which seems to do what I need:
Replaces all the local named links with
the actual destinations.
Please see my code:
try (document) {
Map<String, PdfObject> namedDests = null;
for (int i = 1; i <= document.getNumberOfPages(); i++) {
List<PdfAnnotation> annots = document.getPage(i).getAnnotations();
if (annots.isEmpty()) {
continue;
}
for (PdfAnnotation annot : annots) {
if (annot == null) {
continue;
}
PdfDictionary action = annot.getPdfObject()
.getAsDictionary(PdfName.A);
if (action == null) {
continue;
}
PdfName actionType = action.getAsName(PdfName.S);
if (PdfName.Link.equals(actionType)) {
throw new UnsupportedOperationException(
"Action of type " + action);
} else if (!PdfName.GoTo.equals(actionType)) {
continue;
}
PdfArray dest = action.getAsArray(PdfName.D);
// explicit GoTO case
if (dest != null) {
changeZoomLevel(document, annot, dest.getAsDictionary(0));
continue;
}
// Named Destination case
if (namedDests == null) {
namedDests = document.getCatalog()
.getNameTree(PdfName.Dests)
.getNames();
}
String namedDest = action.getAsString(PdfName.D)
.getValue();
PdfObject d = namedDests.get(namedDest);
PdfDictionary dict = null;
if (d instanceof PdfDictionary) {
PdfArray arr = (PdfArray) ((PdfDictionary) d).get(PdfName.D);
dict = arr.getAsDictionary(0);
} else if (d instanceof PdfArray) {
dict = ((PdfArray) d).getAsDictionary(0);
}
changeZoomLevel(document, annot, dict);
}
}
}
private static void changeZoomLevel(PdfDocument document,
PdfAnnotation annot, PdfDictionary dest) {
annot.remove(PdfName.A);
annot.put(PdfName.A, createPdfAction(document, dest));
}
private static PdfObject createPdfAction(PdfDocument document, PdfDictionary dest) {
PdfPage actionPage = document.getPage(dest);
Rectangle cropBox = actionPage.getCropBox();
return PdfAction.createGoTo(PdfExplicitDestination.createXYZ(
actionPage, cropBox.getLeft(), cropBox.getTop(), 1.5F)).getPdfObject();
}

Traverse whole PDF and change blue color to black ( Change color of underlines as well) + iText

I am using below code to remove blue colors from pdf text. It is working fine. But it is not changing underlines color, but changing text color correctly.
original file part:
Manipulated File:
As you see in above manipulated file, underline color didn't change.
I am looking fix for this thing since two weeks, can anyone help on this. Below is my change color code:
public void testChangeBlackTextToGreenDocument(String source, String filename) throws IOException {
try (InputStream resource = getClass().getResourceAsStream(source);
PdfReader pdfReader = new PdfReader(source);
OutputStream result = new FileOutputStream(filename);
PdfWriter pdfWriter = new PdfWriter(result);
PdfDocument pdfDocument = new PdfDocument(pdfReader, pdfWriter);) {
PdfCanvasEditor editor = new PdfCanvasEditor() {
#Override
protected void write(PdfCanvasProcessor processor, PdfLiteral operator, List<PdfObject> operands) {
String operatorString = operator.toString();
if (TEXT_SHOWING_OPERATORS.contains(operatorString)) {
List<PdfObject> listobj = new ArrayList<>();
listobj.add(new PdfNumber(0));
listobj.add(new PdfNumber(0));
listobj.add(new PdfNumber(0));
listobj.add(new PdfLiteral("rg"));
if (currentlyReplacedBlack == null) {
Color currentFillColor =getGraphicsState().getFillColor();
if (ColorConstants.GREEN.equals(currentFillColor) || ColorConstants.CYAN.equals(currentFillColor) || ColorConstants.BLUE.equals(currentFillColor)) {
currentlyReplacedBlack = currentFillColor;
super.write(processor, new PdfLiteral("rg"), listobj);
}
}
} else if (currentlyReplacedBlack != null) {
if (currentlyReplacedBlack instanceof DeviceCmyk) {
List<PdfObject> listobj = new ArrayList<>();
listobj.add(new PdfNumber(0));
listobj.add(new PdfNumber(0));
listobj.add(new PdfNumber(0));
listobj.add(new PdfNumber(0));
listobj.add(new PdfLiteral("k"));
super.write(processor, new PdfLiteral("k"), listobj);
} else if (currentlyReplacedBlack instanceof DeviceGray) {
List<PdfObject> listobj = new ArrayList<>();
listobj.add(new PdfNumber(0));
listobj.add(new PdfLiteral("g"));
super.write(processor, new PdfLiteral("g"), listobj);
} else {
List<PdfObject> listobj = new ArrayList<>();
listobj.add(new PdfNumber(0));
listobj.add(new PdfNumber(0));
listobj.add(new PdfNumber(0));
listobj.add(new PdfLiteral("rg"));
super.write(processor, new PdfLiteral("rg"), listobj);
}
currentlyReplacedBlack = null;
}
super.write(processor, operator, operands);
}
Color currentlyReplacedBlack = null;
final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
};
for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++) {
editor.editPage(pdfDocument, i);
}
}
File file = new File(source);
file.delete();
}
Here is the original file.
https://raad-dev-test.s3.ap-south-1.amazonaws.com/36/2019-08-30/originalFile.pdf
Related Links:
Traverse whole PDF and change some attribute with some object in it using iText
Removing Watermark from PDF iTextSharp
Maven Dependcy Details:
<dependency>
<groupId>com.itextpdf</groupId>
<artifactId>itext7-core</artifactId>
<version>7.1.5</version>
<type>pom</type>
</dependency>
<dependency>
<groupId>com.itextpdf</groupId>
<artifactId>itextpdf</artifactId>
<version>5.0.6</version>
</dependency>
Edited:
Accepted answer is not working for below files:
https://raad-dev-test.s3.ap-south-1.amazonaws.com/36/2019-08-30/021549Orig1s025_aprepitant_clinpharm_prea_Mac.pdf (Page 41)
https://raad-dev-test.s3.ap-south-1.amazonaws.com/36/2019-08-30/400_206494S5_avibactam_and_ceftazidine_unireview_prea_Mac.pdf (Page 60).
Please Help.
(The example code here uses iText 7 for Java. You mentioned neither the iText version nor your programming environment in tags or question text but your example code appears to indicate that this is your combination of choice.)
Replacing blue fill colors
The test you based your original code on attempts explicitly only to change text color. The "underline" in your document, though, is (as far as PDF drawing is concerned) not part of the text but instead drawn as a simple path. Thus, the underline explicitly is not touched by the original code and it has to be adapted for your task.
But actually your task, changing everything blue to black, is easier to implement than only changing the blue text, e.g.
try ( PdfReader pdfReader = new PdfReader(SOURCE_PDF);
PdfWriter pdfWriter = new PdfWriter(RESULT_PDF);
PdfDocument pdfDocument = new PdfDocument(pdfReader, pdfWriter) )
{
PdfCanvasEditor editor = new PdfCanvasEditor()
{
#Override
protected void write(PdfCanvasProcessor processor, PdfLiteral operator, List<PdfObject> operands)
{
String operatorString = operator.toString();
if (SET_FILL_RGB.equals(operatorString) && operands.size() == 4) {
if (isApproximatelyEqual(operands.get(0), 0) &&
isApproximatelyEqual(operands.get(1), 0) &&
isApproximatelyEqual(operands.get(2), 1)) {
super.write(processor, new PdfLiteral("g"), Arrays.asList(new PdfNumber(0), new PdfLiteral("g")));
return;
}
}
super.write(processor, operator, operands);
}
boolean isApproximatelyEqual(PdfObject number, float reference) {
return number instanceof PdfNumber && Math.abs(reference - ((PdfNumber)number).floatValue()) < 0.01f;
}
final String SET_FILL_RGB = "rg";
};
for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++)
{
editor.editPage(pdfDocument, i);
}
}
(ChangeColor test testChangeFillRgbBlueToBlack)
Beware, this is merely a proof-of-concept, not a final and complete solution. In particular:
It merely looks at the fill (non-stroking) colors. In your case that suffices as both your text (as usual) and your underline use fill colors only - the underline actually is not drawn as a stroked line but instead as a slim, filled rectangle.
Only RGB blue (and only such blue set using the rg instruction, not set using sc or scn, let alone blues combined out of other colors using funky blend modes) is considered. This might be an issue particularly in case of documents explicitly designed for printing (likely using CMYK colors).
PdfCanvasEditor only inspects and edits the content stream of the page itself, not the content streams of displayed form XObjects or patterns; thus, some content may not be found. It can be generalized fairly easily.
The result:
Replacing blue fill and stroke colors
Testing the code above you soon found documents in which the underlines were not changed. As it turned out, these underlines are actually drawn as stroked lines, not as filled rectangle as above.
To also properly edit such documents, therefore, you must not only edit the fill colors but also the stroke colors, e.g. like this:
try ( PdfReader pdfReader = new PdfReader(SOURCE_PDF);
PdfWriter pdfWriter = new PdfWriter(RESULT_PDF);
PdfDocument pdfDocument = new PdfDocument(pdfReader, pdfWriter) )
{
PdfCanvasEditor editor = new PdfCanvasEditor()
{
#Override
protected void write(PdfCanvasProcessor processor, PdfLiteral operator, List<PdfObject> operands)
{
String operatorString = operator.toString();
if (SET_FILL_RGB.equals(operatorString) && operands.size() == 4) {
if (isApproximatelyEqual(operands.get(0), 0) &&
isApproximatelyEqual(operands.get(1), 0) &&
isApproximatelyEqual(operands.get(2), 1)) {
super.write(processor, new PdfLiteral("g"), Arrays.asList(new PdfNumber(0), new PdfLiteral("g")));
return;
}
}
if (SET_STROKE_RGB.equals(operatorString) && operands.size() == 4) {
if (isApproximatelyEqual(operands.get(0), 0) &&
isApproximatelyEqual(operands.get(1), 0) &&
isApproximatelyEqual(operands.get(2), 1)) {
super.write(processor, new PdfLiteral("G"), Arrays.asList(new PdfNumber(0), new PdfLiteral("G")));
return;
}
}
super.write(processor, operator, operands);
}
boolean isApproximatelyEqual(PdfObject number, float reference) {
return number instanceof PdfNumber && Math.abs(reference - ((PdfNumber)number).floatValue()) < 0.01f;
}
final String SET_FILL_RGB = "rg";
final String SET_STROKE_RGB = "RG";
};
for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++)
{
editor.editPage(pdfDocument, i);
}
}
(ChangeColor tests testChangeRgbBlueToBlackControlOfNitrosamineImpuritiesInSartansRev and testChangeRgbBlueToBlackEdqmReportsIssuesOfNonComplianceWithToothMac)
The results:
and
Replacing different shades of blue from other RGB'ish color spaces
Testing the code above you again found documents in which the blue colors were not changed. As it turned out, these blue colors were not from the DeviceRGB standard RGB but instead from ICCBased colorspaces, profiled RGB color spaces to be more exact. In particular other color setting operators were used than before, sc / scn instead of rg. Furthermore, in one document not a pure blue 0 0 1 but instead a .17255 .3098 .63529 blue was used
If we assume that sc and scn instructions with three numeric arguments set some flavor of RGB colors as here (in general this is an oversimplification, Lab and other color spaces can also come with 4 components, but your documents seem RGB oriented) and are less strict in recognizing the blue color, we can generalize the code above as follows:
class AllRgbBlueToBlackConverter extends PdfCanvasEditor {
#Override
protected void write(PdfCanvasProcessor processor, PdfLiteral operator, List<PdfObject> operands)
{
String operatorString = operator.toString();
if (RGB_SETTER_CANDIDATES.contains(operatorString) && operands.size() == 4) {
if (isBlue(operands.get(0), operands.get(1), operands.get(2))) {
PdfNumber number0 = new PdfNumber(0);
operands.set(0, number0);
operands.set(1, number0);
operands.set(2, number0);
}
}
super.write(processor, operator, operands);
}
boolean isBlue(PdfObject red, PdfObject green, PdfObject blue) {
if (red instanceof PdfNumber && green instanceof PdfNumber && blue instanceof PdfNumber) {
float r = ((PdfNumber)red).floatValue();
float g = ((PdfNumber)green).floatValue();
float b = ((PdfNumber)blue).floatValue();
return b > .5f && r < .9f*b && g < .9f*b;
}
return false;
}
final Set<String> RGB_SETTER_CANDIDATES = new HashSet<>(Arrays.asList("rg", "RG", "sc", "SC", "scn", "SCN"));
}
(ChangeColor helper class)
Used like this
try ( PdfReader pdfReader = new PdfReader(INPUT);
PdfWriter pdfWriter = new PdfWriter(OUTPUT);
PdfDocument pdfDocument = new PdfDocument(pdfReader, pdfWriter) ) {
PdfCanvasEditor editor = new AllRgbBlueToBlackConverter();
for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++)
{
editor.editPage(pdfDocument, i);
}
}
we get
and

Bad:Converting pdf to images

Convert class:
public void getImage(String pdfFilename) throws Exception{
List<byte[]> listImg = new ArrayList<>();
try (final PDDocument document = PDDocument.load(new File(pdfFilename))){
PDFRenderer pdfRenderer = new PDFRenderer(document);
for (int page = 0; page < document.getNumberOfPages(); ++page)
{
File file = new File("C:\\path1\\"+page+".png");
BufferedImage bim = pdfRenderer.renderImage(page);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ImageIO.write(bim, "png",file);
System.out.println("!!!!");
// System.out.println(Arrays.toString(listImg.get(page)));
}
document.close();
} catch (IOException e){
System.err.println("Exception while trying to create pdf document - " + e);
}
}
Everything works well. All pdf files are converted, but if I use the class shw (this is very necessary for my project):
PdfDocument srcDoc = new PdfDocument(new PdfReader(DEST1));
Rectangle rect = srcDoc.getFirstPage().getPageSize();
System.out.println(rect);
Rectangle pageSize = new Rectangle(rect.getWidth(), rect.getHeight());
PdfDocument pdfDoc = new PdfDocument(new PdfWriter(dest));
pdfDoc.setDefaultPageSize(new PageSize(pageSize));
System.out.println(srcDoc.getNumberOfPages());
PdfCanvas content = new PdfCanvas(pdfDoc.addNewPage());
int n = 0;
for (int i =1 ; i <= srcDoc.getNumberOfPages(); i++) {
PdfFormXObject page = srcDoc.getPage(i).copyAsFormXObject(pdfDoc);
content.clip();
content.newPath();
content.addXObject(page,MainPdf.right_Margin-MainPdf.left_Margin,0);
content = new PdfCanvas(pdfDoc.addNewPage());
for (double y = 4.251969f; y <= 595; y += 14.1732) {
content.moveTo(0, y);
content.lineTo(420, y);
}
for (double x = 0; x <= 420; x += 14.1732) {
content.moveTo(x, 0);
content.lineTo(x, 595);
}
content.closePathStroke();
}
srcDoc.close();
pdfDoc.close();
}
Those images that have been converted to empty (contain nothing inside themselves, just a white background). Pdf not empty.
pdf:https://dropmefiles.com/UXedd
images:
The cause was the call
content.clip();
in the itext segment. This clips with an empty path. Adobe Reader ignores this, but PDFBox doesn't, so the current clipping path is empty so that nothing gets seen.
Per one of the comments, removing that call solves the problem. (I suspect that content.newPath(); isn't needed either)
I have also tried other viewers: PDF.js and GhostScript don't display it, Chrome and Edge display it.

Change PDF Annotation font size using itext 7

My question is a bit similar to this one : Change PDF Annotation properties using iTextSharp C#
But I want to specifically change font size of pdf annotation using iText 7. I have searched a lot online but haven't been able to find any great examples or documentation regarding this. Following is the code I have used.
static void EditAnnot(string PDF)
{
string OutPDF = #"C:\Users\AP037X\Desktop\test.pdf";
iText.Kernel.Pdf.PdfDocument pdfDoc = new iText.Kernel.Pdf.PdfDocument(new iText.Kernel.Pdf.PdfReader(PDF), new iText.Kernel.Pdf.PdfWriter(OutPDF));
iText.Kernel.Pdf.PdfDictionary pageDict = pdfDoc.GetPage(1).GetPdfObject();
iText.Kernel.Pdf.PdfArray annots = pageDict.GetAsArray(iText.Kernel.Pdf.PdfName.Annots);
if (annots != null)
{
for (int i = 0; i < annots.Size(); i++)
{
Console.WriteLine("Scan..");
if (annots.GetAsDictionary(i) == null)
{
Console.WriteLine("1");
//return;
}
iText.Kernel.Pdf.PdfString t = annots.GetAsDictionary(i).GetAsString(iText.Kernel.Pdf.PdfName.Contents);
if (t == null)
{
Console.WriteLine("2");
//return;
}
Console.WriteLine(t);
if (Convert.ToString(t).Trim() == "Change")
{
Console.WriteLine("Found");
Console.WriteLine(annots.Size());
iText.Kernel.Geom.Rectangle rect = annots.GetAsDictionary(i).GetAsRectangle(iText.Kernel.Pdf.PdfName.Rect);
iText.Kernel.Pdf.PdfString cont = new iText.Kernel.Pdf.PdfString("New String");
iText.Kernel.Pdf.Annot.PdfFreeTextAnnotation NewAnnot = new iText.Kernel.Pdf.Annot.PdfFreeTextAnnotation(rect,cont);
float[] color = { 1f,1f,0f};
NewAnnot.SetColor(color);
NewAnnot.Put(iText.Kernel.Pdf.PdfName.Contents, new iText.Kernel.Pdf.PdfString("lion"));
NewAnnot.Put(iText.Kernel.Pdf.PdfName.Font, *What to type here?*);
annots.Remove(i);
annots.Add(i, NewAnnot.GetPdfObject());
}
}
}
pdfDoc.Close();
CompressPDF(OutPDF);
}

How to retain page labels when concatenating an existing pdf with a pdf created from scratch?

I have a code which is creating a "cover page" and then merging it with an existing pdf. The pdf labels were lost after merging. How can I retain the pdf labels of the existing pdf and then add a page label to the pdf page created from scratch (eg "Cover page")? The example of the book I think is about retrieving and replacing page labels. I don't know how to apply this when concatenating an existing pdf with a pdf created from scratch. I am using itext 5.3.0. Thanks in advance.
EDIT
as per comment of mkl
public ByteArrayOutputStream getConcatenatePDF()
{
if (bitstream == null)
return null;
if (item == null)
{
item = getItem();
if (item == null)
return null;
}
ByteArrayOutputStream byteout = null;
InputStream coverStream = null;
try
{
// Get Cover Page
coverStream = getCoverStream();
if (coverStream == null)
return null;
byteout = new ByteArrayOutputStream();
int pageOffset = 0;
ArrayList<HashMap<String, Object>> master = new ArrayList<HashMap<String, Object>>();
Document document = null;
PdfCopy writer = null;
PdfReader reader = null;
byte[] password = (ownerpass != null && !"".equals(ownerpass)) ? ownerpass.getBytes() : null;
// Get infomation of the original pdf
reader = new PdfReader(bitstream.retrieve(), password);
boolean isPortfolio = reader.getCatalog().contains(PdfName.COLLECTION);
char version = reader.getPdfVersion();
int permissions = reader.getPermissions();
// Get metadata
HashMap<String, String> info = reader.getInfo();
String title = (info.get("Title") == null || "".equals(info.get("Title")))
? getFieldValue("dc.title") : info.get("Title");
String author = (info.get("Author") == null || "".equals(info.get("Author")))
? getFieldValue("dc.contributor.author") : info.get("Author");
String subject = (info.get("Subject") == null || "".equals(info.get("Subject")))
? "" : info.get("Subject");
String keywords = (info.get("Keywords") == null || "".equals(info.get("Keywords")))
? getFieldValue("dc.subject") : info.get("Keywords");
reader.close();
// Merge cover page and the original pdf
InputStream[] is = new InputStream[2];
is[0] = coverStream;
is[1] = bitstream.retrieve();
for (int i = 0; i < is.length; i++)
{
// we create a reader for a certain document
reader = new PdfReader(is[i], password);
reader.consolidateNamedDestinations();
if (i == 0)
{
// step 1: creation of a document-object
document = new Document(reader.getPageSizeWithRotation(1));
// step 2: we create a writer that listens to the document
writer = new PdfCopy(document, byteout);
// Set metadata from the original pdf
// the position of these lines is important
document.addTitle(title);
document.addAuthor(author);
document.addSubject(subject);
document.addKeywords(keywords);
if (pdfa)
{
// Set thenecessary information for PDF/A-1B
// the position of these lines is important
writer.setPdfVersion(PdfWriter.VERSION_1_4);
writer.setPDFXConformance(PdfWriter.PDFA1B);
writer.createXmpMetadata();
}
else if (version == '5')
writer.setPdfVersion(PdfWriter.VERSION_1_5);
else if (version == '6')
writer.setPdfVersion(PdfWriter.VERSION_1_6);
else if (version == '7')
writer.setPdfVersion(PdfWriter.VERSION_1_7);
else
; // no operation
// Set security parameters
if (!pdfa)
{
if (password != null)
{
if (security && permissions != 0)
{
writer.setEncryption(null, password, permissions, PdfWriter.STANDARD_ENCRYPTION_128);
}
else
{
writer.setEncryption(null, password, PdfWriter.ALLOW_PRINTING | PdfWriter.ALLOW_COPY | PdfWriter.ALLOW_SCREENREADERS, PdfWriter.STANDARD_ENCRYPTION_128);
}
}
}
// step 3: we open the document
document.open();
// if this pdf is portfolio, does not add cover page
if (isPortfolio)
{
reader.close();
byte[] coverByte = getCoverByte();
if (coverByte == null || coverByte.length == 0)
return null;
PdfCollection collection = new PdfCollection(PdfCollection.TILE);
writer.setCollection(collection);
PdfFileSpecification fs = PdfFileSpecification.fileEmbedded(writer, null, "cover.pdf", coverByte);
fs.addDescription("cover.pdf", false);
writer.addFileAttachment(fs);
continue;
}
}
int n = reader.getNumberOfPages();
// step 4: we add content
PdfImportedPage page;
PdfCopy.PageStamp stamp;
for (int j = 0; j < n; )
{
++j;
page = writer.getImportedPage(reader, j);
if (i == 1) {
stamp = writer.createPageStamp(page);
Rectangle mediabox = reader.getPageSize(j);
Rectangle crop = new Rectangle(mediabox);
writer.setCropBoxSize(crop);
// add overlay text
//<-- Code for adding overlay text -->
stamp.alterContents();
}
writer.addPage(page);
}
PRAcroForm form = reader.getAcroForm();
if (form != null && !pdfa)
{
writer.copyAcroForm(reader);
}
// we retrieve the total number of pages
List<HashMap<String, Object>> bookmarks = SimpleBookmark.getBookmark(reader);
//if (bookmarks != null && !pdfa)
if (bookmarks != null)
{
if (pageOffset != 0)
{
SimpleBookmark.shiftPageNumbers(bookmarks, pageOffset, null);
}
master.addAll(bookmarks);
}
pageOffset += n;
}
if (!master.isEmpty())
{
writer.setOutlines(master);
}
if (isPortfolio)
{
reader = new PdfReader(bitstream.retrieve(), password);
PdfDictionary catalog = reader.getCatalog();
PdfDictionary documentnames = catalog.getAsDict(PdfName.NAMES);
PdfDictionary embeddedfiles = documentnames.getAsDict(PdfName.EMBEDDEDFILES);
PdfArray filespecs = embeddedfiles.getAsArray(PdfName.NAMES);
PdfDictionary filespec;
PdfDictionary refs;
PRStream stream;
PdfFileSpecification fs;
String path;
// copy embedded files
for (int i = 0; i < filespecs.size(); )
{
filespecs.getAsString(i++); // remove description
filespec = filespecs.getAsDict(i++);
refs = filespec.getAsDict(PdfName.EF);
for (PdfName key : refs.getKeys())
{
stream = (PRStream) PdfReader.getPdfObject(refs.getAsIndirectObject(key));
path = filespec.getAsString(key).toString();
fs = PdfFileSpecification.fileEmbedded(writer, null, path, PdfReader.getStreamBytes(stream));
fs.addDescription(path, false);
writer.addFileAttachment(fs);
}
}
}
if (pdfa)
{
InputStream iccFile = this.getClass().getClassLoader().getResourceAsStream(PROFILE);
ICC_Profile icc = ICC_Profile.getInstance(iccFile);
writer.setOutputIntents("Custom", "", "http://www.color.org", "sRGB IEC61966-2.1", icc);
writer.setViewerPreferences(PdfWriter.PageModeUseOutlines);
}
// step 5: we close the document
document.close();
}
catch (Exception e)
{
log.info(LogManager.getHeader(context, "cover_page: getConcatenatePDF", "bitstream_id="+bitstream.getID()+", error="+e.getMessage()));
// e.printStackTrace();
return null;
}
return byteout;
}
UPDATE
Based on mkl's answer, I modified the code above to look like this:
public ByteArrayOutputStream getConcatenatePDF()
{
if (bitstream == null)
return null;
if (item == null)
{
item = getItem();
if (item == null)
return null;
}
ByteArrayOutputStream byteout = null;
try
{
// Get Cover Page
InputStream coverStream = getCoverStream();
if (coverStream == null)
return null;
byteout = new ByteArrayOutputStream();
InputStream documentStream = bitstream.retrieve();
PdfReader coverPageReader = new PdfReader(coverStream);
PdfReader reader = new PdfReader(documentStream);
PdfStamper stamper = new PdfStamper(reader, byteout);
PdfImportedPage page = stamper.getImportedPage(coverPageReader, 1);
stamper.insertPage(1, coverPageReader.getPageSize(1));
PdfContentByte content = stamper.getUnderContent(1);
int n = reader.getNumberOfPages();
for (int j = 2; j <= n; j++) {
//code for overlay text
ColumnText.showTextAligned(stamper.getOverContent(j), Element.ALIGN_CENTER, overlayText,
crop.getLeft(10), crop.getHeight() / 2 + crop.getBottom(), 90);
}
content.addTemplate(page, 0, 0);
stamper.close();
}
catch (Exception e)
{
log.info(LogManager.getHeader(context, "cover_page: getConcatenatePDF", "bitstream_id="+bitstream.getID()+", error="+e.getMessage()));
e.printStackTrace();
return null;
}
return byteout;
}
And then I set the page labels to the cover page. I omitted code not relevant to my question.
/**
*
* #return InputStream the resulting output stream
*/
private InputStream getCoverStream()
{
ByteArrayOutputStream byteout = getCover();
return new ByteArrayInputStream(byteout.toByteArray());
}
/**
*
* #return InputStream the resulting output stream
*/
private byte[] getCoverByte()
{
ByteArrayOutputStream byteout = getCover();
return byteout.toByteArray();
}
/**
*
* #return InputStream the resulting output stream
*/
private ByteArrayOutputStream getCover()
{
ByteArrayOutputStream byteout;
Document doc = null;
try
{
byteout = new ByteArrayOutputStream();
doc = new Document(PageSize.LETTER, 24, 24, 20, 40);
PdfWriter pdfwriter = PdfWriter.getInstance(doc, byteout);
PdfPageLabels labels = new PdfPageLabels();
labels.addPageLabel(1, PdfPageLabels.EMPTY, "Cover page", 1);
pdfwriter.setPageLabels(labels);
pdfwriter.setPageEvent(new HeaderFooter());
doc.open();
//code omitted (contents of cover page)
doc.close();
return byteout;
}
catch (Exception e)
{
log.info(LogManager.getHeader(context, "cover_page", "bitstream_id="+bitstream.getID()+", error="+e.getMessage()));
return null;
}
}
The modified code retained the page labels of the existing pdf (see screenshot 1) (documentStream), but the resulting merged pdf (screenshots 2 and 3) is off by 1 page since a cover page was inserted. As suggested by mkl, I should use page labels to the cover page, but it seems the pdf labels of the imported page was lost. My concern now is how do I set the page labels to the final document state as also suggested by mkl? I suppose I should use PdfWriter but I don't know where to put that in my modified code. Am I correct to assume that after the stamper.close() portion, that is the final state of my document? Thanks again in advance.
Screenshot 1. Notice the actual page 1 labeled Front cover
Screenshot 2. Merged pdf, after the generated on-the-fly "cover page" was inserted. The page label "Front cover" was now assigned to the cover page even after I've set the pdf label of the inserted page using labels.addPageLabel(1, PdfPageLabels.EMPTY, "Cover page", 1)
Screenshot 3. Note that the page label 3 was assigned to page 2.
FINAL UPDATE
Kudos to #mkl
The screenshot below is the result after I applied the latest update of mkl's answer. The pages labels are now assigned correctly to pages. Also, using PdfStamper instead of PdfCopy (as used in my original code) did not break the PDF/A compliance of the existing pdf.
Adding the cover page
Usually using PdfCopy for merging PDFs is the right choice, it creates a new document from the copied pages copying as much of the page-level information as possible not preferring any single document.
Your case is somewhat special, though: You have one document whose structure and content you prefer and want to apply a small change to it by adding a single page, a title page. All the while all information including document-level information (e.g. metadata, embedded files, ...) from the main document shall still be present in the result.
In such a use case it is more appropriate to use a PdfStamper which you use to "stamp" changes onto an existing PDF.
You might want to start from something like this:
try ( InputStream documentStream = getClass().getResourceAsStream("template.pdf");
InputStream titleStream = getClass().getResourceAsStream("title.pdf");
OutputStream outputStream = new FileOutputStream(new File(RESULT_FOLDER, "test-with-title-page.pdf")) )
{
PdfReader titleReader = new PdfReader(titleStream);
PdfReader reader = new PdfReader(documentStream);
PdfStamper stamper = new PdfStamper(reader, outputStream);
PdfImportedPage page = stamper.getImportedPage(titleReader, 1);
stamper.insertPage(1, titleReader.getPageSize(1));
PdfContentByte content = stamper.getUnderContent(1);
content.addTemplate(page, 0, 0);
stamper.close();
}
PS: Concerning questions in comments:
In my code above, I should have an overlay text supposedly (before the stamp.alterContents() portion) but I omitted that part of code for testing purposes. Can you please give me an idea how to implement that?
Do you mean something like an overlayed watermark? The PdfStamper allows you to access an "over content" for each page onto which you can draw any content:
PdfContentByte overContent = stamper.getOverContent(pageNumber);
Keeping page labels
My other question is about page offset, because I inserted the cover page, the page numbering are off by 1 page. How can I resolve that?
Unfortunately iText's PdfStamper does not automatically update the page label definition of the manipulated PDF. Actually this is no wonder because it is not clear how the inserted page is meant to be labeled. #Bruno At least, though, iText could change the page label sections starting after the insertion page number.
Using iText's low level API it is possible, though, to fix the original label positions and add a label for the inserted page. This can be implemented similarly to the iText in Action PageLabelExample example, more exactly its manipulatePageLabel part; simply add this before stamper.close():
PdfDictionary root = reader.getCatalog();
PdfDictionary labels = root.getAsDict(PdfName.PAGELABELS);
if (labels != null)
{
PdfArray newNums = new PdfArray();
newNums.add(new PdfNumber(0));
PdfDictionary coverDict = new PdfDictionary();
coverDict.put(PdfName.P, new PdfString("Cover Page"));
newNums.add(coverDict);
PdfArray nums = labels.getAsArray(PdfName.NUMS);
if (nums != null)
{
for (int i = 0; i < nums.size() - 1; )
{
int n = nums.getAsNumber(i++).intValue();
newNums.add(new PdfNumber(n+1));
newNums.add(nums.getPdfObject(i++));
}
}
labels.put(PdfName.NUMS, newNums);
stamper.markUsed(labels);
}
For a document with these labels:
It generates a document with these labels:
Keeping links
I just found out that the inserted page "Cover Page" lost its link annotations. I wonder if there's a workaround for this, since according to the book, the interactive features of the inserted page are lost when using PdfStamper.
Indeed, among the iText PDF generating classes only Pdf*Copy* keeps interactive features like annotations. Unfortunately one has to decide whether one wants to
create a genuinely new PDF (PdfWriter) with no information from other PDFs beyond contents being embedable;
manipulate a single existing PDF ('PdfStamper') with all information from that one PDF being preserved but no information from other PDFs beyond contents being embedable;
merge any number of existing PDFs (PdfCopy) with most page-level information from all those PDFs being preserved but no document-level information from any.
In your case I thought the new cover page had only static content, no dynamic features, and so assumes the PdfStamper was best. If you only have to deal with links, you may consider copying links manually, e.g. using this helper method
/**
* <p>
* A primitive attempt at copying links from page <code>sourcePage</code>
* of <code>PdfReader reader</code> to page <code>targetPage</code> of
* <code>PdfStamper stamper</code>.
* </p>
* <p>
* This method is meant only for the use case at hand, i.e. copying a link
* to an external URI without expecting any advanced features.
* </p>
*/
void copyLinks(PdfStamper stamper, int targetPage, PdfReader reader, int sourcePage)
{
PdfDictionary sourcePageDict = reader.getPageNRelease(sourcePage);
PdfArray annotations = sourcePageDict.getAsArray(PdfName.ANNOTS);
if (annotations != null && annotations.size() > 0)
{
for (PdfObject annotationObject : annotations)
{
annotationObject = PdfReader.getPdfObject(annotationObject);
if (!annotationObject.isDictionary())
continue;
PdfDictionary annotation = (PdfDictionary) annotationObject;
if (!PdfName.LINK.equals(annotation.getAsName(PdfName.SUBTYPE)))
continue;
PdfArray rectArray = annotation.getAsArray(PdfName.RECT);
if (rectArray == null || rectArray.size() < 4)
continue;
Rectangle rectangle = PdfReader.getNormalizedRectangle(rectArray);
PdfName hightLight = annotation.getAsName(PdfName.H);
if (hightLight == null)
hightLight = PdfAnnotation.HIGHLIGHT_INVERT;
PdfDictionary actionDict = annotation.getAsDict(PdfName.A);
if (actionDict == null || !PdfName.URI.equals(actionDict.getAsName(PdfName.S)))
continue;
PdfString urlPdfString = actionDict.getAsString(PdfName.URI);
if (urlPdfString == null)
continue;
PdfAction action = new PdfAction(urlPdfString.toString());
PdfAnnotation link = PdfAnnotation.createLink(stamper.getWriter(), rectangle, hightLight, action);
stamper.addAnnotation(link, targetPage);
}
}
}
which you can call right after inserting the original page:
PdfImportedPage page = stamper.getImportedPage(titleReader, 1);
stamper.insertPage(1, titleReader.getPageSize(1));
PdfContentByte content = stamper.getUnderContent(1);
content.addTemplate(page, 0, 0);
copyLinks(stamper, 1, titleReader, 1);
Beware, this method is really simple. It only considers links with URI actions and creates a link on the target page using the same location, target, and highlight setting as the original one. If the original one uses more refined features (e.g. if it brings along its own appearance streams or even merely uses the border style attributes) and you want to keep these features, you have to improve the method to also copy the entries for these features to the new annotation.