PDFBox: Remove text behind image - pdfbox

I am in a requirement to split a pdf into two, one with image and one with text. I dont want to remove the text which are behind an image and it should be the part of the image pdf. I want to extract only the top layered text in the PDF. Can any one help on this?
I already extracted the image and text into two pdfs by looping through pdf operators. I am facing trouble when not to remove the text behind the PDF.

Code for removing text:
PDDocument document = null;
try {
document = PDDocument.load(new File(inputfilePath), password);
PDPageTree allPages = document.getDocumentCatalog().getPages();
for (int i = 0; i < allPages.getCount(); i++) {
PDPage page = (PDPage) allPages.get(i);
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List tokens = parser.getTokens();
List newTokens = new ArrayList();
for (int j = 0; j < tokens.size(); j++) {
Object token = tokens.get(j);
if (token instanceof Operator) {
Operator op = (Operator) token;
if (op.getName().equalsIgnoreCase(
"tj")) {
try {
// remove the one argument to this operator
newTokens.remove(newTokens.size() - 1);
} catch (Exception e) {
e.printStackTrace();
}
continue;
}
// Header doesn't contain versioninfo
}
newTokens.add(token);
}
PDStream newContents = new PDStream(document);
ContentStreamWriter writer = new ContentStreamWriter(newContents.createOutputStream());
writer.writeTokens(newTokens);
// In writeTokens method, I closed the output stream.. This is
// for future reference.. or it will throw stream not closed
// error
newContents.addCompression();
page.setContents(newContents);
if (DEBUG)
System.out.println("Background image pdf creation process");
document.setAllSecurityToBeRemoved(true);
document.save(outputFolder + "/img_" + fileName);
And for removing the images and shades, I used the below code:
for (int j = 0; j < tokens.size(); j++) {
Object token = tokens.get(j);
if (token instanceof Operator) {
Operator op = (Operator) token;
// Text Extraction removing image and shades
if (op.getName().equalsIgnoreCase("do") || op.getName().equalsIgnoreCase("sh")
|| op.getName().equalsIgnoreCase("gs") || op.getName().equalsIgnoreCase("bi")
|| op.getName().equalsIgnoreCase("id") || op.getName().equalsIgnoreCase("ei")
|| op.getName().equalsIgnoreCase("bmc") || op.getName().equalsIgnoreCase("bdc")
|| op.getName().equalsIgnoreCase("emc") || op.getName().equalsIgnoreCase("m")
|| op.getName().equalsIgnoreCase("w")
|| op.getName().equalsIgnoreCase("re")
) {
// remove the one argument to this operator
newTokens.remove(newTokens.size() - 1);
continue;
}
}
newTokens.add(token);
}

Related

Update PDF using pdfBox

I would like to ask if having a PDF it is possible, using pdfbox libraries, to update it at a specific point.
I am trying to use a solution already online but seems the gettoken() method does not enter code heresection the words properly to allow me to find the part I would like to modify.
This is the code(Groovy):
for( int i = 0; i < dataContext.getDataCount(); i++ ) {
InputStream is = dataContext.getStream(i);
Properties props = dataContext.getProperties(i);
String searchString= "Hours worked";
String replacement = "Hours worked: 2";
File file = new File("\\\\****\\UKDC\\GFS\\PRE\\PREPROD\\Alchemer\\Template\\***.pdf");
PDDocument doc = PDDocument.load(file);
for ( PDPage page : doc.getPages() )
{
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List tokens = parser.getTokens();
logger.info("in Page");
for (int j = 0; j < tokens.size(); j++)
{
logger.info("tokens:"+tokens[j]);
Object next = tokens.get(j);
//logger.info("in Object");
if (next instanceof Operator)
{
Operator op = (Operator) next;
String pstring = "";
int prej = 0;
//Tj and TJ are the two operators that display strings in a PDF
if (op.getName().equals("Tj"))
{
logger.info("in Tj");
// Tj takes one operator and that is the string to display so lets update that operator
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
logger.info("previousString:"+string);
string = string.replaceFirst(searchString, replacement);
previous.setValue(string.getBytes());
} else
if (op.getName().equals("TJ"))
{
logger.info("in TJ:"+ op.getName());
COSArray previous = (COSArray) tokens.get(j - 1);
logger.info("previous:"+previous);
for (int k = 0; k < previous.size(); k++)
{
Object arrElement = previous.getObject(k);
if (arrElement instanceof COSString)
{
COSString cosString = (COSString) arrElement;
String string = cosString.getString();
logger.info("string:"+string);
if (j == prej || string.equals(" ") || string.equals(":") || string.equals("-")) {
pstring += string;
} else {
prej = j;
pstring = string;
}
}
}
logger.info("pstring:"+pstring);
if (searchString.equals(pstring.trim()))
{
logger.info("in searchString");
COSString cosString2 = (COSString) previous.getObject(0);
cosString2.setValue(replacement.getBytes());
int total = previous.size()-1;
for (int k = total; k > 0; k--) {
previous.remove(k);
}
}
}
}
}
logger.info("in updatedStream");
// now that the tokens are updated we will replace the page content stream.
PDStream updatedStream = new PDStream(doc);
OutputStream out = updatedStream.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
logger.info("in tokenWriter");
out.close();
page.setContents(updatedStream);
doc.save("\\\\***\\UKDC\\GFS\\PRE\\PREPROD\\Alchemer\\***1.pdf");
}
Executing the code I am trying to search "Hours worked" String and update with
"Hours worked: 2"
There are 2 questions:
1.When I execute and check the logs can see the Tokens are not created properly:
enter image description here
enter image description here
So are created two different COSArrays meantime I have all in one Line:
enter image description here
and this can be a problem if I have to search a specific word.
When it find the word it seems it is working but it apply a strange char:
enter image description here
So Here 2 questions:
How to manage to specify the token behaviour (or maybe for the parser) to get an entire phrase in the same token until a special char happen?
Hot to format the new char in the new PDF?
Hope you can help me, thanks for your support.

Changing zoom level of all links in a PDF document

I'd like to change zoom level in links in PDF files using iText 7. There are already solutions for legacy iText 5 (1, 2) but there is none for iText 7.
There are many ways links can work (it can be even JavaScript code) but now I'm interested in GoTo actions (i.e. /XYZ, /Fit and others described in ISO 32000-1, Table 151).
My current code locates GoTo actions but I don't know what the destination of a link is (such as page number, coordinates for XYZ zoom type). How do I get the destination of a link? Or am I missing something and that problem can be solved in a different way?
for (int i = 1; i <= numberOfPages; i++) {
PdfPage page = document.getPage(i);
List<PdfAnnotation> annots = page.getAnnotations();
if (annots.isEmpty()) {
continue;
}
for (int j = 0; j < annots.size(); j++) {
PdfAnnotation annot = annots.get(j);
if (annot == null) {
continue;
}
PdfDictionary action = annot.getPdfObject()
.getAsDictionary(PdfName.A);
if (action == null ||
!PdfName.GoTo.equals(action.get(PdfName.S))) {
continue;
}
annot.remove(PdfName.A);
// here I need the destination of action
int pageNumber = ???
Rectangle cropBox = document.getPage(pageNumber).getCropBox();
PdfAction goToAction = PdfAction.createGoTo(PdfExplicitDestination.createXYZ(
document.getPage(pageNumber), cropBox.getLeft(), cropBox.getTop(), 2F));
annot.put(PdfName.A, goToAction.getPdfObject());
}
}
UPDATE Working Code
I managed to deal with GoTo actions - both explicit and of type Named Destination. It works on some pdf file I tested the code against but I'm not sure whether it covers all the possible cases
Note:
When dealing with named destination I create explicit destination and then use it instead of named destination. The thing is, however, that named destination still exists in pdf. How do I get rid of it? In iText 5 there was a method consolidateNamedDestinations() which seems to do what I need:
Replaces all the local named links with
the actual destinations.
Please see my code:
try (document) {
Map<String, PdfObject> namedDests = null;
for (int i = 1; i <= document.getNumberOfPages(); i++) {
List<PdfAnnotation> annots = document.getPage(i).getAnnotations();
if (annots.isEmpty()) {
continue;
}
for (PdfAnnotation annot : annots) {
if (annot == null) {
continue;
}
PdfDictionary action = annot.getPdfObject()
.getAsDictionary(PdfName.A);
if (action == null) {
continue;
}
PdfName actionType = action.getAsName(PdfName.S);
if (PdfName.Link.equals(actionType)) {
throw new UnsupportedOperationException(
"Action of type " + action);
} else if (!PdfName.GoTo.equals(actionType)) {
continue;
}
PdfArray dest = action.getAsArray(PdfName.D);
// explicit GoTO case
if (dest != null) {
changeZoomLevel(document, annot, dest.getAsDictionary(0));
continue;
}
// Named Destination case
if (namedDests == null) {
namedDests = document.getCatalog()
.getNameTree(PdfName.Dests)
.getNames();
}
String namedDest = action.getAsString(PdfName.D)
.getValue();
PdfObject d = namedDests.get(namedDest);
PdfDictionary dict = null;
if (d instanceof PdfDictionary) {
PdfArray arr = (PdfArray) ((PdfDictionary) d).get(PdfName.D);
dict = arr.getAsDictionary(0);
} else if (d instanceof PdfArray) {
dict = ((PdfArray) d).getAsDictionary(0);
}
changeZoomLevel(document, annot, dict);
}
}
}
private static void changeZoomLevel(PdfDocument document,
PdfAnnotation annot, PdfDictionary dest) {
annot.remove(PdfName.A);
annot.put(PdfName.A, createPdfAction(document, dest));
}
private static PdfObject createPdfAction(PdfDocument document, PdfDictionary dest) {
PdfPage actionPage = document.getPage(dest);
Rectangle cropBox = actionPage.getCropBox();
return PdfAction.createGoTo(PdfExplicitDestination.createXYZ(
actionPage, cropBox.getLeft(), cropBox.getTop(), 1.5F)).getPdfObject();
}

Remove underlines from text in PDF file

I have a bunch of PDF files with broken links.
I need to remove those links and right now I can do the following:
Remove link actions
Change text color from blue to black
What I can't do is to remove blue underlines below text that was a link before.
I tried several PDF libraries for .NET (because this is my primary platform)
Aspost.PDF
PDFSharp
ceTe DynamicPDF
PDFBox
You are welcone to recommend solution on any prograning language, platform and library. I just need to do this.
In case of the sample document the underlines are drawn as blue (RGB 0,0,1) filled vector graphics rectangles (long, slim ones). As blue only is used for the links, we can use that criterion to find the rectangles in question.
Here a sample implementation using PDFBox 1.8.10:
void removeBlueRectangles(PDDocument document) throws IOException
{
List<?> pages = document.getDocumentCatalog().getAllPages();
for (int i = 0; i < pages.size(); i++)
{
PDPage page = (PDPage) pages.get(i);
PDStream contents = page.getContents();
PDFStreamParser parser = new PDFStreamParser(contents.getStream());
parser.parse();
List<Object> tokens = parser.getTokens();
Stack<Boolean> blueState = new Stack<Boolean>();
blueState.push(false);
for (int j = 0; j < tokens.size(); j++)
{
Object next = tokens.get(j);
if (next instanceof PDFOperator)
{
PDFOperator op = (PDFOperator) next;
if (op.getOperation().equals("q"))
{
blueState.push(blueState.peek());
}
else if (op.getOperation().equals("Q"))
{
blueState.pop();
}
else if (op.getOperation().equals("rg"))
{
if (j > 2)
{
Object r = tokens.get(j-3);
Object g = tokens.get(j-2);
Object b = tokens.get(j-1);
if (r instanceof COSNumber && g instanceof COSNumber && b instanceof COSNumber)
{
blueState.pop();
blueState.push((
Math.abs(((COSNumber)r).floatValue() - 0) < 0.001 &&
Math.abs(((COSNumber)g).floatValue() - 0) < 0.001 &&
Math.abs(((COSNumber)b).floatValue() - 1) < 0.001));
}
}
}
else if (op.getOperation().equals("f"))
{
if (blueState.peek() && j > 0)
{
Object re = tokens.get(j-1);
if (re instanceof PDFOperator && ((PDFOperator)re).getOperation().equals("re"))
{
tokens.set(j, PDFOperator.getOperator("n"));
}
}
}
}
}
PDStream updatedStream = new PDStream(document);
OutputStream out = updatedStream.createOutputStream();
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
page.setContents(updatedStream);
}
}
(RemoveUnderlines.java)
original.pdf
Applying this to your first sample file original.pdf
public void testOriginal() throws IOException, COSVisitorException
{
try ( InputStream resourceStream = getClass().getResourceAsStream("original.pdf") )
{
PDDocument document = PDDocument.loadNonSeq(resourceStream, null);
removeBlueRectangles(document);
document.save("original-noBlueRectangles.pdf");
document.close();
}
}
(RemoveUnderlines.java)
results in
1178.pdf
You commented
After testing this on many files I have to say this solution works incorrectly in some cases. For example in for this file (dropbox.com/s/23g54bvt781lb93/1178.pdf?dl=0) it removes the entire content of the page. Keep searching..
So I applyed the code to your new sample file 1178.pdf
public void test1178() throws IOException, COSVisitorException
{
try ( InputStream resourceStream = getClass().getResourceAsStream("1178.pdf") )
{
PDDocument document = PDDocument.loadNonSeq(resourceStream, null);
removeBlueRectangles(document);
document.save(new File(RESULT_FOLDER, "1178-noBlueRectangles.pdf"));
document.close();
}
}
(RemoveUnderlines.java)
which resulted in
So I cannot confirm your claim that the solution works incorrectly; in particular I see that it does not remove the entire content of the page.
As I cannot reproduce your observation, I assume there are additional issues in your setup you have not yet mentioned.

parser.getTokens() gives out junk data and singlecharacters PDFBox-1.8.9 version

I am new to pdfbox. I am using pdfbox-app-2.0.0-RC1 version to fetch the entire text from the pdf using PDFTextStripperByArea. Is it possible for me to get each string separately?
For Example,
In the following text,
Nomination : Name&Address
Shipper : shipper name
I need Nomination as seperate string and "Name&Address" as separate string. Instead I am getting each character separately. I have tried with different Pdfs. For most pdfs I am able to get the exact string but for few pdfs I don't.
I am using the following code to get the separate string.
for (PDPage page : doc.getPages()) {
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List<Object> tokens = parser.getTokens();
for (int j = 0; j < tokens.size(); j++) {
Object next = tokens.get(j);
if (next instanceof Operator) {
Operator op = (Operator) next;
if (op.getName().equals("Tj")) {
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
System.out.println("string1===" + string);
if (string.contains("Plant")) {
int size = al.size();
al.add(string);
stop = false;
continue;
}
if (!string.contains("_") && !stop) {
if (string.contains("Nomination")) {
stop = true;
} else {
al.add(string);
}
}
} else if (op.getName().equals("TJ")) {
COSArray previous = (COSArray) tokens.get(j - 1);
for (int k = 0; k < previous.size(); k++) {
Object arrElement = previous.getObject(k);
if (arrElement instanceof COSString) {
COSString cosString = (COSString)arrElement;
String string = cosString.getString();
System.out.println("string2====>>"+string);
al.add(string);
}
}
}
}
}
}
I am getting the following output:
string2====>>Nom
string2====>>i
string2====>>na
string2====>>t
string2====>>i
string2====>>on
string1===
string2====>>(
string2====>>T
string2====>>o
string1===
string2====>>Loa
string2====>>di
string2====>>ng
string1===
string2====>>Fa
string2====>>c
string2====>>i
string2====>>l
string2====>>i
string2====>>t
string2====>>y
string2====>>)

Itext: How to retrieve list of not embedded fonts of a pdf

I would like to check for a PDF if all fonts are embedded or not. I followed the coding as mentionned in How to check that all used fonts are embedded in PDF with Java iText? but I still not able to get a proper list of fonts used.
See my example pdf: https://www.dropbox.com/s/anvm49vh87d8yqs/000024944.pdf?dl=0, the coding returs no fonts at all but the document properties in acrobat mention Helvetica + Verdana (Embedded Subset) + Verdana-Bold (Embedded Subset). For other pdf's I do get Verdana Embedded subset, only for these kind of pdf's I fail to get the font list.
As we have to deal with a huge amount of pdf's from internal as external sources we need to be able to embed fonts in order to print them. As it is almost impossible to embed all fonts we just want to embed common fonts, for exotic fonts we would ignore the printrequest.
Can anyone help me to solve this issue? Thanks
Got it working after all by referring to BASEFONT instead of FONT:
/**
* Creates a Set containing information about the fonts in the src PDF file.
* #param src the path to a PDF file
* #throws IOException
*/
public void listFonts(PdfReader reader, Set<String> set) throws IOException {
try {
int n = reader.getXrefSize();
PdfObject object;
PdfDictionary font;
for (int i = 0; i < n; i++) {
object = reader.getPdfObject(i);
if (object == null || !object.isDictionary()) {
continue;
}
font = (PdfDictionary)object;
if (font.get(PdfName.BASEFONT) != null) {
System.out.println("fontname " + font.getAsName(PdfName.BASEFONT).toString());
processFont(font,set);
}
}
} catch (Exception e) {
System.out.println("error " + e.getMessage());
}
}
/**
* Finds out if the font is an embedded subset font
* #param font name
* #return true if the name denotes an embedded subset font
*/
private boolean isEmbeddedSubset(String name) {
//name = String.format("%s subset (%s)", name.substring(8), name.substring(1, 7));
return name != null && name.length() > 8 && name.charAt(7) == '+';
}
private void processFont(PdfDictionary font, Set<String> set) {
**String name = font.getAsName(PdfName.BASEFONT).toString();**
if(isEmbeddedSubset(name)) {
return;
}
PdfDictionary desc = font.getAsDict(PdfName.FONTDESCRIPTOR);
//nofontdescriptor
if (desc == null) {
System.out.println("desc null " );
PdfArray descendant = font.getAsArray(PdfName.DESCENDANTFONTS);
if (descendant == null) {
System.out.println("descendant null " );
set.add(name.substring(1));
}
else {
System.out.println("descendant not null " );
for (int i = 0; i < descendant.size(); i++) {
PdfDictionary dic = descendant.getAsDict(i);
processFont(dic, set);
}
}
}
/**
* (Type 1) embedded
*/
else if (desc.get(PdfName.FONTFILE) != null) {
System.out.println("(TrueType) embedded ");
}
/**
* (TrueType) embedded
*/
else if (desc.get(PdfName.FONTFILE2) != null) {
System.out.println("(FONTFILE2) embedded ");
}
/**
* " (" + font.getAsName(PdfName.SUBTYPE).toString().substring(1) + ") embedded"
*/
else if (desc.get(PdfName.FONTFILE3) != null) {
System.out.println("(FONTFILE3) ");
}
else {
set.add(name.substring(1));
}
}
This gives me the same results as list of fonts in acrobat reader>properties
I managed to get some results by combining coding from How to check that all used fonts are embedded in PDF with Java iText? and http://itextpdf.com/examples/iia.php?id=288.
Initially it was not working as font.getAsName(PdfName.BASEFONT).toString(); is not working in my case but I did a small change and get some results.
Below is my coding:
/**
* Creates a Set containing information about the fonts in the src PDF file.
* #param src the path to a PDF file
* #throws IOException
*/
public void listFonts(PdfReader reader, Set<String> set) throws IOException {
int n = reader.getXrefSize();
PdfObject object;
PdfDictionary font;
for (int i = 0; i < n; i++) {
object = reader.getPdfObject(i);
if (object == null || !object.isDictionary()) {
continue;
}
font = (PdfDictionary)object;
if (font.get(PdfName.FONTNAME) != null) {
System.out.println("fontname " + font.get(PdfName.FONTNAME));
processFont(font,set);
}
}
}
/**
* Finds out if the font is an embedded subset font
* #param font name
* #return true if the name denotes an embedded subset font
*/
private boolean isEmbeddedSubset(String name) {
//name = String.format("%s subset (%s)", name.substring(8), name.substring(1, 7));
return name != null && name.length() > 8 && name.charAt(7) == '+';
}
private void processFont(PdfDictionary font, Set<String> set) {
String name = font.get(PdfName.FONTNAME).toString();
if(isEmbeddedSubset(name)) {
return;
}
PdfDictionary desc = font.getAsDict(PdfName.FONTDESCRIPTOR);
//nofontdescriptor
if (desc == null) {
System.out.println("desc null " );
PdfArray descendant = font.getAsArray(PdfName.DESCENDANTFONTS);
if (descendant == null) {
System.out.println("descendant null " );
set.add(name.substring(1));
}
else {
System.out.println("descendant not null " );
for (int i = 0; i < descendant.size(); i++) {
PdfDictionary dic = descendant.getAsDict(i);
processFont(dic, set);
}
}
}
/**
* (Type 1) embedded
*/
else if (desc.get(PdfName.FONTFILE) != null) {
System.out.println("(TrueType) embedded ");
}
/**
* (TrueType) embedded
*/
else if (desc.get(PdfName.FONTFILE2) != null) {
System.out.println("(FONTFILE2) embedded ");
}
/**
* " (" + font.getAsName(PdfName.SUBTYPE).toString().substring(1) + ") embedded"
*/
else if (desc.get(PdfName.FONTFILE3) != null) {
System.out.println("(FONTFILE3) ");
}
else {
set.add(name.substring(1));
}
}
}
So instead of using String name = font.getAsName(PdfName.BASEFONT).toString(); I changed it to String name = font.get(PdfName.FONTNAME).toString();
This definitely get some better results as it gives me different fonts. However I do not get results for fontdescriptor and descendantfonts. Or they are simply not available in my pdf's or because I changed the coding I will never end up there.
Can I assume if a subset is found that the font is embedded, if no subset availbale in the fontname can I assume the font is not embedded?