Java, edit pdf exist text with PDFBox - pdfbox

i trying edit text with library PDFBox and don't no how. I do not know how to get stream of individual text objects, so I could edit the text and or color.
Any idea, example?
Thanks

I found editing text in pdf is not reliable, so try to clear the text with a rectangle (white/background color fill) and write the new text in the cleared position. Here is a sample code.
//to add a link in footer
//to replace a text
//to replace a link/url/href
public static void editTextorUrl(String inputFile, String outputFile)
throws IOException, COSVisitorException {
// the document
PDDocument doc = null;
try {
System.out.println(inputFile);
doc = PDDocument.load(inputFile);
List pages = doc.getDocumentCatalog().getAllPages();
for (int i = 0; i < pages.size(); i++) {
float inch = 72;
PDGamma colourRed = new PDGamma();
colourRed.setR(1);
PDGamma colourBlue = new PDGamma();
colourBlue.setB(1);
PDGamma white = new PDGamma();
white.setR(1);
white.setB(1);
white.setG(1);
PDBorderStyleDictionary borderThick = new PDBorderStyleDictionary();
borderThick.setWidth(inch / 12); // 12th inch
PDBorderStyleDictionary borderThin = new PDBorderStyleDictionary();
borderThin.setWidth(inch / 72); // 1 point
PDBorderStyleDictionary borderULine = new PDBorderStyleDictionary();
borderULine.setStyle(PDBorderStyleDictionary.STYLE_UNDERLINE);
borderULine.setWidth(inch / 72); // 1 point
PDPage page = (PDPage) pages.get(i);
PDFont font = PDType1Font.HELVETICA;
PDPageContentStream contentStream = new PDPageContentStream(
doc, page, true, false);
contentStream.setNonStrokingColor(Color.WHITE);
contentStream.fillRect(55, 27, 144, 17);
contentStream.setNonStrokingColor(Color.BLUE);
contentStream.beginText();
contentStream.setFont(font, 11);
contentStream.moveTextPositionByAmount(55, 37);
contentStream.drawString("www.loasoftwares.com"); //text to be replaced
contentStream.endText();
contentStream.setLineWidth(inch / 300);
contentStream.setStrokingColor(Color.BLUE);
contentStream.drawLine(55, 34, 188, 34);
contentStream.close();
PDAnnotationLink txtLink = new PDAnnotationLink();
PDRectangle position = new PDRectangle();
position.setLowerLeftX(55);
position.setLowerLeftY(27);
position.setUpperRightX(188);
position.setUpperRightY(50);
txtLink.setRectangle(position);
// add an action
PDActionURI action = new PDActionURI();
action.setURI("www.loasoftwares.com");
txtLink.setBorderStyle(borderULine);
txtLink.setAction(action);
txtLink.setColour(white);
page.getAnnotations().add(txtLink);
}
doc.save(outputFile);
} finally {
if (doc != null) {
doc.close();
}
}
}

Remark:
It works under specific circumstances (ASCII'ish font encodings + pretty long string arguments)
/**
* This is an example that will replace a string in a PDF with a new one.
*
* The example is taken from the pdf file format specification.
*
* #author Ben Litchfield
* #version $Revision: 1.3 $
*/
public class ReplaceString {
/**
* Constructor.
*/
public ReplaceString() {
super();
}
/**
* Locate a string in a PDF and replace it with a new string.
*
* #param inputFile The PDF to open.
* #param outputFile The PDF to write to.
* #param strToFind The string to find in the PDF document.
* #param message The message to write in the file.
*
* #throws IOException If there is an error writing the data.
* #throws COSVisitorException If there is an error writing the PDF.
*/
public void doIt( String inputFile, String outputFile, String strToFind, String message)
throws IOException, COSVisitorException {
// the document
PDDocument doc = null;
try
{
doc = PDDocument.load( inputFile );
List pages = doc.getDocumentCatalog().getAllPages();
for( int i=0; i<pages.size(); i++ )
{
PDPage page = (PDPage)pages.get( i );
PDStream contents = page.getContents();
PDFStreamParser parser = new PDFStreamParser(contents.getStream());
parser.parse();
List tokens = parser.getTokens();
for( int j=0; j<tokens.size(); j++ )
{
Object next = tokens.get( j );
if( next instanceof PDFOperator )
{
PDFOperator op = (PDFOperator)next;
//Tj and TJ are the two operators that display
//strings in a PDF
if( op.getOperation().equals( "Tj" ) )
{
//Tj takes one operator and that is the string
//to display so lets update that operator
COSString previous = (COSString)tokens.get( j-1 );
String string = previous.getString();
string = string.replaceFirst( strToFind, message );
previous.reset();
previous.append( string.getBytes("ISO-8859-1") );
}
else if( op.getOperation().equals( "TJ" ) )
{
COSArray previous = (COSArray)tokens.get( j-1 );
for( int k=0; k<previous.size(); k++ )
{
Object arrElement = previous.getObject( k );
if( arrElement instanceof COSString )
{
COSString cosString = (COSString)arrElement;
String string = cosString.getString();
string = string.replaceFirst( strToFind, message );
cosString.reset();
cosString.append( string.getBytes("ISO-8859-1") );
}
}
}
}
}
//now that the tokens are updated we will replace the
//page content stream.
PDStream updatedStream = new PDStream(doc);
OutputStream out = updatedStream.createOutputStream();
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens( tokens );
page.setContents( updatedStream );
}
doc.save( outputFile );
}
finally
{
if( doc != null )
{
doc.close();
}
}
}
/**
* This will open a PDF and replace a string if it finds it.
* <br />
* see usage() for commandline
*
* #param args Command line arguments.
*/
public static void main(String[] args)
{
ReplaceString app = new ReplaceString();
try
{
if( args.length != 4 )
{
app.usage();
}
else
{
app.doIt( args[0], args[1], args[2], args[3] );
}
}
catch (Exception e)
{
e.printStackTrace();
}
}
/**
* This will print out a message telling how to use this example.
*/
private void usage()
{
System.err.println( "usage: " + this.getClass().getName() +
" <input-file> <output-file> <search-string> <Message>" );
}
}

Related

Update PDF using pdfBox

I would like to ask if having a PDF it is possible, using pdfbox libraries, to update it at a specific point.
I am trying to use a solution already online but seems the gettoken() method does not enter code heresection the words properly to allow me to find the part I would like to modify.
This is the code(Groovy):
for( int i = 0; i < dataContext.getDataCount(); i++ ) {
InputStream is = dataContext.getStream(i);
Properties props = dataContext.getProperties(i);
String searchString= "Hours worked";
String replacement = "Hours worked: 2";
File file = new File("\\\\****\\UKDC\\GFS\\PRE\\PREPROD\\Alchemer\\Template\\***.pdf");
PDDocument doc = PDDocument.load(file);
for ( PDPage page : doc.getPages() )
{
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List tokens = parser.getTokens();
logger.info("in Page");
for (int j = 0; j < tokens.size(); j++)
{
logger.info("tokens:"+tokens[j]);
Object next = tokens.get(j);
//logger.info("in Object");
if (next instanceof Operator)
{
Operator op = (Operator) next;
String pstring = "";
int prej = 0;
//Tj and TJ are the two operators that display strings in a PDF
if (op.getName().equals("Tj"))
{
logger.info("in Tj");
// Tj takes one operator and that is the string to display so lets update that operator
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
logger.info("previousString:"+string);
string = string.replaceFirst(searchString, replacement);
previous.setValue(string.getBytes());
} else
if (op.getName().equals("TJ"))
{
logger.info("in TJ:"+ op.getName());
COSArray previous = (COSArray) tokens.get(j - 1);
logger.info("previous:"+previous);
for (int k = 0; k < previous.size(); k++)
{
Object arrElement = previous.getObject(k);
if (arrElement instanceof COSString)
{
COSString cosString = (COSString) arrElement;
String string = cosString.getString();
logger.info("string:"+string);
if (j == prej || string.equals(" ") || string.equals(":") || string.equals("-")) {
pstring += string;
} else {
prej = j;
pstring = string;
}
}
}
logger.info("pstring:"+pstring);
if (searchString.equals(pstring.trim()))
{
logger.info("in searchString");
COSString cosString2 = (COSString) previous.getObject(0);
cosString2.setValue(replacement.getBytes());
int total = previous.size()-1;
for (int k = total; k > 0; k--) {
previous.remove(k);
}
}
}
}
}
logger.info("in updatedStream");
// now that the tokens are updated we will replace the page content stream.
PDStream updatedStream = new PDStream(doc);
OutputStream out = updatedStream.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
logger.info("in tokenWriter");
out.close();
page.setContents(updatedStream);
doc.save("\\\\***\\UKDC\\GFS\\PRE\\PREPROD\\Alchemer\\***1.pdf");
}
Executing the code I am trying to search "Hours worked" String and update with
"Hours worked: 2"
There are 2 questions:
1.When I execute and check the logs can see the Tokens are not created properly:
enter image description here
enter image description here
So are created two different COSArrays meantime I have all in one Line:
enter image description here
and this can be a problem if I have to search a specific word.
When it find the word it seems it is working but it apply a strange char:
enter image description here
So Here 2 questions:
How to manage to specify the token behaviour (or maybe for the parser) to get an entire phrase in the same token until a special char happen?
Hot to format the new char in the new PDF?
Hope you can help me, thanks for your support.

Remove underlines from text in PDF file

I have a bunch of PDF files with broken links.
I need to remove those links and right now I can do the following:
Remove link actions
Change text color from blue to black
What I can't do is to remove blue underlines below text that was a link before.
I tried several PDF libraries for .NET (because this is my primary platform)
Aspost.PDF
PDFSharp
ceTe DynamicPDF
PDFBox
You are welcone to recommend solution on any prograning language, platform and library. I just need to do this.
In case of the sample document the underlines are drawn as blue (RGB 0,0,1) filled vector graphics rectangles (long, slim ones). As blue only is used for the links, we can use that criterion to find the rectangles in question.
Here a sample implementation using PDFBox 1.8.10:
void removeBlueRectangles(PDDocument document) throws IOException
{
List<?> pages = document.getDocumentCatalog().getAllPages();
for (int i = 0; i < pages.size(); i++)
{
PDPage page = (PDPage) pages.get(i);
PDStream contents = page.getContents();
PDFStreamParser parser = new PDFStreamParser(contents.getStream());
parser.parse();
List<Object> tokens = parser.getTokens();
Stack<Boolean> blueState = new Stack<Boolean>();
blueState.push(false);
for (int j = 0; j < tokens.size(); j++)
{
Object next = tokens.get(j);
if (next instanceof PDFOperator)
{
PDFOperator op = (PDFOperator) next;
if (op.getOperation().equals("q"))
{
blueState.push(blueState.peek());
}
else if (op.getOperation().equals("Q"))
{
blueState.pop();
}
else if (op.getOperation().equals("rg"))
{
if (j > 2)
{
Object r = tokens.get(j-3);
Object g = tokens.get(j-2);
Object b = tokens.get(j-1);
if (r instanceof COSNumber && g instanceof COSNumber && b instanceof COSNumber)
{
blueState.pop();
blueState.push((
Math.abs(((COSNumber)r).floatValue() - 0) < 0.001 &&
Math.abs(((COSNumber)g).floatValue() - 0) < 0.001 &&
Math.abs(((COSNumber)b).floatValue() - 1) < 0.001));
}
}
}
else if (op.getOperation().equals("f"))
{
if (blueState.peek() && j > 0)
{
Object re = tokens.get(j-1);
if (re instanceof PDFOperator && ((PDFOperator)re).getOperation().equals("re"))
{
tokens.set(j, PDFOperator.getOperator("n"));
}
}
}
}
}
PDStream updatedStream = new PDStream(document);
OutputStream out = updatedStream.createOutputStream();
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
page.setContents(updatedStream);
}
}
(RemoveUnderlines.java)
original.pdf
Applying this to your first sample file original.pdf
public void testOriginal() throws IOException, COSVisitorException
{
try ( InputStream resourceStream = getClass().getResourceAsStream("original.pdf") )
{
PDDocument document = PDDocument.loadNonSeq(resourceStream, null);
removeBlueRectangles(document);
document.save("original-noBlueRectangles.pdf");
document.close();
}
}
(RemoveUnderlines.java)
results in
1178.pdf
You commented
After testing this on many files I have to say this solution works incorrectly in some cases. For example in for this file (dropbox.com/s/23g54bvt781lb93/1178.pdf?dl=0) it removes the entire content of the page. Keep searching..
So I applyed the code to your new sample file 1178.pdf
public void test1178() throws IOException, COSVisitorException
{
try ( InputStream resourceStream = getClass().getResourceAsStream("1178.pdf") )
{
PDDocument document = PDDocument.loadNonSeq(resourceStream, null);
removeBlueRectangles(document);
document.save(new File(RESULT_FOLDER, "1178-noBlueRectangles.pdf"));
document.close();
}
}
(RemoveUnderlines.java)
which resulted in
So I cannot confirm your claim that the solution works incorrectly; in particular I see that it does not remove the entire content of the page.
As I cannot reproduce your observation, I assume there are additional issues in your setup you have not yet mentioned.

Itext: How to retrieve list of not embedded fonts of a pdf

I would like to check for a PDF if all fonts are embedded or not. I followed the coding as mentionned in How to check that all used fonts are embedded in PDF with Java iText? but I still not able to get a proper list of fonts used.
See my example pdf: https://www.dropbox.com/s/anvm49vh87d8yqs/000024944.pdf?dl=0, the coding returs no fonts at all but the document properties in acrobat mention Helvetica + Verdana (Embedded Subset) + Verdana-Bold (Embedded Subset). For other pdf's I do get Verdana Embedded subset, only for these kind of pdf's I fail to get the font list.
As we have to deal with a huge amount of pdf's from internal as external sources we need to be able to embed fonts in order to print them. As it is almost impossible to embed all fonts we just want to embed common fonts, for exotic fonts we would ignore the printrequest.
Can anyone help me to solve this issue? Thanks
Got it working after all by referring to BASEFONT instead of FONT:
/**
* Creates a Set containing information about the fonts in the src PDF file.
* #param src the path to a PDF file
* #throws IOException
*/
public void listFonts(PdfReader reader, Set<String> set) throws IOException {
try {
int n = reader.getXrefSize();
PdfObject object;
PdfDictionary font;
for (int i = 0; i < n; i++) {
object = reader.getPdfObject(i);
if (object == null || !object.isDictionary()) {
continue;
}
font = (PdfDictionary)object;
if (font.get(PdfName.BASEFONT) != null) {
System.out.println("fontname " + font.getAsName(PdfName.BASEFONT).toString());
processFont(font,set);
}
}
} catch (Exception e) {
System.out.println("error " + e.getMessage());
}
}
/**
* Finds out if the font is an embedded subset font
* #param font name
* #return true if the name denotes an embedded subset font
*/
private boolean isEmbeddedSubset(String name) {
//name = String.format("%s subset (%s)", name.substring(8), name.substring(1, 7));
return name != null && name.length() > 8 && name.charAt(7) == '+';
}
private void processFont(PdfDictionary font, Set<String> set) {
**String name = font.getAsName(PdfName.BASEFONT).toString();**
if(isEmbeddedSubset(name)) {
return;
}
PdfDictionary desc = font.getAsDict(PdfName.FONTDESCRIPTOR);
//nofontdescriptor
if (desc == null) {
System.out.println("desc null " );
PdfArray descendant = font.getAsArray(PdfName.DESCENDANTFONTS);
if (descendant == null) {
System.out.println("descendant null " );
set.add(name.substring(1));
}
else {
System.out.println("descendant not null " );
for (int i = 0; i < descendant.size(); i++) {
PdfDictionary dic = descendant.getAsDict(i);
processFont(dic, set);
}
}
}
/**
* (Type 1) embedded
*/
else if (desc.get(PdfName.FONTFILE) != null) {
System.out.println("(TrueType) embedded ");
}
/**
* (TrueType) embedded
*/
else if (desc.get(PdfName.FONTFILE2) != null) {
System.out.println("(FONTFILE2) embedded ");
}
/**
* " (" + font.getAsName(PdfName.SUBTYPE).toString().substring(1) + ") embedded"
*/
else if (desc.get(PdfName.FONTFILE3) != null) {
System.out.println("(FONTFILE3) ");
}
else {
set.add(name.substring(1));
}
}
This gives me the same results as list of fonts in acrobat reader>properties
I managed to get some results by combining coding from How to check that all used fonts are embedded in PDF with Java iText? and http://itextpdf.com/examples/iia.php?id=288.
Initially it was not working as font.getAsName(PdfName.BASEFONT).toString(); is not working in my case but I did a small change and get some results.
Below is my coding:
/**
* Creates a Set containing information about the fonts in the src PDF file.
* #param src the path to a PDF file
* #throws IOException
*/
public void listFonts(PdfReader reader, Set<String> set) throws IOException {
int n = reader.getXrefSize();
PdfObject object;
PdfDictionary font;
for (int i = 0; i < n; i++) {
object = reader.getPdfObject(i);
if (object == null || !object.isDictionary()) {
continue;
}
font = (PdfDictionary)object;
if (font.get(PdfName.FONTNAME) != null) {
System.out.println("fontname " + font.get(PdfName.FONTNAME));
processFont(font,set);
}
}
}
/**
* Finds out if the font is an embedded subset font
* #param font name
* #return true if the name denotes an embedded subset font
*/
private boolean isEmbeddedSubset(String name) {
//name = String.format("%s subset (%s)", name.substring(8), name.substring(1, 7));
return name != null && name.length() > 8 && name.charAt(7) == '+';
}
private void processFont(PdfDictionary font, Set<String> set) {
String name = font.get(PdfName.FONTNAME).toString();
if(isEmbeddedSubset(name)) {
return;
}
PdfDictionary desc = font.getAsDict(PdfName.FONTDESCRIPTOR);
//nofontdescriptor
if (desc == null) {
System.out.println("desc null " );
PdfArray descendant = font.getAsArray(PdfName.DESCENDANTFONTS);
if (descendant == null) {
System.out.println("descendant null " );
set.add(name.substring(1));
}
else {
System.out.println("descendant not null " );
for (int i = 0; i < descendant.size(); i++) {
PdfDictionary dic = descendant.getAsDict(i);
processFont(dic, set);
}
}
}
/**
* (Type 1) embedded
*/
else if (desc.get(PdfName.FONTFILE) != null) {
System.out.println("(TrueType) embedded ");
}
/**
* (TrueType) embedded
*/
else if (desc.get(PdfName.FONTFILE2) != null) {
System.out.println("(FONTFILE2) embedded ");
}
/**
* " (" + font.getAsName(PdfName.SUBTYPE).toString().substring(1) + ") embedded"
*/
else if (desc.get(PdfName.FONTFILE3) != null) {
System.out.println("(FONTFILE3) ");
}
else {
set.add(name.substring(1));
}
}
}
So instead of using String name = font.getAsName(PdfName.BASEFONT).toString(); I changed it to String name = font.get(PdfName.FONTNAME).toString();
This definitely get some better results as it gives me different fonts. However I do not get results for fontdescriptor and descendantfonts. Or they are simply not available in my pdf's or because I changed the coding I will never end up there.
Can I assume if a subset is found that the font is embedded, if no subset availbale in the fontname can I assume the font is not embedded?

With Lucene 4.3.1, How to get all terms which occur in sub-range of all docs

Suppose a lucene index with fields : date, content.
I want to get all terms value and frequency of docs whose date is yesterday. date field is keyword field. content field is analyzed and indexed.
Pls help me with sample code.
My solution source is as follow ...
/**
*
*
* #param reader
* #param fromDateTime
* - yyyymmddhhmmss
* #param toDateTime
* - yyyymmddhhmmss
* #return
*/
static public String top10(IndexSearcher searcher, String fromDateTime,
String toDateTime) {
String top10Query = "";
try {
Query query = new TermRangeQuery("tweetDate", new BytesRef(
fromDateTime), new BytesRef(toDateTime), true, false);
final BitSet bits = new BitSet(searcher.getIndexReader().maxDoc());
searcher.search(query, new Collector() {
private int docBase;
#Override
public void setScorer(Scorer scorer) throws IOException {
}
#Override
public void setNextReader(AtomicReaderContext context)
throws IOException {
this.docBase = context.docBase;
}
#Override
public void collect(int doc) throws IOException {
bits.set(doc + docBase);
}
#Override
public boolean acceptsDocsOutOfOrder() {
return false;
}
});
//
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_43,
EnglishStopWords.getEnglishStopWords());
//
HashMap<String, Long> wordFrequency = new HashMap<>();
for (int wx = 0; wx < bits.length(); ++wx) {
if (bits.get(wx)) {
Document wd = searcher.doc(wx);
//
TokenStream tokenStream = analyzer.tokenStream("temp",
new StringReader(wd.get("content")));
// OffsetAttribute offsetAttribute = tokenStream
// .addAttribute(OffsetAttribute.class);
CharTermAttribute charTermAttribute = tokenStream
.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while (tokenStream.incrementToken()) {
// int startOffset = offsetAttribute.startOffset();
// int endOffset = offsetAttribute.endOffset();
String term = charTermAttribute.toString();
if (term.length() < 2)
continue;
Long wl;
if ((wl = wordFrequency.get(term)) == null)
wordFrequency.put(term, 1L);
else {
wl += 1;
wordFrequency.put(term, wl);
}
}
tokenStream.end();
tokenStream.close();
}
}
analyzer.close();
// sort
List<String> occurterm = new ArrayList<String>();
for (String ws : wordFrequency.keySet()) {
occurterm.add(String.format("%06d\t%s", wordFrequency.get(ws),
ws));
}
Collections.sort(occurterm, Collections.reverseOrder());
// make query string by top 10 words
int topCount = 10;
for (String ws : occurterm) {
if (topCount-- == 0)
break;
String[] tks = ws.split("\\t");
top10Query += tks[1] + " ";
}
top10Query.trim();
} catch (IOException e) {
e.printStackTrace();
} finally {
}
// return top10 word string
return top10Query;
}

Adding JPG to PDF extremely slow

I'm trying to write an Image to PDF using PDFBox. I'm using their sample (as attached). Everything is fine, but writing 3.5MB jpeg (3200*2500px) takes roughly 2 seconds.
Is this normal ? Is there any way how to make it faster (at least 10x) ?
public void createPDFFromImage( String inputFile, String image, String outputFile )
throws IOException, COSVisitorException
{
// the document
PDDocument doc = null;
try
{
doc = PDDocument.load( inputFile );
//we will add the image to the first page.
PDPage page = (PDPage)doc.getDocumentCatalog().getAllPages().get( 0 );
PDXObjectImage ximage = null;
if( image.toLowerCase().endsWith( ".jpg" ) )
{
ximage = new PDJpeg(doc, new FileInputStream( image ) );
}
else if (image.toLowerCase().endsWith(".tif") || image.toLowerCase().endsWith(".tiff"))
{
ximage = new PDCcitt(doc, new RandomAccessFile(new File(image),"r"));
}
else
{
//BufferedImage awtImage = ImageIO.read( new File( image ) );
//ximage = new PDPixelMap(doc, awtImage);
throw new IOException( "Image type not supported:" + image );
}
PDPageContentStream contentStream = new PDPageContentStream(doc, page, true, true);
contentStream.drawImage( ximage, 20, 20 );
contentStream.close();
doc.save( outputFile );
}
finally
{
if( doc != null )
{
doc.close();
}
}
}
If you are willing to use another product itext could go really fast, take a look at http://tutorials.jenkov.com/java-itext/image.html .Personally, I did this test with a +750k jpg image and took 78 ms
try {
PdfWriter.getInstance(document,
new FileOutputStream("Image2.pdf"));
document.open();
long start = System.currentTimeMillis();
String imageUrl = "c:/Users/dummy/notSoBigImage.jpg";
Image image = Image.getInstance((imageUrl));
image.setAbsolutePosition(500f, 650f);
document.add(image);
document.close();
long end = System.currentTimeMillis() - start;
System.out.println("time: " + end + " ms");
} catch(Exception e){
e.printStackTrace();
}