Update PDF using pdfBox - pdf

I would like to ask if having a PDF it is possible, using pdfbox libraries, to update it at a specific point.
I am trying to use a solution already online but seems the gettoken() method does not enter code heresection the words properly to allow me to find the part I would like to modify.
This is the code(Groovy):
for( int i = 0; i < dataContext.getDataCount(); i++ ) {
InputStream is = dataContext.getStream(i);
Properties props = dataContext.getProperties(i);
String searchString= "Hours worked";
String replacement = "Hours worked: 2";
File file = new File("\\\\****\\UKDC\\GFS\\PRE\\PREPROD\\Alchemer\\Template\\***.pdf");
PDDocument doc = PDDocument.load(file);
for ( PDPage page : doc.getPages() )
{
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List tokens = parser.getTokens();
logger.info("in Page");
for (int j = 0; j < tokens.size(); j++)
{
logger.info("tokens:"+tokens[j]);
Object next = tokens.get(j);
//logger.info("in Object");
if (next instanceof Operator)
{
Operator op = (Operator) next;
String pstring = "";
int prej = 0;
//Tj and TJ are the two operators that display strings in a PDF
if (op.getName().equals("Tj"))
{
logger.info("in Tj");
// Tj takes one operator and that is the string to display so lets update that operator
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
logger.info("previousString:"+string);
string = string.replaceFirst(searchString, replacement);
previous.setValue(string.getBytes());
} else
if (op.getName().equals("TJ"))
{
logger.info("in TJ:"+ op.getName());
COSArray previous = (COSArray) tokens.get(j - 1);
logger.info("previous:"+previous);
for (int k = 0; k < previous.size(); k++)
{
Object arrElement = previous.getObject(k);
if (arrElement instanceof COSString)
{
COSString cosString = (COSString) arrElement;
String string = cosString.getString();
logger.info("string:"+string);
if (j == prej || string.equals(" ") || string.equals(":") || string.equals("-")) {
pstring += string;
} else {
prej = j;
pstring = string;
}
}
}
logger.info("pstring:"+pstring);
if (searchString.equals(pstring.trim()))
{
logger.info("in searchString");
COSString cosString2 = (COSString) previous.getObject(0);
cosString2.setValue(replacement.getBytes());
int total = previous.size()-1;
for (int k = total; k > 0; k--) {
previous.remove(k);
}
}
}
}
}
logger.info("in updatedStream");
// now that the tokens are updated we will replace the page content stream.
PDStream updatedStream = new PDStream(doc);
OutputStream out = updatedStream.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
logger.info("in tokenWriter");
out.close();
page.setContents(updatedStream);
doc.save("\\\\***\\UKDC\\GFS\\PRE\\PREPROD\\Alchemer\\***1.pdf");
}
Executing the code I am trying to search "Hours worked" String and update with
"Hours worked: 2"
There are 2 questions:
1.When I execute and check the logs can see the Tokens are not created properly:
enter image description here
enter image description here
So are created two different COSArrays meantime I have all in one Line:
enter image description here
and this can be a problem if I have to search a specific word.
When it find the word it seems it is working but it apply a strange char:
enter image description here
So Here 2 questions:
How to manage to specify the token behaviour (or maybe for the parser) to get an entire phrase in the same token until a special char happen?
Hot to format the new char in the new PDF?
Hope you can help me, thanks for your support.

Related

Changing zoom level of all links in a PDF document

I'd like to change zoom level in links in PDF files using iText 7. There are already solutions for legacy iText 5 (1, 2) but there is none for iText 7.
There are many ways links can work (it can be even JavaScript code) but now I'm interested in GoTo actions (i.e. /XYZ, /Fit and others described in ISO 32000-1, Table 151).
My current code locates GoTo actions but I don't know what the destination of a link is (such as page number, coordinates for XYZ zoom type). How do I get the destination of a link? Or am I missing something and that problem can be solved in a different way?
for (int i = 1; i <= numberOfPages; i++) {
PdfPage page = document.getPage(i);
List<PdfAnnotation> annots = page.getAnnotations();
if (annots.isEmpty()) {
continue;
}
for (int j = 0; j < annots.size(); j++) {
PdfAnnotation annot = annots.get(j);
if (annot == null) {
continue;
}
PdfDictionary action = annot.getPdfObject()
.getAsDictionary(PdfName.A);
if (action == null ||
!PdfName.GoTo.equals(action.get(PdfName.S))) {
continue;
}
annot.remove(PdfName.A);
// here I need the destination of action
int pageNumber = ???
Rectangle cropBox = document.getPage(pageNumber).getCropBox();
PdfAction goToAction = PdfAction.createGoTo(PdfExplicitDestination.createXYZ(
document.getPage(pageNumber), cropBox.getLeft(), cropBox.getTop(), 2F));
annot.put(PdfName.A, goToAction.getPdfObject());
}
}
UPDATE Working Code
I managed to deal with GoTo actions - both explicit and of type Named Destination. It works on some pdf file I tested the code against but I'm not sure whether it covers all the possible cases
Note:
When dealing with named destination I create explicit destination and then use it instead of named destination. The thing is, however, that named destination still exists in pdf. How do I get rid of it? In iText 5 there was a method consolidateNamedDestinations() which seems to do what I need:
Replaces all the local named links with
the actual destinations.
Please see my code:
try (document) {
Map<String, PdfObject> namedDests = null;
for (int i = 1; i <= document.getNumberOfPages(); i++) {
List<PdfAnnotation> annots = document.getPage(i).getAnnotations();
if (annots.isEmpty()) {
continue;
}
for (PdfAnnotation annot : annots) {
if (annot == null) {
continue;
}
PdfDictionary action = annot.getPdfObject()
.getAsDictionary(PdfName.A);
if (action == null) {
continue;
}
PdfName actionType = action.getAsName(PdfName.S);
if (PdfName.Link.equals(actionType)) {
throw new UnsupportedOperationException(
"Action of type " + action);
} else if (!PdfName.GoTo.equals(actionType)) {
continue;
}
PdfArray dest = action.getAsArray(PdfName.D);
// explicit GoTO case
if (dest != null) {
changeZoomLevel(document, annot, dest.getAsDictionary(0));
continue;
}
// Named Destination case
if (namedDests == null) {
namedDests = document.getCatalog()
.getNameTree(PdfName.Dests)
.getNames();
}
String namedDest = action.getAsString(PdfName.D)
.getValue();
PdfObject d = namedDests.get(namedDest);
PdfDictionary dict = null;
if (d instanceof PdfDictionary) {
PdfArray arr = (PdfArray) ((PdfDictionary) d).get(PdfName.D);
dict = arr.getAsDictionary(0);
} else if (d instanceof PdfArray) {
dict = ((PdfArray) d).getAsDictionary(0);
}
changeZoomLevel(document, annot, dict);
}
}
}
private static void changeZoomLevel(PdfDocument document,
PdfAnnotation annot, PdfDictionary dest) {
annot.remove(PdfName.A);
annot.put(PdfName.A, createPdfAction(document, dest));
}
private static PdfObject createPdfAction(PdfDocument document, PdfDictionary dest) {
PdfPage actionPage = document.getPage(dest);
Rectangle cropBox = actionPage.getCropBox();
return PdfAction.createGoTo(PdfExplicitDestination.createXYZ(
actionPage, cropBox.getLeft(), cropBox.getTop(), 1.5F)).getPdfObject();
}

PDFBox: Remove text behind image

I am in a requirement to split a pdf into two, one with image and one with text. I dont want to remove the text which are behind an image and it should be the part of the image pdf. I want to extract only the top layered text in the PDF. Can any one help on this?
I already extracted the image and text into two pdfs by looping through pdf operators. I am facing trouble when not to remove the text behind the PDF.
Code for removing text:
PDDocument document = null;
try {
document = PDDocument.load(new File(inputfilePath), password);
PDPageTree allPages = document.getDocumentCatalog().getPages();
for (int i = 0; i < allPages.getCount(); i++) {
PDPage page = (PDPage) allPages.get(i);
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List tokens = parser.getTokens();
List newTokens = new ArrayList();
for (int j = 0; j < tokens.size(); j++) {
Object token = tokens.get(j);
if (token instanceof Operator) {
Operator op = (Operator) token;
if (op.getName().equalsIgnoreCase(
"tj")) {
try {
// remove the one argument to this operator
newTokens.remove(newTokens.size() - 1);
} catch (Exception e) {
e.printStackTrace();
}
continue;
}
// Header doesn't contain versioninfo
}
newTokens.add(token);
}
PDStream newContents = new PDStream(document);
ContentStreamWriter writer = new ContentStreamWriter(newContents.createOutputStream());
writer.writeTokens(newTokens);
// In writeTokens method, I closed the output stream.. This is
// for future reference.. or it will throw stream not closed
// error
newContents.addCompression();
page.setContents(newContents);
if (DEBUG)
System.out.println("Background image pdf creation process");
document.setAllSecurityToBeRemoved(true);
document.save(outputFolder + "/img_" + fileName);
And for removing the images and shades, I used the below code:
for (int j = 0; j < tokens.size(); j++) {
Object token = tokens.get(j);
if (token instanceof Operator) {
Operator op = (Operator) token;
// Text Extraction removing image and shades
if (op.getName().equalsIgnoreCase("do") || op.getName().equalsIgnoreCase("sh")
|| op.getName().equalsIgnoreCase("gs") || op.getName().equalsIgnoreCase("bi")
|| op.getName().equalsIgnoreCase("id") || op.getName().equalsIgnoreCase("ei")
|| op.getName().equalsIgnoreCase("bmc") || op.getName().equalsIgnoreCase("bdc")
|| op.getName().equalsIgnoreCase("emc") || op.getName().equalsIgnoreCase("m")
|| op.getName().equalsIgnoreCase("w")
|| op.getName().equalsIgnoreCase("re")
) {
// remove the one argument to this operator
newTokens.remove(newTokens.size() - 1);
continue;
}
}
newTokens.add(token);
}

parser.getTokens() gives out junk data and singlecharacters PDFBox-1.8.9 version

I am new to pdfbox. I am using pdfbox-app-2.0.0-RC1 version to fetch the entire text from the pdf using PDFTextStripperByArea. Is it possible for me to get each string separately?
For Example,
In the following text,
Nomination : Name&Address
Shipper : shipper name
I need Nomination as seperate string and "Name&Address" as separate string. Instead I am getting each character separately. I have tried with different Pdfs. For most pdfs I am able to get the exact string but for few pdfs I don't.
I am using the following code to get the separate string.
for (PDPage page : doc.getPages()) {
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List<Object> tokens = parser.getTokens();
for (int j = 0; j < tokens.size(); j++) {
Object next = tokens.get(j);
if (next instanceof Operator) {
Operator op = (Operator) next;
if (op.getName().equals("Tj")) {
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
System.out.println("string1===" + string);
if (string.contains("Plant")) {
int size = al.size();
al.add(string);
stop = false;
continue;
}
if (!string.contains("_") && !stop) {
if (string.contains("Nomination")) {
stop = true;
} else {
al.add(string);
}
}
} else if (op.getName().equals("TJ")) {
COSArray previous = (COSArray) tokens.get(j - 1);
for (int k = 0; k < previous.size(); k++) {
Object arrElement = previous.getObject(k);
if (arrElement instanceof COSString) {
COSString cosString = (COSString)arrElement;
String string = cosString.getString();
System.out.println("string2====>>"+string);
al.add(string);
}
}
}
}
}
}
I am getting the following output:
string2====>>Nom
string2====>>i
string2====>>na
string2====>>t
string2====>>i
string2====>>on
string1===
string2====>>(
string2====>>T
string2====>>o
string1===
string2====>>Loa
string2====>>di
string2====>>ng
string1===
string2====>>Fa
string2====>>c
string2====>>i
string2====>>l
string2====>>i
string2====>>t
string2====>>y
string2====>>)

How to write a method that removes punctuation?

I am trying to write a method that receives a sentence (str) and prints it without punctuation (.,?!:").
Right now my method is removing some punctuation but not all.
public static String dePunct(String str)
{
String noPunct = str;
int length = str.length();
for (int i = 0; i < length; i++)
{
if (str.charAt(i)=='.' || str.charAt(i)==','|| str.charAt(i)=='!'|| str.charAt(i)=='?'|| str.charAt(i)==':'|| str.charAt(i)=='"')
{
StringBuffer strSB = new StringBuffer(str);
StringBuffer newStrSB =s trSB.replace(i,i+1, "");
noPunct = newStrSB.toString();
length = noPunct.length();
}
}
return noPunct;
}
Use
String a = sentence.replaceAll("[.,:!\\?]","");

StyledDocument adding extra count to indexof for each line of file

I have a strange problem (at least it appears that way) that when searching for a string in a textPane, I get an extra index for each line number that is searched and returned when using StyledDoc verses just getting the text from a textPane. I get the same text from the same pane, it's just that one is from the plain text the other is from the styled doc. Am I missing something here. I'll try to list as many of the changes between the two versions I am working with.
The plain text version:
public int displayXMLFile(String path, int target){
InputStreamReader inputStream;
FileInputStream fileStream;
BufferedReader buffReader;
if(target == 1){
try{
File file = new File(path);
fileStream = new FileInputStream(file);
inputStream = new InputStreamReader(fileStream,"UTF-8");
buffReader = new BufferedReader(inputStream);
StringBuffer content = new StringBuffer("");
String line = "";
while((line = buffReader.readLine())!=null){
content.append(line+"\n");
}
buffReader.close();
xhw.txtDisplay_1.setText(content.toString());
}
catch(Exception e){
e.printStackTrace();
return -1;
}
}
}
verses the Styled Doc (without the styles applied)
protected void openFile(String path, StyledDocument sDoc, int target)
throws BadLocationException {
FileInputStream fileStream;
String file;
if(target == 1){
file = "Openning First File";
} else {
file = "Openning Second File";
}
try {
fileStream = new FileInputStream(path);
// Get the object of DataInputStream
//DataInputStream in = new DataInputStream(fileStream);
ProgressMonitorInputStream in = new ProgressMonitorInputStream(
xw.getContentPane(), file, fileStream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine;
//Read File Line By Line
while ((strLine = br.readLine()) != null) {
sDoc.insertString(sDoc.getLength(), strLine + "\n", sDoc.getStyle("regular"));
xw.updateProgress(target);
}
//Close the input stream
in.close();
} catch (Exception e){//Catch exception if any
System.err.println("Error: " + e.getMessage());
}
This is how I search:
public int searchText(int sPos, int target) throws BadLocationException{
String search = xhw.textSearch.getText();
String contents;
JTextPane searchPane;
if(target == 1){
searchPane = xhw.txtDisplay_1;
} else {
searchPane = xhw.txtDisplay_2;
}
if(xhw.textSearch.getText().isEmpty()){
xhw.displayDialog("Nothing to search for");
highlight(searchPane, null, 0,0);
} else {
contents = searchPane.getText();
// Search for the desired string starting at cursor position
int newPos = contents.indexOf( search, sPos );
// cycle cursor to beginning of doc window
if (newPos == -1 && sPos > 0){
sPos = 0;
newPos = contents.indexOf( search, sPos );
}
if ( newPos >= 0 ) {
// Select occurrence if found
highlight(searchPane, contents, newPos, target);
sPos = newPos + search.length()+1;
} else {
xhw.displayDialog("\"" + search + "\"" + " was not found in File " + target);
}
}
return sPos;
}
The sample file:
<?xml version="1.0" encoding="UTF-8"?>
<AlternateDepartureRoutes>
<AlternateDepartureRoute>
<AdrName>BOIRR</AdrName>
<AdrRouteAlpha>..BROPH..</AdrRouteAlpha>
<TransitionFix>
<FixName>BROPH</FixName>
</TransitionFix>
</AlternateDepartureRoute>
<AlternateDepartureRoute>
</AlternateDepartureRoutes>
And my highlighter:
public void highlight(JTextPane tPane, String text, int position, int target) throws BadLocationException {
Highlighter highlighter = new DefaultHighlighter();
Highlighter.HighlightPainter painter = new DefaultHighlighter.DefaultHighlightPainter(Color.LIGHT_GRAY);
tPane.setHighlighter(highlighter);
String searchText = xhw.textSearch.getText();
String document = tPane.getText();
int startOfSString = document.indexOf(searchText,position);
if(startOfSString >= 0){
int endOfSString = startOfSString + searchText.length();
highlighter.addHighlight(startOfSString, endOfSString, painter);
tPane.setCaretPosition(endOfSString);
int caretPos = tPane.getCaretPosition();
javax.swing.text.Element root = tPane.getDocument().getDefaultRootElement();
int lineNum = root.getElementIndex(caretPos) +1;
if (target == 1){
xhw.txtLineNum1.setText(Integer.toString(lineNum));
} else if (target == 2){
xhw.txtLineNum2.setText(Integer.toString(lineNum));
} else {
xhw.txtLineNum1.setText(null);
xhw.txtLineNum2.setText(null);
}
} else {
highlighter.removeAllHighlights();
}
}
When I do a search for Alt with the indexof() I get 40 for the plain text (which is what it should return) and 41 when searching with the styled doc. And for each additional line that Alt appears on I get and extra index (so that the indexof() call returns 2 more then needed in line 3). This happens for every additional line that it finds. Am I missing something obvious? (If I need to push this to a smaller single class to make it easier to check I can do this later when I have some more time).
Thanks in advance...
If you are on Windows, then the TextComponent text (searchPane.getText()) can contain carriage-return+newline characters (\r\n), but the TextComponent's Styled Document (sSearchPane.getText(0, sSearchPane.getLength())) contains only newline characters (\n). That's why your newPos is always larger than newPosS by the number of newlines at that point. To fix this, in your search function you can change:
contents = searchPane.getText();
to:
contents = searchPane.getText().replaceAll("\r\n","\n");
That way the search occurs with the same indices that the Styled Document is using.
OK I have found a solution (basicly). I approached this from the aspect that I am getting text from the same text componet in two different ways...
String search = xw.textSearch.getText();
String contents;
String contentsS;
JTextPane searchPane;
StyledDocument sSearchPane;
searchPane = xw.txtDisplay_left;
sSearchPane = xw.txtDisplay_left.getStyledDocument();
contents = searchPane.getText();
contentsS = sSearchPane.getText(0, sSearchPane.getLength());
// Search for the desired string starting at cursor position
int newPos = contents.indexOf( search, sPos );
int newPosS = contentsS.indexOf(search, sPos);
So when comparing the two variables "newPos" & "newPosS", newPos retruned 1 more then newPosS for each line that the search string was found on. So when looking at the sample file and searching for "Alt" the first instance is found on line 2. "newPos" returns 41 and "newPosS returns 40 (which then highlights the correct text). The next occurance (which is found in line 3) "newPos" returns 71 and "newPosS" returns 69. As you can see, every new line increases the count by the line number the occurance begins in. I would suspect that there is an extra character being added in for each new line from the textPane that is not present in the StyledDoc.
I'm sure there is a reasonable explaination but I don't have it at this time.