Text displayed in blue although PDAnnotation removed - pdfbox

We have a requirement where we need to remove annotation on some matched conditional check. PDAnnotaion gets removed when I have executed allPageAnnotationsList.remove(annotationTobeRemoved) statement.
But corresponding text remained displayed in blue color only. How could I update the text color to normal(black)?

Originally I thought you asked for all non-black text on a page to be changed to black. This resulted in my original answer, now the first section 'Updating All Text to Black'. Then you clarified that you only wanted the text in the areas of the removed annotations to be made black. That's shown in the second section 'Updating Text in Areas to Black'.
Updating All Text to Black
First of all, as already described by Tilman in comments, removing link annotations usually merely removes the interactivity of that link but the text in the area of the link annotation remains as is. If you want to update the text color to normal(black), therefore, you have to add a second step and manipulate the colors in the static page contents.
The static page content is defined by a stream of instructions which change the graphics state or draw something. The color used for drawing is part of the graphics state and is set by explicit color setting instructions. Thus, one could think you could simply replace all color setting instructions by instructions selecting normal(black).
Unfortunately it's not that easy because colors may be changed to draw other things, too. E.g. in your document at the start the whole page is filled with white; if you replaced the color setting instruction before that fill instruction, your whole page would be black. Not exactly what you want.
To update the text color to normal(black) but not change other colors, therefore, you have to consider the context of instructions you want to change.
The PDFBox parsing framework can help you here, iterating over a content stream and keeping track of the graphics state.
Based upon that framework, furthermore, a generic content stream editor helper class has been created in this answer, the PdfContentStreamEditor. (For details and example uses see that answer.) Now you merely have to customize it for your use case, e.g. like this:
PDDocument document = ...;
for (PDPage page : document.getDocumentCatalog().getPages()) {
PdfContentStreamEditor editor = new PdfContentStreamEditor(document, page) {
#Override
protected void write(ContentStreamWriter contentStreamWriter, Operator operator, List<COSBase> operands) throws IOException {
String operatorString = operator.getName();
if (TEXT_SHOWING_OPERATORS.contains(operatorString)) {
if (currentlyReplacedColor == null)
{
PDColor currentFillColor = getGraphicsState().getNonStrokingColor();
if (!isBlack(currentFillColor))
{
currentlyReplacedColor = currentFillColor;
super.write(contentStreamWriter, SET_NON_STROKING_GRAY, GRAY_BLACK_VALUES);
}
}
} else if (currentlyReplacedColor != null) {
PDColorSpace replacedColorSpace = currentlyReplacedColor.getColorSpace();
List<COSBase> replacedColorValues = new ArrayList<>();
for (float f : currentlyReplacedColor.getComponents())
replacedColorValues.add(new COSFloat(f));
if (replacedColorSpace instanceof PDDeviceCMYK)
super.write(contentStreamWriter, SET_NON_STROKING_CMYK, replacedColorValues);
else if (replacedColorSpace instanceof PDDeviceGray)
super.write(contentStreamWriter, SET_NON_STROKING_GRAY, replacedColorValues);
else if (replacedColorSpace instanceof PDDeviceRGB)
super.write(contentStreamWriter, SET_NON_STROKING_RGB, replacedColorValues);
else {
//TODO
}
currentlyReplacedColor = null;
}
super.write(contentStreamWriter, operator, operands);
}
PDColor currentlyReplacedColor = null;
final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
final Operator SET_NON_STROKING_CMYK = Operator.getOperator("k");
final Operator SET_NON_STROKING_RGB = Operator.getOperator("rg");
final Operator SET_NON_STROKING_GRAY = Operator.getOperator("g");
final List<COSBase> GRAY_BLACK_VALUES = Arrays.asList(COSInteger.ZERO);
};
editor.processPage(page);
}
document.save("withBlackText.pdf");
(ChangeTextColor test testMakeTextBlackTestAfterRemovingAnnotation)
Here we check whether the current instruction is a text drawing instruction. If it is and the current color is not already replaced, we check whether the current color is already black'ish. If it is not black, we store it and add an instruction to replace the current fill color by black.
Otherwise, i.e. if the current instruction is not a text drawing instruction, we check whether the current color has been replaced by black. If it has, we restore the original color.
To check whether a given color is black'ish, we use the following helper method.
static boolean isBlack(PDColor pdColor) {
PDColorSpace pdColorSpace = pdColor.getColorSpace();
float[] components = pdColor.getComponents();
if (pdColorSpace instanceof PDDeviceCMYK)
return (components[0] > .9f && components[1] > .9f && components[2] > .9f) || components[3] > .9f;
else if (pdColorSpace instanceof PDDeviceGray)
return components[0] < .1f;
else if (pdColorSpace instanceof PDDeviceRGB)
return components[0] < .1f && components[1] < .1f && components[2] < .1f;
else
return false;
}
(ChangeTextColor helper method)
Updating Text in Areas to Black
In comments you clarified that you only want the text in the areas of the removed annotations to become black.
For this you have to collect the rectangles of the annotations you remove and later check the position before switching colors whether it's inside one of those rectangles.
This can be done by extending the code above as follows. Here I remove every other annotation only and collect their rectangles to check against them later. Also I override the PDFStreamEngine method showText(byte[]) to store the position of the text shown in the current text drawing instruction.
PDDocument document = ...;
for (PDPage page : document.getDocumentCatalog().getPages()) {
List<PDRectangle> areas = new ArrayList<>();
// Remove every other annotation, collect their areas
List<PDAnnotation> annotations = new ArrayList<>();
boolean remove = true;
for (PDAnnotation annotation : page.getAnnotations()) {
if (remove)
areas.add(annotation.getRectangle());
else
annotations.add(annotation);
remove = !remove;
}
page.setAnnotations(annotations);
PdfContentStreamEditor editor = new PdfContentStreamEditor(document, page) {
#Override
protected void write(ContentStreamWriter contentStreamWriter, Operator operator, List<COSBase> operands) throws IOException {
String operatorString = operator.getName();
if (TEXT_SHOWING_OPERATORS.contains(operatorString) && isInAreas()) {
if (currentlyReplacedColor == null)
{
PDColor currentFillColor = getGraphicsState().getNonStrokingColor();
if (!isBlack(currentFillColor))
{
currentlyReplacedColor = currentFillColor;
super.write(contentStreamWriter, SET_NON_STROKING_GRAY, GRAY_BLACK_VALUES);
}
}
} else if (currentlyReplacedColor != null) {
PDColorSpace replacedColorSpace = currentlyReplacedColor.getColorSpace();
List<COSBase> replacedColorValues = new ArrayList<>();
for (float f : currentlyReplacedColor.getComponents())
replacedColorValues.add(new COSFloat(f));
if (replacedColorSpace instanceof PDDeviceCMYK)
super.write(contentStreamWriter, SET_NON_STROKING_CMYK, replacedColorValues);
else if (replacedColorSpace instanceof PDDeviceGray)
super.write(contentStreamWriter, SET_NON_STROKING_GRAY, replacedColorValues);
else if (replacedColorSpace instanceof PDDeviceRGB)
super.write(contentStreamWriter, SET_NON_STROKING_RGB, replacedColorValues);
else {
//TODO
}
currentlyReplacedColor = null;
}
super.write(contentStreamWriter, operator, operands);
before = null;
after = null;
}
PDColor currentlyReplacedColor = null;
final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
final Operator SET_NON_STROKING_CMYK = Operator.getOperator("k");
final Operator SET_NON_STROKING_RGB = Operator.getOperator("rg");
final Operator SET_NON_STROKING_GRAY = Operator.getOperator("g");
final List<COSBase> GRAY_BLACK_VALUES = Arrays.asList(COSInteger.ZERO);
#Override
protected void showText(byte[] string) throws IOException {
Matrix ctm = getGraphicsState().getCurrentTransformationMatrix();
if (before == null)
before = getTextMatrix().multiply(ctm);
super.showText(string);
after = getTextMatrix().multiply(ctm);
}
Matrix before = null;
Matrix after = null;
boolean isInAreas() {
return isInAreas(before) || isInAreas(after);
}
boolean isInAreas(Matrix m) {
return m != null && areas.stream().anyMatch(rect -> rect.contains(m.getTranslateX(), m.getTranslateY()));
}
};
editor.processPage(page);
}
document.save("WithoutSomeAnnotation-withBlackTextThere.pdf");

Related

Using Lucene's highlighting, getting too much highlighted, is there a workaround for this?

I am using the highlighting feature of Lucene to isolate matching terms for my query, but some of the matched terms are excessive.
I have some simple test cases which are delivered in an Ant project (download details below).
Materials
You can download the test case here: mydemo_with_libs.zip
That archive includes the Lucene 8.6.3 libraries which my test uses; if you prefer a copy without the JAR files you can download that from here: mydemo_without_libs.zip
The necessary libraries are: core, analyzers, queries, queryparser, highlighter, and memory.
You can run the test case by unzipping the archive into an empty directory and running the Ant command ant synsearch
Input
I have provided a short synonym list which is used for indexing and analysing in the highlighting methods:
cope,manage
jobs,tasks
simultaneously,at once
and there is one document being indexed:
Queues are a useful way of grouping jobs together in order to manage a number of them at once. You can:
hold or release multiple jobs at the same time;
group multiple tasks (for the same event);
control the priority of jobs in the queue;
Eventually log all events that take place in a queue.
Use either job.queue or task.queue in specifications.
Process
When building the index I am storing the text field, and using a custom analyzer. This is because (in the real world) the content I am indexing is technical documentation, so stripping out punctuation is inappropriate because so much of it may be significant in technical expressions. My analyzer uses a TechTokenFilter which breaks the stream up into tokens consisting of strings of words or digits, or individual characters which don't match the previous pattern.
Here's the relevant code for the analyzer:
public class MyAnalyzer extends Analyzer {
public MyAnalyzer(String synlist) {
if (synlist != "") {
this.synlist = synlist;
this.useSynonyms = true;
}
}
public MyAnalyzer() {
this.useSynonyms = false;
}
#Override
protected TokenStreamComponents createComponents(String fieldName) {
WhitespaceTokenizer src = new WhitespaceTokenizer();
TokenStream result = new TechTokenFilter(new LowerCaseFilter(src));
if (useSynonyms) {
result = new SynonymGraphFilter(result, getSynonyms(synlist), Boolean.TRUE);
result = new FlattenGraphFilter(result);
}
return new TokenStreamComponents(src, result);
}
and here's my filter:
public class TechTokenFilter extends TokenFilter {
private final CharTermAttribute termAttr;
private final PositionIncrementAttribute posIncAttr;
private final ArrayList<String> termStack;
private AttributeSource.State current;
private final TypeAttribute typeAttr;
public TechTokenFilter(TokenStream tokenStream) {
super(tokenStream);
termStack = new ArrayList<>();
termAttr = addAttribute(CharTermAttribute.class);
posIncAttr = addAttribute(PositionIncrementAttribute.class);
typeAttr = addAttribute(TypeAttribute.class);
}
#Override
public boolean incrementToken() throws IOException {
if (this.termStack.isEmpty() && input.incrementToken()) {
final String currentTerm = termAttr.toString();
final int bufferLen = termAttr.length();
if (bufferLen > 0) {
if (termStack.isEmpty()) {
termStack.addAll(Arrays.asList(techTokens(currentTerm)));
current = captureState();
}
}
}
if (!this.termStack.isEmpty()) {
String part = termStack.remove(0);
restoreState(current);
termAttr.setEmpty().append(part);
posIncAttr.setPositionIncrement(1);
return true;
} else {
return false;
}
}
public static String[] techTokens(String t) {
List<String> tokenlist = new ArrayList<String>();
String[] tokens;
StringBuilder next = new StringBuilder();
String token;
char minus = '-';
char underscore = '_';
char c, prec, subc;
// Boolean inWord = false;
for (int i = 0; i < t.length(); i++) {
prec = i > 0 ? t.charAt(i - 1) : 0;
c = t.charAt(i);
subc = i < (t.length() - 1) ? t.charAt(i + 1) : 0;
if (Character.isLetterOrDigit(c) || c == underscore) {
next.append(c);
// inWord = true;
}
else if (c == minus && Character.isLetterOrDigit(prec) && Character.isLetterOrDigit(subc)) {
next.append(c);
} else {
if (next.length() > 0) {
token = next.toString();
tokenlist.add(token);
next.setLength(0);
}
if (Character.isWhitespace(c)) {
// shouldn't be possible because the input stream has been tokenized on
// whitespace
} else {
tokenlist.add(String.valueOf(c));
}
// inWord = false;
}
}
if (next.length() > 0) {
token = next.toString();
tokenlist.add(token);
// next.setLength(0);
}
tokens = tokenlist.toArray(new String[0]);
return tokens;
}
}
Examining the index I can see that the index contains the separate terms I expect, including the synonym values. For example the text at the end of the first line has produced the terms
of
them
at , simultaneously
once
.
You
can
:
and the text at the end of the third line has produced the terms
same
event
)
;
When the application performs a search it analyzes the query without using the synonym list (because the synonyms are already in the index), but I have discovered that I need to include the synonym list when analyzing the stored text to identify the matching fragments.
Searches match the correct documents, but the code I have added to identify the matching terms over-performs. I won't show all the search method here, but will focus on the code which lists matched terms:
public static void doSearch(IndexReader reader, IndexSearcher searcher,
Query query, int max, String synList) throws IOException {
SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter("\001", "\002");
Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query));
Analyzer analyzer;
if (synList != null) {
analyzer = new MyAnalyzer(synList);
} else {
analyzer = new MyAnalyzer();
}
// Collect all the docs
TopDocs results = searcher.search(query, max);
ScoreDoc[] hits = results.scoreDocs;
int numTotalHits = Math.toIntExact(results.totalHits.value);
System.out.println("\nQuery: " + query.toString());
System.out.println("Matches: " + numTotalHits);
// Collect matching terms
HashSet<String> matchedWords = new HashSet<String>();
int start = 0;
int end = Math.min(numTotalHits, max);
for (int i = start; i < end; i++) {
int id = hits[i].doc;
float score = hits[i].score;
Document doc = searcher.doc(id);
String docpath = doc.get("path");
String doctext = doc.get("text");
try {
TokenStream tokens = TokenSources.getTokenStream("text", null, doctext, analyzer, -1);
TextFragment[] frag = highlighter.getBestTextFragments(tokens, doctext, false, 100);
for (int j = 0; j < frag.length; j++) {
if ((frag[j] != null) && (frag[j].getScore() > 0)) {
String match = frag[j].toString();
addMatchedWord(matchedWords, match);
}
}
} catch (InvalidTokenOffsetsException e) {
System.err.println(e.getMessage());
}
System.out.println("matched file: " + docpath);
}
if (matchedWords.size() > 0) {
System.out.println("matched terms:");
for (String word : matchedWords) {
System.out.println(word);
}
}
}
Problem
While the correct documents are selected by these queries, and the fragments chosen for highlighting do contain the query terms, the highlighted pieces in some of the selected fragments extend over too much of the input.
For example, if the query is
+text:event +text:manage
(the first example in the test case) then I would expect to see 'event' and 'manage' in the highlighted list. But what I actually see is
event);
manage
Despite the highlighting process using an analyzer which breaks terms apart and treats punctuation characters as single terms, the highlight code is "hungry" and breaks on whitespace alone.
Similarly if the query is
+text:queeu~1
(my final test case) I would expect to only see 'queue' in the list. But I get
queue.
job.queue
task.queue
queue;
It is so nearly there... but I don't understand why the highlighted pieces are inconsistent with the index, and I don't think I should have to parse the list of matches through yet another filter to produce the correct list of matches.
I would really appreciate any pointers to what I am doing wrong or how I could improve my code to deliver exactly what I need.
Thanks for reading this far!
I managed to get this working by replacing the WhitespaceTokenizer and TechTokenFilter in my analyzer with a PatternTokenizer; the regular expression took a bit of work but once I had it all the matching terms were extracted with pinpoint accuracy.
The replacement analyzer:
public class MyAnalyzer extends Analyzer {
public MyAnalyzer(String synlist) {
if (synlist != "") {
this.synlist = synlist;
this.useSynonyms = true;
}
}
public MyAnalyzer() {
this.useSynonyms = false;
}
private static final String tokenRegex = "(([\\w]+-)*[\\w]+)|[^\\w\\s]";
#Override
protected TokenStreamComponents createComponents(String fieldName) {
PatternTokenizer src = new PatternTokenizer(Pattern.compile(tokenRegex), 0);
TokenStream result = new LowerCaseFilter(src);
if (useSynonyms) {
result = new SynonymGraphFilter(result, getSynonyms(synlist), Boolean.TRUE);
result = new FlattenGraphFilter(result);
}
return new TokenStreamComponents(src, result);
}

Traverse whole PDF and Remove underlines of hyperlinks (annotations) only + iText

I have successfully changed the color of underlines using below link code. Can anyone help me how to remove underlines from PDF, the underlines i have find using below link code.
Traverse whole PDF and change blue color to black ( Change color of underlines as well) + iText
Below is my code that are finding hyperlinks and changing their colors to black. I have to modify this code to remove those underlines.
PdfCanvasEditor editor = new PdfCanvasEditor() {
#Override
protected void write(PdfCanvasProcessor processor, PdfLiteral operator, List<PdfObject> operands)
{
String operatorString = operator.toString();
if (SET_FILL_RGB.equals(operatorString) && operands.size() == 4) {
if (isApproximatelyEqual(operands.get(0), 0) &&
isApproximatelyEqual(operands.get(1), 0) &&
isApproximatelyEqual(operands.get(2), 1)) {
super.write(processor, new PdfLiteral("g"), Arrays.asList(new PdfNumber(0), new PdfLiteral("g")));
return;
}
}
if (SET_STROKE_RGB.equals(operatorString) && operands.size() == 4) {
if (isApproximatelyEqual(operands.get(0), 0) &&
isApproximatelyEqual(operands.get(1), 0) &&
isApproximatelyEqual(operands.get(2), 1)) {
super.write(processor, new PdfLiteral("G"), Arrays.asList(new PdfNumber(0), new PdfLiteral("G")));
return;
}
}
super.write(processor, operator, operands);
}
boolean isApproximatelyEqual(PdfObject number, float reference) {
return number instanceof PdfNumber && Math.abs(reference - ((PdfNumber)number).floatValue()) < 0.01f;
}
final String SET_FILL_RGB = "rg";
final String SET_STROKE_RGB = "RG";
};
for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++) {
editor.editPage(pdfDocument, i);
}
Edited:
Accepted answer is not working for below files:
https://raad-dev-test.s3.ap-south-1.amazonaws.com/36/2019-08-30/021549Orig1s025_aprepitant_clinpharm_prea_Mac.pdf (Page 41)
https://raad-dev-test.s3.ap-south-1.amazonaws.com/36/2019-08-30/400_206494S5_avibactam_and_ceftazidine_unireview_prea_Mac.pdf (Page 60).
Please Help.
As described in a comment in the context of the referenced question
it is easy to make the editor class above remove vector graphics by replacing fill or stroke instructions by instructions dropping the current path without drawing it. If only doing so in case of the applicable current color being blue, that would likely do the job in case of your example PDFs. But beware, in documents with other graphics with blue elements (e.g. logos), these would be mutilated, too.
This is what the following content editor does:
class PdfGraphicsRemoverByColor extends PdfCanvasEditor {
public PdfGraphicsRemoverByColor(Color color) {
this.color = color;
}
#Override
protected void write(PdfCanvasProcessor processor, PdfLiteral operator, List<PdfObject> operands)
{
String operatorString = operator.toString();
if (color.equals(getGraphicsState().getFillColor())) {
switch (operatorString) {
case "f":
case "f*":
case "F":
operatorString = "n";
break;
case "b":
case "b*":
operatorString = "s";
break;
case "B":
case "B*":
operatorString = "S";
break;
}
}
if (color.equals(getGraphicsState().getStrokeColor())) {
switch (operatorString) {
case "s":
case "S":
operatorString = "n";
break;
case "b":
case "B":
operatorString = "f";
break;
case "b*":
case "B*":
operatorString = "f*";
break;
}
}
operator = new PdfLiteral(operatorString);
operands.set(operands.size() - 1, operator);
super.write(processor, operator, operands);
}
final Color color;
}
(RemoveGraphicsByColor helper class)
Applied like this:
try ( PdfReader pdfReader = new PdfReader(INPUT);
PdfWriter pdfWriter = new PdfWriter(OUTPUT);
PdfDocument pdfDocument = new PdfDocument(pdfReader, pdfWriter) )
{
PdfCanvasEditor editor = new PdfGraphicsRemoverByColor(ColorConstants.BLUE);
for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++)
{
editor.editPage(pdfDocument, i);
}
}
(RemoveGraphicsByColor tests)
to the example files Control_of_nitrosamine_impurities_in_sartans__rev.pdf, EDQM_reports_issues_of_non-compliance_with_tooth__Mac.pdf, and originalFile.pdf from the referenced question, one gets:
and
and
Beware, this is merely a proof-of-concept, not a final and complete solution. In particular:
Only RGB blue is considered. This might be an issue particularly in case of documents explicitly designed for printing (likely using CMYK colors).
All path fills and strokes are dropped as long as they were blue. Depending on your documents this may have to be filtered.
PdfCanvasEditor only inspects and edits the content stream of the page itself, not the content streams of displayed form XObjects or patterns; thus, some content may not be found. It can be generalized fairly easily.
Different shades of blue from other RGB'ish color spaces
Testing the code above you found documents in which the blue lines were not removed. As it turned out, these blue colors were not from the DeviceRGB standard RGB but instead from ICCBased colorspaces, profiled RGB color spaces to be more exact. Furthermore, in one document not a pure blue 0 0 1 but instead a .17255 .3098 .63529 blue was used.
To also be able to deal with these documents, the approach above must be generalized; e.g. we can use a Predicate<Color> instead of a single, specific Color, e.g. like this:
class PdfGraphicsRemoverByColorPredicate extends PdfCanvasEditor {
public PdfGraphicsRemoverByColorPredicate(Predicate<Color> colorPredicate) {
this.colorPredicate = colorPredicate;
}
#Override
protected void write(PdfCanvasProcessor processor, PdfLiteral operator, List<PdfObject> operands)
{
String operatorString = operator.toString();
if (colorPredicate.test(getGraphicsState().getFillColor())) {
switch (operatorString) {
case "f":
case "f*":
case "F":
operatorString = "n";
break;
case "b":
case "b*":
operatorString = "s";
break;
case "B":
case "B*":
operatorString = "S";
break;
}
}
if (colorPredicate.test(getGraphicsState().getStrokeColor())) {
switch (operatorString) {
case "s":
case "S":
operatorString = "n";
break;
case "b":
case "B":
operatorString = "f";
break;
case "b*":
case "B*":
operatorString = "f*";
break;
}
}
operator = new PdfLiteral(operatorString);
operands.set(operands.size() - 1, operator);
super.write(processor, operator, operands);
}
final Predicate<Color> colorPredicate;
}
(RemoveGraphicsByColor helper class)
Applied like this:
try ( PdfReader pdfReader = new PdfReader(INPUT);
PdfWriter pdfWriter = new PdfWriter(OUTPUT);
PdfDocument pdfDocument = new PdfDocument(pdfReader, pdfWriter) )
{
PdfCanvasEditor editor = new PdfGraphicsRemoverByColorPredicate(RemoveGraphicsByColor::isRgbBlue);
for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++)
{
editor.editPage(pdfDocument, i);
}
}
(RemoveGraphicsByColor testRemoveAllBlueLinesFrom* tests)
to the new example files using this predicate method
public static boolean isRgbBlue(Color color) {
if (color instanceof CalRgb || color instanceof DeviceRgb || (color instanceof IccBased && color.getNumberOfComponents() == 3)) {
float[] components = color.getColorValue();
float r = components[0];
float g = components[1];
float b = components[2];
return b > .5f && r < .9f*b && g < .9f*b;
}
return false;
}
(RemoveGraphicsByColor helper method)
one gets
and
Beware, the warnings from above still apply.

Traverse whole PDF and change blue color to black ( Change color of underlines as well) + iText

I am using below code to remove blue colors from pdf text. It is working fine. But it is not changing underlines color, but changing text color correctly.
original file part:
Manipulated File:
As you see in above manipulated file, underline color didn't change.
I am looking fix for this thing since two weeks, can anyone help on this. Below is my change color code:
public void testChangeBlackTextToGreenDocument(String source, String filename) throws IOException {
try (InputStream resource = getClass().getResourceAsStream(source);
PdfReader pdfReader = new PdfReader(source);
OutputStream result = new FileOutputStream(filename);
PdfWriter pdfWriter = new PdfWriter(result);
PdfDocument pdfDocument = new PdfDocument(pdfReader, pdfWriter);) {
PdfCanvasEditor editor = new PdfCanvasEditor() {
#Override
protected void write(PdfCanvasProcessor processor, PdfLiteral operator, List<PdfObject> operands) {
String operatorString = operator.toString();
if (TEXT_SHOWING_OPERATORS.contains(operatorString)) {
List<PdfObject> listobj = new ArrayList<>();
listobj.add(new PdfNumber(0));
listobj.add(new PdfNumber(0));
listobj.add(new PdfNumber(0));
listobj.add(new PdfLiteral("rg"));
if (currentlyReplacedBlack == null) {
Color currentFillColor =getGraphicsState().getFillColor();
if (ColorConstants.GREEN.equals(currentFillColor) || ColorConstants.CYAN.equals(currentFillColor) || ColorConstants.BLUE.equals(currentFillColor)) {
currentlyReplacedBlack = currentFillColor;
super.write(processor, new PdfLiteral("rg"), listobj);
}
}
} else if (currentlyReplacedBlack != null) {
if (currentlyReplacedBlack instanceof DeviceCmyk) {
List<PdfObject> listobj = new ArrayList<>();
listobj.add(new PdfNumber(0));
listobj.add(new PdfNumber(0));
listobj.add(new PdfNumber(0));
listobj.add(new PdfNumber(0));
listobj.add(new PdfLiteral("k"));
super.write(processor, new PdfLiteral("k"), listobj);
} else if (currentlyReplacedBlack instanceof DeviceGray) {
List<PdfObject> listobj = new ArrayList<>();
listobj.add(new PdfNumber(0));
listobj.add(new PdfLiteral("g"));
super.write(processor, new PdfLiteral("g"), listobj);
} else {
List<PdfObject> listobj = new ArrayList<>();
listobj.add(new PdfNumber(0));
listobj.add(new PdfNumber(0));
listobj.add(new PdfNumber(0));
listobj.add(new PdfLiteral("rg"));
super.write(processor, new PdfLiteral("rg"), listobj);
}
currentlyReplacedBlack = null;
}
super.write(processor, operator, operands);
}
Color currentlyReplacedBlack = null;
final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
};
for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++) {
editor.editPage(pdfDocument, i);
}
}
File file = new File(source);
file.delete();
}
Here is the original file.
https://raad-dev-test.s3.ap-south-1.amazonaws.com/36/2019-08-30/originalFile.pdf
Related Links:
Traverse whole PDF and change some attribute with some object in it using iText
Removing Watermark from PDF iTextSharp
Maven Dependcy Details:
<dependency>
<groupId>com.itextpdf</groupId>
<artifactId>itext7-core</artifactId>
<version>7.1.5</version>
<type>pom</type>
</dependency>
<dependency>
<groupId>com.itextpdf</groupId>
<artifactId>itextpdf</artifactId>
<version>5.0.6</version>
</dependency>
Edited:
Accepted answer is not working for below files:
https://raad-dev-test.s3.ap-south-1.amazonaws.com/36/2019-08-30/021549Orig1s025_aprepitant_clinpharm_prea_Mac.pdf (Page 41)
https://raad-dev-test.s3.ap-south-1.amazonaws.com/36/2019-08-30/400_206494S5_avibactam_and_ceftazidine_unireview_prea_Mac.pdf (Page 60).
Please Help.
(The example code here uses iText 7 for Java. You mentioned neither the iText version nor your programming environment in tags or question text but your example code appears to indicate that this is your combination of choice.)
Replacing blue fill colors
The test you based your original code on attempts explicitly only to change text color. The "underline" in your document, though, is (as far as PDF drawing is concerned) not part of the text but instead drawn as a simple path. Thus, the underline explicitly is not touched by the original code and it has to be adapted for your task.
But actually your task, changing everything blue to black, is easier to implement than only changing the blue text, e.g.
try ( PdfReader pdfReader = new PdfReader(SOURCE_PDF);
PdfWriter pdfWriter = new PdfWriter(RESULT_PDF);
PdfDocument pdfDocument = new PdfDocument(pdfReader, pdfWriter) )
{
PdfCanvasEditor editor = new PdfCanvasEditor()
{
#Override
protected void write(PdfCanvasProcessor processor, PdfLiteral operator, List<PdfObject> operands)
{
String operatorString = operator.toString();
if (SET_FILL_RGB.equals(operatorString) && operands.size() == 4) {
if (isApproximatelyEqual(operands.get(0), 0) &&
isApproximatelyEqual(operands.get(1), 0) &&
isApproximatelyEqual(operands.get(2), 1)) {
super.write(processor, new PdfLiteral("g"), Arrays.asList(new PdfNumber(0), new PdfLiteral("g")));
return;
}
}
super.write(processor, operator, operands);
}
boolean isApproximatelyEqual(PdfObject number, float reference) {
return number instanceof PdfNumber && Math.abs(reference - ((PdfNumber)number).floatValue()) < 0.01f;
}
final String SET_FILL_RGB = "rg";
};
for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++)
{
editor.editPage(pdfDocument, i);
}
}
(ChangeColor test testChangeFillRgbBlueToBlack)
Beware, this is merely a proof-of-concept, not a final and complete solution. In particular:
It merely looks at the fill (non-stroking) colors. In your case that suffices as both your text (as usual) and your underline use fill colors only - the underline actually is not drawn as a stroked line but instead as a slim, filled rectangle.
Only RGB blue (and only such blue set using the rg instruction, not set using sc or scn, let alone blues combined out of other colors using funky blend modes) is considered. This might be an issue particularly in case of documents explicitly designed for printing (likely using CMYK colors).
PdfCanvasEditor only inspects and edits the content stream of the page itself, not the content streams of displayed form XObjects or patterns; thus, some content may not be found. It can be generalized fairly easily.
The result:
Replacing blue fill and stroke colors
Testing the code above you soon found documents in which the underlines were not changed. As it turned out, these underlines are actually drawn as stroked lines, not as filled rectangle as above.
To also properly edit such documents, therefore, you must not only edit the fill colors but also the stroke colors, e.g. like this:
try ( PdfReader pdfReader = new PdfReader(SOURCE_PDF);
PdfWriter pdfWriter = new PdfWriter(RESULT_PDF);
PdfDocument pdfDocument = new PdfDocument(pdfReader, pdfWriter) )
{
PdfCanvasEditor editor = new PdfCanvasEditor()
{
#Override
protected void write(PdfCanvasProcessor processor, PdfLiteral operator, List<PdfObject> operands)
{
String operatorString = operator.toString();
if (SET_FILL_RGB.equals(operatorString) && operands.size() == 4) {
if (isApproximatelyEqual(operands.get(0), 0) &&
isApproximatelyEqual(operands.get(1), 0) &&
isApproximatelyEqual(operands.get(2), 1)) {
super.write(processor, new PdfLiteral("g"), Arrays.asList(new PdfNumber(0), new PdfLiteral("g")));
return;
}
}
if (SET_STROKE_RGB.equals(operatorString) && operands.size() == 4) {
if (isApproximatelyEqual(operands.get(0), 0) &&
isApproximatelyEqual(operands.get(1), 0) &&
isApproximatelyEqual(operands.get(2), 1)) {
super.write(processor, new PdfLiteral("G"), Arrays.asList(new PdfNumber(0), new PdfLiteral("G")));
return;
}
}
super.write(processor, operator, operands);
}
boolean isApproximatelyEqual(PdfObject number, float reference) {
return number instanceof PdfNumber && Math.abs(reference - ((PdfNumber)number).floatValue()) < 0.01f;
}
final String SET_FILL_RGB = "rg";
final String SET_STROKE_RGB = "RG";
};
for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++)
{
editor.editPage(pdfDocument, i);
}
}
(ChangeColor tests testChangeRgbBlueToBlackControlOfNitrosamineImpuritiesInSartansRev and testChangeRgbBlueToBlackEdqmReportsIssuesOfNonComplianceWithToothMac)
The results:
and
Replacing different shades of blue from other RGB'ish color spaces
Testing the code above you again found documents in which the blue colors were not changed. As it turned out, these blue colors were not from the DeviceRGB standard RGB but instead from ICCBased colorspaces, profiled RGB color spaces to be more exact. In particular other color setting operators were used than before, sc / scn instead of rg. Furthermore, in one document not a pure blue 0 0 1 but instead a .17255 .3098 .63529 blue was used
If we assume that sc and scn instructions with three numeric arguments set some flavor of RGB colors as here (in general this is an oversimplification, Lab and other color spaces can also come with 4 components, but your documents seem RGB oriented) and are less strict in recognizing the blue color, we can generalize the code above as follows:
class AllRgbBlueToBlackConverter extends PdfCanvasEditor {
#Override
protected void write(PdfCanvasProcessor processor, PdfLiteral operator, List<PdfObject> operands)
{
String operatorString = operator.toString();
if (RGB_SETTER_CANDIDATES.contains(operatorString) && operands.size() == 4) {
if (isBlue(operands.get(0), operands.get(1), operands.get(2))) {
PdfNumber number0 = new PdfNumber(0);
operands.set(0, number0);
operands.set(1, number0);
operands.set(2, number0);
}
}
super.write(processor, operator, operands);
}
boolean isBlue(PdfObject red, PdfObject green, PdfObject blue) {
if (red instanceof PdfNumber && green instanceof PdfNumber && blue instanceof PdfNumber) {
float r = ((PdfNumber)red).floatValue();
float g = ((PdfNumber)green).floatValue();
float b = ((PdfNumber)blue).floatValue();
return b > .5f && r < .9f*b && g < .9f*b;
}
return false;
}
final Set<String> RGB_SETTER_CANDIDATES = new HashSet<>(Arrays.asList("rg", "RG", "sc", "SC", "scn", "SCN"));
}
(ChangeColor helper class)
Used like this
try ( PdfReader pdfReader = new PdfReader(INPUT);
PdfWriter pdfWriter = new PdfWriter(OUTPUT);
PdfDocument pdfDocument = new PdfDocument(pdfReader, pdfWriter) ) {
PdfCanvasEditor editor = new AllRgbBlueToBlackConverter();
for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++)
{
editor.editPage(pdfDocument, i);
}
}
we get
and

How to force Mobile Vision for Android to read full lines of text

I have implemented Google's Mobile Vision for Android by following a tutorial. I am trying to build an app that will scan a receipt and find the numeric total. However, as I scan different receipts that are printed in different formats, the API will detect TextBlocks in what seems to be an arbitrary way. For example, in one receipt, if several words of text are separated by single spaces, then they are grouped into a single TextBlock. However, if two words of text are separated by lots of spaces, then they are separated as independent TextBlocks, even though they appear on the same "line". What I am trying to do is force the API to recognize each entire line of the receipt as a single entity. Is this possible?
public ArrayList<T> getAllGraphicsInRow(float rawY) {
synchronized (mLock) {
ArrayList<T> row = new ArrayList<>();
// Get the position of this View so the raw location can be offset relative to the view.
int[] location = new int[2];
this.getLocationOnScreen(location);
for (T graphic : mGraphics) {
float rawX = this.getWidth();
for (int i=0; i<rawX; i+=10){
if (graphic.contains(i - location[0], rawY - location[1])) {
if(!row.contains(graphic)) {
row.add(graphic);
}
}
}
}
return row;
}
}
This should be in the GraphicOverlay.java file and essentially fetches all the graphics in that row.
public static boolean almostEqual(double a, double b, double eps){
return Math.abs(a-b)<(eps);
}
public static boolean pointAlmostEqual(Point a, Point b){
return almostEqual(a.y,b.y,10);
}
public static boolean cornerPointAlmostEqual(Point[] rect1, Point[] rect2){
boolean almostEqual=true;
for (int i=0; i<rect1.length;i++){
if (!pointAlmostEqual(rect1[i],rect2[i])){
almostEqual=false;
}
}
return almostEqual;
}
private boolean onTap(float rawX, float rawY) {
String priceRegex = "(\\d+[,.]\\d\\d)";
ArrayList<OcrGraphic> graphics = mGraphicOverlay.getAllGraphicsInRow(rawY);
OcrGraphic currentGraphics = mGraphicOverlay.getGraphicAtLocation(rawX,rawY);
if (graphics !=null && currentGraphics!=null) {
List<? extends Text> currentComponents = currentGraphics.getTextBlock().getComponents();
final Pattern pattern = Pattern.compile(priceRegex);
final Pattern pattern1 = Pattern.compile(priceRegex);
TextBlock text = null;
Log.i("text results", "This many in the row: " + Integer.toString(graphics.size()));
ArrayList<Text> combinedComponents = new ArrayList<>();
for (OcrGraphic graphic : graphics) {
if (!graphic.equals(currentGraphics)) {
text = graphic.getTextBlock();
Log.i("text results", text.getValue());
combinedComponents.addAll(text.getComponents());
}
}
for (Text currentText : currentComponents) { // goes through components in the row
final Matcher matcher = pattern.matcher(currentText.getValue()); // looks for
Point[] currentPoint = currentText.getCornerPoints();
for (Text otherCurrentText : combinedComponents) {//Looks for other components that are in the same row
final Matcher otherMatcher = pattern1.matcher(otherCurrentText.getValue()); // looks for
Point[] innerCurrentPoint = otherCurrentText.getCornerPoints();
if (cornerPointAlmostEqual(currentPoint, innerCurrentPoint)) {
if (matcher.find()) { // if you click on the price
Log.i("oh yes", "Item: " + otherCurrentText.getValue());
Log.i("oh yes", "Value: " + matcher.group(1));
itemList.add(otherCurrentText.getValue());
priceList.add(Float.valueOf(matcher.group(1)));
}
if (otherMatcher.find()) { // if you click on the item
Log.i("oh yes", "Item: " + currentText.getValue());
Log.i("oh yes", "Value: " + otherMatcher.group(1));
itemList.add(currentText.getValue());
priceList.add(Float.valueOf(otherMatcher.group(1)));
}
Toast toast = Toast.makeText(this, " Text Captured!" , Toast.LENGTH_SHORT);
toast.show();
}
}
}
return true;
}
return false;
}
This should be in OcrCaptureActivity.java and it breaks up the TextBlock into lines and finds the blocks in the same row as the line and checks if the components are all prices, and prints all value accordingly.
The eps value in almostEqual is the tolerance for how tall it checks for graphics in the row.

Remove underlines from text in PDF file

I have a bunch of PDF files with broken links.
I need to remove those links and right now I can do the following:
Remove link actions
Change text color from blue to black
What I can't do is to remove blue underlines below text that was a link before.
I tried several PDF libraries for .NET (because this is my primary platform)
Aspost.PDF
PDFSharp
ceTe DynamicPDF
PDFBox
You are welcone to recommend solution on any prograning language, platform and library. I just need to do this.
In case of the sample document the underlines are drawn as blue (RGB 0,0,1) filled vector graphics rectangles (long, slim ones). As blue only is used for the links, we can use that criterion to find the rectangles in question.
Here a sample implementation using PDFBox 1.8.10:
void removeBlueRectangles(PDDocument document) throws IOException
{
List<?> pages = document.getDocumentCatalog().getAllPages();
for (int i = 0; i < pages.size(); i++)
{
PDPage page = (PDPage) pages.get(i);
PDStream contents = page.getContents();
PDFStreamParser parser = new PDFStreamParser(contents.getStream());
parser.parse();
List<Object> tokens = parser.getTokens();
Stack<Boolean> blueState = new Stack<Boolean>();
blueState.push(false);
for (int j = 0; j < tokens.size(); j++)
{
Object next = tokens.get(j);
if (next instanceof PDFOperator)
{
PDFOperator op = (PDFOperator) next;
if (op.getOperation().equals("q"))
{
blueState.push(blueState.peek());
}
else if (op.getOperation().equals("Q"))
{
blueState.pop();
}
else if (op.getOperation().equals("rg"))
{
if (j > 2)
{
Object r = tokens.get(j-3);
Object g = tokens.get(j-2);
Object b = tokens.get(j-1);
if (r instanceof COSNumber && g instanceof COSNumber && b instanceof COSNumber)
{
blueState.pop();
blueState.push((
Math.abs(((COSNumber)r).floatValue() - 0) < 0.001 &&
Math.abs(((COSNumber)g).floatValue() - 0) < 0.001 &&
Math.abs(((COSNumber)b).floatValue() - 1) < 0.001));
}
}
}
else if (op.getOperation().equals("f"))
{
if (blueState.peek() && j > 0)
{
Object re = tokens.get(j-1);
if (re instanceof PDFOperator && ((PDFOperator)re).getOperation().equals("re"))
{
tokens.set(j, PDFOperator.getOperator("n"));
}
}
}
}
}
PDStream updatedStream = new PDStream(document);
OutputStream out = updatedStream.createOutputStream();
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
page.setContents(updatedStream);
}
}
(RemoveUnderlines.java)
original.pdf
Applying this to your first sample file original.pdf
public void testOriginal() throws IOException, COSVisitorException
{
try ( InputStream resourceStream = getClass().getResourceAsStream("original.pdf") )
{
PDDocument document = PDDocument.loadNonSeq(resourceStream, null);
removeBlueRectangles(document);
document.save("original-noBlueRectangles.pdf");
document.close();
}
}
(RemoveUnderlines.java)
results in
1178.pdf
You commented
After testing this on many files I have to say this solution works incorrectly in some cases. For example in for this file (dropbox.com/s/23g54bvt781lb93/1178.pdf?dl=0) it removes the entire content of the page. Keep searching..
So I applyed the code to your new sample file 1178.pdf
public void test1178() throws IOException, COSVisitorException
{
try ( InputStream resourceStream = getClass().getResourceAsStream("1178.pdf") )
{
PDDocument document = PDDocument.loadNonSeq(resourceStream, null);
removeBlueRectangles(document);
document.save(new File(RESULT_FOLDER, "1178-noBlueRectangles.pdf"));
document.close();
}
}
(RemoveUnderlines.java)
which resulted in
So I cannot confirm your claim that the solution works incorrectly; in particular I see that it does not remove the entire content of the page.
As I cannot reproduce your observation, I assume there are additional issues in your setup you have not yet mentioned.