Extracting highlighted content in pdf automatically as images

Extracting highlighted content in pdf automatically as images - pdf

I have a pdf file in which some text and images are highlighted using highlight text(U) tool. Is there a way to automatically extract all the highlighted content as separate images and save it to a folder? I dont want readable text. I just want all the highlighted content as images. Thanks

You would need to use PDF library to iterate through all the Annotation objects and their properties to see which ones are using a highlight annotation. Once you have found the highlight annotation you can then extract the position and size (bounding box) of the annotation.
Once you have a list of the annotation bounding boxes you will need to render the PDF file to an image format such as PNG/JPEG/TIFF so that you can extract / clip the rendered image of the annotation text you want. You could use GDI+ or something like LibTIFF
There are various PDF libraries that could do this including
http://www.quickpdflibrary.com (I consult for QuickPDF) or
http://www.itextpdf.com
Here is a C# function based on Quick PDF Library that does what you need.
private void ExtractAnnots_Click(object sender, EventArgs e)
{
int dpi = 300;
Rectangle r;
List<Rectangle> annotList = new List<Rectangle>();
QP.LoadFromFile("samplefile.pdf", "");
for (int p = 1; p <= QP.PageCount(); p++)
{
QP.SelectPage(p); // Select the current page.
QP.SetOrigin(1); // Set origin to top left.
annotList.Clear();
for (int i = 1; i <= QP.AnnotationCount(); i++)
{
if (QP.GetAnnotStrProperty(i, 101) == "Highlight")
{
r = new Rectangle((int)(QP.GetAnnotDblProperty(i, 105) * dpi / 72.0), // x
(int)(QP.GetAnnotDblProperty(i, 106) * dpi / 72.0), // y
(int)(QP.GetAnnotDblProperty(i, 107) * dpi / 72.0), // w
(int)(QP.GetAnnotDblProperty(i, 108) * dpi / 72.0)); // h
annotList.Add(r); // Add the bounding box to the annotation list for this page.
string s = String.Format("page={0}: x={1} y={2} w={3} h={4}\n", p, r.X, r.Y, r.Width, r.Height);
OutputTxt.AppendText(s);
}
}
// Now we have a list of annotations for the current page.
// Delete the annotations from the PDF in memory so we don't render them.
for (int i = QP.AnnotationCount(); i >= 0; i--)
QP.DeleteAnnotation(i);
QP.RenderPageToFile(dpi, p, 0, "page.bmp"); // 300 dpi, 0=bmp
Bitmap bmp = Image.FromFile("page.bmp") as Bitmap;
for (int i=0;i<annotList.Count;i++)
{
Bitmap cropped = bmp.Clone(annotList[i], bmp.PixelFormat);
string filename = String.Format("annot_p{0}_{1}.bmp", p, i+1);
cropped.Save(filename);
}
bmp.Dispose();
}
QP.RemoveDocument(QP.SelectedDocument());
}

Do you want each piece of text as a separate highlight or all the higlhights on a separate pane?

Related

How do I extract viewport from a pdf and modify an annotation's bounding rectangle according to the viewport?

I have implemented functionality to add link annotation to any pdf using pdfbox. It works well for most of pdfs, but for some pdfs it not placing markups at correct coordinates. And when I opened that pdf in some pdf editor, it gave me warning that the pdf contains an untitled viewport which might affect measurements for that pdf. So, I feel viewport is most probably causing the problem. Is there a way that I can modify the coordinates of markup according to viewport, so that it is placed at correct location in pdf. Here is a link to a pdf which contains the viewport.
According to Tilman's suggestion, I extracted the C entry from viewport's measure dictionary. And tried to modify rectangle's coordinate, but they are not getting added at the right location.Below is the code that I tried. Also, the viewport does not have effect on annotations, but it is causing problem when I try to draw something into the pdf.
COSArray vps = (COSArray)page.getCOSObject().getDictionaryObject(COSName.getPDFName("VP"));
if (vps != null)
{
for (int v = 0; v < vps.size(); ++v)
{
COSDictionary vp = (COSDictionary)vps.getObject(v);
PDViewportDictionary viewportDict = new PDViewportDictionary(vp);
PDRectangle vpRect = viewportDict.getBBox();
PDMeasureDictionary measureDict = viewportDict.getMeasure();
PDRectlinearMeasureDictionary rectilinearDict = new PDRectlinearMeasureDictionary(measureDict.getCOSObject());
bool pointLieInVP = UtilityClass.RectangleContainsPoint(new PointF(leftX, bottomY), vpRect);
if (pointLieInVP)
{
COSArray xArray = (COSArray)measureDict.getCOSObject().getDictionaryObject(COSName.getPDFName("X"));
float xScale = 1;
if (xArray!=null)
{
xScale = ((COSFloat)(((COSDictionary)xArray.getObject(0)).getDictionaryObject(COSName.getPDFName("C")))).floatValue();
}
leftX /= xScale;
rightX /= xScale;
COSBase yObj = measureDict.getCOSObject().getDictionaryObject(COSName.getPDFName("Y"));
if (yObj != null)
{
COSArray yArray = (COSArray)yObj;
float yScale = ((COSFloat)(((COSDictionary)yArray.getObject(0)).getDictionaryObject(COSName.getPDFName("C")))).floatValue();
bottomY /= yScale;
topY /= yScale;
}
else
{
bottomY /= xScale;
topY /= xScale;
}
}
}
}
Here is the link to pdf markups are added without adjusting for viewports. The 5 red colored markups are added at bottom right end of the page. But they should have been placed over the link annotations in the pdf which are placed at correct positions. And here is the link for pdf , in which markups are placed after modifying their coordinates using the above code. The markups do not appear at all.

This code (which does not avoid ClassCastExceptions) will show you the viewports in each page:
try (PDDocument doc = PDDocument.load(new File("S115-STRUCTURALHIGH ROOF FRAMING(WEST)ENLARGED PLANS.pdf")))
{
for (int p = 0; p < doc.getNumberOfPages(); ++p)
{
PDPage page = doc.getPage(p);
COSArray vps = (COSArray) page.getCOSObject().getDictionaryObject(COSName.getPDFName("VP"));
if (vps != null)
{
for (int v = 0; v < vps.size(); ++v)
{
COSDictionary vp = (COSDictionary) vps.getObject(v);
PDRectangle rect = new PDRectangle((COSArray) vp.getDictionaryObject(COSName.BBOX));
System.out.println("Viewport " + vp.getString(COSName.NAME) + ": " + rect);
}
}
}
}
How to adjust annotations is up to you... most likely, these should be inside the bbox. All you need to do is to adjust the rectangle of the annotations.

Removing pages from PDF using PDFBox produces bigger PDF than original

I need to extract page range from PDF files.
I use the following code for this (using PDFBox v2.0.4):
int startPage = 17;
int endPage = 18;
String fn = "original.pdf";
String resFn = "result.pdf";
PDDocument doc = PDDocument.load(new File(fn), MemoryUsageSetting.setupMixed(1024 * 1024));
int cnt = doc.getNumberOfPages();
for (int i = cnt - 1; i > endPage; i--) {
doc.removePage(i);
}
for (int i = startPage - 1; i >= 0; i--) {
doc.removePage(i);
}
doc.save(new FileOutputStream(resFn));
However for relatively small original files it produces slightly larger result files.
For example an original.pdf file of 800Kb (which has 22 pages) resulted in a result.pdf file of 1300Kb (which had just 2 pages).
Can anyone tell me how make PDFBox create smaller PDF (or at least the same size as original)?

Adobe Illustrator, Save to Swatches Panel with RGB Hex names instead of RGB 0-255 Values

Is there a setting in Adobe Illustrator that would make all saved swatches from the Color Guide save as RGB Hex instead of RGB 0-255 values?
I'm not even sure if this is possible...
It would save a lot of time, allowing me to just double-click the name of each swatch, then copy the hex value, and paste into whatever .css file I'm editing...rather than having to double click the color, click inside the hex box, and copy that way. For one-off's, thats no big deal, but when dealing with tons of colors, every click adds up time-wise.
Thanks in advance for any suggestions.
Screenshot, showing specifically what I'd like.

/*
Run this script to rename swatch rgb color to corresponding hex value
For example, 'R=108 G=125 B=87' will be '#6c7d57'
Note: script works with RGB color only.
Befor run script select swatch colors in illustrator's Swathes Panel.
*/
var myDoc = app.activeDocument;
var selSwatches = myDoc.swatches.getSelected();
for (var i=0; i<selSwatches.length; i++)
{
swcolor = selSwatches[i].color;
if (swcolor.typename=='RGBColor')
{
selSwatches[i].name = rgbToHex(swcolor.red, swcolor.green, swcolor.blue) ;
}
}
function rgbToHex(r, g, b)
{
var hex = '#';
for (var i = 0; i < 3; ++i)
{
var n = typeof arguments[i] == 'number' ? arguments[i] : parseInt(arguments[i]);
if (isNaN(n) || n < 0 || n > 255)
{
return null;
}
hex += (n < 16 ? '0' : '') + n.toString(16);
}
return hex;
}

Detect Bold, Italic and Strike Through text using PDFBox with VB.NET

Is there a way to preserve the text formatting when extracting a PDF with PDFBox?
I have a program that parses a PDF document for information. When a new version of the PDF is released the authors use bold or italic text to indicate new information and Strike through or underlined to indicated omitted text. Using the base Stripper class in PDFbox returns all the text but the formatting is removed so I have no way of telling if the text is new or omitted. I'm currently using the project example code below:
Dim doc As PDDocument = Nothing
Try
doc = PDDocument.load(RFPFilePath)
Dim stripper As New PDFTextStripper()
stripper.setAddMoreFormatting(True)
stripper.setSortByPosition(True)
rtxt_DocumentViewer.Text = stripper.getText(doc)
Finally
If doc IsNot Nothing Then
doc.close()
End If
End Try
I have my parsing code working well if I simply copy and paste the PDF text into a richtextbox which preservers the formatting. I was thinking of doing this programatically by opening the PDF, select all, Copy, close the document then paste it in my richtextbox but that seems clunky.

As the OP mentioned in a comment that a Java example would do and I've yet only used PDFBox with Java, this answer features a Java example. Furthermore, this example has been developed and tested with PDFBox version 1.8.11 only.
A customized text stripper
As already mentioned in a comment,
The bold and italic effects in the OP's sample document are generated by using a different font (containing bold or italic versions of the letters) to draw the text. The underline and strike-through effects in the sample document are generated by drawing a rectangle under / through the text line which has the width of the text line and a very small height. To extract these information, therefore, one has to extend the PDFTextStripper to somehow react to font changes and rectangles nearby text.
This is an example class extending the PDFTextStripper just like that:
public class PDFStyledTextStripper extends PDFTextStripper
{
public PDFStyledTextStripper() throws IOException
{
super();
registerOperatorProcessor("re", new AppendRectangleToPath());
}
#Override
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
for (TextPosition textPosition : textPositions)
{
Set<String> style = determineStyle(textPosition);
if (!style.equals(currentStyle))
{
output.write(style.toString());
currentStyle = style;
}
output.write(textPosition.getCharacter());
}
}
Set<String> determineStyle(TextPosition textPosition)
{
Set<String> result = new HashSet<>();
if (textPosition.getFont().getBaseFont().toLowerCase().contains("bold"))
result.add("Bold");
if (textPosition.getFont().getBaseFont().toLowerCase().contains("italic"))
result.add("Italic");
if (rectangles.stream().anyMatch(r -> r.underlines(textPosition)))
result.add("Underline");
if (rectangles.stream().anyMatch(r -> r.strikesThrough(textPosition)))
result.add("StrikeThrough");
return result;
}
class AppendRectangleToPath extends OperatorProcessor
{
public void process(PDFOperator operator, List<COSBase> arguments)
{
COSNumber x = (COSNumber) arguments.get(0);
COSNumber y = (COSNumber) arguments.get(1);
COSNumber w = (COSNumber) arguments.get(2);
COSNumber h = (COSNumber) arguments.get(3);
double x1 = x.doubleValue();
double y1 = y.doubleValue();
// create a pair of coordinates for the transformation
double x2 = w.doubleValue() + x1;
double y2 = h.doubleValue() + y1;
Point2D p0 = transformedPoint(x1, y1);
Point2D p1 = transformedPoint(x2, y1);
Point2D p2 = transformedPoint(x2, y2);
Point2D p3 = transformedPoint(x1, y2);
rectangles.add(new TransformedRectangle(p0, p1, p2, p3));
}
Point2D.Double transformedPoint(double x, double y)
{
double[] position = {x,y};
getGraphicsState().getCurrentTransformationMatrix().createAffineTransform().transform(
position, 0, position, 0, 1);
return new Point2D.Double(position[0],position[1]);
}
}
static class TransformedRectangle
{
public TransformedRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3)
{
this.p0 = p0;
this.p1 = p1;
this.p2 = p2;
this.p3 = p3;
}
boolean strikesThrough(TextPosition textPosition)
{
Matrix matrix = textPosition.getTextPos();
// TODO: This is a very simplistic implementation only working for horizontal text without page rotation
// and horizontal rectangular strikeThroughs with p0 at the left bottom and p2 at the right top
// Check if rectangle horizontally matches (at least) the text
if (p0.getX() > matrix.getXPosition() || p2.getX() < matrix.getXPosition() + textPosition.getWidth() - textPosition.getFontSizeInPt() / 10.0)
return false;
// Check whether rectangle vertically is at the right height to underline
double vertDiff = p0.getY() - matrix.getYPosition();
if (vertDiff < 0 || vertDiff > textPosition.getFont().getFontDescriptor().getAscent() * textPosition.getFontSizeInPt() / 1000.0)
return false;
// Check whether rectangle is small enough to be a line
return Math.abs(p2.getY() - p0.getY()) < 2;
}
boolean underlines(TextPosition textPosition)
{
Matrix matrix = textPosition.getTextPos();
// TODO: This is a very simplistic implementation only working for horizontal text without page rotation
// and horizontal rectangular underlines with p0 at the left bottom and p2 at the right top
// Check if rectangle horizontally matches (at least) the text
if (p0.getX() > matrix.getXPosition() || p2.getX() < matrix.getXPosition() + textPosition.getWidth() - textPosition.getFontSizeInPt() / 10.0)
return false;
// Check whether rectangle vertically is at the right height to underline
double vertDiff = p0.getY() - matrix.getYPosition();
if (vertDiff > 0 || vertDiff < textPosition.getFont().getFontDescriptor().getDescent() * textPosition.getFontSizeInPt() / 500.0)
return false;
// Check whether rectangle is small enough to be a line
return Math.abs(p2.getY() - p0.getY()) < 2;
}
final Point2D p0, p1, p2, p3;
}
final List<TransformedRectangle> rectangles = new ArrayList<>();
Set<String> currentStyle = Collections.singleton("Undefined");
}
(PDFStyledTextStripper.java)
In addition to what the PDFTextStripper does, this class also
collects rectangles from the content (defined using the re instruction) using an instance of the AppendRectangleToPath operator processor inner class,
checks text for the style variants from the sample document in determineStyle, and
whenever the style changes, adds the new style to the result in writeString.
Beware: This merely is a proof of concept! In particular
the implementations of the tests in TransformedRectangle.underlines(TextPosition) and TransformedRectangle#strikesThrough(TextPosition) are very simplistic and only work for horizontal text without page rotation and horizontal rectangular strikeThroughs and underlines with p0 at the left bottom and p2 at the right top;
all rectangles are collected, not checking whether they actually are filled with a visible color;
the tests for "bold" and "italic" merely inspect the name of the used font which may not suffice in general.
A test output
Using the PDFStyledTextStripper like this
String extractStyled(PDDocument document) throws IOException
{
PDFTextStripper stripper = new PDFStyledTextStripper();
stripper.setSortByPosition(true);
return stripper.getText(document);
}
(from ExtractText.java, called from the test method testExtractStyledFromExampleDocument)
one gets the result
[]This is an example of plain text
[Bold]This is an example of bold text
[]
[Underline]This is an example of underlined text[]
[Italic]This is an example of italic text
[]
[StrikeThrough]This is an example of strike through text[]
[Italic, Bold]This is an example of bold, italic text
for the OP's sample document
PS The code of the PDFStyledTextStripper meanwhile has been slightly changed to also work for a sample document shared in a github issue, in particular the code of its inner class TransformedRectangle, cf. here.

cryptage et decryptage rsa sur une image utilisant java netbeans

I'm working on an application that can encrypt and decrypt an image (specific selection ) using RSA algorithm, all works well but some pixels are behaving strangely and I can't understand why! I use the same parameters to encrypt/decrypt and save the image and yet, when I create the new image, and try to read the pixels in crypted zone, I don't get the pixel that my program showed me before.
File img = new File (Path);
bf1 = ImageIO.read(img);
marchdanslImage(bf1,captureRect); // only selected rectangle (captureRect) from image will be treated
///////the function i called before
private void marchdanslImage(BufferedImage image , Rectangle REC) throws IOException {
bf2 = new BufferedImage(REC.width, REC.height, BufferedImage.TYPE_INT_RGB); //this image gonna contain the pixels after encryption
for (int i = y; i < h; i++) {
for (int j = x; j < w; j++) {
int pixel = image.getRGB(j, i);//orginal values
printPixelARGB(pixel,j,i); //here i call the code to crypt or decrypt
bf2.setRGB(j-x,i-y, rgb); //new values
} }
}
the code of the function printPixelARGB:
public void printPixelARGB(int pixel,int i , int j) {
r[i][j] = (pixel >> 16) & 0xff; // original values
rr[i][j] = RSA_crypt_decrypt(r[i][j], appel);//values after treatment
g[i][j] = (pixel >> 8) & 0xff;
gg[i][j] = RSA_crypt_decrypt(g[i][j], appel);
b[i][j] = (pixel) & 0xff;
bb[i][j] = RSA_crypt_decrypt(b[i][j], appel);
rgb = rr[i][j];// new values on rgb to be set in bf2
rgb = (rgb << 8) + gg[i][j];
rgb = (rgb << 8) + bb[i][j];
}
and finally to save my work:
public void save_image()
{
Graphics2D g;
g = (Graphics2D) bf1.getGraphics();
g.drawImage(bf2, captureRect.x, captureRect.y, captureRect.width, captureRect.height, null);
g.dispose();
//i draw the crypted pixels on my original image and create new image
File outputFile = new File("C:/USERS/HP/DesKtop/output.jpg");
try {
ImageIO.write(bf1, "jpg", outputFile);
} catch (IOException ex) {
Logger.getLogger(MenuGenerale2.class.getName()).log(Level.SEVERE, null, ex);
}
}
So far everything is working, but when open the image I created, and try to decrypt, the values I get are not the same, after treatment!
Is it because of the saving part? When I try it on a white image it does not work correctly, but on another image it does not at all! It's been 3 weeks couldn't solve this problem... I really really need help.
Here is the link of my application:
www.fichier-rar.fr/2016/04/23/cryptagersa11/

The problem is that you are saving the image with JPEG compression. JPEG compression does not preserve data exactly: it is a lossy compression.
If you used, say, BMP or PNG files, the problem would would not happen.
You might want to investigate steganography, although I suspect it is the opposite of what you want to achieve

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Extracting highlighted content in pdf automatically as images - pdf

I have a pdf file in which some text and images are highlighted using highlight text(U) tool. Is there a way to automatically extract all the highlighted content as separate images and save it to a folder? I dont want readable text. I just want all the highlighted content as images. Thanks

Do you want each piece of text as a separate highlight or all the higlhights on a separate pane?

Related

How do I extract viewport from a pdf and modify an annotation's bounding rectangle according to the viewport?

Removing pages from PDF using PDFBox produces bigger PDF than original

Adobe Illustrator, Save to Swatches Panel with RGB Hex names instead of RGB 0-255 Values

Detect Bold, Italic and Strike Through text using PDFBox with VB.NET

cryptage et decryptage rsa sur une image utilisant java netbeans

Categories

Resources