Removing pages from PDF using PDFBox produces bigger PDF than original

Removing pages from PDF using PDFBox produces bigger PDF than original - pdf

I need to extract page range from PDF files.
I use the following code for this (using PDFBox v2.0.4):
int startPage = 17;
int endPage = 18;
String fn = "original.pdf";
String resFn = "result.pdf";
PDDocument doc = PDDocument.load(new File(fn), MemoryUsageSetting.setupMixed(1024 * 1024));
int cnt = doc.getNumberOfPages();
for (int i = cnt - 1; i > endPage; i--) {
doc.removePage(i);
}
for (int i = startPage - 1; i >= 0; i--) {
doc.removePage(i);
}
doc.save(new FileOutputStream(resFn));
However for relatively small original files it produces slightly larger result files.
For example an original.pdf file of 800Kb (which has 22 pages) resulted in a result.pdf file of 1300Kb (which had just 2 pages).
Can anyone tell me how make PDFBox create smaller PDF (or at least the same size as original)?

Related

Migradoc: Avoid a page break in case of merged rows

When there isn't enough space left on a page, merged rows (cells) in table are placed into a new page.
How to prevent this and assure the table is filling the free space on the current page?
Section section = document.AddSection();
Table t5 = new Table();
t5.AddColumn(Unit.FromCentimeter(4));
t5.AddColumn(Unit.FromCentimeter(4));
Row first = t5.AddRow();
first.Cells[0].AddParagraph("Header 1");
first.Cells[1].AddParagraph("Header 2");
for (int j = 0; j < 4; j++)
{
var rowpd = t5.AddRow();
rowpd.VerticalAlignment = VerticalAlignment.Center;
rowpd.Cells[0].MergeDown = 17;
rowpd.Cells[0].AddParagraph("Merged 18 cells. ");
for (int i = 0; i < 18; i++)
{
if (i == 0)
{
rowpd.Cells[1].AddParagraph($"value {i}");
}
else
{
var row = t5.AddRow();
row.Cells[1].AddParagraph($"value {i}");
}
}
}
document.LastSection.Add(t5);

MigraDoc does not (yet) split cells, it only splits between cells. With MergeDown you create a huge cell that will not split.
Option: Avoid the MergeDown and use many small cells on the left column without horizontal borders to achieve a similar optical effect, but with page breaking as expected. Depending on the text in the left column this may or may not be an option.

cryptage et decryptage rsa sur une image utilisant java netbeans

I'm working on an application that can encrypt and decrypt an image (specific selection ) using RSA algorithm, all works well but some pixels are behaving strangely and I can't understand why! I use the same parameters to encrypt/decrypt and save the image and yet, when I create the new image, and try to read the pixels in crypted zone, I don't get the pixel that my program showed me before.
File img = new File (Path);
bf1 = ImageIO.read(img);
marchdanslImage(bf1,captureRect); // only selected rectangle (captureRect) from image will be treated
///////the function i called before
private void marchdanslImage(BufferedImage image , Rectangle REC) throws IOException {
bf2 = new BufferedImage(REC.width, REC.height, BufferedImage.TYPE_INT_RGB); //this image gonna contain the pixels after encryption
for (int i = y; i < h; i++) {
for (int j = x; j < w; j++) {
int pixel = image.getRGB(j, i);//orginal values
printPixelARGB(pixel,j,i); //here i call the code to crypt or decrypt
bf2.setRGB(j-x,i-y, rgb); //new values
} }
}
the code of the function printPixelARGB:
public void printPixelARGB(int pixel,int i , int j) {
r[i][j] = (pixel >> 16) & 0xff; // original values
rr[i][j] = RSA_crypt_decrypt(r[i][j], appel);//values after treatment
g[i][j] = (pixel >> 8) & 0xff;
gg[i][j] = RSA_crypt_decrypt(g[i][j], appel);
b[i][j] = (pixel) & 0xff;
bb[i][j] = RSA_crypt_decrypt(b[i][j], appel);
rgb = rr[i][j];// new values on rgb to be set in bf2
rgb = (rgb << 8) + gg[i][j];
rgb = (rgb << 8) + bb[i][j];
}
and finally to save my work:
public void save_image()
{
Graphics2D g;
g = (Graphics2D) bf1.getGraphics();
g.drawImage(bf2, captureRect.x, captureRect.y, captureRect.width, captureRect.height, null);
g.dispose();
//i draw the crypted pixels on my original image and create new image
File outputFile = new File("C:/USERS/HP/DesKtop/output.jpg");
try {
ImageIO.write(bf1, "jpg", outputFile);
} catch (IOException ex) {
Logger.getLogger(MenuGenerale2.class.getName()).log(Level.SEVERE, null, ex);
}
}
So far everything is working, but when open the image I created, and try to decrypt, the values I get are not the same, after treatment!
Is it because of the saving part? When I try it on a white image it does not work correctly, but on another image it does not at all! It's been 3 weeks couldn't solve this problem... I really really need help.
Here is the link of my application:
www.fichier-rar.fr/2016/04/23/cryptagersa11/

The problem is that you are saving the image with JPEG compression. JPEG compression does not preserve data exactly: it is a lossy compression.
If you used, say, BMP or PNG files, the problem would would not happen.
You might want to investigate steganography, although I suspect it is the opposite of what you want to achieve

iTextSharp can't read numbers in this PDF

I'm reading PDF by iTextSharp-5.5.7.0, PdfTextExtractor.GetTextFromPage() works well in most of files until this: sample PDF
I can't read any number from it, for example: only return 'ANEU' from 'A0NE8U', they are fine in Adobe Reader to copy out. Code is here:
public static string ExtractTextFromPdf(string path)
{
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
for (int i = 1; i <= reader.NumberOfPages; i++)
{
text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
}
return text.ToString();
}
}

The font in question has a ToUnicode map which is used for text extraction. Unfortunately, though, iText(Sharp) reads it only partially, and digits are located after the mappings read.
In detail:
The cause for the issue is the implementation of AbstractCMap.addRange (I'm showing the iText Java code as iText also has this issue and I'm more into the Java version):
void addRange(PdfString from, PdfString to, PdfObject code) {
byte[] a1 = decodeStringToByte(from);
byte[] a2 = decodeStringToByte(to);
if (a1.length != a2.length || a1.length == 0)
throw new IllegalArgumentException("Invalid map.");
byte[] sout = null;
if (code instanceof PdfString)
sout = decodeStringToByte((PdfString)code);
int start = a1[a1.length - 1] & 0xff;
int end = a2[a2.length - 1] & 0xff;
for (int k = start; k <= end; ++k) {
a1[a1.length - 1] = (byte)k;
PdfString s = new PdfString(a1);
s.setHexWriting(true);
if (code instanceof PdfArray) {
addChar(s, ((PdfArray)code).getPdfObject(k - start));
}
else if (code instanceof PdfNumber) {
int nn = ((PdfNumber)code).intValue() + k - start;
addChar(s, new PdfNumber(nn));
}
else if (code instanceof PdfString) {
PdfString s1 = new PdfString(sout);
s1.setHexWriting(true);
++sout[sout.length - 1];
addChar(s, s1);
}
}
}
The loop only considers the range in the least significant byte of from and to. Thus, for the range in question:
1 beginbfrange
<0000><01E1>[
<FFFD><FFFD><FFFD><0020><0041><0042><0043><0044>
<0045><0046><0047><0048><0049><004A><004B><004C>
...
<2248><003C><003E><2264><2265><00AC><0394><03A9>
<00B5><03C0><00B0><221E><2202><222B><221A><2211>
<220F><25CA>]
endbfrange
it only iterates from 0x00 to 0xE1, i.e. only the first 226 entries of the 482 mappings.
There actually are some peculiar restrictions in CMaps, e.g. there may only be up to 100 separate bfrange entries in the same section, and in the alternative bfrange entry syntax
n beginbfrange
srcCode1 srcCode2 dstString
endbfrange
which is handled by the same method addRange, there is the restriction
When defining ranges of this type, the value of the last byte in the string shall be less than or equal to 255 − (srcCode2 − srcCode1).
Probably a misunderstanding of this restriction made the developer believe, srcCode2 and srcCode1 also would merely differ in the least significant byte.
But maybe there are even more restrictions which I merely did not find...
Meanwhile (as of iText 5.5.9, tested against a development SNAPSHOT) this issue seems to have been fixed.

How to set the line space between two chunks in itextsharp

I am creating a PDF using iTextSharp. This is a reporting tool. Everything is working fine, only the space between two chunks is slighly greater that what I want. I tried to find some help on StackOverflow and got to know SetLeading(fixed, multiplied); but it is not coming with chunk in case.
The reason I need it in chunk is that I have multiple chunks which I am adding into paragraph proceeding to which adding all into Document at a single shot.
public static void createPDF(Paragraph para)
{
string imagepath = "12.pdf";
Document doc = new Document();
try
{
Paragraph p = para;
Rectangle[] COLUMNS = {
new Rectangle(36, 36, 290, 806),
new Rectangle(305, 36, 559, 806)
};
//This is what i have tried
// p.SetLeading(0.4f,0.8f);
p.SpacingBefore = 0.0f;
p.SpacingAfter = 0.1f;
PdfReader inputPdf = new PdfReader(#"");
PdfWriter writer2 = PdfWriter.GetInstance(doc, new FileStream(imagepath, FileMode.Create));
doc.Open();
PdfContentByte canvas = writer2.DirectContent;
for (int ij = 1; ij <= 3; ij++)
{
doc.SetPageSize(inputPdf.GetPageSizeWithRotation(ij));
doc.NewPage();
PdfImportedPage page = writer2.GetImportedPage(inputPdf, ij);
int rotation = inputPdf.GetPageRotation(ij);
if (rotation == 90 || rotation == 270)
{
canvas.AddTemplate(page, 0, -1f, 1f, 0, 0, inputPdf.GetPageSizeWithRotation(ij).Height);
}
else
{
canvas.AddTemplate(page, 1f, 0, 0, 1f, 0, 0);
}
}
doc.NewPage();
ColumnText ct = new ColumnText(canvas);
int side_of_the_page = 0;
ct.SetSimpleColumn(COLUMNS[side_of_the_page]);
int paragraphs = 0;
int i = 0;
while (paragraphs < p.Count-1)
{
string TEXT = p[i].ToString();
ct.AddElement(p[i]);
while (ColumnText.HasMoreText(ct.Go()))
{
if (side_of_the_page == 0)
{
side_of_the_page = 1;
canvas.MoveTo(297.5f, 36);
canvas.LineTo(297.5f, 806);
canvas.Stroke();
}
else
{
side_of_the_page = 0;
doc.NewPage();
}
ct.SetSimpleColumn(COLUMNS[side_of_the_page]);
}
i++;
paragraphs++;
}
doc.Close();
}
catch {
}
}

Please read chapter 2 of my book. The Chunk object is called the atomic building block among iText's high-level objects. By design, you cannot define a leading on the level of a Chunk.
I quote from page 23:
A Chunk isn't aware of the space that is needed between two lines.
The leading is defined at the level of a Phrase (and, of course, its superclasses, such as Paragraph). If you want to change the spacing between Chunk objects, you need to wrap Chunks in Phrases or Paragraphs (as you already indicate) and define the leading for those phrases or paragraphs.
Note that the documentation also states:
In normal circumstances you'll use Chunk objects to compose other text objects, such as Phrases and Paragraphs. Typically, you won't add Chunk objects directly to a Document.
Which special circumstance do you have that requires making an exception to this rule?
Extra remarks
You are importing an existing PDF in a way that throws away all existing interactivity. This is suboptimal.
You first compose a paragraph p, you set the leading for p, then you decompose p throwing away the leading you've defined and then you complain that there's no leading.
This is what you are doing wrong:
while (paragraphs < p.Count-1)
{
ct.AddElement(p[i]);
...
}
The object p knows its leading; the separate components of this object (p[0], p[1],...), don't know anything about the leading.
Hence you should do something like this:
ColumnText ct = new ColumnText(canvas);
int side_of_the_page = 0;
ct.SetSimpleColumn(COLUMNS[side_of_the_page]);
ct.AddElement(p);
while (ColumnText.HasMoreText(ct.Go()))
{
if (side_of_the_page == 0)
{
side_of_the_page = 1;
canvas.MoveTo(297.5f, 36);
canvas.LineTo(297.5f, 806);
canvas.Stroke();
}
else
{
side_of_the_page = 0;
doc.NewPage();
}
ct.SetSimpleColumn(COLUMNS[side_of_the_page]);
}
As you have defined the leading at the level of the p object, you must add the p object as an element to the ColumnText.
Regarding the wrong way you're copying the original document: The AddLongTable example shows how to do it correctly. You get a PdfReader object for the existing document. You create a PdfStamper to create a new document. You get the number of pages in the existing document, and then you use insertPage() as many time as needed to add extra content.

Extracting highlighted content in pdf automatically as images

I have a pdf file in which some text and images are highlighted using highlight text(U) tool. Is there a way to automatically extract all the highlighted content as separate images and save it to a folder? I dont want readable text. I just want all the highlighted content as images. Thanks

You would need to use PDF library to iterate through all the Annotation objects and their properties to see which ones are using a highlight annotation. Once you have found the highlight annotation you can then extract the position and size (bounding box) of the annotation.
Once you have a list of the annotation bounding boxes you will need to render the PDF file to an image format such as PNG/JPEG/TIFF so that you can extract / clip the rendered image of the annotation text you want. You could use GDI+ or something like LibTIFF
There are various PDF libraries that could do this including
http://www.quickpdflibrary.com (I consult for QuickPDF) or
http://www.itextpdf.com
Here is a C# function based on Quick PDF Library that does what you need.
private void ExtractAnnots_Click(object sender, EventArgs e)
{
int dpi = 300;
Rectangle r;
List<Rectangle> annotList = new List<Rectangle>();
QP.LoadFromFile("samplefile.pdf", "");
for (int p = 1; p <= QP.PageCount(); p++)
{
QP.SelectPage(p); // Select the current page.
QP.SetOrigin(1); // Set origin to top left.
annotList.Clear();
for (int i = 1; i <= QP.AnnotationCount(); i++)
{
if (QP.GetAnnotStrProperty(i, 101) == "Highlight")
{
r = new Rectangle((int)(QP.GetAnnotDblProperty(i, 105) * dpi / 72.0), // x
(int)(QP.GetAnnotDblProperty(i, 106) * dpi / 72.0), // y
(int)(QP.GetAnnotDblProperty(i, 107) * dpi / 72.0), // w
(int)(QP.GetAnnotDblProperty(i, 108) * dpi / 72.0)); // h
annotList.Add(r); // Add the bounding box to the annotation list for this page.
string s = String.Format("page={0}: x={1} y={2} w={3} h={4}\n", p, r.X, r.Y, r.Width, r.Height);
OutputTxt.AppendText(s);
}
}
// Now we have a list of annotations for the current page.
// Delete the annotations from the PDF in memory so we don't render them.
for (int i = QP.AnnotationCount(); i >= 0; i--)
QP.DeleteAnnotation(i);
QP.RenderPageToFile(dpi, p, 0, "page.bmp"); // 300 dpi, 0=bmp
Bitmap bmp = Image.FromFile("page.bmp") as Bitmap;
for (int i=0;i<annotList.Count;i++)
{
Bitmap cropped = bmp.Clone(annotList[i], bmp.PixelFormat);
string filename = String.Format("annot_p{0}_{1}.bmp", p, i+1);
cropped.Save(filename);
}
bmp.Dispose();
}
QP.RemoveDocument(QP.SelectedDocument());
}

Do you want each piece of text as a separate highlight or all the higlhights on a separate pane?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Removing pages from PDF using PDFBox produces bigger PDF than original - pdf

Related

Migradoc: Avoid a page break in case of merged rows

cryptage et decryptage rsa sur une image utilisant java netbeans

iTextSharp can't read numbers in this PDF

How to set the line space between two chunks in itextsharp

Extracting highlighted content in pdf automatically as images

Categories

Resources