I have a huge PDF document (1600 pages) from which I kept only a part of it by printing some pages in Acrobat Reader.
I am reading the text using iText with the following code:
public void decode(File file) throws IOException {
PdfReader reader = new PdfReader(file.toURI().toURL());
int numberOfPages = reader.getNumberOfPages();
ProcessorListener listener = new ProcessorListener();
PdfContentStreamProcessor processor = new PdfContentStreamProcessor(listener);
for (int pageNumber = 1; pageNumber <= numberOfPages; pageNumber++) {
PdfDictionary resourcesDic = pageDic.getAsDict(PdfName.RESOURCES);
processor.processContent(ContentByteUtils.getContentBytesForPage(reader, pageNumber), resourcesDic);
}
}
and the processor:
public class ProcessorListener implements RenderListener {
private final PdfReader reader;
public ProcessorListener(PdfReader reader) {
this.reader = reader;
}
#Override
public void beginTextBlock() {
}
#Override
public void renderText(TextRenderInfo tri) {
String text = tri.getText();
double x = tri.getDescentLine().getBoundingRectange().getX();
double y = tri.getDescentLine().getBoundingRectange().getY();
System.out.println(text + " => x:" + x + " y:" + y);
}
#Override
public void endTextBlock() {
}
}
What I observe is that in my initial (huge) document, the x and y coordinates change depending on the text position, but in my smaller saved document the coordinates of all texts are almost always the same (x is very small and y is big), in my case, for example x is invariably smaller than 2, and y is always almost equal to 480.
What's the reason for this behavior? I only have this problem if I print my PDF document using Acrobat Reader, not when I am printing it using Edge for example. Also the resulting document is correctly shown in Acrobat Reader or Edge, which is the reason I think that the problem is in my code.
Related
I am wondering how we can use ITEXT7 to extract image info associated to digital signatures. I know there have been similar questions asked in the past, but they were mostly around ITEXT5, which is quite different from the ITEXT7 after all the updates and modifications to the software.
You can extract the image from a signature appearance using low-level API.
Complete Java code:
private void saveImageFromSignature(PdfDocument document, String fieldName) throws IOException {
PdfAcroForm acroForm = PdfAcroForm.getAcroForm(document, false);
PdfDictionary xObject = acroForm.getField(name)
.getWidgets()
.get(0)
.getNormalAppearanceObject()
.getAsDictionary(PdfName.Resources)
.getAsDictionary(PdfName.XObject)
.getAsStream(new PdfName("FRM"))
.getAsDictionary(PdfName.Resources)
.getAsDictionary(PdfName.XObject);
PdfStream stream = xObject.getAsStream(new PdfName("Im1"));
PdfImageXObject image = new PdfImageXObject(stream);
BufferedImage result = createImageFromBytes(image.getImageBytes());
//pdf allows using masked image in the signature appearance
PdfStream maskStream = (PdfStream) stream.getAsStream(PdfName.SMask);
if (maskStream != null) {
PdfImageXObject maskImage = new PdfImageXObject(maskStream);
BufferedImage maskBimage = createImageFromBytes(maskImage.getImageBytes());
String fileMask = String.format(getOutputFolder() + "/file_mask_%d.%s",
image.getPdfObject().getIndirectReference().getObjNumber(),
image.identifyImageFileExtension());
ImageIO.write(maskBimage,
image.identifyImageFileExtension(),
new File(fileMask));
//the mask defines an alfa channel
Image transpImg = transformToTransperency(maskBimage);
result = applyTransperency(result, transpImg);
}
String filenameComp = String.format(getOutputFolder() + "/file_comp_%d.%s",
image.getPdfObject().getIndirectReference().getObjNumber(),
image.identifyImageFileExtension());
ImageIO.write(result,
image.identifyImageFileExtension(),
new File(filenameComp));
document.close();
}
private Image transformToTransperency(BufferedImage bi) {
ImageFilter filter = new RGBImageFilter() {
#Override
public int filterRGB(int x, int y, int rgb) {
return (rgb << 8) & 0xFF000000;
}
};
ImageProducer ip = new FilteredImageSource(bi.getSource(), filter);
return Toolkit.getDefaultToolkit().createImage(ip);
}
private BufferedImage applyTransperency(BufferedImage bi, Image mask) {
BufferedImage dest = new BufferedImage(
bi.getWidth(), bi.getHeight(),
BufferedImage.TYPE_INT_ARGB);
Graphics2D g2 = dest.createGraphics();
g2.drawImage(bi, 0, 0, null);
AlphaComposite ac = AlphaComposite.getInstance(AlphaComposite.DST_IN, 1.0F);
g2.setComposite(ac);
g2.drawImage(mask, 0, 0, null);
g2.dispose();
return dest;
}
Upd: This works for a very limited number of cases. Thanks for #mkl.
First of all, thank you for the proposals which personally guided me.
After several tries, here is the code that worked for me:
public void extract(String inputFilename, String fieldName) throws IOException {
try (PdfDocument document = new PdfDocument(new PdfReader(inputFilename))){
PdfAcroForm acroForm = PdfAcroForm.getAcroForm(document, false);
final PdfFormField signatorySignature1 = acroForm.getField(fieldName);
final PdfDictionary appearanceDic = signatorySignature1.getPdfObject().getAsDictionary(PdfName.AP);
final PdfStream normalAppearance = appearanceDic.getAsStream(PdfName.N);
final PdfDictionary ressourceDic = normalAppearance.getAsDictionary(PdfName.Resources);
PdfResources resources = new PdfResources(ressourceDic);
final ImageRenderInfo imageRenderInfo = extractImageRenderInfo(normalAppearance.getBytes(), resources);
Files.write(
Path.of(inputFilename + "_" + fieldName + "_" + System.currentTimeMillis() + ".png"),
imageRenderInfo.getImage().getImageBytes());
} catch (Exception e) {
e.printStackTrace();
}
}
public ImageRenderInfo extractImageRenderInfo(byte[] contentBytes, PdfResources pdfResource) {
MyLocationExtractionStrategy strategy = new MyLocationExtractionStrategy();
PdfCanvasProcessor parser = new PdfCanvasProcessor(strategy, new HashMap<>());
parser.processContent(contentBytes, pdfResource);
return strategy.getImageRenderInfo();
}
class MyLocationExtractionStrategy implements ILocationExtractionStrategy {
private ImageRenderInfo imageRenderInfo;
#Override public Collection<IPdfTextLocation> getResultantLocations() {
return null;
}
#Override public void eventOccurred(IEventData iEventData, EventType eventType) {
if (eventType.equals(EventType.RENDER_IMAGE)) {
imageRenderInfo = (ImageRenderInfo)iEventData;
}
}
#Override public Set<EventType> getSupportedEvents() {
return null;
}
public ImageRenderInfo getImageRenderInfo() {
return this.imageRenderInfo;
}
}
The following is the code (using iText for.Net Version 7.0.4.0) that i am using for extracting the text from a pdf. What i have observed during my testing is it works well by only extracting the content within a rectangle for most of the pdf's. But for few of them it gives the entire line from the pdf. I know
that the text snippets that intersect with the rect (so part of the text may be outside rect, iText doesn't cut text snippets in pieces).
But I want to understand what parameter in the pdf will be used in iText to split text.
var reader = new PdfReader( filePath );
PdfDocument pdfDoc = new PdfDocument( reader );
var addressRect = new Rectangle( 33, 190, 70, 42 ); //
var addressRegionFilter = new TextRegionEventFilter( addressRect );
var filterListener = new FilteredTextEventListener( new LocationTextExtractionStrategy(), addressRegionFilter );
var addressText = PdfTextExtractor.GetTextFromPage( pdfDoc.GetPage( 1 ), filterListener );
pdfDoc.Close();
This should do the trick.
class RectangleTextExtractionStrategy implements ITextExtractionStrategy
{
private ITextExtractionStrategy innerStrategy = null;
private Rectangle rectangle;
public RectangleTextExtractionStrategy(ITextExtractionStrategy strategy, Rectangle rectangle)
{
this.innerStrategy = strategy;
this.rectangle = rectangle;
}
#Override
public String getResultantText() {
return innerStrategy.getResultantText();
}
#Override
public void eventOccurred(IEventData iEventData, EventType eventType) {
if(eventType != EventType.RENDER_TEXT)
return;
TextRenderInfo tri = (TextRenderInfo) iEventData;
for(TextRenderInfo subTri : tri.getCharacterRenderInfos())
{
Rectangle r2 = new CharacterRenderInfo(subTri).getBoundingBox();
if(intersects(r2))
innerStrategy.eventOccurred(subTri, EventType.RENDER_TEXT);
}
}
private boolean intersects(Rectangle rectangle)
{
// # TODO
return true;
}
#Override
public Set<EventType> getSupportedEvents() {
return innerStrategy.getSupportedEvents();
}
}
The idea here is to split all incoming TextRenderInfo objects into the corresponding events for their characters. Then (if they are in the search region) we delegate the call to another ITextExtractionStrategy.
I would like to parse a PDF and find the logo via known attributes and when I find a match, remove that image and then copy everything else.
I am using the code below to replace an image with a blank white image to remove a logo from PDFs that are to be printed on letterhead. It replaces the image with a white image of the same size. Is there a way to modify this to actually remove the image (and thus save some space, etc.?).
private static void Main(string[] args)
{
ManipulatePdf(#"C:\in.pdf", #"C:\out.pdf");
Console.WriteLine("Finished - press a key");
Console.ReadKey();
}
public static void ManipulatePdf(String src, String dest)
{
Console.WriteLine("Start");
PdfReader reader = new PdfReader(src);
// first read all references and find the one we wish to work on.
PdfDictionary page = reader.GetPageN(1); // all resources are available to every page (?)
PdfDictionary resources = page.GetAsDict(PdfName.RESOURCES);
PdfDictionary xobjects = resources.GetAsDict(PdfName.XOBJECT);
page = reader.GetPageN(1);
resources = page.GetAsDict(PdfName.RESOURCES);
xobjects = resources.GetAsDict(PdfName.XOBJECT);
foreach (PdfName pdfName in xobjects.Keys)
{
PRStream stream = (PRStream) xobjects.GetAsStream(pdfName);
if (stream.Length > 100000)
{
PdfImage image = new PdfImage(MakeBlankImg(), "", null);
Console.WriteLine("Calling replace stream");
ReplaceStream(stream, image);
}
}
PdfStamper stamper = new PdfStamper(reader, new FileStream(dest, FileMode.Create));
stamper.Close();
reader.Close();
}
public static iTextSharp.text.Image MakeBlankImg()
{
Console.WriteLine("Making small blank image");
byte[] array;
using (MemoryStream ms = new MemoryStream())
{
//var drawingImage = image.GetDrawingImage();
using (Bitmap newBi = new Bitmap(1, 1))
{
using (Graphics g = Graphics.FromImage(newBi))
{
g.Clear(Color.White);
g.Flush();
}
newBi.Save(ms, ImageFormat.Jpeg);
}
array = ms.ToArray();
}
Console.WriteLine("Image array is " + array.Length + " bytes.");
return iTextSharp.text.Image.GetInstance(array);
}
public static void ReplaceStream(PRStream orig, PdfStream stream)
{
orig.Clear();
MemoryStream ms = new MemoryStream();
stream.WriteContent(ms);
orig.SetData(ms.ToArray(), false);
Console.WriteLine("Iterating keys");
foreach (KeyValuePair<PdfName, PdfObject> keyValuePair in stream)
{
Console.WriteLine("Key: " + keyValuePair.Key.ToString());
orig.Put(keyValuePair.Key, stream.Get(keyValuePair.Key));
}
}
}
Anyone can help in this code, the pdf file is not loading in app and just showing blank white screen, Logcat showing FileNotFoundExeeption: /storage/sdcard/raw/ourpdf.pdf.
i am trying to make an app that will show information while i click buttons and every button will be active for specific pdf file reading. Any specific help please.
Thanks for help
part1
package com.code.androidpdf;
public class MainActivity extends Activity {
//Globals:
private WebView wv;
private int ViewSize = 0;
//OnCreate Method:
#Override
protected void onCreate(Bundle savedInstanceState)
{
super.onCreate(savedInstanceState);
setContentView(R.layout.activity_main);
//Settings
PDFImage.sShowImages = true; // show images
PDFPaint.s_doAntiAlias = true; // make text smooth
HardReference.sKeepCaches = true; // save images in cache
//Setup above
wv = (WebView)findViewById(R.id.webView1);
wv.getSettings().setBuiltInZoomControls(true);//show zoom buttons
wv.getSettings().setSupportZoom(true);//allow zoom
//get the width of the webview
wv.getViewTreeObserver().addOnGlobalLayoutListener(new ViewTreeObserver.OnGlobalLayoutListener()
{
#Override
public void onGlobalLayout()
{
ViewSize = wv.getWidth();
wv.getViewTreeObserver().removeGlobalOnLayoutListener(this);
}
});
pdfLoadImages();//load images
}
private void pdfLoadImages() {
try
{
// run async
new AsyncTask<Void, Void, Void>()
{
// create and show a progress dialog
ProgressDialog progressDialog = ProgressDialog.show(MainActivity.this, "", "Opening...");
#Override
protected void onPostExecute(Void result)
{
//after async close progress dialog
progressDialog.dismiss();
}
#Override
protected Void doInBackground(Void... params)
{
try
{
// select a document and get bytes
File file = new File(Environment.getExternalStorageDirectory().getPath()+"/randompdf.pdf");
RandomAccessFile raf = new RandomAccessFile(file, "r");
FileChannel channel = raf.getChannel();
net.sf.andpdf.nio.ByteBuffer bb = null ;
raf.close();
// create a pdf doc
PDFFile pdf = new PDFFile(bb);
//Get the first page from the pdf doc
PDFPage PDFpage = pdf.getPage(1, true);
//create a scaling value according to the WebView Width
final float scale = ViewSize / PDFpage.getWidth() * 0.95f;
//convert the page into a bitmap with a scaling value
Bitmap page = PDFpage.getImage((int)(PDFpage.getWidth() * scale), (int)(PDFpage.getHeight() * scale), null, true, true);
//save the bitmap to a byte array
ByteArrayOutputStream stream = new ByteArrayOutputStream();
page.compress(Bitmap.CompressFormat.PNG, 100, stream);
stream.close();
byte[] byteArray = stream.toByteArray();
//convert the byte array to a base64 string
String base64 = Base64.encodeToString(byteArray, Base64.DEFAULT);
//create the html + add the first image to the html
String html = "<!DOCTYPE html><html><body bgcolor=\"#7f7f7f\"><img src=\"data:image/png;base64,"+base64+"\" hspace=10 vspace=10><br>";
//loop through the rest of the pages and repeat the above
for(int i = 2; i <= pdf.getNumPages(); i++)
{
PDFpage = pdf.getPage(i, true);
page = PDFpage.getImage((int)(PDFpage.getWidth() * scale), (int)(PDFpage.getHeight() * scale), null, true, true);
stream = new ByteArrayOutputStream();
page.compress(Bitmap.CompressFormat.PNG, 100, stream);
stream.close();
byteArray = stream.toByteArray();
base64 = Base64.encodeToString(byteArray, Base64.DEFAULT);
html += "<img src=\"data:image/png;base64,"+base64+"\" hspace=10 vspace=10><br>";
}
html += "</body></html>";
//load the html in the webview
wv.loadDataWithBaseURL("", html, "text/html","UTF-8", "");
}
catch (Exception e)
{
Log.d("CounterA", e.toString());
}
return null;
}
}.execute();
System.gc();// run GC
}
catch (Exception e)
{
Log.d("error", e.toString());
}
}
}
It is (sadly) not possible to view a PDF that is stored locally in your devices. Android L has introduced the feature. So, to display a PDF , you have two options:
See this answer for using webview
How to open local pdf file in webview in android? (note that this requires an internet connection)
Use a third party pdf Viewer.
You can also send an intent for other apps to handle your pdf.
You can get an InputStream for the file using
getResources().openRawResource(R.raw.ourpdf)
Docs: http://developer.android.com/reference/android/content/res/Resources.html#openRawResource(int)
I want to print pdf file generated by itext. I also use PDFBox for this. At this page
PDFBox: How to print pdf with specified printer?
everything is explained, but I am having problem about page size. Original size of pdf file is 10x150 mm (3,93x5,9 inch). When I run the code the file is printed but not with its original size. My pdf file is printed at the noth-west of the page with nearly %70 decreased size.
Here is my code for this
public class LabelPrinter {
public LabelPrinter() {
}
public static PrintService getNamedPrinter(String name, PrintRequestAttributeSet attrs) {
PrintService[] services = PrintServiceLookup.lookupPrintServices(null, attrs);
if (services.length > 0) {
if (name == null)
return services[0];
else {
for (int i = 0; i < services.length; i++) {
if (services[i].getName().equals(name))
return services[i];
}
}
}
return null;
}
public static void printLabel(String fileName,String printerName) throws FileNotFoundException, PrintException, IOException, PrinterException{
PrinterJob job = PrinterJob.getPrinterJob();
PrintService myService=getNamedPrinter(printerName, null);
if(myService!=null){
job.setPrintService(myService);
PDDocument doc = PDDocument.load(fileName);
doc.print(job);
}
else{
System.out.println("printer not found");
}
}
}
I hope I could explain enough. (Accept my appologies for my bad English)