How to identify checkboxes in a flat pdf? - pdf

Team,
I have to validate a flattened pdf as part of a requirement. This pdf has checkboxes. I used Apache PDFBOX library to read the contents of this PDF. It is only reading the text but not identifying the checkboxes. Please find attached a screenshot of a similar pdf file that i am using Flat PDF with Checkbox :
Can you please provide me any approach to identify and validate these checkboxes
Code Snippet used
PDFTextStripper stripper = new PDFTextStripper() ;
PDDocument document = new PDDocument() ;
document = PDDocument.load(new File("D:\\test.pdf"));
stripper.setStartPage(1);
stripper.setEndPage(1);
stripper.setSortByPosition(true);
pdfTextContent = stripper.getText(document);
System.out.println(pdfTextContent);

Related

Loading custom fonts to PDF document not working

We are converting a PDF document to HTML. Most of the documents are using Courier Final Draft font. The converted html is not getting rendered properly as the text is getting overlapped. Looks like PDFBox is not finding the font and so it is wrongly calculating the spacing between the words. We are trying to load the font (Courier Final Draft) to PDDocument. But is it still not identifying the font. We are getting error as "Error loading font 'Error loading font 'NOLUXZ+CourierFinalDraft". Can somebody guide me on this? Thanks
I tried to load the font as below -
PDDocument doc = new PDDocument();
PDTrueTypeFont.load(doc, ResourceUtils.getFile(
"classpath:fonts/Courier_Final_Draft_Regular.ttf"), MacRomanEncoding.INSTANCE);
PDType0Font.load(doc, ResourceUtils.getFile(
"classpath:fonts/Courier_Final_Draft_Regular.ttf"));
PDType0Font.load(doc, new FileInputStream(ResourceUtils.getFile(
"classpath:fonts/Courier_Final_Draft_Regular.ttf")), false);

Why the ordinate of SetFixedPosition() in Itext7 .NET not work well in Web Application?

I used itext7 .NET to make a demo of Console project to add text into an existing pdf. the code is below:
PdfDocument pdfTemple = new PdfDocument(new PdfReader(templateFile), new PdfWriter(templeFile));
Document documentTemple = new Document(pdfTemple, PageSize.A4);
Text text = new Text(string.Format(#"{0} / {1} ", month, year))
.SetBackgroundColor(ColorConstants.WHITE)
.SetBold()
.SetFontSize(11);
documentTemple.Add(new Paragraph(text).SetFixedPosition(1, 424, 740, 60));
text = new Text(DateTime.Now.ToString("MM/dd/yyy"))
.SetBackgroundColor(ColorConstants.WHITE)
//.SetBold()
.SetFontSize(10);
documentTemple.Add(new Paragraph(text).SetFixedPosition(1, 503, 710, 60));
documentTemple.Close();
It works well in the Console project demo. But when I use the same code in a web application (MVC, .NET4.7.2), it doesn't work. Only the text with the ordinate between 50 - 350 can display on the created pdf file.
I need to add text to the position on the pdf file with the ordinate between 40 and 740. How can I let it work in my web application (MVC, .NET4.7.2)?
Thanks.
additions:
I found that if I open the template pdf (I write text into it) with photoshop, I will get such screen:
If I write text into the white area shown in the screenshot above, the text will not show on the pdf file. but if I write text into another area, the text will show on the pdf file.

Actually cropping a PDF with PDF Clown

My objective is actually cropping a PDF file with PdfClown.
There are a lot of tools/library that allow cropping PDF, changing the PDF cropBox. This permits hiding contents outside a rectangular area, but content is still there, it might be accessed through a PDF parser and PDF size does not change.
On the contrary what I need is creating a new page containing only the contents inside the rectangular area.
So far I've tried scanning contents and selectively cloning them. But I didn't succeed yet. Any suggestions on using PdfClown for that?
I've seen someone is trying something similar with PdfBox Cropping a region from a PDF page with PDFBox not succeeding yet.
A bit late, but maybe it helps someone;
I am sucessfully doing what you are asking for - but with other libraries.
Required libraries : iText 4 or 5 and Ghostscript
Step 1 with pseudo code
Using iText, Create a PDFWRITER instance with a blank Doc. Open a PDFREADER object to the original file you want to crop. Import the Page, get a PDFTemplate Object from the source, set its .boundingBox property to the desired cropbox, wrap the template into an iText Image object and paste it onto the new page at an absolute position.
Dim reader As New PdfReader(sourcefile)
Dim doc As New Document()
Dim writer As PdfWriter = PdfWriter.GetInstance(doc, New System.IO.FileStream(outputfilename, System.IO.FileMode.Create))
//get the source page as an Imported Page
Dim page As PdfImportedPage = writer.GetImportedPage(reader, indexOfPageToGet) page
//create PDFTemplate Object at original size from source - see iText in Action book Page 91 for full details
Dim pdftemp As PdfTemplate = page.CreateTemplate(page.Width, page.Height)
//paste the original page onto the template object, see iText documentation what those parameters do (scaling, mirroring)
pdftemp.AddTemplate(page, 1, 0, 0, 1, 0, 0)
//now the critical part - set .boundingBox property on the template. This makes all objects outside the rectangle invisible
pdftemp.boundingBox = {iText Rectangle Structure with new Cropbox}
//template not needed anymore
writer.ReleaseTemplate(pdftemp)
//create an iText IMAGE object as wrapper to the template - with this img object absolute positionion on the final page is much easier
dim img as iTextSharp.Text.Image = Image.GetInstance(pdftemp)
// set img position
img.SetAbsolutePosition(x, y)
//set optional Rotation if needed
img.RotationDegrees = 0
//finally, this adds the actual content to the new document
doc.Add(img)
//cleanup
doc.Close()
reader.Close()
writer.Close()
The output file will visually look cropped. But the objects are still present in the PDF Stream. Filesize will probably remain very little changed yet.
Step 2:
Using Ghostscript and output device pdfwrite, combined with the correct command line parameters you can re-process the PDF from Step 1. This will give you a much smaller PDF. See Ghostscript documentation for the arguments https://www.ghostscript.com/doc/9.52/Use.htm
This steps actually gets rid of objects that are outside the bounding box - the requirement you asked for in your OP, at least for files that I deal with.
Optional Step 3:
Using MUTOOL with the -g option you can clean up unused XREF objects. Your original PDF probably had a lot of Xrefs, which increase filesize. After cropping some of them may not be needed anymore.
https://mupdf.com/docs/manual-mutool-clean.html
PDF Format is a tricky thing, normally I would agree with #Tilman Hausherr, my suggestion may not work for all files and covers the 'almost impossible' scenario, but it works for all cases that I deal with.

Change font size in text box - apache poi word docx

I found the answer that explains how to insert a new text box into docx document.
create text box in document .docx using apache poi
The problem is that I cannot change the font size inside a newly created text box.
Does anyone know how to do that?
Reference : create text box in document .docx using apache poi
The ctTxbxContent.addNewP() in my code creates a CTP object. The XWPFParagraph has a constructor XWPFParagraph(org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP prgrph, IBody part). So you can get a XWPFParagraph from the CTP object and then use the default apache-poi methods further.
...
CTTxbxContent ctTxbxContent = ctShape.addNewTextbox().addNewTxbxContent();
XWPFParagraph textboxparagraph = new XWPFParagraph(ctTxbxContent.addNewP(), (IBody)doc);
XWPFRun textboxrun = textboxparagraph.createRun();
textboxrun.setText("The TextBox text...");
textboxrun.setFontSize(24);
...

How to insert values into an existing PDF on the fly?

There is a PDF with some fields to accept values from the user(for example: a "bio data" form). My question is that how can I insert the user inputs to the Correct fields of the existing PDF and to generate the filled PDF?
if i using iTextSharp, then how can i choose the co ordinates to print values?
Is there any design tools to design rectangle fields to accept values?
because my PDF template have lots of fields to get values from user side.
tnx in adv.
There are two possibilities:
Your original PDF is a form:
You can check this by checking if the PDF has any fields as explained here: convert pdf editable fields into text using java programming
You'll need to adapt the Java code to C# code or you can use RUPS as shown in my answer to the question How to get specific types from AcroFields? Like PushButtonField, RadioCheckField, etc
In this case, filling out the form is easy:
PdfStamper pdfStamper = new PdfStamper(new PdfReader(templateFile), new FileStream(fileName, FileMode.Create));
AcroFields acroFields = pdfStamper.AcroFields;
acroFields.SetField(key, value);
pdfStamper.FormFlattening = true;
pdfStamper.Close();
You can have as many lines with SetField() as you want. In these lines key is the field name as defined in the original form; value is the value you want to add at the position(s) of that field.
The line with the pdfStamper.FormFlattening is optional. If you set that value to true, all interactivity will be removed: the form will no longer be a form. If you remove the line or set that value to false, then the form will still be a form. You'll be able to change the content of the fields and extract the value of the fields.
Your original PDF is not a form:
A PDF may look like a form to the human eye, but if it doesn't have AcroForm fields (and no XFA either), then a machine won't consider it as being a form. In this case, you have to understand that all the content is fixed at fixed coordinates on the page. You can add content at absolute positions, but the original content won't move.
There are different ways to add content to an existing PDF and they all involve PdfStamper. Once you have obtained PdfContentByte object from this PdfStamper then you can add text as explained in the documentation. Read the sections Manipulating existing PDFs and Absolute positioning of text or take a look at the content tagged with the keyword PdfStamper. The watermark examples should be interesting too.
I would advice not to use this second approach as it is very hard to find the exact coordinates to use. If your PDF isn't a form, turn it into a form using Adobe Acrobat and use the first approach. The first approach is much more future proof: if you ever have to change something in your form, you can change that form without having to change your code (provided that you preserve the original field names).
ItextSharp provides you to do the same, using pdfStamper class of ItextSharp.
Just a sample for your reference.
//create pdfreader instance and read content of existing PDF file into it, by providing it's path
PdfReader pdfReader = new PdfReader(FILE_PATH);
// create stamper instance to edit the exiting file
PdfStamper pdfStamper = new PdfStamper(pdfReader, Response.OutputStream);
// perform your edit operation here.....
.
.
.
// close pdfStamper instance
stamper.Close();