Loading custom fonts to PDF document not working - pdfbox

We are converting a PDF document to HTML. Most of the documents are using Courier Final Draft font. The converted html is not getting rendered properly as the text is getting overlapped. Looks like PDFBox is not finding the font and so it is wrongly calculating the spacing between the words. We are trying to load the font (Courier Final Draft) to PDDocument. But is it still not identifying the font. We are getting error as "Error loading font 'Error loading font 'NOLUXZ+CourierFinalDraft". Can somebody guide me on this? Thanks
I tried to load the font as below -
PDDocument doc = new PDDocument();
PDTrueTypeFont.load(doc, ResourceUtils.getFile(
"classpath:fonts/Courier_Final_Draft_Regular.ttf"), MacRomanEncoding.INSTANCE);
PDType0Font.load(doc, ResourceUtils.getFile(
"classpath:fonts/Courier_Final_Draft_Regular.ttf"));
PDType0Font.load(doc, new FileInputStream(ResourceUtils.getFile(
"classpath:fonts/Courier_Final_Draft_Regular.ttf")), false);

Related

Why the ordinate of SetFixedPosition() in Itext7 .NET not work well in Web Application?

I used itext7 .NET to make a demo of Console project to add text into an existing pdf. the code is below:
PdfDocument pdfTemple = new PdfDocument(new PdfReader(templateFile), new PdfWriter(templeFile));
Document documentTemple = new Document(pdfTemple, PageSize.A4);
Text text = new Text(string.Format(#"{0} / {1} ", month, year))
.SetBackgroundColor(ColorConstants.WHITE)
.SetBold()
.SetFontSize(11);
documentTemple.Add(new Paragraph(text).SetFixedPosition(1, 424, 740, 60));
text = new Text(DateTime.Now.ToString("MM/dd/yyy"))
.SetBackgroundColor(ColorConstants.WHITE)
//.SetBold()
.SetFontSize(10);
documentTemple.Add(new Paragraph(text).SetFixedPosition(1, 503, 710, 60));
documentTemple.Close();
It works well in the Console project demo. But when I use the same code in a web application (MVC, .NET4.7.2), it doesn't work. Only the text with the ordinate between 50 - 350 can display on the created pdf file.
I need to add text to the position on the pdf file with the ordinate between 40 and 740. How can I let it work in my web application (MVC, .NET4.7.2)?
Thanks.
additions:
I found that if I open the template pdf (I write text into it) with photoshop, I will get such screen:
If I write text into the white area shown in the screenshot above, the text will not show on the pdf file. but if I write text into another area, the text will show on the pdf file.

How to identify checkboxes in a flat pdf?

Team,
I have to validate a flattened pdf as part of a requirement. This pdf has checkboxes. I used Apache PDFBOX library to read the contents of this PDF. It is only reading the text but not identifying the checkboxes. Please find attached a screenshot of a similar pdf file that i am using Flat PDF with Checkbox :
Can you please provide me any approach to identify and validate these checkboxes
Code Snippet used
PDFTextStripper stripper = new PDFTextStripper() ;
PDDocument document = new PDDocument() ;
document = PDDocument.load(new File("D:\\test.pdf"));
stripper.setStartPage(1);
stripper.setEndPage(1);
stripper.setSortByPosition(true);
pdfTextContent = stripper.getText(document);
System.out.println(pdfTextContent);

Lost some text when extracting pdf

I've tried to get all the text on the page by using iText, but I have no idea why every coordinate text loses the last two character.
PdfDocument pdfDoc = new PdfDocument(new PdfReader(#"E:\Coding\COOR.pdf"));
LocationTextExtractionStrategy strategy = new LocationTextExtractionStrategy();
PdfCanvasProcessor parser = new PdfCanvasProcessor(strategy);
parser.ProcessPageContent(pdfDoc.GetFirstPage());
Console.Write(strategy.GetResultantText());
pdfDoc.Close();
Console.WriteLine("Great!");
Console.ReadKey();
You can also download my code from
https://1drv.ms/u/s!Al1hUSZtR4OjwU3XVBRQGneVaZlS
In short
The reason for that "lost text" is that the missing "text" isn't there to start with!
In detail
The contents of you PDF file are constructed in a misleading manner.
On the one hand there are very many path definitions which then are stroked (drawn). These drawings create what you can see in a viewer, both text and table lines.
On the other hand there are a few text drawing instructions to draw text using text rendering mode 3 which is... invisible! These drawings create the text you can copy&paste in a viewer or extract using iText.
Unfortunately the text in the text drawing instructions and the text drawn using paths does not match completely. The text you retrieve via copy&paste or text extraction, therefore, differs from your expectations.
Also the glyph sizes and positions are not exactly the same
To illustrate this I made the text drawing instructions use the normal (fill) text rendering mode. The top left corner which originally looks like this:
with that change looks like this:
As you see the formerly invisible text is only approximately at the same position as the visible drawings, and it is somewhat broken: The symbol for degrees is weirdly represented as "¡ã", and the longitude fractional seconds and the following symbol for seconds are missing.
To correctly extract the originally visible data, you'll need to use OCR instead of text extraction.

Actually cropping a PDF with PDF Clown

My objective is actually cropping a PDF file with PdfClown.
There are a lot of tools/library that allow cropping PDF, changing the PDF cropBox. This permits hiding contents outside a rectangular area, but content is still there, it might be accessed through a PDF parser and PDF size does not change.
On the contrary what I need is creating a new page containing only the contents inside the rectangular area.
So far I've tried scanning contents and selectively cloning them. But I didn't succeed yet. Any suggestions on using PdfClown for that?
I've seen someone is trying something similar with PdfBox Cropping a region from a PDF page with PDFBox not succeeding yet.
A bit late, but maybe it helps someone;
I am sucessfully doing what you are asking for - but with other libraries.
Required libraries : iText 4 or 5 and Ghostscript
Step 1 with pseudo code
Using iText, Create a PDFWRITER instance with a blank Doc. Open a PDFREADER object to the original file you want to crop. Import the Page, get a PDFTemplate Object from the source, set its .boundingBox property to the desired cropbox, wrap the template into an iText Image object and paste it onto the new page at an absolute position.
Dim reader As New PdfReader(sourcefile)
Dim doc As New Document()
Dim writer As PdfWriter = PdfWriter.GetInstance(doc, New System.IO.FileStream(outputfilename, System.IO.FileMode.Create))
//get the source page as an Imported Page
Dim page As PdfImportedPage = writer.GetImportedPage(reader, indexOfPageToGet) page
//create PDFTemplate Object at original size from source - see iText in Action book Page 91 for full details
Dim pdftemp As PdfTemplate = page.CreateTemplate(page.Width, page.Height)
//paste the original page onto the template object, see iText documentation what those parameters do (scaling, mirroring)
pdftemp.AddTemplate(page, 1, 0, 0, 1, 0, 0)
//now the critical part - set .boundingBox property on the template. This makes all objects outside the rectangle invisible
pdftemp.boundingBox = {iText Rectangle Structure with new Cropbox}
//template not needed anymore
writer.ReleaseTemplate(pdftemp)
//create an iText IMAGE object as wrapper to the template - with this img object absolute positionion on the final page is much easier
dim img as iTextSharp.Text.Image = Image.GetInstance(pdftemp)
// set img position
img.SetAbsolutePosition(x, y)
//set optional Rotation if needed
img.RotationDegrees = 0
//finally, this adds the actual content to the new document
doc.Add(img)
//cleanup
doc.Close()
reader.Close()
writer.Close()
The output file will visually look cropped. But the objects are still present in the PDF Stream. Filesize will probably remain very little changed yet.
Step 2:
Using Ghostscript and output device pdfwrite, combined with the correct command line parameters you can re-process the PDF from Step 1. This will give you a much smaller PDF. See Ghostscript documentation for the arguments https://www.ghostscript.com/doc/9.52/Use.htm
This steps actually gets rid of objects that are outside the bounding box - the requirement you asked for in your OP, at least for files that I deal with.
Optional Step 3:
Using MUTOOL with the -g option you can clean up unused XREF objects. Your original PDF probably had a lot of Xrefs, which increase filesize. After cropping some of them may not be needed anymore.
https://mupdf.com/docs/manual-mutool-clean.html
PDF Format is a tricky thing, normally I would agree with #Tilman Hausherr, my suggestion may not work for all files and covers the 'almost impossible' scenario, but it works for all cases that I deal with.

Change font size in text box - apache poi word docx

I found the answer that explains how to insert a new text box into docx document.
create text box in document .docx using apache poi
The problem is that I cannot change the font size inside a newly created text box.
Does anyone know how to do that?
Reference : create text box in document .docx using apache poi
The ctTxbxContent.addNewP() in my code creates a CTP object. The XWPFParagraph has a constructor XWPFParagraph(org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP prgrph, IBody part). So you can get a XWPFParagraph from the CTP object and then use the default apache-poi methods further.
...
CTTxbxContent ctTxbxContent = ctShape.addNewTextbox().addNewTxbxContent();
XWPFParagraph textboxparagraph = new XWPFParagraph(ctTxbxContent.addNewP(), (IBody)doc);
XWPFRun textboxrun = textboxparagraph.createRun();
textboxrun.setText("The TextBox text...");
textboxrun.setFontSize(24);
...