Webdings font characters not extracted using pdfbox - pdfbox

I am using pdfbox to get the names of all fonts that are used in a pdf.
So far it was working well. However, I recently came across a pdf that has 'Webdings' font. PDFBox was not able to identify it.Could anyone help please.
This is the code I have used:
public static Set<String> extractFonts(String pdfPath) throws IOException
{
PDDocument doc = PDDocument.load(new File(pdfPath));
PDPageTree pages = doc.getDocumentCatalog().getPages();
Set<String> fontSet = new HashSet<String>();
try{
for(PDPage page:pages){
PDResources res = page.getResources();
for (COSName fontName : res.getFontNames())
{
PDFont font = res.getFont(fontName);
if(font != null){
String fontUsedName = font.getName();
if(fontUsedName.contains("+")) {
fontUsedName = fontUsedName.substring(fontUsedName.indexOf("+")+1, fontUsedName.length());
}
fontSet.add(fontUsedName);
}
}
}
} catch (Exception e) {
e.printStackTrace();
}
System.out.println(fontSet);
return fontSet;
}
I was able to know that the font 'Webdings' is present from the File-> Properties->Fonts option in Adobe Reader

Related

How can i convert docx to pdf using apache poi and itext 7 with pdf calligraph on in java?

i want to convert docx to pdf using apache-poi and itext 7(pdf calligraph on)
i have tried using other version of itext but they are showing problem of ligature in indic languages
import org.apache.poi.xwpf.converter.pdf.PdfConverter;
import org.apache.poi.xwpf.converter.pdf.PdfOptions;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.springframework.util.FileCopyUtils;
import java.io.*;
public class Docx2PdfConverterUsingPOI implements Docx2PdfConverter{
public byte[] convert(byte[] docxData) {
byte[] output = null;
try {
InputStream isFromFirstData = new ByteArrayInputStream(docxData);
XWPFDocument document = new XWPFDocument(isFromFirstData);
PdfOptions pdfOptions = PdfOptions.create();
// pdfOptions.fontEncoding(BaseFont.IDENTITY_H);
//make new file in c:\temp\
ByteArrayOutputStream out = new ByteArrayOutputStream();
//Options options =
Options.getTo(ConverterTypeTo.PDF).via(ConverterTypeVia.XWPF).
subOptions(pdfOptions);
PdfConverter.getInstance().convert(document, out, pdfOptions);
document.close();
return out.toByteArray();
} catch (IOException e) {
e.printStackTrace();
}
return output;
}
public static void main(String args[]){
Docx2PdfConverterUsingPOI docx2PdfConverterUsingPOI =new
Docx2PdfConverterUsingPOI();
String inputFile = "D:\\WORKSPACE\\yogesh\\letters\\out.docx";
FileInputStream inputStream = null;
try {
inputStream = new FileInputStream(new File(inputFile));
byte[]output =
docx2PdfConverterUsingPOI.convert(FileCopyUtils.
copyToByteArray(inputStream));
FileCopyUtils.copy(output,new
File("D:\\WORKSPACE\\yogesh\\letters\\out1.pdf"));
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
can anyone help me in how to use itext7 with apache poi for my docx to pdf conversion.
Also,can anyone explain how apache uses itext to get proper result of conversion(so that i can change the itext maven dependency accordingly)

Parsing PDF file using Apache PDFBox to get outlines

Now I can use the PDFBox to extract the outlines from PDF, but some PDF can get the outlines, others can't.
Every PDF has outlines and when I open a pdf use pdf read tool, I can click an outline to a certain page.
Here is my code:
public static void main(String[] args) {
try {
PDDocument document = PDDocument.load(new File(filePath));
PDDocumentOutline outline = document.getDocumentCatalog().getDocumentOutline();
getOutlines(document, outline, "");
document.close();
} catch (InvalidPasswordException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public static void getOutlines(PDDocument document, PDOutlineNode bookmark, String indentation) throws IOException{
PDOutlineItem current = bookmark.getFirstChild();
while (current != null) {
PDPage currentPage = current.findDestinationPage(document);
Integer pageNumber = document.getDocumentCatalog().getPages().indexOf(currentPage) + 1;
System.out.println(current.getTitle() + "-------->" + pageNumber);
getOutlines(document, current, indentation);
current = current.getNextSibling();
}
}

Understanding loading of font in PDFBox 2.0

I have finally succeeded in making PDFBox print my unicodes.
But now, I would like to understand the solution that I have come up with.
The code below works and prints a ≥ to the page.
Two things do not work:
changing
PDType0Font.load(documentMock, systemResourceAsStream, true);
to
PDType0Font.load(documentMock, systemResourceAsStream, false);
changing
final PDFont robotoLight = loadFontAlternative("Roboto-Light.ttf");
to
final PDFont robotoLight = loadFont("Roboto-Light.ttf");
The first change prints two dots instead of the character.
What does embedSubset do, since it does not work when set to false?
The documentation is too sparse for me to understand.
The second change gives the following exception Exception in thread "main" java.lang.IllegalArgumentException: U+2265 is not available in this font's encoding: WinAnsiEncoding
This problem has been covered in many other questions that pre-dates PDFBox 2.0 where there was a bug in handling unicodes.
So, they do not answer the question directly.
That aside, the problem is clear: I should not set the encoding to WinAnsiEncoding but something different.
But what should the encoding be? and why is there no UTF-8 encoding or similar available?
There is no documentation in COSName about the many options.
public class SimpleReportUnicode {
public static void main(String[] args) throws IOException {
PDDocument report = createReport();
final String fileLocation = "c:/SimpleFormUnicode.pdf";
report.save(fileLocation);
report.close();
}
private static PDDocument createReport() throws IOException {
PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage(page);
PDPageContentStream contentStream = new PDPageContentStream(document, page);
final PDFont robotoLight = loadFontAlternative("Roboto-Light.ttf");
writeText(contentStream, robotoLight, 100, 650);
contentStream.close();
return document;
}
private static void writeText(PDPageContentStream contentStream, PDFont font, double x, double y) {
try {
contentStream.beginText();
contentStream.setFont(font, 12);
contentStream.moveTextPositionByAmount((float) x, (float) y);
String unicode = "≥";
contentStream.showText(unicode);
contentStream.endText();
}
catch (IOException e) {
}
}
private static PDFont loadFont(String location) {
PDFont font;
try {
PDDocument documentMock = new PDDocument();
InputStream systemResourceAsStream = ClassLoader.getSystemResourceAsStream(location);
Encoding encoding = Encoding.getInstance(COSName.WIN_ANSI_ENCODING);
font = PDTrueTypeFont.load(documentMock, systemResourceAsStream, encoding);
}
catch (IOException e) {
throw new RuntimeException("IO exception");
}
return font;
}
private static PDFont loadFontAlternative(String location) {
PDDocument documentMock = new PDDocument();
InputStream systemResourceAsStream = ClassLoader.getSystemResourceAsStream(location);
PDFont font;
try {
font = PDType0Font.load(documentMock, systemResourceAsStream, true);
}
catch (IOException e) {
throw new RuntimeException("IO exception");
}
return font;
}
}
EDIT
If you want to use the same font as in the code, Roboto is available here:
https://fonts.google.com/specimen/Roboto
Add Roboto-Light.ttf to your classpath and the code should work out of the box.
As discussed in the comments:
The problem with embedSubsets went away by using version 2.0.7. (Btw 2.0.8 was released today);
The problem "U+2265 is not available in this font's encoding: WinAnsiEncoding" is explained in the FAQ and the solution is to use PDType0Font.load() which you already did in your working version;
There is no UTF-8 encoding for fonts because it isn't available in the PDF specification;
using embedSubsets true produces a 4KB file, with false the file is 100KB because the full font is embedded, so false is usually best.

Pdf generation in arabic language is printing garbage values

I am using component one library to generate pdf document and save in phone storage. Here is my code to print just one line.
public ViewStatementDetails()
{
this.InitializeComponent();
this.navigationHelper = new NavigationHelper(this);
this.navigationHelper.LoadState += this.NavigationHelper_LoadState;
this.navigationHelper.SaveState += this.NavigationHelper_SaveState;
pdf = new C1PdfDocument(PaperKind.Letter);
pdf.Clear();
}
private void Print_Click(object sender, RoutedEventArgs e)
{
LoadingProgress.Visibility = Windows.UI.Xaml.Visibility.Visible;
PDFTest_Loaded();
}
async void PDFTest_Loaded()
{
try
{
WriteableBitmap writeableBmp = await initializeImage();
pdf = new C1PdfDocument(PaperKind.Letter);
CreateDocumentText(pdf);
StorageFile Assets = await Windows.Storage.ApplicationData.Current.LocalFolder.CreateFileAsync("Salik Statement.pdf", CreationCollisionOption.GenerateUniqueName);
PdfUtils.Save(pdf, Assets);
LoadingProgress.Visibility = Visibility.Collapsed;
}
catch (Exception ex)
{
Debug.WriteLine(ex.ToString());
Debugger.Break();
LoadingProgress.Visibility = Visibility.Collapsed;
}
}
async void CreateDocumentText(C1PdfDocument pdf)
{
try
{
pdf.Landscape = false;
// measure and show some text
var text = App.GetResource("RoadAndSafetyheading")
var font = new Font("Segoe UI Light", 36, PdfFontStyle.Bold);
var fmt = new StringFormat();
fmt.Alignment = HorizontalAlignment.Center;
// measure it
var sz = pdf.MeasureString(text, font, 72 * 3, fmt);
var rc = new Rect(0, 0, pdf.PageRectangle.Width, sz.Height);
rc = PdfUtils.Offset(rc, 0, 0);
// draw the text
pdf.DrawString(text, font, Colors.Orange, rc, fmt);
}
catch (Exception e)
{
}
}
The above code is working perfect but my application supports two languages, English and Arabic. And when I am in arabic mode and generate same pdf it prints garbage values in pdf file. attaching image of printed characters.
Use of Arabic characters would require to use Unicode symbols and embed the Unicode font into PDF (as PDF format does not provide support for Unicode using its built-in fonts). If you are using ComponentOne then try to set .EmbedTrueTypeFonts = true (see details here)

How to display a pdf file using PDFBox in JPanel?

I have already created a JForm in netbeans which can read pdf file using PDFBox. But the problem is that I have used the method PDPage.convertToImage() which is really very slow. Can anyone please help me in displaying the pdf using PDFBox in the JPanel at a faster speed ?
The code I have written is inside an ActionListener for a JButton.
File f = null;
ArrayList<JLabel> jl = new ArrayList<JLabel>();
BufferedImage bi = null;
JFileChooser fc = new JFileChooser();
int x=fc.showOpenDialog(null);
if(x==JFileChooser.APPROVE_OPTION)
{
f=fc.getSelectedFile();
}
PDDocument doc=null;
try {
doc = PDDocument.load(f);
} catch (IOException ex) {
JOptionPane.showMessageDialog(null, "not done\n"+ex);
}
List pages = doc.getDocumentCatalog().getAllPages();
Iterator itr = pages.iterator();
int q=0;
while(itr.hasNext())
{
PDPage page = (PDPage)itr.next();
try
{
bi = page.convertToImage();
q++;
jl.add(new JLabel(new ImageIcon(bi)));
}catch(Exception e)
{
JOptionPane.showMessageDialog(null, e);
}
}
itr = jl.iterator();
while(itr.hasNext())
{
viewPanel.setVisible(false);
viewPanel.add((JLabel)itr.next());
viewPanel.setVisible(true);
}
JOptionPane.showMessageDialog(null, "done");
NetBeans has several plugins to display PDFs
http://plugins.netbeans.org/plugin/5809/java-pdf-reader
http://plugins.netbeans.org/plugin/11676/netbeans-pdfviewer
http://plugins.netbeans.org/plugin/17/pdf-viewer-javafx-converter-and-bookmarking-application
HAve you tried any of them?