Parsing PDF file using Apache PDFBox to get outlines - pdf

Now I can use the PDFBox to extract the outlines from PDF, but some PDF can get the outlines, others can't.
Every PDF has outlines and when I open a pdf use pdf read tool, I can click an outline to a certain page.
Here is my code:
public static void main(String[] args) {
try {
PDDocument document = PDDocument.load(new File(filePath));
PDDocumentOutline outline = document.getDocumentCatalog().getDocumentOutline();
getOutlines(document, outline, "");
document.close();
} catch (InvalidPasswordException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public static void getOutlines(PDDocument document, PDOutlineNode bookmark, String indentation) throws IOException{
PDOutlineItem current = bookmark.getFirstChild();
while (current != null) {
PDPage currentPage = current.findDestinationPage(document);
Integer pageNumber = document.getDocumentCatalog().getPages().indexOf(currentPage) + 1;
System.out.println(current.getTitle() + "-------->" + pageNumber);
getOutlines(document, current, indentation);
current = current.getNextSibling();
}
}

Related

How can i convert docx to pdf using apache poi and itext 7 with pdf calligraph on in java?

i want to convert docx to pdf using apache-poi and itext 7(pdf calligraph on)
i have tried using other version of itext but they are showing problem of ligature in indic languages
import org.apache.poi.xwpf.converter.pdf.PdfConverter;
import org.apache.poi.xwpf.converter.pdf.PdfOptions;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.springframework.util.FileCopyUtils;
import java.io.*;
public class Docx2PdfConverterUsingPOI implements Docx2PdfConverter{
public byte[] convert(byte[] docxData) {
byte[] output = null;
try {
InputStream isFromFirstData = new ByteArrayInputStream(docxData);
XWPFDocument document = new XWPFDocument(isFromFirstData);
PdfOptions pdfOptions = PdfOptions.create();
// pdfOptions.fontEncoding(BaseFont.IDENTITY_H);
//make new file in c:\temp\
ByteArrayOutputStream out = new ByteArrayOutputStream();
//Options options =
Options.getTo(ConverterTypeTo.PDF).via(ConverterTypeVia.XWPF).
subOptions(pdfOptions);
PdfConverter.getInstance().convert(document, out, pdfOptions);
document.close();
return out.toByteArray();
} catch (IOException e) {
e.printStackTrace();
}
return output;
}
public static void main(String args[]){
Docx2PdfConverterUsingPOI docx2PdfConverterUsingPOI =new
Docx2PdfConverterUsingPOI();
String inputFile = "D:\\WORKSPACE\\yogesh\\letters\\out.docx";
FileInputStream inputStream = null;
try {
inputStream = new FileInputStream(new File(inputFile));
byte[]output =
docx2PdfConverterUsingPOI.convert(FileCopyUtils.
copyToByteArray(inputStream));
FileCopyUtils.copy(output,new
File("D:\\WORKSPACE\\yogesh\\letters\\out1.pdf"));
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
can anyone help me in how to use itext7 with apache poi for my docx to pdf conversion.
Also,can anyone explain how apache uses itext to get proper result of conversion(so that i can change the itext maven dependency accordingly)

Adding ColorSpace to resources causes the stream to close

I am trying very simple steps to add colorspace to resources using PDFBOX version 2.0.7, but it is not working.
I have PDF "pdf1.pdf", I am reading the colorspaces from this file and adding them to HashMap, then I am creating new resources and trying to add the colorspaces to the newly created resources. But it is not working
So the first Step, I read the colorSpaces from the sourcePdf file and add them to HashMap:
seperationColors = new HashMap<COSName, PDColorSpace>();
PDDocument sourcePdfFile = null;
try {
sourcePdfFile = PDDocument.load(new FileInputStream(new File(pdfPath)));
PDPage page = sourcePdfFile.getPages().get(0);
page.getContents();
for (COSName name : page.getResources().getColorSpaceNames()) {
PDColor color = page.getResources().getColorSpace(name).getInitialColor();
if (color.getColorSpace() instanceof PDSeparation) {
seperationColors.put(name, page.getResources().getColorSpace(name));
}
}
} catch (FileNotFoundException e) {
// e.printStackTrace();
} catch (IOException e) {
// e.printStackTrace();
} finally {
if (sourcePdfFile != null)
try {
sourcePdfFile.close();
} catch (IOException e) {
e.printStackTrace();
} finally {
sourcePdfFile = null;
}
}
}
Then, at later stages in the code, I want to create new PDF document, and add the colorSpaces from the source Pdf to the new one.
PDResources newResources = new PDResources();
PDColorSpace colorSpace = originalDocumentColorSpaces.values().iterator().next();
newResources.add(colorSpace);
newResources will have the error: COSDictionary{COSStream has been closed and cannot be read. Perhaps its enclosing PDDocument has been closed?}
after the add operation (line 3)
colorSpace is of type PDSeperation.
Any clue?

I need to add all the screenshots of steps performed in One word file using selenium webdriver

Please help me out.
I want to add all the steps screenshot in one document(word) file using selenium webdriver with java for that particular test case and that file should get stored with a that particular test case name.
public static void main(String[] args) {
try {
XWPFDocument docx = new XWPFDocument();
XWPFRun run = docx.createParagraph().createRun();
FileOutputStream out = new FileOutputStream(System.getProperty("user.dir")+"\\Result\\Screenshot");
for (int counter = 1; counter <= 5; counter++) {
captureScreenShot(docx, run, out);
TimeUnit.SECONDS.sleep(1);
}
docx.write(out);
out.flush();
out.close();
docx.close();
} catch (Exception e) {
e.printStackTrace();
}
}
public static void captureScreenShot(XWPFDocument docx, XWPFRun run, FileOutputStream out) throws Exception {
String screenshot_name = System.currentTimeMillis() + ".png";
BufferedImage image = new Robot().createScreenCapture(new Rectangle(Toolkit.getDefaultToolkit().getScreenSize()));
File file = new File(System.getProperty("user.dir")+"\\Result\\Screenshot" + screenshot_name);
ImageIO.write(image, "png", file);
InputStream pic = new FileInputStream(System.getProperty("user.dir")+"\\Result\\Screenshot" + screenshot_name);
run.addBreak();
run.addPicture(pic, XWPFDocument.PICTURE_TYPE_PNG, screenshot_name, Units.toEMU(350), Units.toEMU(350));
pic.close();
file.delete();
}

Text getting cut while creating PDF file using Apache PDF box 2.0.6

Creating pdf file by reading a text file
using apache pdfbox 2.0.6. Text which is being read is not getting displayed and is getting cut.
Below is the sample program which I am using:-
public static void main(String[] args) {
// TODO Auto-generated method stub
PDDocument doc = null;
TextToPDF text2pdf = new TextToPDF();
try {
doc = text2pdf.createPDFFromText(new FileReader("C:/sampleTextRead2.txt"));
ByteArrayOutputStream out = new ByteArrayOutputStream();
OutputStreamWriter writer = new OutputStreamWriter(out);
PDFTextStripper stripper = new PDFTextStripper();
stripper.writeText(doc, writer);
writer.close();
doc.save("C:/SamplePDF.pdf");
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}

Webdings font characters not extracted using pdfbox

I am using pdfbox to get the names of all fonts that are used in a pdf.
So far it was working well. However, I recently came across a pdf that has 'Webdings' font. PDFBox was not able to identify it.Could anyone help please.
This is the code I have used:
public static Set<String> extractFonts(String pdfPath) throws IOException
{
PDDocument doc = PDDocument.load(new File(pdfPath));
PDPageTree pages = doc.getDocumentCatalog().getPages();
Set<String> fontSet = new HashSet<String>();
try{
for(PDPage page:pages){
PDResources res = page.getResources();
for (COSName fontName : res.getFontNames())
{
PDFont font = res.getFont(fontName);
if(font != null){
String fontUsedName = font.getName();
if(fontUsedName.contains("+")) {
fontUsedName = fontUsedName.substring(fontUsedName.indexOf("+")+1, fontUsedName.length());
}
fontSet.add(fontUsedName);
}
}
}
} catch (Exception e) {
e.printStackTrace();
}
System.out.println(fontSet);
return fontSet;
}
I was able to know that the font 'Webdings' is present from the File-> Properties->Fonts option in Adobe Reader