Extract text data line by line from pdf using pdfbox API in java

Extract text data line by line from pdf using pdfbox API in java - pdfbox

I have used to extract text data from PDF using Apache PDFBox API, but below code is not returned data sequentially (line by line)
Code:
try {
RandomAccess scratchFile = null;
pdDoc = PDDocument.loadNonSeq(new File(fileName), scratchFile);
pdfStripper = new PDFTextStripper();
parsedText = pdfStripper.getText(pdDoc);
system.out.println(parsedText);
} catch (IOException e) {
System.err.println("Unable to open PDF Parser. " + e.getMessage());
return null;
}

Related

Parsing PDF file using Apache PDFBox to get outlines

Now I can use the PDFBox to extract the outlines from PDF, but some PDF can get the outlines, others can't.
Every PDF has outlines and when I open a pdf use pdf read tool, I can click an outline to a certain page.
Here is my code:
public static void main(String[] args) {
try {
PDDocument document = PDDocument.load(new File(filePath));
PDDocumentOutline outline = document.getDocumentCatalog().getDocumentOutline();
getOutlines(document, outline, "");
document.close();
} catch (InvalidPasswordException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public static void getOutlines(PDDocument document, PDOutlineNode bookmark, String indentation) throws IOException{
PDOutlineItem current = bookmark.getFirstChild();
while (current != null) {
PDPage currentPage = current.findDestinationPage(document);
Integer pageNumber = document.getDocumentCatalog().getPages().indexOf(currentPage) + 1;
System.out.println(current.getTitle() + "-------->" + pageNumber);
getOutlines(document, current, indentation);
current = current.getNextSibling();
}
}

Webdings font characters not extracted using pdfbox

I am using pdfbox to get the names of all fonts that are used in a pdf.
So far it was working well. However, I recently came across a pdf that has 'Webdings' font. PDFBox was not able to identify it.Could anyone help please.
This is the code I have used:
public static Set<String> extractFonts(String pdfPath) throws IOException
{
PDDocument doc = PDDocument.load(new File(pdfPath));
PDPageTree pages = doc.getDocumentCatalog().getPages();
Set<String> fontSet = new HashSet<String>();
try{
for(PDPage page:pages){
PDResources res = page.getResources();
for (COSName fontName : res.getFontNames())
{
PDFont font = res.getFont(fontName);
if(font != null){
String fontUsedName = font.getName();
if(fontUsedName.contains("+")) {
fontUsedName = fontUsedName.substring(fontUsedName.indexOf("+")+1, fontUsedName.length());
}
fontSet.add(fontUsedName);
}
}
}
} catch (Exception e) {
e.printStackTrace();
}
System.out.println(fontSet);
return fontSet;
}
I was able to know that the font 'Webdings' is present from the File-> Properties->Fonts option in Adobe Reader

How to display a pdf file using PDFBox in JPanel?

I have already created a JForm in netbeans which can read pdf file using PDFBox. But the problem is that I have used the method PDPage.convertToImage() which is really very slow. Can anyone please help me in displaying the pdf using PDFBox in the JPanel at a faster speed ?
The code I have written is inside an ActionListener for a JButton.
File f = null;
ArrayList<JLabel> jl = new ArrayList<JLabel>();
BufferedImage bi = null;
JFileChooser fc = new JFileChooser();
int x=fc.showOpenDialog(null);
if(x==JFileChooser.APPROVE_OPTION)
{
f=fc.getSelectedFile();
}
PDDocument doc=null;
try {
doc = PDDocument.load(f);
} catch (IOException ex) {
JOptionPane.showMessageDialog(null, "not done\n"+ex);
}
List pages = doc.getDocumentCatalog().getAllPages();
Iterator itr = pages.iterator();
int q=0;
while(itr.hasNext())
{
PDPage page = (PDPage)itr.next();
try
{
bi = page.convertToImage();
q++;
jl.add(new JLabel(new ImageIcon(bi)));
}catch(Exception e)
{
JOptionPane.showMessageDialog(null, e);
}
}
itr = jl.iterator();
while(itr.hasNext())
{
viewPanel.setVisible(false);
viewPanel.add((JLabel)itr.next());
viewPanel.setVisible(true);
}
JOptionPane.showMessageDialog(null, "done");

NetBeans has several plugins to display PDFs
http://plugins.netbeans.org/plugin/5809/java-pdf-reader
http://plugins.netbeans.org/plugin/11676/netbeans-pdfviewer
http://plugins.netbeans.org/plugin/17/pdf-viewer-javafx-converter-and-bookmarking-application
HAve you tried any of them?

Modify Printing attribute for Media Name Java Apache FOP API

Am using Apache FOP API to print a document which was working well for a while but now it is trying to print on a legal size paper on tray 1. Am wondering if i can change that to Letter size so that users do not manually have to hit button on the printer to make that happen.
public void printDocument() {
DocFlavor flavor = DocFlavor.INPUT_STREAM.AUTOSENSE;
PrintRequestAttributeSet aset =
new HashPrintRequestAttributeSet();
PrintService prnSvc = null;
/* locate a print service that can handle it */
PrintService[] pservices =
PrintServiceLookup.lookupPrintServices(null, null);
if (pservices.length > 0) {
int ii = 0;
while (ii < pservices.length) {
System.out.println("Named Printer found: " + pservices[ii].getName());
if (pservices[ii].getName().endsWith("xyz")) {
prnSvc = pservices[ii];
System.out.println("Named Printer selected: " + pservices[ii].getName() + "*");
break;
}
ii++;
}
/* create a print job for the chosen service */
DocPrintJob pj = prnSvc.createPrintJob();
try {
File file = new File("test.pcl");
FileInputStream fis = new FileInputStream(file); //Doc encapsulating the print data
Doc doc = new SimpleDoc(fis, flavor, null);
/* print the doc as specified */
pj.print(doc, aset);
} catch (IOException ie) {
System.err.println(ie);
} catch (PrintException e) {
e.printStackTrace();
System.err.println(e);
}
}
}
Would highly appreciate if anyone can provide any recommendations around the same.

You'll need to specify the paper size by adding it to aset:
aset.add(javax.print.attribute.standard.MediaSizeName.<desired paper size>);
(Javadoc for MediaSizeName). For letter size, use
aset.add(javax.print.attribute.standard.MediaSizeName.NA_LETTER);

Extract xml data from gzip file using apache tika?

I am working a project in which i need to extract xml(sitemap)data from gz file using apache tika[AM NEW TO TIKA].
the fie name is something like sitemap01.xml.gz
I could extract data from normal text file or html,but i don't know how to extract xml from gz and extract the meta and data from xml...
I searched Google for past two days.
Do i need to use delegateParser in tika to extract data from xml?
Please guide me to some sample or articles....
Here is my try
public void parseXml() throws IOException{
Metadata metadata = new Metadata();
ContentHandler handler = new BodyContentHandler();
Parser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
InputStream stream =this.getClass().getResourceAsStream("sitemap.xml.gz");
try {
parser.parse(stream,handler,metadata,context);
for(int i = 0; i <metadata.names().length; i++) {
String name = metadata.names()[i];
System.out.println(name + " : " + metadata.get(name));
}
System.out.println(handler.toString());
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (SAXException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (TikaException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}finally{
if(stream!=null) {
stream.close();
}
}
}

The thing you're missing is setting a recursing parser on your ParseContext. You probably want something like:
Parser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
parser.parse(....)
By setting a Parser on the ParseContext, you tell Tika to call that when it encounters embedded documents (such as the XML inside your GZip)

Here is how you can use XML parser from Apache Tika for your case:
//detecting the file type
BodyContentHandler handler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
File inFile = new File("sitemap.xml.gz");
System.out.println(inFile.isFile());
FileInputStream inputstream = new FileInputStream(inFile);
ParseContext pcontext = new ParseContext();
//Xml parser
XMLParser xmlparser = new XMLParser();
xmlparser.parse(inputstream, handler, metadata, pcontext);
System.out.println(pcontext.toString());
System.out.println("Contents of the document:" + handler.toString());//this one contains all contents from xml files and tags are also removed
System.out.println("Metadata of the document:");
String[] metadataNames = metadata.names();
for(String name : metadataNames) {
System.out.println(name + ": " + metadata.get(name));

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Extract text data line by line from pdf using pdfbox API in java - pdfbox

Related

Parsing PDF file using Apache PDFBox to get outlines

Webdings font characters not extracted using pdfbox

How to display a pdf file using PDFBox in JPanel?

Modify Printing attribute for Media Name Java Apache FOP API

Extract xml data from gzip file using apache tika?

Categories

Resources