Encoding error in reading non-latin characters from pdf file

Encoding error in reading non-latin characters from pdf file - pdf

I am trying to read Greek from a pdf file using PDFbox 1.8.7. The issue is that whatever encoding I use (Windows-1253, ISO-8859-7,UTF-8) I cannot read the greek characters. Could you provide me with an insight on what I need to set differently in the options of PDFbox in order to read the greek characters. (sample pdf file here https://www.wetransfer.com/downloads/861b3c85bb2b1eca9f10af210a8eff4820141111223938/83ff55)
My source code is the following:
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.pdmodel.font.PDTrueTypeFont;
import org.apache.pdfbox.util.PDFText2HTML;
import org.apache.pdfbox.util.PDFTextStripper;
/**
*
* #author Administrator
*/
public class Pdf {
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
try {
PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File file = new File("C:\\themis\\Pdfs\\first.pdf");
PDFParser parser = new PDFParser(new FileInputStream(file));
parser.parse();
cosDoc = parser.getDocument();
pdDoc = new PDDocument(cosDoc);
String encoding="ISO-8859-7";
pdfStripper = new PDFTextStripper(encoding);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(2);
String parsedText = pdfStripper.getText(pdDoc);
System.out.println(parsedText);
} catch (FileNotFoundException ex) {
Logger.getLogger(PdfBMW.class.getName()).log(Level.SEVERE, null, ex);
} catch (IOException ex) {
Logger.getLogger(PdfBMW.class.getName()).log(Level.SEVERE, null, ex);
}
}
}

Related

docx to PDF conversion using docx4j

I am trying to convert docx file to pdf using docx4j. But I am getting an error
What I found that docx4j older versions were not able to support shapes inserted in the docs for pdf conversion.
Does any current version of docx4J, documents4j and docx4j-export-fo combination has these supported?
In the input file there were few lines drawn or inserted, as a place holder to populate values, these lines are erroring out while converting to PDF
```
import org.docx4j.Docx4J;
import org.docx4j.Docx4jProperties;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.InputStream;
public class Main {
private static final Logger LOGGER = LoggerFactory.getLogger(Main.class);
public static final int FLAG_EXPORT_PREFER_XSL = 0;
public static void main(String[] args) {
Docx4jProperties.setProperty(
"com.plutext.converter.URL",
"https://converter-eval.plutext.com:443/v1/00000000-0000-0000-0000-000000000000/convert");
try {
InputStream templateInputStream = new FileInputStream("C:\\Documents\\Output_docx\\Directors.docx");
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(templateInputStream);
String outputfilepath = "C:\\Documents\\Output_docx\\Example_output_2.pdf";
FileOutputStream os = new FileOutputStream(outputfilepath);
Docx4J.toPDF(wordMLPackage,os);
os.flush();
os.close();
} catch (Throwable e) {
LOGGER.warn("Conversion Error!");
e.printStackTrace();
}
}
}
```

See https://www.docx4java.org/blog/2020/09/office-pptxxlsxdocx-to-pdf-to-in-docx4j-8-2-3/ which compares the 3 approaches for generating PDF output.
In summary to handle those objects, use the documents4j approach (which uses Word), or Microsoft Graph.

Query Expansion lucene

I am new to lucene and I am trying to do query expansion.
I have referred to these two posts (first , second) and I've managed to reuse the code in a way that suits version 6.0.0, as the one in the previous is deprecated.
The issue is, either I'm not getting a results or I didn't access the results (expanded queries) appropriately.
Here is my code:
import com.sun.corba.se.impl.util.Version;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.StringReader;
import java.io.UnsupportedEncodingException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import java.net.URLDecoder;
import java.net.URLEncoder;
import java.text.ParseException;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.standard.ClassicTokenizer;
import org.apache.lucene.analysis.standard.StandardFilter;
import org.apache.lucene.analysis.synonym.SynonymFilter;
import org.apache.lucene.analysis.synonym.SynonymMap;
import org.apache.lucene.analysis.synonym.WordnetSynonymParser;
import org.apache.lucene.analysis.util.CharArraySet;
import org.apache.lucene.util.*;
public class Graph extends Analyzer
{
protected static TokenStreamComponents createComponents(String fieldName, Reader reader) throws ParseException{
System.out.println("1");
// TODO Auto-generated method stub
Tokenizer source = new ClassicTokenizer();
source.setReader(reader);
TokenStream filter = new StandardFilter( source);
filter = new LowerCaseFilter(filter);
SynonymMap mySynonymMap = null;
try {
mySynonymMap = buildSynonym();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
filter = new SynonymFilter(filter, mySynonymMap, false);
return new TokenStreamComponents(source, filter);
}
private static SynonymMap buildSynonym() throws IOException, ParseException
{ System.out.print("build");
File file = new File("wn\\wn_s.pl");
InputStream stream = new FileInputStream(file);
Reader rulesReader = new InputStreamReader(stream);
SynonymMap.Builder parser = null;
parser = new WordnetSynonymParser(true, true, new StandardAnalyzer(CharArraySet.EMPTY_SET));
System.out.print(parser.toString());
((WordnetSynonymParser) parser).parse(rulesReader);
SynonymMap synonymMap = parser.build();
return synonymMap;
}
public static void main (String[] args) throws UnsupportedEncodingException, IOException, ParseException
{
Reader reader = new FileReader("C:\\input.txt"); // here I have the queries that I want to expand
TokenStreamComponents TSC = createComponents( "" , new StringReader("some text goes here"));
**System.out.print(TSC); //How to get the result from TSC????**
}
#Override
protected TokenStreamComponents createComponents(String string)
{
throw new UnsupportedOperationException("Not supported yet."); //To change body of generated methods, choose Tools | Templates.
}
}
Please suggest ways to help me access the expanded queries!

So, are you just trying to figure out how to iterate through the terms from the TokenStreamComponents in your main method?
Something like this:
TokenStreamComponents TSC = createComponents( "" , new StringReader("some text goes here"));
TokenStream stream = TSC.getTokenStream();
CharTermAttribute termattr = stream.addAttribute(CharTermAttribute.class);
stream.reset();
while (stream.incrementToken()) {
System.out.println(termattr.toString());
}

Add a watermark on a pdf that contains images using pdfbox (1.7)

I have used the code suggested in:
PDFBox Overlay fails
to add a watermark to an existing pdf.
Unfortunately, the pdf produced is corrupted. The pdf reader complains when I open the document: "An error exists on this page. Acrobat may not display the page correctly. Please contact the person who created the PDF document to correct the problem".
The document is opened but it does not show the images.
It seems to happen with all the pdfs. It could be worth saying that it happens also with a different implementation that simply uses the Overlay class.
The following url points to a pdf that I used for my testing:
A pdf with an image
The code to test this transformation is:
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.apache.pdfbox.cos.COSDictionary;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.exceptions.COSVisitorException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDStream;
import org.apache.pdfbox.pdmodel.edit.PDPageContentStream;
import org.apache.pdfbox.pdmodel.graphics.PDExtendedGraphicsState;
import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm;
import org.apache.pdfbox.util.MapUtil;
/**
* This test is about overlaying with special effect.
*
* #author mkl
*/
public class OverlayWithEffect
{
final static File RESULT_FOLDER = new File("target/test-outputs", "assembly");
public static void overlayWithDarkenBlendMode(PDDocument document, PDDocument overlay) throws IOException
{
PDXObjectForm xobject = importAsXObject(document, (PDPage) overlay.getDocumentCatalog().getAllPages().get(0));
PDExtendedGraphicsState darken = new PDExtendedGraphicsState();
darken.getCOSDictionary().setName("BM", "Darken");
List<PDPage> pages = document.getDocumentCatalog().getAllPages();
for (PDPage page: pages)
{
if (page.getResources() == null) {
page.setResources(page.findResources());
}
if (page.getResources() != null) {
Map<String, PDExtendedGraphicsState> states = page.getResources().getGraphicsStates();
if (states == null) {
states = new HashMap<String, PDExtendedGraphicsState>();
}
String darkenKey = MapUtil.getNextUniqueKey(states, "Dkn");
states.put(darkenKey, darken);
page.getResources().setGraphicsStates(states);
PDPageContentStream stream = new PDPageContentStream(document, page, true, false, true);
stream.appendRawCommands(String.format("/%s gs ", darkenKey));
stream.drawXObject(xobject, 0, 0, 1, 1);
stream.close();
}
}
}
public static PDXObjectForm importAsXObject(PDDocument target, PDPage page) throws IOException
{
final PDStream xobjectStream = new PDStream(target, page.getContents().createInputStream(), false);
final PDXObjectForm xobject = new PDXObjectForm(xobjectStream);
xobject.setResources(page.findResources());
xobject.setBBox(page.findCropBox());
COSDictionary group = new COSDictionary();
group.setName("S", "Transparency");
group.setBoolean(COSName.getPDFName("K"), true);
xobject.getCOSStream().setItem(COSName.getPDFName("Group"), group);
return xobject;
}
public static void main(String[] args) throws COSVisitorException, IOException
{
InputStream sourceStream = new FileInputStream("x:/pdf-test.pdf");
InputStream overlayStream = new FileInputStream("x:/draft.pdf");
try {
final PDDocument document = PDDocument.load(sourceStream);
final PDDocument overlay = PDDocument.load(overlayStream);
overlayWithDarkenBlendMode(document, overlay);
document.save("x:/da-draft-5.pdf");
document.close();
}
finally {
sourceStream.close();
overlayStream.close();
}
}
}
I am using version 1.7 of pdfbox.
Thanks

As suggested by mkl, it is probably an issue with the version of pdfbox that I am using.

Issues converting docx to pdf using docx4j

I am using docx4j 2.8.1 and I tried to convert several different docx file, but i have always the same issue.
maybe the issue is coming from the version of the library or some dependency missing.
Code:
package test;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import org.docx4j.convert.out.pdf.PdfConversion;
import org.docx4j.convert.out.pdf.viaXSLFO.PdfSettings;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
public class pdfConverter {
public static void main(String[] args) {
createPDF();
}
private static void createPDF() {
try {
// 1) Load DOCX into WordprocessingMLPackage
InputStream is = new FileInputStream(
new File("D:/TestDoc/Res.docx"));
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage
.load(is);
// 2) Prepare Pdf settings
PdfSettings pdfSettings = new PdfSettings();
// 3) Convert WordprocessingMLPackage to Pdf
OutputStream out = new FileOutputStream(new File(
"D:/TestDoc/Res.pdf"));
PdfConversion converter = new org.docx4j.convert.out.pdf.viaXSLFO.Conversion(
wordMLPackage);
converter.output(out, pdfSettings);
} catch (Throwable e) {
e.printStackTrace();
}
}
}
Error:
org.docx4j.openpackaging.exceptions.Docx4JException: FOP issues
at org.docx4j.convert.out.pdf.viaXSLFO.Conversion.output(Conversion.java:374)
at test.pdfConverter.createPDF(pdfConverter.java:42)
at test.pdfConverter.main(pdfConverter.java:21)
Caused by: java.lang.NullPointerException
at org.docx4j.XmlUtils.transform(XmlUtils.java:842)
at org.docx4j.XmlUtils.transform(XmlUtils.java:802)
at org.docx4j.convert.out.pdf.viaXSLFO.Conversion.output(Conversion.java:349)
... 2 more

Solved by changing to jars. i used 2.8.0 and its fine now.

Parsing content from Word document using docx4j

Thanks to a previous answer, I'm now able to read my password-protected Word 2010 documents. (I have to translate them one by one from .doc to .docx. They go back to 1994, but that's okay.)
I wrote a simple Java class to get started:
package model.docx4j;
import model.JournalEntry;
import model.JournalEntryFactory;
import org.docx4j.openpackaging.exceptions.Docx4JException;
import org.docx4j.openpackaging.packages.OpcPackage;
import org.docx4j.openpackaging.parts.Parts;
import java.io.IOException;
import java.io.InputStream;
import java.security.GeneralSecurityException;
import java.util.LinkedList;
import java.util.List;
/**
* JournalEntryFactoryImpl using docx4j
* #author Michael
* #link
* #since 9/8/12 12:44 PM
*/
public class JournalEntryFactoryImpl implements JournalEntryFactory {
#Override
public List<JournalEntry> getEntries(InputStream inputStream, String password) throws IOException, GeneralSecurityException {
List<JournalEntry> journalEntries = new LinkedList<JournalEntry>();
if (inputStream != null) {
try {
OpcPackage opcPackage = OpcPackage.load(inputStream, password);
Parts parts = opcPackage.getParts();
} catch (Docx4JException e) {
LOGGER.error("Could not load document into docx4j", e);
throw new IOException(e);
}
}
return journalEntries;
}
}
And a JUnit test to drive it:
package model.docx4j;
import model.JournalEntry;
import model.JournalEntryFactory;
import org.docx4j.openpackaging.exceptions.Docx4JException;
import org.docx4j.openpackaging.packages.OpcPackage;
import org.docx4j.openpackaging.parts.Parts;
import java.io.IOException;
import java.io.InputStream;
import java.security.GeneralSecurityException;
import java.util.LinkedList;
import java.util.List;
/**
* JournalEntryFactoryImpl using docx4j
* #author Michael
* #link
* #since 9/8/12 12:44 PM
*/
public class JournalEntryFactoryImpl implements JournalEntryFactory {
#Override
public List<JournalEntry> getEntries(InputStream inputStream, String password) throws IOException, GeneralSecurityException {
List<JournalEntry> journalEntries = new LinkedList<JournalEntry>();
if (inputStream != null) {
try {
OpcPackage opcPackage = OpcPackage.load(inputStream, password);
Parts parts = opcPackage.getParts();
} catch (Docx4JException e) {
LOGGER.error("Could not load document into docx4j", e);
throw new IOException(e);
}
}
return journalEntries;
}
}
I put a breakpoint into the test to see what docx4j was doing once it read my document. I see a list of 8 parts, but I walked through the tree without finding the content.
Each document consists of a page with a date and content, but I can't find pages. Where do they live?

The main document content lives in the "main document part", which is often named "/word/document.xml".
The usual way to get it with docx4j is:
WordprocessingMLPackage wordMLPackage = (WordprocessingMLPackage)opcPackage;
MainDocumentPart documentPart = wordMLPackage.getMainDocumentPart();
but you'd expect your approach to work as well.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Encoding error in reading non-latin characters from pdf file - pdf

Related

docx to PDF conversion using docx4j

Query Expansion lucene

Add a watermark on a pdf that contains images using pdfbox (1.7)

Issues converting docx to pdf using docx4j

Parsing content from Word document using docx4j

Categories

Resources