pdfBox Return Bad Encoding Charachter - pdfbox

i have a pdf http://www.persianacademy.ir/UserFiles/File/fe1394.pdfthat i want to extract words from it(contain persian words.).i use PDFBox library to get words.here is my code:
package ir.blog.stack;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.io.RandomAccessFile;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class PDFManager {
public static void main(String[] args) {
PDFManager pdfManager = new PDFManager();
pdfManager.setFilePath("/home/saeed/Documents/words.pdf");
try {
System.out.println(pdfManager.ToText());
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
private PDFParser parser;
private PDFTextStripper pdfStripper;
private PDDocument pdDoc ;
private COSDocument cosDoc ;
private String Text ;
private String filePath;
private File file;
public PDFManager() {
}
public String ToText() throws IOException
{
this.pdfStripper = null;
this.pdDoc = null;
this.cosDoc = null;
file = new File(filePath);
parser = new PDFParser(new RandomAccessFile(file,"r")); // update for PDFBox V 2.0
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdDoc.getNumberOfPages();
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(5);
// reading text from page 1 to 10
// if you want to get text from full pdf file use this code
// pdfStripper.setEndPage(pdDoc.getNumberOfPages());
Text = pdfStripper.getText(pdDoc);
return Text;
}
public void setFilePath(String filePath) {
this.filePath = filePath;
}
}
and this is part of output:
° ǽA ° SwA ²j±ÇÇM/SwA ²joÇ Ak¼ÇQ ³Ç«AjA p°oÇ«A ³ÇM BÇU éÇ
BÇM ¤ Ø°A ·ª¦ °j ³ An <»wB®{Sv½p> ° <»wB®z¯BMp> ,<³¯BhQBa> ,<³¯BiRnB\U>
»¯BwC³ÇM ©½o¼¢Moǯnj kǯA²k{ ³TiBw <»wB®{> BM ¨°j ·ª¦ °j ° <³¯Bi> ·ª¦
k{BÇM ³TÇ{Aj j±]° o¯ ßB
UA ¬C nj ³ ºA²kîB RBª¦ ½A ߺÀ«A ³ ©¼MB½»«nj
/jnAk¯
° ²k{tBLTA »¼® Øßi pA j±i »Moî Øßi ° ²k{ ³To£ »Moî Øßi pA B« Øßi
shall i do extra actions to get right words?

The PDF in question simply does not contain the information required for text extraction. You will have to try with OCR.
In detail
For text extraction from a PDF to succeed, the PDF must contain some information on which Unicode character is represented by each used glyph.
The PDF specification describes the following text extraction process:
9.10.2 Mapping Character Codes to Unicode Values
A conforming reader can use these methods, in the priority given, to map a character code to a Unicode value. Tagged PDF documents, in particular, shall provide at least one of these methods (see 14.8.2.4.2, "Unicode Mapping in Tagged PDF"):
If the font dictionary contains a ToUnicode CMap (see 9.10.3, "ToUnicode CMaps"), use that CMap to convert the character code to Unicode.
If the font is a simple font that uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font (see Annex D):
a) Map the character code to a character name according to Table D.1 and the font’s Differences array.
b) Look up the character name in the Adobe Glyph List (see the Bibliography) to obtain the corresponding Unicode value.
If the font is a composite font that uses one of the predefined CMaps listed in Table 118 (except Identity–H and Identity–V) or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection:
a) Map the character code to a character identifier (CID) according to the font’s CMap.
b) Obtain the registry and ordering of the character collection used by the font’s CMap (for example, Adobe and Japan1) from its CIDSystemInfo dictionary.
c) Construct a second CMap name by concatenating the registry and ordering obtained in step (b) in the format registry–ordering–UCS2 (for example, Adobe–Japan1–UCS2).
d) Obtain the CMap with the name constructed in step (c) (available from the ASN Web site; see the Bibliography).
e) Map the CID obtained in step (a) according to the CMap obtained in step (d), producing a Unicode value.
If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.
In case of the sample PDF, the fonts in question
do not have ToUnicode maps;
are composite;
use Identity-H as Encoding;
have a CIDSystemInfo value of Adobe-Identity-0.
Thus, the process quoted above fails to produce a Unicode value.
The PDF specification alternatively allows the use of ActualText entries in structure element dictionaries or marked-content sequences to override the text some content shall represent.
In case of the sample PDF, no ActualText entries are used.
One can look deeper than the PDF specification describes, in particular one can dive into the embedded font programs to find font specific information on the Unicode characters some font glyph represents.
In case of the sample PDF, the embedded font programs
do not contain Unicode values for the glyphs;
use uninformative glyph names like "glyph89".
Thus, in case of the sample PDF, you most likely will have to resort to OCR.

Related

How to rotate a specific text in a pdf which is already present in the pdf using itext or pdfbox?

I know we can insert text into pdf with rotation using itext. But I want to rotate the text which is already present in the pdf.
Before.pdf
After.pdf
First of all, in your question you only talk about how to rotate a specific text but in your example you additionally rotate a red rectangle. This answer focuses on rotating text. The process of guessing which graphics might be related to the text and, therefore, probably should be rotated along, is a topic in its own right.
You also mention you are looking for a solution using itext or pdfbox and used the tags itext, pdfbox, and itext7. For this answer I chose iText 7.
You did not explain what kind of text pieces you want to rotate but offered a representative example PDF. In that example I saw that the text to rotate was drawn using a single text showing instruction which is the only such instruction in the encompassing text object in the page content stream. To keep the code in the answer simple, therefore, I can assume the text to rotate is drawn in a consecutive sequence of text showing instructions in a text object in the page content stream framed by instructions that are not text showing ones. This is a generalization of your case.
Furthermore, you did not mention the center of rotation. Based on your example files I assume it to be approximately the start of the base line of the text to rotate.
A Simple Implementation
When editing PDF content streams it is helpful to know the current graphics state at each instruction, e.g. to properly recognize the text drawn by a text showing operation one needs to know the current font to map the character codes to Unicode characters. The text extraction framework in iText already contains code to follow the graphics state. Thus, in this answer a base PdfCanvasEditor class has been developed on top of the text extraction framework.
We can base a solution for the task at hand on that class after a small extension; that class originally sets the text extraction event listener to a dummy implementation but here we'll need a custom one. So we need to add an additional constructor that accepts such a custom event listener as parameter:
public PdfCanvasEditor(IEventListener listener)
{
super(listener);
}
(Additional PdfCanvasEditor constructor)
Based on this extended PdfCanvasEditor we can implement the task by inspecting the existing page content stream instruction by instruction. For a sequence of consecutive text showing instructions we retrieve the text matrix before and after the sequence, and if the text drawn by the sequence turns out to be the text to rotate, we insert an instruction before that sequence setting the initial text matrix to a rotated version of itself and another one after that sequence setting the text matrix back to what it was there originally.
Our implementation LimitedTextRotater accepts a Matrix representing the desired rotation and a Predicate matching the string to rotate.
public class LimitedTextRotater extends PdfCanvasEditor {
public LimitedTextRotater(Matrix rotation, Predicate<String> textMatcher) {
super(new TextRetrievingListener());
((TextRetrievingListener)getEventListener()).limitedTextRotater = this;
this.rotation = rotation;
this.textMatcher = textMatcher;
}
#Override
protected void write(PdfCanvasProcessor processor, PdfLiteral operator, List<PdfObject> operands) {
String operatorString = operator.toString();
if (TEXT_SHOWING_OPERATORS.contains(operatorString)) {
recentTextOperations.add(new ArrayList<>(operands));
} else {
if (!recentTextOperations.isEmpty()) {
boolean rotate = textMatcher.test(text.toString());
if (rotate)
writeSetTextMatrix(processor, rotation.multiply(initialTextMatrix));
for (List<PdfObject> recentOperation : recentTextOperations) {
super.write(processor, (PdfLiteral) recentOperation.get(recentOperation.size() - 1), recentOperation);
}
if (rotate)
writeSetTextMatrix(processor, finalTextMatrix);
recentTextOperations.clear();
text.setLength(0);
initialTextMatrix = null;
}
super.write(processor, operator, operands);
}
}
void writeSetTextMatrix(PdfCanvasProcessor processor, Matrix textMatrix) {
PdfLiteral operator = new PdfLiteral("Tm\n");
List<PdfObject> operands = new ArrayList<>();
operands.add(new PdfNumber(textMatrix.get(Matrix.I11)));
operands.add(new PdfNumber(textMatrix.get(Matrix.I12)));
operands.add(new PdfNumber(textMatrix.get(Matrix.I21)));
operands.add(new PdfNumber(textMatrix.get(Matrix.I22)));
operands.add(new PdfNumber(textMatrix.get(Matrix.I31)));
operands.add(new PdfNumber(textMatrix.get(Matrix.I32)));
operands.add(operator);
super.write(processor, operator, operands);
}
void eventOccurred(TextRenderInfo textRenderInfo) {
Matrix textMatrix = textRenderInfo.getTextMatrix();
if (initialTextMatrix == null)
initialTextMatrix = textMatrix;
finalTextMatrix = new Matrix(textRenderInfo.getUnscaledWidth(), 0).multiply(textMatrix);
text.append(textRenderInfo.getText());
}
static class TextRetrievingListener implements IEventListener {
#Override
public void eventOccurred(IEventData data, EventType type) {
if (data instanceof TextRenderInfo) {
limitedTextRotater.eventOccurred((TextRenderInfo) data);
}
}
#Override
public Set<EventType> getSupportedEvents() {
return null;
}
LimitedTextRotater limitedTextRotater;
}
final static List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
final Matrix rotation;
final Predicate<String> textMatcher;
final List<List<PdfObject>> recentTextOperations = new ArrayList<>();
final StringBuilder text = new StringBuilder();
Matrix initialTextMatrix = null;
Matrix finalTextMatrix = null;
}
(LimitedTextRotater)
You can apply it to a document like this:
try ( PdfReader pdfReader = new PdfReader(...);
PdfWriter pdfWriter = new PdfWriter(...);
PdfDocument pdfDocument = new PdfDocument(pdfReader, pdfWriter) )
{
PdfCanvasEditor editor = new LimitedTextRotater(new Matrix(0, -1, 1, 0, 0, 0), text -> true);
for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++){
editor.editPage(pdfDocument, i);
}
}
(RotateText test testBeforeAkhilNagaSai)
The Predicate used here is text -> true which matches any text. In case of your example PDF that is ok as the text to rotate is the only text. In general you might want a more specific check, e.g. text -> text.equals("The text to be rotated"). In general try not to be too specific, though, as the extracted text might slightly deviate from expectations, e.g. by extra spaces.
The result:
As you can see the text is rotated. In contrast to your After.pdf, though, the red rectangle is not rotated. The reason is - as already mentioned at the start - that that rectangle in no way is part of the text.
Some Ideas
First of all, there are ports of the PdfCanvasEditor to iText 5 (the PdfContentStreamEditor in this answer) and PDFBox (the PdfContentStreamEditor in this answer). Thus, if you eventually prefer to switch to either of these PDF libraries, you can create equivalent implementations.
Then, if the assumption that the text to rotate is drawn in a consecutive sequence of text showing instructions in a text object in the page content stream framed by instructions that are not text showing ones does not hold for you, you can generalize the implementation here somewhat. Have a look at the SimpleTextRemover in this answer for inspiration which is based on the PdfContentStreamEditor for iText 5. Here also texts that start somewhere in one text showing instruction and end somewhere in another one are processed which requires some more detailed data keeping and splitting of existing text drawing instructions.
Also, if you want to rotate graphics along with the text that a human viewer might consider associated with it (like the red rectangle in your example file), you can try and extend the example accordingly, e.g. by also extracting the coordinates of the rotated text and in a second run trying to guess which graphics around those coordinates are related and rotating the graphics along. This is not trivial, though.
Finally note that the Matrix provided in the constructor is not limited to rotations, it can represent an arbitrary affine transformation. So instead of rotating text you can also move it or scale it or skew it, ...

How to decode data from Content Stream

I created a pdf document using the code looks like the following:
// The text parameter equels 'שדג' it is Hebrew. unicode equivalent is '\u05E9\u05D3\u05D2'
private static void createSimplePdf(String filename, String text) throws Exception {
final String path = RunItextApp.class.getResource("/Arial.ttf").getPath();
final PdfFont font = PdfFontFactory.createFont(path, PdfEncodings.IDENTITY_H);
Style hebrewStyle = new Style()
.setBaseDirection(BaseDirection.RIGHT_TO_LEFT)
.setFontSize(14)
.setFont(font);
final PdfWriter pdfWriter = new PdfWriter(filename);
final PdfDocument pdfDocument = new PdfDocument(pdfWriter);
final Document pdf = new Document(pdfDocument);
pdf.add(
new Paragraph(text)
.setFontScript(Character.UnicodeScript.HEBREW)
.addStyle(hebrewStyle)
);
pdf.close();
System.out.println("The document '" + filename + "' has been created.");
}
and after that, I tried to open this document using pdfbox util and I got the following data:
but I got an unexpected result in the Contents:stream section especially Tj tag. I expected string like the following 05E905D305D2 but I got 02b902a302a2. I tried to convert this hex string to normal string and I got the following result: ʹʣʢ but I expected that string שדג.
What do I wrong? Hot to convert this 02b902a302a2 string and get שדג?
This answer writes in a comment #usr2564301. Thanks for the help!
The numbers you get are not Unicode characters but font indexes instead. (Check how the font is embedded!) The text in a PDF does not specifically care about Unicode – it may or may not be this. Good PDF creators add a /ToUnicode table to help decoding, but it's optional.

PDFBOX is not giving right output

I am parsing a pdf in PDFBox to extract all the text from it
public static void main(String args[]) {
PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File file = new File("C:\\Users\\admin\\Downloads\\Airtel.pdf");
try {
PDFParser parser = new PDFParser(new FileInputStream(file));
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(1);
String parsedText = pdfStripper.getText(pdDoc);
System.out.println(parsedText);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
BUT its not giving any text in output
help
PDFBox extracts text similar to how Adobe Reader copies&pastes text.
If you open your document in Adobe Reader and press <Ctrl-A> to mark all text (you'll see that hardly anything is marked) and copy&paste it to an editor, you'll find that Adobe Reader also hardly extracts anything.
The reason why neither PDFBox nor Adobe Reader (nor any other normal text extractor) extracts the text from your document is that there virtually is no text at all in it! The "text" you see is not drawn using text drawing operations but instead is painted using by defining the outlines of each "character" as path and filling the area in that path. Thus, there is no indication to a text extractor that there even is text.
There actually are two characters of real text in your document, the '-' sign between the "Previous balance" and the "Payments" boxes and the '-' sign between the "Payments" and the "Adjustments" boxes. And even those two characters are not extracted as desired because the font does not provide the information which Unicode codepoint those characters represent.
Your virtually only chance, therefore, to extract the text content of the document is to apply OCR to the document.

How can I Extract words with its coordinates from pdf using .net?

I'm working with pdf in hebrew language with diacritical marks. I want to extract all the words with its coordinates. I tried to use ITextSharp and pdfClown and they both didn't give me what I want.
In pdfClown there are missing letters\chars in ITextSharp I don't get the words coordinates.
Is there a way to do it? (I'm looking for a free framework\code)
EDIT:
PDFClown Code:
File file = new File(PDFFilePath);
TextExtractor te = new TextExtractor();
IDictionary<RectangleF?, IList<ITextString>> strs = te.Extract(file.Document.Pages[0].Contents);
List<string> correctText = new List<string>();
foreach (var key in strs.Keys)
{
foreach (var value in strs[key])
{
string reversedText = new string(value.Text.Reverse().ToArray());
string cleanText = RemoveDiacritics(reversedText);
correctText.Add(cleanText);
}
}
You aren't showing how you are trying to extract text using iText(Sharp). I am assuming that you are following the official documentation and that your code looks like this:
public string ExtractText(byte[] src) {
PdfReader reader = new PdfReader(src);
MyTextRenderListener listener = new MyTextRenderListener();
PdfContentStreamProcessor processor = new PdfContentStreamProcessor(listener);
PdfDictionary pageDic = reader.GetPageN(1);
PdfDictionary resourcesDic = pageDic.GetAsDict(PdfName.RESOURCES);
processor.ProcessContent(
ContentByteUtils.GetContentBytesForPage(reader, 1), resourcesDic);
return listener.Text.ToString();
}
If your code doesn't look like this, this explains already explains the first thing you're doing wrong.
In this method, there is one class that isn't part of iTextSharp: MyTextRenderListener. This is a class you should write and that looks for instance like this:
public class MyTextRenderListener : IRenderListener {
public StringBuilder Text { get; set; }
public MyTextRenderListener() {
Text = new StringBuilder();
}
public void BeginTextBlock() {
Text.Append("<");
}
public void EndTextBlock() {
Text.AppendLine(">");
}
public void RenderImage(ImageRenderInfo renderInfo) {
}
public void RenderText(TextRenderInfo renderInfo) {
Text.Append("<");
Text.Append(renderInfo.GetText());
LineSegment segment = renderInfo.GetBaseline();
Vector start = segment.GetStartPoint();
Text.Append("| x=");
Text.Append(start[Vector.I1]);
Text.Append("; y=");
Text.Append(start[Vector.I2]);
Text.Append(">");
}
}
When you run this code, and you look what's inside Text, you'll notice that a PDF document doesn't store words. Instead, it stores text blocks. In our special IRenderListener, we indicate the start and the end of text blocks using < and >. Inside these text blocks, you'll find text snippets. We'll mark text snippets like this: <text snippet| x=36.0000; y=806.0000> where the x and y value give you the coordinate of the start of the baseline (as opposed to the ascent and descent position). You can also get the end position of the baseline (and the ascent/descent).
Now how do you distill words out of all of this? The problem with the text snippets you get, is that they don't correspond with words. See for instance this file: hello_reverse.pdf
When you open it in Adobe Reader, you read "Hello World Hello People." You'd hope you'd find four words in the content stream, wouldn't you? In reality, this is what you'll find:
<>
<<ld><Wor><llo><He>>
<<Hello People>>
To distill the words, "World" and "Hello" from the first line, you need to do plenty of Math. Instead of getting the base line of the TextRenderInfo object returned in the RenderText() method of your render listener, you have to use the GetCharacterRenderInfos() method. This will return a list of TextRenderInfo objects that gives you more info about every character (including the position of those characters). You then need to compose the words from those different characters.
This is explained in mkl's answer to this question: Retrieve the respective coordinates of all words on the page with itextsharp
We've done similar projects. One of them is described here: https://www.youtube.com/watch?v=lZnbhnU4m3Y
You'll need to do quite some coding to get it right. One word about PdfClown: your text is probably stored as UNICODE in your PDF. To retrieve the correct characters, the parser needs to examine the mapping of the glyphs stored in the font and the corresponding UNICODE character. If PdfClown can't do this, this means that PdfClown doesn't do this task correctly. PdfClown is a one man project, so you'll have to ask that developer to fix this (if he has the time).
As you can tell from the video, iText could help you out, but iText is a company with subsidiaries in the US, Belgium and Singapore. It is a company with many employees and to keep that company running, we need to make money (that's how we pay our employees). Hence you shouldn't expect that we help you for free. Surely you can understand this as you wouldn't want to work for free either, would you?

How a font is detected to be bold/italic/plain that is used in PDF

While Extracting Content from PDF using the MuPDF library, i am getting the Font name only not its font-face.
Do i guess (eg.bold in font-name though not the right way) or there is any other way to detect that specific font is Bold/Italic/Plain.
I have used itextsharp to extract font-family ,font color etc
public void Extract_inputpdf() {
text_input_File = string.Empty;
StringBuilder sb_inputpdf = new StringBuilder();
PdfReader reader_inputPdf = new PdfReader(path); //read PDF
for (int i = 0; i <= reader_inputPdf.NumberOfPages; i++) {
TextWithFont_inputPdf inputpdf = new TextWithFont_inputPdf();
text_input_File = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader_inputPdf, i, inputpdf);
sb_inputpdf.Append(text_input_File);
input_pdf = sb_inputpdf.ToString();
}
reader_inputPdf.Close();
clear();
}
public class TextWithFont_inputPdf: iTextSharp.text.pdf.parser.ITextExtractionStrategy {
public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo) {
string curFont = renderInfo.GetFont().PostscriptFontName;
string divide = curFont;
string[] fontnames = null;
//split the words from postscript if u want separate. it will be in this
}
}
public string GetResultantText() {
return result.ToString();
}
The PDF spec contains entries which allow you to specify the style of a font. However unfortunately in the real world you will often find that these are absent.
If the font is referenced rather than embeded this generally means you are stuck with the PostScript name for the font. It requires some heuristics but normally the name provides sufficient clues as to the style. It sounds this is pretty much where you are.
If the font is embedded you can parse it and try and find style information from the embedded font program. If it is subsetted then in theory this information might be removed but in general I don't think it will be. However parsing TrueType/OpenType fonts is boring and you may not feel that it is worth it.
I work on the ABCpdf .NET software component so my replies may feature concepts based around ABCpdf. It's just what I know. :-)"