Extract text using iTextSharp with different PDF page labels - pdf

I am trying to extract text from PDF using iTextSharp but I get null reference error while trying to call GetTextFromPage:
My guess would be that iTextSharp somehow interprets page label incorrectly as it is indeed strange:
Or is it due danish letters in the text?
However, I am able to extract text from different languages.
Thank you in advance.
EDIT: The problem could also be used fonts and their custom encoding:

Related

How to export text document containing astral Unicode characters to PDF

I regularly create documents that need Unicode characters above U+FFFF. Unfortunately, OpenOffice and LibreOffice are both unable to correctly export these characters when creating a PDF. The actual data gets mangled by a completely asinine algorithm, while the display just consists of various overlapping question mark boxes.
This is not a font issue. I embed all used fonts in the PDF and all characters below U+FFFF work perfectly fine.
Until now I have been working around this issue by mapping the glyphs I need to a custom PUA font. This solves the display problems, but obviously makes the actual content of the text unsearchable and quite fragile. I haven’t been able to find any settings that might affect the handling of Unicode characters in PDF.
Therefore I have three questions:
Is there a way to make OpenOffice/LibreOffice handle astral characters correctly on PDF export?
If not, is there an external tool that can convert .odt files to PDF while preserving astral characters?
If not, is there another good rich-text editor using a different file format that can deal with astral characters in PDFs?

Hindi to english from pdf

I am not able to copy hindi content from pdf file. When I am trying to copy/paste that content it changes to different hindi characters.
Example-
Original- विधान सभा
After paste- नरधरन सभर
it shows like this.
Can anybody help me to get the exact hindi characters.
What was used to create the PDF?
It was likely been created with an embedded font subset and no toUnicode mapping. Basically the codes of the characters used in the content of the PDF are mapped to glyphs embedded in the PDF which are displayed, but there is no mapping from these codes to regular Unicode codes so copying them produces gibberish. The only way to extract the original contents would be with some form of OCR.
Another possibility is that the application you are pasting it into is not shaping the characters correctly.

iText - How to replace a character with a dynamic checkbox in a pdf

I'm using iText to working on pdf documents.
I have a pdf document with non-dynamic checkboxes (they are made with the character "□", unicode \u25A1); I need to replace that checkboxes with the dynamic checkboxes of iText.
It's possible?
I need it because the pdf document that I receive is not made by me and it contains that character.
Thanks for the help, and sorry for my not-perfect english

Parse Body Text from PDF

I have just recently been experimenting with parsing the text data from a PDF document using iTextSharp in a VB2010 app. the document doesn't contain any images or other fancy elements, just text. Ive read some articles and used some code snippets and it looks promising. However, what Ive been trying to do is just parse out the body of each page, minus a header or footer. I haven't found any guidance for that particular function.
Currently using the snippet found here Reading PDF content with itextsharp dll in VB.NET or C# but it parses all text in a page. There's got to be a way to just get the body. Or at least I hope so.
PDFs generally do not contain information about logical structure of contained text.
So there are no headers, footers, body, paragraphs and anything like this in a PDF. There is only bunch of operations like "draw this glyph here", "move to this position and draw that group of glyphs there". I wrote glyph and not character because PDFs are not required to contain readable text. Only visual appearance required to be specified.
One exception is Tagged PDF but most of PDFs in the wild are not tagged.
Given all of the above you are probably left with following approach:
Extract all text from each page
Analyze text and find similar parts at the beginning / end of each page
Remove similar parts
This is a heuristic-based detection, so it probably won't always give excellent results.

RDLC rendered to PDF ignores Strikethrough formatting

So, I have a local .rdlc file with some text formatted using strikethrough formatting. My issue is quite simple to explain, but I do not know if it is just a limitation of PDF, or a bug with the .rdlc exporting to PDF.
When I write this code:
var localReport = new LocalReport();
...
byte[] pdf = localReport.Render("PDF");
System.IO.File.WriteAllBytes("MyReport.pdf", pdf);
None of the strike-through formatted text transfers over the the .pdf file properly.
If instead, I export to Word using .Render("Word"), the strikethrough does work on the .doc format. So, I know it isn't a problem with the .rdlc report itself.
Has anyone encountered this? Any solutions or workarounds?
I found this: http://social.msdn.microsoft.com/Forums/en-US/sqlreportingservices/thread/b35ca474-046d-4a38-a765-6c38c3d33105/
which suggests that missing strikethrough in PDFs was a known limitation. (But as mentioned in comments to the question, I couldn't reproduce with 2008r2.)
The two workarounds given there look painful.
(A) finding a font which itself as the strikethrough built into each
glyph/character. (B) trying to mimic a strikethrough using a line
report item. Note that for (B) overlapping items are supported only in
PDF, Print & TIFF formats.
I suppose if it were mine, I would play around with option B if the text is a small amount. Also, it may be worth test some of the html passthrough enabled when a placeholder is set to render as HTML. Maybe using a strikethrough style there would work?
While exporting RDLC report on word, I faced this issue. So while fetching data I replaced style for Strike formatting with strike tag from HTML and it worked.