Extract text from corrupt (?) pdf document - pdf

In a project I'm working on we scrape legal documents from various government sites and then make them searchable online.
Every now and then we encounter a PDF that seems to be corrupt. Here's an example of one.
If you open it in a PDF reader, it looks fine, but:
If you try to copy and paste it, you get corrupted text
If you run it through any tools like pdftotext, you corrupted text
If you do just about anything else to it -- you guessed it -- you get corrupted text
Yet, if you open it in a reader, it looks fine! So I know the text is there, but something is wrong, wrong wrong! The result is that on my site it looks really bad.
Is there anything I can do?
Update: I did more research today. Thanks to #Andrew Cash's observation that this is essentially a Caesar cipher, I realized I could search for the documents. This link will show you about 200 of these in my system. Looking through the larger sample set, it looks like these are all created by the same software, pdffactory v. 3.51! So I blame a bug, not deliberate obfuscation.
Update 2: The link above won't provide any results anymore. These are purged from my system using my solution below.

Tha PDF is using subsetted fonts where the characters are remapped to other characters using the same as a simple World War II substitution cipher.
A = G,
B = 1,
C = #,
D = W,
...
... and so on. Every character is remapped.
The font is mapped this way and in order to get the correct characters displaying in the PDF you need to send "G1#W" in for it to print out ABCD. Normally PDF's will have a ToUnicode table to help you with text extraction but this table has been left out on purpose I suspect.
I have seen a few of these documents myself where they are deliberately obfuscated to prevent text extraction. I have seen a document with about 5 different fonts and they were all mapped using a different sequence.
One sure way to tell if this is the problem is to load the PDF into Acrobat and copy / paste the text into a text editor. If Acrobat cannot decode the text back to English then there is no way to extract the text without remapping it manually if you know the translation mappings.
The only way to extract text easily from these types of documents is to OCR the full document and remove the original text. The OCR would convert the page to a TIFF image and then OCR it so the original garbled text shouldn't affect the OCR.

Weary of this issue and not wanting to deal with OCR, I manually sorted out the cipher. Here she be, as a python dict along with some rudimentary code that I was using to test it. I'm sure this could be improved, but it does work for all letters except uppercase Q and uppercase X, which I haven't yet been able to find.
It's missing a fair bit of punctuation too at least for now (all of these are missing, for example: <>?{}\|!~`##$%^_=+).
# -*- coding: utf-8 -*-
import re
import sys
letter_map = {
u'¿':'a',
u'regex':'b',
u'regex':'c',
u'regex':'d',
u'»':'e',
u'o':'f',
u'1':'g',
u'regex':'h',
u'·':'i',
u'¶':'j',
u'μ':'k',
u'regex':'l',
u'3':'m',
u'2':'n',
u'±':'o',
u'°':'p',
u'regex':'q',
u'®':'r',
u'-':'s',
u'¬':'t',
u'«':'u',
u'a':'v',
u'©':'w',
u'regex':'x',
u'§':'y',
u'¦':'z',
u'ß':'A',
u'Þ':'B',
u'Ý':'C',
u'Ü':'D',
u'Û':'E',
u'Ú':'F',
u'Ù':'G',
u'Ø':'H',
u'×':'I',
u'Ö':'J',
u'Õ':'K',
u'Ô':'L',
u'Ó':'M',
u'Ò':'N',
u'Ñ':'O',
u'Ð':'P',
u'':'Q', # Missing
u'Î':'R',
u'Í':'S',
u'Ì':'T',
u'Ë':'U',
u'Ê':'V',
u'É':'W',
u'':'X', # Missing
u'Ç':'Y',
u'Æ':'Z',
u'ð':'0',
u'ï':'1',
u'î':'2',
u'í':'3',
u'ì':'4',
u'ë':'5',
u'ê':'6',
u'é':'7',
u'è':'8',
u'ç':'9',
u'ò':'.',
u'ô':',',
u'æ':':',
u'å':';',
u'Ž':"'",
u'•':"'",
u'•':"'", # s/b double quote, but identical to single.
u'Œ':"'", # s/b double quote, but identical to single.
u'ó':'-', # dash
u'Š':'-', # n-dash
u'‰':'--', # em-dash
u'ú':'&',
u'ö':'*',
u'ñ':'/',
u'÷':')',
u'ø':'(',
u'Å':'[',
u'Ã':']',
u'‹':'•',
}
ciphertext = u'''YOUR STUFF HERE'''
plaintext = ''
for letter in ciphertext:
try:
plaintext += letter_map[letter]
except KeyError:
plaintext += letter
# These are multi-length replacements
plaintext = re.sub(u'm⁄4', 'b', plaintext)
plaintext = re.sub(u'g⁄n', 'c', plaintext)
plaintext = re.sub(u'g⁄4', 'd', plaintext)
plaintext = re.sub(u' ́', 'l', plaintext)
plaintext = re.sub(u' ̧', 'h', plaintext)
plaintext = re.sub(u' ̈', 'x', plaintext)
plaintext = re.sub(u' ̄u', 'qu', plaintext)
for letter in plaintext:
try:
sys.stdout.write(letter)
except UnicodeEncodeError:
continue

Related

unable to render Č character in pdf [duplicate]

I have a problem when adding characters such as "Č" or "Ć" while generating a PDF. I'm mostly using paragraphs for inserting some static text into my PDF report. Here is some sample code I used:
var document = new Document();
document.Open();
Paragraph p1 = new Paragraph("Testing of letters Č,Ć,Š,Ž,Đ", new Font(Font.FontFamily.HELVETICA, 10));
document.Add(p1);
The output I get when the PDF file is generated, looks like this: "Testing of letters ,,Š,Ž,Đ"
For some reason iTextSharp doesn't seem to recognize these letters such as "Č" and "Ć".
THE PROBLEM:
First of all, you don't seem to be talking about Cyrillic characters, but about central and eastern European languages that use Latin script. Take a look at the difference between code page 1250 and code page 1251 to understand what I mean. [NOTE: I have updated the question so that it talks about Czech characters instead of Cyrillic.]
Second observation. You are writing code that contains special characters:
"Testing of letters Č,Ć,Š,Ž,Đ"
That is a bad practice. Code files are stored as plain text and can be saved using different encodings. An accidental switch from encoding (for instance: by uploading it to a versioning system that uses a different encoding), can seriously damage the content of your file.
You should write code that doesn't contain special characters, but that use a different notations. For instance:
"Testing of letters \u010c,\u0106,\u0160,\u017d,\u0110"
This will also make sure that the content doesn't get altered when compiling the code using a compiler that expects a different encoding.
Your third mistake is that you assume that Helvetica is a font that knows how to draw these glyphs. That is a false assumption. You should use a font file such as Arial.ttf (or pick any other font that knows how to draw those glyphs).
Your fourth mistake is that you do not embed the font. Suppose that you use a font you have on your local machine and that is able to draw the special glyphs, then you will be able to read the text on your local machine. However, somebody who receives your file, but doesn't have the font you used on his local machine may not be able to read the document correctly.
Your fifth mistake is that you didn't define an encoding when using the font (this is related to your second mistake, but it's different).
THE SOLUTION:
I have written a small example called CzechExample that results in the following PDF: czech.pdf
I have added the same text twice, but using a different encoding:
public static final String FONT = "resources/fonts/FreeSans.ttf";
public void createPdf(String dest) throws IOException, DocumentException {
Document document = new Document();
PdfWriter.getInstance(document, new FileOutputStream(DEST));
document.open();
Font f1 = FontFactory.getFont(FONT, "Cp1250", true);
Paragraph p1 = new Paragraph("Testing of letters \u010c,\u0106,\u0160,\u017d,\u0110", f1);
document.add(p1);
Font f2 = FontFactory.getFont(FONT, BaseFont.IDENTITY_H, true);
Paragraph p2 = new Paragraph("Testing of letters \u010c,\u0106,\u0160,\u017d,\u0110", f2);
document.add(p2);
document.close();
}
To avoid your third mistake, I used the font FreeSans.ttf instead of Helvetica. You can choose any other font as long as it supports the characters you want to use. To avoid your fourth mistake, I have set the embedded parameter to true.
As for your fifth mistake, I introduced two different approaches.
In the first case, I told iText to use code page 1250.
Font f1 = FontFactory.getFont(FONT, "Cp1250", true);
This will embed the font as a simple font into the PDF, meaning that each character in your String will be represented using a single byte. The advantage of this approach is simplicity; the disadvantage is that you shouldn't start mixing code pages. For instance: this won't work for Cyrillic glyphs.
In the second case, I told iText to use Unicode for horizontal writing:
Font f2 = FontFactory.getFont(FONT, BaseFont.IDENTITY_H, true);
This will embed the font as a composite font into the PDF, meaning that each character in your String will be represented using more than one byte. The advantage of this approach is that it is the recommended approach in the newer PDF standards (e.g. PDF/A, PDF/UA), and that you can mix Cyrillic with Latin, Chinese with Japanese, etc... The disadvantage is that you create more bytes, but that effect is limited by the fact that content streams are compressed anyway.
When I decompress the content stream for the text in the sample PDF, I see the following PDF syntax:
As I explained, single bytes are used to store the text of the first line. Double bytes are used to store the text of the second line.
You may be surprised that these characters look OK on the outside (when looking at the text in Adobe Reader), but don't correspond with what you see on the inside (when looking at the second screen shot), but that's how it works.
CONCLUSION:
Many people think that creating PDF is trivial, and that tools for creating PDF should be a commodity. In reality, it's not always that simple ;-)
If you are using the FontProvider, I managed to solve the display of the special characters by setting the registerShippedFreeFonts parameter to true:
FontProvider dfp = new DefaultFontProvider(true, true, false);
See also: https://itextpdf.com/en/resources/books/itext-7-converting-html-pdf-pdfhtml/chapter-6-using-fonts-pdfhtml

Itextsharp: Export greek letters to PDF? [duplicate]

I have a problem when adding characters such as "Č" or "Ć" while generating a PDF. I'm mostly using paragraphs for inserting some static text into my PDF report. Here is some sample code I used:
var document = new Document();
document.Open();
Paragraph p1 = new Paragraph("Testing of letters Č,Ć,Š,Ž,Đ", new Font(Font.FontFamily.HELVETICA, 10));
document.Add(p1);
The output I get when the PDF file is generated, looks like this: "Testing of letters ,,Š,Ž,Đ"
For some reason iTextSharp doesn't seem to recognize these letters such as "Č" and "Ć".
THE PROBLEM:
First of all, you don't seem to be talking about Cyrillic characters, but about central and eastern European languages that use Latin script. Take a look at the difference between code page 1250 and code page 1251 to understand what I mean. [NOTE: I have updated the question so that it talks about Czech characters instead of Cyrillic.]
Second observation. You are writing code that contains special characters:
"Testing of letters Č,Ć,Š,Ž,Đ"
That is a bad practice. Code files are stored as plain text and can be saved using different encodings. An accidental switch from encoding (for instance: by uploading it to a versioning system that uses a different encoding), can seriously damage the content of your file.
You should write code that doesn't contain special characters, but that use a different notations. For instance:
"Testing of letters \u010c,\u0106,\u0160,\u017d,\u0110"
This will also make sure that the content doesn't get altered when compiling the code using a compiler that expects a different encoding.
Your third mistake is that you assume that Helvetica is a font that knows how to draw these glyphs. That is a false assumption. You should use a font file such as Arial.ttf (or pick any other font that knows how to draw those glyphs).
Your fourth mistake is that you do not embed the font. Suppose that you use a font you have on your local machine and that is able to draw the special glyphs, then you will be able to read the text on your local machine. However, somebody who receives your file, but doesn't have the font you used on his local machine may not be able to read the document correctly.
Your fifth mistake is that you didn't define an encoding when using the font (this is related to your second mistake, but it's different).
THE SOLUTION:
I have written a small example called CzechExample that results in the following PDF: czech.pdf
I have added the same text twice, but using a different encoding:
public static final String FONT = "resources/fonts/FreeSans.ttf";
public void createPdf(String dest) throws IOException, DocumentException {
Document document = new Document();
PdfWriter.getInstance(document, new FileOutputStream(DEST));
document.open();
Font f1 = FontFactory.getFont(FONT, "Cp1250", true);
Paragraph p1 = new Paragraph("Testing of letters \u010c,\u0106,\u0160,\u017d,\u0110", f1);
document.add(p1);
Font f2 = FontFactory.getFont(FONT, BaseFont.IDENTITY_H, true);
Paragraph p2 = new Paragraph("Testing of letters \u010c,\u0106,\u0160,\u017d,\u0110", f2);
document.add(p2);
document.close();
}
To avoid your third mistake, I used the font FreeSans.ttf instead of Helvetica. You can choose any other font as long as it supports the characters you want to use. To avoid your fourth mistake, I have set the embedded parameter to true.
As for your fifth mistake, I introduced two different approaches.
In the first case, I told iText to use code page 1250.
Font f1 = FontFactory.getFont(FONT, "Cp1250", true);
This will embed the font as a simple font into the PDF, meaning that each character in your String will be represented using a single byte. The advantage of this approach is simplicity; the disadvantage is that you shouldn't start mixing code pages. For instance: this won't work for Cyrillic glyphs.
In the second case, I told iText to use Unicode for horizontal writing:
Font f2 = FontFactory.getFont(FONT, BaseFont.IDENTITY_H, true);
This will embed the font as a composite font into the PDF, meaning that each character in your String will be represented using more than one byte. The advantage of this approach is that it is the recommended approach in the newer PDF standards (e.g. PDF/A, PDF/UA), and that you can mix Cyrillic with Latin, Chinese with Japanese, etc... The disadvantage is that you create more bytes, but that effect is limited by the fact that content streams are compressed anyway.
When I decompress the content stream for the text in the sample PDF, I see the following PDF syntax:
As I explained, single bytes are used to store the text of the first line. Double bytes are used to store the text of the second line.
You may be surprised that these characters look OK on the outside (when looking at the text in Adobe Reader), but don't correspond with what you see on the inside (when looking at the second screen shot), but that's how it works.
CONCLUSION:
Many people think that creating PDF is trivial, and that tools for creating PDF should be a commodity. In reality, it's not always that simple ;-)
If you are using the FontProvider, I managed to solve the display of the special characters by setting the registerShippedFreeFonts parameter to true:
FontProvider dfp = new DefaultFontProvider(true, true, false);
See also: https://itextpdf.com/en/resources/books/itext-7-converting-html-pdf-pdfhtml/chapter-6-using-fonts-pdfhtml

making a list of traditional Chinese characters from a string

I am currently trying to estimate the number of times each character is used in a large sample of traditional Chinese characters. I am interested in characters not words. The file also includes punctuation and western characters.
I am reading in an example file of traditional Chinese characters. The file contains a large sample of traditional Chinese characters. Here is a small subset:
首映鼓掌10分鐘 評語指不及《花樣年華》
該片在柏林首映,完場後獲全場鼓掌10分鐘。王家衛特別為該片剪輯「柏林版本
增減20處 趙本山香港戲分被刪
在柏林影展放映的《一代宗師》版本
教李小龍武功 葉問決戰散打王
另一增加的戲分是開場時葉問(梁朝偉飾)
My strategy is to read each line, split each line into a list, and go through and check each character to see if it already exists in a list or a dictionary of characters. If the character does not yet exist in my list or dictionary I will add it to that list, if it does exist in my list or dictionary, I will increase the counter for that specific character. I will probably use two lists, a list of characters, and a parallel list containing the counts. This will be more processing, but should also be much easier to code.
I have not gotten anywhere near this point yet.
I am able to read in the example file successfully. Then I am able to make a list for each line of my file. I am able to print out those individual lines into my output file and sort of reconstitute the original file, and the traditional Chinese comes out intact.
However, I run into trouble when I try to make a list of each character on a particular line.
I've read through the following article. I understood many of the comments, but unfortunately, was unable to understand enough of it to solve my problem.
How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator?
My code looks like the following
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import codecs
wordfile = open('Chinese_example.txt', 'r')
output = open('Chinese_output_python.txt', 'w')
LINES = wordfile.readlines()
Through various tests I am sure the following line is not splitting the string LINES[0] into its component Chinese characters.
A_LINE = list(LINES[0])
output.write(A_LINE[0])
I mean you want to use this, from answerer 'flow' at How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator? :
from re import compile as _Re
_unicode_chr_splitter = _Re( '(?s)((?:[\ud800-\udbff][\udc00-\udfff])|.)' ).split
def split_unicode_chrs( text ):
return [ chr for chr in _unicode_chr_splitter( text ) if chr ]
to successfully split a line of traditional Chinese characters.. I just had to know the proper syntax to handle encoded characters.. pretty basic.
my_new_list = list(unicode(LINE[0].decode('utf8')));

Encoding issue in I/O with Jena

I'm generating some RDF files with Jena. The whole application works with utf-8 text. The source code as well is stored in utf-8.
When I print a string contaning non-English characters on the console, I get the right format, e.g. Est un lieu généralement officielle assis....
Then, I use the RDF writer to output the file:
Model m = loadMyModelWithMultipleLanguages()
log.info( getSomeStringFromModel(m) ) // log4j, correct output
RDFWriter w = m.getWriter( "RDF/XML" ) // default enc: utf-8
w.setProperty("showXmlDeclaration","true") // optional
OutputStream out = new FileOutputStream(pathToFile)
w.write( m, out, "http://someurl.org/base/" )
// file contains garbled text
The RDF file starts with: <?xml version="1.0"?>. If I add utf-8 nothing changes.
By default the text should be encoded to utf-8.
The resulting RDF file validates ok, but when I open it with any editor/visualiser (vim, Firefox, etc.), non-English text is all messed up: Est un lieu généralement officielle assis ... or Est un lieu g\u221A\u00A9n\u221A\u00A9ralement officielle assis....
(Either way, this is obviously not acceptable from the user's viewpoint).
The same issue happens with any output format supported by Jena (RDF, NT, etc.).
I can't really find a logical explanation to this.
The official documentation doesn't seem to address this issue.
Any hint or tests I can run to figure it out?
My guess would be that your strings are messed up, and your printStringFromModel() method just happens to output them in a way that accidentally makes them display correctly, but it's rather hard to say without more information.
You're instructing Jena to include an XML declaration in the RDF/XML file, but don't say what encoding (if any) Jena declares in the XML declaration. This would be helpful to know.
You're also not showing how you're printing the strings in the printStringFromModel() method.
Also, in Firefox, go to the View menu and then to Character Encoding. What encoding is selected? If it's not UTF-8, then what happens when you select UTF-8? Do you get it to show things correctly when selecting some other encoding?
Edit: The snippet you show in your post looks fine and should work. My best guess is that the code that reads your source strings into a Jena model is broken, and reads the UTF-8 source as ISO-8859-1 or something similar. You should be able to confirm or disconfirm that by checking the length() of one of the offending strings: If each of the troublesome characters like é are counted as two, then the error is on reading; if it's correctly counted as one, then it's on writing.
My hint/answer would be to inspect the byte sequence in 3 places:
The data source. Using a hex editor, confirm that the é character in your source data is represented by the expected utf-8 hex sequence 0xc3a8.
In memory. Right after your call to printStringFromModel, put a breakpoint and inspect the bytes in the string (or convert to hex and print them out.
The output file. Again, use a hex editor to inspect the byte sequence is 0xc3a8.
This will tell exactly what is happening to the bytes as they travel along the path of your program, and also where they deviate from the expected 0xc3a8.
The best way to address this would be to package up the smallest unit of your code that you can that demonstrates the issue, and submit a complete, runnable test case as a ticket on the Jena Jira.

parsing pdf with pdf-box shows unknown characters

When I try to parse a pdf file with pdf box in java which is generated with cups pdf, showing junk characters. But it works perfectly with common pdfs I checked font cups pdf shows FreeMono_00.ttf (but i didn't se such a font anywhere) and working pdf shows ArialMT.
Anything I want to do differently for parsing the pdfs generated using cups-pdf.
below is the code I'm using for parsing.
parser = new PDFParser(new FileInputStream(File file));
parser.parse();
COSDocument cosDoc = parser.getDocument();
PDFTextStripperpdfStripper = new PDFTextStripper();
PDDocument pdDoc = new PDDocument(cosDoc);
String parsedText = pdfStripper.getText(pdDoc);
output is getting like this
)LOH1DPHDVGW[W
6XEMHFWVXEMHFWVVDPSOH
0HVVDJHVHQGLQJGHWDLOVDORQJZLWKSULQWILOH
8VHU1DPH$EGXOUD]DN30
8VHU,'D#DFRP
just copy paste also gives like this
I'm only repeating what I've read... I'm inexperienced here. If there were more mavens answering PDF/PDFBox questions, I'd wait for one to answer.
I believe the font either doesn't contain Unicode tables at all, or has been embedded in the document without Unicode tables. If the text seems to be a simple substitution cipher for a single given document, that would tend to confirm this.
If the font is embedded, I think sometimes only an extract of the glyphs you are actually using is embedded. That's likely here, since the font is not installed on the system (you said), and the original FreeMono font is large - over 4000 glyphs. In this case, I fear that the correspondence between character and glyph may be document-dependent - but I'm speculating.