When i try to parse the pdf, i can't get the content of pdf but getting random symbols and characters. What is the reason behind it? This should give the proper text.
I have tried using PyPDF2 also still can not get the text.
filename = "test2.pdf"
with fitz.open(filename) as f:
for p in f:
print("\n\n")
print(p.get_text(sort=True))
Result :
enter image description here
This type of result i get.
I have an issue when trying to read a string from a .CSV file. When I execute the application and the text is shown in a textbox, certain characters such as "é" or "ó" are shown as a question mark symbol.
The idea is that this code reads the whole CSV file and then splits each line into variables depending on the first word of the line.
The code I'm using to read is:
Dim test() As String
test = IO.File.ReadAllLines("Libro1.csv")
Dim test_chart As String = Array.Find(vls1load, Function(x) (x.StartsWith("sample")))
Dim test_chart_div() As String = test_chart.Split(";")
variable1 = test_chart_div(1)
variable2 = test_chart_div(2)
...etc
I have also tried with:
Dim test() As String
test = IO.File.ReadAllLines("Libro1.csv", System.Text.Encoding.UTF8)
But none of them works. The .csv file is supposed to be UTF8. The "web options" that you can see when saving the file in excel show encoding UTF8. I also tried the trick of changing the file extension to HTML and opening it with the browser to see that the encoding is also correct.
Can someone advice anything else I can try?
Thanks in advance.
When an Excel file is exported using the CSV Comma Separated output format, the Encoding selected in Tools -> Web Option -> Encoding of Excel's Save As... dialog doesn't actually generate the expected result:
the Text file is saved using the Encoding relative to the current Language selected in the Excel Application, not the Unicode (UTF16-LE) or UTF-8 Encoding selected (which is ignored) nor the default Encoding determined by the current System Language.
To import the CSV file, you can use the Encoding.GetEncoding() method to specify the Name or CodePage of the Encoding used in the machine that generated the file: again, not the Encoding related to System Language, but the Encoding of the Language that the Excel Application is currently using.
CodePage 1252 (Windows-1252) and ISO-8859-1 are commonly used in Latin1 zone.
Based the symbols you're referring to, this is most probably the original encoding used.
In Windows, use the former. ISO-8859-1 is still used, mostly in old Web Pages (or Web Pages created without care for the Encoding used).
As a note, CodePage 1252 and ISO-8859-1 are not exactly the same Encoding, there are subtle differences.
If you find documentation that states the opposite, the documentation is wrong.
This is my first time working on a python project outside of school, so bear with me.
When I run the code below, I get the error
"(unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated\uXXXXXXXX escape"
and the IDLE editor highlights the '(' before the argument of pd.read_csv.
I googled the error but got a lot of stuff that went way over my head.
The csv file in question is an excel file i saved as csv. should i save it some other way?
import pandas as pd
field = pd.read_csv("C:\Users\Glen\Documents\Feild.csv")
I just want to convert my excel data into a data frame and I don't understand why it was so easy in class, and it's now so difficult on my home pc.
The problem is with the path. There are two ways to mention the path while reading a csv file,
1- Use double backslashes,
pd.read_csv("C:\\Users\\Glen\\Documents\\Feild.csv")
2- Use single forwardslash,
pd.read_csv("C:/Users/Glen/Documents/Feild.csv")
If these do not work, try this one,
pd.read_csv("C:\\Users\\Glen\\Documents\\Feild.csv", encoding='utf-8')
OR
pd.read_csv("C:/Users/Glen/Documents/Feild.csv", encoding='utf-8')
In a project I'm working on we scrape legal documents from various government sites and then make them searchable online.
Every now and then we encounter a PDF that seems to be corrupt. Here's an example of one.
If you open it in a PDF reader, it looks fine, but:
If you try to copy and paste it, you get corrupted text
If you run it through any tools like pdftotext, you corrupted text
If you do just about anything else to it -- you guessed it -- you get corrupted text
Yet, if you open it in a reader, it looks fine! So I know the text is there, but something is wrong, wrong wrong! The result is that on my site it looks really bad.
Is there anything I can do?
Update: I did more research today. Thanks to #Andrew Cash's observation that this is essentially a Caesar cipher, I realized I could search for the documents. This link will show you about 200 of these in my system. Looking through the larger sample set, it looks like these are all created by the same software, pdffactory v. 3.51! So I blame a bug, not deliberate obfuscation.
Update 2: The link above won't provide any results anymore. These are purged from my system using my solution below.
Tha PDF is using subsetted fonts where the characters are remapped to other characters using the same as a simple World War II substitution cipher.
A = G,
B = 1,
C = #,
D = W,
...
... and so on. Every character is remapped.
The font is mapped this way and in order to get the correct characters displaying in the PDF you need to send "G1#W" in for it to print out ABCD. Normally PDF's will have a ToUnicode table to help you with text extraction but this table has been left out on purpose I suspect.
I have seen a few of these documents myself where they are deliberately obfuscated to prevent text extraction. I have seen a document with about 5 different fonts and they were all mapped using a different sequence.
One sure way to tell if this is the problem is to load the PDF into Acrobat and copy / paste the text into a text editor. If Acrobat cannot decode the text back to English then there is no way to extract the text without remapping it manually if you know the translation mappings.
The only way to extract text easily from these types of documents is to OCR the full document and remove the original text. The OCR would convert the page to a TIFF image and then OCR it so the original garbled text shouldn't affect the OCR.
Weary of this issue and not wanting to deal with OCR, I manually sorted out the cipher. Here she be, as a python dict along with some rudimentary code that I was using to test it. I'm sure this could be improved, but it does work for all letters except uppercase Q and uppercase X, which I haven't yet been able to find.
It's missing a fair bit of punctuation too at least for now (all of these are missing, for example: <>?{}\|!~`##$%^_=+).
# -*- coding: utf-8 -*-
import re
import sys
letter_map = {
u'¿':'a',
u'regex':'b',
u'regex':'c',
u'regex':'d',
u'»':'e',
u'o':'f',
u'1':'g',
u'regex':'h',
u'·':'i',
u'¶':'j',
u'μ':'k',
u'regex':'l',
u'3':'m',
u'2':'n',
u'±':'o',
u'°':'p',
u'regex':'q',
u'®':'r',
u'-':'s',
u'¬':'t',
u'«':'u',
u'a':'v',
u'©':'w',
u'regex':'x',
u'§':'y',
u'¦':'z',
u'ß':'A',
u'Þ':'B',
u'Ý':'C',
u'Ü':'D',
u'Û':'E',
u'Ú':'F',
u'Ù':'G',
u'Ø':'H',
u'×':'I',
u'Ö':'J',
u'Õ':'K',
u'Ô':'L',
u'Ó':'M',
u'Ò':'N',
u'Ñ':'O',
u'Ð':'P',
u'':'Q', # Missing
u'Î':'R',
u'Í':'S',
u'Ì':'T',
u'Ë':'U',
u'Ê':'V',
u'É':'W',
u'':'X', # Missing
u'Ç':'Y',
u'Æ':'Z',
u'ð':'0',
u'ï':'1',
u'î':'2',
u'í':'3',
u'ì':'4',
u'ë':'5',
u'ê':'6',
u'é':'7',
u'è':'8',
u'ç':'9',
u'ò':'.',
u'ô':',',
u'æ':':',
u'å':';',
u'Ž':"'",
u'•':"'",
u'•':"'", # s/b double quote, but identical to single.
u'Œ':"'", # s/b double quote, but identical to single.
u'ó':'-', # dash
u'Š':'-', # n-dash
u'‰':'--', # em-dash
u'ú':'&',
u'ö':'*',
u'ñ':'/',
u'÷':')',
u'ø':'(',
u'Å':'[',
u'Ã':']',
u'‹':'•',
}
ciphertext = u'''YOUR STUFF HERE'''
plaintext = ''
for letter in ciphertext:
try:
plaintext += letter_map[letter]
except KeyError:
plaintext += letter
# These are multi-length replacements
plaintext = re.sub(u'm⁄4', 'b', plaintext)
plaintext = re.sub(u'g⁄n', 'c', plaintext)
plaintext = re.sub(u'g⁄4', 'd', plaintext)
plaintext = re.sub(u' ́', 'l', plaintext)
plaintext = re.sub(u' ̧', 'h', plaintext)
plaintext = re.sub(u' ̈', 'x', plaintext)
plaintext = re.sub(u' ̄u', 'qu', plaintext)
for letter in plaintext:
try:
sys.stdout.write(letter)
except UnicodeEncodeError:
continue
i control the problem of the data what is uploaded by the POST method, in the web.
If the file is a text theres no problem but the trouble comes when it's an encoded file, as a Picture or other. What the when the system insert the data into the file.
Well it doesn 't encoded in the right way. I will put all the code, from the area whats take the environ['wsgi.input'] to the area thats save the file:
# Here the data from the environ['wsgi.input'],
# first i convert the byte into a string delete the first
# field that represent the b and after i strip the single quotes
tmpData = str(rawData)[1:].strip("' '")
dat = tmpData.split('\\r')#Then i split all the data in the '\\r'
s = open('/home/hidura/test.png', 'w')#I open the test.png file.
for cont in range(5,150):#Now beging in the 5th position to the 150th position
s.write(dat[cont])#Insert the piece of the data in the file.
s.close()#Then closed.
Where is the mistake?
Thankyou in advance.
Why do you convert the binary data to a string? A png file is binary data. Just write the binary data to the file. You need to open the file in binary mode as well.
s = open('/home/hidura/test.png', 'wb')
s.write(data)