Can't get the text from pdf - pdf

When i try to parse the pdf, i can't get the content of pdf but getting random symbols and characters. What is the reason behind it? This should give the proper text.
I have tried using PyPDF2 also still can not get the text.
filename = "test2.pdf"
with fitz.open(filename) as f:
for p in f:
print("\n\n")
print(p.get_text(sort=True))
Result :
enter image description here
This type of result i get.

Related

Barcode Human Readable Text ( Interpretation Line) masking in ZPL II

I am trying to make a ZPL Label with Code 128. My issue is that I would like to alter the human readbale interpretation line under the barcode. Normally, it gets printed as what it is encoded, i want to add spaces after 5 characters (as picture).
enter image description here
I was able to do it in Zebra designer but we have to store the label as ZPL and if I export the .prn file it gets encrypted and i get always the same barcode printed, which in my case is not correct.
I tried to turn off BC interpretation line and used ^FN to print interpretation line and do some formatting on it but after going through internet and ZPLII manuals , I am unable to find anything.
content of .prn file, i am getting:
CT~~CD,~CC^~CT~
^XA
~TA000
~JSN
^LT0
^MNW
^MTT
^PON
^PMN
^LH0,0
^JMA
^PR4,4
~SD15
^JUS
^LRN
^CI27
^PA0,1,1,0
^XZ
^XA
^MMT
^PW2457
^LL1200
^LS0
^FO652,258^GFA,2573,46768,148,:Z64:eJzszjEOgDAMA0D//9NGJYXuDEy3RHEiS9emSbtGetYnTp5wb9mF9Zjrx1r3LafwdpmYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmJiYmP41XQAAAP//7dmxjqU2FAZgExdO5xeIRB4j1fq1UkQy0RRb7iuxSpFyHyGeastFShEXxM75/2MzMLszyUKkKBKW5gruAP6uMcfH5jbdptt0m27TbbpNt+k23ab/k+kud7nLXe5yl7vc5e/LOMvfYsxQjt/btW14bCC3cJn7g+wP2B/T8Ywwt40qG14uaeJ0zoQrRfH4OulVQuWW7FtJdhYTZddWqT/UDeuwX7NC5TBq5TMgPXLYDXJJW5fTpgGKgGpM3Uyhzmqq8rM9qon8nqaxgoKW9ZvJ1lVNXjYMLuLU+vUlzrjoLDXiBj6ZYk00yUcSA6uZcYJbAV6lRt7AzeTESFOQjUF/yfpKxa8UOdehWl4YpkjToBx+LL0atCT7V5R9z+PE5NQ0kqMfpv2S8lrNr5q83qIV7b8zZZpYY9Rqlm7SW4R2C3vTRBMu4PSXnEu7B7mwXC2LoaD9u8l2kwdWu652j3EFuE4j2y1uptBN8jd7fTq0B35tsS+YpKJVTWWkaeqm8IJJzpvV9EhTPm3yYgo5SJu8bQGhmYRC05hdGep7qQY4KdJstj5Im+SYGAC66VFN0idjGlf8kkd9Kr62jDAtXuq3FXsT2wS9xtGU1DTF2Wc8cdJGMEk3DEtYZC8br6Y6R5iKmnAwzjpjijAlEZShmya2n1AMote4WPwrdJODqUCQxmbSKAQKohcOln/Z0ybGkZhcM4Unk1BAnmCSTVZT2LASmoqENZrsziTHIK7TFJL+knTCpOFyFhN+H0O1mkSQWFEzjd3EcLnCJE3IUK0mOf0Nf0szDddN5rtnptQqetE0PDNpfBz/DZPccgcTx3rp3oPGuWYqexOrDX/IsT9hSEI6gYFwXPamtDc9Tx3+mamwG8JkOXwcTdJR0bXVlNT0kREBp4XEwfhgkiEdXZumctK0qimrqcAUJmNaq7sn0xISueFXNU1q+iSm35kWqClupsqzzpgWmtAlZByzK0wc1sLjqlYJnvlgcho5Jw6V4UPm4AsTskNYJXhOl0xpM2GUyDRx9PrIyMlMRHIDiavN9GYzcSB6R9PEXpBNT8DkcHvepGEN58aFJssEQRqgZbBIUfamH1o0R/Q0NAUmCNLGLUlGinLF9A07q1xCLp+EQBO78m8YYdFHpUYMF930PU0YiBDoq6dJDrN/4ryBg+YqaZU7bdIHyHA0lQQhaRJumHksaoq4n6l2k9ERprADDkVNCyNdVRNi1lj8VRNTfknCmmmiKesEAncGCcLB5Nke0onUlNU0MVlHO8m4fNGExA45SJj3ppUmq0noM9PIume7HkwzTUGT0IsmyydbBtYJqZtlTktTnE3PfY+mgOBRk8tikq62dpNjss4E8KIJT5B5kKQIpsI5Y1jHwmjjkZ2F5yZ5SM23knfRlDxN76UHIPgiAbSXTegdeLyhMj/quMH8SXaZNo3109HEKcy4eFEZNzmdXuDJzUyk5EktF00tqf25aFKApxABRqMN0gFfPxxM7IACGBOTAsdJLyIKAxoGnbpeNAWdWz80U/jM5Oq7g8nq9D0HNdnPTTFfNEWeGn+pLV7CBAFTT1LKeDC5toU5QWomCpB6csM/XjTpBPcLptJN5mhiB/yiKbUNc34Mpkl7x9HE3MzuTEzTukk74MHE9A+ndpNmgudNk8HjpqblBdOY7JNJehEeN5r8C6ahnMt927oKsxI1+RdNy5MJY3btpvqS6WQ+rqFaB9RmyjvT2kzj01xKQ3XCmN1N0840q2lYzs+laEIYlr/aTZr/yyVdM3HOuTdxfJx9N7UphphiM2HOecnE3iGmB5iwMbc5p94mW/dzc3zJDohvaJINzeYQlmYNVJfm5jTJeMtFAXaSwseQc3N2Z+A+M1XOzbP2w8THkGMLnxhM9q+akAO4+raZKoO0L76ZgKuS9vd1FbdqDhBzbKaF44DcxGYSnBycT6+rdFNLL7f8CetPHOXHIgnCfv2pmzS93PInrj8hkRiEki6tP8EU96ZBNzhV4BQLG3G3Tue4vLozjW32xYR04sLi7K6s03WTrtOpqahJVwJaTle29cxu0nU6NSU16WJDy+nS6fXMbtI1VsM1MZ0sicBrTrtw4aKv+3aTrrH63BZbderZ0mZdGzm97hs5A8DtaVPG2N4KVM3QBl1Vzvv1cd9uz0ST6y8eNAkkha8BTq+Pd9PuPYJmLlikJpOr1cv2HqGb9u8R+D3X9SfT3yOYS+8RaPJsBf46Nseog6C+ZnE6bTB7k2ErMNllcwx1URNfs8T1wvuWXrb3UsPTq55uOLyX2sqWiLQsaXsvxRT/ynupu9zlLne5y13ucpf/vvwFfhWYpQ==:AC0D
^PQ1,0,1,Y
^XZ

How to pass emoji scraping a text in phyton with bs4

I'm creating a scraper that scrapes all the comments in a URL page and I'm saving the text in a txt file (1 comment = 1 txt).
Now I'm having a problem when there are some emoji in the text of a comment. In fact, the program stops and says "UnicodeEncodeError: 'charmap' codec can't encode the character". How can I pass this problem? (I'm using bs4)
The structure of the code is like this:
q=requests.get(url)
soup=BeautifulSoup(q.content, "html.parser")
x=soup.find("a", {"class":"comments"})
y=x.find_all("div", {"class":"blabla"})
i=0
for item in y:
name=str(i)
comment=item.find_all("p")
out_file=open('%s.txt'%CreatorName, "w")
out_file.write(str(comment)
out_file.close
i=i+1
Thanks to everyone.
My guess is that you are on Windows. You code works perfectly on Linux. So change the encoding on the file you open to utf-8 like this:
out_file=open('%s.txt'%CreatorName, "w", encoding='utf-8')
This should write to the file without error although the emoji may not display properly in notepad you can always open it in FireFox or another application if you want to see the emoji. Other comment text should be readable in notepad though.

Extract text from PDF in respect to formatting (font size, type etc)

Is possible to extract text from a PDF file concerning specific font/font size/font colour etc.? I prefer Perl, python or *nix command-line utilities. My goal is to extract all headlines from PDF file so I will have a nice index of articles contained in a single PDF.
Text and /font/font size/position (no color, as I checked) you can get from Ghostscript's txtwrite device (try -dTextFormat=0 | 1 options), as well as from mudraw's (MuPDF) with -tt option. Then parse XML-like output with e.g. Perl.
I have working code which extracts text from pdf with the size of the font.
with help of PDfminer, I have achieved this job. with many pdf's
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar,LTLine,LAParams
import os
path=r'path\whereyour pdffile'
os.chdir(path)
Extract_Data=[]
for PDF_file in os.listdir():
if PDF_file.endswith('.pdf'):
for page_layout in extract_pages(PDF_file):
for element in page_layout:
if isinstance(element, LTTextContainer):
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
Font_size=character.size
Extract_Data.append([Font_size,(element.get_text())])
I have used fitz to accomplish the required task, as it is much faster compared to pdfminer. You can find my duplicate answer to a similar question here.
An example code snippet is shown below.
import fitz
def scrape(keyword, filePath):
results = [] # list of tuples that store the information as (text, font size, font name)
pdf = fitz.open(filePath) # filePath is a string that contains the path to the pdf
for page in pdf:
dict = page.get_text("dict")
blocks = dict["blocks"]
for block in blocks:
if "lines" in block.keys():
spans = block['lines']
for span in spans:
data = span['spans']
for lines in data:
if keyword in lines['text'].lower(): # only store font information of a specific keyword
results.append((lines['text'], lines['size'], lines['font']))
# lines['text'] -> string, lines['size'] -> font size, lines['font'] -> font name
pdf.close()
return results
If you wish to find the font information of every line, you may omit the if condition that checks for a specific keyword.
You can extract the text information in any desired format by understanding the structure of dictionary outputs that we obtain by using get_text("dict"), as mentioned in the documentation.

Extract text from corrupt (?) pdf document

In a project I'm working on we scrape legal documents from various government sites and then make them searchable online.
Every now and then we encounter a PDF that seems to be corrupt. Here's an example of one.
If you open it in a PDF reader, it looks fine, but:
If you try to copy and paste it, you get corrupted text
If you run it through any tools like pdftotext, you corrupted text
If you do just about anything else to it -- you guessed it -- you get corrupted text
Yet, if you open it in a reader, it looks fine! So I know the text is there, but something is wrong, wrong wrong! The result is that on my site it looks really bad.
Is there anything I can do?
Update: I did more research today. Thanks to #Andrew Cash's observation that this is essentially a Caesar cipher, I realized I could search for the documents. This link will show you about 200 of these in my system. Looking through the larger sample set, it looks like these are all created by the same software, pdffactory v. 3.51! So I blame a bug, not deliberate obfuscation.
Update 2: The link above won't provide any results anymore. These are purged from my system using my solution below.
Tha PDF is using subsetted fonts where the characters are remapped to other characters using the same as a simple World War II substitution cipher.
A = G,
B = 1,
C = #,
D = W,
...
... and so on. Every character is remapped.
The font is mapped this way and in order to get the correct characters displaying in the PDF you need to send "G1#W" in for it to print out ABCD. Normally PDF's will have a ToUnicode table to help you with text extraction but this table has been left out on purpose I suspect.
I have seen a few of these documents myself where they are deliberately obfuscated to prevent text extraction. I have seen a document with about 5 different fonts and they were all mapped using a different sequence.
One sure way to tell if this is the problem is to load the PDF into Acrobat and copy / paste the text into a text editor. If Acrobat cannot decode the text back to English then there is no way to extract the text without remapping it manually if you know the translation mappings.
The only way to extract text easily from these types of documents is to OCR the full document and remove the original text. The OCR would convert the page to a TIFF image and then OCR it so the original garbled text shouldn't affect the OCR.
Weary of this issue and not wanting to deal with OCR, I manually sorted out the cipher. Here she be, as a python dict along with some rudimentary code that I was using to test it. I'm sure this could be improved, but it does work for all letters except uppercase Q and uppercase X, which I haven't yet been able to find.
It's missing a fair bit of punctuation too at least for now (all of these are missing, for example: <>?{}\|!~`##$%^_=+).
# -*- coding: utf-8 -*-
import re
import sys
letter_map = {
u'¿':'a',
u'regex':'b',
u'regex':'c',
u'regex':'d',
u'»':'e',
u'o':'f',
u'1':'g',
u'regex':'h',
u'·':'i',
u'¶':'j',
u'μ':'k',
u'regex':'l',
u'3':'m',
u'2':'n',
u'±':'o',
u'°':'p',
u'regex':'q',
u'®':'r',
u'-':'s',
u'¬':'t',
u'«':'u',
u'a':'v',
u'©':'w',
u'regex':'x',
u'§':'y',
u'¦':'z',
u'ß':'A',
u'Þ':'B',
u'Ý':'C',
u'Ü':'D',
u'Û':'E',
u'Ú':'F',
u'Ù':'G',
u'Ø':'H',
u'×':'I',
u'Ö':'J',
u'Õ':'K',
u'Ô':'L',
u'Ó':'M',
u'Ò':'N',
u'Ñ':'O',
u'Ð':'P',
u'':'Q', # Missing
u'Î':'R',
u'Í':'S',
u'Ì':'T',
u'Ë':'U',
u'Ê':'V',
u'É':'W',
u'':'X', # Missing
u'Ç':'Y',
u'Æ':'Z',
u'ð':'0',
u'ï':'1',
u'î':'2',
u'í':'3',
u'ì':'4',
u'ë':'5',
u'ê':'6',
u'é':'7',
u'è':'8',
u'ç':'9',
u'ò':'.',
u'ô':',',
u'æ':':',
u'å':';',
u'Ž':"'",
u'•':"'",
u'•':"'", # s/b double quote, but identical to single.
u'Œ':"'", # s/b double quote, but identical to single.
u'ó':'-', # dash
u'Š':'-', # n-dash
u'‰':'--', # em-dash
u'ú':'&',
u'ö':'*',
u'ñ':'/',
u'÷':')',
u'ø':'(',
u'Å':'[',
u'Ã':']',
u'‹':'•',
}
ciphertext = u'''YOUR STUFF HERE'''
plaintext = ''
for letter in ciphertext:
try:
plaintext += letter_map[letter]
except KeyError:
plaintext += letter
# These are multi-length replacements
plaintext = re.sub(u'm⁄4', 'b', plaintext)
plaintext = re.sub(u'g⁄n', 'c', plaintext)
plaintext = re.sub(u'g⁄4', 'd', plaintext)
plaintext = re.sub(u' ́', 'l', plaintext)
plaintext = re.sub(u' ̧', 'h', plaintext)
plaintext = re.sub(u' ̈', 'x', plaintext)
plaintext = re.sub(u' ̄u', 'qu', plaintext)
for letter in plaintext:
try:
sys.stdout.write(letter)
except UnicodeEncodeError:
continue

Problem saving uploaded files in Python3

i control the problem of the data what is uploaded by the POST method, in the web.
If the file is a text theres no problem but the trouble comes when it's an encoded file, as a Picture or other. What the when the system insert the data into the file.
Well it doesn 't encoded in the right way. I will put all the code, from the area whats take the environ['wsgi.input'] to the area thats save the file:
# Here the data from the environ['wsgi.input'],
# first i convert the byte into a string delete the first
# field that represent the b and after i strip the single quotes
tmpData = str(rawData)[1:].strip("' '")
dat = tmpData.split('\\r')#Then i split all the data in the '\\r'
s = open('/home/hidura/test.png', 'w')#I open the test.png file.
for cont in range(5,150):#Now beging in the 5th position to the 150th position
s.write(dat[cont])#Insert the piece of the data in the file.
s.close()#Then closed.
Where is the mistake?
Thankyou in advance.
Why do you convert the binary data to a string? A png file is binary data. Just write the binary data to the file. You need to open the file in binary mode as well.
s = open('/home/hidura/test.png', 'wb')
s.write(data)