PDF to txt python3 - pdf

I'm trying to convert pdf file to txt.
`
import re
import PyPDF2
with open('123.pdf', 'rb') as pdfFileObj:
pdfreader = PyPDF2.PdfFileReader(pdfFileObj)
x = pdfreader.numPages
pageObj = pdfreader.getPage(x + 1)
text = pageObj.extractText()
file1 = open(f"C:\\Users\\honorr\\Desktop\\ssssssss\{re.sub('pdf$','txt',pdfFileObj)}", "a")
file1.writelines(text)
file1.close()
Errors:
Traceback (most recent call last):
File "C:\Users\honorr\Desktop\ssssssss\main.py", line 5, in <module>
pageobj = pdfreader.getPage(x + 1)
File "C:\Users\honorr\Desktop\ssssssss\venv\lib\site-packages\PyPDF2\_reader.py", line 477, in getPage
return self._get_page(pageNumber)
File "C:\Users\honorr\Desktop\ssssssss\venv\lib\site-packages\PyPDF2\_reader.py", line 492, in _get_page
return self.flattened_pages[page_number]
IndexError: list index out of range
`
How to fix it?
So i don't know why i have this errors. Maybe somebody tell me another way to convert from PDF to TXT?

You're setting x to the number of pages, but then trying to get page x + 1, which doesn't exist. Depending on how the library is implemented (I'm not familiar with PyPDF2), you may need to try pdfreader.getPage(x) or pdfreader.getPage(x - 1) to get it to work. This will only get the last page in the document though.

Related

using pandas.read_csv, how can one process all errors, receive all non-error data?

Data which, for me, generates an exception instead of invoking the 'on_bad_lines' handler is at:
https://opencalaccess.org/misc/NAMES_CD.TSV
I have this:
bad_lines = list()
def bad_line_finder(x):
bad_lines.append(str(x))
return None
for file in os.listdir(dir):
bad_lines = list()
try:
for df in pd.read_csv(f"{dir}/{file}",
sep='\t',
on_bad_lines=bad_line_finder,
engine='python',
chunksize=1000):
print(f"\n{target}")
df.info()
print(f"Bad Lines: {bad_lines}")
bad_lines = list()
except:
print("EXCEPTION:")
traceback.print_exc()
and this works great. There are errors in the files and the method handles them so that I can keep track of them. Except, why do i still see this:
EXCEPTION:
Traceback (most recent call last):
File "/home/ray/Projects/opencalaccess-data/import.py", line 41, in <module>
for df in pd.read_csv(f"{dir}/{file}",
File "/home/ray/Projects/opencalaccess-data/.venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1698, in __next__
return self.get_chunk()
File "/home/ray/Projects/opencalaccess-data/.venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1810, in get_chunk
return self.read(nrows=size)
File "/home/ray/Projects/opencalaccess-data/.venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1778, in read
) = self._engine.read( # type: ignore[attr-defined]
File "/home/ray/Projects/opencalaccess-data/.venv/lib/python3.10/site-packages/pandas/io/parsers/python_parser.py", line 250, in read
content = self._get_lines(rows)
File "/home/ray/Projects/opencalaccess-data/.venv/lib/python3.10/site-packages/pandas/io/parsers/python_parser.py", line 1114, in _get_lines
new_rows.append(next(self.data))
_csv.Error: ' ' expected after '"'
What is the "on_bad_lines" option doing if it does not handle all of the bad lines? Which of them will it handle and which will it not?
This is a government data source. There are format errors in the data that cannot be corrected by the agency, because they constitute the 0fficial record. So, I must fix them myself. But which of them throw exceptions and which do not?

I get this error when i try to use Wolfram Alpha in VS code python ValueError: dictionary update sequence element #0 has length 1; 2 is required

This is my code
import wolframalpha
app_id = '876P8Q-R2PY95YEXY'
client = wolframalpha.Client(app_id)
res = client.query(input('Question: '))
print(next(res.results).text)
the question I tried was 1 + 1
and i run it and then i get this error
Traceback (most recent call last):
File "c:/Users/akshi/Desktop/Xander/Untitled.py", line 9, in <module>
print(next(res.results).text)
File "C:\Users\akshi\AppData\Local\Programs\Python\Python38\lib\site-packages\wolframalpha\__init__.py", line 166, in text
return next(iter(self.subpod)).plaintext
ValueError: dictionary update sequence element #0 has length 1; 2 is required
Please help me
I was getting the same error when I tried to run the same code.
You can refer to "Implementing Wolfram Alpha Search" section of this website for better understanding of how the result was extracted from the dictionary returned.
https://medium.com/#salisuwy/build-an-ai-assistant-with-wolfram-alpha-and-wikipedia-in-python-d9bc8ac838fe
Also, I tried the following code by referring to the above website....hope it might help you :)
import wolframalpha
client = wolframalpha.Client('<your app_id>')
query = str(input('Question: '))
res = client.query(query)
if res['#success']=='true':
pod0=res['pod'][0]['subpod']['plaintext']
print(pod0)
pod1=res['pod'][1]
if (('definition' in pod1['#title'].lower()) or ('result' in pod1['#title'].lower()) or (pod1.get('#primary','false') == 'true')):
result = pod1['subpod']['plaintext']
print(result)
else:
print("No answer returned")

Converting xls to xlsx using xlrd

I am using the exact script below (except the file path) to try to convert xls to xlsx. The script is successful but not producing any output. Am I missing something basic like inputting some variable values or file names, or not saving the file properly as a xlsx?
import xlrd
import os
from openpyxl.workbook import Workbook
filenames = os.listdir("file_path")
for fname in filenames:
if fname.endswith(".xls"):
def cvt_xls_to_xlsx(fname):
book_xls = xlrd.open_workbook(fname)
book_xlsx = Workbook()
sheet_names = book_xls.sheet_names()
for sheet_index in range(0,len(sheet_names)):
sheet_xls = book_xls.sheet_by_name(sheet_names[sheet_index])
if sheet_index == 0:
sheet_xlsx = book_xlsx.active
sheet_xlsx.title = sheet_names[sheet_index]
else:
sheet_xlsx = book_xlsx.create_sheet(title=sheet_names[sheet_index])
for row in range(0, sheet_xls.nrows):
for col in range(0, sheet_xls.ncols):
sheet_xlsx.cell(row = row+1 , column = col+1).value = sheet_xls.cell_value(row, col)
cvt_xls_to_xlsx(fname)
book_xlsx.save(fname.xlsx)
I have updated the above script with the last two rows of commands, but I am now getting the following error message:
File "C:\Users\local\Documents\Tasks\Python\Excelconvert.py", line 26, in
cvt_xls_to_xlsx(fname)
File "C:\Users\local\Documents\Tasks\Python\Excelconvert.py", line 10, in cvt_xls_to_xlsx
book_xls = xlrd.open_workbook(fname)
File "C:\Users\local\Python34\lib\site-packages\xlrd__init__.py", line 157, in open_workbook
ragged_rows=ragged_rows,
File "C:\Users\local\Python34\lib\site-packages\xlrd\book.py", line 92, in open_workbook_xls
biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
File "C:\Users\local\Python34\lib\site-packages\xlrd\book.py", line 1278, in getbof
bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
File "C:\Users\local\Python34\lib\site-packages\xlrd\book.py", line 1272, in bof_error
raise XLRDError('Unsupported format, or corrupt file: ' + msg)
xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'# - Copy'
What does the '# - Copy' mean? I have seen others suggest it relates to the file format but usually that is an xml error, and not my above. Any solutions would be appreciated.

Spacy phrasematcher does not get matcher name

I am new to phraseMatcher and want to extract some keyword from my emails.
Everything is working well except that I can't get a name of added matcher.
This is my code below:
def main():
patterns_months = 'phraseMatcher/months.txt'
text_loc = 'phraseMatcher/text.txt'
nlp = spacy.blank('en')
nlp.vocab.lex_attr_getters ={}
phrases_months = read_gazetter(patterns_months)
txts = read_text(text_loc, n=n)
months = [nlp(text) for text in phrases_months]
matcher = PhraseMatcher(nlp.vocab)
matcher.add('MONTHS', None, *months)
print(nlp.vocab.strings['MONTHS'])
for txt in txts:
doc = nlp(txt)
matches = matcher(doc)
for match_id ,start, end in matches:
span = doc[start: end]
label = nlp.vocab.strings[match_id]
print(label, span.text, start, end)
The result:
12298211501233906429 <--- this is from print(nlp.vocab.strings['MONTHS'])
Traceback (most recent call last):
File "D:/workspace/phraseMatcher/venv/phraseMatcher.py", line 71, in <module>
plac.call(main)
File "D:\workspace\phraseMatcher\venv\lib\site-packages\plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "D:\workspace\phraseMatcher\venv\lib\site-packages\plac_core.py", line 207, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "D:/workspace/phraseMatcher/venv/phraseMatcher.py", line 47, in main
label = nlp.vocab.strings[match_id]
File "strings.pyx", line 117, in spacy.strings.StringStore.__getitem__
KeyError: "[E018] Can't retrieve string for hash '18446744072093410045'."
spaCy version:** 2.0.12
Platform:** Windows-7-6.1.7601-SP1
Python version:** 3.7.0
I can't find what I did wrong. It is simple and I read these already:
Using PhraseMatcher in SpaCy to find multiple match types
Help me, thanks in advance.

openpyxl+load_workbook+AttributeError: 'NoneType' object has no attribute 'date1904'

When I use openpyxl to load the Excel file( .xlsx), this error displays (the last the link is the sample Excel file):
from openpyxl import *
wb = load_workbook("D:/develop/workspace/exman/test sample/510001653.xlsx")
Traceback (most recent call last):
File "", line 1, in
File "C:\Python34\lib\site-packages\openpyxl-2.5.0-py3.4.egg\openpyxl\reader\
xcel.py", line 161, in load_workbook
parser.parse()
File "C:\Python34\lib\site-packages\openpyxl-2.5.0-py3.4.egg\openpyxl\packagi
g\workbook.py", line 42, in parse
if package.properties.date1904:
AttributeError: 'NoneType' object has no attribute 'date1904'
sample excel file download
I debug the python file ,and find that the workbookPr = None , cause the package.properties to None( properties = Alias(workbookPr). So I change the code of workbookParser.parser() like follow, the error is solved.
class WorkbookParser:
def __init__(self, archive):
self.archive = archive
self.wb = Workbook()
self.sheets = []
self.rels = get_dependents(self.archive, ARC_WORKBOOK_RELS)
def parse(self):
src = self.archive.read(ARC_WORKBOOK)
node = fromstring(src)
package = WorkbookPackage.from_tree(node)
if package.properties is not None: #add this line
if package.properties.date1904:
wb.excel_base_date = CALENDAR_MAC_1904
self.wb.code_name = package.properties.codeName
self.wb.active = package.active
..........
This bug was fixed in newer versions (I checked 2.4.8 and its fixed. 2.4.0 still had it)
pip install --upgrade openpyxl