How to open Russian-language PDFs for NLTK processing

How to open Russian-language PDFs for NLTK processing - pdf

I'm trying to extract text from a pdf file in Russian, and use this text as data for tokenisation, lemmatisation etc. with NLTK on Jupyter Notebook. I'm using PyPDF2, but I keep running into problems.
I am creating a function and passing to it the pdf as the input:
from PyPDF2 import PdfFileReader
def getTextPDF(pdfFileName):
pdf_file = open(pdfFileName, "rb")
read_pdf = PdfFileReader(pdf_file)
text = []
for i in range(0, read_pdf.getNumPages()):
text.append(read_pdf.getPage(i).extractText())
return "\n".join(text)
Then I call the function:
pdfFile = "sample_russian.pdf"
print("PDF: \n", myreader_pdf.getTextPDF(pdfFile))
But I get a long pink list of the same error warning:
PdfReadWarning: Superfluous whitespace found in object header b'1' b'0' [pdf.py:.....]
Any ideas would be very helpful! Thanks in advance!

Related

Using Python 3.8, I would like to extract text from a random PDF file

I would like to import a PDF file and find the most common words.
import PyPDF2
# Open the PDF file and read the text
pdf_file = open("nita20.pdf", "rb")
pdf_reader = PyPDF2.PdfReader(pdf_file)
text = ""
for page in range(pdf_reader.pages):
text += pdf_reader.getPage(page).extractText()
I get this error:
TypeError: '_VirtualList' object cannot be interpreted as an integer
How to resolve this issue? So I can extract every word from the PDF file, thanks.

I got some deprecation warnings on your code, but this works (tested on Python 3.11, PyPDF2 version: 3.0.1)
import PyPDF2
# Open the PDF file and read the text
pdf_file = open("..\test.pdf", "rb")
pdf_reader = PyPDF2.PdfReader(pdf_file)
text = ""
i=0
print(len(pdf_reader.pages))
for page in range(len(pdf_reader.pages)):
text += pdf_reader.pages[i].extract_text()
i=i+1
print(text)

I want to read and print this data in the picture by using the follow codes. But I get some troubles in this program, how could I fix this codes?

import re
file_path = 'D:/Speech/data/test2.txt'
useful_regex = re.compile(r'\[.+\]\n', re.IGNORECASE)
with open(file_path) as f:
file_content = f.read()
info_lines = re.findall(useful_regex, file_content)
len(info_lines)
for l in info_lines[1:10]:
print(l.strip().split('\t'))
As stated in the title, I want to read and print this data in the picture by using the follow codes. But I get some troubles in this program, how could I fix this codes?

Read pdf object from S3

I am trying to create a lambda function that will access a pdf form uploaded to s3 and strip out the data entered into the form and send it elsewhere.
I am able to do this when I can download the file locally. So the below script works and allows me to read the data from the pdf into my pandas dataframe.:
import PyPDF2 as pypdf
import pandas as pd
s3 = boto3.resource('s3')
s3.meta.client.download_file(bucket_name, asset_key, './target.pdf')
pdfobject = open("./target.pdf", 'rb')
pdf = pypdf.PdfFileReader(pdfobject)
data = pdf.getFormTextFields()
pdf_df = pd.DataFrame(data, columns=get_cols(data), index=[0])
But with lambda I cannot save the file locally because I get a "read only filesystem" error.
I have tried using the s3.get_object() method like below:
s3_response_object= s3.get_object(
Bucket='pdf-forms-bucket',
Key='target.pdf',
)
pdf_bytes = s3_response_object['Body'].read()
But I have no idea how to convert the resulting bytes into an object that can be parsed with PyDF2. The output that I need and that PyDF2 will produce is like below:
{'form1[0].#subform[0].nameandmail[0]': 'Burt Lancaster',
'form1[0].#subform[0].mailaddress[0]': '675 Creighton Ave, Washington DC',
'form1[0].#subform[0].Principal[0]': 'David St. Hubbins',
'Principal[1]': None,
'form1[0].#subform[0].Principal[2]': 'Bart Simpson',
'Principal[3]': None}
So in summary, I need o be able to read a pdf with fillable forms, into memory and parse it without downloading the file because my lambda function environment won't allow local temp files.

Solved:
This does the trick:
import boto3
from PyPDF2 import PdfFileReader
from io import BytesIO
bucket_name ="pdf-forms-bucket"
item_name = "form.pdf"
s3 = boto3.resource('s3')
obj = s3.Object(bucket_name, item_name)
fs = obj.get()['Body'].read()
pdf = PdfFileReader(BytesIO(fs))
data = pdf.getFormTextFields()

How to convert all type of images to text using python tesseract

I'm trying to convert all type of images in a folder to text using python tesseract. Below is the that I'm using, with this only .png files are being converted to .txt, and other types are not being converted to text.
import os
import pytesseract
import cv2
import re
import glob
import concurrent.futures
import time
def ocr(img_path):
out_dir = "Output//"
img = cv2.imread(img_path)
text = pytesseract.image_to_string(img,lang='eng',config='--psm 6')
out_file = re.sub(".png",".txt",img_path.split("\\")[-1])
out_path = out_dir + out_file
fd = open(out_path,"w")
fd.write("%s" %text)
return out_file
os.environ['OMP_THREAD_LIMIT'] = '1'
def main():
path = input("Enter the path : ")
if os.path.isdir(path) == 1:
out_dir = "ocr_results//"
if not os.path.exists(out_dir):
os.makedirs(out_dir)
with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
image_list = glob.glob(path+"\\*.*")
for img_path,out_file in zip(image_list,executor.map(ocr,image_list)):
print(img_path.split("\\")[-1],',',out_file,', processed')
if __name__ == '__main__':
start = time.time()
main()
end = time.time()
print(end-start)
How to convert all type of image files to text. Please help me with the above code.

There is a bug in the ocr function.
First of all, the following does convert all type of image files to text.
text = pytesseract.image_to_string(img,lang='eng',config='--psm 6'))
However, what the next chunk of code does are
Select those file with .png extension using a regex
Create a new path with the same filename and a a .txt extension
Write the OCR output to the newly create text file.
out_file = re.sub(".png",".txt",img_path.split("\\")[-1])
out_path = out_dir + out_file
fd = open(out_path,"w")
fd.write("%s" %text)
In other words, all types of images files are converted but not all are written back correctly. The regex matching logic only replace .png with .txt and assign to out_path. When there is no .png (other image types), the variable gets the same value as the original filename (e.g. sampe.jpg). The next lines of code open the original image and overwrite with the OCR result.
One way to fix is by adding all the image formats you want to cover into the regex.
For example,
out_file = re.sub(".png|.jpg|.bmp|.tiff",".txt",img_path.split("\\")[-1])

OpenCv_Python - Convert Frame Sequence To a Video

I am a newbie in OpenCV using Python. I am currently working with a project related opencv using python language. I have a video data set named "VideoDataSet/dynamicBackground/canoe/input" that stores the sequence of image frames and I would like to convert the sequence of frames from the file path to a video. However, I am getting an error when I execute the program. I have tried various codecs but it still gives me the same errors, can any of you please shed some light on what might be wrong? Thank you.
This is my sample code:
import cv2
import numpy as np
import os
import glob as gb
filename = "VideoDataSet/dynamicBackground/canoe/input"
img_path = gb.glob(filename)
videoWriter = cv2.VideoWriter('test.avi', cv2.VideoWriter_fourcc(*'MJPG'),
25, (640,480))
for path in img_path:
img = cv2.imread(path)
img = cv2.resize(img,(640,480))
videoWriter.write(img)
print ("you are success create.")
This is the error:
Error prompt out:cv2.error: OpenCV(3.4.1) D:\Build\OpenCV\opencv-3.4.1\modules\imgproc\src\resize.cpp:4044: error: (-215) ssize.width > 0 && ssize.height > 0 in function cv::resize
(Note: the problem occur with the img = cv2.resize(img,(640,480)))

It is returning this error because you are trying to re-size the directory entry! You need to put:
filename = "VideoDataSet/dynamicBackground/canoe/input/*"
So that it will match all the files in the folder when you glob it. The error actually suggested that the source image had either zero width or zero height. Putting:
print( img_path )
In after your glob attempt showed that it was only returning the directory entry itself.
You subsequently discovered that although it was now generating a file, it was corrupted. This is because you are incorrectly specifying the codec. Replace your fourcc parameter with this:
cv2.VideoWriter_fourcc('M','J','P','G')

you can try this:
img_path = gb.glob(filename)
videoWriter = cv2.VideoWriter('frame2video.avi', cv2.VideoWriter_fourcc(*'MJPG'), 25, (640,480))
for path in img_path:
img = cv2.imread(path)
img = cv2.resize(img,(640,480))
videoWriter.write(img)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to open Russian-language PDFs for NLTK processing - pdf

Related

Using Python 3.8, I would like to extract text from a random PDF file

I want to read and print this data in the picture by using the follow codes. But I get some troubles in this program, how could I fix this codes?

Read pdf object from S3

How to convert all type of images to text using python tesseract

OpenCv_Python - Convert Frame Sequence To a Video

Categories

Resources