pdf to jpg : can't recognize file - pdf

When i run :
from pdf2jpg import pdf2jpg
PATH = "C:/Users/yyyyyy/Desktop/test_ocr/data"
pdf2jpg.convert_pdf2jpg(PATH +"/" + 'xxxx.pdf',PATH +"/results", pages = "ALL")
I get this error:
[WinError 2] Le fichier spécifié est introuvable
Though I'm sure that this file exists. What should I do to make it work ?

Related

trying to print a pdf file but got an error , my code is correctly working with image print and txt file print but when i print pdf file its error

import os
try:
file = "C:\\Cheque_Software\\All_Data\\ISSUE_CHEQUE.pdf"
os.startfile(file, "print")
except Exception as Ex:
print(f"ERROR DUE TO : {str(Ex).title()}")
else:
print("your printer is not connected with system".title())
ERROR IS
ERROR DUE TO : [Winerror 1155] No Application Is Associated With The Specified File For This Operation: 'C:\Cheque_Software\All_Data\Issue_Cheque.Pdf'emphasized text

How to convert PDF with images which I don't care about to text?

I'm trying to convert pdf to text files. The problem is that those pdf contain images, which I don't care about (this is the type of file I want to extract (https://www.sia.aviation-civile.gouv.fr/pub/media/store/documents/file/l/f/lf_sup_2020_213_fr.pdf). Note that if I do copy/paste with my mouse, it work quite well (except the line break), so I'd guess that it's possible. Most of the answer I found online work pretty well on dummy pdf with text only, but give especially bad result on the map.
For instance, something like this
from tika import parser # pip install tika
raw = parser.from_file('test2.pdf')
print(raw['content'])
works well for retrieving the text, but I have a lot of trash like this :
ERY
CTR
3
CH
A
which appear because of the map.
Something like this, which work by converting the pdf to images and then reading the images, face the same problem (I found it on a very similar thread on stackoverflow, but there is no answer) :
import pytesseract as pt
from PIL import Image
import sys
def convert(name):
pages = convert_from_path(name, dpi=200)
for idx,page in enumerate(pages):
page.save('page'+str(idx)+'.jpg', 'JPEG')
quote = Image.open('page'+str(idx)+'.jpg')
text = pt.image_to_string(quote, lang="fra")
file_ex = open('page'+str(idx)+'.text',"w")
file_ex.write(text)
file_ex.close()
if __name__ == '__main__':
convert(sys.argv[1])
Finally, I tried to remove the image first, and then using one of the solutions above, but it didn't work better :
from tika import parser # pip install tika
from PyPDF2 import PdfFileWriter, PdfFileReader
# Remove the images
inputStream = open("lf_sup_2020_213_fr.pdf", "rb")
outputStream = open("test3.pdf", "wb")
src = PdfFileReader(inputStream)
output = PdfFileWriter()
[output.addPage(src.getPage(i)) for i in range(src.getNumPages())]
output.removeImages()
output.write(outputStream)
outputStream.close()
# Read from pdf without images
raw = parser.from_file('test2.pdf')
print(raw['content'])
Do you know how to solve this ? It can be in any language.
Thanks
One approach you could try is to use a toolkit capable of parsing the text characters in the PDF then use the object properties to try and remove the unwanted map labels while keeping the text characters required.
For example, the ParsePages method from LEADTOOLS PDF toolkit (which is what I am familiar with since I work for the vendor of this toolkit) can be used to obtain the text from the PDF:
using (PDFDocument document = new PDFDocument(pdfFileName))
{
PDFParsePagesOptions options = PDFParsePagesOptions.All;
document.ParsePages(options, 1, -1);
using (StreamWriter writer = File.CreateText(txtFileName))
{
IList<PDFObject> objects = document.Pages[0].Objects;
writer.WriteLine("Objects: {0}", objects.Count);
foreach (PDFObject obj in objects)
{
if (obj.TextProperties.IsEndOfLine)
writer.WriteLine(obj.Code);
else
writer.Write(obj.Code);
}
writer.WriteLine("---------------------");
}
}
This will obtain all the text in the PDF for the first page, with the unwanted results as you mentioned. Here is an excerpt below:
Objects: 3918
5
91L
F5
4
1 LF
N
OY
L2
1AM
TService
8
26
1de l’Information
0
B09SUP AIP 213/20
7
Aéronautique
Date de publication : 05 NOV
e-mail : sia.qualite#aviation-civile.gouv.fr
Internet : www.sia.aviation-civile.gouv.fr
141
17˚
82
N20
9Objet : Création de 4 zones réglementées temporaires (ZRT) pour l’exercice VOLOPS en région de Chambéry
En vigueur : Du mercredi 25 Novembre 2020 au vendredi 04 décembre 2020
More code can be used to examine the properties for each parsed character:
writer.WriteLine(" ObjectType: {0}", obj.ObjectType.ToString());
writer.WriteLine(" Bounds: {0}, {1}, {2}, {3}", obj.Bounds.Left, obj.Bounds.Top, obj.Bounds.Right, obj.Bounds.Bottom);
writer.WriteLine(" TextProperties.FontHeight: {0}", obj.TextProperties.FontHeight.ToString());
writer.WriteLine(" TextProperties.FontIndex: {0}", obj.TextProperties.FontIndex.ToString());
writer.WriteLine(" Code: {0}", obj.Code);
writer.WriteLine("------");
This will give the properties for each character:
Objects: 3918
ObjectType: Text
Bounds: -60.952693939209, 1017.25231933594, -51.8431816101074, 1023.71826171875
TextProperties.FontHeight: 7.10454273223877
TextProperties.FontIndex: 48
Code: 5
------
Using these properties, the unwanted text might be filtered using their properties. For example, I noticed that the FontHeight for a good portion of the unwanted text is around 7 PDF units, so the first code might be altered to avoid extracting any text smaller than 7.25 PDF units:
foreach (PDFObject obj in objects)
{
if (obj.TextProperties.FontHeight > 7.25)
{
if (obj.TextProperties.IsEndOfLine)
writer.WriteLine(obj.Code);
else
writer.Write(obj.Code);
}
}
The extracted output would give a better result, an excerpt follows:
Objects: 3918
Service
de l’Information
SUP AIP 213/20
Aéronautique
Date de publication : 05 NOV
e-mail : sia.qualite#aviation-civile.gouv.fr
Internet : www.sia.aviation-civile.gouv.fr
Objet : Création de 4 zones réglementées temporaires (ZRT) pour l’exercice VOLOPS en région de Chambéry
En vigueur : Du mercredi 25 Novembre 2020 au vendredi 04 décembre 2020
Lieu : FIR : Marseille LFMM - AD : Chambéry Aix-Les-Bains LFLB, Chambéry Challes les Eaux LFLE
ZRT LE SIRE, MOTTE CASTRALE, ALLEVARD
*
C
D
E
In the end, you will have to try and come up with a good criteria to filter out the unwanted text without removing the text you need to keep, using this approach.

nuitka fails to generate executable due to "The filename or extension is too long"

I've been trying to compile/generate a standalone executable (.exe) with nuitka but it fails every time with the message:
Nuitka:INFO:Total memory usage before running scons: 2.72 GB (2920177664 bytes):
scons: *** [main_executable.dist\main_executable.exe] The filename or extension is too long
I'm new to this programming but I think I've tried just about everything. I moved my *.py files to directory C:\main to shorten the path to no avail. I've rename the file to produce "main.exe" from "main_executable" to no avail.
My python is installed in here:
'C:\users\test\Anaconda3...'
I came across this function below to shorten the path but I have no idea how to implement it: (taken from http://code.activestate.com/recipes/286179-getshortpathname/)
Could you kindly help. Thanks.
def getShortPathName(filepath):
"Converts the given path into 8.3 (DOS) form equivalent."
import win32api, os
if filepath[-1] == "\\":
filepath = filepath[:-1]
tokens = os.path.normpath(filepath).split("\\")
if len(tokens) == 1:
return filepath
ShortPath = tokens[0]
for token in tokens[1:]:
PartPath = "\\".join([ShortPath, token])
Found = win32api.FindFiles(PartPath)
if Found == []:
raise WindowsError, 'The system cannot find the path specified: "%s"' % (PartPath)
else:
if Found[0][9] == "":
ShortToken = token
else:
ShortToken = Found[0][9]
ShortPath = ShortPath + "\\" + ShortToken
return ShortPath

convert a file object into a string

In a meson.build file, I have some file define by :
file = files("my_filename.ext")
To build an ID, I have try to write :
myTgt = "_other_ext" + file[0][0]
And then I have this error :
meson.build:257:4: ERROR: Invalid use of addition: must be str, not File
How can I convert the file object into a valid string ?
(I have try to add .string(), but it's not the solution)
I have found solution with format function:
fmt = "_other_ext_#0#"
myTgt = fmt.format(file[0][0])

How to read multiple .xls files in one go in r

Tried the below code multiple times, but nothing happens when I run the below code. I think fread does not read .xls format. Thus I tried two other different codes, one with Rio package and another with openxlsx package. Sorry i am new to this. There are 38 files, each with name "Cust+Txn+Details+Customer (36).xls". Thank you.
## First put all file names into a list
library(data.table)
files <- list.files(path = "F:\\MUMuniv\\machine learning class\\
price sensitivty\\PS works\\Customer files",
pattern = ".xls", full.names = T)
readdata <- function(fn){
dt_temp <- fread(fn)
return(dt_temp)
}
# then using
all.files <- lapply(files, readdata)
final.data <- rbindlist(all.files)
Error I got: " Error in fread(fn) : mmap'd region has EOF at the end "
#Example 2
#rio package
require("rio")
xls <- dir(path = ".", all.files = T)
created <- mapply(convert, xls, gsub(".xlsx", ".csv", "xls"))
unlink(xls)
Error in get_ext(file) : 'file' has no extension
#example 3
# using openxlsx package
require("openxlsx")
# Create a vector of Excel files to read
files.to.read = list.files(path = ".", all.files = T)
# Read each file and write it to csv
lapply(files.to.read, function(f) {
df = read.xlsx(f, sheet=1)
write.csv(df, gsub("xlsx", "csv", f), row.names=FALSE)
})
Error in file(con, "r") : invalid 'description' argument In addition: Warning message:
In unzip(xlsxFile, exdir = xmlDir) : error 1 in extracting from zip file