pdfminer3k - pdf2txt.py error - pdf

I want to convert my pdf files to txt files and used pdfminer3k module & pdf2txt.py, however, I got an error.
pdf2txt.py -o file.txt -t tag file.pdf
This is my code at cmd screen.
Traceback (most recent call last):
File "C:\Python36\lib\site.py", line 67, in
import os
File "C:\Python36\lib\os.py", line 409
yield from walk(new_path, topdown, onerror, followlinks)
^
SyntaxError: invalid syntax
This is an error message that I got.
Could you help me to fix this problem??

Added for reference: Great resourse:
http://www.degeneratestate.org/posts/2016/Jun/15/extracting-tabular-data-from-pdfs/
The -t flag is the type of output. The options are text, tag, xml, and html.
Tag refers to generating a tag for xml. Replace tag with text in your command and try it.
The order of optional input also matters.
You also must invoke python, your command line does'nt know what import means, yet some of your environment seems to be setup. My example is for windows cmd from Anaconda3\Scripts directory. If your in juptyer notebook or a console, you should be able to run import pdf2txt with the .py
To setup your environment you need to append the os.path.append(yourpdfdirectory) otherwise file.pdf will not be found.
Try python pdf2txt.py -t text -o file.txt file.pdf
Or if you are brave...this is how to do programmatically. The trouble with xml is if you want to get the text, each character from xml tree is returned in an arbitrary order. You can get it to work but you need to build the string character by character which is not that hard, its just logically time consuming.
fp = open(filesin,'rb')
parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize('')
rsrcmgr = PDFResourceManager(caching=False)
laparams = LAParams(all_texts=True)
laparams.boxes_flow = -0.2
laparams.paragraph_indent = 0.2
laparams.detect_vertical = False
#laparams.heuristic_word_margin = 0.03
laparams.word_margin = 0.2
laparams.line_margin = 0.3
outfp = open(filesin+".out.tag" ,'wb')
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
#process_pdf(rsrcmgr, device, pdfparse, pagenos,caching=c, check_extractable=True)
for p,page in enumerate(doc.get_pages()):
if p == 0: #temporary for page 1
interpreter.process_page(page)
layout = device.get_result()
alltextinbox = ''
#This is a rich environment so categorization of this object hierarchy is needed
for c,lt_obj in enumerate(layout):
#print(type(lt_obj),"This is type ",c,"th object on the ",p,"th page")
if isinstance(lt_obj,LTTextBoxHorizontal) or isinstance(lt_obj,LTTextBox) or isinstance(lt_obj,LTTextLine):
print("Type ,",type(lt_obj)," and text ..",lt_obj.get_text())
obj_textbox_line.update({lt_obj:lt_obj.get_text()})
elif p != 0:
pass
fp.close()
#print(obj_textbox_line)
#call the column finder here
#check_matching("example", "example1")
#text_doc_df = pd.DataFrame(obj_textbox_line,columns=['text'])
#print (text_doc_df)
pass
I'm working on a generic row/column matcher. If you don't want to bother, you can buy this software already for like 150 bucks for a pro converter.

Related

In Google collab I get IOPub data rate exceeded

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
--NotebookApp.iopub_data_rate_limit.
Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)
An IOPub error usually occurs when you try to print a large amount of data to the console. Check your print statements - if you're trying to print a file that exceeds 10MB, its likely that this caused the error. Try to read smaller portions of the file/data.
I faced this issue while reading a file from Google Drive to Colab.
I used this link https://colab.research.google.com/notebook#fileId=/v2/external/notebooks/io.ipynb
and the problem was in this block of code
# Download the file we just uploaded.
#
# Replace the assignment below with your file ID
# to download a different file.
#
# A file ID looks like: 1uBtlaggVyWshwcyP6kEI-y_W3P8D26sz
file_id = 'target_file_id'
import io
from googleapiclient.http import MediaIoBaseDownload
request = drive_service.files().get_media(fileId=file_id)
downloaded = io.BytesIO()
downloader = MediaIoBaseDownload(downloaded, request)
done = False
while done is False:
# _ is a placeholder for a progress object that we ignore.
# (Our file is small, so we skip reporting progress.)
_, done = downloader.next_chunk()
downloaded.seek(0)
#Remove this print statement
#print('Downloaded file contents are: {}'.format(downloaded.read()))
I had to remove the last print statement since it exceeded the 10MB limit in the notebook - print('Downloaded file contents are: {}'.format(downloaded.read()))
Your file will still be downloaded and you can read it in smaller chunks or read a portion of the file.
The above answer is correct, I just commented the print statement and the error went away. just keeping it here so someone might find it useful. Suppose u are reading a csv file from google drive just import pandas and add pd.read_csv(downloaded) it will work just fine.
file_id = 'FILEID'
import io
from googleapiclient.http import MediaIoBaseDownload
request = drive_service.files().get_media(fileId=file_id)
downloaded = io.BytesIO()
downloader = MediaIoBaseDownload(downloaded, request)
done = False
while done is False:
# _ is a placeholder for a progress object that we ignore.
# (Our file is small, so we skip reporting progress.)
_, done = downloader.next_chunk()
downloaded.seek(0)
pd.read_csv(downloaded);
Maybe this will help..
from via sv1997
IOPub Error on Google Colaboratory in Jupyter Notebook
IoPub Error is occurring in Colab because you are trying to display the output on the console itself(Eg. print() statements) which is very large.
The IoPub Error maybe related in print function.
So delete or annotate the print function. It may resolve the error.
%cd darknet
!sed -i 's/OPENCV=0/OPENCV=1/' Makefile
!sed -i 's/GPU=0/GPU=1/' Makefile
!sed -i 's/CUDNN=0/CUDNN=1/' Makefile
!sed -i 's/CUDNN_HALF=0/CUDNN_HALF=1/' Makefile
!apt update
!apt-get install libopencv-dev
its important to update your make file. and also, keep your input file name correct

writeOGR error: creation of output file failed

I'm an R rookie and attempting to create home ranges from fish telemetry data using kernel density estimates within the adehabitatHR package
kud <- kernelUD(muskydetectdata.P[,6], h="href", extent = 5)
class(kud)
image(kud)
kud[[1]]#h
muskykud.P95 <- getverticeshr(kud, percent = 95)
muskykud.P95
muskykud.P50 <- getverticeshr(kud, percent = 50)
muskykud.P50
when exporting to a shapefile
writeOGR(muskydetectdata.sp,"musky_kde1", "gps",
driver="ESRI Shapefile",
dataset_options= "FieldName= id")
an error message is displayed
##creation of output file failed
I have also attempted to use writeSpatialShape with similar results
I'm using R version 3.3.2 on windows 64 bit
I had the same problem and have solved it only when I added a full name of my directory and a name of a layer plus a shp suffix:
writeOGR(muskydetectdata.sp, dsn="d:/your directory here/musky_kde.shp", layer="musky_kde", driver="ESRI Shapefile")
I had that same error.
I resolved mine by correcting the directory it was saving to (making sure it existed)
e.g.
writeOGR(muskydetectdata.sp, dsn = save.dir, layer = filename.save, driver = 'ESRI Shapefile')
where save.dir is the directory you want saved as a string and filename.save is the filename you want it saved as (excluding extension)
I guess you are trying to write on an existing file and the writeOGR function don't allow that. I guess this is a known behavior of some drivers supported by OGR (as far as I remember in R as in python and in the C API).
You have to check if the file exists prior to your writing and removing it (or changing the path you want to use).
For example here the first write operation succeed but the attempt to overwrite the file fails with your error message :
> rgdal::writeOGR(spdf, 'b.shp', layer="brazil", driver='ESRI Shapefile')
> rgdal::writeOGR(spdf, 'b.shp', layer="brazil", driver='ESRI Shapefile')
Error in rgdal::writeOGR(spdf, "b.shp", layer = "brazil", driver = "ESRI Shapefile") :
Creation of output file failed

file seek in wlst / Jython 2.2.1 fails for lines longer than 8091 characters

For a CSV file generated in WLST / Jython 2.2.1 i want to update the header, the first line of the output file, when new metrics have been detected. This works fine by using seek to go to the first line and overwriting the line. But it fails when the number of characters of the first line exceeds 8091 characters.
I made simplified script which does reproduce the issue i am facing here.
#!/usr/bin/python
#
import sys
global maxheaderlength
global initheader
maxheaderlength=8092
logFilename = "test.csv"
# Create (overwrite existing) file
logfileAppender = open(logFilename,"w",0)
logfileAppender.write("." * maxheaderlength)
logfileAppender.write("\n")
logfileAppender.close()
# Append some lines
logfileAppender = open(logFilename,"a",0)
logfileAppender.write("2nd line\n")
logfileAppender.write("3rd line\n")
logfileAppender.write("4th line\n")
logfileAppender.write("5th line\n")
logfileAppender.close()
# Seek back to beginning of file and add data
logfileAppender = open(logFilename,"r+",0)
logfileAppender.seek(0) ;
header = "New Header Line" + "." * maxheaderlength
header = header[:maxheaderlength]
logfileAppender.write(header)
logfileAppender.close()
When maxheaderlength is 8091 or lower i do get the results as expected. The file test.csv starts with “New Header Line" followed by 8076 dots and
followed by the lines
2nd line
3rd line
4th line
5th line
When maxheaderlength is 8092> the test.csv results as a file starting with 8092 dots followed by "New Header Line" and then followed by 8077 dots. The 2nd ... 5th line are now show, probably overwritten by the dots.
Any idea how to work around or fix this ?
I too was able to reproduce this extremely odd behaviour and indeed it works correctly in Jython 2.5.3 so I think we can safely say this is a bug in 2.2.1 (which unfortunately you're stuck with for WLST).
My usual recourse in these circumstances is to fall back to using native Java methods. Changing the last block of code as follows seems to work as expected :-
# Seek back to beginning of file and add data
from java.io import RandomAccessFile
logfileAppender = RandomAccessFile(logFilename, "rw")
logfileAppender.seek(0) ;
header = "New Header Line" + "." * maxheaderlength
header = header[:maxheaderlength]
logfileAppender.writeBytes(header)
logfileAppender.close()

Jython - importing a text file to assign global variables

I am using Jython and wish to import a text file that contains many configuration values such as:
QManager = MYQM
ProdDBName = MYDATABASE
etc.
.. and then I am reading the file line by line.
What I am unable to figure out is now that as I read each line and have assigned whatever is before the = sign to a local loop variable named MYVAR and assigned whatever is after the = sign to a local loop variable MYVAL - how do I ensure that once the loop finishes I have a bunch of global variables such as QManager & ProdDBName etc.
I've been working on this for days - I really hope someone can help.
Many thanks,
Bret.
See other question: Properties file in python (similar to Java Properties)
Automatically setting global variables is not a good idea for me. I would prefer global ConfigParser object or dictionary. If your config file is similar to Windows .ini files then you can read it and set some global variables with something like:
def read_conf():
global QManager
import ConfigParser
conf = ConfigParser.ConfigParser()
conf.read('my.conf')
QManager = conf.get('QM', 'QManager')
print('Conf option QManager: [%s]' % (QManager))
(this assumes you have [QM] section in your my.conf config file)
If you want to parse config file without help of ConfigParser or similar module then try:
my_options = {}
f = open('my.conf')
for line in f:
if '=' in line:
k, v = line.split('=', 1)
k = k.strip()
v = v.strip()
print('debug [%s]:[%s]' % (k, v))
my_options[k] = v
f.close()
print('-' * 20)
# this will show just read value
print('Option QManager: [%s]' % (my_options['QManager']))
# this will fail with KeyError exception
# you must be aware of non-existing values or values
# where case differs
print('Option qmanager: [%s]' % (my_options['qmanager']))

Python file location conventions and Import errors

I'm new to Python and I simply don't know how to handle this specific problem:
I'm trying to run an executable (named ros_M5e.py) that's located in the directory /opt/ros/diamondback/stacks/hrl/hrl_rfid/src/hrl_rfid/ros_M5e.py (annoyingly long filepath, I know, but necessary). However, within the ros_M5e.py file there is a call to another file that is further up the file path: from hrl.hrl_rfid.msg import RFIDread. The directory msg indeed is located at /opt/ros/diamondback/stacks/hrl/hrl_rfid/ and it does indeed contain the file RFIDread. However, whenever I try to execute ros_M5e.py I get this error:
Traceback (most recent call last):
File "/opt/ros/diamondback/stacks/hrl/hrl_rfid/src/hrl_rfid/ros_M5e.py", line 37, in <module>
from hrl.hrl_rfid.msg import RFIDread
ImportError: No module named hrl.hrl_rfid.msg
Would someone with some expertise please shine some light on this problem for me? It seems like just a rudimentary file location problem, but I just don't know the appropriate Python conventions to fix it. I've tried putting the ros_M5e.py file in the same directory as the files it calls and changing the filepaths but to no avail.
Thanks a lot,
Khiya
Sure, I can help you get it up and running.
From the StackOverflow posting, it would seem that you're checking out the stack to /opt/ros/diamondback. This is no good, as it is a system path. You need to install into your local path. The reason for "readonly" on the repository is that you do not have permissions to make changes to the code -- it will still work just fine for you on your local machine. I spent a fair amount of time showing how to use this package (at least the python version) here:
http://www.ros.org/wiki/hrl_rfid
I'll try to do a quick run-through for installing it.... Run the following commands:
cd
mkdir sandbox
cd sandbox/
svn checkout http://gt-ros-pkg.googlecode.com/svn/trunk/hrl/hrl_rfid hrl_rfid (double-check that this checkout works OK!)
Add the following line to the bottom of your bashrc to tell ROS where to find the new package. (You may use "gedit ~/.bashrc")
export ROS_PACKAGE_PATH=$ROS_PACKAGE_PATH:$HOME/sandbox/hrl_rfid
Now execute the following:
roscd hrl_rfid (did you end up in the correct directory?)
rosmake hrl_rfid (did it make without errors?)
roscd hrl_rfid/src/hrl_rfid
At this point everything is actually installed correctly. By default, ros_M5e.py assumes that the reader is located at "/dev/robot/RFIDreader". Unless you've already altered udev rules, this will not be the case on your machine. I suggest running through the code:
http://www.ros.org/wiki/hrl_rfid
using iPython (a command-line python prompt that will let you execute python commands one at a time) to make sure everything is working (replace /dev/ttyUSB0 with whatever device your RFID reader is connected as):
import lib_M5e as M5e
r = M5e.M5e( '/dev/ttyUSB0', readPwr = 3000 )
r.ChangeAntennaPorts( 1, 1 )
r.QueryEnvironment()
r.TrackSingleTag( 'test_tag_id_' )
r.ChangeTagID( 'test_tag_id_' )
r.QueryEnvironment()
r.TrackSingleTag( 'test_tag_id_' )
r.ChangeAntennaPorts( 2, 2 )
r.QueryEnvironment()
This means that the underlying library is working just fine. Next, test ROS (make sure "roscore" is running!), by putting this in a python file and executing:
import lib_M5e as M5e
def P1(r):
r.ChangeAntennaPorts(1,1)
return 'AntPort1'
def P2(r):
r.ChangeAntennaPorts(2,2)
return 'AntPort2'
def PrintDatum(data):
ant, ids, rssi = data
print data
r = M5e.M5e( '/dev/ttyUSB0', readPwr = 3000 )
q = M5e.M5e_Poller(r, antfuncs=[P1, P2], callbacks=[PrintDatum])
q.query_mode()
t0 = time.time()
while time.time() - t0 < 3.0:
time.sleep( 0.1 )
q.track_mode( 'test_tag_id_' )
t0 = time.time()
while time.time() - t0 < 3.0:
time.sleep( 0.1 )
q.stop()
OK, everything works now. You can make your own node that is tuned to your setup:
#!/usr/bin/python
import ros_M5e as rm
def P1(r):
r.ChangeAntennaPorts(1,1)
return 'AntPort1'
def P2(r):
r.ChangeAntennaPorts(2,2)
return 'AntPort2'
ros_rfid = rm.ROS_M5e( name = 'my_rfid_server',
readPwr = 3000,
portStr = '/dev/ttyUSB0',
antFuncs = [P1, P2],
callbacks = [] )
rospy.spin()
ros_rfid.stop()
Or, ping me back and I can tweak ros_M5e.py to take an optional "portStr" -- though I recommend making your own so that you can name your antennas sensibly. Also, I highly recommend setting udev rules to ensure that the RFID reader always gets assigned to the same device: http://www.hsi.gatech.edu/hrl-wiki/index.php/Linux_Tools#udev_for_persistent_device_naming
BUS=="usb", KERNEL=="ttyUSB*", SYSFS{idVendor}=="0403", SYSFS{idProduct}=="6001", SYSFS{serial}=="ftDXR6FS", SYMLINK+="robot/RFIDreader"
If you do not do this... there is no guarantee that the reader will always be enumerated at /dev/ttyUSBx.
Let me know if you have any further problems.
~Travis Deyle (Hizook.com)
PS -- Did you modify ros_M5e.py to "from hrl.hrl_rfid.msg import RFIDread"? In the repo, it is "from hrl_rfid.msg import RFIDread". The latter is correct. As long as you have your ROS_PACKAGE_PATH correctly defined, and you've run rosmake on the package, then the import statement should work just fine. Also, I would not recommend posting ROS-related questions to StackOverflow. Very few people on here are going to be familiar with the ROS ecosystem (which is VERY complex). Please post questions here instead:
http://answers.ros.org/
http://code.google.com/p/gt-ros-pkg/issues/list
You need to make sure that following are true:
Directory /opt/ros/diamondback/stacks/ is in your python path.
/opt/ros/diamondback/stacks/hr1 contains __init__.py
/opt/ros/diamondback/stacks/hr1/hr1_rfid contians __init__.py
/opt/ros/diamondback/stacks/hr1/hr1_rfid/msg contians __init__.py
As the asker explained in comments that the RFIDRead does not have .py extension, so here is how that can be imported.
import imp
imp.load_source('RFIDRead', '/opt/ros/diamondback/stacks/hr1/hr1_rfid/msg/RFIDRead.msg')
Check out imp documentation for more information.