file seek in wlst / Jython 2.2.1 fails for lines longer than 8091 characters - jython

For a CSV file generated in WLST / Jython 2.2.1 i want to update the header, the first line of the output file, when new metrics have been detected. This works fine by using seek to go to the first line and overwriting the line. But it fails when the number of characters of the first line exceeds 8091 characters.
I made simplified script which does reproduce the issue i am facing here.
#!/usr/bin/python
#
import sys
global maxheaderlength
global initheader
maxheaderlength=8092
logFilename = "test.csv"
# Create (overwrite existing) file
logfileAppender = open(logFilename,"w",0)
logfileAppender.write("." * maxheaderlength)
logfileAppender.write("\n")
logfileAppender.close()
# Append some lines
logfileAppender = open(logFilename,"a",0)
logfileAppender.write("2nd line\n")
logfileAppender.write("3rd line\n")
logfileAppender.write("4th line\n")
logfileAppender.write("5th line\n")
logfileAppender.close()
# Seek back to beginning of file and add data
logfileAppender = open(logFilename,"r+",0)
logfileAppender.seek(0) ;
header = "New Header Line" + "." * maxheaderlength
header = header[:maxheaderlength]
logfileAppender.write(header)
logfileAppender.close()
When maxheaderlength is 8091 or lower i do get the results as expected. The file test.csv starts with “New Header Line" followed by 8076 dots and
followed by the lines
2nd line
3rd line
4th line
5th line
When maxheaderlength is 8092> the test.csv results as a file starting with 8092 dots followed by "New Header Line" and then followed by 8077 dots. The 2nd ... 5th line are now show, probably overwritten by the dots.
Any idea how to work around or fix this ?

I too was able to reproduce this extremely odd behaviour and indeed it works correctly in Jython 2.5.3 so I think we can safely say this is a bug in 2.2.1 (which unfortunately you're stuck with for WLST).
My usual recourse in these circumstances is to fall back to using native Java methods. Changing the last block of code as follows seems to work as expected :-
# Seek back to beginning of file and add data
from java.io import RandomAccessFile
logfileAppender = RandomAccessFile(logFilename, "rw")
logfileAppender.seek(0) ;
header = "New Header Line" + "." * maxheaderlength
header = header[:maxheaderlength]
logfileAppender.writeBytes(header)
logfileAppender.close()

Related

How to save an updated fits file with headers in correct places?

I want to edit the data in my fits file using astropy and then save it to its original file. Below is my code and the error message, please ignore if there's a redundant line because obviously I opened the file twice but I still get the error after deleting it.
file_list = sorted(glob.glob('*.fits')) #read in my three fits files
hdudata = np.full((3,720,1440), 0) # a test list to store the data
for im in range(len(file_list)):
hdu_list = fits.open(file_list[im])
hdudata[im] = hdu_list[0].data # read in the data from fits file
if im == 2: # I only want to change the last image
with fits.open(file_list[im], mode='update') as hdus:
hdu = hdus[0]
hdu.data = (hdudata[im-1] + hdudata[im])/2. # basically add two images
# and take the average
hdu.close() # this is required otherwise an error message pops up saying
# the next line cannot proceed as the file is being run
hdu.flush() # the error line
VerifyError:
Verification reported errors:
HDU 0:
'NAXIS1' card at the wrong place (card 4).
'NAXIS2' card at the wrong place (card 5).
'EXTEND' card at the wrong place (card 6).
Note: astropy.io.fits uses zero-based indexing.
I have only accessed and changed the data but why is the error taking place in my header, I met no problem reading the headers (though I didn't include in this code above) then why is it faulty when saving it?

pdfminer3k - pdf2txt.py error

I want to convert my pdf files to txt files and used pdfminer3k module & pdf2txt.py, however, I got an error.
pdf2txt.py -o file.txt -t tag file.pdf
This is my code at cmd screen.
Traceback (most recent call last):
File "C:\Python36\lib\site.py", line 67, in
import os
File "C:\Python36\lib\os.py", line 409
yield from walk(new_path, topdown, onerror, followlinks)
^
SyntaxError: invalid syntax
This is an error message that I got.
Could you help me to fix this problem??
Added for reference: Great resourse:
http://www.degeneratestate.org/posts/2016/Jun/15/extracting-tabular-data-from-pdfs/
The -t flag is the type of output. The options are text, tag, xml, and html.
Tag refers to generating a tag for xml. Replace tag with text in your command and try it.
The order of optional input also matters.
You also must invoke python, your command line does'nt know what import means, yet some of your environment seems to be setup. My example is for windows cmd from Anaconda3\Scripts directory. If your in juptyer notebook or a console, you should be able to run import pdf2txt with the .py
To setup your environment you need to append the os.path.append(yourpdfdirectory) otherwise file.pdf will not be found.
Try python pdf2txt.py -t text -o file.txt file.pdf
Or if you are brave...this is how to do programmatically. The trouble with xml is if you want to get the text, each character from xml tree is returned in an arbitrary order. You can get it to work but you need to build the string character by character which is not that hard, its just logically time consuming.
fp = open(filesin,'rb')
parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize('')
rsrcmgr = PDFResourceManager(caching=False)
laparams = LAParams(all_texts=True)
laparams.boxes_flow = -0.2
laparams.paragraph_indent = 0.2
laparams.detect_vertical = False
#laparams.heuristic_word_margin = 0.03
laparams.word_margin = 0.2
laparams.line_margin = 0.3
outfp = open(filesin+".out.tag" ,'wb')
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
#process_pdf(rsrcmgr, device, pdfparse, pagenos,caching=c, check_extractable=True)
for p,page in enumerate(doc.get_pages()):
if p == 0: #temporary for page 1
interpreter.process_page(page)
layout = device.get_result()
alltextinbox = ''
#This is a rich environment so categorization of this object hierarchy is needed
for c,lt_obj in enumerate(layout):
#print(type(lt_obj),"This is type ",c,"th object on the ",p,"th page")
if isinstance(lt_obj,LTTextBoxHorizontal) or isinstance(lt_obj,LTTextBox) or isinstance(lt_obj,LTTextLine):
print("Type ,",type(lt_obj)," and text ..",lt_obj.get_text())
obj_textbox_line.update({lt_obj:lt_obj.get_text()})
elif p != 0:
pass
fp.close()
#print(obj_textbox_line)
#call the column finder here
#check_matching("example", "example1")
#text_doc_df = pd.DataFrame(obj_textbox_line,columns=['text'])
#print (text_doc_df)
pass
I'm working on a generic row/column matcher. If you don't want to bother, you can buy this software already for like 150 bucks for a pro converter.

How can I print a certain line using for line in lines and line length in Python?

I have to use the import sys module for this syntax. What I have so far is this
import sys
file=sys.argv[1]
fp1= open(file, 'r+')
fp2= open(file+ 'cl.', 'w+')
lines =fp1.readlines()
for line in lines:
if len(line)>1 and line[0]=='Query':
print line.split('|') [0:1}
fp1.close()
Basically when I run this on the command line:
python homework4.py sqout
It gives me nothing, but if I take away the line[0}=='Query':
it prints the first 2 splits of every line (which I want it to do) just not every line. I only want it to print the first line which starts with Query. Thanks
line[0] is just the first character of string line. You could use line[0:5]=='Query' or line[:5]=='Query'
Before doing this I suggest checking first that len(line)>4 or using an exception.

Creating a $variable from a specific part of a txt file

I'm trying to get PowerShell to use a specific section of a text file as a $variable to be used later in the script.
With Get-Content and index I can get to the point of having a whole line, but I just want one word to be the variable, not the whole thing.
The alphanumeric code will always be in the same location exactly
line 5 (counting the first one as 0 of course) and the position in would be between the characters 22 to 30 (or the last 8 characters of that line).
I would like that section of the document to be identified as $txtdoc, to be used later in:
$inputfield = $ie.Document.getElementByID('input5')
$inputfield.value = $txtdoc
The txt file contains the following
From: *************
Sent: *************
To: *******************
Subject: *************
On-Demand Tokencode: 79960739
Expires after use or 60 minutes
this maybe?
$variable = ( gc mytext.txt )[5].substring(21,8)

How to modify a line in a file with Erlang OTP module

I got a big file and I would like to replace the first line with other content.
When I use {ok, IoDev} = file:open("/root/FileName", [write, raw, binary]), the whole content is removed.
But when I use {ok, IoDev} = file:open("/root/FileName", [append, raw, binary]) and file:pwrite(S, {bof,0}, <<"new content\n">>), I got the result {error, badarg}.
If I set Location to 0: file:pwrite(S, 0, <<"new content\n">>), the string is appended at tail of the file.
You seem to be confused with the actual file API.
file:open/2 will truncate the file if you pass [write, raw, binary]as you do:
(about write mode): The file is opened for writing. It is created if it does not exist. If the file exists, and if write is not combined with read, the file will be truncated.
So you need to pass either [write, read] or [write, append] as documented.
file:pwrite/3 also works exactly as documented. It allows you to write at a given position in the file. In particular, you cannot pass {bof, 0} as second argument since you opened the file in raw mode:
If IoDevice has been opened in raw mode, some restrictions apply: Location is only allowed to be an integer; and the current position of the file is undefined after the operation.
The following sample code shows how they work:
ok = file:write_file("/tmp/file", "This is line 1.\nThis is line 2.\n"),
{ok, F} = file:open("/tmp/file", [read, write, raw, binary]),
ok = file:pwrite(F, 0, <<"This is line A.\n">>),
ok = file:close(F),
{ok, Content} = file:read_file("/tmp/file"),
io:put_chars(Content),
ok = file:delete("/tmp/file").
It will output:
This is line A.
This is line 2.
This works because text "This is line A.\n" is exactly as long as "This is line 1.\n". It does not really replace the line, but just bytes. If you need to replace the first line with content that has a different length, you need to rewrite the whole content of the file. A common approach is indeed to write a new file and swap them eventually. If the file is small enough, however, you can read it entirely in memory and rewrite it. file:read_file/1 and file:write_file/2 would work:
replace_first_line(Path, NewLine) ->
{ok, Content} = file:read_file(Path),
[FirstLine | Tail] = binary:split(Content, <<"\n">>),
NewContent = [NewLine, <<"\n">> | Tail],
ok = file:write_file(Path, NewContent).
The question is not related to erlang but rather general file operations.
Replacing a line in a file requires to rewrite the file in a whole. The easiest way to do so would be to write all the new content in a new file and then to move the file.