I have a formatted data file which is typically billions of lines long, with several lines of headers of variable length. The data file takes the form:
# header 1
# header 2
# headers are of variable length.
# data begins from next line.
1.23 4.56 7.89 0.12
2.34 5.67 8.90 1.23
:
:
# billions of lines of data, each row the same length, same format.
-- end of file --
I would like to extract a portion of data from this file, and my current code looks like:
<pre>
do j=1,jmax !Suppose I want to extract jmax lines of data from the file.
[algorithm to determine number of lines to skip, "N(j)"]
!This determines the number of lines to skip from the previous file
!position, when the data was read on j-1th iteration.
!Skip N-1 lines to go to the next data line to read off:
do i=1,N-1
read(unit=unit,fmt='(A)')
end do
!Now read off the line of data I want:
read(unit=unit,fmt='(data_format)'),data1,data2,etc.
!Data is stored in some arrays.
end do
</pre>
The problem is, N(j) can be anywhere between 1 and several billion, so it takes some time to run the code.
My question is, is there a more efficient way of skipping millions of lines of data? The only way I can think of, while sticking to Fortran, is to open the file with direct access and jump to the desired line upon opening the file.
As you suggest, direct access seems like the best option. But that requires the records to all have the same length, which your headers violate. Also, why used formatted output? With a file of this length, its hard to imagine a person reading the file. If you use unformatted IO, the file will be both smaller and IO will be faster. Perhaps create two files, one with the headers (metadata) in human reader form, and the other with the data in native form. Native / binary representation means a loss of portability, which is something to consider if you want to move the files to different computer architectures or have them be useable for decades. Otherwise it's probably worth the convenience. Other options would be to use a more sophisticated file format that combines metadata and data, such as HDF5 or FITS, but for communication between two programs of one person, that's probably excessive.
Related
Please note that I'm not asking how to append texts at the end of the file. I'm asking how to prepend texts to the beginning of file.
let handle = try FileHandle(forWritingTo: someFile)
//handle.seekToEndOfFile() // This is for appending
handle.seek(toFileOffset: 0) // Me trying to seek to the beginning of file
handle.write(content)
handle.closeFile()
It seems like my content is being written at the beginning of the file, but it just replaces the existing consent as well... Thanks!
One reasonable solution is to write the new content to a temporary file, then append the existing contents to the end of the temporary file. Then move the temporary file over the old file.
When you seek to a point in an existing file and then perform a write, the existing contents are overwritten from that point. This is why your current approach fails.
In general, most file systems don't have built-in support for prepending data to files. Likewise, most file I/O APIs don't either.
In order to prepend data, you first have to shift all of the existing data further along the file to make room for the new data at the beginning. You typically do this by starting near the end, reading a chunk of data, writing that data to the original position plus the length of data you hope to eventually prepend, and then repeating with the next chunk closer to the beginning of the file. In this way, you gradually shift everything down. Only after you've done all of that can you safely write the new data at the beginning of the file safely.
Frankly, if there's any way to avoid this, you should try to. The performance is likely to be terrible if the file is large and/or you're doing it frequently.
I am trying to read a binary file with Python. This is the code I use:
fb = open(Bin_File, "r")
a = numpy.fromfile(fb, dtype=numpy.float32)
However, I get zero values at the end of the array. For example, for a case where nrows=296 and ncol=439 and as a result, len(a)=296*439, I get zero values for a[-922:]. I know these values should be noData (-9999 in this example) from a trusted piece of code in R. Does anybody know why I am getting these non-sense zeros?
P.S: I am not sure it is related on not, but len(a) is nrows*ncols+2! I have to get rid of these two using a = a[0:-2] so that when I reshape them into rows and columns using a_reshape = a.reshape(nrows, ncols) I don't get an error.
When opening a file for reading as binary you should use the mode "rb" instead of "r".
Here is some background from the docs. On linux machines you don't need the "b" but it wont hurt. On Windows machines you must use "rb" for binary files.
Also note that the two extra entries you're getting is a common bug/feature when using the "unformatted" binary output format of Fortran. Each write statement given in this mode will produce a record that is surrounded by two 4 byte blocks.
These blocks represent integers that list the number of bytes in the block of unformatted data. For example, [223] [223 bytes of data] [223].
How do I fix the Fortran runtime error: Bad integer for item 0 in list input?
Below is the Fortran program which generates a runtime error.
CHARACTER CNFILE*(*)
REAL BOX
INTEGER CNUNIT
PARAMETER ( CNUNIT = 10 )
INTEGER NN
OPEN ( UNIT = CNUNIT, FILE = CNFILE, STATUS = 'OLD' )
READ ( CNUNIT,* ) NN, BOX
The error message received from gdb is :
At line 688 of file MCNPT.f (unit = 10, file = 'LATTICE-256.txt')
Fortran runtime error: Bad integer for item 0 in list input
[Inferior 1 (process 3052) exited with code 02]
(gdb)
I am not sure what options must be specified for READ() to read to numbers from the text file. Does it matter if the two numbers on the same line are specified as either an integer or a real in the text file?
Below is the gdb execution of the program using a break point at the open call
Breakpoint 1, readcn (
cnfile=<error reading variable: Cannot access memory at address 0x7fffffffdff0>,
box=-3.37898272e+33, _cnfile=30) at MCNPT.f:686
Since you did not specify form="unformatted" on the open statement, the unit / file is opened for formatted IO. This is appropriate for a human-readable text file. ("unformatted" would be used for a non-human readable file in computer-native format, sometimes called "binary".) Therefore you should provide a format on the read, or use list-directed read, i.e., read(unit, *). To advise on a particular format we would have to know the layout of the numbers in the file. A possible read with format is: read (CNUINT, '(I4, 2X, F6.2)' ) NN, BOX
P.S. I'm answering the question in your question and not the title, which seems unrelated.
EDIT: now that you are show the text data file, a list-directed read looks easier. That is because the data doesn't line up in columns. It seems that the file has two integers on the first line, then three real numbers on each of the following lines. Most likely you need a different read for the first line. Is the code sample that you are showing us trying to read the first line, or one of the later lines? If the first line, it would seem plausible to read into two integer variables. If a later line, into two or three real variables. Two if you wish to skip the third data item on the line.
EDIT 2: the question has been substantially altered several times, which is very confusing. The first line of the text file that was shown in one version of the question contained integers, with later lines having reals. Since the listed-directed read is reading into an integer and a floating variable, it will have problems if you attempt to use it on the later lines that have two real values.
I've written several IDL programs to analyse some data. To keep it simple the programs read in some time varying data and calculate the fourier spectrum. This spectrum is written to file using this code:
openw,3,filename
printf,3,[transpose(freq),transpose(power)],format='(e,e)'
close,3
The file is then read by another program using this code:
rdfloat,filename,freq,power,/double
The rdfloat procedure can be found here: http://idlastro.gsfc.nasa.gov/
The error i get when trying to read the a file is: "Input conversion error. Unit: 101"
When i delve in to the file being read, i notice several types of unrecognised characters. I dont know if these are a result of the writing to the file or some thing else related to the number of files being created (over 300 files)
These symbols/characters are in the place of a single number:
< dle> < dc1> < dc2> < dc3> < dc4> < can> < nak> < em> < soh> < syn>
Example of what appears in the file being read, Note they are not consecutive lines.
7.7346< dle>18165493007e+01 8.4796811549010105e+00
7.7354408697119453e+01 1.04459538071< dc2>1749e+01
7.7360701595839< can>28e+01 3.0447318983094189e+00
Whenever I run the procedures that write the files, there is always at least one file that has some or all of these characters. The file/s that contains these characters is always different.
Can anyone explain what these symbols are and what I might be doing to create them as well as how to ensure they are not written to file?
I see two things that may be causing a problem. But first, I want to suggest a few tips.
When you open a file, it is useful to use the /GET_LUN keyword because it allows IDL to find and use a logical unit number (LUN) that is available (e.g., in case you left LUN 3 open somewhere else). When you print formatted data, you should specify the total width and number of decimal places. It will make things easier because then you need not worry about changing spacings between numbers in a file.
So I would change your first set of code to the following (or some variant of the following):
OPENW,gunit,filename[0],/GET_LUN,ERROR=err
FOR j=0L, N_ELEMENTS(freq) - 1L DO BEGIN
PRINTF,gunit,freq[j],power[j],FORMAT='(2e20.12)'
ENDFOR
FREE_LUN,gunit ;; this is better than using the CLOSE routine
So the first potential issue I see is that if your variable power was calculated using something like FFT.pro, then it will be a complex float or complex double, depending on the input and keywords used.
The second potential issue may be due to an incorrect format statement. You did not tell PRINTF how many columns or rows to expect. It might not know how to handle the input properly, so it guesses and may result in those characters you show. Those characters may be spacing characters due to the vague format statement or the software you are using to look at the files (e.g., I would not recommend using Word to open text files, use a text editor).
Side Note: You can open and read the file you just wrote in a similar fashion to what I showed above, but changed to the following:
n = FILE_LINES(filename[0])
freq = DBLARR(n)
power = DBLARR(n)
OPENR,gunit,filename[0],/GET_LUN,ERROR=err
FOR j=0L, N_ELEMENTS(freq) - 1L DO BEGIN
READF,gunit,freq[j],power[j],FORMAT='(2e20.12)'
ENDFOR
FREE_LUN,gunit ;; this is better than using the CLOSE routine
I am looking for ways to read in a PDF file with SAS. Apparently this is not basic functionality and there is very little to be found on the internet. (Let alone that google is not easy with PDF in you search giving you also links to PDF documents that go about other things.)
The only things that can be found, are people looking for ways to import data into datasets from a PDF. For me, that is not even necesarry. I would like to be able to read the contents of the PDF file in one big character variable. If possible, it would even be better to be able to read in the file's binary data.
Is this possible with SAS and how? (I got it to work in Access VBA, but can't find any similar ways in SAS.)
(In the end, the purpose is to convert this to base64 and put that base64-string into an XML document.)
You probably will not be able to read the entire file into one character variable since the maximum size of a character variable is around 33 KB. A simple way to read in one line at a time, though, is something like the following:
%let pdfFileName = Test.pdf;
%let lineSize = 2000;
data base;
format text_line $&lineSize..;
infile "&pdfFileName" lrecl=&lineSize;
input text_line $;
run;
This requires that you have a general idea of the maximum record length ahead of time, but you could write additional code to determine the maximum record size prior to reading in the file. In this example each line of text is read into one character variable named "text_line." From there, you could use a RETAIN statement or double trailers (##) in the INPUT line to process multiple lines at a time. The SAS web-site has plenty of documentation on how to read and process text from various types of input files.