Say I have a 5 GB file. I want to split it in the following way.
First 100 MB is on the file
The rest go some reserve file
I do not want to use readalllines kind of function because it's too slow for large files.
I do not want to read the whole file to the memory. I want the program to handle only a medium chunk of data at a time.
You may use BinaryReader class and its method to read file in chunks.
Dim chunk() As Byte
chunk = br.ReadBytes(1024)
Related
I haven't used openvms for 20+ years. It was my 1st OS. I've been asked if it possible to copy the data from RMS files from openvms server to windows as a text file - so that it's readable.
No-one has experience or knowledge of the record structures etc.
The files are xyz.DAT and are relative files. I'm hoping the dat files are fixed length.
My 1st attempt would be to try and use Datatrieve (DTR) but get an error that the image isn't loaded.
Thought it might be as easy using CONVERT/FDL = nnnn.FDL - by changing the Relative to Sequential. The file seems still to be unreadable.
Is there an easy way to stream an RMS index file to a flat ASCII file?
I use to use COBOL and C to access the data in the past but had lots of libraries to help....
I've notice some solution may use odbc to connect but not sure what I can or cannot install on the server.
I can FTP using Filezilla to the server....
Another plan writing C application to read a file and output out as string.....or DCL too.....doesn't have to be quick...
Any ideas
Has mentioned before
The simple solution MIGHT be to to just use: $ TYPE/OUT=test.TXT test.DAT.
This will handle Relatie and Indexed files alike.
It is much the same as $ CONVERT / FDL=NL: test.DAT test.TXT
Both will just read records from the source and transfer the bytes, byte for byte, to the records in a sequential file.
FTP in ASCII mode will transfer that nicely to windows.
You can also use an 'inline' FDL file to generate a 'unix' LF file like:
$ conv /fdl="record; format stream_lf" test.DAT test.TXT
Or CR-LF file using:
$ conv /fdl="record; format stream" test.DAT test.TXT
Both can be transferring in Binary or Ascii with FTP.
MOSTLY - because this really only works well for TEXT ONLY source .DAT file.
There should be no CR, LF, FF or NUL characters in the source or things will break.
As 'habo' points out, use DUMP /RECORD=COUNT=3 to see how 'readable' the source data is.
If you spot 'binary' data using DUMP then you will need to find a record defintion somewhere which maps byte to Integers or Floating points or Dates as needed.
These defintions can be COBOL LIB files, or BASIC MAPS and are often stores IN the CDD (Common Data Dictionary) or indeed in DATATRIEVE .DIC DICTIONARIES
To use such definition you likely need a program to just read following the 'map' and write/print as text. Normally that's not too hard - notably not when you can find an example program on the server to tweak.
If it is just one or two 'suspect' byte ranges, then you can create a DCL loop to read and write and use F$EXTRACT to select the chunks you like.
If you want further help, kindly describe in words what kind of data is expected and perhaps provide the output from DUMP for 3 or 5 rows.
Good luck!
Hein.
I have been using pandas pd.read_sql_query to read a decent chunk of data into memory each day in order to process it (add columns, calculations, etc to about 1GB of data). This has cause my computer to freeze a few times though so today I tried using psql to create a .csv file. I then zipped that file (.xz) and read it with pandas.
Overall, it was a lot smoother and it made me think about automating the process. Is it possible to replace saving a .csv.xz file and instead copying the data directly to memory while still compressing it (ideally)?
buf = StringIO()
from_curs = from_conn.cursor()
from_curs.copy_expert("COPY table where row_date = '2016-10-17' TO STDOUT WITH CSV HEADER", buf)
(is it possible to compress this?)
buf.seek(0)
(read the buf with pandas to process it)
I have 500,000 files which I need to read in Fortran and each file has ~14,000 entries in it (each entry is only about 100 characters long). I need to process each line for each file at a time. For example, I need to process line 1 for all 500,000 files before moving on to line 2 from the files and so forth.
I cannot open them all at once (I tried making an array of file pointers and opening them all) because there will be too many files open at once. Instead, I would like to do something as follows:
do iline = 1,Nlines
do ifile = 1,Nfiles
! open the file
! read a line
! close the file
enddo
end
In hopes that this would allow me to read one line at a time (from each file) and then move on to the next line (in each file). Unfortunately, each time I open the file it starts me off at line 1 again. Is there any way to open/close a file and then open it again where you left off previously?
Thanks
Unfortunately it is not possible in this way in standard Fortran. Even If you specify
position="ASIS"
the actual position will be unspecified for a not already connected unit and will be in fact the beginning of the file on most systems.
That means You have to use
read(*,*)
enough times to get on the right place in the file.
You could also use stream access. The file would be again opened at the beginning, but you can use
read(u,*,pos=n) number
where n is the position saved from the previous open. You can get the position from
inquire(unit=u, pos=n)
n = n
You would open the file with acess="STREAM".
Also 500000 opened files is indeed too much. There are ways how to inquire for the system limits and how to control them, but also your compiler may have some limits http://www.cyberciti.biz/faq/linux-increase-the-maximum-number-of-open-files/
Other solution: Couldn't you store the content of the files in memory? Today couple of Gigabytes is OK, but it may be not enough for you.
You can try using fseek and ftell in something like the following.
! initialize an array of 0's
do iline = 1,Nlines
do ifile = 1,Nfiles
! open the file
! fseek(fd, array(ifile))
! read a line
! array(ifile)=ftell(fd)
! close the file
enddo
end
The (untested) idea is to store the offset of each file in an array and position the cursor at that place upon opening the file. Then, once a line is read, the ftell retrieves the current position which is saved to memory for next round. If all entries have the same length, you can spare the array and just store one value.
If the files have fixed, i.e., constant, record lengths, you could use direct access. Then you could "directly" read a specific record. A big "if" however.
the overhead of all the file opening/closing will be a big performance bottleneck.
You should try to read as much as you can for each open operation given whatever memory you have:
pseudocode:
loop until done:
loop over all files:
open
fseek !as in damiens answer
read N lines into array ! N=100 eg.
save ftell value for file
close
end file loop
loop over N output files:
open
write array data
close
So I've been using the DotNetZip Library for some time now, and it works pretty well, up until yesterday when I maxed out the zipfile size. On any given day, I need to zip PDFs and transfer them to an SFTP site, that only accepts zip files. The amount of PDFs range from a couple hundred, a couple thousand to well over 10K. I had about 24K PDFs yesterday when the DotNetZip process broke. There is a way to split the zipfiles using the DotNetZip library but for some reason, the system that is being used on the SFTP server cant handle zipfiles that are split.
What's the best way to grab say 5K (or any other set amount of files), zip, delete those files and grab another 5K, zip, delete and repeat the process until all files are zipped?
Here is my current code of the zip process...
Dim PathToPDFs As String = "C:\Temp" 'PDF LOCATION
Using Zip As ZipFile = New ZipFile()
Zip.AddSelectedFiles("(name = *.pdf)", PathToPDFs, "", True)
Zip.CompressionMethod = CompressionMethod.Deflate
Zip.CompressionLevel = Ionic.Zlib.CompressionLevel.BestCompression
Zip.Save("C:\Temp\Zipfile.zip")
End Using
Try enumerating through all files first, getting a list of FileInfo, then going through them in a loop, and creating ZIP files every 5K (or whichever your batch size is). You don't need to delete anything, just keep a batch id in memory, so your zip file names would derive from that (i.e. pdf_batch_01.zip).
So when your batch size is reached you would do Save and create a new ZipFile, and keep adding files in the loop. Don't forget to also "commit" at last file (last batch would most likely be incomplete). To sum up, you "commit" when batch size is reached OR processing last entry (a varitation of i=FileCount-1).
So a .wav file has a few standard chunks. In most of the files I work with, the "RIFF" chunk is first, then a "fmt " chunk, then the "DATA" chunk. When recording using AVAudioRecorder, those chunks are created (although an extra "FLLR" is created before the "DATA" chunk.)
When creating a file with AudioQueue, those standard chunks aren't created. Instead, AudioQueue creates, in order, "caff","desc","lpcm","free", and "data" chunks.
What's going on? Aren't the "RIFF" and "fmt " chunks required? How does one force the inclusion of those chunks?
I'm creating a file by:
AudioFileCreateWithURL(URL, kAudioFileCAFType, &inputDataFormat, kAudioFileFlags_EraseFile, &AudioFile);
with inputDataFormat being a AudioStreamBasicDescription with a full complement of properties.
So how does one write, at least, the "RIFF" and "fmt " chunks with AudioQueue?
Thanks.
So a .wav file has a few standard chunks. …
When creating a file with AudioQueue, those standard chunks aren't created. …
⋮
I'm creating a file by:
AudioFileCreateWithURL(URL, kAudioFileCAFType, &inputDataFormat, kAudioFileFlags_EraseFile, &AudioFile);
Let this be an example of the value of showing one's code in one's question. :-)
kAudioFileCAFType is a Core Audio File, not a WAV file. Try kAudioFileWAVEType instead.