Max file size for File.ReadAllLines - vb.net

I need to read and process a text file. My processing would be easier if I could use the File.ReadAllLines method but I'm not sure what is the maximum size of the file that could be read with this method without reading by chunks.
I understand that the file size depends on the computer memory. But are still there any recommendations for an average machine?

On a 32-bit operating system, you'll get at most a contiguous chunk of memory around 550 Megabytes, allowing loading a file of half that size. That goes down hill quickly after your program has been running for a while and the virtual memory address space gets fragmented. 100 Megabytes is about all you can hope for.
This is not an issue on a 64-bit operating system.
Since reading a text file one line at a time is just as fast as reading all lines, this should never be a real problem.

I've done stuff like this with 1-2GB before, albeit in Python. I do not think .NET would have a problem, though. But I would only do this for one-off processing.
If you are doing this on a regular basis, you might want to go through the file line by line.

Its bad design unless you know the files sizes vs the computer memory that would be avaiable in the running app.
A better solution would be consider memory mapped files. They use themselvses as page fil storage,

Related

Pentaho text file input step crashing (out of memory)

I am using Pentaho for reading a very large file. 11GB.
The process is sometime crashing with out of memory exception, and sometimes it will just say process killed.
I am running the job on a machine with 12GB, and giving the process 8 GB.
Is there a way to run the Text File Input step with some configuration to use less memory? maybe use the disk more?
Thanks!
Open up spoon.sh/bat or pan/kettle .sh or .bat and change the -Xmx figure. Search for JAVAMAXMEM Even though you have spare memory unless java is allowed to use it it wont work. although to be fair in your example above i can't really see why/how it would be consuming much memory anyway!

Accesing files which are currently being written

If a file is in a writing process, and at this time if I try to access it like if it is a log file which is being written every 10 milliseconds and I`m trying to access it will I damage or disturb the writing process?
Specifically I'm asking about video files, like if I start a recording process (using Windows Media Encoder) and at this time I would like to monitor the file if it is a blank file (black pixels everywhere) or there is a real content being recorded.
Sorry if my question is a newbie one, but I really really need to be sure about that.
Best on advance
In general you can certainly read files as they are being written, without corrupting their content. However:
It is possible to face an issue if your recording medium cannot deal with the combined data-rate or of both reading and writing. This can be a problem especially with slow-ish USB flash drives.
It is possible to face an issue on hard drives too, if the combination of reading and writing exceeds the rate of random seeks that the hard drive can handle. This can happen more easily on older drives (e.g. IDE) when dealing with HD video.
The end result is that if you have a real-time writer process, such as a TV recorder, it may be forced to drop some of the data - in the case of video a few frames.
Modern systems have quite fast disk subsystems, reasonably good I/O schedulers and large enough RAM capacities to allow for extensive data caching, which makes it quite unlikely that a single writer/reader combination would saturate the disk subsystem, unless you are doing something unusual like recording several video streams at once.
Keep in mind however, that:
The disk subsystem can also be saturated by unrelated processes reading/writing other files from the same drive.
If you are encoding video, you might also lose frames if something draws enough CPU resources that the encoding process is no longer able to keep-up with the real-time requirements. Depending on the video file, test-playing it might be just enough to do that - at least HD reproduction can be quite demanding. So, watch your CPU load and experiment before relying on it to record your favourite show :-)
EDIT:
If you are among the lucky ones that have SSD drives, seeks and data rate should normally be a non-issue. That leaves the CPU - you'd be surprised how easy it is to push it to the limit.
Above all, you should experiment to find out the limits of your system for each particular application. That way you won't have any nasty surprises...

Copying files to an encrypted USB drive leads to major fillup

I have a Kingston Datatraveller Vault privacy edition with 8 GIG of space and I use a home-made program to copy images to it. Thing is, the amount of space needed for the images cab to be high as 3 times the actual size of the file while it's usually almost on par on a PC. It doesn't seem to affect other types of file though.
Does anyone knows why ?
Just so if someone has the same problem, I did find out the answer. It's because that particular key use 64K blocks instead of 4K and when you have a lot of small images, this is what happen.

Reading a whole file on a network drive the fast way (Windows, C/C++, C#, ...)

Lately I've been having problems reading big files on a network drive and I just can't pinpoint what I may be doing wrong. I tried both in C++ (Unmanaged) and in C# and had about the same performances on both...which were somewhat abysmal.
Sometimes it will read at 4 KB/s a file on the network, but if this file is located on the local HD it will achieve easily the maximum data rate the HD can output. That is with reading 64 KB chunks at a time... I tried with bigger buffers up to insane numbers, or smaller and it doesn't make much differences.
I tried async IO in C# with BeginRead on the FileStream and OVERLAPPED IO in C++ as well as synchronous reads and they all had the same problems, which is being slow on the network.
The only solution we came up with is to copy the file using the OS CopyFile function on the local HD before actually reading the file but I'm not too satisfied with this approach. It just seems like CopyFile is doing something we are not that makes it incredibly faster than our approach.
Anyone has a clue as to why this is?
We would have to guess, since you aren't showing us your code. So my guess is that Windows file copy is opening the file with the FILE_FLAG_SEQUENTIAL_SCAN flag which in turn causes the file system/cache to choose optimal block sizes and submit read requests in anticipation of read calls that havn't been submitted yet.
We only can assume that you have been trying really all possible methods of reading/writing. Have you been reading synchronously or asynchronously? Did you try I/O completion ports? Or ReadFileEx() function? I would guess that the Windows CopyFile() function detects that you want to read a file from network and will use different method for reading then it would use for disk access.
If you have really exhausted all possible reading methods, and if you really need thing to be solved, then I would suggest to check out a bit on what is the CopyFile() function doing. There are numerous tools for doing that. E.g.: this one (or some other -- links on the same page).

Keeping a file in the OS block buffer

I need to keep as much as I can of large file in the operating system block cache even though it's bigger than I can fit in ram, and I'm continously reading another very very large file. ATM I'll remove large chunk of large important file from system cache when I stream read form another file.
In a POSIX system like Linux or Solaris, try using posix_fadvise.
On the streaming file, do something like this:
posix_fadvise(fd, 0, 0, POSIX_FADV_SEQUENTIAL);
while( bytes > 0 ) {
bytes = pread(fd, buffer, 64 * 1024, current_pos);
current_pos += 64 * 1024;
posix_fadvise(fd, 0, current_pos, POSIX_FADV_DONTNEED);
}
And you can apply POSIX_FADV_WILLNEED to your other file, which should raise its memory priority.
Now, I know that Windows Vista and Server 2008 can also do nifty tricks with memory priorities. Probably older versions like XP can do more basic tricks as well. But I don't know the functions off the top of my head and don't have time to look them up.
Within linux, you can mount a filesystem as the type tmpfs, which uses available swap memory as backing if needed. You should be able to create a filesystem greater than your memory size and it will prioritize the contents of that filesystem in the system cache.
mount -t tmpfs none /mnt/point
See: http://lxr.linux.no/linux/Documentation/filesystems/tmpfs.txt
You may also benefit from the files swapiness and drop_cache within /proc/sys/vm
If you're using Windows, consider opening the file you're scanning through with the flag
FILE_FLAG_SEQUENTIAL_SCAN
You could also use
FILE_FLAG_NO_BUFFERING
for that file, but it imposes some restrictions on your read size and buffer alignment.
Some operating systems have ramdisks that you can use to set aside a segment of ram for storage and then mounting it as a file system.
What I don't understand, though, is why you want to keep the operating system from caching the file. Your full question doesn't really make sense to me.
Buy more ram (it's relatively cheap!) or let the OS do its thing. I think you'll find that circumventing the OS is going to be more trouble than it's worth. The OS will cache as much of the file as needed, until yours or any other applications needs memory.
I guess you could minimize the number of processes, but it's probably quicker to buy more memory.
mlock() and mlockall() respectively lock part or all of the calling process’s virtual address space into RAM, preventing that memory from being paged to the swap area.
(copied from the MLOCK(2) Linux man page)