I need to keep as much as I can of large file in the operating system block cache even though it's bigger than I can fit in ram, and I'm continously reading another very very large file. ATM I'll remove large chunk of large important file from system cache when I stream read form another file.
In a POSIX system like Linux or Solaris, try using posix_fadvise.
On the streaming file, do something like this:
posix_fadvise(fd, 0, 0, POSIX_FADV_SEQUENTIAL);
while( bytes > 0 ) {
bytes = pread(fd, buffer, 64 * 1024, current_pos);
current_pos += 64 * 1024;
posix_fadvise(fd, 0, current_pos, POSIX_FADV_DONTNEED);
}
And you can apply POSIX_FADV_WILLNEED to your other file, which should raise its memory priority.
Now, I know that Windows Vista and Server 2008 can also do nifty tricks with memory priorities. Probably older versions like XP can do more basic tricks as well. But I don't know the functions off the top of my head and don't have time to look them up.
Within linux, you can mount a filesystem as the type tmpfs, which uses available swap memory as backing if needed. You should be able to create a filesystem greater than your memory size and it will prioritize the contents of that filesystem in the system cache.
mount -t tmpfs none /mnt/point
See: http://lxr.linux.no/linux/Documentation/filesystems/tmpfs.txt
You may also benefit from the files swapiness and drop_cache within /proc/sys/vm
If you're using Windows, consider opening the file you're scanning through with the flag
FILE_FLAG_SEQUENTIAL_SCAN
You could also use
FILE_FLAG_NO_BUFFERING
for that file, but it imposes some restrictions on your read size and buffer alignment.
Some operating systems have ramdisks that you can use to set aside a segment of ram for storage and then mounting it as a file system.
What I don't understand, though, is why you want to keep the operating system from caching the file. Your full question doesn't really make sense to me.
Buy more ram (it's relatively cheap!) or let the OS do its thing. I think you'll find that circumventing the OS is going to be more trouble than it's worth. The OS will cache as much of the file as needed, until yours or any other applications needs memory.
I guess you could minimize the number of processes, but it's probably quicker to buy more memory.
mlock() and mlockall() respectively lock part or all of the calling process’s virtual address space into RAM, preventing that memory from being paged to the swap area.
(copied from the MLOCK(2) Linux man page)
Related
SUMMARY: Cannot copy more than 32GB of files to a 128GB memory stick formatted under FAT32 or exFAT despite the fact that I can format the stick and ChkDsk is showing the correct results after formatting (and also when less than 32GB of files are on the stick). I cannot use NTFS because this stick is designed to transfer files to an iPhone and the app will not handle NTFS. See below for details.
DETAILS:
I have a 128GB memory stick which is designed to quickly transfer files between a computer and an iPhone. One end is a USB and the other plugs into the iPhone's lightning port. This particular type is extremely common and looks like a "T" when you unfold it (Amazon link: https://www.amazon.com/gp/product/B07SB12JHG ).
While this stick is not especially fast when I copy Windows data to it, the transfer rate to my iPhone is much better than the wireless alternatives.
Normally I'd format a large memory stick or USB drive in NTFS, but the app used to transfer files to my iPhone ("CooDisk") will only handle exFAT and FAT32. I've tried both. For exFAT formatting, I've tried both Windows 7 and 10, and for FAT32 I used a free product from RidgeCrop consulting (I can give you the link if you want).
As with all USB storage devices, my stick is formatted as a single active partition.
I do not have a problem formatting. After formatting, ChkDsk seems happy with both FAT32 and exFAT. The CooDisk app works fine with either. After formatting, all the space is ostensibly available for files.
My problem arises when populating the stick with files.
Whenever I get beyond 32GB in total space, I have various problems. Either the copy will fail, or ChkDsk will fail. (After running ChkDsk in 'fix' mode, every file created beyond the 32GB limit will be clobbered.) Interestingly, when I use the DOS copy command with "/v" (verify) it will flag an error for files beyond the 32GB limit, although DOS XCopy with "/v" keeps on going. GUI methods also die at 32GB.
Out of sheer desperation, I wrote a script that uses GNU's cp for Windows. Now I can copy more than 32GB of files and ChkDsk flags no errors. However files beyond the 32GB limit end up being filled with binary zeros despite the fact that they appear as they should in a directory or Windows file explorer listing. (Weird, isn't it?)
I have also tried various allocation unit sizes from 4K all the way up to 64K and attempted this with three different Windows OSs (XP, Win7, and Win10).
Let me emphasize: there is no problem with the first 32GB of files copied to the stick regardless of: whether I use exFAT or FAT32; my method of copying; and my choice of AU size.
Finally, there is nothing in these directories that would bother a FAT32 or exFAT system: (a) file and directory names are short (well under 100 characters); (b) directory nesting is minimal (no more than 5 levels); (c) files are small (nothing close to a GB); and directories have relatively few files (nowhere close to 200, for those of you who recall the old FAT limit of 512 files per directory :)
The only platform I haven't yet tried is using an aging MacBook that someone gave to me. I'm not terribly good with Macs, but I would rather not be dependent on it (it's 13 years old, although MacBooks are built like tanks).
Also, is it possible that FAT32 and exFAT don't allow more than 32GB on an active partition (I can find no such limitation documented anywhere, in fact in my experience USB storage devices are always bootable - as was the original version of my stick)?
Any ideas??
I am working on a REST API that has an endpoint to download a file that could be > 2 GB in size. I have read that Java's FileChannel.transferTo(...) will use zero-copy if the OS supports it. My server is running on localhost during development on my MacBook Pro OS 10.11.6.
I compared the following two methods of writing file to response stream:
Copying a fixed number of bytes from FileChannel to WritableByteChannel using transferTo
Reading a fixed number of bytes from FileInputStream into a byte array (size 4096) and writing to OutputStream in a loop.
The time taken for a 5.2GB file is between 20 and 23 seconds with both methods. I tried transferTo with the fixed number of bytes in single transfer set to following values: 4KB (i.e. 4 * 1024), 1MB and 50MB. The time taken to write is in the same range in all the 3 cases.
Time taken is measured from before entering the while-loop to after exiting the while-loop, in which bytes are read from the file. This is all on the server side. The network hop time does not figure into this.
Any ideas on what the reason could be? I am quite sure MacOS 10.11.6 should support zero-copy (i.e. sendfile system call).
EDIT (6/18/2018):
I found the following blog post from 2015, saying that sendfile on MacOS X is broken. Could it be that this problem still exists?
https://blog.phusion.nl/2015/06/04/the-brokenness-of-the-sendfile-system-call/
The (high) transfer rate that you are quoting is likely close to or at the limit of what a SATA device can do anyway. If my guess is right, you will not see a performance gain reflected in the time it takes to run your test - however there will likely be a change in the CPU load during the test. Given that you have a relatively powerful machine, your CPU and memory are fast enough. Any method (zero-copy or not) will work at the same speed - which is the speed of your disk. However, zero-copy will cause a lot less CPU load and will not grab unnecessary bandwidth from your memory, either. Therefore, you should test different methods and see which one ends up using the least amount of CPU and choose that method for your application.
As I have created a Debian VM inside VirtualBox by encrypting their partition. So that the OS must be running in an encrypted partition. Although while creating a disk image(VHD), I had given for Dynamic allocation, but after OS installation it looks the disk image was consuming the entire disk space. Now the image size is 20GB. Is it possible for us to compress or compact it to some smaller sizes. I saw the documentations to compact the disk image in Virtual Box, but I may need to know whether we can do the same for encrypted disk image.
Your help is greatly appreciated.
Thanks.
It depends on the type of encryption being used. Since you're using Debian I assume you're using LUKS, which is inflexible. The space has to be pre-allocated and therefore the image will utilise the full space allocated to it.
Yes, there is a way to do it, but too complex to do it.
Each time you need/want to compact it you need to do some steps carefully.
(Maybe this is not really need it, try first without this) Blank with zeros all free space inside 'clear' mounted partitions, so free space is zeroed in 'clear', it will not be zeroed in 'encrypted' view point, since encryption will encrypt such zeros.
Shutdown the machine and boot with a LiveCD iso that let you mount the virtual hdd you are using and a new 'dynamic' and 'empty' one.
Set the partition scheme and encription identically on the new one, but ensure encription will not do the 'fill' part, so it does not write all sectors... this is the top most important part... this way the new virtual disk is smalll in size, but encypted by you LUKs, etc.
At this point, only scheme and encryption is on the new 'small' one, now is time to mount both enctypted... the old and the new, so they can see in 'plain' at the same time.
Again this is very important, clone form the old 'plain' to the new 'plain' only sectors that have data (most tools to clone partitions does that).
As i say... the top most important thing (to get a smaller virtual HDD) is:
Create a new virtual dynamic disk empty
Partition it and Encrypt it without writting all sectors; so omit the dd with random data prior to do the encryption or else the dynamic file will grow to max), also omit the fill empty space, that again will grow the virtual disk to max
Clone the partitions from the plain view (mounted and de-crypted on fly), so the clone tool will only write data areas of files, etc, but not free space.
There is a small part that will not be able to be reduced... files inside encrypted partitions that have full clusters fill with zeros (hope you do not have any of thoose)... the cause is that such space (when no encryption, as is all zeros, the normal compat see the full cluster is zero so it does not need such space; but when it is encrypted, such cluster is not all zero inside the real virtaul disk file, so the compact method can not reduce it).
The idea behind all is:
When encryption is on... to get the smallest virtual disk size, start with a dynamic and empty one and write as less as possible clusters on it when cloning the previous one.
As said, it is too much work... and time to time, each write occurs it will start growing and growing again.
My best personal recomendation is, get a 'BIG' and 'FAST' disk and use a fixed virtual disk... if i read well and your disk is only 20GiB... you gain in speed a lot for having it fixed and not dynamic and will not get worried about 'fragmentation' etc.
Remember if you use a USB for it, get one able to write at 30MiB/s (if only have USB 2.0 ports), if you are lucky (like me) and have at least one USB 3 port (better if it is a USB 3.1 Gen 2 Type C) seach for a 2.5 inch HDD 500GiB Sata III (with write speed greater than 100MiB/s, it is really cheap, less than 25 euros) and a Sata III to USB 3.1 Gen 2 Type C enclosure (also cheap, some are under 15 euros)... and avoid having to 'reduce', 'clone' etc.
I have 10 virtual machines on a 500GiB (with more than 50% free space), each is 20GiB in size (with Windows system inside them taking near 16BiG) and VeraCrypt encryption... so i am on the quite near case to you... i opted to use a USB 3.1 gen 2 Type C enclosure to hold all the fixed size VDI files... my experience is that encrypted fixed size fly if compared to non encrypted dynamic size.
Of course, ensure you do the needed test (when encryption takes place), i mean... test virtual HDD speed with no encryption, then test encryption algorithms on ram... and choose a method that is faster than 1.6 times the speed of the disk... so encription will not be a bottle neck... else you can have a really bad speed caused by encryption.
Also think on this, how much cores you show to the guest? that will make encryption speed very different... but also think the worst case... how much CPU will use the non encryption threads on that guest?
Just as an example... if inside the guest you are doing LZMA2 compression (or video transcoding H.264 for example) etc... the free CPU for encryption is very low... so encryption will slow down things a lot... sample cases also do much I/O to disks, so encrypt/decrypt a lot per second is needed.
Maybe a better aproach... would be... encrypt the 'container' not the 'system'... in other words... encrypt where the VDI files are stored, not the whole guest system... create a container per VDI if want different phrase passwords, etc. That way the VDI can also be dynamic and be compacted, etc.
Of course, i would be of more help if you told what encryption scheme (without details) are using.
This makes a really great difference in possible answers:
Are you encrypting system partition with any tool that runs on guest? Then use the 'clone' only used clusters trick
Are you encrypting but setting the VDI encryption property on? Maybe VirtualBox console will help to compat them
Are you encrypting the container where VDI is stored? I am quite sure this is not your case, since in such case compact can be done as normal, VDI is not encrypted at all, neither anything iside it.
I talk about VDI... same applies to the rest formats, VHD, VHDx, etc.
Remember... if encription is done on guest and still want to reduce (compact) the virtua hdd file... start with a new dynamic one, put partition scheme, encryption but without filling all the disk... at this point virtual disk file size must not be great, just a few megabytes... then clone form the old to the new one all used clusters, but not the not used ones.
Advise: Prepare to repeat the 'compact' by 'cloning' a few times per 100 hours of intense use of the guest... if gain is less than 50% it does not compensate the effort... then the best can be done is use fixed size.
Special note: With Fixed size the access speed is much more than with dynamic size... having a dynamic size with 100% size as if it where fixed is a big lost in speed... how much? you must do the test in your machine, depends a lot on CPU, I/O speed (input/ouptup operations per second) of storage you have and also on transfer speed (MiB/s), and other factors... so best do some test.
Since you are talking about 20GiB... better do the test of fixed size... i am quite sure you will enjoy it a lot.
Other thing would be talking about 500GiB system partition with only 10% fill... since space gain could be 450GiB, it is wellcome to do the clone method to compact it, that is why i say how to do such... for such people and your you, and for any one.
P.D.: If someone does not know how to do something, that does not mean it is not possible, and if someone say something is not possible, better for that person explain the demostration or be prepeared to be called an idiot; technology improves a lot time to time, knowledge a lot more.
Let's say I have a 10 MB file and go through these steps:
Open it in my favorite programming language for Read/Write
Erase everything in the stream
Write exactly 10 MB of random back to the same stream
Save the changes to disk
Delete the file through normal means
Can I be certain that the new 10 MB successfully overwrote the old 10 MB on a sector level in the hard drive? Or is it possible that the "erase everything in the stream" step deletes the old file and potentially writes the new 10 MB in a new location?
The data may still be accessible by a professional who knows what they're doing and can access the raw data on the disk (i.e. without going through the filesystem).
Your program is basically equivalent to the Linux shred command, which contains the following warning:
CAUTION: Note that shred relies on a very important assumption:
that the file system overwrites data in place. This is the traditional
way to do things, but many modern file system designs do not satisfy this
assumption. The following are examples of file systems on which shred is
not effective, or is not guaranteed to be effective in all file system modes:
log-structured or journaled file systems, such as those supplied with
AIX and Solaris (and JFS, ReiserFS, XFS, Ext3, etc.)
file systems that write redundant data and carry on even if some writes
fail, such as RAID-based file systems
file systems that make snapshots, such as Network Appliance's NFS server
file systems that cache in temporary locations, such as NFS
version 3 clients
compressed file systems
There's other situations as well, such as SSDs with wear leveling.
no, since on any modern file system commits are atomic, you can be almost 100% certain the 10Mb did not overwrite the old 10Mb, and that's before we consider journaled file systems that actually guarantee this.
Short answer: No.
This might depend on your language and OS. I have a feeling that the stream calls are passed to the OS and the OS then decides what to do, so I'd lean towards your second question being correct just to err on the safe side. Furthermore, magnetic artifacts will be present after a deletion which can still be used to recover said data. Even overwriting the same sectors with all zeros could leave behind the data in a faded state. Generally it is recommended to make several deletion passes. See here for an explanation or here for an open source C# file shredder.
For Windows you could use the SDelete command line utility which implements the Department of Defense clearing and sanitizing standard:
Secure delete applications overwrite a deleted file's on-disk data
using techiques that are shown to make disk data unrecoverable, even
using recovery technology that can read patterns in magnetic media
that reveal weakly deleted files.
Of particular note:
Compressed, encrypted and sparse are managed by NTFS in 16-cluster
blocks. If a program writes to an existing portion of such a file NTFS
allocates new space on the disk to store the new data and after the
new data has been written, deallocates the clusters previously
occupied by the file.
I need to read and process a text file. My processing would be easier if I could use the File.ReadAllLines method but I'm not sure what is the maximum size of the file that could be read with this method without reading by chunks.
I understand that the file size depends on the computer memory. But are still there any recommendations for an average machine?
On a 32-bit operating system, you'll get at most a contiguous chunk of memory around 550 Megabytes, allowing loading a file of half that size. That goes down hill quickly after your program has been running for a while and the virtual memory address space gets fragmented. 100 Megabytes is about all you can hope for.
This is not an issue on a 64-bit operating system.
Since reading a text file one line at a time is just as fast as reading all lines, this should never be a real problem.
I've done stuff like this with 1-2GB before, albeit in Python. I do not think .NET would have a problem, though. But I would only do this for one-off processing.
If you are doing this on a regular basis, you might want to go through the file line by line.
Its bad design unless you know the files sizes vs the computer memory that would be avaiable in the running app.
A better solution would be consider memory mapped files. They use themselvses as page fil storage,