FileChannel.transferTo (supposedly zero-copy) not giving any performance gain - sendfile

I am working on a REST API that has an endpoint to download a file that could be > 2 GB in size. I have read that Java's FileChannel.transferTo(...) will use zero-copy if the OS supports it. My server is running on localhost during development on my MacBook Pro OS 10.11.6.
I compared the following two methods of writing file to response stream:
Copying a fixed number of bytes from FileChannel to WritableByteChannel using transferTo
Reading a fixed number of bytes from FileInputStream into a byte array (size 4096) and writing to OutputStream in a loop.
The time taken for a 5.2GB file is between 20 and 23 seconds with both methods. I tried transferTo with the fixed number of bytes in single transfer set to following values: 4KB (i.e. 4 * 1024), 1MB and 50MB. The time taken to write is in the same range in all the 3 cases.
Time taken is measured from before entering the while-loop to after exiting the while-loop, in which bytes are read from the file. This is all on the server side. The network hop time does not figure into this.
Any ideas on what the reason could be? I am quite sure MacOS 10.11.6 should support zero-copy (i.e. sendfile system call).
EDIT (6/18/2018):
I found the following blog post from 2015, saying that sendfile on MacOS X is broken. Could it be that this problem still exists?
https://blog.phusion.nl/2015/06/04/the-brokenness-of-the-sendfile-system-call/

The (high) transfer rate that you are quoting is likely close to or at the limit of what a SATA device can do anyway. If my guess is right, you will not see a performance gain reflected in the time it takes to run your test - however there will likely be a change in the CPU load during the test. Given that you have a relatively powerful machine, your CPU and memory are fast enough. Any method (zero-copy or not) will work at the same speed - which is the speed of your disk. However, zero-copy will cause a lot less CPU load and will not grab unnecessary bandwidth from your memory, either. Therefore, you should test different methods and see which one ends up using the least amount of CPU and choose that method for your application.

Related

Hoot determine GTK3 memory leaks on OS X

I've written a small single-line oriented UDP based display service to support raspberry pi projects I frequently work on, where it would be nice to see the results of sensor data being captured. This is a rewrite of that program using GTK3 V3.18.9, and GLIB2 V2.46.2. I'm developing on OSX El Capitan
It seems to double in real memory size every 30 minutes or so based on traffic; so I'm presuming I have a memory leak somewhere. But for the life of me I can't see where in the code it could possible be. Val grind did not initially work for me, so I've got some studying to do to resolve what ever issue that is.
Meanwhile I was hoping that different eyes might be able to suggest a coding cause for a traffic based (at least I think so) memory leak. Here is the program and test client.
It starts up using about 10MB of real memory, then jumps when it has received 64 total messages to 14MB, than slowly grows from there. At 64 message, I start deleting the 65th message off the end of the list, presuming I should be saving memory; as this program might run for weeks.
Here is the code for the test client and the display service:
https://gist.github.com/skoona/c1218919cf393c98af474a4bf868009f

What happens at a low level when I call fseek()?

When fseek() is called in C - or, seek() is called on a file object in any modern language like Python or Go - what happens at a very low level?
What does the operating system or hard drive actually do?
What gets read?
What overhead is incurred?
How does block size affect this overhead?
Edit to add:
Given NTFS with a block size of 4KB, does seeking 4096 bytes incur less IO overhead than reading 4096 bytes?
Second Edit:
When in doubt, go empirical.
Using some naive Python code with a 1.5GB file:
Reading 4096 sequentially: 21.2
Seek 4096 (relative): 1.35
Seek 4096 (absolute): 0.75 (interesting)
Seek and read every third 4096 (relative): 21.3
Seek and read every third 4096 (absolute): 21.5
The times are averaged are in seconds. The hardware is a nondescript PC with a SATA drive running Windows XP.
This was hugely disappointing. I have several GB of files that I have to read on a near continual basis. About 66% of the 4KB blocks in the files are uninteresting and I know their offset in advance.
Initially, I thought it might be a Big Win to rewrite the legacy code involved as it now does a sequential read 4096 bytes at a time through the files. Assuming Win32 Python is not broken in some fundamental way, incorporating seek has no advantage for non-random reads.
This heavily depends on current conditions. Generally, fseek() only changes state of the stream (either sets current position, or returns an error if parameters are wrong). But - fseek() flushes buffer, that might incur pending write operation. If file is UTF8 file and translation is enabled, ftell() called from fseek() needs to read that part of the file to correctly calculate the offset. If CRLF translation is enabled, it also incurs read operations. But in case of plain binary file and no pending write operation, fseek() just sets position within the stream and doesn't need to go to lower level. For more details, see source code of CRT.

mongodb high cpu usage

I have installed MongoDB 2.4.4 on Amazon EC2 with ubuntu 64 bit OS and 1.6 GB RAM.
On this server, only MongoDB running nothing else.
But sometime CPU usage reach to 99% and load average: 500.01, 400.73,
620.77
I have also installed MMS on server to monitor what's going on server.
Here is MMS detail
As per MMS details, indexing working perfectly for each queries.
Suspect details as below
1) HIGH non-mapped virtual memory
2) HIGH page faults
Can anyone help me to understand what exactly causing high CPU usage ?
EDIT:
After comments of #Dylan Tong, i have reduced active connetions but
still there is high non-mapped virtual memory
Here's a summary of a few things to look into:
1. Observed a large number of connections and cursors (13k):
- fix: make sure your connection pool is appropriate. For reporting, and your current request rate, you only need a few connections at most. Also, I'm guessing you have a m1small instance, which means you only have 1 core.
2. Review queries and indexes:
- run your queries with explain(), to observe how the queries are executed. The right model normally results in queries only pulling very few documents and utilization of an index.
3. Memory (compact and readahead setting):
- make the best use of memory. 1.6GB is low. Check how much free memory you have, and compare it to what is reported as resident. A couple of common causes of low resident memory is due to fragmentation. If there are alot of documents moving, changing size and such, you should run the compact command to defragment your data files. Also, a bad readahead can lead to poor use of memory as well. Check your readahead setting (http://manpages.ubuntu.com/manpages/lucid/man2/readahead.2.html). Try a few values starting with low values (http://docs.mongodb.org/manual/administration/production-notes/). The production notes recommend 32 (for standard 512byte blocks). Sometimes higher values are optimal if your documents are larger. The hope is that resident memory should be close to your available memory and your page faults should start to lower.
If you're using resources to the fullest after this, and you're still capped out on CPU then it means you need to up your resources.

WinhttpReadData slow network

I am downloading an exe from the server using Winhttp C++. I use the sample code provided in MSDN
http://msdn.microsoft.com/en-us/library/aa384104%28v=vs.85%29.aspx
It works fine.I normally used to add up all the data read (Read from WinhttpReadData) and log it.
The expected result is, the added sum should match the exe size. It works fine in reasonably fast network.
In case of very slow network. The data read is too much larger than the original size. But when i check the downloaded exe size, it is same as that of the server.
The logs (which is adding up the data read) shows it reads more data than the original size.
Remember it only occurs in slow network. Have anyone faced this issue?
Are you respecting the value returned via the lpdwNumberOfBytesRead parameter? The number of bytes read with each call may be less than the buffer size you provided -- especially if fewer bytes are available at that time due to the slowness of the network.

Keeping a file in the OS block buffer

I need to keep as much as I can of large file in the operating system block cache even though it's bigger than I can fit in ram, and I'm continously reading another very very large file. ATM I'll remove large chunk of large important file from system cache when I stream read form another file.
In a POSIX system like Linux or Solaris, try using posix_fadvise.
On the streaming file, do something like this:
posix_fadvise(fd, 0, 0, POSIX_FADV_SEQUENTIAL);
while( bytes > 0 ) {
bytes = pread(fd, buffer, 64 * 1024, current_pos);
current_pos += 64 * 1024;
posix_fadvise(fd, 0, current_pos, POSIX_FADV_DONTNEED);
}
And you can apply POSIX_FADV_WILLNEED to your other file, which should raise its memory priority.
Now, I know that Windows Vista and Server 2008 can also do nifty tricks with memory priorities. Probably older versions like XP can do more basic tricks as well. But I don't know the functions off the top of my head and don't have time to look them up.
Within linux, you can mount a filesystem as the type tmpfs, which uses available swap memory as backing if needed. You should be able to create a filesystem greater than your memory size and it will prioritize the contents of that filesystem in the system cache.
mount -t tmpfs none /mnt/point
See: http://lxr.linux.no/linux/Documentation/filesystems/tmpfs.txt
You may also benefit from the files swapiness and drop_cache within /proc/sys/vm
If you're using Windows, consider opening the file you're scanning through with the flag
FILE_FLAG_SEQUENTIAL_SCAN
You could also use
FILE_FLAG_NO_BUFFERING
for that file, but it imposes some restrictions on your read size and buffer alignment.
Some operating systems have ramdisks that you can use to set aside a segment of ram for storage and then mounting it as a file system.
What I don't understand, though, is why you want to keep the operating system from caching the file. Your full question doesn't really make sense to me.
Buy more ram (it's relatively cheap!) or let the OS do its thing. I think you'll find that circumventing the OS is going to be more trouble than it's worth. The OS will cache as much of the file as needed, until yours or any other applications needs memory.
I guess you could minimize the number of processes, but it's probably quicker to buy more memory.
mlock() and mlockall() respectively lock part or all of the calling process’s virtual address space into RAM, preventing that memory from being paged to the swap area.
(copied from the MLOCK(2) Linux man page)