When fseek() is called in C - or, seek() is called on a file object in any modern language like Python or Go - what happens at a very low level?
What does the operating system or hard drive actually do?
What gets read?
What overhead is incurred?
How does block size affect this overhead?
Edit to add:
Given NTFS with a block size of 4KB, does seeking 4096 bytes incur less IO overhead than reading 4096 bytes?
Second Edit:
When in doubt, go empirical.
Using some naive Python code with a 1.5GB file:
Reading 4096 sequentially: 21.2
Seek 4096 (relative): 1.35
Seek 4096 (absolute): 0.75 (interesting)
Seek and read every third 4096 (relative): 21.3
Seek and read every third 4096 (absolute): 21.5
The times are averaged are in seconds. The hardware is a nondescript PC with a SATA drive running Windows XP.
This was hugely disappointing. I have several GB of files that I have to read on a near continual basis. About 66% of the 4KB blocks in the files are uninteresting and I know their offset in advance.
Initially, I thought it might be a Big Win to rewrite the legacy code involved as it now does a sequential read 4096 bytes at a time through the files. Assuming Win32 Python is not broken in some fundamental way, incorporating seek has no advantage for non-random reads.
This heavily depends on current conditions. Generally, fseek() only changes state of the stream (either sets current position, or returns an error if parameters are wrong). But - fseek() flushes buffer, that might incur pending write operation. If file is UTF8 file and translation is enabled, ftell() called from fseek() needs to read that part of the file to correctly calculate the offset. If CRLF translation is enabled, it also incurs read operations. But in case of plain binary file and no pending write operation, fseek() just sets position within the stream and doesn't need to go to lower level. For more details, see source code of CRT.
Related
I am working on a REST API that has an endpoint to download a file that could be > 2 GB in size. I have read that Java's FileChannel.transferTo(...) will use zero-copy if the OS supports it. My server is running on localhost during development on my MacBook Pro OS 10.11.6.
I compared the following two methods of writing file to response stream:
Copying a fixed number of bytes from FileChannel to WritableByteChannel using transferTo
Reading a fixed number of bytes from FileInputStream into a byte array (size 4096) and writing to OutputStream in a loop.
The time taken for a 5.2GB file is between 20 and 23 seconds with both methods. I tried transferTo with the fixed number of bytes in single transfer set to following values: 4KB (i.e. 4 * 1024), 1MB and 50MB. The time taken to write is in the same range in all the 3 cases.
Time taken is measured from before entering the while-loop to after exiting the while-loop, in which bytes are read from the file. This is all on the server side. The network hop time does not figure into this.
Any ideas on what the reason could be? I am quite sure MacOS 10.11.6 should support zero-copy (i.e. sendfile system call).
EDIT (6/18/2018):
I found the following blog post from 2015, saying that sendfile on MacOS X is broken. Could it be that this problem still exists?
https://blog.phusion.nl/2015/06/04/the-brokenness-of-the-sendfile-system-call/
The (high) transfer rate that you are quoting is likely close to or at the limit of what a SATA device can do anyway. If my guess is right, you will not see a performance gain reflected in the time it takes to run your test - however there will likely be a change in the CPU load during the test. Given that you have a relatively powerful machine, your CPU and memory are fast enough. Any method (zero-copy or not) will work at the same speed - which is the speed of your disk. However, zero-copy will cause a lot less CPU load and will not grab unnecessary bandwidth from your memory, either. Therefore, you should test different methods and see which one ends up using the least amount of CPU and choose that method for your application.
How to set the parameter - setRAMBufferSizeMB? Is depending on the RAM size of the Machine? Or Size of Data that needs to be Indexed? Or any other parameter? could someone please suggest an approach for deciding the value of setRAMBufferSizeMB.
So, what we have about this parameter in Lucene javadoc:
Determines the amount of RAM that may be used for buffering added
documents and deletions before they are flushed to the Directory.
Generally for faster indexing performance it's best to flush by RAM
usage instead of document count and use as large a RAM buffer as you
can. When this is set, the writer will flush whenever buffered
documents and deletions use this much RAM.
The maximum RAM limit is inherently determined by the JVMs available
memory. Yet, an IndexWriter session can consume a significantly larger
amount of memory than the given RAM limit since this limit is just an
indicator when to flush memory resident documents to the Directory.
Flushes are likely happen concurrently while other threads adding
documents to the writer. For application stability the available
memory in the JVM should be significantly larger than the RAM buffer
used for indexing.
By default, Lucene uses 16 Mb as this parameter (this is the indication to me, that you shouldn't have that much big parameter to have fine indexing speed). I would recommend you to tune this parameter by setting it let's say to 500 Mb and checking how well your system behave. If you will have crashes, you could try some smaller value like 200 Mb, etc. until your system will be stable.
Yes, as it stated in the javadoc, this parameter depends on the JVM heap, but for Python, I think it could allocate memory without any limit.
As memory is much slower to CPU, It should send the data in blocks of some 'x'Bytes.
How much would be the size of this 'x'?
Do the data line b/n memory and CPU is also a x*8 bit lane?
If I access an address 'A' on memory, would it be sending all the next x-1 memory addresses to the cache?
What is the Approx frequency a memory bus would be working?
SIMD - Do SSE and MMX extensions somehow leverage this bulk reading feature?
Please feel free to provide any references.
Thanks in advance.
The size 'x' is generally the size of a cache line. Cache line size depends on the architecture, but Intel and AMD use 64 byte.
At least . If you have more channels, you can fetch more data from different channels.
Not exactly the next x-1 memory addresses. You can think of the memory, divided into 64 byte chunks. Every time you want to access even one byte, you will bring the chunk your address belongs. Lets assume you want to access the address 123 (decimal). The start of the address should be 64 to 127. So, you will bring that whole chunk. Which means, you do not only bring the following ones, but the previous addresses as well, depending on the address you access of course.
That depends the version of DDR you CPU supports. You can check some numbers in here: https://en.wikipedia.org/wiki/Double_data_rate
Yes, they do. When you bring a data from memory to caches, you bring one cache line, and SIMD extensions work on multiple data elements in a single instruction. Which means if you want to add 4 values in one instruction, the data you are looking for would be in the cache (since you brought the whole chunk) and you just read it from cache.
I'm looking for a good compression algorithm to use for decompressing data from a flash chip to load to an FPGA (a Xilinx Spartan6-LX9, on the Mojo development board). It must be fast to decompress and not require a lot of working memory to do so, as the CPU (an ATmega16U4) is clocked at 8 MHz and has only 2 KiB of RAM and 16 KiB of program flash, some of which is already in use. Compression speed is not particularly important, as compression will only be run once on a computer, and the compression algorithm need not work on arbitrary inputs.
Here is an example bitstream. The format is documented in the Spartan-6 FPGA Configuration manual (starting on page 92).
Generally, the patterns present in the data fall into a few categories, and I'm not sure which of these will be easiest to exploit given the constraints I'm working with:
The data is organized overall into a set of packets of a known format. Certain parts of the bitstream are somewhat "stereotyped" (e.g, it will always begin and end by writing to certain registers), and other commands will appear in predictable sequences.
Some bytes are much more common than others. 00 and FF are by far the most frequent, but other bytes with few bits set (e.g, 80, 44, 02) are also quite common.
Runs of 00 and FF bytes are very frequent. Other patterns will sometimes appear on a local scale (e.g, a 16-byte sequence will be repeated a few times), but not globally.
What would be an appropriate compression algorithm (not a library, unless you're sure it'll fit!) for this task, given the constraints?
You should consider using LZO compression library. It has probably one of the fastest decompressors in existence, and decompression requires no memory. Compression, however, needs 64KB of memory (or 8KB for one of compression levels). If you only need to decompress, it might just work for you.
LZO project even provides special cut-down version of this library called miniLZO. According to the author, miniLZO compiles to less than 5KB binary on i386. Since you have 16KB flash, it might just fit into your constraints.
LZO compressor is currently used by UPX (ultimate packer for executables).
From your description, I would recommend run-length encoding followed by Huffman coding the bytes and runs. You would need very little memory over the data itself, mainly for accumulating frequencies and building a Huffman tree in place. Less than 1K.
You should make a histogram of the lengths of the runs to help determine how many bits to allocate to the run lengths.
Have you tried the built-in bitstream compression? That can work really well on non-full devices. It's a bitgen option, and the FPGA supports it out of the box, so it has no resource impact on your micro.
The way the compression is achieved is described here:
http://www.xilinx.com/support/answers/16996.html
Other possibilities have been discussed on comp.arch.fpga:
https://groups.google.com/forum/?fromgroups#!topic/comp.arch.fpga/7UWTrS307wc
It appears that one poster implemented LZMA successfully on a relatively constrained embedded system. You could use 7zip to check what sort of compression ratio you might expect and see if it's good enough before committing to implementation of the embedded part.
I need to read and process a text file. My processing would be easier if I could use the File.ReadAllLines method but I'm not sure what is the maximum size of the file that could be read with this method without reading by chunks.
I understand that the file size depends on the computer memory. But are still there any recommendations for an average machine?
On a 32-bit operating system, you'll get at most a contiguous chunk of memory around 550 Megabytes, allowing loading a file of half that size. That goes down hill quickly after your program has been running for a while and the virtual memory address space gets fragmented. 100 Megabytes is about all you can hope for.
This is not an issue on a 64-bit operating system.
Since reading a text file one line at a time is just as fast as reading all lines, this should never be a real problem.
I've done stuff like this with 1-2GB before, albeit in Python. I do not think .NET would have a problem, though. But I would only do this for one-off processing.
If you are doing this on a regular basis, you might want to go through the file line by line.
Its bad design unless you know the files sizes vs the computer memory that would be avaiable in the running app.
A better solution would be consider memory mapped files. They use themselvses as page fil storage,