Difference between blocks and sectors - hardware

With reference to this article, there is a line that reads:
Because there are limits to the number of blocks, or drive addresses, that an operating system can address. By defining a block as several sectors, an OS can work with bigger hard drives without increasing the number of block addresses.
What does it mean? What is meant by "operating system can address"? And the subsequent maths isn't clear either. How can 64*512 be less than 64*4?

Look at it this way. Every block that's used in your operating system's file system to store data requires a certain amount of metadata to be stored along with the actual file data you're writing. e.g: timestamps (created, modified), filename, ownership/permission bits. For files that span multiple blocks, you also have to store the IDs of each of those blocks and the order they're chained together, etc.
Determining block size in an OS is a case of tradeoffs. Every file must occupy at least one block, even if the file is 0 bytes long, so there's something for the file's metadata to be attached to. Unless you can guarantee that your files will ALWAYS be some multiple of the block size in size (e.g. in a 4k block OS, all files are 4k), there will be a certain amount of wastage for the files that don't exactly fit within that block.
Small block sizes are good when you need to store many small files. On the other hand, more blocks = more metadata, so you end up wasting a chunk of your storage system on overhead, tracking the location of all the files.
On the flip side, large blocks mean less metadata, but also mean greater wastage when you're storing small files. e.g. a 1 byte file stored in a 4k block wastes 3.99k of that block.
Each of those blocks must be given an ID number by the OS, so it can be uniquely identified. An OS which uses an 8 bit ID field can track only 256 blocks, and therefore, by extension, only 256 files. But if each of those blocks is actually 1 megabyte in size, then you can store up to 256 megabytes of data.
The article you link to has a typo/logical flaw: they meant 512 BYTES, not 512k, so 64*512 bytes is smaller than 64*4k, aka 64*4096 bytes. Most hard drives shipped with 512 byte sector/block sizes.
However, as discussed earlier, small blocks mean more metadata. With drive sizes now in the 3+ terabyte range, with 512 byte blocks, you had to have metadata storage for 3TB/512 bytes = 6.44 billion blocks. That's one major waste of space. So now they ship drives with 4k blocks, 8 times larger, so you only need metadata storage for 805 million blocks. The total number of possible files has been cut by a factor of 8, but the reduced amount of metadata means you can actually store a larger amount of useable data.
Incidentally, 6.4 billion blocks is larger than what can be addressed directly by a 32bit system. 2^32 has an upper limit of ~4.2 billion, so older 32bit machines could not use the entirety of a 3TB drive. Hence switching to larger block sizes. 32bit boxes can easily handle 805 million blocks.

Related

OpenCL (AMD GCN) global memory access pattern for vectorized data: strided vs. contiguous

I'm going to improve OCL kernel performance and want to clarify how memory transactions work and what memory access pattern is really better (and why).
The kernel is fed with vectors of 8 integers which are defined as array: int v[8], that means, before doing any computation entire vector must be loaded into GPRs. So, I believe the bottleneck of this code is initial data load.
First, I consider some theory basics.
Target HW is Radeon RX 480/580, that has 256 bit GDDR5 memory bus, on which burst read/write transaction has 8 words granularity, hence, one memory transaction reads 2048 bits or 256 bytes. That, I believe, what CL_DEVICE_MEM_BASE_ADDR_ALIGN refers to:
Alignment (bits) of base address: 2048.
Thus, my first question: what is the physical sense of 128-byte cacheline? Does it keep the portion of data fetched by single burst read but not really requested? What happens with the rest if we requested, say, 32 or 64 bytes - thus, the leftover exceeds the cache line size? (I suppose, it will be just discarded - then, which part: head, tail...?)
Now back to my kernel, I think that cache does not play a significant role in my case because one burst reads 64 integers -> one memory transaction can theoretically feed 8 work items at once, there is no extra data to read, and memory is always coalesced.
But still, I can place my data with two different access patterns:
1) contiguous
a[i] = v[get_global_id(0) * get_global_size(0) + i];
(wich actually perfomed as)
*(int8*)a = *(int8*)v;
2) interleaved
a[i] = v[get_global_id(0) + i * get_global_size(0)];
I expect in my case contiguous would be faster because as said above one memory transaction can completely stuff 8 work items with data. However, I do not know, how the scheduler in compute unit physically works: does it need all data to be ready for all SIMD lanes or just first portion for 4 parallel SIMD elements would be enough? Nevertheless, I suppose it is smart enough to fully provide with data at least one CU first, as soon as CU's may execute command flows independently.
While in second case we need to perform 8 * global_size / 64 transactions to get a complete vector.
So, my second question: is my assumption right?
Now, the practice.
Actually, I split entire task in two kernels because one part has less register pressure than another and therefore can employ more work items. So first I played with pattern how the data stored in transition between kernels (using vload8/vstore8 or casting to int8 give the same result) and the result was somewhat strange: kernel that reads data in contiguous way works about 10% faster (both in CodeXL and by OS time measuring), but the kernel that stores data contiguously performs surprisingly slower. The overall time for two kernels then is roughly the same. In my thoughts both must behave at least the same way - either be slower or faster, but these inverse results seemed unexplainable.
And my third question is: can anyone explain such a result? Or may be I am doing something wrong? (Or completely wrong?)
Well, not really answered all my question but some information found in vastness of internet put things together more clear way, at least for me (unlike abovementioned AMD Optimization Guide, which seems unclear and sometimes confusing):
«the hardware performs some coalescing, but it's complicated...
memory accesses in a warp do not necessarily have to be contiguous, but it does matter how many 32 byte global memory segments (and 128 byte l1 cache segments) they fall into. the memory controller can load 1, 2 or 4 of those 32 byte segments in a single transaction, but that's read through the cache in 128 byte cache lines.
thus, if every lane in a warp loads a random word in a 128 byte range, then there is no penalty; it's 1 transaction and the reading is at full efficiency. but, if every lane in a warp loads 4 bytes with a stride of 128 bytes, then this is very bad: 4096 bytes are loaded but only 128 are used, resulting in ~3% efficiency.»
So, for my case it does not realy matter how the data is read/stored while it is always contiguous, but the order the parts of vectors are loaded may affect the consequent command flow (re)scheduling by compiler.
I also can imagine that newer GCN architecture can do cached/coalesced writes, that is why my results are different from those prompted by that Optimization Guide.
Have a look at chapter 2.1 in the AMD OpenCL Optimization Guide. It focuses mostly on older generation cards but the GCN architecture did not completely change, therefore should still apply to your device (polaris).
In general AMD cards have multiple memory controllers to which in every clock cycle memory requests are distributed. If you for example access your values in column-major instead of row-major logic your performance will be worse because the requests are sent to the same memory controller. (by column major I mean a column of your matrix is accessed together by all the work-items executed in the current clock cycle, this is what you refer to as coalesced vs interleaved). If you access one row of elements (meaning coalesced) in a single clock cycle (meaning all work-items access values within the same row), those requests should be distributed to different memory controllers rather than the same.
Regarding alignment and cache line sizes, I'm wondering if this really helps improving the performance. If I were in your situation I would try to have a look whether I can optimize the algorithm itself or if I access the values often and it would make sense to copy them to the local memory. But than again it is hard to tell without any knowledge about what your kernels execute.
Best Regards,
Michael

Erasure Code for chunked file

Is there an erasure code, which can be applied to multiple chunks (maybe 100 or 200, each few hundred kB) by (somehow) adding redundancy chunks ?
I heard about Reed-Solomon, but it doesn't look like it can be used for huge data sets and multiple chunks, am I wrong ?
Thanks!
Of course Reed-Solomon can be used for any data size.
Just think of you data as set of multiple RS-sized blocks (eg. 255 byte for a byte-based RS code) and make the calculation for each block independly. All checksums together are the checksum of the whole big data thing.
If your data length is not a multiple of the RS block size, ie. the last block is too short, just add some 0 bytes to fill it up before encoding, and remove the 0s after decoding again. You'll have to save the original data length somewhere, but that should be no problem.
Erasure codes encode $N$ original data chunks into $M$ parity chunks for redundancy, while these $N$ original data chunks and $M$ parity chunks are only a stripe of the whole storage. Theoretically, the size of $N$ can be arbitrary large for Reed-Solomon(RS) codes, only if the Galois field $GF(2^w)$ the RS constructed over is large enough. Based on the above, your question is more likely as follow
Why the number of (original data) chunks in a stripe rarely be too large, for example, $N = 100$ or $200$ ?
The reasons are update problem and repair problem: If you encode a great number data chunks into parity chunks by the erasure codes, a lot of data/parity chunks are co-related. As long as you update one data chunk, all parity chunks must be updated as well, which cause heavy I/O for the parity part; repair problem is the situation when one data/parity chunk is failed, lots of data/parity chunks are accessed and transferred for repairing, causing enormous disk I/O or network traffic.
Take RAID5 of $3$ data chunks (A, B, C) and parity chunk P=A+B+C as a example, the repair for the failure of any chunk requires all other three chunks to participate.
The greater number of chunks are encoded, the more serious update problem and repair problem for a storage system might meet, which further greatly influences the performance of the system.
BTW, the decoding(process to obtain the original data) speed greatly falls off when the $N$ enlarges.

How to get LBA(logical block addressing) of a file from MFT on NTFS file system?

I accessed the $MFT file and extracted file attributes.
Given the file attributes from MFT, how to get a LBA of file from the MFT record on NTFS file system?
To calculate LBA, I know that cluster number of file.
It that possible using cluster number to calculate?
I'm not entirely sure of your question-- But if you're simply trying to find the logical location on disk of a file, there are various IOCTLs that will achieve this.
For instance, MFT File records: FSCTL_GET_NTFS_FILE_RECORD
http://msdn.microsoft.com/en-us/library/windows/desktop/aa364568(v=vs.85).aspx
Location on disk of a specific file via HANDLE: FSCTL_GET_RETRIEVAL_POINTERS
http://msdn.microsoft.com/en-us/library/windows/desktop/aa364572(v=vs.85).aspx
If you're trying to parse NTFS on your own, you'll need to follow the $DATA attribute-- Which will always be non-resident data runs (unless it's a small file that can be resident within the MFT). Microsoft's data runs are fairly simply structures of data contained in the first two nibbles, which specify offset and length for the next run of data.
IMHO you should write the code by doing some basic arithmetic rather than using IOCTLs and FSCTLs for everything. You should know the size of your disk and the offset from which a volume starts (or every extent by using IOCTL_VOLUME_GET_VOLUME_DISK_EXTENTS) and store those values somewhere. Then just add the LCN times the size of a cluster to the offset of the extent on the disk.
Most of the time you just have to deal with one extent. When you have multiple extents you can figure out on which extent the cluster is by multiplying the LCN with the size of a cluster and then subtracting the size of each extent returned by the IOCTL in the order they are returned, if the next number to subtract is greater than your current number, that particular LCN is on that extent.
A file is a single virtually contiguous unit consisting of virtual clusters. These virtual clusters map onto extents (fragments) of logical clusters where LCN 0 is the boot sector of the volume. The logical clusters map onto different logical clusters if there are bad clusters. The actual logical cluster is then translated to a physical cluster, PCN, or LBA (the first sector of the physical cluster) by summing the number of hidden sectors (the sector number of the boot sector relative to the start of the disk) and then adding it to LCN*(sectors per cluster in the volume). PCN = hidden sectors / (sectors per cluster in the volume) + LCN. LBA = hidden sectors + LCN*(sectors per cluster in the volume)

What is the maximum value size you can store in redis?

Does anyone know what the maximum value size you can store in redis? I want to use redis as a message queue with celery to store some small documents that need to be processed by a worker on another server, and I want to make sure the documents aren't going to be too big.
I found one page with a reference to 1GB, but when I followed the link on the page for where they got that answer the link wasn't valid anymore. Here is the link:
http://news.ycombinator.com/item?id=1182005
All string values are limited to 512 MiB. This is the size limit you probably care most about.
EDIT: Because keys in Redis are strings, the maximum key size is 512 MiB. The maximum number of keys is 2^32 - 1 = 4,294,967,295.
Values, on the other hand, can vary in size depending on their type. For aggregate data types (i.e. hash, list, set, and sorted set), the maximum value size is 512 MiB for each element, although the data structure itself can have up to 2^32 - 1 elements.
https://redis.io/topics/data-types
https://redis.io/topics/faq#what-is-the-maximum-number-of-keys-a-single-redis-instance-can-hold-and-what-is-the-max-number-of-elements-in-a-hash-list-set-sorted-set
http://groups.google.com/group/redis-db/browse_thread/thread/1c7e33fbc98734b3?fwc=2
Article about Redis Memory Usage can help you to roughly determine how much memory your database would take.
It's in the order of the amount of RAM you have, at least, so unless you plan on puting multi-gigabyte objects in there I wouldn't worry. I've had sets that were hundreds of megabytes big without a problem, but I don't know the exact limits.
A String value can accommodate the size of max 512MB. But according to this link, the size can be increased.

combining open(Filename, {delayed_write, Size, Delay}) with index

How can I combine a open(Filename, {delayed_write, Size, Delay}) with an index on where to write this data to?
I want to wait until I receive a certain amount of data and then write it to a position in the file.
Also is {read_ahead, Size} the opposite of {delayed_write, Size, Delay}? I would like to read a certain amount of data to send it.
Thanks
read_ahead is kind of the opposite of delayed_write in the sense that reading is the opposite of writing.
If you want to read and send bigger chunks of memory you don't need read_ahead, just read big chunks and send them (not many os calls to save here).
From the file:open/2 manpage on read_ahead:
If read/2 calls are for sizes not significantly less than, or even greater than Size
bytes, no performance gain can be expected.
You don't need to specify and index when opening. Just use pwrite/3 or a combination of position/2 and write/2.
But writing in different positions of the file might just reduce the gain of delayed_write since (also manpage of file:open/2):
The buffered
data is also flushed before some other file
operation than write/2 is executed.
If you have chunks of data for several positions collect them in a list of {Location, Bytes} and from time to time write them with file:pwrite/2 all in one go. This can map to very efficient writev(2) system call that writes several chunks in one go.