How to calculate capacity of a memory by given a range of address? - ram

I have another exercice that I couldn't resolve,
A central memory composed by two memory module(RAM).the total address range attributed to the central memory is:
FROM 0000 0000H TO 3FFF FFFFH
1/Give the total capacity of the central memory (Megabyte and Gegabyte)
2/Give the capacity of each memory module(RAM)
3/Give the first and last address of each memory module(RAM)
Sorry for the bad translation the exercice is an french.

Well, 1 is easy. The range from 0000 0000H to 3FFF FFFFH contains 4000 0000H addresses. (Just like 0 to 3 is four addresses, 0, 1, 2, and 3.) 4000 0000H is 1,073,741,824 decimal, or 1GB. 1,024 MB.
2 is no problem. If two memory modules give 1GB, then each module must be 512MB.
3 is impossible. We don't know if the memory modules are consecutive or interleaved. But if we assume they're consecutive, which I imagine is what the exercise wants us to do, then the first one must be 0000 0000H to 1FFF FFFFH and the second one must be 2000 0000H to 3FFF FFFFH.
Note that mapping memory modules consecutively is generally considered dumb. It means that in the typical case where the memory module bandwidth is the limiting factor, if an application is only using the first half of memory, it's only using one of the two modules, wasting half the available memory bandwidth. (Though, in the less common case where the memory is as fast or faster than the CPU or its memory bus, it doesn't matter.)

Related

the disk iops from scylla_setup iotune study my disk is different from fio test data

when use scylla_setup, iotune study my reuslt is:
Measuring sequential write bandwidth: 473 MB/s
Measuring sequential read bandwidth: 499 MB/s
Measuring random write IOPS: 1902 IOPS
Measuring random read IOPS: 1999 IOPS
iops is 1900-2000,
when use fio,
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=/dev/sdc1 --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
the result is
test: (groupid=0, jobs=1): err= 0: pid=11697: Wed Jun 26 08:58:13 2019
read: IOPS=47.6k, BW=186MiB/s (195MB/s)(3070MiB/16521msec)
bw ( KiB/s): min=187240, max=192136, per=100.00%, avg=190278.42, stdev=985.15, samples=33
iops : min=46810, max=48034, avg=47569.61, stdev=246.38, samples=33
write: IOPS=15.9k, BW=62.1MiB/s (65.1MB/s)(1026MiB/16521msec)
bw ( KiB/s): min=62656, max=65072, per=100.00%, avg=63591.52, stdev=590.96, samples=33
iops : min=15664, max=16268, avg=15897.88, stdev=147.74, samples=33
cpu : usr=4.82%, sys=12.81%, ctx=164053, majf=0, minf=23
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwt: total=785920,262656,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
READ: bw=186MiB/s (195MB/s), 186MiB/s-186MiB/s (195MB/s-195MB/s), io=3070MiB (3219MB), run=16521-16521msec
WRITE: bw=62.1MiB/s (65.1MB/s), 62.1MiB/s-62.1MiB/s (65.1MB/s-65.1MB/s), io=1026MiB (1076MB), run=16521-16521msec
Disk stats (read/write):
sdc: ios=780115/260679, merge=0/0, ticks=792798/230409, in_queue=1023170, util=99.47%
read iops is 46000 - 48000,write iops is 15000-16000
(NB: It looks like the questioner filed this as a Scylla Github issue too - https://github.com/scylladb/scylla/issues/4604 )
[Why is] the disk iops from scylla_setup iotune [...] different from fio test data
Different benchmarks, different results:
Scylla may have been using a much bigger block size (e.g. 64k) per I/O (this is likely the biggest factor). As you make the block size bigger (up to some maximum due to diminishing returns) the bandwidth (i.e. total amount of data you can send in say a second) achieved with that block size goes up but the IOPS you get will typically down (you are sending more data per I/O after all). This is normal!
Scylla could be using buffered I/O (rather than direct I/O)
Scylla may have been benchmarking reads and writes separately
Scylla may have been using a bigger queue depth
Scylla may have been batching its submissions differently
Scylla may be writing a different type of data
And so on...
In general, it's very difficult to take benchmarks done with different tools and compare them directly to each other - you would need to know what they are doing under the hood for any comparison to be meaningful. Trying to look at IOPS or bandwidth in isolation without more context is meaningless as you typically trade one off against the other. It's better to use the same benchmark tool with identical options to compare two different machines changes or to measure the impact of tuning on the same machine.
TLDR; This is likely an apples to oranges comparison where the tools are measuring different contexts.
PS: gtod_reduce is a go faster stripe that very few people actually need. If your hardware isn't capable of doing gigabytes per second and you're not seeing your CPU maxed out it's unlikely reducing gettimeofday calls is going to nudge the result very much.
(This question might be more appropriate for Server Fault (and thus get better replies there) because it's not directly about programming)

OpenCL (AMD GCN) global memory access pattern for vectorized data: strided vs. contiguous

I'm going to improve OCL kernel performance and want to clarify how memory transactions work and what memory access pattern is really better (and why).
The kernel is fed with vectors of 8 integers which are defined as array: int v[8], that means, before doing any computation entire vector must be loaded into GPRs. So, I believe the bottleneck of this code is initial data load.
First, I consider some theory basics.
Target HW is Radeon RX 480/580, that has 256 bit GDDR5 memory bus, on which burst read/write transaction has 8 words granularity, hence, one memory transaction reads 2048 bits or 256 bytes. That, I believe, what CL_DEVICE_MEM_BASE_ADDR_ALIGN refers to:
Alignment (bits) of base address: 2048.
Thus, my first question: what is the physical sense of 128-byte cacheline? Does it keep the portion of data fetched by single burst read but not really requested? What happens with the rest if we requested, say, 32 or 64 bytes - thus, the leftover exceeds the cache line size? (I suppose, it will be just discarded - then, which part: head, tail...?)
Now back to my kernel, I think that cache does not play a significant role in my case because one burst reads 64 integers -> one memory transaction can theoretically feed 8 work items at once, there is no extra data to read, and memory is always coalesced.
But still, I can place my data with two different access patterns:
1) contiguous
a[i] = v[get_global_id(0) * get_global_size(0) + i];
(wich actually perfomed as)
*(int8*)a = *(int8*)v;
2) interleaved
a[i] = v[get_global_id(0) + i * get_global_size(0)];
I expect in my case contiguous would be faster because as said above one memory transaction can completely stuff 8 work items with data. However, I do not know, how the scheduler in compute unit physically works: does it need all data to be ready for all SIMD lanes or just first portion for 4 parallel SIMD elements would be enough? Nevertheless, I suppose it is smart enough to fully provide with data at least one CU first, as soon as CU's may execute command flows independently.
While in second case we need to perform 8 * global_size / 64 transactions to get a complete vector.
So, my second question: is my assumption right?
Now, the practice.
Actually, I split entire task in two kernels because one part has less register pressure than another and therefore can employ more work items. So first I played with pattern how the data stored in transition between kernels (using vload8/vstore8 or casting to int8 give the same result) and the result was somewhat strange: kernel that reads data in contiguous way works about 10% faster (both in CodeXL and by OS time measuring), but the kernel that stores data contiguously performs surprisingly slower. The overall time for two kernels then is roughly the same. In my thoughts both must behave at least the same way - either be slower or faster, but these inverse results seemed unexplainable.
And my third question is: can anyone explain such a result? Or may be I am doing something wrong? (Or completely wrong?)
Well, not really answered all my question but some information found in vastness of internet put things together more clear way, at least for me (unlike abovementioned AMD Optimization Guide, which seems unclear and sometimes confusing):
«the hardware performs some coalescing, but it's complicated...
memory accesses in a warp do not necessarily have to be contiguous, but it does matter how many 32 byte global memory segments (and 128 byte l1 cache segments) they fall into. the memory controller can load 1, 2 or 4 of those 32 byte segments in a single transaction, but that's read through the cache in 128 byte cache lines.
thus, if every lane in a warp loads a random word in a 128 byte range, then there is no penalty; it's 1 transaction and the reading is at full efficiency. but, if every lane in a warp loads 4 bytes with a stride of 128 bytes, then this is very bad: 4096 bytes are loaded but only 128 are used, resulting in ~3% efficiency.»
So, for my case it does not realy matter how the data is read/stored while it is always contiguous, but the order the parts of vectors are loaded may affect the consequent command flow (re)scheduling by compiler.
I also can imagine that newer GCN architecture can do cached/coalesced writes, that is why my results are different from those prompted by that Optimization Guide.
Have a look at chapter 2.1 in the AMD OpenCL Optimization Guide. It focuses mostly on older generation cards but the GCN architecture did not completely change, therefore should still apply to your device (polaris).
In general AMD cards have multiple memory controllers to which in every clock cycle memory requests are distributed. If you for example access your values in column-major instead of row-major logic your performance will be worse because the requests are sent to the same memory controller. (by column major I mean a column of your matrix is accessed together by all the work-items executed in the current clock cycle, this is what you refer to as coalesced vs interleaved). If you access one row of elements (meaning coalesced) in a single clock cycle (meaning all work-items access values within the same row), those requests should be distributed to different memory controllers rather than the same.
Regarding alignment and cache line sizes, I'm wondering if this really helps improving the performance. If I were in your situation I would try to have a look whether I can optimize the algorithm itself or if I access the values often and it would make sense to copy them to the local memory. But than again it is hard to tell without any knowledge about what your kernels execute.
Best Regards,
Michael

What is the average consumption of a GPS app (data-wise)?

I'm currently working on a school project to design a network, and we're asked to assess traffic on the network. In our solution (dealing with taxi drivers), each driver will have a smartphone that can be used to track its position to assign him the best ride possible (through Google Maps, for instance).
What would be the size of data sent and received by a single app during one day? (I need a rough estimate, no real need for a precise answer to the closest bit)
Thanks
Gps Positions compactly stored, but not compressed needs this number of bytes:
time : 8 (4 bytes is possible too)
latitude: 4 (if used as integer or float) or 8
longitude 4 or 8
speed: 2-4 (short: 2: integer 4)
course (2-4)
So binary stored in main memory, one location including the most important attributes, will need 20 - 24 bytes.
If you store them in main memory as single location object, additonal 16 bytes per object are needed in a simple (java) solution.
The maximum recording frequence is usually once per second (1/s): Per hour this need: 3600s * 40 byte = 144k. So a smartphone easily stores that even in main memory.
Not sure if you want to transmit the data:
When transimitting this to a server data usually will raise, depending of the transmit protocoll used.
But it mainly depends how you transmit the data and how often.
If you transimit every 5 minutes a position, you dont't have to care, even
when you use a simple solution that transmits 100 times more bytes than neccessary.
For your school project, try to transmit not more than every 5 or better 10 minutes.
Encryption adds an huge overhead.
To save bytes:
- Collect as long as feasible, then transmit at once.
- Favor binary protocolls to text based. (BSON better than JSON), (This might be out of scope for your school project)

Difference between blocks and sectors

With reference to this article, there is a line that reads:
Because there are limits to the number of blocks, or drive addresses, that an operating system can address. By defining a block as several sectors, an OS can work with bigger hard drives without increasing the number of block addresses.
What does it mean? What is meant by "operating system can address"? And the subsequent maths isn't clear either. How can 64*512 be less than 64*4?
Look at it this way. Every block that's used in your operating system's file system to store data requires a certain amount of metadata to be stored along with the actual file data you're writing. e.g: timestamps (created, modified), filename, ownership/permission bits. For files that span multiple blocks, you also have to store the IDs of each of those blocks and the order they're chained together, etc.
Determining block size in an OS is a case of tradeoffs. Every file must occupy at least one block, even if the file is 0 bytes long, so there's something for the file's metadata to be attached to. Unless you can guarantee that your files will ALWAYS be some multiple of the block size in size (e.g. in a 4k block OS, all files are 4k), there will be a certain amount of wastage for the files that don't exactly fit within that block.
Small block sizes are good when you need to store many small files. On the other hand, more blocks = more metadata, so you end up wasting a chunk of your storage system on overhead, tracking the location of all the files.
On the flip side, large blocks mean less metadata, but also mean greater wastage when you're storing small files. e.g. a 1 byte file stored in a 4k block wastes 3.99k of that block.
Each of those blocks must be given an ID number by the OS, so it can be uniquely identified. An OS which uses an 8 bit ID field can track only 256 blocks, and therefore, by extension, only 256 files. But if each of those blocks is actually 1 megabyte in size, then you can store up to 256 megabytes of data.
The article you link to has a typo/logical flaw: they meant 512 BYTES, not 512k, so 64*512 bytes is smaller than 64*4k, aka 64*4096 bytes. Most hard drives shipped with 512 byte sector/block sizes.
However, as discussed earlier, small blocks mean more metadata. With drive sizes now in the 3+ terabyte range, with 512 byte blocks, you had to have metadata storage for 3TB/512 bytes = 6.44 billion blocks. That's one major waste of space. So now they ship drives with 4k blocks, 8 times larger, so you only need metadata storage for 805 million blocks. The total number of possible files has been cut by a factor of 8, but the reduced amount of metadata means you can actually store a larger amount of useable data.
Incidentally, 6.4 billion blocks is larger than what can be addressed directly by a 32bit system. 2^32 has an upper limit of ~4.2 billion, so older 32bit machines could not use the entirety of a 3TB drive. Hence switching to larger block sizes. 32bit boxes can easily handle 805 million blocks.

How to use coalesced memory access

I have 'N' threads to perform simultaneously on device which they need M*N float from the global memory. What is the correct way to access the global memory coalesced? In this matter, how the shared memory can help?
Usually, a good coalesced access can be achieved when the neighbouring threads access neighbouring cells in memory. So, if tid holds the index of your thread, then accessing:
arr[tid] --- gives perfect coalescence
arr[tid+5] --- is almost perfect, probably misaligned
arr[tid*4] --- is not that good anymore, because of the gaps
arr[random(0..N)] --- horrible!
I am talking from the perspective of a CUDA programmer, but similar rules apply elsewhere as well, even in a simple CPU programming, although the impact is not that big there.
"But I have so many arrays everyone has about 2 or 3 times longer than the number of my threads and using the pattern like "arr[tid*4]" is inevitable. What may be a cure for this?"
If the offset is a multiple of some higher 2-power (e.g. 16*x or 32*x) it is not a problem. So, if you have to process a rather long array in a for-loop, you can do something like this:
for (size_t base=0; i<arraySize; i+=numberOfThreads)
process(arr[base+threadIndex])
(the above asumes that array size is a multiple of the number of threads)
So, if the number of threads is a multiple of 32, the memory access will be good.
Note again: I am talking from the perspective of a CUDA programmer. For different GPUs/environment you might need less or more threads for perfect memory access coalescence, but similar rules should apply.
Is "32" related to the warp size which access parallel to the global memory?
Although not directly, there is some connection. Global memory is divided into segments of 32, 64 and 128 bytes which are accessed by half-warps. The more segments you access for a given memory-fetch instruction, the longer it goes. You can read more into details in the "CUDA Programming Guide", there is a whole chapter on this topic: "5.3. Maximise Memory Throughput".
In addition, I heard a little about shared memory to localize the memory access. Is this preferred for memory coalescing or have its own difficulties?
Shared memory is much faster as it lies on-chip, but its size is limited. The memory is not segmented like global, you can access is almost-randomly at no penality cost. However, there are memory bank lines of width 4 bytes (size of 32-bit int). The address of memory that each thread access should be different modulo 16 (or 32, depending on the GPU). So, address [tid*4] will be much slower than [tid*5], because the first one access only banks 0, 4, 8, 12 and the latter 0, 5, 10, 15, 4, 9, 14, ... (bank id = address modulo 16).
Again, you can read more in the CUDA Programming Guide.