How much data will be transferred from memory to CPU in a single call

How much data will be transferred from memory to CPU in a single call - hardware

As memory is much slower to CPU, It should send the data in blocks of some 'x'Bytes.
How much would be the size of this 'x'?
Do the data line b/n memory and CPU is also a x*8 bit lane?
If I access an address 'A' on memory, would it be sending all the next x-1 memory addresses to the cache?
What is the Approx frequency a memory bus would be working?
SIMD - Do SSE and MMX extensions somehow leverage this bulk reading feature?
Please feel free to provide any references.
Thanks in advance.

The size 'x' is generally the size of a cache line. Cache line size depends on the architecture, but Intel and AMD use 64 byte.
At least . If you have more channels, you can fetch more data from different channels.
Not exactly the next x-1 memory addresses. You can think of the memory, divided into 64 byte chunks. Every time you want to access even one byte, you will bring the chunk your address belongs. Lets assume you want to access the address 123 (decimal). The start of the address should be 64 to 127. So, you will bring that whole chunk. Which means, you do not only bring the following ones, but the previous addresses as well, depending on the address you access of course.
That depends the version of DDR you CPU supports. You can check some numbers in here: https://en.wikipedia.org/wiki/Double_data_rate
Yes, they do. When you bring a data from memory to caches, you bring one cache line, and SIMD extensions work on multiple data elements in a single instruction. Which means if you want to add 4 values in one instruction, the data you are looking for would be in the cache (since you brought the whole chunk) and you just read it from cache.

Related

Performance Counter for DRAM Per-Rank Memory Access

I have an Intel(R) Core(TM) i7-4720HQ CPU # 2.60GHz (Haswell) processor. I need to retrieve the number of accesses to each DRAM rank, over time, to estimate its power consumption. Based on page 261 of the chipset documentation (i.e., Datasheet, volume 2 (M- and H-processor lines)), I could use the 32-bit value in register, RAM—DRAM_ENERGY_STATUS, as a DRAM energy estimation. But I need rank-level energy estimates. I could also use core and offcore DRAM access performance counters to estimate power consumption, but, as mentioned before, I need per-rank statistics. Besides that, they report whole-system stats, while energy is calculated per-rank. They also do not report many DRAM accesses.
Therefore, IMC counters (which are uncore counters) should be the ideal choice. Perf does not support per-rank counters. I tried to use PCM-Memory to access IMC counter information. But /sys/bus/event_source/devices/uncore_imc is not mounted by the kernel (the version is 5.0.0-37-generic) and the tool does not detect the CPU. I tried to access uncore performance counters, manually. Whole-system DRAM access counters are documented, here (They were not documented in the above-mentioned chipset manual). I can retrieve total DRAM read and write accesses using these counters. But, there is no information about channel or rank-level access stats. How can I find the offset associated with these counters? Should I use trial and error?
P.S.: This question is also asked at Intel Software Tuning, Performance Optimization & Platform Monitoring Forum.

The MSR_DRAM_ENERGY_STATUS always reports an estimate of the energy consume by all memory channels. There is no easy way to break it into per-rank energy. This register reports a highly accurate estimate on Haswell.
The 5.0.0-37-generic kernel is an Ubuntu kernel and does support the uncore_imc/data_reads/ and uncore_imc/data_writes/ events on Haswell, which represent a data read CAS command and a data write CAS command from the IMC, respectively. A full cache-line read and a full cache-line write transactions cause a single bursty 64-byte transaction on the memory bus to a single rank. A partial read is also executed as a single full-line read on the bus, but a partial write may require a full line read followed by a full line write due to restrictions in the protocol. Partial writes are generally negligible.
The uncore_imc/data_reads/ and uncore_imc/data_writes/ events occur for requests targeting DRAM memory generated by any unit, not just cores. These names are given by perf and they correspond to UNC_IMC_DRAM_DATA_READS and UNC_IMC_DRAM_DATA_WRITES, respectively, which are mentioned in the Intel article you've cited. The other three events mentioned there allow you count requests (not CAS commands!) for each of the three possible sources separately (GT, IA, and IO). You won't find them listed under /sys/bus/event_source/devices/uncore_imc/events on your old kernel. They are supported in perf starting with mainstream kernel v5.9-rc2.
By the way, PCM does support these events as well, which it uses to report read and write bandwdith over all channels, but you should use the tool pcm.x, not pcm-memory.x, which only works on server processors.
A Haswell H-processor line processor has a single on-die memory controller with two DDR3L 64-bit channels. Each channel can contain zero, one, or two DIMMs with a total capacity of up to 32 GBs over all channels. Moreover, each DIMM can contain up to two ranks, so a single channel can contain anywhere between zero and four ranks. The i7-4720HQ is a high-end mobile processor. You're probably on a laptop with 8 GBs or 16 GBs of memory. If the memory topology was not changed since purchase, it probably has only two 4GB or 8GB DIMMs, one in each channel, with one remaining free slot per channel available for expansion if desired by the user. This means that there are either one or two ranks per channel.
You can approximate the number of accesses to each rank given the knowledge of how physical addresses are mapped to ranks. If each channel is populated with a single rank DIMM of the same capacity, the mapping is simple on your processor. Bit 6 of the physical address (i.e., the seventh bit) determines which channel, and therefore which rank, a request is mapped to. You can collect a set of samples of physical addresses of requests at the IMC by running perf record on MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM with the --phys-data option. Obviously this set of samples may only be representative of core-originated retired loads that reach the IMC, which are a small subset of all requests at the IMC.
It appears to me that you want to measure the number of memory accesses per rank in order to estimate the the per-rank energy from the total DRAM energy, but this is not trivial at all for the following reasons:
Not all CAS commands are of the same energy cost. Precharge and activate commands are not counted by any event and may consume significant energy, especially with high row buffer miss rates.
Even if there are zero requests in the IMC, as long as there is at least one active core, the memory channels are powered and do consume energy.
The amount of time it takes to process a request of the same type and to the same address may vary depending surrounding requests due to timing delays required by rank-to-rank turnarounds and read-write switching.
Despite of all of that, I imagine it may be possible to build a good model of upper and lower bounds on per-rank energy given a representative estimate of the number of requests to each rank (as discussed above).
The bottom line is that there is no easy way to get the luxury of per-rank counting like on server processors.

Is there any connection between the segments created in memory by a microprocessor and the memory structure of a process in an operating system?

In 8086 microprocessor, we segment the memory into segments of 64K each because of the 16 bit registers (Since a 20 bit address cannot be stored in the 16 bit register). These segments are categorized as code segment, data segment, stack segment and extra segment. This structure is similar to that of created by a process in operating system. Does that mean each process takes up memory equivalent to 4 segments which will be equivalent to 4*64K in case of 8086 ? And if this is true then by doing some more math we can say that only 4 process will be handled by a 8086 microprocessor at a time (i.e. one of the process will be running state and others would be in block or ready state) since maximum of 16 segments are possible (Total memory size / size of each segment = 1MB/64K = 16).
I have just started studying this and saw this equivalence between process and segments. Does any such connection between the segments of the memory and the memory structure of the process actually exists or it's just my crazy imagination ?

A little history helps. Early UNIX(tm) ran on the Digital pdp minicomputer family. The first circulated versions were V6 & V7, which were exclusive to the pdp-11 family. That family could support a whopping 256K of RAM; but the gp register set (used for address formation) were 16bits wide. There was a limited memory protection scheme in the processor, which permitted the kernel (supervisor) to have a separate address space from user (user); and instructions (addresses generated by pc) to be separate from data (generated by other means). This will probably get edited into the dust by pdp-11 fanbois.
At around this time, intel was rolling what was to become the 8086. Current 8-bit CPUs were already straining at a 64K address space limitation, and were using a concept called bank switching to increase that. In bank switching, some sub-ranges of the 64K address space could be re-pointed into a larger memory bank; so although you could carefully address much more memory. The Hitachi 64180 was one of the CPUs that incorporated this into its silicon; most used external memory controllers.
The 8086 addressing scheme was an amalgamation of these notions. You could produce an Operating System which supported dynamically relocated processes and shared text with up to 64K Instruction + 64K Data. The general idea was you take the segment registers out of the programming model, thus if the OS has to relocate the process, it knows that the process had no saved copy of the old segment value. The commercial OS QNX 1.x, 2.x provided this as a model; the later using the 286 extensions to protect against programs that played with the segment registers.
For programs that didn't care about such subtleties (Lotus 123, ...), you could use the segment registers to effectively create a 2^20 address space on the 8086. It is an ugly programming model in this mode because address formation is A=Seg*16+Base, so Seg=1,Base=0 and Seg=0,Base=16 resolve to the same address.
So, you aren't hallucinating, it was quite intentional, if more than a little half-arsed.

Can a 32-bit processor load a 64-bit memory address using multiple blocks or registers?

I was doing a little on 32-bit microprocessors and have I have learnt that:
1) A 32-bit microprocessor can only address 2^32 bits of memory which means that the memory pointer size should not exceed 32-bit range i.e. the pointer size should be equal to or less than 32-bit.
2) I also came to know that CPU allocate multiple blocks of memory for things like storing numbers and text, that is up to the program and not related to the size of each address (Source:here).So is it possible that a CPU can use multiple blocks (registers) to store pointers more than 32-bit in size?

Processors can access an essentially unlimited amount of memory by using variations on a technique called bank switching. In a simple bank-switching scheme, the memory chips that are wired to a portion of the address space will have some address inputs fed by the processor and some from an external latching device. Historically, the IBM PC had a 1MB address space, but an expanded memory board would IIRC allow two 16KB regions of that space to be mapped to any of dozens or hundreds of 16KB blocks of memory contained thereon. Nowadays processors generally have a memory-management unit built-in, which maps 4KB or 64KB blocks of memory to any address within a much larger space, and additional circuitry may, with OS support, expand things further.
The big difficulty with bank switching is that any given address might identify many different places in memory depending upon how the bank-switching hardware is configured, so accessing data from memories in a banked region will generally be more complicated than accessing data in directly-accessible memory and will only be possible from code which knows how the bank-switching hardware works. Nowadays it's more common to simply use a processor which can access all the memory one needs, but historically bank-switching was often a useful technique for going beyond processor limitations.

You could store a 64 bit pointer using 2 separate locations in memeory. But it probably wouldn't be useful since your processor can only use 32 bit pointers.

Logging 16-bit data to an SD card at the rate of 44 kHz

I am using the STM32F4 microcontroller with a microSD card. I am capturing analogue data via DMA.
I am using a double buffer, taking 1280 (10*128 - 10 FFTs) samples at a time.
When one buffer is full I am setting a flag and I then look at 128 samples at a time and run an FFT calculation on it. All of this is running well.
The data is being sampled at the rate I want and FFT calculation is as I would expect. If I just let the program run for one second, I see that it runs the FFT approximately 343 times (44000/128).
But the problem is I would like to save 64 values from this FFT to the SD card.
I am using the HCC fat file system library.
Each loop of the FFT calculation I am copy the 64 values into an array.
After every 10 calculations I write the contents of this array to file and start again.
The array stores 640 float_32 values (10*64).
This works perfectly for a one-second test run. I get 22,000 values stored to the SD card.
But as I increase the time I start losing samples as it take the SD card longer to write. I need the SD card to store over 87 kbit/s (4 bytes * 64 * 343 = 87808) consistently. I have tried increasing the DMA buffer sample size and then the number of times it writes, but didn't find it helped.
I am using an 8G microSD card, class 4. I formatted the SD card to the default FAT32 allocation unit size 2048.
How should I organize the buffering of data to allow for this? I thought using fewer writes might help. Would a queue help? How would I implement this and would anyone have an example?
I saw that clifford had a similar problem and he was using a queue, How can I use an SD card for logging 16-bit data at 48 ksamples/s?.

In my case I got it to work by trying a large number of different cards - they vary a great deal. If I had enough RAM available for a longer buffer that would have worked too.
If you are not using an RTOS, the queue buffering option may not be available to you, or at least would be non-trivial to implement.
Using an RTOS queue, I suggest that you create a queue of messages each of length 64*sizeof(float_32), the number of messages in the queue will be determined by the ammount of card latency you need to deal with; a length of 343 for example, will sustain a card stall of 1 second, and will require 87Kb of RAM. The application will then have a high priority thread performing the FFT and placing data in the queue, while a low priority thread takes data from the queue and writes to the file.
You might improve performance further by accumulating multiple message blocks in your DMA buffer before initiating a write, and there may be some benefit in carefully selecting an optimum DMA buffer length.

Flash is very, very sensitive to overwrites. Writing 3kB and then a further 3kB may count as an overwrite of the first 4 kB. In your case, there's no good reason why you'd want such small writes anyway. I'd advise 16 kB writes (32 frames/write * 64 samples/frame * 4 bytes/sample). You'd need 5 or 6 writes per second, which should be well in spec of any old SD card.
Now it's quite likely that you'd get another 1280 samples it while writing; you'll have to deal with that on another thread. Should be no problem as the writing should block without using CPU (it's a low-level Flash delay)

The most probable cause of the problem might be the way you are interfacing the card through the library.
SD cards over the SPI protocol (which I assume being used here) can be read or written in 512 byte sector units, some SD commands making it possible to stream (to perform sequential sector access faster). An important element of the SD card SPI protocol are various delays, where you have to poll the card whether you could start an operation (such as writing data to a sector).
You should read the library's API to discover how its writing process might work. You will need to perform some regular action which in the end would poll the card to know whether the writing process could continue. Some cards might require a set number of accesses before becoming ready for an operation, some others might use timeouts for state transitions. It might not work well to have the function called relatively rarely (such as once in 2-3 milliseconds) anticipating the card getting ready meanwhile. You have to keep on nagging it whether it completed already.
Just from own experiences with SD interfacing.

Physical Memory and Virtual Memory data allocation behavior

Im interested in understanding how a computer allocates variables for physical memory vs files in virtual memory ( such as on a hard drive ), in terms of how does the computer determine know where to put data. It almost seems random in both memory storage types, but its not because it simply can't put data at a memory address or sector (any location) of a hard drive that's occupied or allocated for another process already. When I was studying how Norton's speed disk ( a program that de-fragments files on hard drives ) on my old W95 system, I noticed from the program's representation of hard drive's data ( a color coded visual map of different data types, e.g. swap files were always first at the top.), consisting of many files spread out all over the hard drive with empty unused areas. In addition some of these areas, I saw what looked like a mix of data and empty space showed a spotty pattern. I want to think its random for that to happen. Like wise, when I was studying the memory addresses of a simple program I wrote in C, I noticed that each version of my program after recompiling it after changes - showed different addresses for segments and offsets. I was expecting the computer to use the same address when I recompiled it. Sometimes the same address would be used, other times it was different. Again, I want to think its random also for memory locations to be chosen by programs. I thought that memory allocation or file writing was based on the first empty space available, written in a contiguous manner.
So my question is, I want to know how and what is it in the logic works of a common computer, that decides where it writes its data in such a arbitrary manner for either type of location (physical RAM or Dynamic )? What area of computer science (if not assembly language) would I need to study that would explains this, almost random behavior?
Thanks in Advance

Something broader and directly from computer science would be a linked list. http://en.wikipedia.org/wiki/Linked_list
Imagine if you had a linked list and simply added items to the end, these items might live linearly in memory or disk or whatever somewhere. But as you remove some items in the middle of the list by having say item number 7 point at item number 9 eliminating item number 8. As with memory allocation for allocs or virtual memory or hard drive sector allocation, etc how fast you fragment your storage has to do with the algorithm you use for allocating the next item.
file systems can/do use a link list type scheme to keep track of what sectors are tied to a single file. it is fast and easy to use the link list but deal with fragmentation. A much slower method would be to have no fragmentation but be constantly copying/moving files around to keep them on linear sectors.
malloc() allocation schemes and MMU allocation schemes also fall under this category. Basically any time you take something, slice it up into fractions and put a virtual interface in front of those fractions to give the appearance to the programmer/user that they are linear. Malloc() (not counting the virtual memory via the MMU) is the other way around allocating a number of linear chunks of those fractions to meed the alloc need, and having an alloc/free scheme that attempts to keep as many large chunks available, just in case, a bad malloc system is one where you have half of your memory free but the maximum malloc that works without an out of memory error is a malloc of a small fraction of that memory, say you have a gig free and can only allocate 4096 bytes.

You should look at virtual memory and TLB (translation lookaside buffer) or paging.
It is not trivial to implement virtual memory and paging. The performance of your whole system depends on it. If it's not done properly your system will thrash.
It is early morning here so Wikipedia will have to do for now: https://en.m.wikipedia.org/wiki/Translation_lookaside_buffer
EDIT:
Those coloured spots you saw in your defrag were chunks on your HDD. Each chunk is of some specified size. Depending on how fragmented your HDD is, you might have portions of your HDD that look like this:
*-*-***-***-*
where * means full, and - means empty
This (above) could be part of one application/file or multiple files; I will assume one file is split across those to simplify my example. At the end of each * there is a pointer to the next location where the next * chunk is (this is called a linked list). The more fragmented your HDD is (or memory) the more of these pointers to next chunk you will have. This in turn uses more space for next pointers instead of using space for data and the result is more overhead when reading that data. If this is a file on disk, you will have multiple seeks (which are bad because they're slow) if your data is not grouped together (locality principle). When you use defrag, it moves and groups all chunks together (as best as it can).
*-*-***-***-*
becomes
*********----
The OS decides paging and virtual memory addressing (and such). TLB is a hardware (a cache) that aids this process (it maps physical memory to virtual memory addresses for fast look up). The CPU communicates with the TLB via MMU
To answer your questions
You should study operating systems.
Yes the locations where to place your files on HDD are decided by the OS. If you deleted a file and download it again, there is no guarantee it will be placed in the same location-most likely not.
A nice summary of all these components and principles I mentioned here work: Click Here. It's a ppt with slides from a Real Time Operating Systems book (if I'm not mistaken the same exact one I used)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas