Control memory bandwidth in Gem5 - gem5

Is there any way to control memory bandwidth in Gem5? I want to analyze the impact of memory bandwidth on a micro-benchmark by varying the bandwidth. From stats I could see that, Gem5 gives details on consumed read and write bandwidth at memory controller, but I want to change the available memory bandwidth to throttle it or relax it.
Thanks in advance

Related

Cortex-M3 External RAM Region

I'm currently researching topics such as RAM/ROM/Stack/Heap and data segments etc.
I was looking at the ARM Cortex-M3 memory map and saw the region labeled "External RAM".
According to the data sheet of a random Cortex-M3 STM32 MCU the external RAM region is mapped from 0x60000000- 0x9FFFFFFF, so it is quite large!
I couldn't find a definitive answer about how this region is actually used.
I imagine you would have an external SRAM and you would choose between two options.
(1) Read via the SPI interface and place into a local buffer(stack), then load that local buffer into the external ram region. This option seems to have a lot of negative consequences, such as hogging the CPU and increasing the stack temporarily if the requested data is very large.
(2) Utilize a DMA and transfer from the SPI interface into the external ram region.
Now I can't understand, why you would map the data to this specific address range, what are the advantages, why don't you just place the data directly in that huge memory region?
Now I'm asking this question because I have a slight feeling I have completely missed the point of what the External RAM region really is.
-Edit-
In the data sheet that is linking to the STM32 device, the memory region "External RAM" is marked as reserved. It is my conclusion that the memory regions listed by ARM is showing the full potential of a 32bit MCU, as I incorrectly state that the external RAM region "is quite large!" does not necessarily mean that this is "real" size of that region, if it is even used, it depends on what the vendor can physically achieve within the MCU hardware, and I imagine they would limit hardware capabilities to be competitive on price, power consumption etc.
I imagine you would have an external [SRAM][3] and you would choose
between two options.
(1) Read via the SPI interface and place into a local buffer(stack), then load that local buffer into the external ram region. This option
seems to have a lot of negative consequences, such as hogging the CPU
and increasing the stack temporarily if the requested data is very
large.
(2) Utilize a DMA and transfer from the SPI interface into the external ram region.
None of the above. External memory on an SPI bus is not memory mapped. If you have an SPI memory, it is not mapped to that region, it is simply an SPI device, and the "address" is simply an offset from the start of the memory device itself. MCUs with a Quad or Octo-SPI controller are memory mapped. QSPI RAM is not that common and relatively expensive. QSPI is more commonly used for flash memory.
The external memory region can be used by STM32 parts with an FSMC (Flexible
Static Memory Controller) or an FMC (Flexible Memory Controller), or and mentions a QPSI interface. The latter FMC SDRAM, and is generally available on the higher end parts. Apart from the QSPI and NAND flash, these interfaces require using the GPIO EMIF (external memory interface) alternate function to create an address and data bus. So it generally requires parts with high pin count to accommodate. The EMIF can be configured for 8, 16 or 32bit data bus for reduced pin count (and slower access).
Now I can't understand, why you would map the data to this specific
address range, what are the advantages, why don't you just place the
data directly in that huge memory region?
Since it was precipitated by your earlier misconception this question is perhaps redundant, but memory that exists in the memory map can be used to store data accessed as regular variables rather than transferring to an from internal buffers and it can be used as an execution region - code can loaded to and be executed directly from such memory.
Now I'm asking this question because I have a slight feeling I have completely missed the point of what the External RAM region really is.
Self awareness is a skill. That is known as conscious incompetence and is a motivator for learning.
It is my conclusion that the memory regions listed by ARM is showing the full potential of a 32bit MCU, as I incorrectly state that the external RAM region "is quite large!" does not necessarily mean that this is "real" size of that region, if it is even used, it depends on what the vendor can physically achieve within the MCU hardware, and I imagine they would limit hardware capabilities to be competitive on price, power consumption etc.
No, it is largely about the number of pins available for an address bus (except for QSPI). The external memory is a matter for the board design - it is not something the MCU vendor decides must be present. The constraint is a maximum, not a required amount of physical memory. The STM32 FMC supports the following memory sizes/types:
So you can have up to 512Mb of SDRAM for example. The space available for static memories (NOR/PSRAM/SRAM) is significantly larger than the than the typical size of such memories.

What is difference between upload speed and upload throughput?

I am trying to simulate different network speeds using selenium
Maybe I'm missing the point of the question but:
"Bandwidth and throughput have to do with speed, but what's the difference? To be brief, bandwidth is the theoretical speed of data on the network, whereas throughput is the actual speed of data on the network."
Pretty much: bandwidth is what your ISP will market to you, but your throughput is what you'll actually get on your side, in terms of speed. Throughput will almost always be lower than the marketed/advertised bandwidth.
source:
https://study.com/academy/lesson/bandwidth-vs-throughput.html#:~:text=Lesson%20summary,fast%20data%20is%20being%20sent.&text=Bandwidth%20refers%20to%20the%20theoretical,data%20on%20your%20network%20travels.
Possibly the term upload speed in a broader aspect indicates the internet speed where for uploading and downloading you need speed. Bandwidth and Throughput are the two major indicators of speed where:
Bandwidth is the theoretical speed of data on the network.
Throughput is the actual speed of data on the network.
Bandwidth
In true essence, Bandwidth refers to the maximum amount of data you can get from point A to point B in a specific amount of time. Thesedays while dealing with computers Bandwidth refers to, how many bits of information we can theoretically transmit in specific amount of time, as an example bits per second. E.g. Kbps (kilobits per second) and Mbps (megabits per second).
Throughput
Throughput can only send as much as the bandwidth will allow and is actually less than that as factors like latency (delays), jitter (irregularities in the signal), and error rate (actual mistakes during transmission) reduces the overall throughput.
I think what you are looking for is the method to do it.
Sets Chromium network emulation settings.
driver.set_network_conditions(
offline=False,
latency=5, # additional latency (ms)
download_throughput=500 * 1024, # maximal throughput
upload_throughput=500 * 1024) # maximal throughput
Note: 'throughput' can be used to set both (for download and upload).
Source

Is redis using cache?

I restarted my redis server after 120 days.
Before restart, memory usage 29.5GB
After restarted, memory usage 27.5GB
So, how 2GB reduced comes?
Free memory in ram like this article https://redis.io/topics/memory-optimization
Redis will not always free up (return) memory to the OS when keys are
removed. This is not something special about Redis, but it is how most
malloc() implementations work. For example if you fill an instance
with 5GB worth of data, and then remove the equivalent of 2GB of data,
the Resident Set Size (also known as the RSS, which is the number of
memory pages consumed by the process) will probably still be around
5GB, even if Redis will claim that the user memory is around 3GB. This
happens because the underlying allocator can't easily release the
memory. For example often most of the removed keys were allocated in
the same pages as the other keys that still exist. The previous point
means that you need to provision memory based on your peak memory
usage. If your workload from time to time requires 10GB, even if most
of the times 5GB could do, you need to provision for 10GB.
However allocators are smart and are able to reuse free chunks of
memory, so after you freed 2GB of your 5GB data set, when you start
adding more keys again, you'll see the RSS (Resident Set Size) to stay
steady and don't grow more, as you add up to 2GB of additional keys.
The allocator is basically trying to reuse the 2GB of memory
previously (logically) freed.
Because of all this, the fragmentation ratio is not reliable when you
had a memory usage that at peak is much larger than the currently used
memory. The fragmentation is calculated as the amount of memory
currently in use (as the sum of all the allocations performed by
Redis) divided by the physical memory actually used (the RSS value).
Because the RSS reflects the peak memory, when the (virtually) used
memory is low since a lot of keys / values were freed, but the RSS is
high, the ratio mem_used / RSS will be very high.
Or free memory of caches which was used by my redis server?
Is redis using cache? Cache of cache?
Thanks!

How does memory use affect battery life?

How does memory allocation affect battery usage? Does holding lots of data in variables consume more power than performing many iterations of basic calculations?
P.S. I'm working on a scientific app for mac, and want to optimize it for battery consumption.
The amount of data you hold in memory doesn't influence the battery life as the complete memory has to be refreshed all the time, whether you store something there or not (the memory controller doesn't know whether a part is "unused", AFAIK).
By contrast, calculations do require power. Especially if they might wake up the CPU from an idle or low power state.
I believe RAM consumption is identical regardless of whether it's full or empty. However more physical RAM you have in the machine the more power it will consume.
On a mac, you will want to avoid hitting the hard drive, so try to make sure you don't read the disk very often and definitely don't consume so much RAM you start using virtual memory (or push other apps into virtual memory).
Most modern macs will also partially power down the CPU(s) when they aren't very busy, so reducing CPU usage will actually reduce power consumption.
On the other hand when your app uses more memory it pushes other apps cache data out of the memory and the processing can have some battery cost if the user decides to switch from one to the other, but that i think will be negligible.
it's best to minimize your application's memory footprint once it transitions to the background simply to allow more applications to hang around and not be terminated. Also, applications are terminated in descending order of memory size, so if your application is the largest one existing in the background, it will be killed first.

Direct memory access DMA - how does it work?

I read that if DMA is available, then processor can route long read or write requests of disk blocks to the DMA and concentrate on other work. But, DMA to memory data/control channel is busy during this transfer. What else can processor do during this time?
First of all, DMA (per se) is almost entirely obsolete. As originally defined, DMA controllers depended on the fact that the bus had separate lines to assert for memory read/write, and I/O read/write. The DMA controller took advantage of that by asserting both a memory read and I/O write (or vice versa) at the same time. The DMA controller then generated successive addresses on the bus, and data was read from memory and written to an output port (or vice versa) each bus cycle.
The PCI bus, however, does not have separate lines for memory read/write and I/O read/write. Instead, it encodes one (and only one) command for any given transaction. Instead of using DMA, PCI normally does bus-mastering transfers. This means instead of a DMA controller that transfers memory between the I/O device and memory, the I/O device itself transfers data directly to or from memory.
As for what else the CPU can do at the time, it all depends. Back when DMA was common, the answer was usually "not much" -- for example, under early versions of Windows, reading or writing a floppy disk (which did use the DMA controller) pretty much locked up the system for the duration.
Nowadays, however, the memory typically has considerably greater bandwidth than the I/O bus, so even while a peripheral is reading or writing memory, there's usually a fair amount of bandwidth left over for the CPU to use. In addition, a modern CPU typically has a fair large cache, so it can often execute some instruction without using main memory at all.
Well the key point to note is that the CPU bus is always partly used by the DMA and the rest of the channel is free to use for any other jobs/process to run. This is the key advantage of DMA over I/O. Hope this answered your question :-)
But, DMA to memory data/control channel is busy during this transfer.
Being busy doesn't mean you're saturated and unable to do other concurrent transfers. It's true the memory may be a bit less responsive than normal, but CPUs can still do useful work, and there are other things they can do unimpeded: crunch data that's already in their cache, receive hardware interrupts etc.. And it's not just about the quantity of data, but the rate at which it's generated: some devices create data in hard real-time and need it to be consumed promptly otherwise it's overwritten and lost: to handle this without DMA the software may may have to nail itself to a CPU core then spin waiting and reading - avoiding being swapped onto some other task for an entire scheduler time slice - even though most of the time further data's not even ready.
During DMA transfer, the CPU is idle and has no control over memory bus. CPU is put in idle state by using high impedance state