How can I verify that my hardware prefetcher is disabled - system

I have disabled hardware prefetching using the following guidelines:
Installed msr-tools 1.3
wrmsr -a 0x1A4 1
The prefetcher information for my system (Broadwell) is in the msr address 0x1A4
as shown by intel documentation.
I did rdmsr -a 0x1A4 the out put showed 1.
According to the intel docs if the bit number corresponding to the particular prefetcher is set to 1 that means it is disabled.
I wanted to know if there is anyother way I can verify that my hardware prefetchers have been disabled?

Disabled prefetcher shall slowdown some operations which benefit from enabled prefetcher. You will need to write some code (probably in assembler language) and measure it's performance with enabled and disabled prefetcher.
Some long time ago I wrote test program to measure memory read performance. It was repeatedly reading memory in blocks of different sizes. It proved obvious correlation between memory block sizes and capacities of different levels of memory cache.

You can run some I/O intensive and prefetch-friendly workloads to verify.
btw, My CPU is Gold 6240, and set 0x1A4 as 1 is not working on it.
Instead, I use sudo wrmsr -a 0x1A4 0xf

Related

USRP N210 overflows in virtual machine using GnuRadio

I am using the USRP N210 through a Debian (4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u1) VM and run very quickly into processing overflow. GnuRadio-Companion is printing the letter "D" the moment one of the CPUs load is reaching 100 %. This was tested by increasing the number of taps for a low-pass filter, as shown in the picture with a sampling rate of 6.25 MHz.
I have done all instructions on How to tune an USRP, except the CPU governor. This is because I am not able to do this due to a missing driver reported by cpufreq-info. The exact output is
No or unknown cpufreq driver is active on this CPU.
The output of the lscpu command is also shown in a picture.
Has anyone an idea how I can resolve the problem? Or is GnuRadio just not fully supported for VMs?
Dropping packets when your CPU can't keep up is expected. It's the direct effect of that.
The problem is most likely to be not within your VM, but with the virtualizer.
Virtualization adds some overhead, and whilst modern virtualizers have gotten pretty good at it, you're requesting that
an application with hard real-time requirements runs
under high network load.
This might take away CPU cycles on your host side that your VM doesn't even know of – your 100% is less than it looks!
So, first of all, make sure your virtualizer does as little things with the network traffic as possible. Especially, no NAT, but best-case hardware bridging.
Then, the freq-xlating FIR definitely isn't the highest-performing block. Try using a rotator instead, followed by an FFT FIR. In your case, let that FIR decimate by a factor of 2 – you've done enough low-pass filtering to reduce the sampling rate without getting aliases.
Lastly, might be a good idea to use a newer version of GNU Radio. In Debian testing, apt will get you a 3.8 release series GNU Radio.

What is the difference between the gem5 CPU models and which one is more accurate for my simulation?

When running a simulation in gem5, I can select a CPU with fs.py --cpu-type.
This option can also show a list of all CPU types if I use an invalid CPU type such as fs.py --cpu-type.
What is the difference between those CPU types and which one should I choose for my experiment?
Question inspired by: https://www.mail-archive.com/gem5-users#gem5.org/msg16976.html
An overview of the CPU types can be found at: https://cirosantilli.com/linux-kernel-module-cheat/#gem5-cpu-types
In summary:
simplistic CPUs (derived from BaseSimpleCPU): for example AtomicSimpleCPU (the default one). They have no CPU pipeline, and therefor are completely unrealistic. However, they also run much faster. Therefore,they are mostly useful to boot Linux fast and then checkpoint and switch to a more detailed CPU.
Within the simple CPUs we can notably distinguish:
AtomicSimpleCPU: memory requests finish immediately
TimingSimpleCPU: memory requests actually take time to go through to the memory system and return. Since there is no CPU pipeline however, the simulated CPU stalls on every memory request waiting for a response.
An alternative to those is to use KVM CPUs to speed up boot if host and guest ISA are the same, although as of 2019, KVM is less stable as it is harder to implement and debug.
in-order CPUs: derived from the generic MinorCPU by parametrization, Minor stands for In Order:
for ARM: HPI is made by ARM and models a "(2017) modern in-order Armv8-A implementation". This is your best in-order ARM bet.
out-of-order CPUs, derived from the generic DerivO3CPU by parametrization, O3 stands for Out Of Order:
for ARM: there are no models specifically published by ARM as of 2019. The only specific O3 model available is ex5_big for an A15, but you would have to verify its authors claims on how well it models the real core A15 core.
If none of those are accurate enough for your purposes, you could try to create your own in-order/out-of-order models by parametrizing MinorCPU / DerivO3CPU like HPI and ex5_big do, although this could be hard to get right, as there isn't generally enough public information on non-free CPUs to do this without experiments or reverse engineering.
The other thing you will want to think about is the memory system model. There are basically two choices: classical vs Ruby, and within Ruby, several options are available, see also: https://cirosantilli.com/linux-kernel-module-cheat/#gem5-ruby-build

Exceed Redis maxmemory

I am experimenting with redis 3.0 eviction policies on my local machine - I'd like to limit max memory so redis cannot consume more than 20 megabytes.
my configuration:
loglevel debug
maxmemory 20mb
maxmemory-policy noeviction
from here, I run redis-server with my configuration followed by
redis-benchmark -q -n 100000 -c 50 -P 12
to store a bunch of keys in memory. This puts memory usage for redis at 21MB on my mac, 1 megabyte over the specified limit. If I run it again, even more is consumed.
According to the redis documentation this should be controlled by my maxmemory directive and eviction policy, where an error is thrown on subsequent writes but I am not finding that this is the case.
Why is redis-server consuming more memory than allotted?
The Redis maxmemory policy control the user data memory usage (as Itamer Haber sas in comment). But here is some more complex situation with memory compsumation:
Depends on operation system.
Depends on CPU and used compiler (read as Redis x86/x64 used)
Depends on used allocator (jemalloc by default in Redis)
In real world application (as Redis is) you have limited rights with memory management. So your applicaion would compsume different memory for same application compiled as x64 or x86). In case of Redis data overhead is nearest to 2 times by memory.
Why this important
Each time you write some data to Redis it's allocate or reallocate memory with allocator. The last (jemalloc) has complex strategy about that. In few words - allocate the memory size, lined up to the nearest power of two (you need 17 bytes - 32 would be allocated). Much of Redis structures use same policy. For example HASH (and ZSET becouse of HASH used under hood) use policy like that. Strings use much more brute strategy - double the size (with reallocation) while under REDIS_ENCODING_EMBSTR_SIZE_LIMIT (1 mb) or just allocate need size + REDIS_ENCODING_EMBSTR_SIZE_LIMIT).
So, is you limiting your maxmemory - the actual used memory in os can be a lot more and Redis can't do something with that.
p.s. Do not take for advertising please. Your question is very close to me. Here is series of articles about real memory usage in Redis (they all in russian, sorry for that. I planning to translate them in english in this new years weekend. After that update links here. The part of translated available here).

u-boot : Relocation

This one is a basic question related to u-boot.
Why does the u-boot code relocate itself ?
Ok, it makes sense if u-boot is executing from NOR-flash or boot ROM space but if it runs from SDRAM already why does it have to relocate itself once again ?
This question comes up frequently. Good answers sometimes too.
I agree it is handy to load the build to SDRAM during development. That works for me, I do it all the time. I have some special boot code in flash which does not enable MMU/cache. For my u-boot builds I switch CONFIG_SYS_TEXT_BASE between flash and ram builds. I run my development builds that way routinely.
As a practical matter, handling re-initialization of MMU/cache would be a nontrivial matter. And U-Boot benefits IMO from simplicity, as result of leaving out things like that.
The tech lead at Denx has expressed his opinion. IIRC his other posts are more strongly worded than that one. I get the impression that he does not like to repeat himself.
update: why relocate. Memory access is faster from RAM than from ROM, this will matter particularly if target has no instruction cache. Executing from RAM allows flash reprogramming; also (more minor) it allows software breakpoints with "trap" instructions; also it is more like the target's normal mode of operation, so if e.g. burst reads from RAM are iffy the failure will be seen at early boot.
U-boot has to reserve 3 regions in memory that stores: 1) u-boot itself, 2) uImage (compressed kernel), and 3) uncompressed kernel. These 3 regions must be carefully placed in u-boot to prevent conflict.
However, the previous stage boot-loader, (BL2 or BL1) that brings u-boot into DRAM memory don\t know u-boot's planing on these 3 regions. So it can only loads u-boot onto a lower address in DRAM memory and jump to it. Then, after u-boot execute some basic initialization and detect current PC is not in planed location, u-boot call relocate function that move u-boot to the planned location and jump to it.
The code of NOR flash must initialize the SDRAM, Then the copy code from Nor Flash to SDRAM, The process will copy itself, because you could enable MMU, we will start Virtual address mapping.

How to avoid Boost ASIO reactor becoming constrained to a single core?

TL;DR: Is it possible that I am reactor throughput limited? How would I tell? How expensive and scalable (across threads) is the implementation of the io_service?
I have a farily massively parallel application, running on a hyperthreaded-dual-quad-core-Xeon machine with tons of RAM and a fast SSD RAID. This is developed using boost::asio.
This application accepts connections from about 1,000 other machines, reads data, decodes a simple protocol, and shuffles data into files mapped using mmap(). The application also pre-fetches "future" mmap pages using madvise(WILLNEED) so it's unlikely to be blocking on page faults, but just to be sure, I've tried spawning up to 300 threads.
This is running on Linux kernel 2.6.32-27-generic (Ubuntu Server x64 LTS 10.04). Gcc version is 4.4.3 and boost::asio version is 1.40 (both are stock Ubuntu LTS).
Running vmstat, iostat and top, I see that disk throughput (both in TPS and data volume) is on the single digits of %. Similarly, the disk queue length is always a lot smaller than the number of threads, so I don't think I'm I/O bound. Also, the RSS climbs but then stabilizes at a few gigs (as expected) and vmstat shows no paging, so I imagine I'm not memory bound. CPU is constant at 0-1% user, 6-7% system and the rest as idle. Clue! One full "core" (remember hyper-threading) is 6.25% of the CPU.
I know the system is falling behind, because the client machines block on TCP send when more than 64kB is outstanding, and report the fact; they all keep reporting this fact, and throughput to the system is much less than desired, intended, and theoretically possible.
My guess is I'm contending on a lock of some sort. I use an application-level lock to guard a look-up table that may be mutated, so I sharded this into 256 top-level locks/tables to break that dependency. However, that didn't seem to help at all.
All threads go through one, global io_service instance. Running strace on the application shows that it spends most of its time dealing with futex calls, which I imagine have to do with the evented-based implementation of the io_service reactor.
Is it possible that I am reactor throughput limited? How would I tell? How expensive and scalable (across threads) is the implementation of the io_service?
EDIT: I didn't initially find this other thread because it used a set of tags that didn't overlap mine :-/ It is quite possible my problem is excessive locking used in the implementation of the boost::asio reactor. See C++ Socket Server - Unable to saturate CPU
However, the question remains: How can I prove this? And how can I fix it?
The answer is indeed that even the latest boost::asio only calls into the epoll file descriptor from a single thread, not entering the kernel from more than one thread at a time. I can kind-of understand why, because thread safety and lifetime of objects is extremely precarious when you use multiple threads that each can get notifications for the same file descriptor. When I code this up myself (using pthreads), it works, and scales beyond a single core. Not using boost::asio at that point -- it's a shame that an otherwise well designed and portable library should have this limitation.
I believe that if you use multiple io_service object (say for each cpu core), each run by a single thread, you will not have this problem. See the http server example 2 on the boost ASIO page.
I have done various benchmarks against the server example 2 and server example 3 and have found that the implementation I mentioned works the best.
In my single-threaded application, I found out from profiling that a large portion of the processor instructions was spent on locking and unlocking by the io_service::poll(). I disabled the lock operations with the BOOST_ASIO_DISABLE_THREADS macro. It may make sense for you, too, depending on your threading situation.