How to calculate return instructions after a period of the program (based on the aspect of computer hardware) - performancecounter

I've been working on rop recently. When using perf to count hardware information, I want to measure the number of return instructions executed by a given piece of code. But the perf interface only provides branch instructions.

If you're only x86 with a recent Intel CPU:
perf list on my Skylake shows there's a hardware counter for br_inst_retired.near_return. That will count only ret instructions, not other branches. But see erratum SKL091 for branch-instruction counters.
perf stat -e instructions,br_inst_retired.near_return,... ./a.out may be what you're looking for. Or maybe attaching perf stat to an already-running program, or maybe -I 1000 to print accumulated counts over intervals.
But note that if you're looking for ROP gadgets, you can find a C3 opcode inside what normally decodes as some other instruction. So restricting yourself only to ret instructions that actually run during the target program's normal execution is more limiting than it needs to be.
e.g. a 4-byte immediate might usefully decode as something + ret if you jump to the immediate.

Related

How to measure cpu time of TCL script

I have long TCL script . I want measure exact cpu time the process is taking in TCL
I tried the time command in TCL but I am not sure it is correct way to measure cpu time taken by complete TCL script
Tcl itself doesn't provide this information. You want to use the TclX package's times command.
package require Tclx
set pre [times]
# do CPU intensive operation here; for example calculating the length of a number with many digits...
string length [expr {13**12345}]
set post [times]
# the current process's non-OS CPU time is the first element; it's in milliseconds
set cpuSecondsTaken [expr {([lindex $post 0] - [lindex $pre 0]) / 1000.0}]
# The other elements are: system-cpu-time, subprocess-user-cpu-time, subprocess-system-time
You are, of course, dependent on the OS measuring this information correctly. It's not portable to non-POSIX systems (e.g., Windows) though there might be something logically similar in the TWAPI package.

LLVM ScalarEvolution Pass Cannot Compute Exit Count for Loop Vectorizer

I'm trying to figure out how to run LLVM's built-in loop vectorizer. I have a small program containing an extremely simple loop (I had some output at one point which is why stdio.h is still being included despite never being used):
1 #include <stdio.h>
2
3 unsigned NUM_ELS = 10000;
4
5 int main() {
6 int A[NUM_ELS];
7
8 #pragma clang loop vectorize(enable)
9 for (int i = 0; i < NUM_ELS; ++i) {
10 A[i] = i*2;
11 }
12
13 return 0;
14 }
As you can see, it does nothing at all useful; I just need the for loop to be vectorizable. I'm compiling it to LLVM bytecode with
clang -emit-llvm -O0 -c loop1.c -o loop1.bc
llvm-dis -f loop1.bc
Then I'm applying the vectorizer with
opt -loop-vectorize -force-vector-width=4 -S -debug loop1.ll
However, the debug output gives me this:
LV: Checking a loop in "main" from loop1.bc
LV: Loop hints: force=? width=4 unroll=0
LV: Found a loop: for.cond
LV: SCEV could not compute the loop exit count.
LV: Not vectorizing: Cannot prove legality.
I've dug around in the LLVM source a bit, and it looks like SCEV comes from the ScalarEvolution pass, which has the task of (among other things) counting the number of back edges back to the loop condition, which in this case (if I'm not mistaken) should be the trip count minus the first trip (so 9,999 in this case). I've run this pass on a much larger benchmark and it gives me the exact same error at every loop, so I'm guessing it isn't the loop itself, but that I'm not giving it enough information.
I've spent quite a bit of time combing through the documentation and Google results to find an example of a full opt command using this transformation, but have been unsuccessful so far; I'd appreciate any hints as to what I may be missing (I'm new to vectorizing code so it could be something very obvious).
Thank you,
Stephen
vectorization depends on number of other optimization which needs to be run before. They are not run at all at -O0, therefore you cannot expect that your code would be 'just' vectorized there.
Adding -O2 before -loop-vectorize in opt cmdline would help here (make sure your 'A' array is external / used somehow, otherwise everything will be optimized away).

Mono human readable GC statistics in runtime

Is there a Mono profiler mode similar to Java -Xloggc?
I would like to see a human readable GC report while my application is running. Currently Mono can be run with --profile=log option but the output is in binary format and every time I need to run mprof-report to read it. The output file also contains a lot of info which is not interesting for me.
I tried to reduce the file size by specifying heapshot=14400000ms to collect statistics every few hours but it didn't help a lot. In a week I had few gigabytes log.
I also tried to use "sample" profiler but the overhead was too much.
You can use Mono's trace filters for this. Just set the MONO_LOG_MASK to gc and lower the MONO_LOG_LEVEL. Then run your app normally and you will get the human-readable GC statistics while your app is running:
$ export MONO_LOG_MASK=gc
$ export MONO_LOG_LEVEL=debug
$ mono ... # run your application normally ..
...
# notice the human readable GC output
mono: GC_MAJOR: (LOS overflow) pause 26.00ms, total 26.06ms, bridge 0.00ms major 31472K/0K los 1575K/0K
Mono: GC_MINOR: (Nursery full) pause 2.30ms, total 2.35ms, bridge 0.00ms promoted 31456K major 31456K los 5135K
Mono: GC_MINOR: (Nursery full) pause 2.43ms, total 2.45ms, bridge 0.00ms promoted 31456K major 31456K los 8097K
Mono: GC_MINOR: (Nursery full) pause 1.80ms, total 1.82ms, bridge 0.00ms promoted 31472K major 31472K los 11425K

Statically Defined IDT

This question already has answers here:
Solution needed for building a static IDT and GDT at assemble/compile/link time
(1 answer)
How to do computations with addresses at compile/linking time?
(2 answers)
Closed 5 days ago.
I'm working on a project that has tight boot time requirements. The targeted architecture is an IA-32 based processor running in 32 bit protected mode. One of the areas identified that can be improved is that the current system dynamically initializes the processor's IDT (interrupt descriptor table). Since we don't have any plug-and-play devices and the system is relatively static, I want to be able to use a statically built IDT.
However, this proving to be troublesome for the IA-32 arch since the 8 byte interrupt gate descriptors splits the ISR address. The low 16 bits of the ISR appear in the first 2 bytes of the descriptor, some other bits fill in the next 4 bytes, and then finally the last 16 bits of the ISR appear in the last 2 bytes.
I wanted to use a const array to define the IDT and then simply point the IDT register at it like so:
typedef struct s_myIdt {
unsigned short isrLobits;
unsigned short segSelector;
unsigned short otherBits;
unsigned short isrHibits;
} myIdtStruct;
myIdtStruct myIdt[256] = {
{ (unsigned short)myIsr0, 1, 2, (unsigned short)(myIsr0 >> 16)},
{ (unsigned short)myIsr1, 1, 2, (unsigned short)(myIsr1 >> 16)},
etc.
Obviously this won't work as it is illegal to do this in C because myIsr is not constant. Its value is resolved by the linker (which can do only a limited amount of math) and not by the compiler.
Any recommendations or other ideas on how to do this?
You ran into a well known x86 wart. I don't believe the linker can stuff the address of your isr routines in the swizzled form expected by the IDT entry.
If you are feeling ambitious, you could create an IDT builder script that does something like this (Linux based) approach. I haven't tested this scheme and it probably qualifies as a nasty hack anyway, so tread carefully.
Step 1: Write a script to run 'nm' and capture the stdout.
Step 2: In your script, parse the nm output to get the memory address of all your interrupt service routines.
Step 3: Output a binary file, 'idt.bin' that has the IDT bytes all setup and ready for the LIDT instruction. Your script obviously outputs the isr addresses in the correct swizzled form.
Step 4: Convert his raw binary into an elf section with objcopy:
objcopy -I binary -O elf32-i386 idt.bin idt.elf
Step 5: Now idt.elf file has your IDT binary with the symbol something like this:
> nm idt.elf
000000000000000a D _binary_idt_bin_end
000000000000000a A _binary_idt_bin_size
0000000000000000 D _binary_idt_bin_start
Step 6: relink your binary including idt.elf. In your assembly stubs and linker scripts, you can refer to symbol _binary_idt_bin_start as the base of the IDT. For example, your linker script can place the symbol _binary_idt_bin_start at any address you like.
Be careful that relinking with the IDT section doesn't move anyting else in your binary, e.g. your isr routines. Manage this in your linker script (.ld file) by puting the IDT into it's own dedicated section.
---EDIT---
From comments, there seems to be confusion about the problem. The 32-bit x86 IDT expects the address of the interrupt service routine to be split into two different 16-bit words, like so:
31 16 15 0
+---------------+---------------+
| Address 31-16 | |
+---------------+---------------+
| | Address 15-0 |
+---------------+---------------+
A linker is thus unable to plug-in the ISR address as a normal relocation. So, at boot time, software must construct this split format, which slows boot time.

Nand partitioning in u-boot

I am working on an Embedded ARM9 development board. In that i want rearrange my nand partitions. Can anybody tell me how to do that ?
In my u-boot shell if i give the command mtdparts which gives following information .
Boardcon> mtdparts
device nand0 <nandflash0>, # parts = 7
#: name size offset mask_flags
0: bios 0x00040000 0x00000000 0
1: params 0x00020000 0x00040000 0
2: toc 0x00020000 0x00060000 0
3: eboot 0x00080000 0x00080000 0
4: logo 0x00100000 0x00100000 0
5: kernel 0x00200000 0x00200000 0
6: root 0x03c00000 0x00400000 0
active partition: nand0,0 - (bios) 0x00040000 # 0x00000000
defaults:
mtdids : nand0=nandflash0
mtdparts: mtdparts=nandflash0:256k#0(bios),128k(params),128k(toc),512k(eboot),1024k(logo),2m(kernel),-(root)
Kernel boot message shows the following :
Creating 3 MTD partitions on "NAND 64MiB 3,3V 8-bit":
0x000000000000-0x000000040000 : "Boardcon_Board_uboot"
0x000000200000-0x000000400000 : "Boardcon_Board_kernel"
0x000000400000-0x000003ff8000 : "Boardcon_Board_yaffs2"
Anybody can please explain me what is the relation between both these messages . And which one either kernel or u-boot is responsible for creating partions on nand flash?. As for as i know kernel is not creating partitions on each boot but why the message "Creating 3 MTD partitions"?
For flash devices, either NAND or NOR, there is no partition table on the device itself. That is, you can't read the device in a flash reader and find some table that indicates how many partitions are on the device and where each partition begins and ends. There is only an undifferentiated sequence of blocks. This is a fundamental difference between MTD flash devices and devices such as disks or FTL devices such as MMC.
The partitioning of the flash device is therefore in the eyes of the beholder, that is, either U-Boot or the kernel, and the partitions are "created" when beholder runs. That's why you see the message Creating 3 MTD partitions. It reflects the fact that the flash partitions really only exist in the MTD system of the running kernel, not on the flash device itself.
This leads to a situation in which U-Boot and the kernel can have different definitions of the flash partitions, which is apparently what has happened in the case of the OP.
In U-Boot, you define the flash partitions in the mtdparts environment variable. In the Linux kernel, the flash partitions are defined in the following places:
In older kernels (e.g. 2.6.35 for i.MX28) the flash partitioning could be hard-coded in gpmi-nfc-mil.c or other driver source code. (what a bummer!).
In the newer mainline kernels with device tree support, you can define the MTD paritions in the device tree
In the newer kernels there is usually support for kernel command line partition definition using a command line like root=/dev/mmcblk0p2 rootwait console=ttyS2,115200 mtdparts=nand:6656k(all),1m(squash),-(jffs2)
The type of partitioning support that you have in the kernel therefore depends on the type of flash you are using, whether it's driver supports kernel command line parsing and whether your kernel has device tree support.
In any event, there is an inherent risk of conflict between the U-Boot and kernel partitioning of the flash. Therefore, my recommendation is to define the flash partitions in the U-Boot mtdparts variable and to pass this to the kernel in the U-Boot kernel command line, assuming that your kernel support this option.
you can set the mtdparts environment variable to do so in uboot, and the kernel only use this if you pass this in kernel boot commandline, or else it'll default to the nand partition structure in the kernel source code for your platform, which in this case the 3 MTD partition default.