Linux ioctl return value interpreted by who? - ioctl

I'm working with a custom kernel char device which sometimes returns large negative values (around the thousands, say -2000) for its ioctl().
In userspace, I don't get these values returned from the ioctl call. Instead I get a return value of -1 back with errno set to the negated value from the kernel module (+2000).
As far as I can read and google, __syscall_return() is the macro which is supposed to interpret negative return values as errors. But, it only seems to look for values between -1 and -125. So I didn't expect these large negative values to be translated.
Where are these return values translated? Is it expected behaviour?
I am on Linux 2.6.35.10 with EGLIBC 2.11.3-4+deb6u6.

The translation and move to errno occur on the libc level. Both Gnu libc and μClibc treat negative numbers down to at least -4095 as error conditions, per http://www.makelinux.net/ldd3/chp-6-sect-1
See https://github.molgen.mpg.de/git-mirror/glibc/blob/85b290451e4d3ab460a57f1c5966c5827ca807ca/sysdeps/unix/sysv/linux/aarch64/ioctl.S for the Gnu libc implementation of ioctl.

So, with the help of BRPocock I will report my findings here.
The linux kernel will do a error check for all syscalls along the lines of (from unistd.h):
#define __syscall_return(type, res) \
do { \
if ((unsigned long)(res) >= (unsigned long)(-125)) { \
errno = -(res); \
res = -1; \
} \
return (type) (res); \
} while (0)
Libc will also do an error check for all syscalls along the lines of (from syscall.S):
.text
ENTRY (syscall)
PUSHARGS_6 /* Save register contents. */
_DOARGS_6(44) /* Load arguments. */
movl 20(%esp), %eax /* Load syscall number into %eax. */
ENTER_KERNEL /* Do the system call. */
POPARGS_6 /* Restore register contents. */
cmpl $-4095, %eax /* Check %eax for error. */
jae SYSCALL_ERROR_LABEL /* Jump to error handler if error. */
ret /* Return to caller. */
PSEUDO_END (syscall)
Glibc gives a reason for the 4096 value (from sysdep.h):
/* Linux uses a negative return value to indicate syscall errors,
unlike most Unices, which use the condition codes' carry flag.
Since version 2.1 the return value of a system call might be
negative even if the call succeeded. E.g., the `lseek' system call
might return a large offset. Therefore we must not anymore test
for < 0, but test for a real error by making sure the value in %eax
is a real error number. Linus said he will make sure the no syscall
returns a value in -1 .. -4095 as a valid result so we can savely
test with -4095. */
__syscall_return seems to be missing from newer kernels, I haven't researched that yet.

Related

Valgrind reports "invalid write" at "X bytes below stack pointer"

I'm running some code under Valgrind, compiled with gcc 7.5 targeting an aarch64 (ARM 64 bits) architecture, with optimizations enabled.
I get the following error:
==3580== Invalid write of size 8
==3580== at 0x38865C: ??? (in ...)
==3580== Address 0x1ffeffdb70 is on thread 1's stack
==3580== 16 bytes below stack pointer
This is the assembly dump in the vicinity of the offending code:
388640: a9bd7bfd stp x29, x30, [sp, #-48]!
388644: f9000bfc str x28, [sp, #16]
388648: a9024ff4 stp x20, x19, [sp, #32]
38864c: 910003fd mov x29, sp
388650: d1400bff sub sp, sp, #0x2, lsl #12
388654: 90fff3f4 adrp x20, 204000 <_IO_stdin_used-0x4f0>
388658: 3dc2a280 ldr q0, [x20, #2688]
38865c: 3c9f0fe0 str q0, [sp, #-16]!
I'm trying to ascertain whether this is a possible bug in my code (note that I've thoroughly reviewed my code and I'm fairly confident it's correct), or whether Valgrind will blindly report any writes below the stack pointer as an error.
Assuming the latter, it looks like a Valgrind bug since the offending instruction at 0x38865c uses the pre-decrement addressing mode, so it's not actually writing below the stack pointer.
Furthermore, at address 0x388640 a similar access (and again with pre-decrement addressing mode) is performed, yet this isn't reported by Valgrind; the main difference being the use of an x register at address 0x388640 versus a q register at address 38865c.
I'd also like to draw attention to the large stack pointer subtraction at 0x388650, which may or may not have anything to do with the issue (note this subtraction makes sense, given that the offending C code declares a large array on the stack).
So, will anyone help me make sense of this, and whether I should worry about my code?
The code looks fine, and the write is certainly not below the stack pointer. The message seems to be a valgrind bug, possibly #432552, which is marked as fixed. OP confirms that the message is not produced after upgrading valgrind to 3.17.0.
code declares a large array on the stack
should [I] worry about my code?
I think it depends upon your desire for your code to be more portable.
Take this bit of code that I believe represents at least one important thing you mentioned in your post:
#include <stdio.h>
#include <stdlib.h>
long long foo (long long sz, long long v) {
long long arr[sz]; // allocating a variable on the stack
arr[sz-1] = v;
return arr[sz-1];
}
int main (int argc, char *argv[]) {
long long n = atoll(argv[1]);
long long v = foo(n, n);
printf("v = %lld\n", v);
}
$ uname -mprsv
Darwin 20.5.0 Darwin Kernel Version 20.5.0: Sat May 8 05:10:33 PDT 2021; root:xnu-7195.121.3~9/RELEASE_X86_64 x86_64 i386
$ gcc test.c
$ a.out 1047934
v = 1047934
$ a.out 1047935
Segmentation fault: 11
$ uname -snrvmp
Linux localhost.localdomain 3.19.8-100.fc20.x86_64 #1 SMP Tue May 12 17:08:50 UTC 2015 x86_64 x86_64
$ gcc test.c
$ ./a.out 2147483647
v = 2147483647
$ ./a.out 2147483648
v = 2147483648
There are at least some minor portability concerns with this code. The amount of allocatable stack memory for these two environments differs significantly. And that's only for two platforms. Haven't tried it on my Windows 10 vm but I don't think I need to because I got bit by this one a long time ago.
Beyond OP issue that was due to a Valgrind bug, the title of this question is bound to attract more people (like me) who are getting "invalid write at X bytes below stack pointer" as a legitimate error.
My piece of advice: check that the address you're writing to is not a local variable of another function (not present in the call stack)!
I stumbled upon this issue while attempting to write into the address returned by yyget_lloc(yyscanner) while outside of function yyparse (the former returns the address of a local variable in the latter).

assertion from valgrind mc_main.c

My valgrind runs reports errors like this
Memcheck: mc_main.c:8292 (mc_pre_clo_init): Assertion 'MAX_PRIMARY_ADDRESS == 0x1FFFFFFFFFULL' failed.
What does this mean? Is it a valgrind internal error or an error from my program?
This is a valgrind internal error. This is very weird, as this failed
assertion is a self check done very early on.
You should file a bug on valgrind bugzilla, reporting all the needed
details (version, platform, ...)
From the Valgrind source code (from git HEAD)
/* Only change this. N_PRIMARY_MAP *must* be a power of 2. */
#if VG_WORDSIZE == 4
/* cover the entire address space */
# define N_PRIMARY_BITS 16
#else
/* Just handle the first 128G fast and the rest via auxiliary
primaries. If you change this, Memcheck will assert at startup.
See the definition of UNALIGNED_OR_HIGH for extensive comments. */
# define N_PRIMARY_BITS 21
#endif
/* Do not change this. */
#define N_PRIMARY_MAP ( ((UWord)1) << N_PRIMARY_BITS)
/* Do not change this. */
#define MAX_PRIMARY_ADDRESS (Addr)((((Addr)65536) * N_PRIMARY_MAP)-1)
...
tl_assert(MAX_PRIMARY_ADDRESS == 0x1FFFFFFFFFULL);
So it looks like something has been changed that shouldn't have been.

LLVM ScalarEvolution Pass Cannot Compute Exit Count for Loop Vectorizer

I'm trying to figure out how to run LLVM's built-in loop vectorizer. I have a small program containing an extremely simple loop (I had some output at one point which is why stdio.h is still being included despite never being used):
1 #include <stdio.h>
2
3 unsigned NUM_ELS = 10000;
4
5 int main() {
6 int A[NUM_ELS];
7
8 #pragma clang loop vectorize(enable)
9 for (int i = 0; i < NUM_ELS; ++i) {
10 A[i] = i*2;
11 }
12
13 return 0;
14 }
As you can see, it does nothing at all useful; I just need the for loop to be vectorizable. I'm compiling it to LLVM bytecode with
clang -emit-llvm -O0 -c loop1.c -o loop1.bc
llvm-dis -f loop1.bc
Then I'm applying the vectorizer with
opt -loop-vectorize -force-vector-width=4 -S -debug loop1.ll
However, the debug output gives me this:
LV: Checking a loop in "main" from loop1.bc
LV: Loop hints: force=? width=4 unroll=0
LV: Found a loop: for.cond
LV: SCEV could not compute the loop exit count.
LV: Not vectorizing: Cannot prove legality.
I've dug around in the LLVM source a bit, and it looks like SCEV comes from the ScalarEvolution pass, which has the task of (among other things) counting the number of back edges back to the loop condition, which in this case (if I'm not mistaken) should be the trip count minus the first trip (so 9,999 in this case). I've run this pass on a much larger benchmark and it gives me the exact same error at every loop, so I'm guessing it isn't the loop itself, but that I'm not giving it enough information.
I've spent quite a bit of time combing through the documentation and Google results to find an example of a full opt command using this transformation, but have been unsuccessful so far; I'd appreciate any hints as to what I may be missing (I'm new to vectorizing code so it could be something very obvious).
Thank you,
Stephen
vectorization depends on number of other optimization which needs to be run before. They are not run at all at -O0, therefore you cannot expect that your code would be 'just' vectorized there.
Adding -O2 before -loop-vectorize in opt cmdline would help here (make sure your 'A' array is external / used somehow, otherwise everything will be optimized away).

Sending signals from DCL command line on OpenVMS

I'm trying to send a signal via the command line on an OpenVMS server. Using Perl I have set up signal handlers between processes and Perl on VMS is able to send Posix signals. In addition, C++ programs are able to send and handle signals too. However, the problem I run into is that the processes could be running on another node in the cluster and I need to write a utility script to remotely send a signal to them.
I'm trying to avoid writing a new script and would rather simply execute a command remotely to send the signal from the command line. I need to send SIGUSR1, which translates to C$_SIGUSR1 for OpenVMS.
Thanks.
As far as I know, there is no supported command line interface to do this. But you can accomplish the task by calling an undocumented system service called SYS$SIGPRC(). This system service can deliver any condition value to the target process, not just POSIX signals. Here's the interface described in standard format:
FORMAT
SYS$SIGPRC process-id ,[process-name] ,condition-code
RETURNS
OpenVMS usage: cond_value
type: longword (unsigned)
access: write only
mechanism: by value
ARGUMENTS
process-id
OpenVMS usage: process_id
type: longword (unsigned)
access: modify
mechanism: by reference
Process identifier of the process for which is to receive the signal. The
process-id argument is the address of an unsigned longword containing the
process identifier. If you do not specify process-id, process-name is
used.
The process-id is updated to contain the process identifier actually
used, which may be different from what you originally requested if you
specified process-name.
process-name
OpenVMS usage: process_name
type: character string
access: read only
mechanism: by descriptor
A 1 to 15 character string specifying the name of the process for
which will receive the signal. The process-name argument is the
address of a descriptor pointing to the process name string. The name
must correspond exactly to the name of the process that is to receive
the signal; SYS$SIGPRC does not allow trailing blanks or abbreviations.
If you do not specify process-name, process-id is used. If you specify
neither process-name nor process-id, the caller's process is used.
Also, if you do not specify process-name and you specify zero for
process-id, the caller's process is used.
condition-value
OpenVMS usage: cond_value
type: longword (unsigned)
access: read only
mechanism: by value
OpenVMS 32-bit condition value. The condition-value argument is
an unsigned longword that contains the condition value delivered
to the process as a signal.
CONDITION VALUES RETURNED
SS$_NORMAL The service completed successfully
SS$_NONEXPR Specified process does not exist
SS$_NOPRIV The process does not have the privilege to signal
the specified process
SS$_IVLOGNAM The process name string has a length of 0 or has
more than 15 characters
(plus I suspect there are other possible returns having to do
with various cluster communications issues)
EXAMPLE CODE
#include <stdio.h>
#include <stdlib.h>
#include <ssdef.h>
#include <stsdef.h>
#include <descrip.h>
#include <errnodef.h>
#include <lib$routines.h>
int main (int argc, char *argv[]) {
/*
**
** To build:
**
** $ cc sigusr1
** $ link sigusr1
**
** Run example:
**
** $ sigusr1 := $dev:[dir]sigusr1.exe
** $ sigusr1 20206E53
**
*/
static unsigned int pid;
static unsigned int r0_status;
extern unsigned int sys$sigprc (unsigned int *,
struct dsc$descriptor_s *,
int);
if (argc < 2) {
(void)fprintf (stderr, "Usage: %s PID\n",
argv[0]);
exit (EXIT_SUCCESS);
}
sscanf (argv[1], "%x", &pid);
r0_status = sys$sigprc (&pid, 0, C$_SIGUSR1);
if (!$VMS_STATUS_SUCCESS (r0_status)) {
(void)lib$signal (r0_status);
}
}

What's the line after execve for since it doesn't return on success?

26: execve(prog[0],prog,env);
27: return 0;
execve() does not return on success, and the text, data, bss, and
stack of the calling process are overwritten by that of the program
loaded.
what's return 0; for?
I suggest it is to cease this compiler warning.
$ cat | gcc -W -Wall -x c -
int main(){}
^D
<stdin>: In function 'main':
<stdin>:1:1: warning: control reaches end of non-void function
This also will make happy static analyzers and IDE warnings about same thing.
That line is in case execve() somehow fails and does return. Theoretically, it never should happen, but it does sometimes. Often, the return value is set to some random number to signify that there was an error.