The use of strncmp and memcmp - memcmp

Does
if(strncmp(buf, buf2, 7) == 0)
do the same thing as
if(memcmp(buf, buf2, 7) == 0)
buf and buf2 are char* arrays or similar.
I was going to append this to another question but then decided perhaps it was better to post it separately. Presumably the answer is either a trivial "yes" or if not then what is the difference?
(I found these functions from online documentation, but wasn't sure about strncmp because the documentation was slightly unclear.)

Like strcmp(), strncmp() is for comparing strings, therefore it stops comparing when it finds a string terminator in at least one argument. Any differences past that point have no effect on the result. strncmp() differs in that it will also stop comparing after the specified number of bytes if it does not encounter a terminator before then.
memcmp(), on the other hand, is for comparing blocks of random memory. It compares up to the specified number of bytes from each block until it finds a difference, regardless of the values of the bytes. That is, it does not stop at string terminators.

In C and C++ the end of a string is indicated by a byte with value 0.
The function memcmp does not care about the end of a strig but will in any case compare exactly the number of bytes specified.
In contrast to that, the function strncmp will stop at a byte with value 0 even though the passed number of bytes to compare is not yet reached.

The main difference between strncmp() and memcmp() is that the first is sensible to (stops at) '\0' where the latest is not. If the first 7 bytes of memory from buf and buf2 do not contain a '\0' in it, then the behaviour is the same.
Consider the following example:
#include <stdio.h>
#include <string.h>
int main(void) {
char buf[] = "123\0 12";
char buf2[] = "123\0 34";
printf("strncmp(): %d\n", strncmp(buf, buf2, 7));
printf("memcmp(): %d\n", memcmp(buf, buf2, 7));
return 0;
}
It will output:
strncmp(): 0
memcmp(): -2
Because strncmp() will stop at buf[3], where it'll find a '\0', where memcmp() will continue until all 7 bytes are compared.

Related

what's best way for awk to check arbitrary integer precision

from GNU gawk's page
https://www.gnu.org/software/gawk/manual/html_node/Checking-for-MPFR.html
they have a formula to check arbitrary precision
function adequate_math_precision(n) { return (1 != (1+(1/(2^(n-1))))) }
My question is : wouldn't it be more efficient by staying within integer math domain with a formula such as
( 2^abs(n) - 1 ) % 2 # note 2^(n-1) vs. 2^|n| - 1
Since any power of 2 must also be even, then subtracting 1 must always be odd, then its modulo (%) over 2 becomes indicator function for is_odd() for n >= 0, while the abs(n) handles the cases where it's negative.
Or does the modulo necessitate a casting to float point, thus nullifying any gains ?
Good question. Let's tackle it.
The proposed snippet aims at checking wether gawk was invoked with the -M option.
I'll attach some digression on that option at the bottom.
The argument n of the function is the floating point precision needed for whatever operation you'll have to perform. So, say your script is in a library somewhere and will get called but you have no control over it. You'll run that function at the beginning of the script to promptly throw exception and bail out, suggesting that the end result will be wrong due to lack of bits to store numbers.
Your code stays in the integer realm: a power of two of an integer is an integer. There is no need to use abs(n) here, because there is no point in specifying how many bits you'll need as a negative number in the first place.
Then you subtract one from an even, integer number. Now, unless n=0, in which case 2^0=1 and then your code reads (1 - 1) % 2 = 0, your snippet shall always return 1, because the quotient (%) of an odd number divided by two is 1.
Problem is: you are trying to calculate a potentially stupidly large number in a function that should check if you are able to do so in the first place.
Since any power of 2 must also be even, then subtracting 1 must always
be odd, then its modulo (%) over 2 becomes indicator function for
is_odd() for n >= 0, while the abs(n) handles the cases where it's
negative.
Except when n=0 as we discussed above, you are right. The snippet will tell that any power of 2 is even, and any power of 2, minus 1, is odd. We were discussing another subject entirely thought.
Let's analyze the other function instead:
return (1 != (1+(1/(2^(n-1)))))
Remember that booleans in awk runs like this: 0=false and non zero equal true. So, if 1+x where x is a very small number, typically a large power of two (2^122 in the example page) is mathematically guaranteed to be !=1, in the digital world that's not the case. At one point, floating computation will reach a precision rock bottom, will be rounded down, and x=0 will be suddenly declared. At that point, the arbitrary precision function will return 0: false: 1 is equal 1.
A larger discussion on types and data representation
The page you link explains precision for gawk invoked with the -M option. This sounds like technoblahblah, let's decipher it.
At one point, your OS architecture has to decide how to store data, how to represent it in memory so that it can be accessed again and displayed. Terms like Integer, Float, Double, Unsigned Integer are examples of data representation. We here are addressing Integer representation: how is an integer stored in memory?
A 32-bit system will use 4 bytes to represent and integer, which in turn determines how larger the integer will be. The 32 bits are read from most significative (MSB) to less significative (LSB) and if signed, one bit will represent the sign (the MSB typically, drastically reducing the max size of the integer).
If asked to compute a large number, a machine will try to fit in in the max number available. If the end result is larger than that, you have overflow and end up with a wrong result or an error. Many online challenges typically ask you to write code for arbitrary long loops or large sums, then test it with inputs that will break the 64bit barrier, to see if you master proper types for indexes.
AWK is not a strongly typed language. Meaning, any variable can store data, regardless of the type. The data type can change and it is determined at runtime by the interpreter, so that the developer doesn't need to care. For instance:
$awk '{a="this is text"; print a; a=2; print a; print a+3.0*2}'
-| this is text
-| 2
-| 8
In the example, a is text, then is an integer and can be summed to a floating point number and printed as integer without any special type handling.
The Arbitrary Precision Page presents the following snippet:
$ gawk -M 'BEGIN {
> s = 2.0
> for (i = 1; i <= 7; i++)
> s = s * (s - 1) + 1
> print s
> }'
-| 113423713055421845118910464
There is some math voodoo behind, we will skip that. Since s is interpreted as a floating point number, the end result is computed as floating point.
Try to input that number on Windows calculator as decimal, and it will fail. Although you can compute it as a binary. You'll need the programmer setting and to add up to 53 bits to be able to fit it as unsigned integer.
53 is a magic number here: with the -M option, gawk uses arbitrary precision for numbers. In other words, it commandeers how many bits are necessary, track them and breaks free of the native OS architecture. The default option says that gawk will allocate 53 bits for any given arbitrary number. Fun fact, the actual result of that snippet is wrong, and it would take up to 100 bits to compute correctly.
To implement arbitrary large numbers handling, gawk relies on an external library called MPFR. Provided with an arbitrary large number, MPFR will handle the memory allocation and bit requisition to store it. However, the interface between gawk and MPFR is not perfect, and gawk can't always control the type that MPFR will use. In case of integers, that's not an issue. For floating point numbers, that will result in rounding errors.
This brings us back to the snippet at the beginning: if gawk was called with the -M option, numbers up to 2^53 can be stored as integers. Floating points will be smaller than that (you'll need to make the comma disappear somehow, or rather represent it spending some of the bits allocated for that number, just like the sign). Following the example of the page, and asking an arbitrary precision larger than 32, the snippet will return TRUE only if the -M option was passed, otherwise 1/2^(n-1) will be rounded down to be 0.

Using XOR on characters as a simple checksum; is a char just a byte?

I have a string of characters and want to generate a simple checksum by accumulating XOR on each character, then adding the lowest-order byte result of that to the end of the string as formatted by sprintf(twoCharacterBuffer, "%02X", valueHoldingXOR);.
If I just XOR the characters in the string, putting them into an unsigned char value, the compiler warns me that "'sprintf' output between 3 and 9 bytes into a destination of size 2"
The Arduino documentation is a little vague, possibly on purpose, about the number of bytes in a character. I'd like to just XOR the lowest-order byte, whether it's 1 or 2 or 4 bytes, but am not sure of the correct way to do that. Or can I assume that a char is a byte and just cast it?

Returning uint Value from Char in gawk

I'm trying to get value of an ASCII char I receive via RS232 to convert them into binary like values.
Example:
0xFF-->########
0x01--> #
0x02--> #
...
My Problem is to get the value of ASCII chars higher than 127.
Test-Code to get the int value:
echo -e "\xFF" | gawk -l ordchr -e '{printf("%c : %i", ord($0),ord($0))}'
Return:
� : -1
Test-Code 2:
echo -e "\x61" | gawk -l ordchr -e '{printf("%c : %i", ord($0),ord($0))}'
Return:
a : 97
So my solution to convert the values into unsigned int, is like this:
if(ord($0)<0)
{
new_char=ord($0)+256;
}
else new_char = ord($0)+0`
But I wanted to know if there was a way to cast directly an int as uint in gawk.
Later I tried to write my own ord() function.
#!/bin/bash
echo -e "\xFF" | awk 'BEGIN {_ord_init()}
{
printf("%s : %d\n", $0, ord($0))
}
function _ord_init( i, t)
{
for (i=0; i <= 255; i++) {
t = sprintf("%c", i)
_ord_[t] = i
}
}
function ord(str, c)
{
# only first character is of interest
c = substr(str, 1, 1)
return _ord_[c]
}'
0xFF returns:
� : 0
0x61 returns:
a : 97
Can someone explain me the behavior?
I'm using:
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4-p1, GNU MP 6.1.1)
But I wanted to know if there was a way to cast directly an int as uint in gawk.
Actually, any string in awk is, in the end, a number.
Strings are converted to numbers and numbers are converted to strings,
if the context of the awk program demands it. [...] A string is
converted to a number by interpreting any numeric prefix of the string
as numerals: "2.5" converts to 2.5, "1e3" converts to 1,000, and
"25fix" has a numeric value of 25. Strings that can’t be interpreted
as valid numbers convert to zero. source
Let's make a quick test:
BEGIN {
print 0xff
print 0xff + 0
print 0xff +0.0
print "0xff"
}
# 255
# 255
# 255
# 0xff
So, any hex is automatically interpreted as uint. Casting a int to uint is a tricky question: generally, you should convert the modulus of the int to hex, then add the sign bit as MSB (that is, if the number is non-positive). But you should not need to do so in awk.
Remember that conversion is made as a call to sprintf() and you may control it via the CONVFMT variable:
CONVFMT
A string that controls the conversion of numbers to strings
(see section Conversion of Strings and Numbers). It works by being
passed, in effect, as the first argument to the sprintf() function
(see section String-Manipulation Functions). Its default value is
"%.6g". CONVFMT was introduced by the POSIX standard. source
Remember that locale settings may affect the way the conversion is performed, especially with the decimal separator. For more, see this, which is out of scope.
Can someone explain me the behavior?
I can't actually reproduce it, but I suspect this line of code:
# only first character is of interest
c = substr(str, 1, 1)
In your example, the first char is always 0 and the output should always be the same. I'm testing this online.
I'll make another example of mine:
BEGIN {
a = 0xFF
b = 0x61
printf("a: %d %f %X %s %c\n", a,a,a,a,a)
printf("b: %d %f %X %s %c\n", b,b,b,b,b)
}
# a: 255 255.000000 FF 255 ÿ
# b: 97 97.000000 61 97 a
Either run gawk in binary mode gawk -b to stop it from pre-stitching UTF8 code points. Split it by // empty string, then each single spot inside that resulting array will contain something that's 1-byte wide.
For the other way around, just pre-make an array from 0 to 256. Gawk doesn't stop there at all. In my routine gawk startup sequence, I do that same custom ord sequence from 0x3134F all the way back to zero (around 210k or so). The reason to do it backwards is, for whatever reason, there are some code points that will come out with an IDENTICAL character that gawk can't differentiate. doing it reverse will ensure the lowest # code point is assigned to it. For this mode, I run it in regular utf8 one.
For your scenario I'll just pre-make 4-hex wide array from 0x0000 to 0xFFFF, back to their integer ones, then for each 0xZZ 0xWW, throw ZZWW into that lookup dictionary and get back and integer.
If you just try ord( ) from 128 to 255 it usually won't work like that because 128 is where unicode begins 2 bytes. 0x800 begins 3bytes, 0x10000 begins 4 bytes. I'm not too familiar with those that extend ascii to 256 - they usually require using iconv or similar to get back to UTF-8 first.
A quick note if you want to take raw UTF8 bytes and trying to figure out how many stitched UTF8 code points there are, just delete everything 0x80 - 0xBF. The length() of the residual is the number of code points.
In decimal lingo, out of the 4 ranges of 64 numbers from 0 to 255 :
000 - 063 - ASCII
064 - 127 - ASCII
128 - 191 - UT8-multiple-byte continuation encoding (the 0x80 0xBF)
192 - 255 - the most significant byte of UTF8 multi-byte char
and this looks hideous. Luckily, octal to the rescue. The 0x80 - 0xBF range is just \200-\277. You can use any of AWK's regex to find those (also for FS / RS etc). I was spending time manually coding up the utf8 algorithm before doing all that bit-shifting when I realized much later I don't need that to get to my end goal.
You can easily beat the system built in wc -m command if you want to count utf8 code-points when combining the logic above with mawk2. On my 2.5 year old laptop, against a 1.83 GB flat text file FILLED with unicode all over, I got it down to approx 19 seconds or so to count out 1.29 billion utf8 code points, using just awk.
i've ran into the same problem myself. I ended up with first with a detector whether it's running gawk in unicode mode or byte mode (check the length() of 3 octal value combo that make up one UTF8 code point returns 1 or 3)
then when it sees gawk unicode mode, run a custom shell command from gawk and use unix printf to print out bytes 128-255, and chunk it back into gawk into an array. If you need it i can paste the code sometime (but it's SUPER hideous so i hope i won't get dinged for its lack of elegance)
because there are simply bytes like C0, C1, or FF etc that don't exist in UTF8, no matter what combination you attempt, you cannot get it to generate it all 256 within gawk. I mean another way to do it would be pre-making that chain and using something xxd -ps to store it as a hash string, only converting it back at runtime, but it's admittedly slower.

Difference between printing pointer address and ampersand address

int firstInt =10;
int *pointerFirstInt = &firstInt;
printf("The address of firstInt is: %u", &firstInt);
printf("\n");
printf("The address of firstInt is: %p", pointerFirstInt);
printf("\n");
The above code returns the following:
The address of firstInt is: 1606416332
The address of firstInt is: 0x7fff5fbff7cc
I know that 0x7fff5fbff7cc is in hexadecimal, but when i attempt to convert that number to decimal it does not equal 1606416332. Why is this? Shouldn't both return the same memory address?
The reason for this is lies here:
C11: 7.21.6:
If a conversion specification is invalid, the behavior is undefined.288) If any argument is
not the correct type for the corresponding conversion specification, the behavior is
undefined.
From your hexadecimal address-
The address of firstInt is: 0x7fff5fbff7cc
The size of the address is 6 bytes long. But Size of unsignedint is 4 bytes. When you trying to print the address using %u, It will cause undefined behaviour.
So always print the address using %p.
it seems that you are working on an 64bit machine. so your pointer is 64bit long
both (&firstInt and pointerFirstInt) are exactly same. but are displayed differently.
"%p" knows that pointers are 64bit and displays them in hexadecimal. "%u" shows decimal number and assumes 32bit. so only a part is shown.
if you convert 1606416332 to hexadecimal it looks like: 0x5FBFF7CC. you see that this is the lower half of the 64bit address.
edit:
further explanations:
since printf is a var-arg function all the parametes you give to it were put on the stack. since you put 8 byte on it in both cases. since Pcs using little endian the lower bytes are put on it first.
the printf function parses the string and comes to an %[DatatypeSpecifier] point and reads as many bytes from stack as the datatype that is refered by DatatypeSpecifier requires. so in case of "%u" it only reads 4 bytes and ignores the other bytes. Since you wrote "%u" and not "%x" it displays the value in decimal and not in hexadecimal form.

Converting some assembly to VB.NET - SHR operator working differently?

Well, a simple question here
I am studying some assembly, and converting some assembly routines back to VB.NET
Now, There is a specific line of code I am having trouble with, in assembly, assume the following:
EBX = F0D04080
Then the following line gets executed
SHR EBX, 4
Which gives me the following:
EBX = 0F0D0408
Now, in VB.NET, i do the following
variable = variable >> 4
Which SHOULD give me the same... But it differs a SLIGHT bit, instead of the value 0F0D0408 I get FF0D0408
So what is happening here?
From the documentation of the >> operator:
In an arithmetic right shift, the bits shifted beyond the rightmost bit position are discarded, and the leftmost (sign) bit is propagated into the bit positions vacated at the left. This means that if pattern has a negative value, the vacated positions are set to one; otherwise they are set to zero.
If you are using a signed data type, F0B04080 has a negative sign (bit 1 at the start), which is copied to the vacated positions on the left.
This is not something specific to VB.NET, by the way: variable >> 4 is translated to the IL instruction shr, which is an "arithmetic shift" and preserves the sign, in contrast to the x86 assembly instruction SHR, which is an unsigned shift. To do an arithmetic shift in x86 assembler, SAR can be used.
To use an unsigned shift in VB.NET, you need to use an unsigned variable:
Dim variable As UInteger = &HF0D04080UI
The UI type character at the end of F0D04080 tells VB.NET that the literal is an unsigned integer (otherwise, it would be interpreted as a negative signed integer and the assignment would result in a compile-time error).
VB's >> operator does an arithmetic shift, which shifts in the sign bit rather than 0's.
variable = (variable >> shift_amt) And Not (Integer.MinValue >> (shift_amt - 1))
should give you an equivalent value, even if it is a bit long. Alternatively, you could use an unsigned integer (UInteger or UInt32), as there's no sign bit to shift.