How to write integer value "60" in 16bit binary, 32bit binary & 64bit binary

How to write integer value "60" in 16bit binary, 32bit binary & 64bit binary - vb.net

How to write integer value "60" in other binary formats?
8bit binary code of 60 = 111100
16bit binary code of 60 = ?
32bit binary code of 60 = ?
64bit binary code of 60 = ?
is it 111100000000 for 16 bit binary?
why does 8bit binary code contain 6bits instead of 8?
I googled for the answers but I'm not able to get these answers. Please provide answers as I'm still a beginner of this area.

Imagine you're writing the decimal value 60. You can write it using 2 digits, 4 digits or 8 digits:
1. 60
2. 0060
3. 00000060
In our decimal notation, the most significant digits are to the left, so increasing the number of digits for representation, without changing the value, means just adding zeros to the left.
Now, in most binary representations, this would be the same. The decimal 60 needs only 6 bits to represent it, so an 8bit or 16bit representation would be the same, except for the left-padding of zeros:
1. 00111100
2. 00000000 00111100
Note: Some OSs, software, hardware or storage devices might have different Endianness - which means they might store 16bit values with the least significant byte first, then the most signficant byte. Binary notation is still MSB-on-the-left, as above, but reading the memory of such little-endian devices will show any 16bit chunk will be internally reversed:
1. 00111100 - 8bit - still the same.
2. 00111100 00000000 - 16bit, bytes are flipped.

every number has its own binary number, that means that there is only one!
on a 16/32/64 bit system 111100 - 60 would just look the same with many 0s added infront of the number (regulary not shown)
so on 16 bit it would be 0000000000111100
32 bit - 0000000000000000000000000011110
and so on

For storage Endian matters ... otherwise bitwidth zeros are always prefixed so 60 would be...
8bit: 00111100
16bit: 0000000000111100

Related

What is the bias exponent?

My question concerns the IEEE 754 Standard
So im going through the steps necessary to convert denary numbers into a floating point number (IEEE 754 standard) but I dont understand the purpose of determining the biased exponent. I cant get my head around that step and what it is exactly and why its done?
Could any one explain what this is - please keep in mind that I have just started a computer science conversion masters so I wont completely understand certain choices of terminology!
If you think its very long to explain please point me in the right direction!

The exponent in an IEEE 32-bit floating point number is 8 bits.
In most cases where we want to store a signed quantity in an 8-bit field, we use a signed representation. 0x80 is -128, 0xFF is -1, 0x00 is 0, up to 0x7F is 127.
But that's not how the exponent is represented. The exponent field is treated as if it were unsigned 8-bit number that is 127 too large. Look at the unsigned value in the exponent field and subtract 127 to get the actual value. So 0x00 represents -127. 0x7F represents 0.
For 64-bit floats, the field is 11 bits, with a bias of 1023, but it works the same.

A floating point number could be represented (but is not) as a sign bit s, an exponent field e, and a mantissa field m, where e is a signed integer and m is an unsigned fraction of an integer. The value of that number would then be computed as (-1)^s • 2^e • m. But this would not allow to represent important special cases.
Note that one could increase the exponent by ±n and shift the mantissa right by ±n without changing the value of the number. This allows for nearly a numbers to adjust the exponent such that the mantissa starts with a 1 (one exception is of course 0, a special FP number). If one does so, one has normalized FP numbers, and since the mantissa now starts always with 1, one does not have to store the leading 1 in memory, and the saved bit is used to increase the precision of the FP number. Thus, no mantissa m is stored but a mantissa field mf.
But how is now 0 represented? And what about FP numbers that have already the max or min exponent field, but due to their normalization, cannot be made larger or smaller? And what about "not a number-s" that are e.g. the result of x/0?
Here comes the idea of a biased exponent: If half of the max exponent value is added to the exponent field, one gets the bias exponent be. To compute the value of the FP number, this bias has to be subtracted of course. But all normalized FP numbers have now 0 < be <(all 1). Therefore these special biased exponents 0 and (all 1) can now be reserved for special purposes.
be = 0, mf = 0: Exact 0.
be = 0, mf ≠ 0: A denormalized number, i.e. mf is the real mantissa that does not have a leading 1.
be = (all 1), mf = 0: Infinite
be = (all 1), mf ≠ 0: Not a number (NaN)

Excel VBA - Sum of 1 in workbook not equal to 1 in VBA [duplicate]

Why do some numbers lose accuracy when stored as floating point numbers?
For example, the decimal number 9.2 can be expressed exactly as a ratio of two decimal integers (92/10), both of which can be expressed exactly in binary (0b1011100/0b1010). However, the same ratio stored as a floating point number is never exactly equal to 9.2:
32-bit "single precision" float: 9.19999980926513671875
64-bit "double precision" float: 9.199999999999999289457264239899814128875732421875
How can such an apparently simple number be "too big" to express in 64 bits of memory?

In most programming languages, floating point numbers are represented a lot like scientific notation: with an exponent and a mantissa (also called the significand). A very simple number, say 9.2, is actually this fraction:
5179139571476070 * 2 -49
Where the exponent is -49 and the mantissa is 5179139571476070. The reason it is impossible to represent some decimal numbers this way is that both the exponent and the mantissa must be integers. In other words, all floats must be an integer multiplied by an integer power of 2.
9.2 may be simply 92/10, but 10 cannot be expressed as 2n if n is limited to integer values.
Seeing the Data
First, a few functions to see the components that make a 32- and 64-bit float. Gloss over these if you only care about the output (example in Python):
def float_to_bin_parts(number, bits=64):
if bits == 32: # single precision
int_pack = 'I'
float_pack = 'f'
exponent_bits = 8
mantissa_bits = 23
exponent_bias = 127
elif bits == 64: # double precision. all python floats are this
int_pack = 'Q'
float_pack = 'd'
exponent_bits = 11
mantissa_bits = 52
exponent_bias = 1023
else:
raise ValueError, 'bits argument must be 32 or 64'
bin_iter = iter(bin(struct.unpack(int_pack, struct.pack(float_pack, number))[0])[2:].rjust(bits, '0'))
return [''.join(islice(bin_iter, x)) for x in (1, exponent_bits, mantissa_bits)]
There's a lot of complexity behind that function, and it'd be quite the tangent to explain, but if you're interested, the important resource for our purposes is the struct module.
Python's float is a 64-bit, double-precision number. In other languages such as C, C++, Java and C#, double-precision has a separate type double, which is often implemented as 64 bits.
When we call that function with our example, 9.2, here's what we get:
>>> float_to_bin_parts(9.2)
['0', '10000000010', '0010011001100110011001100110011001100110011001100110']
Interpreting the Data
You'll see I've split the return value into three components. These components are:
Sign
Exponent
Mantissa (also called Significand, or Fraction)
Sign
The sign is stored in the first component as a single bit. It's easy to explain: 0 means the float is a positive number; 1 means it's negative. Because 9.2 is positive, our sign value is 0.
Exponent
The exponent is stored in the middle component as 11 bits. In our case, 0b10000000010. In decimal, that represents the value 1026. A quirk of this component is that you must subtract a number equal to 2(# of bits) - 1 - 1 to get the true exponent; in our case, that means subtracting 0b1111111111 (decimal number 1023) to get the true exponent, 0b00000000011 (decimal number 3).
Mantissa
The mantissa is stored in the third component as 52 bits. However, there's a quirk to this component as well. To understand this quirk, consider a number in scientific notation, like this:
6.0221413x1023
The mantissa would be the 6.0221413. Recall that the mantissa in scientific notation always begins with a single non-zero digit. The same holds true for binary, except that binary only has two digits: 0 and 1. So the binary mantissa always starts with 1! When a float is stored, the 1 at the front of the binary mantissa is omitted to save space; we have to place it back at the front of our third element to get the true mantissa:
1.0010011001100110011001100110011001100110011001100110
This involves more than just a simple addition, because the bits stored in our third component actually represent the fractional part of the mantissa, to the right of the radix point.
When dealing with decimal numbers, we "move the decimal point" by multiplying or dividing by powers of 10. In binary, we can do the same thing by multiplying or dividing by powers of 2. Since our third element has 52 bits, we divide it by 252 to move it 52 places to the right:
0.0010011001100110011001100110011001100110011001100110
In decimal notation, that's the same as dividing 675539944105574 by 4503599627370496 to get 0.1499999999999999. (This is one example of a ratio that can be expressed exactly in binary, but only approximately in decimal; for more detail, see: 675539944105574 / 4503599627370496.)
Now that we've transformed the third component into a fractional number, adding 1 gives the true mantissa.
Recapping the Components
Sign (first component): 0 for positive, 1 for negative
Exponent (middle component): Subtract 2(# of bits) - 1 - 1 to get the true exponent
Mantissa (last component): Divide by 2(# of bits) and add 1 to get the true mantissa
Calculating the Number
Putting all three parts together, we're given this binary number:
1.0010011001100110011001100110011001100110011001100110 x 1011
Which we can then convert from binary to decimal:
1.1499999999999999 x 23 (inexact!)
And multiply to reveal the final representation of the number we started with (9.2) after being stored as a floating point value:
9.1999999999999993
Representing as a Fraction
9.2
Now that we've built the number, it's possible to reconstruct it into a simple fraction:
1.0010011001100110011001100110011001100110011001100110 x 1011
Shift mantissa to a whole number:
10010011001100110011001100110011001100110011001100110 x 1011-110100
Convert to decimal:
5179139571476070 x 23-52
Subtract the exponent:
5179139571476070 x 2-49
Turn negative exponent into division:
5179139571476070 / 249
Multiply exponent:
5179139571476070 / 562949953421312
Which equals:
9.1999999999999993
9.5
>>> float_to_bin_parts(9.5)
['0', '10000000010', '0011000000000000000000000000000000000000000000000000']
Already you can see the mantissa is only 4 digits followed by a whole lot of zeroes. But let's go through the paces.
Assemble the binary scientific notation:
1.0011 x 1011
Shift the decimal point:
10011 x 1011-100
Subtract the exponent:
10011 x 10-1
Binary to decimal:
19 x 2-1
Negative exponent to division:
19 / 21
Multiply exponent:
19 / 2
Equals:
9.5
Further reading
The Floating-Point Guide: What Every Programmer Should Know About Floating-Point Arithmetic, or, Why don’t my numbers add up? (floating-point-gui.de)
What Every Computer Scientist Should Know About Floating-Point Arithmetic (Goldberg 1991)
IEEE Double-precision floating-point format (Wikipedia)
Floating Point Arithmetic: Issues and Limitations (docs.python.org)
Floating Point Binary

This isn't a full answer (mhlester already covered a lot of good ground I won't duplicate), but I would like to stress how much the representation of a number depends on the base you are working in.
Consider the fraction 2/3
In good-ol' base 10, we typically write it out as something like
0.666...
0.666
0.667
When we look at those representations, we tend to associate each of them with the fraction 2/3, even though only the first representation is mathematically equal to the fraction. The second and third representations/approximations have an error on the order of 0.001, which is actually much worse than the error between 9.2 and 9.1999999999999993. In fact, the second representation isn't even rounded correctly! Nevertheless, we don't have a problem with 0.666 as an approximation of the number 2/3, so we shouldn't really have a problem with how 9.2 is approximated in most programs. (Yes, in some programs it matters.)
Number bases
So here's where number bases are crucial. If we were trying to represent 2/3 in base 3, then
(2/3)10 = 0.23
In other words, we have an exact, finite representation for the same number by switching bases! The take-away is that even though you can convert any number to any base, all rational numbers have exact finite representations in some bases but not in others.
To drive this point home, let's look at 1/2. It might surprise you that even though this perfectly simple number has an exact representation in base 10 and 2, it requires a repeating representation in base 3.
(1/2)10 = 0.510 = 0.12 = 0.1111...3
Why are floating point numbers inaccurate?
Because often-times, they are approximating rationals that cannot be represented finitely in base 2 (the digits repeat), and in general they are approximating real (possibly irrational) numbers which may not be representable in finitely many digits in any base.

While all of the other answers are good there is still one thing missing:
It is impossible to represent irrational numbers (e.g. π, sqrt(2), log(3), etc.) precisely!
And that actually is why they are called irrational. No amount of bit storage in the world would be enough to hold even one of them. Only symbolic arithmetic is able to preserve their precision.
Although if you would limit your math needs to rational numbers only the problem of precision becomes manageable. You would need to store a pair of (possibly very big) integers a and b to hold the number represented by the fraction a/b. All your arithmetic would have to be done on fractions just like in highschool math (e.g. a/b * c/d = ac/bd).
But of course you would still run into the same kind of trouble when pi, sqrt, log, sin, etc. are involved.
TL;DR
For hardware accelerated arithmetic only a limited amount of rational numbers can be represented. Every not-representable number is approximated. Some numbers (i.e. irrational) can never be represented no matter the system.

There are infinitely many real numbers (so many that you can't enumerate them), and there are infinitely many rational numbers (it is possible to enumerate them).
The floating-point representation is a finite one (like anything in a computer) so unavoidably many many many numbers are impossible to represent. In particular, 64 bits only allow you to distinguish among only 18,446,744,073,709,551,616 different values (which is nothing compared to infinity). With the standard convention, 9.2 is not one of them. Those that can are of the form m.2^e for some integers m and e.
You might come up with a different numeration system, 10 based for instance, where 9.2 would have an exact representation. But other numbers, say 1/3, would still be impossible to represent.
Also note that double-precision floating-points numbers are extremely accurate. They can represent any number in a very wide range with as much as 15 exact digits. For daily life computations, 4 or 5 digits are more than enough. You will never really need those 15, unless you want to count every millisecond of your lifetime.

Why can we not represent 9.2 in binary floating point?
Floating point numbers are (simplifying slightly) a positional numbering system with a restricted number of digits and a movable radix point.
A fraction can only be expressed exactly using a finite number of digits in a positional numbering system if the prime factors of the denominator (when the fraction is expressed in it's lowest terms) are factors of the base.
The prime factors of 10 are 5 and 2, so in base 10 we can represent any fraction of the form a/(2b5c).
On the other hand the only prime factor of 2 is 2, so in base 2 we can only represent fractions of the form a/(2b)
Why do computers use this representation?
Because it's a simple format to work with and it is sufficiently accurate for most purposes. Basically the same reason scientists use "scientific notation" and round their results to a reasonable number of digits at each step.
It would certainly be possible to define a fraction format, with (for example) a 32-bit numerator and a 32-bit denominator. It would be able to represent numbers that IEEE double precision floating point could not, but equally there would be many numbers that can be represented in double precision floating point that could not be represented in such a fixed-size fraction format.
However the big problem is that such a format is a pain to do calculations on. For two reasons.
If you want to have exactly one representation of each number then after each calculation you need to reduce the fraction to it's lowest terms. That means that for every operation you basically need to do a greatest common divisor calculation.
If after your calculation you end up with an unrepresentable result because the numerator or denominator you need to find the closest representable result. This is non-trivil.
Some Languages do offer fraction types, but usually they do it in combination with arbitary precision, this avoids needing to worry about approximating fractions but it creates it's own problem, when a number passes through a large number of calculation steps the size of the denominator and hence the storage needed for the fraction can explode.
Some languages also offer decimal floating point types, these are mainly used in scenarios where it is imporant that the results the computer gets match pre-existing rounding rules that were written with humans in mind (chiefly financial calculations). These are slightly more difficult to work with than binary floating point, but the biggest problem is that most computers don't offer hardware support for them.

What does alignment to 16-byte boundary mean in x86

Intel's official optimization guide has a chapter on converting from MMX commands to SSE where they state the fallowing statment:
Computation instructions which use a memory operand that may not be aligned to a 16-byte boundary must be replaced with an unaligned 128-bit load (MOVDQU) followed by the same computation operation that uses instead register operands.
(chapter 5.8 Converting from 64-bit to 128-bit SIMD Integers, pg. 5-43)
I can't understand what they mean by "may not be aligned to a 16-byte boundary", could you please clarify it and give some examples?

Certain SIMD instructions, which perform the same instruction on multiple data, require that the memory address of this data is aligned to a certain byte boundary. This effectively means that the address of the memory your data resides in needs to be divisible by the number of bytes required by the instruction.
So in your case the alignment is 16 bytes (128 bits), which means the memory address of your data needs to be a multiple of 16. E.g. 0x00010 would be 16 byte aligned, while 0x00011 would not be.
How to get your data to be aligned depends on the programming language (and sometimes compiler) you are using. Most languages that have the notion of a memory address will also provide you with means to specify the alignment.

I'm guessing here, but could it be that "may not be aligned to a 16-byte boundary" means that this memory location has been aligned to a smaller value (4 or 8 bytes) before for some other purposes and now to execute SSE instructions on this memory you need to load it into a register explicitly?

Data that's aligned on a 16 byte boundary will have a memory address that's an even number — strictly speaking, a multiple of two. Each byte is 8 bits, so to align on a 16 byte boundary, you need to align to each set of two bytes.
Similarly, memory aligned on a 32 bit (4 byte) boundary would have a memory address that's a multiple of four, because you group four bytes together to form a 32 bit word.

How would you most efficiently store latitude and longitude data?

This question comes from a homework assignment I was given. You can base your storage system off of one of the three following formats:
DD MM SS.S
DD MM.MMM
DD.DDDDD
You want to maximize the amount of data you can store by using as few bytes as possible.
My solution is based off the first format. I used 3 bytes for latitude: 8 bits for the DD (-90 to 90), 6 bits for the MM (0-59), and 10 bits for the SS.S (0-59.9). I then used 25 bits for the longitude: 9 bits for the DDD (-180 to 180), 6 bits for the MM, and 10 for the SS.S. This solution doesn't fit nicely on a byte border, but I figured the next reading can be stored immediately following the previous one, and 8 readings would use only 49 bytes.
I'm curious what methods others can come up. Is there a more efficient method to storing this data? As a note, I considered an offset based storage, but the problem gave no indication of how much the values may change between readings, so I'm assuming any change is possible.

Your suggested method is not optimal. You are using 10 bits (1024 possible values) to store a value in the range (0..599). This is a waste of space.
If you'll use 3 bytes for latitude, you should map the range [0, 2^24-1] to the range [-90, 90]. Hence each of the 2^24 values represents 180/2^24 degrees, which is 0.086 seconds.
If you want only 0.1 second accuracy, you'll need 23 bits for latitudes and 24 bits for longitudes (you'll get 0.077 seconds accuracy). That's 47 bit total instead of your 49 bits, with better accuracy.
Can we do even better?
The exact number of bits needed for 0.1 second accuracy is log2(180*60*60*10 * 360*60*60*10) < 46.256. Which means that you can use 46256 bits (5782 bytes) to store 1000 (lat,lon) pairs, but the mathematics involved will require dealing with very large integers.
Can we do even better?
It depends. If your data set has concentrations, you can store only some points and relative distances from these points, using less bits. Clustering algorithms should be used.

Sticking to existing technology:
If you used half precision floating point numbers to store only the DD.DDDDD data, you can be a lot more space-efficent, but you'd have to accept an exponent bias of 15, which means: The coordinates stored might not be exact, but at an offset from the original value.
This is due to the way floating point numbers are stored, essentially: A normalized significant is multiplied by an exponent to result in a number, instead of just storing a single value (as in integer numbers, the way you calculated the numbers for your solution).
The next highest commonly used floating point number mechanism uses 32 bits (the type "float" in many programming languages) - still efficient, but larger than your custom format.
If, however, you would design your own custom floating point type as well, and you gradually added more bits, your results would become more exact and it would STILL be more efficient than the solution you first found. Just play around with the number of bits used for significant and exponent, and find out how close your fp approximations come to the desired result in degrees!

Well, if this is for a large number of readings, then you may try a differential approach. Start with an absolute location, and then start saving incremental changes, which should ideally require less bits, depending on the nature of the changes. This is effectively compressing the stream. But somehow I don't think that's what this homework is about.

Convert ADC Bins into Voltage [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
Let's say I have a 12-bit Analog to Digital Converter (4096 bins). And let's say I have a signal from 0 to 5 Volts.
What is the proper conversion formula to convert ADC bins into Volts?
V = ADC / 4096 * 5
or
V = ADC / 4095 * 5
Do I divide by 4096 because there are 4096 bins in the ADC?
Or do I divide by 4095 because that is the highest value that the ADC returns?

V = ADC / 4096 * 5
is the correct formula for converting the digital value back to (an approximation of) the analog voltage.
This is according to The Data Conversion Handbook, edited by Walt Kester (Newnes, 2005),
and available at (as of 2018/10/18) at:
https://www.analog.com/en/education/education-library/data-conversion-handbook.html
See in particular Figures 2.4 and 2.5 in Chapter 2:
In your case, FS would be 5 V. (And of course you're using a 12-bit ADC, not a 3-bit one.) Note that even if the ADC value is the maximum possible value (4095 in your case), the corresponding analog voltage will be slightly less than the "full-scale" voltage (5 V in your case).

Brian's suggestion about checking ADC datasheet is ideal. BUT! Assuming your maximum voltage (5V) equates to the maximum ADC input (12-bits = 4095), the following conversion should work for you:
const float maxAdcBits = 4095.0f; // Using Float for clarity
const float maxVolts = 5.0f; // Using Float for clarity
const float voltsPerBit = (maxVolts / maxAdcBits);
float yourVoltage = ADCReading * voltsPerBit;
A quick inspection of the math with Excel leads me to believe this is correct.

How picky do you want to get? If you want to really get picky, then you should also consider that each "bin" represents a small range of values (about 1.2 mV in your case). So when you convert to a voltage value, do you want to return the voltage value of the middle of the bin, or the lower edge of the bin? That is, do you want to effectively "truncate" or "round" in the value you report?
Also, the ADC's steps are probably even (linear), but take care as to what the ADC does with the bins at the two extremities of the range. Those bins may be possibly half the width of the others. It depends on the ADC, so check the spec.
Whether this concern matters at all depends on your application.

The spec for the ADC should identify how 5V is represented in terms of your 12 bits.
I would suspect that 4095 corresponds to 5V and thus your second solution is correct. Otherwise you would never be able to identify a signal of 5V correctly.

For a 12-bit value the maximum representable value is 4095, but of course there are 4096 values total (including zero). Assuming that your ADC is linear then yes, 4095 is equivalent to full scale. This is not necessarily 5V but whatever your reference voltage is equivalent to OR a value exceeding that voltage (of course).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas