UTF-8 vs ASCII Text - sql

Why does sql database use UTF-8 Encoding? do they both use 8-bit to store a character?

UTF-8 is used to support a large range of characters. In UTF-8, up to 4 bytes can be used to represent a single character.
Joel has written an article on this subject that you may want to refer to
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

For "normal" characters, only 8 bits are used. For characters that do not fit in 8 bits more bits can be used. This makes UTF-8 is a variable length encoding.
Wikipedia has a good article on UTF-8.
ASCII only defines 128 character. So only 7 bits. But is normally stored with 8 bits/character. RS232 (old serial communication) can be used with bytes of 7 bits.

ASCII can only represent a limited number of characters at one time. It isn't very useful to represent any language that isn't based on a Latin character set. However, UTF-8 which is an encoding standard for UCS-4 (Unicode) can represent almost any language. It does this by chaining multiple bytes together to represent one character (or glyph to be more correct).

A more sophisticated encoding increases the index access time drastically. It's something to think about, when encountering performance problems in writing or reading from an database.

Related

what is the packed binary data and unpacked binary data in ISO 8583 message?

I am new in this field, and working on payment gateway, please tell me what is the difference between packed and unpacked binary data used in iso8583 message...!
The schema definition files for ISO8583 are available at http://dfdlschemas.github.io/ISO8583. In ISO8583_1993.xsd it says:
* This DFDL schema provides a DFDL model for ISO8583 1993 binary data
* where each bitmap in the message is encoded as 8 bytes of binary data
* (8 bits per byte). The bitmaps are said to be 'packed'.
So, the term "packed" refers to the bitmaps, which can be either packed or unpacked.
In en.wikipedia.org/wiki/ISO_8583#Bitmaps, it says
The bitmap may be transmitted as 8 bytes of binary data, or as 16 hexadecimal > characters 0-9, A-F in the ASCII or EBCDIC character sets.
In data structures, packed binary data usually means that more (if not all available) bit combinations are used to encode some values, while unpacked means that some bit combinations remain unused, either to improve readability or to make certain calculations easier (but unpacked data takes more space).
For example, one unsigned byte (8 bits) can encode numbers from 0 to 255. If the numbers are BCD encoded, only numbers from 0 to 99 can be represented, and some bit combinations remain unused. However, it is in some cases easier to base calculations on a BCD encoded number than on a binary encoded number.
In summary, ISO 8583 defines two different encodings:
packed which is 8 bytes of binary data
unpacked which is 16 bytes as hexadecimal characters (in two different encodings, but that is another aspect).
One obvious difference is, that when you dump this data to a console, you can immediately read the unpacked data as hexadecimal numbers, while the binary encoding will only print some garbage characters, depending on your console, your locale and the font which you have installed.

Big O Notation - input size

I am reading a blog abt big O notation on topcoder.
https://www.topcoder.com/community/data-science/data-science-tutorials/computational-complexity-section-1/
I have come across the below paragraph
Formal notes on the input size
What exactly is this "input size" we started to talk about? In the
formal definitions this is the size of the input written in some
fixed finite alphabet (with at least 2 "letters"). For our needs, we
may consider this alphabet to be the numbers 0..255. Then the "input
size" turns out to be exactly the size of the input file in bytes.
can anyone please explain what does this statement say?
it is the size of the input written in some
fixed finite alphabet (with at least 2 "letters"). For our needs, we
may consider this alphabet to be the numbers 0..255.
The statement is about the fundamental representation of information using symbols. The more symbols you use (the bigger the alphabet is), the more information you can represent with less characters although you can represent everything with just two "letters", i.e. one bit of information per character. Using the numbers 0..255 is equivalent to using 8 bit, i.e. one byte (2^8=256).
In computer programming, you normally use bytes but in theoretical computer science bits are used as they have the same capabilities (you just need more of them) and it makes proofs easier to write.
This statement means the following. You have to represent the input to process it by the algorithm, i.e. you have to "write it down". You can write the input down with letters (=symbols). The number of symbols have to be finite (or else you or the algorithm can not understand it), i.e. they comes from a fixed finite alphabet (=set of possible symbols). The size of input is that how many letters did you used to write down the input.
In the example mentioned in the text there is written that the alphabet contains the numbers between 0 and 255. This means that each letter can be written with an ASCII character. So, you can write down your input with ASCII characters. Each ASCII character can be stored in one byte, i.e. the size of input (=number of ASCII characters) is the number of bytes.
Let me explain by example.
Let's take, say, the factorization (sub)problem: given number n (not prime), find any of its divisors different from 1 and n. Clearly, we need to check at most sqrt(n) numbers to find one. Thus it seems to be a subpolynomial problem. Why it is considered a hard nut to crack then? That's because we usually need only log(n) digits to write down n, and we naturally want to resolve the problems which are "easy to write down". But although sqrt(n) may seem a little when compared to n, it's too much for us when compared to log(n).
That is the point why we need to say a word about "input alphabet" before talking of problem's complexity.

Does nvarchar always take twice as much space as varchar?

Nvarchar is used to store unicode data which is used to store multilingual data. If you don't end up storing unicode does it still take up the same space?
YES.
See MSDN Books Online on NCHAR and NVARCHAR.
NCHAR:
The storage size is two times n bytes.
NVARCHAR
The storage size, in bytes, is two
times the number of characters entered
+ 2 bytes
Sort of. Not all unicode characters use two bytes. Utf-8, for example, is still just one byte per character a lot of the time, but rarely you may need 4 bytes per character. What nvarchar will do is allocate two bytes per character.

Why are an integers bytes stored backwards? Does this apply to headers only?

I'm currently trying to decipher WAV files. From headers to the PCM data.
I've found a PDF (http://www.tdt.com/T2Support/technical_notes/tn0132.pdf) detailing the anatomy of a WAV file, and I've been able to extract and make sense of the appropriate header data using Ghex2. But my questions are:
Why are the integers bytes stored backwards? I.e. dec. 20 is stored as 0x14000000 instead of 0x00000014.
Are the integers of the PCM data also stored backwards?
WAV files are little-endian (least significant bytes first) because the format originated for operating systems running on intel processor based machines which use the little endian format to store numbers.
If you think about it kind of makes sense because if you want to cast a long integer to a short one or even a character the starting address remains the same you just look at less bytes.
Consequently, for 16 bit encoding upwards, little-endian format will be used for the PCM as well. This is quite handy since you will be able to pull them in as integers. don't forget they will be stored as two's complement signed integers if they are 16 bit, but not if they are 8 bit. (see http://www-mmsp.ece.mcgill.ca/Documents/AudioFormats/WAVE/WAVE.html for more detail)
"Backwards" is subjective. Some machines are big-endian, others are little-endian. In byte-oriented contexts like file formats and network protocols, the order is arbitrary. Some formats like to specify big- or little-endian, others like to be flexible and accept either form, with a flag indicating which is in use.
Looks like WAV files just like little-endian.

[My]SQL VARCHAR Size and Null-Termination

Disclaimer: I'm very new to SQL and databases in general.
I need to create a field that will store a maximum of 32 characters of text data. Does "VARCHAR(32)" mean that I have exactly 32 characters for my data? Do I need to reserve an extra character for null-termination?
I conducted a simple test and it seems that this is a WYSIWYG buffer. However, I wanted to get a concrete answer from people who actually know what they're doing.
I have a C[++] background, so this question is raising alarm bells in my head.
Yes, you have 32 characters at your disposal. SQL does not concern itself with nul terminated strings like some programming languages do.
Your VARCHAR specification size is the max size of your data, so in this case, 32 characters. However, VARCHARS are a dynamic field, so the actual physical storage used is only the size of your data, plus one or two bytes.
If you put a 10-character string into a VARCHAR(32), the physical storage will be 11 or 12 bytes (the manual will tell you the exact formula).
However, when MySQL is dealing with result sets (ie. after a SELECT), 32 bytes will be allocated in memory for that field for every record.