How many bytes in BigQuery types - google-bigquery

How many bytes do the following types take up in BigQuery:
Timestamp
Datetime
Date
My guess was that date could be stored in 2 bytes, and a timestamp perhaps 8, but I wasn't sure about that and it is not mentioned on the https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types page.

The size of BigQuery's data types is as follows:
Data type Size
INT64/INTEGER 8 bytes
FLOAT64/FLOAT 8 bytes
NUMERIC 16 bytes
BOOL/BOOLEAN 1 byte
STRING 2 bytes + the UTF-8 encoded string size
BYTES 2 bytes + the number of bytes in the value
DATE 8 bytes
DATETIME 8 bytes
TIME 8 bytes
TIMESTAMP 8 bytes
STRUCT/RECORD 0 bytes + the size of the contained fields
GEOGRAPHY 16 bytes + 24 bytes * the number of vertices in the geography type (you can verify the number of vertices using the ST_NumPoints function)
Null values for any data type are calculated as 0 bytes.
A repeated column is stored as an array, and the size is calculated
based on the number of values. For example, an integer column (INT64)
that is repeated (ARRAY) and contains 4 entries is calculated
as 32 bytes (4 entries x 8 bytes).
See more details in Data size calculation section of Pricing documentation

Related

How hexadecimal representation matches word size?

What does the statement "hexadecimal matches cleanly with modulo 8 word sizes, such as 8, 16, 32, and 64 bits" mean?
Since a single hex digit can represent exactly 4 bits of binary data, any word size that's a multiple of 4 can be exactly represented with a fixed number of hex digits.
And every word size that's a multiple of 8 (i.e. the common ones) can be represented with a number of digits that's a multiple of 2:
8 bits can store values from 00 to FF
16 bits can store values from 0000 to FFFF
32 bits can store values from 00000000 to FFFFFFFF
...
All 2-digit hex numbers can be represented in 8 bits and all 8 bit values can be represented in 2 hex digits. If a hex editor displays some value as CA FE BA BE you can easily grasp that it's 4 bytes and thus 32 bits. Getting that information from the decimal 3405707966 is not quite as trivial (no matter how you group the digits: there's no nice "byte boundaries" in that representation).
If you compare this with decimal, the same isn't true. For example, 8 bits can represent values from 0 to 255 (decimal). So you need up to 3 digits in decimal to represent 8-bit values. But there are 3-digit decimal values that you can't represent in 8 bits: 256 (or anything higher than that) doesn't map onto 8 bits. So the mapping isn't perfect for decimal numbers.

Can DEFLATE only compress duplicate strings up to 32 KiB apart?

According to DEFLATE spec:
Compressed representation overview
A compressed data set consists of a series of blocks, corresponding to successive blocks of input
data. The block sizes are arbitrary, except that non-compressible
blocks are limited to 65,535 bytes.
Each block is compressed using a combination of the LZ77 algorithm and
Huffman coding. The Huffman trees for each block are independent of
those for previous or subsequent blocks; the LZ77 algorithm may use a
reference to a duplicated string occurring in a previous block, up to
32K input bytes before.
Each block consists of two parts: a pair of Huffman code trees that
describe the representation of the compressed data part, and a
compressed data part. (The Huffman trees themselves are compressed
using Huffman encoding.) The compressed data consists of a series of
elements of two types: literal bytes (of strings that have not been
detected as duplicated within the previous 32K input bytes), and
pointers to duplicated strings, where a pointer is represented as a
pair <length, backward distance>. The representation used in the
"deflate" format limits distances to 32K bytes and lengths to 258
bytes, but does not limit the size of a block, except for
uncompressible blocks, which are limited as noted above.
So pointers to duplicate strings only go back 32 KiB, but since block size is not limited, could the Huffman code tree store two duplicate strings more than 32 KiB apart as the same code? Then is the limiting factor the block size?
The Huffman tree for distances contains codes 0 to 29 (table below); the code 29, followed by 8191 in "plain" bits, means "distance 32768". That's a hard limit in the definition of Deflate. The block size is not limiting. Actually the block size is not stored anywhere: the block is an infinite stream. If you want to stop the block, you send an End-Of-Block code for that.
Distance Codes
--------------
Extra Extra Extra Extra
Code Bits Dist Code Bits Dist Code Bits Distance Code Bits Distance
---- ---- ---- ---- ---- ------ ---- ---- -------- ---- ---- --------
0 0 1 8 3 17-24 16 7 257-384 24 11 4097-6144
1 0 2 9 3 25-32 17 7 385-512 25 11 6145-8192
2 0 3 10 4 33-48 18 8 513-768 26 12 8193-12288
3 0 4 11 4 49-64 19 8 769-1024 27 12 12289-16384
4 1 5,6 12 5 65-96 20 9 1025-1536 28 13 16385-24576
5 1 7,8 13 5 97-128 21 9 1537-2048 29 13 24577-32768
6 2 9-12 14 6 129-192 22 10 2049-3072
7 2 13-16 15 6 193-256 23 10 3073-4096
To add to Zerte's answer, the references to previous sequences have nothing to do with blocks or block boundaries. Such references can be within blocks, across blocks, and the referenced sequence can cross a block boundary.

WHAT is the meaning of Leading Length?

I was checking out the difference between char vs varchar2 from google. I came across a word LEADING LENGTH in this link . THERE it was written that
Suppose you store the string ‘ORATABLE’ in a CHAR(20) field and a VARCHAR2(20) field. The CHAR field will use 22 bytes (2 bytes for leading length). The VARCHAR2 field will use 10 bytes only (8 for the string, 2 bytes for leading length).
Q1:How does the char field will use 22 bytes if the string is of 8 characters if (1 byte = 1 char)?
Q2 What is the LEADING LENGTH ? why it does occupy 2 bytes?
The CHAR() datatype pads the string with characters. So, for 'ORATABLE', it looks like:
'ORATABLE '
12345678901234567890
The "leading length" are two bytes at the beginning that specify the length of the string. Two bytes are needed because one byte is not enough. Two bytes allow lengths up to 65,535 units; one byte would only allow lengths up to 255.
The important point both CHAR() and VARCHAR2() use the same internal format, so there is little reason to sue CHAR(). Personally, I would only use it for fixed-length codes, such as ISO country codes or US social security numbers.

SQL Server 2008, how much space does this occupy?

I am trying to calculate how much space (Mb) this would occupy. In the database table there is 7 bit columns, 2 tiny int and 1 guid.
Trying to calculate the amount that 16 000 rows would occupies.
My line of thought was that 7 bit columns consume 1 byte, 2 tiny ints consumes 2 bytes and a guid consumes 16 bytes. Total of 19byte for one row in the table? That would mean 304000 bytes for 16 000 rows or ~0.3mbs us that correct? Is there a meta data byte as well?
There are several estimators out there which take away the donkey work
You have to take into account the Null bitmap which will be 3 bytes in this case + number of rows per page + row header + row version + pointers + all the stuff here:
Inside the Storage Engine: Anatomy of a record
Edit:
Your 19 bytes of actual data
has 11 bytes overhead
total 30 bytes per row
around 269 rows per page (8096 / 30)
requires 60 pages (16000 / 269)
around 490k space (60 x 8192)
a few KB for the index structure of the primary

Storing a 30KB BLOB in SQL Server 2005

My data is 30KB on disk (Serialized object) was size should the binary field in t-sql be?
Is the brackets bit bytes ?
... so is binary(30000) .... 30KB?
Thanks
You need to use the varbinary(max) data type; the maximum allowed size for binary is 8,000 bytes. Per the MSDN page on binary and varbinary:
varbinary [ ( n | max) ]
Variable-length binary data. n can be a value from 1 through 8,000. max indicates that the maximum storage size is 2^31-1 bytes. The storage size is the actual length of the data entered + 2 bytes. The data that is entered can be 0 bytes in length.
The number after binary() is the number of bytes, see MSDN:
binary [ ( n ) ]
Fixed-length binary data of n bytes. n
must be a value from 1 through 8,000.
Storage size is n+4 bytes.
Whether 30kb is 30000 or 30720 bytes depends on which binary prefix system your file system is using.