I am trying to calculate how much space (Mb) this would occupy. In the database table there is 7 bit columns, 2 tiny int and 1 guid.
Trying to calculate the amount that 16 000 rows would occupies.
My line of thought was that 7 bit columns consume 1 byte, 2 tiny ints consumes 2 bytes and a guid consumes 16 bytes. Total of 19byte for one row in the table? That would mean 304000 bytes for 16 000 rows or ~0.3mbs us that correct? Is there a meta data byte as well?
There are several estimators out there which take away the donkey work
You have to take into account the Null bitmap which will be 3 bytes in this case + number of rows per page + row header + row version + pointers + all the stuff here:
Inside the Storage Engine: Anatomy of a record
Edit:
Your 19 bytes of actual data
has 11 bytes overhead
total 30 bytes per row
around 269 rows per page (8096 / 30)
requires 60 pages (16000 / 269)
around 490k space (60 x 8192)
a few KB for the index structure of the primary
Related
According to DEFLATE spec:
Compressed representation overview
A compressed data set consists of a series of blocks, corresponding to successive blocks of input
data. The block sizes are arbitrary, except that non-compressible
blocks are limited to 65,535 bytes.
Each block is compressed using a combination of the LZ77 algorithm and
Huffman coding. The Huffman trees for each block are independent of
those for previous or subsequent blocks; the LZ77 algorithm may use a
reference to a duplicated string occurring in a previous block, up to
32K input bytes before.
Each block consists of two parts: a pair of Huffman code trees that
describe the representation of the compressed data part, and a
compressed data part. (The Huffman trees themselves are compressed
using Huffman encoding.) The compressed data consists of a series of
elements of two types: literal bytes (of strings that have not been
detected as duplicated within the previous 32K input bytes), and
pointers to duplicated strings, where a pointer is represented as a
pair <length, backward distance>. The representation used in the
"deflate" format limits distances to 32K bytes and lengths to 258
bytes, but does not limit the size of a block, except for
uncompressible blocks, which are limited as noted above.
So pointers to duplicate strings only go back 32 KiB, but since block size is not limited, could the Huffman code tree store two duplicate strings more than 32 KiB apart as the same code? Then is the limiting factor the block size?
The Huffman tree for distances contains codes 0 to 29 (table below); the code 29, followed by 8191 in "plain" bits, means "distance 32768". That's a hard limit in the definition of Deflate. The block size is not limiting. Actually the block size is not stored anywhere: the block is an infinite stream. If you want to stop the block, you send an End-Of-Block code for that.
Distance Codes
--------------
Extra Extra Extra Extra
Code Bits Dist Code Bits Dist Code Bits Distance Code Bits Distance
---- ---- ---- ---- ---- ------ ---- ---- -------- ---- ---- --------
0 0 1 8 3 17-24 16 7 257-384 24 11 4097-6144
1 0 2 9 3 25-32 17 7 385-512 25 11 6145-8192
2 0 3 10 4 33-48 18 8 513-768 26 12 8193-12288
3 0 4 11 4 49-64 19 8 769-1024 27 12 12289-16384
4 1 5,6 12 5 65-96 20 9 1025-1536 28 13 16385-24576
5 1 7,8 13 5 97-128 21 9 1537-2048 29 13 24577-32768
6 2 9-12 14 6 129-192 22 10 2049-3072
7 2 13-16 15 6 193-256 23 10 3073-4096
To add to Zerte's answer, the references to previous sequences have nothing to do with blocks or block boundaries. Such references can be within blocks, across blocks, and the referenced sequence can cross a block boundary.
How many bytes do the following types take up in BigQuery:
Timestamp
Datetime
Date
My guess was that date could be stored in 2 bytes, and a timestamp perhaps 8, but I wasn't sure about that and it is not mentioned on the https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types page.
The size of BigQuery's data types is as follows:
Data type Size
INT64/INTEGER 8 bytes
FLOAT64/FLOAT 8 bytes
NUMERIC 16 bytes
BOOL/BOOLEAN 1 byte
STRING 2 bytes + the UTF-8 encoded string size
BYTES 2 bytes + the number of bytes in the value
DATE 8 bytes
DATETIME 8 bytes
TIME 8 bytes
TIMESTAMP 8 bytes
STRUCT/RECORD 0 bytes + the size of the contained fields
GEOGRAPHY 16 bytes + 24 bytes * the number of vertices in the geography type (you can verify the number of vertices using the ST_NumPoints function)
Null values for any data type are calculated as 0 bytes.
A repeated column is stored as an array, and the size is calculated
based on the number of values. For example, an integer column (INT64)
that is repeated (ARRAY) and contains 4 entries is calculated
as 32 bytes (4 entries x 8 bytes).
See more details in Data size calculation section of Pricing documentation
I need to store 5 values in a single SQL Server column, each range 1-90. The values cannot be repeated. I though of using the 2, 4, 8, 16, 32, 64, ... system but you guess it will get really big, using decimal I risk wrong calculation. Is there a convenient way to:
store the 5 values into a single column so that to avoid having 90 bit column in the table, see my previous post here.
quickly query the database for example to return all records with number X and Y
another option was a string (90) containing flags like 000001000011000 but this way I have to use substrings to query and I fear it will slow down on a table with 25.000 records or more.
First request: You say most are bit. But if not all then you cant use bitwise operator. And can't save it in a single field
In that case you need an aditional table.
Row_id | fieldName | fieldValue
1 | name1 | value1
1 | name2 | value2
.
.
.
1 | name90 | value90
Second request: Save the 5 values is very easy and fast on the aditional table. Just create and index for row_id on both tables.
Third Request: Here you say again can save it as bits. But instead using strings, that is a bad idea.
Your are right, number isnt big enough to hold 90 bit, that is because a number can only hold 32 or 64 bits depending on type.
In that case you need to use two field (64 bits) or three field (32 bits) to store all 90 possible flags.
Again easy to do and really fast.
EDIT
For use multiple fields you have to create categories
Like imagine there are 16 bits split into two 8 bits (0..256)
01234567 89ABCDEF
01010101 11111111
Create fieldUp and fieldDown
SAVE
FieldUp = 01234567
FieldUp = 1 + 4 + 16 + 64
FieldDown = 89ABCDEF
FieldDown = 1 + 2 + 4 + 8 + 16 + 32 + 64 + 128
Then Select a row with FLAGS [b1, b5, bA] would be
SELECT *
FROM TABLE
WHERE FieldUp & (4 + 32)
AND FieldDown & 8
I have resolved saving the numbers comma separated, then in my code i split this field into an array and can process the data. Numbers are not meant for math operations but just as a string.
I'm wondering if it is possible to represent a number as a sequence of bits, each having approximately the same significance, such that if we flip one of the bits, the overall value does not change by much.
For example, we can use sequences of 4-bits, where each group represents a value from 0 to 15 and the overall value is the sum of all these values.
0110 0101 1101 1010 1011 → 6 + 5 + 13 + 10 + 11 = 45
and now flipping any bit can only incur in a maximum difference of 8 in the final value.
Some drawbacks obviously exist with this approach:
values have multiple representations, with some values having more representations than other ones (for example, there are 39280 distinct representations for the number 38, and only 1 for the number 0);
the amount of values that can be represented is greatly reduced (this representation allows for integers from 0 to 75, while 20 bits could normally represent 220 ~ 1 million different integers).
Are there any resources I can find concerning this problem? I can't seem to find anything online, but maybe I'm not searching with the right keywords. What other alternatives exist to my approach? Do they improve on its disadvantages?
I'm preparing to the SQL Server exam (70-431). I have the book from Sybex "SQL Server 2005 - Implementation and Maintenance". I'm little confused about estimating a size of a table.
In the 2nd chapter there is explained how to do this:
Count the row size from the formula: Row_Size = Fixed_Data_Size + Variable_Data_Size + Null_Bitmap + Row_Header.
Fixed_Data_Size is a sum of all sizes of fixed length columns (simple sum)
Variable_Data_Size = 2 + (num_variable_columns × 2) + max_varchar_size, num_variable_columns - number of columnt with variable length, max_varchar_size - maximum size of varchar column
null_bitmap = 2 + ((number of columns + 7) ÷ 8) (rounded down)
Row_header always equals 4.
Calculating rows per page from the formula: Rows_Per_Page = 8096 ÷ (Row_Size + 2) (rounded down)
Estimating the number of rows in the table. Let's say that table has 1,000 rows.
Calculating the number of pages needed: No_Of_Pages = 1,000 / Rows_Per_Page (rounded up)
Total size: Total_Size = No_Of_Pages * 8,192, where 8,192 is the size of one page.
So everything is perfectly clear for me. I made one example and checked with the answers in the book that my calculations are correct. But there is one question which confuses me.
The question is: we have a table with the following schema:
Name Datatype
-------------------
ID Int
VendorID Int
BalanceDue Money
DateDue Datetime
It is expected that in this table there will be about 5,000 rows. Question (literaly): "How much space will the Receivables table take?"
So my answer is simple:
null_bitmap = 2 + ((4+7) / 8) = 3.375 = 3 (rounded)
fixed_datasize = 4 + 4 + 8 + 8 = 24
variable_datasize = 0
row_header = 4 (always)
row_size = 3 + 24 + 0 + 4 = 31
But in the answer they omit row_header and they don't add 4. Is it a mistake in the book or row_header is added only in some cases (which are not mentioned in the book)? I was thinking that maybe row_header is added only if there are variable-length fields in the table, but there is another exercise in which there are not variable-length fields and row_header is added. I would appreciate if someone explains me that. Thanks.
Inside the Storage Engine: Anatomy of a record says all records have a record header:
The record structure is as follows:
record header
4 bytes long
two bytes of record metadata (record type)
two bytes pointing forward in the record to the NULL bitmap