What is the size of metadata in a postgres table? - sql

There is a table in postgres 9.4 with following types of columns:
NAME TYPE TYPE SIZE
id | integer | 4 bytes
timestamp | timestamp with time zone | 8 bytes
num_seconds | double precision | 8 bytes
count | integer | 4 bytes
total | double precision | 8 bytes
min | double precision | 8 bytes
max | double precision | 8 bytes
local_counter | integer | 4 bytes
global_counter | integer | 4 bytes
discrete_value | integer | 4 bytes
Giving in total: 60 bytes per row
The size of a table(with toast) returned by pg_table_size(table) is: 49 152 bytes
Number of rows in the table: 97
Taking into account that a table is split into pages of 8kB, we can fit 49 152/8 192 = 6 pages into this table.
Each page and each row has some meta-data...
Looking at the pure datatype size we should expect something around 97 * 60 = 5 820 bytes of row data and adding approximately the same amount of metadata to it, we are not landing even close to the result returned by pg_table_size: 49 152 bytes.
Does metadata really take ~9x space compared to the pure data in postgres?

A factor 9 is clearly more wasted space ("bloat") than there should be:
Each page has a 16-byte header.
Each row has a 23-byte "tuple header".
There will be four bytes of padding between id and timestamp and between count and total for alignment reasons (you can avoid that by reordering the columns).
Moreover, each tuple has a "line pointer" of two bytes in the data page.
See this answer for some details.
To see exactly how the space in your table is used, install the pgstattuple extension:
CREATE EXTENSION pgstattuple;
and use the pgstattuple function on the table:
SELECT * FROM pgstattuple('tablename');

Related

How to select half precision (BFLOAT16 vs FLOAT16) for your trained model?

how will you decide what precision works best for your inference model? Both BF16 and F16 takes two bytes but they use different number of bits for fraction and exponent.
Range will be different but I am trying to understand why one chose one over other.
Thank you
|--------+------+----------+----------|
| Format | Bits | Exponent | Fraction |
|--------+------+----------+----------|
| FP32 | 32 | 8 | 23 |
| FP16 | 16 | 5 | 10 |
| BF16 | 16 | 8 | 7 |
|--------+------+----------+----------|
Range
bfloat16: ~1.18e-38 … ~3.40e38 with 3 significant decimal digits.
float16: ~5.96e−8 (6.10e−5) … 65504 with 4 significant decimal digits precision.
bfloat16 is generally easier to use, because it works as a drop-in replacement for float32. If your code doesn't create nan/inf numbers or turn a non-0 into a 0 with float32, then it shouldn't do it with bfloat16 either, roughly speaking. So, if your hardware supports it, I'd pick that.
Check out AMP if you choose float16.

Deflate: code lengths of > 7 bits for top-level HCLEN?

RFC 1951 specifies that the first level of encoding in a block contains HCLEN 3-bit values, which encode the lengths of the next level of Huffman codes. Since these are 3-bit values, it follows that no code for the next level can be longer than 7 bits (111 in binary).
However, there seem to be corner cases which (at least with the "classical" algorithm to build Huffman codes, using a priority queue) apparently generate codes of 8 bits, which can of course not be encoded.
An example I came up with is the following (this represents the 19 possible symbols resulting from the RLE encoding, 0-15 plus 16, 17 and 18):
symbol | frequency
-------+----------
0 | 15
1 | 14
2 | 6
3 | 2
4 | 18
5 | 5
6 | 12
7 | 26
8 | 3
9 | 20
10 | 79
11 | 94
12 | 17
13 | 7
14 | 8
15 | 4
16 | 16
17 | 1
18 | 13
According to various online calculators (eg https://people.ok.ubc.ca/ylucet/DS/Huffman.html), and also building the tree by hand, some symbols in the above table (namely 3 and 17) produce 8-bit long Huffman codes. The resulting tree looks ok to me, with 19 leaf nodes and 18 internal nodes.
So, is there a special way to calculate Huffman codes for use in DEFLATE?
Yes. deflate uses length-limited Huffman codes. You need either a modified Huffman algorithm that limits the length, or an algorithm that shortens a Huffman code that has exceeded the length. (zlib does the latter.)
In addition to the code lengths code being limited to seven bits, the literal/length and distance codes are limited to 15 bits. It is not at all uncommon to exceed those limits when applying Huffman's algorithm to sets of frequencies encountered during compression.
Though your example is not a valid or possible set of frequencies for that code. Here is a valid example that results in a 9-bit Huffman code, which would then need to be squashed down to seven bits:
3 0 0 5 5 1 9 31 58 73 59 28 9 1 2 0 6 0 0

MariaDB table size if FK is empty or Null

Update my question, after negative vote.
I have the following table.
+-------------------------------+
| tbl_IndexDemo |
+---------+------+--------------+
| ID | INT | Primary Key |
| FK_1 | INT | Foreign Key |
| FK_2 | INT | Foreign Key |
| FK_3 | INT | Foreign Key |
| FK_4 | INT | Foreign Key |
+---------+------+--------------+
It has one Primary Key and four Foreign Keys.
I am trying to estimate size (in bytes) of a row.
What is the size if all fields have data? I calculated
ID 4 bytes + 4 bytes (Index-PK)
FK_1 4 bytes
FK_2 4 bytes
FK_3 4 bytes
FK_4 4 bytes
Total per row = 24 bytes
And if FK_3 is emtpy or NULL, what is the size of the row?
I am using MariaDB with InnoDB.
Computing the size of an InnoDB table is far from that simple. But here is a quick and dirty way to estimate. You got 24; now multiply that by 2 or 3. That is, a crude estimate of the row size will be 48-72 bytes.
As for NULL or not, well that will make only a 4 byte difference per INT, if it makes any difference.
Note that there are 4 ROW_FORMAT values possible. This adds hard-to-quantify wrinkles to the calculations. TEXT and PARTITION also make a mess of estimating size.
If you are worried about space, then consider whether you really need INT, which takes 4 bytes and has a limit of 2 billion. Perhaps MEDIUMINT UNSIGNED (3 bytes, max of 16M) would be better -- especially considering that saves 1 byte for each occurrence. You will have at least 3 occurrences -- the column in tbl_IndexDemo, the column in the other table, and the INDEX implicitly created by the FK.

Luke reveals unknown term values for numeric fields in index

We use Lucene.net for indexing. One of the fields that we index, is a numeric field with the values 1 to 6 and 9999 for not set.
When using Luke to explore the index, we see terms that we do not recognize. The index contains a total of 38673 documents, and Luke shows the following top ranked terms for this field:
Term | Rank | Field | Text | Text (decoded as numeric-int)
1 | 38673 | Axis | x | 0
2 | 38673 | Axis | p | 0
3 | 38673 | Axis | t | 0
4 | 38673 | Axis | | | 0
5 | 19421 | Axis | l | 0
6 | 19421 | Axis | h | 0
7 | 19421 | Axis | d# | 0
8 | 19252 | Axis | ` N | 9999
9 | 19252 | Axis | l | 8192
10 | 19252 | Axis | h ' | 9984
11 | 19252 | Axis | d# p | 9984
12 | 18209 | Axis | ` | 4
13 | 950 | Axis | ` | 1
14 | 116 | Axis | ` | 5
15 | 102 | Axis | ` | 6
16 | 26 | Axis | ` | 3
17 | 18 | Axis | ` | 2
We find the same pattern for other numeric fields.
Where does the unknown values come from?
NumericFields are indexed using a trie structure. The terms you see are part of it, but will not return results if you query for them.
Try indexing your NumericField with a precision step of Int32.MaxValue and the values will go away.
NumericField documentation
... Within Lucene, each numeric value is indexed as a trie structure, where each term is logically assigned to larger and larger pre-defined brackets (which are simply lower-precision representations of the value). The step size between each successive bracket is called the precisionStep, measured in bits. Smaller precisionStep values result in larger number of brackets, which consumes more disk space in the index but may result in faster range search performance. The default value, 4, was selected for a reasonable tradeoff of disk space consumption versus performance. You can use the expert constructor NumericField(String,int,Field.Store,boolean) if you'd like to change the value. Note that you must also specify a congruent value when creating NumericRangeQuery or NumericRangeFilter. For low cardinality fields larger precision steps are good. If the cardinality is < 100, it is fair to use Integer.MAX_VALUE, which produces one term per value. ...
More details on the precision step available in the NumericRangeQuery documentation:
Good values for precisionStep are depending on usage and data type:
• The default for all data types is 4, which is used, when no
precisionStep is given.
• Ideal value in most cases for 64 bit data
types (long, double) is 6 or 8.
• Ideal value in most cases for 32 bit
data types (int, float) is 4.
• For low cardinality fields larger
precision steps are good. If the cardinality is < 100, it is fair to use •Integer.MAX_VALUE (see below).
• Steps ≥64 for long/double and
≥32 for int/float produces one token per value in the index and
querying is as slow as a conventional TermRangeQuery. But it can be
used to produce fields, that are solely used for sorting (in this case
simply use Integer.MAX_VALUE as precisionStep). Using NumericFields
for sorting is ideal, because building the field cache is much faster
than with text-only numbers. These fields have one term per value and
therefore also work with term enumeration for building distinct lists
(e.g. facets / preselected values to search for). Sorting is also
possible with range query optimized fields using one of the above
precisionSteps.
EDIT
little sample, the index produced by this will show terms with value 8192, 9984, 1792, etc in luke, but using a range that would include them in the query doesnt produce results:
NumericField number = new NumericField("number", Field.Store.YES, true);
Field regular = new Field("normal", "", Field.Store.YES, Field.Index.ANALYZED);
IndexWriter iw = new IndexWriter(FSDirectory.GetDirectory("C:\\temp\\testnum"), new StandardAnalyzer(), true);
Document doc = new Document();
doc.Add(number);
doc.Add(regular);
number.SetIntValue(1);
regular.SetValue("one");
iw.AddDocument(doc);
number.SetIntValue(2);
regular.SetValue("one");
iw.AddDocument(doc);
number.SetIntValue(13);
regular.SetValue("one");
iw.AddDocument(doc);
number.SetIntValue(2000);
regular.SetValue("one");
iw.AddDocument(doc);
number.SetIntValue(9999);
regular.SetValue("one");
iw.AddDocument(doc);
iw.Commit();
IndexSearcher searcher = new IndexSearcher(iw.GetReader());
NumericRangeQuery rangeQ = NumericRangeQuery.NewIntRange("number", 1, 2, true, true);
var docs = searcher.Search(rangeQ);
Console.WriteLine(docs.Length().ToString()); // prints 2
rangeQ = NumericRangeQuery.NewIntRange("number", 13, 13, true, true);
docs = searcher.Search(rangeQ);
Console.WriteLine(docs.Length().ToString()); // prints 1
rangeQ = NumericRangeQuery.NewIntRange("number", 9000, 9998, true, true);
docs = searcher.Search(rangeQ);
Console.WriteLine(docs.Length().ToString()); // prints 0
Console.ReadLine();

Rounding off a list of numbers to a user-defined step while preserving their sum

I've been reading a lot of posts about rounding off numbers, but I couldn't manage to do what I want :
I have got a list of positive floats.
The unsigned integer roundOffStep to use is user-defined. I have no control other it.
I want to be able to do the most accurate rounding while preserving the sum of those numbers, or at least while keeping the new sum inferior to the original sum.
How would I do that ? I am terrible with algorithms, so this is way too tricky for me.
Thx.
EDIT : Adding a Test case :
FLOATS
29.20
18.25
14.60
8.76
2.19
sum = 73;
Let's say roundOffStep = 5;
ROUNDED FLOATS
30
15
15
10
0
sum = 70 < 73 OK
Round all numbers to the nearest multiple of roundOffStep normally.
If the new sum is lower than the original sum, you're done.
For each number, calculate rounded_number - original_number. Sort this list of differences in decreasing order so that you can find the numbers with the largest difference.
Pick the number that gives the largest difference rounded_number - original_number, and subtract roundOffStep from that number.
Repeat step 4 (picking the next largest difference each time) until the new sum is less than the original.
This process should ensure that the rounded numbers are as close as possible to the originals, without going over the original sum.
Example, with roundOffStep = 5:
Original Numbers | Rounded | Difference
----------------------+------------+--------------
29.20 | 30 | 0.80
18.25 | 20 | 1.75
14.60 | 15 | 0.40
8.76 | 10 | 1.24
2.19 | 0 | -2.19
----------------------+------------+--------------
Sum: 73 | 75 |
The sum is too large, so we pick the number giving the largest difference (18.25 which was rounded to 20) and subtract 5 to give 15. Now the sum is 70, so we're done.