H2 BLOB data type size - blob

Does H2 have a notion of a specific size limit for the BLOB data type? The documentation (https://h2database.com/html/datatypes.html#blob_type) states that you can optionally set a limit e.g. BLOB(10K), so does that mean that BLOB() is unlimited in size?
Similarly, the documentation lists TINYBLOB, MEDIUMBLOB etc. as acceptable keywords, but doesn't give any specific meaning for them. Are they simply aliases to BLOB for compatibility with other database dialects?
(I see that the BINARY type has a limit of 2Gb, which is what makes me think that BLOB doesn't have a limit since it's not specified.)

BINARY / VARBINARY data types are limited to available memory and they also have a strong limit slightly below 2 GB (it is limited to the maximum array size in Java). Note that BINARY should be used only when you have values with a known fixed size. In H2 1.4.200 BINARY is an alias for VARBINARY, but in the not yet released next version they are different.
BLOB values can be much larger. They aren't loaded into memory, they are streamed instead. There is some outdated information about limits in the documentation: https://h2database.com/html/advanced.html#limits_limitations
but this part of documentation was written for an old storage engine of H2, H2 uses another storage engine by default. Anyway, both engines support large binary and character objects.
TINYBLOB, MEDIUMBLOB, etc. don't have any special meaning, they are for compatibility only. Don't use them.

Related

Editing Parquet Files as Binary

Assuming Parquet files on AWS S3 (used for querying by AWS Athena).
I need to anonymize a record with specific numeric field by changing the numeric value (changing one digit is enough).
Can I scan a parquet file as Binary and find a numeric value ? Or the compression will make it impossible to find such string ?
Assuming I can do #1 - can I anonymize the record by changing a digit on this number on the binary level without corrupting the parquet file ?
10X
No, this will not be possible. Parquet has two layers in its format that make this impossible: encoding and compression. They both reorder the data to fit into less space, the difference between them is CPU usage and universalness. Sometimes data can be compressed so that we need less than a byte per value if all values are the same / very similar. Changing a single value would than lead to more space usage which in turn makes your edit impossible.

Best Firebird blob size page size relation

I have a small Firebird 2.5 database with a blob field called "note" declared as this:
BLOB SUB_TYPE 1 SEGMENT SIZE 80 CHARACTER SET UTF8
The data base page size is:
16.384 (That I'm suspecting is too high)
I have ran this select in order to discover the average size of the blob fields available:
select avg(octet_length(items.note)) from items
and got this information:
2.671
As a beginner, I would like to know the better segment size for this blob field and the best database page size in your opinion (I know that this depends of other information, but I still don't know how to figure it out).
Blobs in Firebird are stored in separate pages of your database. The exact storage format depends on the size of your blob. As described in Blob Internal Storage:
Blobs are created as part of a data row, but because a blob could be
of unlimited length, what is actually stored with the data row is a
blobid, the data for the blob is stored separately on special blob
pages elsewhere in the database.
[..]
A blob page stores data for a blob. For large blobs, the blob page
could actually be a blob pointer page, i.e. be used to store pointers
to other blob pages. For each blob that is created a blob record is
defined, the blob record contains the location of the blob data, and
some information about the blobs contents that will be useful to the
engine when it is trying to retrieve the blob. The blob data could be
stored in three slightly different ways. The storage mechanism is
determined by the size of the blob, and is identified by its level
number (0, 1 or 2). All blobs are initially created as level 0, but
will be transformed to level 1 or 2 as their size increases.
A level 0 blob is a blob that can fit on the same page as the blob
header record, for a data page of 4096 bytes, this would be a blob of
approximately 4052 bytes (Page overhead - slot - blob record header).
In other words, if your average size of blobs is 2671 bytes (and most larger ones are still smaller than +/- 4000 bytes), then likely a page size of 4096 is optimal as it will reduce wasted space from on average 16340 - 2671 = 13669 bytes to 4052 - 2671 = 1381 bytes.
However for performance itself this likely hardly going to matter, and smaller page sizes have other effects that you will need to take into account. For example a smaller page size will also reduce the maximum size of a CHAR/VARCHAR index key, indexes might become deeper (more levels), and less records fit in a single page (or wider records become split over multiple pages).
Without measuring and testing it is hard to say if using 4096 for the page size is the right size for your database.
As to segment sizes: it is a historic artifact that is best ignored (and left off). Sometimes applications or drivers incorrectly assume that blobs need to be written or read in the specified segment size. In those rare cases specifying a larger segment size might improve performance. If you leave it off, Firebird will default to a value of 80.
From Binary Data Types:
Segment Size: Specifying the BLOB segment is throwback to times past,
when applications for working with BLOB data were written in C
(Embedded SQL) with the help of the gpre pre-compiler. Nowadays, it is
effectively irrelevant. The segment size for BLOB data is determined
by the client side and is usually larger than the data page size, in
any case.

How to insert floating point numbers in Aerospike KV store?

I am Using Aerospike 3.40. Bin with floating point value doesn't appear. I am using python client. Please help.
It is now supported in Aerospike 3.6 version
The server does not natively support floats. It supports integers, strings, bytes, lists, and maps. Different clients handle the unsupported types in different ways. The PHP client, for example, will serialize the other types such as boolean and float and store them in a bytes field, then deserialize them on reads. The Python client will be doing that starting with the next release (>= 1.0.38).
However, this approach has the limitation of making it difficult for different clients (PHP and Python, for example) to read such serialized data, as it's not serialized using a common format.
One common way to get around this with floats is to turn them into integers. For example, If you have a bin called 'currency' you can multiply the float by 100, chop off the mantissa, and store it as an integer. On the way out you simply divide by 100.
A similar method is to store the significant digits in one bin and the mantissa in another, both of them integer types, and recombine them on the read. So 123.456789 gets stored as v_sig and v_mantissa.
(v_sig, v_mantissa) = str(123.456789).split('.')
on read you would combine the two
v = float(v_sig)+float("0."+str(v_mantissa))
FYI, floats are now supported natively as doubles on the aerospike server versions >= 3.6.0. Most clients, such as the Python and PHP one supports casting floats to as_double.
Floating point number can be divided into two parts, before decimal point and after it and storing them in two bins and leveraging them in the application code.
However, creating more number of bins have performance overheads in Aerospike as a new malloc will be used per bin.
If switching from Python to any other language is not the use case, it is better to use a better serialization mechanism and save it in single bin. That would mean only one bin per floating number is used and also will reduce the data size in Aerospike. Lesser amount of data in Aerospike always helps in speed in terms of Network I/O which is the main aim of Caching.

document length in lucene 4.0

as I've read the documentation of the lucene 4.0, now this library stores some statistics as in order to compute different scoring models, one of them bm25. Is there a way, besides fetching a document, to fetch its length too?
You can store whatever you want from FieldInvertState into the 'norm', and it doesn't have to be a 8 bit float either.
The default is a lossy storage of the length, if you want the actual exact length, maybe you choose to use a short (16bits) per document or something else instead.
See Similarity.computeNorm

[My]SQL VARCHAR Size and Null-Termination

Disclaimer: I'm very new to SQL and databases in general.
I need to create a field that will store a maximum of 32 characters of text data. Does "VARCHAR(32)" mean that I have exactly 32 characters for my data? Do I need to reserve an extra character for null-termination?
I conducted a simple test and it seems that this is a WYSIWYG buffer. However, I wanted to get a concrete answer from people who actually know what they're doing.
I have a C[++] background, so this question is raising alarm bells in my head.
Yes, you have 32 characters at your disposal. SQL does not concern itself with nul terminated strings like some programming languages do.
Your VARCHAR specification size is the max size of your data, so in this case, 32 characters. However, VARCHARS are a dynamic field, so the actual physical storage used is only the size of your data, plus one or two bytes.
If you put a 10-character string into a VARCHAR(32), the physical storage will be 11 or 12 bytes (the manual will tell you the exact formula).
However, when MySQL is dealing with result sets (ie. after a SELECT), 32 bytes will be allocated in memory for that field for every record.