What is the VInt in Lucene? - lucene

I want to know what is the VInt in Lucene ?
I read this article , but i don't understand what is it and where does Lucene use it ?
Why Lucene doesn't use simple integer or big integer ?
Thanks .

VInt is extremely space efficient. It could theoretically save upto 75% space.
In Lucene, many of the structures are list of integers. For example, list of documents for a given term, positions (and offsets) of the terms in documents, among others. These lists form bulk of the lucene data.
Think of Lucene indices for millions of documents that need tens of GBs of space. Shrinking space by more than half reduces disk space requirements. While savings of disk space may not be a big win, given that disk space is cheap, the real gain comes reduced disk IO. Disk IO for reading VInt data is lower than reading integers which automatically translates to better performance.

VInt refers to Lucene's variable-width integer encoding scheme. It encodes integers in one or more bytes, using only the low seven bits of each byte. The high bit is set to zero for all bytes except the last, which is how the length is encoded.

For your first question:
A variable-length format for positive integers is defined where the high-order bit of each byte indicates whether more bytes remain to be read. The low-order seven bits are appended as increasingly more significant bits in the resulting integer value. Thus values from zero to 127 may be stored in a single byte, values from 128 to 16,383 may be stored in two bytes, and so on. https://lucene.apache.org/core/3_0_3/fileformats.html.
So, to save a list of n integers the amount of memory you would need is [eg] 4*n bytes. But with Vint all numbers under 128 would be stored using only 1 byte [and so on] saving a lot of memory.
Vint provides a compressed representation of integers and Shashikant's answer already explains the requirements and benefits of compression in Lucene.

Related

Map words to numbers

I am doing indexing of data in my IRE (Information Retrieval and Extraction) course. Now instead of storing terms in the index, I am storing termID which is a mapping corresponding to the term. The size of term, if the length of the term is 15, would be 15 bytes i.e. 120 bits while if I use termID instead of term then I can definitely store it in less than 120 bits. One of the possible ways is to maintain a dictionary of the (term, termID) where termID would be from 1..n where n is the number of terms. The problems with this method is:
I have to keep this dictionary in the ram and the dictionary size can be in GBs.
To find termID corresponding to a term, it will take O(log(n)) where n is the number of terms in the dictionary.
Can I make some function which takes a term as an input and returns the mapping (encryption) in O(1) ?. It is okay if there are few collisions (Just guessing that a few collisions in exchange of speed and memory is a good trade-off. BTW I don't know how much it will effect my search results).
Is there any other better way to do this?
I think you gave the answer already more or less by saying "it is ok if there are a few collisions". The trick is hashing. You can first reduce the number of "characters" in your search terms. E.g., drop numbers, and special characters. Afterwards you can merge Upper and lower-case characters. Finally you could apply some simple replacements e.g. replacing the german ü bei ue (which is actually there origin). After doing so you have probably sth. like 32bit. You can then represent an four character string in a single byte. If you reserve 4 bytes for each words you need to deal with the longer words. There you can basically resort to xor each 4byte block.
An alternative approach would be to do something hybrid for the dictionary. If you would build a dictionary for only the 10k most frequent words you are most likely covering already most of the texts. Hence, you only need to keep parts of your dictionary in memory, while for most of the words you can use dictionary on hardisc or maybe even ignore them.

Is varchar(128) better than varchar(100)

Quick question. Does it matter from the point of storing data if I will use decimal field limits or hexadecimal (say 16,32,64 instead of 10,20,50)?
I ask because I wonder if this will have anything to do with clusters on HDD?
Thanks!
VARCHAR(128) is better than VARCHAR(100) if you need to store strings longer than 100 bytes.
Otherwise, there is very little to choose between them; you should choose the one that better fits the maximum length of the data you might need to store. You won't be able to measure the performance difference between them. All else apart, the DBMS probably only stores the data you send, so if your average string is, say, 16 bytes, it will only use 16 (or, more likely, 17 - allowing 1 byte for storing the length) bytes on disk. The bigger size might affect the calculation of how many rows can fit on a page - detrimentally. So choosing the smallest size that is adequate makes sense - waste not, want not.
So, in summary, there is precious little difference between the two in terms of performance or disk usage, and aligning to convenient binary boundaries doesn't really make a difference.
If it would be a C-Program I'd spend some time to think about that, too. But with a database I'd leave it to the DB engine.
DB programmers spent a lot of time in thinking about the best memory layout, so just tell the database what you need and it will store the data in a way that suits the DB engine best (usually).
If you want to align your data, you'll need exact knowledge of the internal data organization: How is the string stored? One, two or 4 bytes to store the length? Is it stored as plain byte sequence or encoded in UTF-8 UTF-16 UTF-32? Does the DB need extra bytes to identify NULL or > MAXINT values? Maybe the string is stored as a NUL-terminated byte sequence - then one byte more is needed internally.
Also with VARCHAR it is not neccessary true, that the DB will always allocate 100 (128) bytes for your string. Maybe it stores just a pointer to where space for the actual data is.
So I'd strongly suggest to use VARCHAR(100) if that is your requirement. If the DB decides to align it somehow there's room for extra internal data, too.
Other way around: Let's assume you use VARCHAR(128) and all things come together: The DB allocates 128 bytes for your data. Additionally it needs 2 bytes more to store the actual string length - makes 130 bytes - and then it could be that the DB aligns the data to the next (let's say 32 byte) boundary: The actual data needed on the disk is now 160 bytes 8-}
Yes but it's not that simple. Sometimes 128 can be better than 100 and sometimes, it's the other way around.
So what is going on? varchar only allocates space as necessary so if you store hello world in a varchar(100) it will take exactly the same amount of space as in a varchar(128).
The question is: If you fill up the rows, will you hit a "block" limit/boundary or not?
Databases store their data in blocks. These have a fixed size, for example 512 (this value can be configured for some databases). So the question is: How many blocks does the DB have to read to fetch each row? Rows that span several block will need more I/O, so this will slow you down.
But again: This doesn't depend on the theoretical maximum size of the columns but on a) how many columns you have (each column needs a little bit of space even when it's empty or null), b) how many fixed width columns you have (number/decimal, char), and finally c) how much data you have in variable columns.

Parallelizable hashing algorithm where size and order of sub-strings is irrelevant

EDIT
Here is the problem I am trying to solve:
I have a string broken up into multiple parts. These parts are not of equal, or predictable length. Each part will have a hash value. When I concatenate parts I want to be able to use the hash values from each part to quickly get the hash value for the parts together. In addition the hash generated by putting the parts together must match the hash generated if the string were hashed as a whole.
Basically I want a hashing algorithm where the parts of the data being hashed can be hashed in parallel, and I do not want the order or length of the pieces to matter. I am not breaking up the string, but rather receiving it in unpredictable chunks in an unpredictable order.
I am willing to ensure an elevated collision rate, so long as it is not too elevated. I am also ok with a slightly slower algorithm as it is hardly noticeable on small strings, and done in parallel for large strings.
I am familiar with a few hashing algorithms, however I currently have a use-case for a hash algorithm with the property that the sum of two hashes is equal to a hash of the sum of the two items.
Requirements/givens
This algorithm will be hashing byte-strings with length of at least 1 byte
hash("ab") = hash('a') + hash('b')
Collisions between strings with the same characters in different order is ok
Generated hash should be an integer of native size (usually 32/64 bits)
String may contain any character from 0-256 (length is known, not \0 terminated)
The ascii alpha-numeric characters will be by far the most used
A disproportionate number of strings will be 1-8 ASCII characters
A very tiny percentage of the strings will actually contain bytes with values at or above 127
If this is a type of algorithm that has terminology associated with it, I would love to know that terminology. If I knew what a proper term/name for this type of hashing algorithm was it would be much easier to google.
I am thinking the simplest way to achieve this is:
Any byte's hash should be its value, normalized to <128 (if >128 subtract 128)
To get the hash of a string you normalize each byte to <128 and add it to the key
Depending on key size I may need to limit how many characters are used to hash to avoid overflow
I don't see anything wrong with just adding each (unsigned) byte value to create a hash which is just the sum of all the characters. There is nothing wrong with having an overflow: even if you reach the 32/64 bit limit (and it would have to be a VERY/EXTREMELY long string to do this) the overflow into a negative number won't matter in 2's complement arithmetic. As this is a linear process it doesn't matter how you split your string.

Why historically do people use 255 not 256 for database field magnitudes?

You often see database fields set to have a magnitude of 255 characters, what is the traditional / historic reason why? I assume it's something to do with paging / memory limits, and performance but the distinction between 255 and 256 has always confused me.
varchar(255)
Considering this is a capacity or magnitude, not an indexer, why is 255 preferred over 256? Is a byte reserved for some purpose (terminator or null or something)?
Presumably varchar(0) is a nonsense (has zero capacity)? In which case 2^8 of space should be 256 surely?
Are there other magnitudes that provide performance benefits? For example is varchar(512) less performant than varchar(511) or varchar(510)?
Is this value the same for all relations databases, old and new?
disclaimer - I'm a developer not a DBA, I use field sizes and types that suit my business logic where that is known, but I'd like to know the historic reason for this preference, even if it's no longer relevant (but even more if it still is relevant).
Edit:
Thanks for the answers, there seems to be some concensus that a byte is used to store size, but this doesn't settle the matter definitively in my mind.
If the meta data (string length) is stored in the same contiguous memory/disk, it makes some sense. 1 byte of metadata and 255 bytes of string data, would suit each other very nicely, and fit into 256 contiguous bytes of storage, which presumably is neat and tidy.
But...If the metadata (string length) is stored separately from the actual string data (in a master table perhaps), then to constrain the length of string's data by one byte, just because it's easier to store only a 1 byte integer of metadata seems a bit odd.
In both cases, it would seem to be a subtlety that probably depends on the DB implementation. The practice of using 255 seems pretty widespread, so someone somewhere must have argued a good case for it in the beginning, can anyone remember what that case was/is? Programmers won't adopt any new practice without a reason, and this must have been new once.
With a maximum length of 255 characters, the DBMS can choose to use a single byte to indicate the length of the data in the field. If the limit were 256 or greater, two bytes would be needed.
A value of length zero is certainly valid for varchar data (unless constrained otherwise). Most systems treat such an empty string as distinct from NULL, but some systems (notably Oracle) treat an empty string identically to NULL. For systems where an empty string is not NULL, an additional bit somewhere in the row would be needed to indicate whether the value should be considered NULL or not.
As you note, this is a historical optimisation and is probably not relevant to most systems today.
255 was the varchar limit in mySQL4 and earlier.
Also 255 chars + Null terminator = 256
Or 1 byte length descriptor gives a possible range 0-255 chars
255 is the largest numerical value that can be stored in a single-byte unsigned integer (assuming 8-bit bytes) - hence, applications which store the length of a string for some purpose would prefer 255 over 256 because it means they only have to allocate 1 byte for the "size" variable.
From MySQL Manual:
Data Type:
VARCHAR(M), VARBINARY(M)
Storage Required:
L + 1 bytes if column values require 0 – 255 bytes, L + 2 bytes if values may require more than 255 bytes
Understand and make choice.
255 is the maximum value of a 8 bit integer : 11111111 = 255.
Are there other magnitudes that provide performance benefits? For example is varchar(512) less performant than varchar(511) or varchar(510)?
Recollected the fundamentals of the bits/bytes storage, it requires one byte to store integers below 256 and two bytes for any integer between 256 and 65536.
Hence, it requires same space (two bytes) to store 511 or 512 or for that matter 65535....
Thus it is clear that the this argument mentioned in the discussion above is N/A for varchar(512) or varchar(511).
A maximum length of 255 allows the database engine to use only 1 byte to store the length of each field. You are correct that 1 byte of space allows you to store 2^8=256 distinct values for the length of the string.
But if you allow the field to store zero-length text strings, you need to be able to store zero in the length. So you can allow 256 distinct length values, starting at zero: 0-255.
It used to be that all strings required a NUL terminator, or "backslash-zero". Updated databases don't have that. It was "255 characters of text" with a "\0" added automatically at the end so the system knew where the string ended. If you said VARCHAR(256), it would end up being 257 and then you'd be in the next register for one character. Wasteful. That's why everything was VARCHAR(255) and VARCHAR(31). Out of habit the 255 seems to have stuck around but the 31's became 32's and the 511's became 512's. That part is weird. It's hard to make myself write VARCHAR(256).
Often varchars are implemented as pascal strings: holding the actual length in the byte #0. The length was therefore bound to 255. (Value of a byte varies from 0 to 255.)
8 bits unsigned = 256 bytes
255 characters + byte 0 for length
I think this might answer your question. Looks like it was the max limit of varchar in earlier systems. I took it off another stackoverflow question.
It's hard to know what the longest postal address is, of course, which is why many people choose a long VARCHAR that is certainly longer than any address. And 255 is customary because it may have been the maximum length of a VARCHAR in some databases in the dawn of time (as well as PostgreSQL until more recently).
Are there disadvantages to using a generic varchar(255) for all text-based fields?
Data is saved in memory in binary system and 0 and 1 are binary digits. Largest binary number that can fit in 1 byte (8-bits) is 11111111 which converts to decimal 255.

[My]SQL VARCHAR Size and Null-Termination

Disclaimer: I'm very new to SQL and databases in general.
I need to create a field that will store a maximum of 32 characters of text data. Does "VARCHAR(32)" mean that I have exactly 32 characters for my data? Do I need to reserve an extra character for null-termination?
I conducted a simple test and it seems that this is a WYSIWYG buffer. However, I wanted to get a concrete answer from people who actually know what they're doing.
I have a C[++] background, so this question is raising alarm bells in my head.
Yes, you have 32 characters at your disposal. SQL does not concern itself with nul terminated strings like some programming languages do.
Your VARCHAR specification size is the max size of your data, so in this case, 32 characters. However, VARCHARS are a dynamic field, so the actual physical storage used is only the size of your data, plus one or two bytes.
If you put a 10-character string into a VARCHAR(32), the physical storage will be 11 or 12 bytes (the manual will tell you the exact formula).
However, when MySQL is dealing with result sets (ie. after a SELECT), 32 bytes will be allocated in memory for that field for every record.