Space complexity, Auxiliary space and Input space - space-complexity

Could you explain me this phrase concerning space complexity?
It is composed of two different spaces; Auxiliary space and Input space.

Input space is the space occupied by the input of the algorithm. For instance, if the algorithm sorts an array of integers, then the input space is the memory occupied by that array.
Auxiliary space is the space that the algorithm needs on top of that. For instance, a recursive sorting algorithm (like quicksort) will typically use space from the stack. Also the local variables (like a loop variable, or the temporary variable used to swap values, or an array to copy values into) are part of this auxiliary space.

Related

Why flatbuffer struct fields can not be vector/table/string?

Structs may only contain scalars or other structs. But as I konw , only offset is writed to buffer when writing vector/table/string field into table data (the data of vector/table/string is wirited before table data). So struct contains vector/table/string field still can has fix size. Why flatbuffer do the limit that struct can only contain scalars or other structs?
The idea behind a struct is that it is a self-contained piece of memory that always has the same layout and size and as such can easily be copied around by itself, especially in languages that support such types natively, like C/C++/Rust etc.
If it could contain strings, then it would be at least two pieces of memory, whose distance and size would be variable, and thus not efficient to copy and easy to manage. We have table for such cases.
If you must have a vector or string inside a struct, some languages already support an "array" type, which is of fixed length. You could put that, plus a length field in a struct to emulate vectors and strings, of course with the downside that the space allocated for them is always the same.

Strings or integers for Steam IDs

What is the preferred datatype for storing Steam IDs? These IDs are very similar to credit card numbers, but is different cases of use. Until now I'm using unsigned big integer but I'm not 100% sure yet. If the ID starts with a zero number, can cause issues? Eg ID: 76561197960287930
In general number take less space on the disk to store and on the transfer from the database to the application compared to strings. They are for the same reason faster to compare e.g. in the where-clause of a query.
Have a look here for the bytes needed to store numbers and bytes to store strings.
In the database the numbers are stored without leading zeros. You could fill up your numbers with leading zeros in your application after loading them from the database, if the numbers always have a fixed size.
But if the numbers can have leading zeros strings are easier to handle, because you do not have to implement additional logic for edgecases like leading zeros.

Map words to numbers

I am doing indexing of data in my IRE (Information Retrieval and Extraction) course. Now instead of storing terms in the index, I am storing termID which is a mapping corresponding to the term. The size of term, if the length of the term is 15, would be 15 bytes i.e. 120 bits while if I use termID instead of term then I can definitely store it in less than 120 bits. One of the possible ways is to maintain a dictionary of the (term, termID) where termID would be from 1..n where n is the number of terms. The problems with this method is:
I have to keep this dictionary in the ram and the dictionary size can be in GBs.
To find termID corresponding to a term, it will take O(log(n)) where n is the number of terms in the dictionary.
Can I make some function which takes a term as an input and returns the mapping (encryption) in O(1) ?. It is okay if there are few collisions (Just guessing that a few collisions in exchange of speed and memory is a good trade-off. BTW I don't know how much it will effect my search results).
Is there any other better way to do this?
I think you gave the answer already more or less by saying "it is ok if there are a few collisions". The trick is hashing. You can first reduce the number of "characters" in your search terms. E.g., drop numbers, and special characters. Afterwards you can merge Upper and lower-case characters. Finally you could apply some simple replacements e.g. replacing the german ΓΌ bei ue (which is actually there origin). After doing so you have probably sth. like 32bit. You can then represent an four character string in a single byte. If you reserve 4 bytes for each words you need to deal with the longer words. There you can basically resort to xor each 4byte block.
An alternative approach would be to do something hybrid for the dictionary. If you would build a dictionary for only the 10k most frequent words you are most likely covering already most of the texts. Hence, you only need to keep parts of your dictionary in memory, while for most of the words you can use dictionary on hardisc or maybe even ignore them.

What is the VInt in Lucene?

I want to know what is the VInt in Lucene ?
I read this article , but i don't understand what is it and where does Lucene use it ?
Why Lucene doesn't use simple integer or big integer ?
Thanks .
VInt is extremely space efficient. It could theoretically save upto 75% space.
In Lucene, many of the structures are list of integers. For example, list of documents for a given term, positions (and offsets) of the terms in documents, among others. These lists form bulk of the lucene data.
Think of Lucene indices for millions of documents that need tens of GBs of space. Shrinking space by more than half reduces disk space requirements. While savings of disk space may not be a big win, given that disk space is cheap, the real gain comes reduced disk IO. Disk IO for reading VInt data is lower than reading integers which automatically translates to better performance.
VInt refers to Lucene's variable-width integer encoding scheme. It encodes integers in one or more bytes, using only the low seven bits of each byte. The high bit is set to zero for all bytes except the last, which is how the length is encoded.
For your first question:
A variable-length format for positive integers is defined where the high-order bit of each byte indicates whether more bytes remain to be read. The low-order seven bits are appended as increasingly more significant bits in the resulting integer value. Thus values from zero to 127 may be stored in a single byte, values from 128 to 16,383 may be stored in two bytes, and so on. https://lucene.apache.org/core/3_0_3/fileformats.html.
So, to save a list of n integers the amount of memory you would need is [eg] 4*n bytes. But with Vint all numbers under 128 would be stored using only 1 byte [and so on] saving a lot of memory.
Vint provides a compressed representation of integers and Shashikant's answer already explains the requirements and benefits of compression in Lucene.

in sql,How does fixed-length data type take place in memory?

I want to know in sql,how fixed-length data type take places length in memory?I know is that for varchar,if we specify length is (20),and if user input length is 15,it takes 20 by setting space.for varchar2,if we specify length is (20),and if user input is 15,it only take 15 length in memory.So how about fixed-length data type take place?I searched in Google,but I did not find explanation with example.Please explain me with example.Thanks in advance.
A fixed length data field always consumes its full size.
In the old days (FORTRAN), it was padded at the end with space characters. Modern databases might do that too, but either implicitly trim trailing blanks off or the query might have to do it explicitly.
Variable length fields are a relative newcomer to databases, probably in the 1970s or 1980s they made widespread appearances.
It is considerably easier to manage fixed length record offsets and sizes rather than compute the offset of each data item in a record which has variable length fields. Furthermore, a fixed length data record is easily addressed in a data file by computing the byte offset of its beginning by multiplying the record size times the record number (and adding the length of whatever fixed header data is at the beginning of file).