B Tree Index vs Inverted Index? - indexing

Here is mine understanding about both
B Tree index :- It is generally used database column. It keeps the column content as key and row_id as value . It keeps the key in sorted fashion
to quickly find the key and row location
Inverted Index :- Generally used in full text search. Here also word in document works as key, stored in sorted fashion along with doucument location/id
as value.
So what's the difference b/w B tree index and Inverted index . To me they looks same

Short answer:
yes, they have the same purpose - finding things fast
difference: what are they useful for / particularly good at
and btw the naming is just awfully confusing too
Long answer:
The naming
My knowledge comes from practice with SQL world, so for me the data storage used to be equal to "database" and the structure that allows to find things quick - an "index".
The trick is - search engines already call their storage "index", so how do you call that index-of-"index"? "Inverted Index", of course! Why inverted? Because, as I can already see in your question, it just inverts the the primary storage. Storage is like primary key --> values, that helper-structure inverts it to values --> primary key and helps quickly finding documents by values.
Purpose
Your question has a mix of Ideas. "Inverted index" means actually more like "a data structure that helps finding documents that are already in storage" whereas B-Tree is just an implementation of such structure.
An index could be theoretically implemented with any data structure you want. Hashes, Graphs, Trees, Arrays, Bitmaps.. it just depends on your usecase.
The differences
B-Tree is good for data that changes, so it's used e.g. in databases and filesystems. Downside: multiple indices cannot be used together in one query (I guess because this structure is dynamic and thus references back to documents are not sorted) and it's data tends to become scattered, so the IO can become an issue.
"Inverted index" uses Bitmaps/Arrays and everything's sorted (list of values and the list of references to documents). These are good for static data sets. And because of sorted nature, multiple indices can be used together. Downside: updating is not performant (new document means inserting values somewhere in a sorted list), tricks are used like keeping batches of data together as it comes in and merging into bigger batches in a background process.

Related

If there's going to be a json(b) column regardless, what data should still have its own column?

I welcome for generic answers, but I'll also provide a slightly abstracted version of my specific situation
I have a Page model that uses the Ancestry gem to organize itself into a tree for sitemap and nav purposes
Each page is going to have a jsonb column for miscellaneous options, but there's a couple values where I'm not sure if they should be in the jsonb column or a separate one of their datatype.
hidden : Boolean
If true, a page will not be included in the nav menu.
It will mostly be accessed in whole-tree operations along with other data
I'm thinking this should be a column to allow for easily culling hidden pages from queries.
redirect : String
If any value but nil, will redirect to that value instead of rendering that page's content.
I don't expect this to be utilized often, but the fact that its value will be read every time a page is loaded could justify a column.
I'm also open to explanations on why going for jsonb here is a bad idea, but from a "what if" standpoint I'd still like an answer where a jsonb column must exist.
Use individual columns if:
They are meaningful as a unique key;
They are foreign keys;
They're likely to be updated a lot in changes that don't change the rest of the data
They benefit from richer data types and semantics
They're DB structure not part of the payload data
They're likely to be useful for index predicates
Sometimes it's also worth splitting them out for indexing, but expression indexes, GIN and GiST, etc make that less important.
Sometimes it's worth duplicating data within the json payload as a separate column and maintaining that duplication with a trigger. That way you get to keep it in the payload, while keeping a copy as a separate column for use as an index predicate or whatever.

How is a SQL table different from an array of structs? (In terms of usage, not implementation)

Here's a simple example of an SQL table:
CREATE TABLE persons
(
id INTEGER,
name VARCHAR(255),
height DOUBLE
);
Since I haven't used SQL very much, I haven't yet learned to think in its terms. Effectively, my brain translates the above into this:
struct Person
{
int id;
string name;
double height;
Person(int id_, const char* name_, double height_)
:id(id_),name(name_),height(height_)
{}
};
Person persons[64];
Then, inserting some elements, in SQL:
INSERT INTO persons (id, name, height) VALUES (1234, 'Frank', 5.125);
INSERT INTO persons (id, name, height) VALUES (5678, 'Jesse', 6.333);
...and how I'm thinking of it:
persons[0] = Person(1234, "Frank", 5.125);
persons[1] = Person(5678, "Jesse", 6.333);
I've read that SQL can be thought of as two major parts: data manipulation and data definition. I'm more concerned about organizing my data in the first place, as opposed to querying and modifying it. There, the distinctions of SQL are immediately obvious. To me, it seems like the subtleties of how data can and should be structured in SQL is a more obscure topic. Where does the array-of-structs analogy I'm automatically drawing for myself break down?
To give a concrete example, let's say that I want each entry in my persons table (or each of my Person objects) to contain a field denoting the names of that person's children (actual fruit-of-your-loins children, not hierarchical data structure children). In reality, these would probably be cross-table references (or pointers to objects), but let's keep things simple and make this field contain zero or more names. In my C++ example, I'd modify the declaration like so:
vector<string> namesOfChildren;
...and do something like this:
persons[0].namesOfChildren.push_back("John");
persons[0].namesOfChildren.push_back("Jane");
But, from what I can tell, the typical usage of SQL doesn't mirror this approach. If I'm wrong and there's a simple, straightforward solution, great. If not, I'm sure a SQL novice like myself could benefit greatly from a little cogitation on the subject of how databases of SQL tables are meant to be used in contrast to bare, generic data structures.
To me, it seems like the subtleties of how data can and should be structured in SQL is a more obscure topic.
It's called "data(base) modeling" and is somewhere between engineering discipline and art (like much of the computer programming). If you are really interested in the topic, take a look at ERwin Methods Guide.
Where does the array-of-structs analogy I'm automatically drawing for myself break down?
At persistency, concurrency, consistency and scalability.
Persistency: The table is automatically saved to the permanent storage. It'll stay there and survive reboots (not that a real database server will reboot much) until you explicitly delete it or there is a catastrophic hardware failure. DBMSes have well-oiled backup procedures that should help in the latter case.
Concurrency: Tables are meant to be accessed and (need be) modified by many clients concurrently. Mechanisms such as locking and multi-version concurrency control are employed to ensure clients will not "step on each other's toes".
Consistency: You can define certain constraints (such as uniqueness, foreign keys or checks) and the DBMS will make sure they are never broken. Furthermore, this can often be done in a declarative manner, minimizing chance for errors. On top of that, everything you do in a database is transactional, so you reap the benefits of atomicity, consistency, isolation and durability (aka. "ACID"). In a nutshell, the database will defend itself from bad data.
Scalability: A well designed database schema can grow well beyond the confines of the available RAM, and still keep good performance, using techniques such as indexing, partitioning, clustering etc... Furthermore, SQL is declarative and set-based, which means that the DBMS has the latitude to pick the optimal "query execution plan" for the data at hand, auto-parallelize the query, cache the results in hope they will be reused etc... without changing the meaning of the query.
Your analogy to the array of structs is not bad ... for the beginning.
After this beginning the differences start in relation to organizing data.
Database people love their "Normal Forms" laws. We do not have these laws in C++ or similar programming languages. Organizing data in the tables according to these laws help database engines to do their magic (queries, joins) better, i.e keep databases compact and crunch millions of rows in fractions of a second, and allow multiple requests concurrently. They are not absolute laws: the 1NF (1st Normal Form) is followed in 99.9999% cases, but the bigger the number (2NF, 3NF, ...) the more often DB planners allow themselves to deviate from them.
Description of normal forms can be found for example here.
I will try to illustrate differences on your example.
In your example the fields of your struct correspond to the columns of the database table. Adding vector of the names as a new field of struct would correspond to adding comma separated list of the names into a new column of your table. This is a violation of the 1NF which demands that one cell is for one value - not for the list of values. To normalize your data you will need to have two arrays: one of Person structs, and another new of structs for Child. While in C++ we can use just pointers to link each child to its parent, in SQL we must use the mechanism of the key. You already added id field into Person struct, now we need to add ParentId field to Child struct so that database engine could find the Parent. ParentId column is called foreign key. Another approach to satisfy 1NF instead of creating the new table/struct for children is that we can switch to children-centric thinking and have just one table with a record per child which will include all the information about the parent of the child. Info about the parent obviously will be repeated in as many records as many children this parent has.
Note (this is also considered part of 1NF) that while in the array of structs we always know the order of the elements, in databases it is up to the engine in what order to store the records. It is just mathematical un-ordered set of records, the engine can resort it in internal storage for various optimizations as it likes. When you retrieve the records from the database with the SELECT statement, if you care about the order, you need to provide ORDER BY clause.
2NF is about removing repetitions from your records. Imagine you would have place of work related fields also as part of your Person struct. Imagine it would include Name of the company and company address. If many Persons in your dataset work in the same Company your would repeat address of the company in their records. Probably we wouldn't do these repetitions in C++ either, but nevertheless extracting these repetitions into a separate table would satisfy 2NF. Strictly speaking even if there is no repetitions and all your Persons work in different places, 2NF still requires to extract data about the workplaces into separate table because it requires that one table would represent one entity.
3NF is about removing transitive dependency and is considered kind of optional, so I will not describe it here. See link above.
Another feature of databases quite different from conventional programming of data structures in C++ is the database indexes. Simplifying, index is just a copy of a column (or columns) (i.e vertical slice) into a separate table where they are stored in an inherent for them order and each record in the index retains the reference to the whole record. So, in your example to create index by height you would create another array of 64 elems of the new
struct HeightIndexElem
{
double height;
Person* pFullRecord;
}
and sort them by height in this array. This will allow the DB engine to automatically optimize certain queries. The database engine itself decides when to use certain index. In C++ we usually create maps (Dictionaries in C#) to speed up finding element by certain characteristic but we must use these maps ourselves - no automatic aspect there.
There are major differences:-
SQL tables are persistent -- (English Tran: written to disk)
They are transactional -- (really written to disk)
They can be an arbitary size -- (Tables of a several hundred million rows are quite common)
They support relational algebra -- (Joins with other tables, filtering etc.)
Relational Algebra is provable -- For a given SELECT statement there is only one possible correct answer.
The biggest differences are that when you "UPDATE" and "COMMIT" you know your data is saved in the database and will be there until you decide to "DELETE" it. When you update a structure within an array its gone when the machine is switched off.
The other big difference is scale. The size of a modern DBMS is only limited by your hard disk budget.
[I really like farfareast's answer from an academic stand point, but I feel the need to add a more practically oriented answer too.]
SQL tables themselves are "bare, generic data structures" as you call C++'s structures. They are only different data structures: a table is always an array of (fixed size) structs and the only pointers you can use are foreign keys.
For example, when you are adding a vector<string> to your struct, you are already using pointers internally as strings are only a "fancy" way of writing char*. This would already require a second table in SQL (using a secondayr index column to keep the elements in order). Of course there are things like postgresql's arrays that can help in this specific case, but those are "only" shortcuts for similar hand-writeable constructs.
The real difference in data structure and algorithms comes from the fact that you can easily add declarations of index structures. Say you know you need to always access Persons in the order of their height. In C++ you'd use some kind of tree or sorted list to keep them in order. There is an STL container for that. The downside is, that when you need to access them in a different order (say by name), you'll have to add a second tree and duplicate the data or start using pointers to Persons. If you add a Person, you need to update all containers and so on. This becomes cumbersome and soon you'll be on the front page of The Daily WTF. SQL tables on the other side can have attached indices which automatically keep up with new and changed data. Of course, their maintenance also must be paid in performance, but the management of them is basically deciding which are required by your access patterns -- something needed in every case -- and defining them. In contrast to having to rewrite large parts of an application, this is a much more favorable situation.

When to choose a Dictionary ADT

In college we learned the three main abstract data types were Containers (Stacks, Queues, and Tables), Dictionaries, and Priority Queues. There are probably an unlimited number of ways to group ADTs at a high level like this, but this is a good start.
I don't really understand when you would choose a Dictionary ADT to solve a computational problem though. Stacks and Queues seem to come up naturally, but not dictionaries.
The one example I can think of is a dictionary in the sense we use it in the real world. A dictionary keeps and ordered set of words for fast lookup, and what you get when you look up a word is: correct spelling, how to pronounce the word, what part of speech it is, definition of the word, etc.
As I'm starting to understand it better, the more it seems like a "dictionary" is another way to think of querying a database. When you write a SQL SELECT statement, you're usually looking for a when a primary key equals a certain value (not always of course, you can select on any field that exists in a table).
Is this the correct way to think of a Dictionary ADT? Or is the intended use more limited than this...
Your SQL example is pretty accurate. You're searching for a primary key (the key of the dictionary entry) to get some fields (the values) associated with it.
I personally found dictionaries useful in my game programming courses. I would load my resources and then cache them into a dictionary for later use. This way, I didn't have to know an index number for a specific resource, I could give it a key that would relate to the resource.

Sphinx question: Structuring database

I'm developing a job service that has features like radial search, full-text search, the ability to do full-text search + disable certain job listings (such as un-checking a textbox and no longer returning full-time jobs).
The developer who is working on Sphinx wants the database information to all be stored as intergers with a key (so under the table "Job Type" values might be stored such as 1="part-time" and 2="full-time")... whereas the other developers want to keep the database as strings (so under the table "Job Type" it says "part-time" or "full-time".
Is there a reason to keep the database as ints? Or should strings be fine?
Thanks!
Walker
Choosing your key can have a dramatic performance impact. Whenever possible, use ints instead of strings. This is called using a "surrogate key", where the key presents a unique and quick way to find the data, rather than the data standing on it's own.
String comparisons are resource intensive, potentially orders of magnitude worse than comparing numbers.
You can drive your UI off off the surrogate key, but show another column (such as job_type). This way, when you hit the database you pass the int in, and avoid looking through to the table to find a row with a matching string.
When it comes to joining tables in the database, they will run much faster if you have int's or another number as your primary keys.
Edit: In the specific case you have mentioned, if you only have two options for what your field may be, and it's unlikely to change, you may want to look into something like a bit field, and you could name it IsFullTime. A bit or boolean field holds a 1 or a 0, and nothing else, and typically isn't related to another field.
if you are normalizing your structure (i hope you are) then numeric keys will be most efficient.
Aside from the usual reasons to use integer primary keys, the use of integers with Sphinx is essential, as the result set returned by a successful Sphinx search is a list of document IDs associated with the matched items. These IDs are then used to extract the relevant data from the database. Sphinx does not return rows from the database directly.
For more details, see the Sphinx manual, especially 3.5. Restrictions on the source data.

Multiple or single index in Lucene?

I have to index different kinds of data (text documents, forum messages, user profile data, etc) that should be searched together (ie, a single search would return results of the different kinds of data).
What are the advantages and disadvantages of having multiple indexes, one for each type of data?
And the advantages and disadvantages of having a single index for all kinds of data?
Thank you.
If you want to search all types of document with one search , it's better that you keep all
types to one index . In the index you can define more field type that you want to Tokenize or Vectore them .
It takes a time to introduce to each IndexSearcher a directory that include indeces .
If you want to search terms separately , it would better that index each type to one index .
single index is more structural than multiple index.
In other hand , we can balance our loading with multiple indeces .
Not necessarily answering your direct questions, but... ;)
I'd go with one index, add a Keyword (indexed, stored) field for the type, it'll let you filter if needed, as well as tell the difference between the results you receive back.
(and maybe in the vein of your questions... using separate indexes will allow each corpus to have it's own relevency score, don't know if excessively repeated terms in one corpus will throw off relevancy of documents in others?)
You should think logically as to what each dataset contains and design your indexes by subject-matter or other criteria (such as geography, business unit etc.). As a general rule your index architecture is similar to how you would databases (you likely wouldn't combine an accounting with a personnel database for example even if technically feasible).
As #llama pointed out, creating a single uber-index affects relevance scores, security/access issues, among other things and causes a whole new set of headaches.
In summary: think of a logical partitioning structure depending on your business need. Would be hard to explain without further background.
Agree that each kind of data should have its own index. So that all the index options can be set accordingly - like analyzers for the fields, what is stored for the fields for term vectors and similar. And also to be able to use different dynamic when IndexReaders/Writers are reopened/committed for different kinds of data.
One obvious disadvantage is the need to handle several indexes instead of one. To make it easier, and because I always use more than one index, created small library to handle it: Multi Index Lucene Manager