Search on multiple index in RediSearch - redis

I'm using Redisearch for my project. there are different indexes in the project, such as job_idx, company_idx, article_idx and event_idx (article_idx and event_idx structure are very similar). The index is used on different pages, ex: job_idx used by Job page search, company_idx used by Company page search.
The question is on the homepage, the search engine should return result from every index, so should I call search 4 times? I think that there should be better solution for my case.

The FT.SEARCH command allows you to pass exactly one index as a parameter. So if you are already having 4 indexes, then you need to call the command 4 times.
It's typically the simplest to have one index per entity, BUT it's at the end a question how you are designing your physical data model to support your queries best. This can range from entirely separated indexes up to one single index for everything (e.g., an 'all_fields' index with a type field). The best implementation might be somewhere in the middle (very much similar to 'normalized vs. de-normalized database schema' in relational database systems).
A potential solution for you could be to create an additional index (e.g., called combined_homepage) which indexes on specific fields that are needed for the search on the homepage. This index would then enable you to do a single search.
However, this additional index would indeed need additional space. So, considering that you don't want to rethink the physical data model from scratch, you either invest into space (memory) to enable a more efficient access, or spend more for compute and network (for combining the results of the 4 queries on the client-side).
Hope this helps, even if my answer comes basically down to 'it depends' :-).

Related

SQL DBM - Spatial Index with Arbitary Types (non-geometric)

Find off, at the moment I'm not looking for alternate suggestions, just a yes or a no, and if it's a yes, what the name is.
Are there any SQL DBMS that allows you to create "Spatial" indexes using arbitrary (i.e. non geometric) data types like integers, dates, etc? While spatial indexes are most commonly used for location data, they can also be used to properly indexes queries where you need to search within two or more ranges.
For example (and this is just a made-up example), if you had a database of customer receipts, and you wanted to find all transactions between $10-$1000 and which took place between 2000-01-01 thru 2005-03-01. The fact that you're searching within multiple ranges means that the regular b-tree indexes cannot be used to efficiently perform this lookup, at least not in a way that's scalable.
Now yes, for the specific example I provided, and probably any other case, you could likely come up with some tricks to do it efficiently using the b-tree indexes, or at the very least narrow it down; I'm well aware, but again, not looking for alternate suggestions, just a no, or a yes and the name.
Appreciate any help you all can provide
EDIT: Just to clarify; I'm using the term spatial index as this is the most common term for it as well as the most commonly implemented use case. I am however referring to any index which uses quadtrees, r-trees, etc to achieve the same or similar effect.

How can I measure the cost of a database index?

Is there a good method for judging whether the costs of creating a database index in Postgres (slower INSERTS, time to build an index, time to re-index) are worth the performance gains (faster SELECTS)?
I am actually going to disagree with Hexist. PostgreSQL's planner is pretty good, and it supports good sequential access to table files based on physical order scans, so indexes are not necessarily going to help. Additionally there are many cases where the planner has to pick an index. Additionally you are already creating primary keys for unique constraints and primary keys.
I think one of the good default positions with PostgreSQL (MySQL btw is totally different!) is to wait until you need an index to add one and then only add the indexes you most clearly need. This is, however, just a starting point and it assumes either a lack of a general lack of experience in looking at query plans or a lack of understanding of where the application is likely to go. Having experience in these areas matters.
In general, where you have tables likely to span more than 10 pages (that's 40kb of data and headers), it's a good idea to foreign keys. These can be assumed tob e clearly needed. Small lookup tables spanning 1 page should never have non-unique indexes because these indexes are never going to be used for selects (no query plan beats a sequential scan over a single page).
Beyond that point you also need to look at data distribution. Indexing boolean columns is usually a bad idea and there are better ways to index things relating to boolean searches (partial indexes being a good example). Similarly indexing commonly used function output may seem like a good idea sometimes, but that isn't always the case. Consider:
CREATE INDEX gj_transdate_year_idx ON general_journal (extract('YEAR' FROM transdate));
This will not do much. However an index on transdate might be useful if paired with a sparse index scan via a recursive CTE.
Once the basic indexes are in place, then the question becomes what other indexes do you need to add. This is often better left to later use case review than it is designed in at first. It isn't uncommon for people to find that performance significantly benefits from having fewer indexes on PostgreSQL.
Another major thing to consider is what sort of indexes you create and these are often use-case specific. A b-tree index on an array record for example might make sense if ordinality is important to the domain, and if you are frequently searching based on initial elements, but if ordinality is unimportant, I would recommend a GIN index, because a btree will do very little good (of course that is an atomicity red flag, but sometimes that makes sense in Pg). Even when ordinality is important, sometimes you need GIN indexes anyway because you need to be able to do commutitive scans as if ordinality was not. This is true if using ip4r for example to store cidr blocks and using an EXCLUDE constraint to ensure that no block contains any other block (the actual scan requires using an overlap operator rather than a contain operator since you don't know which side of the operator the violation will be found on).
Again this is somewhat database-specific. On MySQL, Hexist's recommendations would be correct, for example. On PostgreSQL, though, it's good to watch for problems.
As far as measuring, the best tool is EXPLAIN ANALYZE
Generally speaking, unless you have a log or archive table where you wont be doing selects on very frequently (or it's ok if they take awhile to run), you should index on anything your select/update/deelete statements will be using in a where clause.
This however is not always as simple as it seems, as just because a column is used in a where clause and is indexed, doesn't mean the sql engine will be able to use the index. Using the EXPLAIN and EXPLAIN ANALYZE capabilities of postgresql you can examine what indexes were used in selects and help you figure out if having an index on a column will even help you.
This is generally true because without an index your select speed goes down from some O(log n) looking operation down to O(n), while your insert speed only improves from cO(log n) to dO(log n) where d is usually less than c, ie you may speed up your inserts a little by not having an index, but you're going to kill your select speed if they're not indexed, so it's almost always worth it to have an index on your data if you're going to be selecting against it.
Now, if you have some small table that you do a lot of inserts and updates on, and frequently remove all the entries, and only periodically do some selects, it could turn out to be faster to not have any indexes.. however that would be a fairly special case scenario, so you'd have to do some benchmarking and decide if it made sense in your specific case.
Nice question. I'd like to add a bit more what #hexist had already mentioned and to the info provided by #ypercube's link.
By design, database don't know in which part of the table it will find data that satisfies provided predicates. Therefore, DB will perform a full or sequential scan of all table's data, filtering needed rows.
Index is a special data structure, that for a given key can precisely specify in which rows of the table such values will be found. The main difference when index is involved:
there is a cost for the index scan itself, i.e. DB has to find a value in the index first;
there's an extra cost of reading specific data from the table itself.
Working with index will lead to a random IO pattern, compared to a sequential one used in the full scan. You can google for the comparison figures of random and sequential disk access, but it might differ up to an order of magnitude (random being slower of course).
Still, it's clear that in some cases Index access will be cheaper and in others Full scan should be preferred. This depends on how many rows (out of all) will be returned by the specified predicate, or it's selectivity:
if predicate will return a relatively small number of rows, say, less then 10% of total, then it seems valuable to pick those directly via Index. This is a typical case for Primary/Unique keys or queries like: I need address information for customer with internal number = XXX;
if predicate has no big impact on the selectivity, i.e. if 30% (or more) rows are returned, then it's cheaper to do a Full scan, 'cos sequential disk access will beat random and data will be delivered faster. All reports, covering big areas (like a month, or all customers) fall here;
if there's a need to obtain an ordered list of values and there's an index, then doing Index scan is the fastest option. This is a special case of #2, when you need report data ordered by some column;
if number of distinct values in the column is relatively small compared to a total number of values, then Index will be a good choice. This is a case called Loose Index Scan, and typical queries will be like: I need 20 most recent purchases for each of the top 5 categories by number of goods.
How DB decides what to do, Index or Full scan? This is a runtime decision and it is based on the statistics, so make sure to keep those up to date. In fact, numbers provided above have no real life value, you have to evaluate each query independently.
All this is a very rough description of what happens. I would very much recommended to look into How PostgreSQL Planner Uses Statistics, this best what I've seen on the subject.

Maximizing performance of SQL indexes for VARCHAR columns with homogeneous prefixes

I'm designing a DB2 table, one VARCHAR column of which will store an alpha-numeric product identifier. The first few characters of these IDs have very little variation. The column will be indexed, and I'm concerned that performance may suffer because of the common prefixes.
As far as I can tell, DB2 does not use hash codes for selecting VARCHARs. (At least basic DB2, I don't know about any extensions.)
If this is to be a problem, I can think of three obvious solutions.
Create an extra, hash code column.
Store the text backward, to ensure good distribution of initial characters.
Break the product IDs into two columns, one containing a long enough prefix to produce better distribution in the remainder.
Each of these would be a hack, of course.
Solution #2 would provide the best key distribution. The backwards text could be stored in a separate column, or I could reverse the string after reading. Each approach involves overhead, which I would want to profile and compare.
With solution #3, the key distribution still would be non-optimal, and I'd need to concatenate the text after reading, or use 3 columns for the data.
If I leave my product IDs as-is, is my index likely to perform poorly? If so, what is the best method to optimize the performance?
I'm a SQL dba, not db2, but I wouldn't think that having common prefixes would hurt you at all, indexing wise.
The index pages simply store a "from" and "to" range of key values with pointers to the actual pages. The fact that an index page happens to store FrobBar001291 to FrobBar009281 shouldn't matter in the slightest to the db engine.
In fact, having these common prefixes allows the index to take advantage of other queries like:
SELECT * FROM Products WHERE ProductID LIKE 'FrobBar%'
I agree with BradC that I don't think this is a problem at all, and even if there was some small benefit to the alternatives you suggest, I imagine all the overhead and complexity would outweigh any benefits.
If you're looking to understand and improve index performance, there are a number of topics in the Info Center that you should consider (in particular the last two topics seem relevant): http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/nav/2_3_2_4_1 like:
Index structure
Index cleanup and maintenance
Asynchronous index cleanup
Asynchronous index cleanup for MDC tables
Online index defragmentation
Using relational indexes to improve performance
Relational index planning tips
Relational index performance tips

should I use LIKE to query tables with 4 million rows

I am designing a search form, and I am wondering whether should I give the possibility to search by using LIKE %search_string% for a table that is going to have up to 4 million rows
In general, I would say no. This is a good candidate for full-text indexing. The leading % in your search string is going to eliminate the possibility of using any indexes.
There may be cases where the wait is acceptable and/or you do not want the additional administrative overhead of maintaining full-text indexes, in which case you might opt for LIKE.
No, you should really only use LIKE '%...%' when your tables are relatively small or you don't care about the performance of your own or other peoples' queries on your database.
There are other ways to achieve this capability which scale much better, full text indexing or, if that's unavailable or not flexible enough, using insert/update triggers to extract non-noise words for querying later.
I mention that last possibility since you may not want a full text index. In other words, do you really care about words like "is", "or" and "but" (these are the noise-words I was alluding to before).
You can separate the field into words and place the relevant ones in another table and use blindingly fast queries on that table to find the actual rows.
The search with LIKE %search_string% is very slow even on indexed columns. Worstcase the search does a full table scan.
If a search LIKE search_string% is enough I'd just provide this possibility.
It depends - without knowing how responsive the search has to be, it could either be fine or completely no go. You'll only really know if you profile your search with likely data patterns and search criteria.
And as RedFilter points out, you might want to consider Full Text Search, if plain search isn't performing well

Multiple or single index in Lucene?

I have to index different kinds of data (text documents, forum messages, user profile data, etc) that should be searched together (ie, a single search would return results of the different kinds of data).
What are the advantages and disadvantages of having multiple indexes, one for each type of data?
And the advantages and disadvantages of having a single index for all kinds of data?
Thank you.
If you want to search all types of document with one search , it's better that you keep all
types to one index . In the index you can define more field type that you want to Tokenize or Vectore them .
It takes a time to introduce to each IndexSearcher a directory that include indeces .
If you want to search terms separately , it would better that index each type to one index .
single index is more structural than multiple index.
In other hand , we can balance our loading with multiple indeces .
Not necessarily answering your direct questions, but... ;)
I'd go with one index, add a Keyword (indexed, stored) field for the type, it'll let you filter if needed, as well as tell the difference between the results you receive back.
(and maybe in the vein of your questions... using separate indexes will allow each corpus to have it's own relevency score, don't know if excessively repeated terms in one corpus will throw off relevancy of documents in others?)
You should think logically as to what each dataset contains and design your indexes by subject-matter or other criteria (such as geography, business unit etc.). As a general rule your index architecture is similar to how you would databases (you likely wouldn't combine an accounting with a personnel database for example even if technically feasible).
As #llama pointed out, creating a single uber-index affects relevance scores, security/access issues, among other things and causes a whole new set of headaches.
In summary: think of a logical partitioning structure depending on your business need. Would be hard to explain without further background.
Agree that each kind of data should have its own index. So that all the index options can be set accordingly - like analyzers for the fields, what is stored for the fields for term vectors and similar. And also to be able to use different dynamic when IndexReaders/Writers are reopened/committed for different kinds of data.
One obvious disadvantage is the need to handle several indexes instead of one. To make it easier, and because I always use more than one index, created small library to handle it: Multi Index Lucene Manager