Search large number of ID search in MongoDB - mongodb-query

Thanks for looking at my query. I have 20k+ unique identification id that I is provided by client, I want to look for all these id's in MongoDB using single query. I tried looking using $in but then it does not seems feasible to put all the 20K Id in $in and search. Is there a better version of achieving.

If the id field is indexed, an IN query should be very fast, but i don't think it is a good idea to perform a query with 20k ids in one time, as it may consume quite a lot of resources like memory, you can split the ids into multiple groups with a reasonable size and do the query separately and you still can perform the queries parallelly in application level.

Consider importing your 20k+ id into a collection(says using mongoimport etc). Then perform a $lookup from your root collection to the search collection. Depending on whether the $lookup result is empty array or not, you can proceed with your original operation that requires $in.
Here is Mongo playground for your reference.

Related

Searching efficiently with keywords

I'm working with a big table (millions of rows) on a postgresql database, each row has a name column and i would like to perform a search on that column.
For instance, if i'm searching for the movie Django Unchained, i would like the query to return the movie whether i search for Django or for Unchained (or Dj or Uncha), just like the IMDB search engine.
I've looked up full text search but i believe it is more intended for long text, my name column will never be more than 4-5 words.
I've thought about having a table keywords with a many to many relationship, but i'm not sure that's the best way to do it.
What would be the most efficient way to query my database ?
My guess is that for what you want to do, full text search is the best solution. (Documented here.)
It does allow you to search for any complete words. It allows you to search for prefixes on words (such as "Dja"). Plus, you can add synonyms as necessary. It doesn't allow for wildcards at the beginning of a word, so "Jango" would need to be handled with a synonym.
If this doesn't meet your needs and you need the capabilities of like, I would suggest the following. Put the title into a separate table that basically has two columns: an id and the title. The goal is to make the scanning of the table as fast as possible, which in turn means getting the titles to fit in the smallest space possible.
There is an alternative solution, which is n-gram searching. I'm not sure if Postgres supports it natively, but here is an interesting article on the subject that include Postgres code for implementing it.
The standard way to search for a sub-string anywhere in a larger string is using the LIKE operator:
SELECT *
FROM mytable
WHERE name LIKE '%Unchai%';
However, in case you have millions of rows it will be slow because there are no significant efficiencies to be had from indexes.
You might want to dabble with multiple strategies, such as first retrieving records where the value for name starts with the search string (which can benefit from an index on the name column - LIKE 'Unchai%';) and then adding middle-of-the-string hits after a second non-indexed pass. Humans tend to be significantly slower than computers on interpreting strings, so the user may not suffer.
This question is very much related to the autocomplete in forms. You will find several threads for that.
Basically, you will need a special kind of index, a space partitioning tree. There is an extension called SP-GiST for Postgres which supports such index structures. You will find a bunch of useful stuff if you google for that.

Storing Search results for Future Use

I have the following scenario where the search returns a list of userid values (1,2,3,4,5,6... etc.) If the search were to be run again, the results are guaranteed to change given some time. However I need to stored the instance of the search results to be used in the future.
We have a current implementation (legacy), which creates a record for the search_id with the criteria and inserts every row returned into a different table with the associated search_id.
table search_results
search_id unsigned int FK, PK (clustered index)
user_id unsigned int FK
This is an unacceptable approach as this table has grown onto millions of records. I've considered partitioning the table, but either I will have numerous partitions (1000s).
I've optimized the existing tables that search results expired unless they're used elsewhere, so all the search results are referenced elsewhere.
In the current schema, I cannot store the results as serialized arrays or XML. I am looking to efficiently store the search result information, such that it can be efficiently accessed later without being burdened by the number of records.
EDIT: Thank you for the answers, I don't have any problems running the searches themselves, but the result set for the search gets used in this case for recipient lists, which will be used over and over again, the purpose of storing is exactly to have a snapshot of the data at the given time.
The answer is don't store query results. It's a terrible idea!
It introduces statefulness, which is very bad unless you really (really really) need it
It isn't scalable (as you're finding out)
The data is stale as soon as it's stored
The correct approach is to fix your query/database so it runs acceptable quickly.
If you can't make the queries faster using better SQL and/or indexes etc, I recommend using lucene (or any text-based search engine) and denormalizing your database into it. Lucene queries are incredibly fast.
I recently did exactly this on a large web site that was doing what you're doing: It was caching query results from the production relational database in the session object in an attempt top speed up queries, but it was a mess, and wasn't much faster anyway - before my time, a "senior" java developer (whose name started with Jam.. and ended with .illiams) who was actually a moron decided it was a good idea.
I put in Solr (a java-tailored lucene implementation) and kept Solr up to date with the relational database (using work queues) and the web queries are now just a few milliseconds.
Is there a reason why you need to store every search? Surely you would want the most up to date information available for the user ?
I'll admit first, this isn't a great solution.
Setup another database alongside your current one [SYS_Searches]
The save script could use SELECT INTO [SYS_Searches].Results_{Search_ID}
The script that retrieves can do a simple SELECT out of the matching table.
Benefits:
Every search is neatly packed into it's own table, [preferably in another DB]
The retrieval query is very simple
The retrieval time should be very quick, no massive table scans.
Drawbacks:
You will have a table for every x user * y searches a user can store.
This could get very silly very quickly unless there is management involved to expire results or the user can only have 1 cached search result set.
Not pretty, but I can't think of another way.

How does a full text search server like Sphinx work?

Can anyone explain in simple words how a full text server like Sphinx works? In plain SQL, one would use SQL queries like this to search for certain keywords in texts:
select * from items where name like '%keyword%';
But in the configuration files generated by various Sphinx plugins I can not see any queries like this at all. They contain instead SQL statements like the following, which seem to divide the search into distinct ID groups:
SELECT (items.id * 5 + 1) AS id, ...
WHERE items.id >= $start AND items.id <= $end
GROUP BY items.id
..
SELECT * FROM items WHERE items.id = (($id - 1) / 5)
It it possible to explain in simple words how these queries work and how they are generated?
Inverted Index is the answer to your question: http://en.wikipedia.org/wiki/Inverted_index
Now when you run a sql query through sphinx, it fetches the data from the database and constructs the inverted index which in Sphinx is like a hashtable where the key is a 32 bit integer which is calculated using crc32(word) and the value is the list of documentID's having that word.
This makes it super fast.
Now you can argue that even a database can create a similar structure for making the searches superfast. However the biggest difference is that a Sphinx/Lucene/Solr index is like a single-table database without any support for relational queries (JOINs) [From MySQL Performance Blog]. Remember that an index is usually only there to support search and not to be the primary source of the data. So your database may be in "third normal form" but the index will be completely be de-normalized and contain mostly just the data needed to be searched.
Another possible reason is generally databases suffer from internal fragmentation, they need to perform too much semi-random I/O tasks on huge requests.
What that means is, for example, considering the index architecture of a databases, the query leads to the indexes which in turn lead to the data. If the data to recover is widely spread, the result will take long and that seems to be what happens in databases.
EDIT: Also please see the source code in cpp files like searchd.cpp etc for the real internal implementation, I think you are just seeing the PHP wrappers.
Those queries you are looking at, are the query sphinx uses, to extract a copy of the data from the database, to put in its own index.
Sphinx needs a copy of the data to build it index (other answers have mentioned how that index works). You then ask for results (matching a specific query) from the searchd daemon - it consults the index and returns you matching documents.
The particular example you have choosen looks quite complicated, because it only extracting a part of the data, probbably for sharding - to split the index into parts for performance reasons. And is using range queries - so can access big datasets piecemeal.
An index could be built with a much simpler query, like
sql_query = select id,name,description from items
which would create a sphinx index, with two fields - name and description that could be searched/queried.
When searching, you would get back the unique id. http://sphinxsearch.com/info/faq/#row-storage
Full text search usually use one implementation of inverted index. In simple words, it brakes the content of a indexed field in tokens (words) and save a reference to that row, indexed by each token. For example, a field with The yellow dog for row #1 and The brown fox for row #2, will populate an index like:
brown -> row#2
dog -> row#1
fox -> row#2
The -> row#1
The -> row#2
yellow -> row#1
A short answer to the question is that databases such as MySQL are specifically designed for storing and indexing records and supporting SQL clauses (SELECT, PROJECT, JOIN, etc). Even though they can be used to do keyword search queries, they cannot give the best performance and features. Search engines such as Sphinx are designed specifically for keyword search queries, thus can provide much better support.

Lucene.Net memory consumption and slow search when too many clauses used

I have a DB having text file attributes and text file primary key IDs and
indexed around 1 million text files along with their IDs (primary keys in DB).
Now, I am searching at two levels.
First is straight forward DB search, where i get primary keys as result (roughly 2 or 3 million IDs)
Then i make a Boolean query for instance as following
+Text:"test*" +(pkID:1 pkID:4 pkID:100 pkID:115 pkID:1041 .... )
and search it in my Index file.
The problem is that such query (having 2 million clauses) takes toooooo much time to give result and consumes reallly too much memory....
Is there any optimization solution for this problem ?
Assuming you can reuse the dbid part of your queries:
Split the query into two parts: one part (the text query) will become the query and the other part (the pkID query) will become the filter
Make both parts into queries
Convert the pkid query to a filter (by using QueryWrapperFilter)
Convert the filter into a cached filter (using CachingWrapperFilter)
Hang onto the filter, perhaps via some kind of dictionary
Next time you do a search, use the overload that allows you to use a query and filter
As long as the pkid search can be reused, you should quite a large improvement. As long as you don't optimise your index, the effect of caching should even work through commit points (I understand the bit sets are calculated on a per-segment basis).
HTH
p.s.
I think it would be remiss of me not to note that I think you're putting your index through all sorts of abuse by using it like this!
The best optimization is NOT to use the query with 2 million clauses. Any Lucene query with 2 million clauses will run slowly no matter how you optimize it.
In your particular case, I think it will be much more practical to search your index first with +Text:"test*" query and then limit the results by running a DB query on Lucene hits.

How (and where) should I combine one-to-many relationships?

I have a user table, and then a number of dependent tables with a one to many relationship
e.g. an email table, an address table and a groups table. (i.e. one user can have multiple email addresses, physical addresses and can be a member of many groups)
Is it better to:
Join all these tables, and process the heap of data in code,
Use something like GROUP_CONCAT and return one row, and split apart the fields in code,
Or query each table independently?
Thanks.
It really depends on how much data you have in the related tables and on how many users you're querying at a time.
Option 1 tends to be messy to deal with in code.
Option 2 tends to be messy to deal with as well in addition to the fact that grouping tends to be slow especially on large datasets.
Option 3 is easiest to deal with but generates more queries overall. If your data-set is small and you're not planning to scale much beyond your current needs its probably the best option. It's definitely the best option if you're only trying to display one record.
There is a fourth option however that is a middle of the road approach which I use in my job in which we deal with a very similar situation. Instead of getting the related records for each row 1 at a time, use IN() to get all of the related records for your results set. Then loop in your code to match them to the appropriate record for display. If you cache search queries you can cache that second query as well. Its only two queries and only one loop in the code (no parsing, use hashes to relate things by their key)
Personally, assuming my table indexes where up to scratch I'd going with a table join and get all the data out in one go and then process that to end up with a nested data structure. This way you're playing to each systems strengths.
Generally speaking, do the most efficient query for the situation you're in. So don't create a mega query that you use in all cases. Create case specific queries that return just the information you need.
In terms of processing the results, if you use GROUP_CONCAT you have to split all the resulting values during processing. If there are extra delimiter characters in your GROUP_CONCAT'd values, this can be problematic. My preferred method is to put the GROUPed BY field into a $holder during the output loop. Compare that field to the $holder each time through and change your output accordingly.