Nested search in solr - ruby-on-rails-3

I have an Activity model and ActivityOccurrence Model where Activity has_many :activity_occurrences
Activity: This model will have all the meta data required by ActivityOccurrence
AcitvityOccurrence: attrs - occurrence(datetime), completed.
Now we have new requirement where we have to show all occurrences of activity in search results when user searches for activities in a particular range.
Previously we used to show only one record in case of repeating activities.
So as per new requirement we have decided to move search from Activity to ActivityOccurrence.
Now, I don't want to index the Meta information of Activity in each of my ActivityOccurrence as my activity has 10 fields more than ActivityOccurrence,
eg:
if I have Activity with 1000 AcitivityOccurrence then I will be indexing all my activity informations in 1000 AcitivityOccurrence records.
This will take huge space as app grows if we index this way
Hence, my major concern is the amount of indexing I have to do.
So I am thinking to avoid activity indexes in ActivityOccurrence.
So is there a way to search Activity based on its filters first and then search ActivityOccurrence in the range based on the results from activities?
Note: Also we have never ending occurrences.
Any ideas?
Thanks in advance.

Unless you're dealing with millions of Activities/Occurrences, this may be a premature optimization - space is cheap, and SOLR is fast. Looking at this the other way around, have you considered just indexing a list of the activity occurrences that pertain to each activity (using callbacks to ensure that it gets updated)?It's hard to really optimization without more info about your data access patterns, but I'm never a fan of doing more round-trips than necessary.
That said, while I'm not sure how to write a pure SOLR query to do this, you can do it with Sunspot pretty easily:
Make sure that ActivityOccurence is searchable by Activity easily (i.e. by Activity ID).
Search Activity for the metadata that you want, and use this to extract the ID's that are relevant:
search = Activity.solr_search {<some block that does what you want>}
activity_ids = search.hits.map { |hit| hit.primary_key.to_i }
Now you can just add a with parameter to your ActivityOccurence search block:
with(:activity_id, activity_ids)
This will limit the search to the occurrences for those activities. Note that you are trading off search-time performance for index efficiency with this.

Related

How to setup splunk summary index?

I'm a bit confused with setting up summary index in splunk.
I have an index name index_1 which receive logs from my app.
There are much too many logs, and I need to save an aggregation of them.
I have tried setting up the summary index from here to an index name summary,
but when I search the index there are no log entries.
My search is as follow:
index=index_1 ... level>30
I couldn't understand when to use the collect command and when setting up from the web ui is enough.
Your search, index=index_1 ... level>30 should reduce the number of events being returned, and to only those events you want to store in the summary index. In this case, it looks like you're only interested in keeping events where level>30.
At the end of your search, you need to include the collect command. The collect command will take the remaining events, and write it to the named index, so collect index=summary
Overall, your search should look like
index=index_1 ... level>30 | collect index=summary
Here is an older blog post discussing summary indexing that may help you understand the process and good practices around using it.
https://davidveuve.com/tech/how-i-use-summary-indexes-in-splunk/

How to improve ranking of search in Apache Solr

I am implementing search engine using Apache Solr. I want to improve results on the basis of most frequent searches. For example: Consider my index has 5 wordsDown 99 Drawn 46 Dark 86 Dull 75 Dirty 63
The numbers shows that how many times users searcded a particular word.
I want if a next user comes it and type D the response should be in descending order of previously searched and should be in order DownDarkDullDirtyDrawn
The results will change from time to time as word searched frequency will change after every search.. How can I implement this in Solr... Any help in this will help me a lot. Thanking you in anticipation
Regards A.S.Danyal
As vinod writes, you'll have to keep track of actual searches yourself - there is nothing built-in to Solr to handle this for you. However, when you DO have the search statistics available, you can implement the feature by having a separate collection / core with searches and their popularity that you search against. Each document would be a search term and the frequency of how often that document is searched, i.e. document: search, search_count.
You can also use a logarithmic function to use the score of a search_count to affect the score of the search terms, for example if you have more than just the search as a field to influence the score (such as active category, etc.).
Depending on search volume, you probably don't need to update these values after each single search - just updating it once a day or every other hour will usually be good enough. Keep track of the terms that have changed in search volume since the last update, and update those documents in a batch job in certain intervals.
Solr doesn't provide this kind of feature.
One way to achieve this is by using logs,
you will need to have an index of search terms entered. This can be built by mining your search logs.

Large results set from Oracle SELECT

I have a simple, pasted below, statement called against an Oracle database. This result set contains names of businesses but it has 24,000 results and these are displayed in a drop down list.
I am looking for ideas on ways to reduce the result set to speed up the data returned to the user interface, maybe something like Google's search or a completely different idea. I am open to whatever thoughts and any direction is welcome.
SELECT BusinessName FROM MyTable ORDER BY BusinessName;
Idea:
SELECT BusinessName FROM MyTable WHERE BusinessName LIKE "A%;
I'm know all about how LIKE clauses are not wise to use but like I said this is a LARGE result set. Maybe something along the lines of a BINARY search?
The last query can perform horribly. String comparisons inside the database can be very slow, and depending on the number of "hits" it can be a huge drag on performance. If that doesn't concern you that's fine. This is especially true if the Company data isn't normalized into it's own db table.
As long as the user knows the company he's looking up, then I would identify an existing JavaScript component in some popular JavaScript library that provides a search text field with a dynamic dropdown that shows matching results would be an effective mechanism. But you might want to use '%A%', if they might look for part of a name. For example, If I'm looking for IBM Rational, LLC. do I want it to show up in results when I search for "Rational"?
Either way, watch your performance and if it makes sense cache that data in the company look up service that sits on the server in front of the DB. Also, make sure you don't respond to every keystroke, but have a timeout 500ms or so, to allow the user to type in multiple chars before going to the server and searching. Also, I would NOT recommend bringing all of the company names to the client. We're always looking to reduce the size and frequency of traversals to the server from the browser page. Waiting for 24k company names to come down to the client when the form loads (or even behind the scenes) when shorter quicker very specific queries will perform sufficiently well seems more efficient to me. Again, test it and identify the performance characteristics that fit your use case best.
These are techniques I've used on projects with large data, like searching for a user from a base of 100,000+ users. Our code was a custom Dojo widget (dijit), I 'm not seeing how to do it directly with the dijit code, but jQuery UI provides the autocomplete widget.
Also use limit on this query with a text field so that the drop down only provides a subset of all the matches, forcing the user to further refine the query.
SELECT BusinessName FROM MyTable ORDER BY BusinessName LIMIT 10

Elasticsearch - higher scoring if higher frequency of term

I have 2 documents, and am searching for the keyword "Twitter". Suppose both documents are blog posts with a "tags" field.
Document A has ONLY 1 term in the "tags" field, and it's "Twitter".
Document B has 100 terms in the "tags" field, but 3 of them is "Twitter".
Elastic Search gives the higher score to Document A even though Document B has a higher frequency. But the score is "diluted" because it has more terms. How do I give Document B a higher score, since it has a higher frequency of the search term?
I know ElasticSearch/Lucene performs some normalization based on the number of terms in the document. How can I disable this normalization, so that Document B gets a higher score above?
As the other answer says it would be interesting to see whether you have the same result on a single shard. I think you would and that depends on the norms for the tags field, which is taken into account when computing the score using the tf/idf similarity (default).
In fact, lucene does take into account the term frequency, in other words the number of times the term appears within the field (1 or 3 in your case), and the inverted document frequency, in other words how the term is frequent in the index, in order to compare it with other terms in the query (in your case it doesn't make any difference if you are searching for a single term).
But there's another factor called norms, that rewards shorter fields and take into account eventual index time boosting, which can be per field (in the mapping) or even per document. You can verify that norms are the reason of your result enabling the explain option in your search request and looking at the explain output.
I guess the fact that the first document contains only that tag makes it more important that the other ones that contains that tag multiple times but a lot of ther tags as well. If you don't like this behaviour you can just disable norms in your mapping for the tags field. It should be enabled by default if the field is "index":"analyzed" (default). You can either switch to "index":"not_analyzed" if you don't want your tags field to be analyzed (it usually makes sense but depends on your data and domain) or add the "omit_norms": true option in the mapping for your tags field.
Are the documents found on different shards? From Elastic search documentation:
"When a query is executed on a specific shard, it does not take into account term frequencies and other search engine information from the other shards. If we want to support accurate ranking, we would need to first execute the query against all shards and gather the relevant term frequencies, and then, based on it, execute the query."
The solution is to specify the search type. Use dfs_query_and_fetch search type to execute an initial scatter phase which goes and computes the distributed term frequencies for more accurate scoring.
You can read more here.

Getting lucene to return only unique threads (indexing both threads and posts)

I have a StackOverflow-like system where content is organised into threads, each thread having content of its own (the question body / text), and posts / replies.
I'm producing the ability to search this content via Lucene, and if possible I have decided I would like to index individual posts, (it makes the index easier to update, and means I have more control and ability to tweak the results), rather than index entire threads. The problem I have however is that I want the search to display a list of threads, rather than a list of posts.
How can I get Lucene to return only unique threads as results, while also searching the content of the posts?
Each document can have a "threadId" field. After running a search, you can loop through your result set and return all the unique threadId's.
The tricky part is specifying how many results you want to return. If you want to show say, 10 results on your results page, you'll probably need Lucene to return 10 + m results, since a certain percentage of the return set will be de-duped out, because they are posts belonging to the same thread. You'll need to incorporate some extra logic that will run another Lucene search if the deduped set is < 10.
This is what the Nutch project does when collapsing multiple search results that belong to the same domain.
When you index the threads, you should break each thread into postings and make each post a Document with a field containing a unique id identifying the thread to which it belongs.
When you do the search implementation, I would recommend using lucene 2.9 or later, which enables you to use a Collector. Collectors lets you preprocess the retrieved documents and thereby you'll be able to group together posts that originate from the same thread-id.
Just for completenes, latest Lucene versions (from 3.2 onwards) support a grouping API that is very useful for this kind of use-cases:
http://lucene.apache.org/java/3_2_0/api/contrib-grouping/org/apache/lucene/search/grouping/package-summary.html