Spring data solr boost - bq clause - spring-data-solr

We are trying to use spring data solr. We have a heavy need to use boost query (bq)
Is there a way to generate "bq" clause while querying solr using spring data solr?
The field on which we use boost value, is not used for filtering. Say I am looking for televisions made by Sony (filter), but I want to boost on size (47Inch the most). The reason is if there are 47s i can atlease show TVs of different sizes

Related

Solr Indexing in Storm topology vs Hbase NG Indexer

I am working on designing the Data Indexing feature into Solr. We are using Storm Topology and have a Hbase Bolt where it is adding data into Hbase. The requirement is what ever data we are adding into Hbase, needs to be indexed as well.
The following are the options:
Add code to index in Solr, in Hbase bolt itself.
Create a new bolt, and separate Solr indexing
Use Hbase ND indexer, and integrate Solr indexer with Hbase row insertion.
The first two option, are similar to transactions, meaning both Hbase and Solr or none. But not sure, if we can do this, as we are dealing with data on large scale.
For third option, the starting point is Hbase, so all data is assumed to be in there. However, we do not have complete control on debugging because we have to deploy the jar into Indexer environment.
Please help me, which design is preferable.
After some analysis, we went ahead and implemented the design with NGHbas indexer. One argument is that we cannot gaurantee same data in hbase and solr as we cannot handle transactions at large scale. Also we have similar design for streaming data. So made used of the setup

Term-Document payload support in Lucene

I am using Elasticsearch 1.3.4 and as a result Lucene 4.9. I have a requirement of storing some information per term-document pair (something like term frequency but only spanning variable number of bytes). I am aware of payloads being supported by Lucene, but that information is per term-document-occurrence trio. Hence in my case, it would be an overkill to use payloads. Well, I can also try saving that information as a payload corresponding to only the first occurrence of a term in a document but that does not sound very clean.
I would like to know if there is an out of the box solution for storing term-document custom information in Lucene. If not, what are my alternatives?

How to index PDF / MS-Word / Excel files really fast for full text search?

We are building real time search feature for institutions, the index is based on the user uploaded files (mostly are Word/Excel/PDF/PowerPoint, and ASCII files). The I/O is expected at only 10 IOPS -20 IOPS but it can vary depends on the date. Maximum I/O could be 100 IOPS. Current database size is reaching 10GB, it's 4 months old.
For real time search server, I'm considering Solr / Lucene and probably ElasticSearch. But the challenge is how to index these files FAST, so that search server can query the index in real time.
I have found some similar questions on how to index .doc/.xls/.pdf, but they did not mention how to ensure indexing performance:
Search for keywords in Word documents and index them
Index Word/PDF Documents From File System To SQL Server
How to extract text from MS office documents in C#
Using full-text search with PDF files in SQL Server 2005
So my question is: how to build the index FAST ?
Any suggestion on the architecture ? Should I focus on building fast infrastructure (i.e. RAID, SSD, more CPU, Network bandwidth ?) or focus on the index tools & algorithm?
We're building a high perfomance full-text search for office documents. We can share some insights:
We use ElasticSearch. It's difficult to make it perform well on large file. We write several posts about it.
Highlighting Large Documents in ElasticSearch
Making ElasticSearch Perform Well with Large Text Fields
Use microservice architecture and docker to easily scale your application
Do not store original files in elasticsearch as binary data. Store it separately for example in MongoDB
Hope it helps!

SOLR One collection (core) VS. many

I have multiple entities from a MySQL database that will be indexed in SOLR.
What is the best method in order to have the best performance results (query time)?
Using a single SOLR collection (core) with a field for the entity type
Or having a collection (core) for every entity type
Thanks
I would add a few more parameters for you to consider (mostly discouraging one core per entity approach, but not just for performance reasons that you are specifically asking for)
More cores would mean more endpoints. Your application will need to be made aware of such. And you may find it difficult to run a query across cores. For ex, if you are searching by a common attribute, say name, you would have to run multiple queries to each core and aggregate the result. And this will miss the relevancy aspect that you get out of the box in querying a single core.
Consider making minimal requests to your database. N+1 jdbc connections drastically slow down indexing. Instead, try to aggregate your results in a view and if you can fire a single query, your indexing will be much faster.
Range queries on common attributes will not be possible across core. Ex - if you have price of Books and Music Cds stored in different cores, you can't get all products between X and Y price range.
Faceting feature will also be compromised.
So, while you may perceive some index time performance gain by parallelizing in form of 1 core per entity, I feel this may reduce the features that you can benefit from.

Speeding up Solr Indexing

I am kind of working on speeding up my Solr Indexing speed. I just want to know by default how many threads(if any) does Solr use for indexing. Is there a way to increase/decrease that number.
When you index a document, several steps are performed :
the document is analyzed,
data is put in the RAM buffer,
when the RAM buffer is full, data is flushed to a new segment on disk,
if there are more than ${mergeFactor} segments, segments are merged.
The first two steps will be run in as many threads as you have clients sending data to Solr, so if you want Solr to run three threads for these steps, all you need is to send data to Solr from three threads.
You can configure the number of threads to use for the fourth step if you use a ConcurrentMergeScheduler (http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/index/ConcurrentMergeScheduler.html). However, there is no mean to configure the maximum number of threads to use from Solr configuration files, so what you need is to write a custom class which call setMaxThreadCount in the constructor.
My experience is that the main ways to improve indexing speed with Solr are :
buying faster hardware (especially I/O),
sending data to Solr from several threads (as many threads as cores is a good start),
using the Javabin format,
using faster analyzers.
Although StreamingUpdateSolrServer looks interesting for improving indexing performance, it doesn't support the Javabin format. Since Javabin parsing is much faster than XML parsing, I got better performance by sending bulk updates (800 in my case, but with rather small documents) using CommonsHttpSolrServer and the Javabin format.
You can read http://wiki.apache.org/lucene-java/ImproveIndexingSpeed for further information.
This article describes an approach to scaling indexing with SolrCloud, Hadoop and Behemoth. This is for Solr 4.0 which hadn't been released at the time this question was originally posted.
You can store the content in external storage like file;
What are all the field that contains huge size of content,in schema set stored="false" for that corresponding field and store the content for that field in external file using some efficient file system hierarchy.
It improves indexing by 40 to 45% reduced time. But when doing search, search time speed is some what increased.For search it took 25% more time than normal search.