Apache Solr 5 - deduplicating data within a field

Apache Solr 5 - deduplicating data within a field - apache

Here is my question (pardon the wordiness):
I have millions of documents and all of them are unique.
However, all documents contain a 'description' field and this field contains data that only has a few different variations in the text across all 10 million documents. This field is large-ish -400-800 words or so.
What is the most appropriate way to eliminate this repetition of data in the 'description' field?
Let me elaborate. Here is an example schema that been simplified:
Doc_id <-- this is unique
Title <-- always unique as well
Description <-- contains mostly dupe data
I search over both the title and description but only return the title itself.
I'm fairly new to Solr but have been unable to find any information on how to tackle a scenario like this. In case it matters, I'm running Solr 5 on Ubuntu.
Thanks for any help!

I will try to provide some strategies to tackle your problem.
You are saying that you search over title and description, this means you should set these fields to indexed=true in your schema.xml. Only title is returned, this means only title needs to be set to stored=true, description should be set to stored=false. See this posting for more information on stored vs. indexed: Solr index vs stored
Another useful option you could try is the field option compression. If you need to store a field, you can use gzip compression on certain fields, such as TextField and StrField, see: https://wiki.apache.org/solr/SchemaXml for more info.
Lastly, deduplication is supported in Solr, see: https://wiki.apache.org/solr/Deduplication. I did not try this feature, but from the sounds of it, you can prevent (nearly) duplicate documents to be indexed or tag duplicates. Maybe its goal "Allow for both duplicate collapsing in search results as well as deduplication on adding a document." is what you are looking for?

Related

Sql Server search entire Json document for value

I have a few thousand rows in my table (SQL Server 2016).
One of the columns stores JSON documents (NVARCHAR(max)).
The JSON documents are quite complex in therms of nesting etc.. also they can be very different one to another.
My goal is to search each document for a certain match. Say: "MagicNo":"999000".
So if the document has a property "MagicNo" and if the value is 999000 then it's a match.
I know you can navigate through the document using the
JSON_VALUE $.
followed by the path, but since those docs can be very different the "MagicNo" property may appear pretty much everywhere in the document (a lot nesting). So xpathing is out of question here.
Is there some kind of wild card I could use with JSON_VALUE to say search the entire doc and return it if the match is found?
The simple
like '%999000%'
and
CONTAINS
searches on the VARCHAR column are out of question here due to the poor performance.
Any thoughts?
Thanks.

How to exclude text indexed from PDF in solr query

I have a solr index generated from a catalog of PDF files and correspoing metadata fields pertaining to the pdf files themselves. Still, I would like to provide my users an option to exclude in the query any text indexed from within a PDF. This is so the query results would be based on the metadata fields instead and not biased by the vast text within the pdf files.
I have thought of maybe having two indexes (cores) - one with the indexed pdf files and one without.
Is there another way?

Sounds like you are doing a general search against a default field. Which means you have a lot of copyField instructions (or just one copyField * -> text), which include the PDF content field.
You can create a second destination and copyField everything but the PDF content field into that as well. This way, users can search against or another combined field.
However, remember that this parses all content according to the analysis chain of the destination field. So, eDisMax with a list of source fields may be a better approach there. And, remember, you can use several request handlers (like 'select') and define different default parameters there. That usually makes the client code a bit easier.

You do not need to use 2 separate indexes. You can use the edismax parser and specify the qf parameter at query time. That will help determine what fields are searched.

You can look at field aliases
If you have 3 index fields
pdfmeta
pdftext
Then you can create two field aliases
quicksearch : pdfmeta
fullsearch : pdfmeta, pdftext
One advantage of using a field alias over qf is if your users have bookmarks like q=quicksearch:value, you can change the alias for quicksearch without affecting the user's bookmark.

SOLR indexed item has extra word which is not available in query parameter - how to identify those cases?

We have a scenario where we are trying to perform accurate name matching of Items using SOLR.
Query Parameter: Apple
SOLR Indexed Word: Apple-D
In our business case, "Apple" and "Apple-D" are totally different items and therefore SOLR shouldn't return the match.
Is there an option to achieve the same?

You need to change the fieldType used for the field. Use the String fieldType for the your field.
This String fieldType will make sure that the words will be stored as it is by solr.
It won't apply any analysis on the word. Or it won't create any tokes of it.
With the String type applied to it . The Apple and Apple-D are stored/indexed different token. As there won't be any tokenizing on the same. This will help you to achieve the exact match.
Once you change the fieldType. Re-index the same.
You can use the solr analysis tool to check how it is indexing and querying .
Note : Make sure whenever you ask question on it, Share your schema.xml

Fulltext Solr statistical search

Consider I'm having a couple of documents indexed with Solr 4.0. Each has 2 fields - unique ID and text DATA field. DATA field contains few paragraphs of text. Who could advise me what kind of analyzers/parsers I should use and how to build statistical query to find out sorted list of most frequently used words in all DATA fields of all documents.

for the most frequent terms look into the terms- and statistical component

besides the answers mentioned here, you can use the "HighFreqTerms" class: its in the lucene-misc-4.0 jar (which is bundled with Solr).
This is a command line application which lets you see the top terms for any field either by document frequency or by total term frequency (the -t option)
Here is the usage:
java org.apache.lucene.misc.HighFreqTerms [-t] [number_terms] [field]
-t: include totalTermFreq
Here's the original patch, which is committed and in the 4.0 (trunk) and branch_3x codebases: https://issues.apache.org/jira/browse/LUCENE-2393

For ID field use analyzer based on keyword tokenizer - it will take all the content of the field as a single token.
For DATA field use language specific analyzer. Notice, that there's possibility to auto-detect the language of the text (patch).
I'm not sure, if it's possible to find the most frequent words with Solr, but if you can use Lucene itself, pay attention to this question. My own suggestion for it is to use HighFreqTerms class from Luke project.

Tag suggestion (not tag autocomplete)

AJAX autocomplete is fairly simple to implement. However, I wonder how to handle smart tag suggestion like this on SO.
To clarify the difference between autocomplete and suggestion:
autocomplete: foo [foobar, foobaz]
suggestion: foo [barfoo, foobar, foobaz], or even better, with 'did you mean' feature: [barfoo, foobar, foobaz, fobar, fobaz]
I suppose I need some full text search in tags (all letters indexed, not just words). There would be no problem to do it witch regex or other patterns for limited number of tags (even client side).
But how to implement this feature for big number of tags?
Is there any particular reason (besides URL) the tags on SO are dash separated? What about Unicode characters in tags?
I store the tags in the table with the following columns: id, tagname.
My SQL query returns objects with following fields: id, tagname, count
(I use Doctrine ORM and pgsql as default db driver.)

I would go with SELECTING them from database by REGEXP at every keypress. I did this on my sites and the was no prefrormance problem (I do not have heavy loaded server thought). If you do not like this idea, I would cash all 1-5 letters combinations which will users enter and refresh them on daily basis in separate table. If this table is indexed than you have very fast implementation.
To elaborate more on the second appreach:
Briefly: 1. Make a table SEARCHTABLE representing 1-n relationship betwean keywords (limit it to 3-4 letters) and primary IDs of tags. 2. INDEX on both fields. 3. Everytime the user makes a search do look at the SEARCHTABLE and if the combination is there, use that - very fast, as everything is indexed. If not do the regexp search and put all results to SEARCHTABLE.
Notes:
You should invalidate the table if
you add tags, but this should much
less often than a search. When
invalidating table you do not
necesarilly TRUNCATE it, you can
easily rebuild it taking all
keywords into account.
If you want to speed it up, you can "pregenerate" all two or even three
letters searches.
If you care enough, you should be using information from n-1 letter kewords to generate
the n letter keyword. It speeds the things tremendously. Imagine that user has typed "mo"
and you have shown them appropriate result from SEARCHTABLE. Than when she types "n"
giving it "mon" you need only serach trough already selected items to generate new
response.
Hope it is more comprehensive now.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas