Solr: store original file offset or record number with token - indexing

I have a workflow where there is a layer of pre-processing in order to extract fields - this is later handed to another process to be ingested into Solr. The original files comprise documents with records, think tabular data.
Some of these columns are indexed in Solr in order to get the relevant documentID for that value of the field. I.e. you query like
q=indexedField:indexedValue1
fl= documentId
and have a response like:
... response: {documentID1, documentID3}
assuming indexedValue1 is present in field indexedField in documents documentID1, documentID3.
Each record will then have a value on one of the fields we want to index. The pre-processing concats these values to one (long) text field, with each value as a token, so you can later search by them. Indexed fields when handed to Morphlines look like this:
...
value1 value2 ... valueN
...
Some fields are extracted and then regrouped in a field, so if you want to search by a value, you can know in which document it is.
(fairly simple until here)
However, how could I also store in Solr, along with each token that I want to search by, the offset (or record number) on the original file? The problem is not to extract this information (that is another problem, but we can solve it).
i.e. you would query like above, but will get per each document ID, the original record number or file offset where the record is located - something like:
... response:{ {documentID1, [1234, 5678]}, { documentID3, [] } }
Is this possible at all? In that case, what's the correct Solr data structure to efficiently model it?

It sounds that what you are looking for is Payloads. This functionality is present in Solr, but often requires custom code to actually fully benefit from it.
The challenge however would be that you seem to want to return payloads that are associated with the tokens that matched during search. That's even more complicated as the search focuses on returning documents and extracting what matched in the specific document is a separate challenge, usually solved by highlighters.

Related

Sql Server search entire Json document for value

I have a few thousand rows in my table (SQL Server 2016).
One of the columns stores JSON documents (NVARCHAR(max)).
The JSON documents are quite complex in therms of nesting etc.. also they can be very different one to another.
My goal is to search each document for a certain match. Say: "MagicNo":"999000".
So if the document has a property "MagicNo" and if the value is 999000 then it's a match.
I know you can navigate through the document using the
JSON_VALUE $.
followed by the path, but since those docs can be very different the "MagicNo" property may appear pretty much everywhere in the document (a lot nesting). So xpathing is out of question here.
Is there some kind of wild card I could use with JSON_VALUE to say search the entire doc and return it if the match is found?
The simple
like '%999000%'
and
CONTAINS
searches on the VARCHAR column are out of question here due to the poor performance.
Any thoughts?
Thanks.

Lucene.Net range subquery not returning expected results (RavenDB related)

I'm trying to write a lucene query to filter some data in RavenDB. Some of the documents in this specific collection are assigned a sequential number, and the valid ranges are not continuous (for example, one range can be from 100-200 and another from 1000 to 1400). I want to query RavenDB using Raven Studio (v2.5, the Silverlight client) to retrieve all documents that have values outside of these user-defined ranges.
This is the overly simplified document structure:
{
ExternalId: something/1,
SequentialNumber: 12345
}
To test, I added 3500 documents, all of which have a SequentialNumber that's inside one of the following two ranges: 123-312 and 9000-18000, except for one that has 100000123. The ExternalId field is a reference to the parent document, and for this test all the documents have the field set to something/1. This is the Lucene query I came up with:
ExternalId: something/1 AND NOT
(SequentialNumber: [123 TO 321] OR SequentialNumber: [9000 TO 18000])
Running the query in RavenDB's Studio returns all the documents where SequentialNumber isn't in the 123-321 range. I would expect it to only return the document that has 100000123 as a SequentialNumber. I've been trying to Google for help, but so far I haven't found anything to steer me into the right direction.
What am I doing wrong?
RavenDB is indexing numbers in two ways, once as strings (which is what you see here) and once in numeric form.
For range queries use:
SequentialNumber_Range: [Ix123 TO Ix321] OR SequentialNumber_Range: [Ix9000 TO Ix18000])
The Ix prefix means that you are using int32

Apache Solr 5 - deduplicating data within a field

Here is my question (pardon the wordiness):
I have millions of documents and all of them are unique.
However, all documents contain a 'description' field and this field contains data that only has a few different variations in the text across all 10 million documents. This field is large-ish -400-800 words or so.
What is the most appropriate way to eliminate this repetition of data in the 'description' field?
Let me elaborate. Here is an example schema that been simplified:
Doc_id <-- this is unique
Title <-- always unique as well
Description <-- contains mostly dupe data
I search over both the title and description but only return the title itself.
I'm fairly new to Solr but have been unable to find any information on how to tackle a scenario like this. In case it matters, I'm running Solr 5 on Ubuntu.
Thanks for any help!
I will try to provide some strategies to tackle your problem.
You are saying that you search over title and description, this means you should set these fields to indexed=true in your schema.xml. Only title is returned, this means only title needs to be set to stored=true, description should be set to stored=false. See this posting for more information on stored vs. indexed: Solr index vs stored
Another useful option you could try is the field option compression. If you need to store a field, you can use gzip compression on certain fields, such as TextField and StrField, see: https://wiki.apache.org/solr/SchemaXml for more info.
Lastly, deduplication is supported in Solr, see: https://wiki.apache.org/solr/Deduplication. I did not try this feature, but from the sounds of it, you can prevent (nearly) duplicate documents to be indexed or tag duplicates. Maybe its goal "Allow for both duplicate collapsing in search results as well as deduplication on adding a document." is what you are looking for?

RavenDb - What's faster?

I need to do a query in RavenDb and perform a get on a document by Id and a secondary parameter.
To be more precise, I'd like to load a document by document Id and by ApiKey. If the ApiKey of the given document does not match I want a null back.
My question is, is it faster to do a Query with Id and ApiKey comparison, or is it faster to do a Load by Id and throw away the document in code if the ApiKey does not match. My documents are probably 20k in size.
Do a load by id, then compare.

SEO and magic numbers in URL

which URL is more relevant, 1 or 2?
1: http://site.com/language/countrcy/city/category/title
2: http://site.com/language/country/city/category/articleId(number)/title
the thing is I have to design my DB in ineffective way for (1) doing textual search and table joins, but I'm not sure how (2) where I'm just putting a direct table ID loses relevance in search results.
The first would be the most relevant, as it doesn't contain any irrelevant data, such as the articleId.
If you are concerned about keeping unique titles, have a 2nd database column called filename for example, which is a URL encoded version of the title. If the title is already in use, then append an incremented value at the end.
For example, if the title 'SEO' was already in use by another article, loop through your string and call it SEO-1 etc..
That way you are only applying irrelevant values when two titles clash.