Indexing documents in SOLR with their related PDF - indexing

I have to index documents with their related pdf. Is it possible to do it with one command like indexing csv, json, xml file?
I have a csv file with the data : id, name, field1, field2, ..., NameOfThePDF
What I need is to add all the documents with their PDF's content indexed (but not stored). So, I can query through documents and content.
Examples:
1, Titanic, Movie, Long, SynopsisPDF_titanic.pdf
2, LinkinPark, Music, Short, AlbumList_LinkinPark.pdf
If I query "Di Caprio" that is inside the SynopsisPDF_Titanic.pdf, I must find "Titanic" as result.

Related

Apache Solr - Need to index the same documents two times for plain text query to work

I have a dataset of 76000 big json documents. I have it indexed into solr with post.jar
java -jar -Dc=sampleindex -Dauto example/exampledocs/post.jar "/home/sample/sample.json"
I am able to see that the 76000 documents are indexed in the solr UI dashboard. I can do field base search against the index like below
if customer_name is one of the field, I can search with
http://localhost:8983/solr/sampleindex/select?fl=customer_name&q=customer_name%3Arachel
I get the results from solr. However, when I just search using only (plain text)
http://localhost:8983/solr/sampleindex/select?q=rachel
I don't get any results.
I need to index the 76000 documents again and now the total document count is 152000
Now, if I search with with plain text "rachel"
http://localhost:8983/solr/sampleindex/select?q=rachel
I get results.
I am not sure if some thing is wrong with the way I am indexing.

How Solr indexes documents?

I'm new to Solr and I want to understand exactly how it indexes documents.
Let's say I have a 100 MB document (document1) full of text. The text is not structured, it's just raw text. I send that document to Solr in order to be indexed.
As far as I understood, Lucene will parse the document, extract all the words, based on the default schema (let's assume we're using the default schema), and create an index that is basically a mapping between a word and a list of documents, like so:
word1 -> [document1]
word2 -> [document1]
etc
Now, if I want to search for the word "word1", Solr will give me the entire 100 MB document that contains the word "word1", correct?
Please correct me if I'm wrong, I need to understand exactly how it works.
You described most of the indexing part kinda okay, at least at high level. The reason, why you getting all your document back - it is because your field is a stored one in your Solr schema (which is true by default at least)
This means, that apart from having postings list of
word1 -> doc1, doc3
word2 -> doc2, doc3
etc.
Solr/Lucene also stores the original content of the field, so it will be able to return it back to you. You could either explictily turn it off by saying stored=false in your schema or by filtering it out in fl section and just request fl=id (or something similar)
If you would like to return part of the document only, around searched ones, you could do that by using Solr Highlighting feature. Highlighting in Solr allows fragments of documents that match the user’s query to be included with the query response.

How to exclude text indexed from PDF in solr query

I have a solr index generated from a catalog of PDF files and correspoing metadata fields pertaining to the pdf files themselves. Still, I would like to provide my users an option to exclude in the query any text indexed from within a PDF. This is so the query results would be based on the metadata fields instead and not biased by the vast text within the pdf files.
I have thought of maybe having two indexes (cores) - one with the indexed pdf files and one without.
Is there another way?
Sounds like you are doing a general search against a default field. Which means you have a lot of copyField instructions (or just one copyField * -> text), which include the PDF content field.
You can create a second destination and copyField everything but the PDF content field into that as well. This way, users can search against or another combined field.
However, remember that this parses all content according to the analysis chain of the destination field. So, eDisMax with a list of source fields may be a better approach there. And, remember, you can use several request handlers (like 'select') and define different default parameters there. That usually makes the client code a bit easier.
You do not need to use 2 separate indexes. You can use the edismax parser and specify the qf parameter at query time. That will help determine what fields are searched.
You can look at field aliases
If you have 3 index fields
pdfmeta
pdftext
Then you can create two field aliases
quicksearch : pdfmeta
fullsearch : pdfmeta, pdftext
One advantage of using a field alias over qf is if your users have bookmarks like q=quicksearch:value, you can change the alias for quicksearch without affecting the user's bookmark.

Solr: store original file offset or record number with token

I have a workflow where there is a layer of pre-processing in order to extract fields - this is later handed to another process to be ingested into Solr. The original files comprise documents with records, think tabular data.
Some of these columns are indexed in Solr in order to get the relevant documentID for that value of the field. I.e. you query like
q=indexedField:indexedValue1
fl= documentId
and have a response like:
... response: {documentID1, documentID3}
assuming indexedValue1 is present in field indexedField in documents documentID1, documentID3.
Each record will then have a value on one of the fields we want to index. The pre-processing concats these values to one (long) text field, with each value as a token, so you can later search by them. Indexed fields when handed to Morphlines look like this:
...
value1 value2 ... valueN
...
Some fields are extracted and then regrouped in a field, so if you want to search by a value, you can know in which document it is.
(fairly simple until here)
However, how could I also store in Solr, along with each token that I want to search by, the offset (or record number) on the original file? The problem is not to extract this information (that is another problem, but we can solve it).
i.e. you would query like above, but will get per each document ID, the original record number or file offset where the record is located - something like:
... response:{ {documentID1, [1234, 5678]}, { documentID3, [] } }
Is this possible at all? In that case, what's the correct Solr data structure to efficiently model it?
It sounds that what you are looking for is Payloads. This functionality is present in Solr, but often requires custom code to actually fully benefit from it.
The challenge however would be that you seem to want to return payloads that are associated with the tokens that matched during search. That's even more complicated as the search focuses on returning documents and extracting what matched in the specific document is a separate challenge, usually solved by highlighters.

lucene multiple documents created for large and unique number of resources?

I am a beginner in lucene search.If I have a collection resources like:
id,name,{list of products},{list of keywords}.If I want to search based on name or products or keyword.I have some doubts related to lucene and its usage:
1)For document creation, I create a document that has the structure of id,name,products(multiple values),keywords(multiple values).If I have a thousand unique resources, will it create 1000 unique documents?
2)Also, If I make name and products field as searchable fields(as StringField), then after searching, will the result also contains(ScoreDocs contains) exactly the same set of documents that has the text I searched?
Q> <..> will it create 1000 unique documents?
A> Lucene doesn't have the concept of "uniqueness" - it is only in your head. Alternatively, think of this as if all documents are unique for Lucene. If you search by these fields, relevant documents will be returned.
Q> <..> will the result also contains(ScoreDocs contains) exactly the same set of documents that has the text I searched?
A> Strange/unclear question. If you search for all documents, you will get all documents. If your search query will only match some documents, some documents will be returned. The internals are more complex - it all depends on how you analyze the text. Maybe you can more give concrete example with use cases?