Modify existing Solr 7.6.0 / Lucene index (add another field 'URL' to an already indexed file (.pdf, .docx etc.))

Modify existing Solr 7.6.0 / Lucene index (add another field 'URL' to an already indexed file (.pdf, .docx etc.)) - indexing

I have a Solr 7.6.0 Lucene index (lots of .pdf's, .docx and .xlsx files)
The index was created using the post command in a command window, pointing to a directory share (mapped filepath) where the files exist.
There is also a web URL for the document which I have in a database and Lucene currently knows nothing about. I would like to 'enrich' the existing index with this URL data.
Can I extract the id of the currently indexed files and then use the Solr web interface to modify the existing index, injecting the URL?
I am looking at the following tutorial for advice:
https://www.tutorialspoint.com/apache_solr/apache_solr_indexing_data.htm
The tutorial shows an example of adding a document but not modifying one.

Thanks #MatsLindh I managed to get it to work:
I used the Solr GUI to run the JSON add-field update:
{
"add-field" : {
"name":"URL",
"type":"string",
"stored":true
"indexed":true
}
}
I then inserted/set the property:
{"id":"S:\\Docs\\forIndexing\\indexThisFile_001.pdf",
"URL":{"set":"https//localhost/urlToFiles/indexThisFile_001.pdf:"}
}

Related

How to create a custom named time based indexing using logstash?

Currently i am trying to create a time based indexing in an Elastic search using Logstash. I am trying to create a time based indexing.
I have come across and successfully tried the following output mechanism in Logstash config file.
output {
elasticsearch {
hosts => ["127.0.0.1:9200"]
index => "logstash-%{+xxxx.ww}"
}
file{path=>"C:/output3.txt"}
}
This is working fine But the name is required to be "logstash".
I need to give it some custom name other than logstash.
Can you please suggest something.
Thanks again

Tika Parser: Exclude PDF Attachments

There is a PDF documents that has attachments (here: joboptions) that should not be extracted by Tika. The contents should not be sent to Solr. Is there any way to exclude certain (or all) PDF attachments in the Tika config?

#gagravarr, we changed that behavior via TIKA-2096, Tika 1.15. The default is now "extract all embedded documents". To avoid parsing embedded documents call:
parseContext.set(Parser.class, new EmptyParser())
Or subclass EmbeddedDocumentExtractor to do nothing and send that in via the ParseContext.
If you were using Solr DIH's TikaEntityProcessor, I'd set extractEmbedded to false, but you aren't; and please don't. :)
So, I don't think there's an easy way to turn off parsing of embedded documents only for PDFs, and I'm not sure you'd want to. What if there were an MSWord file attached to a PDF, for example?
If you want to ignore .joboptions, you could use a custom EmbeddedDocumentExtractor.

Implement a custom org.apache.tika.extractor.DocumentSelector and set it at the ParseContext. The DocumentSelector is called with metadata of the embedded document to decide whether the embedded document should be parsed.
Example DocumentSelector:
public class CustomDocumentSelector implements DocumentSelector {
#Override
public boolean select(Metadata metadata) {
String resourceName = metadata.get(Metadata.RESOURCE_NAME_KEY);
return resourceName == null || !resourceName.endsWith(".joboptions");
}
}
Register it at the ParseContext:
parseContext.set(DocumentSelector.class, new CustomDocumentSelector());

SOLR 7.1 Extracthandler extract PDF WILL ADD Many extra META data which i don't want,it don't happen in solr 6

1.SOLR 7 extract pdf
will add many schema column(pdf meta)
and extra meta pdf data
2.in solr 6,it don't happen
4.how can i close it

My guess is that you're running with a "schemaless" update processor in 7, so that any unknown fields are added to the schema by the update processor. If you turn that off and use an explicit schema like you probably did in 6, you should see the old behavior again.
You might need to switch to the ClassicIndexSchemaFactory to get Solr to read your old schema.xml.
Rename the managed-schema file to schema.xml.
Modify solrconfig.xml to replace the schemaFactory class.
Remove any ManagedIndexSchemaFactory definition if it exists.
Add a ClassicIndexSchemaFactory definition as shown above
Reload the core(s).

Indexing Multiple documents and mapping to unique solr id

My use case is to index 2 files: metadata file and a binary PDF file to a unique solr id. Metadata file has content in form of XML file and some schema fields are mapped to elements in that XML file.
What I do: Extract content from PDF files(using pdftotext), process that content and retrieve specific information(example: PDF's first page/line has information about the medicine, research stage). Information retrieved(medicine/research stage) needs to be indexed and one should be able to search/sort/facet.
I can create a XML file with information retrieved(lets call this as metadata file). Now assuming my schema would be
<field name="medicine" type="text" stored="true" indexed="true"/>
<field name="researchStage". ../>
Is there a way to put this metadata file and the PDF file in Solr?
What I have tried:
Based on a suggestion in archives, I zipped these files and gave to ExtractRequestHandler. I was able to put all the content in SOLR and make it searchable. But it appear as content of zip file.(I had to apply some patches to Solr Code base to make this work). But this is not sufficient as the content in metadata file is not mapped to field names.
curl "http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true" -F "myfile=#file.zip"
I tried to work with DataImportHandler(binURLdatasource). But I don't think I understand how it works. So could not go far.
I thought of adding metadata tags to PDF itself. For this to work, ExtractrequestHandler should process this metadata. I am not sure of that either.
So I tried "pdftk" to add metadata. Was not able to add custom tags to it. It only updates/adds title/author/keywords etc. Does anyone know similar unix tool.
If someone has tips, please share.
I want to avoid creating 1 file(by merging PDF text + metadata file).

Given a file record1234.pdf and metadata like:
<metadata>
<field1>value1</field1>
<field2>value2</field2>
<field3>value3</field3>
</metadata>
Do the programmatic equivalent of
curl "http://localhost:8983/solr/update/extract?
literal.id=record1234.pdf
&literal.field1=value1
&literal.field2=value2
&literal.field3=value3
&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_txt&boost.foo_txt=3&" -F "tutorial=#tutorial.pdf"
Adapted from http://wiki.apache.org/solr/ExtractingRequestHandler#Literals .
This will create a new entry in the index containing the text output from Tika/Solr CEL as well as the fields you specify.
You should be able to perform these operations in your favorite language.
the content in metadata file is not mapped to field names
If they dont map to a predefined field, then use dynamic fields. For example you can set a *_i to be an integer field.
I want to avoid creating 1 file(by merging PDF text + metadata file).
That looks like programmer fatigue :-) But, do you have a good reason?

Indexing file paths or URIs in Lucene

Some of the documents I store in Lucene have fields that contain file paths or URIs. I'd like users to be able to retrieve these documents if their query terms contain a path or URI segment.
For example, if the path is
C:\home\user\research\whitepapers\analysis\detail.txt
I'd like the user to be able to find it by queriying for path:whitepapers.
Likewise, if the URI is
http://www.stackoverflow.com/questions/ask
A query containing uri:questions would retrieve it.
Do I need to use a special analyzer for these fields, or will StandardAnaylzer do the job? Will I need to do any pre-processing of these fields? (To replace the forward slashes or backslashes with spaces, for example?)
Suggestions welcome!

You can use StandardAnalyzer.
I tested this, by adding the following function to Lucene's TestStandardAnalyzer.java:
public void testBackslashes() throws Exception {
assertAnalyzesTo(a, "C:\\home\\user\\research\\whitepapers\\analysis\\detail.txt", new String[]{"c","home", "user", "research","whitepapers", "analysis", "detail.txt"});
assertAnalyzesTo(a, "http://www.stackoverflow.com/questions/ask", new String[]{"http", "www.stackoverflow.com","questions","ask"});
}
This unit test passed using Lucene 2.9.1. You may want to try it with your specific Lucene distribution. I guess it does what you want, while keeping domain names and file names unbroken. Did I mention that I like unit tests?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Modify existing Solr 7.6.0 / Lucene index (add another field 'URL' to an already indexed file (.pdf, .docx etc.)) - indexing

Related

How to create a custom named time based indexing using logstash?

Tika Parser: Exclude PDF Attachments

SOLR 7.1 Extracthandler extract PDF WILL ADD Many extra META data which i don't want,it don't happen in solr 6

Indexing Multiple documents and mapping to unique solr id

Indexing file paths or URIs in Lucene

Categories

Resources