how to test search accuracy in solr - apache

Hello I am new with Solr information retravel system
and I want to add a text file to Solr then search for a word form the file in order to see Solr accuracy in other languages but I am not sure how. I find that there is a UI for search but also don't know how to use it and there is data import handler but it must be in XML, CSV or JSON and I want text file but also if I use it I don't know how to search for a word or sentence

I would recommend a basic Apache Lucene/Solr course and a deep dive in the Solr wiki[1].
The getting started especially, should really help you.
Good Luck
[1] https://lucene.apache.org/solr/guide/7_0/solr-tutorial.html

Related

How to import an external file of indexed documents in solr core

we are working on a teamwork to create a Persian search engine.
I am doing the "indexing" part.
I worked with Solr and indexed some English documents to see if it works.
It worked! so it's the time for Persian indexer. I optimized a code for PersianAnalyzer a little bit (extending the stop words set for instance) and it can index the documents. Now I want to import the external Persian indexed document to the core to see the indexing process and search a query on it. how can I do it and import these indexed documents to the core?
I am kind of in hurry, so I will appreciate any help.
thanks,
Mahshid
You have several options:
the quickest option in order to get content from a file would be to use Solr DataImportHandler;
another option would be to write a custom crawler/indexer but that would require time;
if you need a web-crawler instead then you can use Apache Nutch.

How to develop a simple search engine for full text search in local files

Could someone tell me from where I have to start to develop a simple full text search engine for local files?
I have a Debian 7 server with LAMP and I have mounted a windows network drive on it. So far I am using this script to show the other local network users the directory tree where they can download files from the mounted network drive.
But I have to build a simple search engine which could index the names and the content (if any) of local files in the mounted folder - Microsoft doc, docx, xls, xlsx, rtf, txt. The search has to return the name of the file, the path and the best would be if there is a part of the text where the search word(s) present (if the file has text).
Could someone point me to the right direction what I have to read and learn to do this? Thanks.
You need a couple of tools for this. You need something to something to index and search content, and you've tagged the question with three good tools for this task, lucene, solr, and elasticsearch. Each one of them is rich with tutorials and examples to help you get started.
The other thing you will need, is a way to read the content from all those different file types. I'd recommend Apache Tika. It's an excellent toolkit for this, reads all the formats you've listed, and works well with Lucene.
You can see an example of their use together in this question : Tika in Action book examples Lucene StandardAnalyzer does not work
You may find this helpful, you may not.
I have Solr and Nutch set up to index my local filesystem and store them in Solr and have guides on how I set them up that way.
This would provide a solid backend for your application.
Here are the links. First two for Solr set up, last two for Nutch integration
http://amac4.blogspot.co.uk/2013/07/setting-up-solr-with-apache-tomcat-be.html
http://amac4.blogspot.co.uk/2013/07/setting-up-tika-extracting-request.html
http://amac4.blogspot.co.uk/2013/07/configuring-nutch-to-crawl-urls.html
http://amac4.blogspot.co.uk/2013/07/setting-up-nutch-to-crawl-filesystem.html

How do I use MoreLikeThis function on Solr to find similar documents to a text file?

I am trying to use solr to do the following:
Read some text from a txt file, and use MoreLikeThis on the text to find similar documents to that text. How can I do this with Solr?
From what I know so far I think I have to use content stream, but I do not know how to configure it...
If you were forming a MoreLikeThisQuery from a document stored in the index, it would have formed the query by retrieving the TermVector info from the Index.
Since you are willing to find documents similar to a text file you have, you've got to iterate the text file and form the BooleanQuery, with the terms in the text file, the way you want to match.
The above is true for Lucene, and I believe it's the same for Solr as well, considering that the MoreLikeThisQuery works based on TermVector info.

How to detect image in a document

How can I detect images in a document say doc,xls,ppt or pdf ?
I came across with Apache Tika, I am trying its command line option.
http://tika.apache.org/1.2/gettingstarted.html
But not quite sure how it will detect images.
Any help is appreciated.
Thanks
You've said you want to use a command line solution, and not write any Java code, so it's not going to be the prettiest way to do it... If you are happy to write a little bit of Java, and create a new program to call from Python, then you can do it much nicer!
The first thing to do is to have the Tika App extract out any embedded resources within your file. Use the --extract option for this, and have the extraction occur in a special temp directory you app controls, eg
$ java -jar tika.jar --extract ../testWORD_embedded_pdf.doc
Extracting 'image1.emf' (application/x-emf)
Extracting '_1402837031.pdf' (application/pdf)
Grab the output of the extraction if you can, and parse that looking for images (but be aware that some images have an application/ prefix on their canconical mimetype!). You might need to run a second --detect step on a few, I'm not sure, test how the parsers get on with the extraction.
Now, if there were images, they'll be in your test dir. Process them as you want. Finally, zap the temp dir when you're done with the file!
Having used Tika in the past I can't see how Tika can help with images embedded within Office documents or PDFs I was wrong to answer No. You will have may still try to resolve to native APIs like Apache POI and Apache PDFBox. Tika does use both libraries to parse text and metadata but no embedded image support.
Using Tika makes these APIs automatically available (side effect of using Tika).
UPDATE:
Since Tika 0.8: look for EmbeddedResourceHandler and examples - thanks to Gagravarr.

Suggestions on extracting text from uploaded documents

I currently have a number of documents uploaded to my website on a daily basis (.doc, .docx, .odt, pdf) and these docs are stored in a sql database (mediumblob).
Currently I open the docs from the database and cut and paste a text version into a field in the database for a quick reference and search function.
I'm looking to automate this "cut & paste" process - formatting isn't a real concern just as long as I can extract the text - and was hoping that some people may be able to suggest a good route to go down?
I've tried manipulating the content of the blob field using regex but it is not really working.
I've been looking at Apache POI with a view to extracting the text at the point of upload but I can't help thinking that this maybe a bit of an overkill given my relatively simple needs.
Given the various document formats I encounter and the current storing of the content in a blob field would Apache POI be the best solution to use in this instance or can anybody suggest an alternative?
Help and suggestions greatly appreciated.
Chris
Apache POI will only work for the Microsoft Office formats (.xls, .docx, .msg etc). For these formats, it provides classes for working with the files (always read, for many write support too), as well as text extractors.
For a general text extraction framework, you should look at Apache Tika. Tika uses POI internally to handle the Microsoft formats, and uses a number of other libraries to handle different formats. Tika will, for example, handle both PDF and ODF/ODT, which are the other two file formats you mentioned in the question.
There are some quick start tutorials and examples on the Apache Tika website, I'd suggest you have a look through. It's very quick to get started with, and you should be able to easily change your code to send the document through Tika during upload to get a plain text version, or event XHTML if that's more helpful to you.