Is there a HTML analyzer/tokenizer for Lucene? - lucene

I wanted to index text from html, in Lucene, what is the best way to achieve this ?
Is there any good Contrib module that can do this in Lucene ?
EDIT
Finally ended up using Jericho Parser. It doesn't create DOM and is easy to use.

I'm assuming that you don't actually want to index the HTML tags. If that's the case, you can first extract text from HTML using Apache Tika. Then you can index the text in Lucene.

I would recommend using Jsoup HTML parser to extract the text and then use Lucene. It worked well for me.

You might also want to take a look at /Lucene-3.0.3/src/demo which has an HTML parser example.

Related

how to test search accuracy in solr

Hello I am new with Solr information retravel system
and I want to add a text file to Solr then search for a word form the file in order to see Solr accuracy in other languages but I am not sure how. I find that there is a UI for search but also don't know how to use it and there is data import handler but it must be in XML, CSV or JSON and I want text file but also if I use it I don't know how to search for a word or sentence
I would recommend a basic Apache Lucene/Solr course and a deep dive in the Solr wiki[1].
The getting started especially, should really help you.
Good Luck
[1] https://lucene.apache.org/solr/guide/7_0/solr-tutorial.html

How to import an external file of indexed documents in solr core

we are working on a teamwork to create a Persian search engine.
I am doing the "indexing" part.
I worked with Solr and indexed some English documents to see if it works.
It worked! so it's the time for Persian indexer. I optimized a code for PersianAnalyzer a little bit (extending the stop words set for instance) and it can index the documents. Now I want to import the external Persian indexed document to the core to see the indexing process and search a query on it. how can I do it and import these indexed documents to the core?
I am kind of in hurry, so I will appreciate any help.
thanks,
Mahshid
You have several options:
the quickest option in order to get content from a file would be to use Solr DataImportHandler;
another option would be to write a custom crawler/indexer but that would require time;
if you need a web-crawler instead then you can use Apache Nutch.

How do I use MoreLikeThis function on Solr to find similar documents to a text file?

I am trying to use solr to do the following:
Read some text from a txt file, and use MoreLikeThis on the text to find similar documents to that text. How can I do this with Solr?
From what I know so far I think I have to use content stream, but I do not know how to configure it...
If you were forming a MoreLikeThisQuery from a document stored in the index, it would have formed the query by retrieving the TermVector info from the Index.
Since you are willing to find documents similar to a text file you have, you've got to iterate the text file and form the BooleanQuery, with the terms in the text file, the way you want to match.
The above is true for Lucene, and I believe it's the same for Solr as well, considering that the MoreLikeThisQuery works based on TermVector info.

how to download page from source code

i need to download page from source code..for example
<span id="businessNumOnMap" class="resultNumberOnMap" style="display:none;"></span><span>Cellini's Italian Restaurant
i want to download the "/len/aaproximat...php"..i didnt find the suitable regex for it..and i need to download that page..can anyone help?
im using vb.net
Normally it's not recommended to parse HTML with a regex, with the exception if this is a simple page that you know the format of, the Html Agility Pack is often recommended for this purpose instead.
Be aware though, if you're parsing this from a page that's on the internet, the site in question might have T&Cs for the usage of their data that you might need to follow to stay legal.
Do you want to download the php file itself with all the codes and not the only html codes? If it's in that case it's not possible
Use WebClient.DownloadString method for downloading. If you haven't found a suitable expression to extract that "Span" from the source, then build you own.

T-SQL search html with regex?

In my database I have a field wich contains a html document. Now there must be a possibility to search in this document. However, the html tags may not be found. So when I have something like this:
<html>
<head>
<title>Bar</title>
</head>
<body>
<p>
this content my be found
</p>
</body>
</html>
It is possible that the document stored in the database is not xhtml. Can you tell me what the best way is to search in the content? Shall i use regular expressions? And of so, how would it look like? ANd if not, what should I use else?
You could try turning on Full-Text Search or use something like Lucene.Net to index the content for you.
What volume of records are there? I expect you might have to use full-text search and an IFilter to do this efficiently. Html does not lend itself well to regex - it can quickly be very hard to do something very simple.
If the volume isn't huge, can you iterate over the records with an external parsing application, using something like the HTML Agility Pack (for .NET) - or any other DOM of your choice.
But the FTS/IFilter would be my first choice.