Mediawiki + Lucene: How To Strip Markup?

Mediawiki + Lucene: How To Strip Markup? - lucene

I have the Lucene search extension (http://www.mediawiki.org/wiki/Extension_talk:Lucene-search) integrated with my mediawiki installation. Its all working really well, however- lucene seems to have indexed all the mediawiki /html markup as well and it is showing up in the results.
i.e. searching for "green" will return results with markup such as, style="background:green; color:white
Is there a way to strip the search results of all the markup? I believe wikipedia uses the same search plugin, how are they doing it?

You will probably have to transform the raw wiki markup before indexing it with Lucene. When dealing with pure XML content, it's possible to just use an XSL transform with <xsl:value-of select="text()"/> to extract the text content.
I'm afraid that won't work for wiki markup, but maybe you can capture the page post-HTML transformation?

I found a solution to part of the problem. The following change will remove HTML markup from the search results. I have not been able to remove Wikitext markup yet. Any tips on that would be appreciated. Note that I do not use the Lucene search extension.
Open /includes/search/SearchEngine.php
In that file, there is a class defined - SearchResult
SearchResult.getTextSnippet() contains the code to format the search results
SearchResult->mText contains the text blurb from the search results
To fix the problem, simply go into SearchEngine.php and find the method called getTextSnippet(), then add the following line before the "if":
$this->mText = strip_tags( $this->mText );
I found this solution on this random Wiki: http://www.myrandomwiki.com/wiki/MediaWiki_Notes#Strip_HTML_From_Search_Results

Related

Wikipedia API - How to get rid of Wikipedia hyperlinks/junk

I'm currently using Wikipedia API to get some content that I can use on my website. At the moment when I get content it is all in html or wikitext (both containing Wikipedia hyperlink and a lot of junk in the text). Is there a way around this to just get plain text without having all this junk?
I have tried calling HTML and converting it into plain text but that still contains all of the wiki junk. I want to try and create a universal method that can remove all the junk as I want to be able to call multiple different Wikipedia pages and get plain text for all of these.
HTML:
https://en.wikipedia.org/w/api.php?action=parse&format=json&page=Game_of_Thrones_(season_3)&prop=text&section=1&disabletoc=1
Wikitext:
https://en.wikipedia.org/w/api.php?action=parse&format=json&page=Game_of_Thrones_(season_3)&prop=wikitext&section=1&disabletoc=1
I hope this makes sense, any advice/guidance is greatly appreciated.

noscript text is appearing in Google

I have added in the bottom of my html like this (just like how stackoverflow has it implemented):
<noscript>This site works best with Javascript is enabled</noscript>
but in one of my pages that has very little text, the text "Javascript is disabled" appears in Google search.
Is there a way to tell Google to avoid indexing this part? Or is there a better alternative instead of using <noscript> tag?

The issue is that Google often won't render Javascript. It can - but it often won't.
You either need to present a pre-rendered page or provide it with a meta description that accurately describes the content. Look up tags and how Google uses them to embellish it's search listings.
Other options like or can encourage Google from deviating from the provided description. However, a pre-rendered page for it to scrape is always more reliable.

Get page numbers of searchresult of a pdf in solr

I'm building a web application where users can search for pdf documents and view them with pdf.js. I would like to display the search results with a short snippet of the paragraph where the search term where found and a link to open the document at the right page.
So what I need is the page number and a short text snippet of every search result.
I'm using SOLR 4.1 to index pdf documents. The indexing itself works fine but I don't know how to get the page number and paragraph of a search result.
I found this here "Indexing PDF with page numbers with Solr" but it wasn't really helpfully.

I'm now splitting the PDF and sending each page separately to SOLR.
So every page is an own document with an id <id_of_document>_<page_number> and an additional field doc_id which contains only the <id_of_document> for grouping the results.

There is JIRA SOLR-380 with a Patch, which you can check upon.

I also tried getting the results with page number but could not do it. I used Apache PDFBox for splitting all the PDFs present in a directory and sending the files to Solr server.

I have not tried it myself.
Approach,
Solr customer connector integrating with Apache Tika parser for indexing PDFs
Create multiple attributes in Solr like page1, page2, page3…,pageN – Alternatively, can use dynamic attributes in Solr
In the customer connector, read the PDFs, page by page, index them onto the respective page attributes/dynamic attributes
Enable search on all the “page” attributes
When user searches, use the “highlighter/Summary/Teaser” component to only retrieve “page” attributes that has hits
The “page” attributes that has a hit (find from highlighter/Summary/Teaser) for a given records are the pages that has the searched phrase.
Link the PDF with the “#PageNumber” of the PDF and pop up the page on click
A far better approach compared to splitting the PDFs and indexing them as separate Solr docs.
If you find a flaw in this design, respond to my thread. I will attempt to resolve it.

Get text from a section on some page

I know how to make an API call to get me the text of the whole page, like this, but is there a way (without having to parse through the wiki markup) to only get the text from a certain section?

If you look at the documentation for the revisions module, you'll notice that it has a prameter rvsection, which is exactly what you want. So, for example, to retrieve the lead section, use
http://en.wikipedia.org/w/api.php?format=xml&action=query&titles=Stack%20Overflow&prop=revisions&rvprop=content&rvsection=0

Sitefinity How to Exlude Template from Searching

Is there any way to exclude Sitefinity main template i used for all the pages, from searching?
Right now if i search,the search is returning the result with words present in the template menu,even though its not belong to the page.
Now i need to search pages exlceding that template contents.
Thanks in Advance.
The problem here is i have added a menu inside a content block widget in a template.
This template is used throughout the site and when i search for a keyword using the search feature, all the pages of the website are listed in the search result because the keyword is also found in the menu.So i need a solution so that the search result does not include the menu content in the search result.
This is a very high priority. Please help me find a solution at the earliest possible.

Ivan Pelovski recently published a blog post on how you can hide content from the search engine by using custom layout controls. Not specifically what you are asking, but maybe it can help.
Here: http://www.sitefinity.com/blogs/ivanpelovski/posts/12-02-06/hiding_page_content_from_the_search_engine_in_sitefinity_using_layout_widgets.aspx

Try adding a robots.txt metatag like this into the top of the template:
<meta name="robots" content="noindex" />
In more recent versions of Sitefinity you can also uncheck a box at each page level that will prevent the page from being indexed. The column for this setting in the database is sf_page_data (table) .. crawlable (column) in case you want to write a sql script to update several pages at once.
The exclusion of templates from search is mentioned in more detail here:
http://www.sitefinity.com/devnet/forums/sitefinity-4-x/general-discussions/exclude-page-from-search-index.aspx
Note that this will probably also prevent other search engines (such as google) from indexing that page.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Mediawiki + Lucene: How To Strip Markup? - lucene

Related

Wikipedia API - How to get rid of Wikipedia hyperlinks/junk

noscript text is appearing in Google

Get page numbers of searchresult of a pdf in solr

Get text from a section on some page

Sitefinity How to Exlude Template from Searching

Categories

Resources