fetch specific title in every page with nutch and solr

fetch specific title in every page with nutch and solr - apache

I have solr and nutch installed and my web page structure is that in every page the title is the same; e.g. Bank Something; but in every page there is a tag with an ID of TITLE, something like:
<div ID="TITLE"><h1>my page specific title</h1></div>
I want to add another field to solr like second Title that fetch my page specific title and search words in it.(indeed now my page specific title is in content field and i want to have this in other field)
How can I do this?!

Check Nutch Plugin which should allow you to extarct an element from a web page.

Related

How to follow lazy loading with scrapy?

I am trying to crawl a page that is using lazy loading to get the next set of items. My crawler follows normal links, but this one seems to be different:
The page:
https://www.omegawatches.com/de/vintage-watches
is followed by https://www.omegawatches.com/de/vintage-watches?p=2
But only if you load it within the browser. Scrapy will not follow the link.
Is there a way to make scray follow the pages 1,2,3,4 automatically?

The page follows Virtual scrolling and the api through which it gets data is
https://www.omegawatches.com/de/vintage-watches?p=1&ajax=1
it returns a json data which contains different details including products in html format, and if the next page exist or not in a a tag with class link next
increase the page number till there is no a tag with link next class.

Sitefinity: How to set page title based-on the content item being viewed?

I have a page in Sitefinity 7 and its entire purpose is to show the detailed view of a custom content item.
So I've dragged-on a widget to the page, selected that it only shows one particular item only, without selecting which, because it should be whatever one was chosen from another page which caused them to navigate here.
One not-selected content item.
The page with the list control navigates to the detailed page.
But the page title is the same no-matter which is selected. I want the page title to be one of the content item's fields. How can I get the page title to be based-on the item we're viewing?

Edit the Awards widget on your details page and then go to the Advanced settings, then in the MetaTitle field enter the field name of your module you'd like to use, so in most cases "Title", then in the PageTitleMode field, you can enter one of a few options, Append, Replace or DoNotSet. Documentation on those options is here. You can also utilize the MetaKeywordsField and MetaDecriptionField by also mapping those to a new "SEO Keywords" or "SEO Description" long text field on your module for instance. Documentation on that is here.
The screenshot is from Sitefinity 6.3 but it should be the same.

How to control the way Nutch parses and Solr indexes a URL when its HTML structure is unknown?

I am trying to crawl some sites which has poorly maintained HTML structure and I have no control over it to change it. When I look at the nutch crawled data indexed by Solr, the field 'title' looks okay where as the 'content' field includes lot of junk as it grabbed all the text from the html banner with its drop down menu and worked down into the left side menu, navigation, footer etc.
In my case, I am interested to just grab the "Description:" information which is defined in a paragragh on HTML page into 'content' field.
Example: (raw html):
<p><strong>Description:</strong> Apache Nutch is an open source Web crawler written in Java. By using it, we can find Web page hyperlinks in an automated manner, reduce lots of maintenance work, for example checking broken links, and create a copy of all the visited pages for searching over.
How can I filter the junk out of the 'content' field and only have the information I am interested in?

You can use the plugin below to extract content based on XPath queries.
If your content is in a specific div, you can use this plugin to extract the content you want from that specific section.
Filter xpath

Get page numbers of searchresult of a pdf in solr

I'm building a web application where users can search for pdf documents and view them with pdf.js. I would like to display the search results with a short snippet of the paragraph where the search term where found and a link to open the document at the right page.
So what I need is the page number and a short text snippet of every search result.
I'm using SOLR 4.1 to index pdf documents. The indexing itself works fine but I don't know how to get the page number and paragraph of a search result.
I found this here "Indexing PDF with page numbers with Solr" but it wasn't really helpfully.

I'm now splitting the PDF and sending each page separately to SOLR.
So every page is an own document with an id <id_of_document>_<page_number> and an additional field doc_id which contains only the <id_of_document> for grouping the results.

There is JIRA SOLR-380 with a Patch, which you can check upon.

I also tried getting the results with page number but could not do it. I used Apache PDFBox for splitting all the PDFs present in a directory and sending the files to Solr server.

I have not tried it myself.
Approach,
Solr customer connector integrating with Apache Tika parser for indexing PDFs
Create multiple attributes in Solr like page1, page2, page3…,pageN – Alternatively, can use dynamic attributes in Solr
In the customer connector, read the PDFs, page by page, index them onto the respective page attributes/dynamic attributes
Enable search on all the “page” attributes
When user searches, use the “highlighter/Summary/Teaser” component to only retrieve “page” attributes that has hits
The “page” attributes that has a hit (find from highlighter/Summary/Teaser) for a given records are the pages that has the searched phrase.
Link the PDF with the “#PageNumber” of the PDF and pop up the page on click
A far better approach compared to splitting the PDFs and indexing them as separate Solr docs.
If you find a flaw in this design, respond to my thread. I will attempt to resolve it.

TYPO3 Indexed Search Engine - not all the page content is indexed

I have a TYPO3 site with Indexed Search Engine extension...
The problem is that not all the content is indexed (with debug option activated in conf content not all the page is present but the page size is corect), only the firts part of the page (witch is the head/title and the begining of the menu...).
So for every page the words are only from the begining of the page (title, menus).
I have tried using the Indexed Search Engine begin and end tags but no effect...
What did I do wrong?

I am stupid :-( I have figured out the problem...
The begin/end tags where not corect I used <!--TYPO3SEARCH_begin>
<!--TYPO3SEARCH_end> insted
of <!--TYPO3SEARCH_begin--> <!--TYPO3SEARCH_end-->

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

fetch specific title in every page with nutch and solr - apache

Check Nutch Plugin which should allow you to extarct an element from a web page.

Related

How to follow lazy loading with scrapy?

Sitefinity: How to set page title based-on the content item being viewed?

How to control the way Nutch parses and Solr indexes a URL when its HTML structure is unknown?

Get page numbers of searchresult of a pdf in solr

TYPO3 Indexed Search Engine - not all the page content is indexed

Categories

Resources