How to control the way Nutch parses and Solr indexes a URL when its HTML structure is unknown? - apache

I am trying to crawl some sites which has poorly maintained HTML structure and I have no control over it to change it. When I look at the nutch crawled data indexed by Solr, the field 'title' looks okay where as the 'content' field includes lot of junk as it grabbed all the text from the html banner with its drop down menu and worked down into the left side menu, navigation, footer etc.
In my case, I am interested to just grab the "Description:" information which is defined in a paragragh on HTML page into 'content' field.
Example: (raw html):
<p><strong>Description:</strong> Apache Nutch is an open source Web crawler written in Java. By using it, we can find Web page hyperlinks in an automated manner, reduce lots of maintenance work, for example checking broken links, and create a copy of all the visited pages for searching over.
How can I filter the junk out of the 'content' field and only have the information I am interested in?

You can use the plugin below to extract content based on XPath queries.
If your content is in a specific div, you can use this plugin to extract the content you want from that specific section.
Filter xpath

Related

Apache Tika extract only field names from PDF XFA forms but not the text content

I have tried to use Apache Tika v1.14 to parse text contained in PDF XFA forms. However, after trying different configuration in PDFParserConfig, I can only get the field names but not the text content of those fields.
For example, if there's a field called "Telephone", part of Tika's output may be <li fieldName="Telephone">Telephone: </li> (the same repeats for other fields as well). However, if I use pdfbox API to traverse the DOM tree to reach the Node named "Telephone", then I can use getNodeValue() to obtain the text content I want.
I am aware of the settings setExtractAcroFormContent() and setIfXFAExtractOnlyXFA() in PDFParserConfig and also experimented with them but I still didn't get the text content.
So my questions are
Did I misconfigure Tika so that it does not give the right output? Or,
Is this what Apache Tika's implementation is intended to do? Or,
Is the implementation still under development?
I am sorry that the forms contain medical information of patients so I am not able to attach them as example.
Thank you very much.
p.s. I am also aware of Tika's Jira issues https://issues.apache.org/jira/browse/TIKA-973 and https://issues.apache.org/jira/browse/TIKA-1857 so I thought this feature has been implemented.

How to use html tables within tumblr posts

For a public blog we are currently using the default rich text editor.
This as content editors are no html writers.
Now on a certain post / custom page we want to add tabular data using a html table.
The rich text editor in tumblr has no support for creating tables so we switched to the html editor. There we created the table tag and some rows. Everything looks fine until saving the post. On saving the table element seems to be gone...
The Tumblr dashboard strips out many different HTML tags. They're visible in your theme, but on the dash they're either removed without a trace or replaced with a small clickthrough "embedded content" symbol.
If you NEED the table on the dash, you may need to use an image. Personally I'd include the Read More break to direct people into viewing the whole thing on your blog, but that won't work inside the app or slide-in bar, etc.

Get page numbers of searchresult of a pdf in solr

I'm building a web application where users can search for pdf documents and view them with pdf.js. I would like to display the search results with a short snippet of the paragraph where the search term where found and a link to open the document at the right page.
So what I need is the page number and a short text snippet of every search result.
I'm using SOLR 4.1 to index pdf documents. The indexing itself works fine but I don't know how to get the page number and paragraph of a search result.
I found this here "Indexing PDF with page numbers with Solr" but it wasn't really helpfully.
I'm now splitting the PDF and sending each page separately to SOLR.
So every page is an own document with an id <id_of_document>_<page_number> and an additional field doc_id which contains only the <id_of_document> for grouping the results.
There is JIRA SOLR-380 with a Patch, which you can check upon.
I also tried getting the results with page number but could not do it. I used Apache PDFBox for splitting all the PDFs present in a directory and sending the files to Solr server.
I have not tried it myself.
Approach,
Solr customer connector integrating with Apache Tika parser for indexing PDFs
Create multiple attributes in Solr like page1, page2, page3…,pageN – Alternatively, can use dynamic attributes in Solr
In the customer connector, read the PDFs, page by page, index them onto the respective page attributes/dynamic attributes
Enable search on all the “page” attributes
When user searches, use the “highlighter/Summary/Teaser” component to only retrieve “page” attributes that has hits
The “page” attributes that has a hit (find from highlighter/Summary/Teaser) for a given records are the pages that has the searched phrase.
Link the PDF with the “#PageNumber” of the PDF and pop up the page on click
A far better approach compared to splitting the PDFs and indexing them as separate Solr docs.
If you find a flaw in this design, respond to my thread. I will attempt to resolve it.

SEO: Can dynamically generated links be crawled?

I have a page containing <div> tags with onclick="" code that calls an ajax request to get json data, and then iterates through the results to form links (<a />) to append to the page. These links do not exist in any other place on my website. How can I make these dynamically generated links crawlable?
My initial thought was to turn the <div> tags into <a> tags with a href="#", but with my limited knowledge of how typical crawlers work, i don't think this would solve my problem since the "#" would be what's recognized by the crawler, and not necessarily the dynamically generated output. This is besides the point that i don't want the scroll positioning to be altered at all, which would also rule out giving the <a> tag an id and having it reference itself.
Do I have any options aside from making a new page containing all of the links i need to be crawled? Thanks.
As a general rule, content that is created or made available through JavaScript cannot be found or indexed by search engines. Google does support crawlable Ajax but using it as the only means of accessing your content is bad for accessibility. Also, other search engines can't get to that content which is also not a good thing. Basically crawable ajax is a bad thing.
You should always make your content available without requiring JavaScript to get it. Then you can improve your site by adding JavaScript to make getting the content faster or easier. This is called Progressive Enhancement and is how good websites are built.

FAST For SharePoint Web Crawler Meta Tag Extraction

I am using FAST For SharePoint to crawl a non SharepPoint website. The website crawled with no error, I can get the results of any keyword.
I want to create refiner on result page by html page meta tags. There must be two level refiner; category and sub category. If user clicks category, refiner panel must show all related sub categories.
The meta tags like this:
<meta name="Category" content="Products"/>
<meta name="SubCategory" content="Electronic"/>
How can I extract meta tags that crawled html page(s) with FAST For SharePoint Webcrawler?
I tried to add the meta tag names to FAST Search Administration > Managed Properties and configured refiner panel for those meta tags, but I could not get result. It does not work.
Thank you!
If you want to use custom Managed Property, you need to first bind them to a crawled property. Crawled properties are created automatically during the crawl, or you can create them in powershell, see the following link: http://msdn.microsoft.com/en-us/subscriptions/ff393776(v=office.14).aspx
If I understand well, what you are trying to do is getting information that is in the HTML of your page. In this case, you cannot use the out-of-the-box web crawler to get this information. I suggest you take a look on custom BDC connector, if you want to create a custom crawler to get the information you want: http://msdn.microsoft.com/en-us/library/ee557349(v=office.14).aspx