Apache Nutch: Get outlink URL's text context - apache

Anyone knows an efficient way to extract the text context that wraps an outlink URL. For example, given this sample text containing an outlink:
Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster. You can download Nutch here.
For more information about Apache Nutch, please see the Nutch wiki.
In this example, I would like to get the sentence containing the link, and a sentence before and after that sentence. Any way to do this efficiently? Any methods I can invoke to get something like the position of the link within a fetched content? Or even a part of the nutch code I can modify to do this? Thanks!

What you want to do is Web Scraping. Python and Hadoop offers tools for that. To achieve it, you can use selectors.
Here you find some examples how to do that using Python Scrapy:
Selectors
Scrapy Tutorial
On Hadoop the best way to go is to implement a crawling using selectors:
Web crawl with Hadoop
enter link description here
HiveQL
The cascading can be used to address the URL you specify:
Hadoop and Cascading
After having the data, you can also use R to optimize analysis:
R and Hadoop
Enabling R on Hadoop
If you haven't done anything with Hadoop yet, here is a good starting point. You may also want to have a look in HUE Beeswax as an interactive tool that is very useful for data analysis.

Related

How to know when dataimporthandler for solr finished indexing?

Is it possibile to know where solr finished to index my data?
I work with solrcloud 4.9.0 and zookeeper for conf file manager.
I have the data.import file, but in it there is only where the indexing is STARTED not when it ended.
You can get the dataimporthandler status using:
<MY_SERVER>/solr/dataimport?command=status
Reading the status you can understand if the import is still running. A similar procedure (with a different url) is advised in "Solr in Action" book in order to check if a backup is still running.
Another option would involve the use of listeners as advised here.
I also use the /dataimport?command=status way to check if the job is done or not, and while it works, sometimes I get the impression it is a bit flaky.
There are listeners you can use: see here I would really like to use those, but of course you need to write java code and handle your jar in solr etc. So it is a bit of a PITA

RDF4J rdf lucene configuration

I have been trying for some time to configure my sesame RDF repository (at the moment is called RDF4j) in order to use full text queries.
I did not find much documentation about this configuration, I think that I need to create a template file so then I can use it with the console. Here is the little information about the topic https://groups.google.com/forum/#!topic/rdf4j-users/xw2UJCziKl8
Does anybody know any information about the configuration of RDF4j with Lucene? Any clue would be very appreciate. Otherway, I would think about change the whole repository for another, like for example virtuoso.
Thanks in advance,
You need to do these operations:
Start rdf4j-server. I used rdf4j-server.war (and rdf4j-workbench.war). My url was http://127.0.0.1:8080/rdf4j-server
Put lucene.ttl (lucene.ttl or this) into ~/.RDF4J/console/templates, where "~" your home directory
Correct settings in this file
Then start console from rdf4j distributive
Then execute next commands in console:
connect http://127.0.0.1:8080/rdf4j-server
create lucene
Enter lucene repository Id: myRepositoryId
Enter lucene repository name: myRepositoryName
Then you can see in http://127.0.0.1:8080/rdf4j-workbench created repository. If you add some triples you can use lucene search

What is the significance of data-config.xml file in Solr?

and when shall I use it? How is it configured can anyone please tell me in detail?
The data-config.xml file is an example configuration file for how to use the DataImportHandler in Solr. It's one way of getting data into Solr, allowing one of the servers to connect through JDBC (or through a few other plugins) to a database server or a set of files and import them into Solr.
DIH has a few issues (for example the non-distributed way it works), so it's usually suggested to write the indexing code yourself (and POST it to Solr from a suitable client, such as SolrJ, Solarium, SolrClient, MySolr, etc.)
It has been mentioned that the DIH functionality really should be moved into a separate application, but that hasn't happened yet as far as I know.

How to check if the cloudera services like hive, Impala are running or not through java code?

I want to run some hive queries, and then need to collect different metrics like hdfs bytes read/write. For this I have written java code. But before running the code I just want to check if the cloudera services like hive, impala, yarn are running or not. If running then the code need to execute otherwise just exit. Is there any way to check the status of services by java code?
Sampson S gave you a correct answer, but it's not trivial to implement. The information is available via the REST API of the Cloudera Manager (CM) tools offered by Cloudera. You would have your Java program make a web GET request to CM, parse the JSON result and use that to make a decision. Alternatively, you could look at the code behind their APIs to make a more direct query.
But I think you should ask "Why?" What are you trying to accomplish? Are you replicating the functionality already provided by CM? When asking questions here on SO it's always helpful to provide some context. It seems like you may be new to the environment. Perhaps it already does what you want.

Recrawl URL with Nutch just for updated sites

I crawled one URL with Nutch 2.1 and then I want to re-crawl pages after they got updated. How can I do this? How can I know that a page is updated?
Simply you can't. You need to recrawl the page to control if it's updated. So according to your needs, prioritize the pages/domains and recrawl them within a time period. For that you need a job scheduler such as Quartz.
You need to write a function that compares the pages. However, Nutch originally saves the pages as index files. In other words Nutch generates new binary files to save HTMLs. I don't think it's possible to compare binary files, as Nutch combines all crawl results within a single file. If you want to save pages in raw HTML format to compare, see my answer to this question.
You have to Schedule ta Job for Firing the Job
However, Nutch AdaptiveFetchSchedule should enable you to crawl and index pages and detect whether the page is new or updated and you don't have to do it manually.
Article describes the same in detail.
what about http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
This is discussed on : How to recrawle nutch
I am wondering if the above mentioned solution will indeed work. I am trying as we speak. I crawl news-sites and they update their frontpage quite frequently, so I need to re-crawl the index/frontpage often and fetch the newly discovered links.