nutch crawl using protocol-selenium with phantomjs launched as a Mesos task : org.openqa.selenium.NoSuchElementException - selenium

I am trying to crawl AJAX based sites with Nutch using protocol-selenium with the phantomjs driver. I am using apache-nutch-1.13 compiled from nutch's github repository. These crawls are launched as tasks in a system managed by Mesos. When I launch nutch's crawl script from a terminal in the server everything goes perfect and the site is crawled as I asked. However, when I execute the same crawl script with the same parameters inside a Mesos task nutch raises the exception:
fetch of http://XXXXX failed with: java.lang.RuntimeException: org.openqa.selenium.NoSuchElementException: {"errorMessage":"Unable to find element with tag name 'body'","request":{"headers":{"Accept-Encoding":"gzip,deflate","Connection":"Keep-Alive","Content-Length":"35","Content-Type":"application/json; charset=utf-8","Host":"localhost:12215","User-Agent":"Apache-HttpClient/4.3.5 (java 1.5)"},"httpVersion":"1.1","method":"POST","post":"{\"using\":\"tag name\",\"value\":\"body\"}","url":"/element","urlParsed":{"anchor":"","query":"","file":"element","directory":"/","path":"/element","relative":"/element","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/element","queryKey":{},"chunks":["element"]},"urlOriginal":"/session/a7f98ec0-b8aa-11e6-8b84-232b0d8e1024/element"}}
My first impression was that there was something strange with the environmental variables (HADOOP_HOME, PATH, CLASSPATH...) but I put the same vars in the nutch script and in the terminal and still the same result.
Any ideas about what I am doing wrong?

Related

Error: Could not find or load main class org.apache.nutch.crawl.Generator (Nutch 1.14)

I am using Nutch 1.14 on Mac. I have set JAVA_HOME and I have been able to run Nutch successfully before, but I had to re-download Nutch and now I am having the error above.
I think it must have something to do with Apache Ant, which I never ran inside the Nutch folder - I can't figure that step out. Any help?

AWS Elastic Beanstalk: setting up X virtual framebuffer (Xvfb)

I'm trying to get a Selenium script running on an Elastic Beanstalk server, to achieve this I am using pyvirtualdisplay package following this answer. However, for the Display driver to run xvfb also needs to be installed on the system. I'm getting this error message:
OSError=[Errno 2] No such file or directory: 'Xvfb'
Is there any way to manually install this on EB? I have also set up an EC2 server as suggested here, but the whole process seems unnecessary for this task.
You can create a file in .ebextensions/ like: .ebextensions/xvfb.config with the following content:
packages:
yum:
xorg-x11-server-Xvfb: []

Apache Nutch 1.9 on Hadoop 1.2.1 no Crawl class in jar file

I'm running a Cluster of five Cubieboards, RaspberryPi-like ARM boards with (because of 32bit) Hadoop 1.2.1 installed on them. There is one Name Node and four Slave Nodes.
For my final paper I wanted to install Apache Nutch 1.9 and Solr for big data analysis.
I did the setup explained like this: http://wiki.apache.org/nutch/NutchHadoopTutorial#Deploy_Nutch_to_Multiple_Machines
When starting the Jar Job-File for deploying Nutch over the whole cluster there is a Class not found exception, because there is no Crawl class anymore since nutch 1.7: http://wiki.apache.org/nutch/bin/nutch%20crawl
even in the source file it is removed alredy.
The following error is shown then:
hadoop jar apache-nutch-1.9.job org.apache.nutch.crawl.Crawl urls -dir crawl -depth 3 -topN 5
Warning: $HADOOP_HOME is deprecated.
Exception in thread "main" java.lang.ClassNotFoundException: org.apache.nutch.crawl.Crawl
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:266)
Other classes I found in the package seem to work, there should be no problem with the environment setting.
Which alternatives do you have to perform a crawl over the whole cluster.
Since Nutch version 2.0 there is a Crawler class. But not in 1.9 :(
Any help is very appreciated. Thank you.
I believe you should use the bin/crawl script instead of submitting the nutch job your self to hadoop. To do that, you need to do the following:
Download Nutch 1.9 source code, lets say you extracted the source into nutch-1.9.
Navigate to ntuch-1.9 and run:
ant build
Once the built finished, run
cd runtime/deploy
hadoop fs -put yourseed yourseedlist
bin/crawl seed.txt crawl http://yoursolrip/solr/yoursolrcore
I hope that will help.

setting up and running apache nutch 2.2.1

I am trying to set up and run apache nutch 2.2.1 on my ubuntu desktop. As a newbie, I found some parts of the tutorial given by the official website a bit confusing.
If I were to run it on my own desktop, is it correct to go to the
$NUTCH_HOME/runtime/local
to run the bin/nutch command?
Where should I put the file named urls? (in which there a seed list seed.txt) Is it under
$NUTCH_HOME/runtime/local
If I am in the right directory, I had this problem executing the command
bin/nutch crawl urls -dir crawl -depth 1
InjectorJob: Using class org.apache.gora.memory.store.MemStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 0
Exception in thread "main" java.lang.RuntimeException: job failed: name=generate: null, jobid=job_local1613558008_0002
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:199)
at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:152)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)
I am following the tutorial 1 http://wiki.apache.org/nutch/NutchTutorial until 3.3
and have yet to configure GORA Hbase etc.
It seems that this problem arises because the injector did not get the urls.
Does anyone know how to solve this problem? Thanks a lot!
you should go to $NUTCH_HOME/runtime/deploy to run the command
in case you want integrate with GORA and Hbase mention this in Nutchsite.xml
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>

Why is selenium hanging on INFO - Checking Resource aliases, and how do I even debug this?

I'm trying to follow the tutorial here to setup a headless selenium test-run with jenkins. I'm running CentOS 5.6, and I've followed the instructions. Now, when I run this:
export DISPLAY=":99" && java -jar /var/lib/selenium/selenium-server.jar -browserSessionReuse -htmlSuite *firefox http://www.google.com ./test/selenium/html/TestSuite.html ./target/selenium/html/TestSuiteResults.html
Selenium hangs on INFO - Checking Resource Aliases. I can run the TestSuite.html file manually, and the path is correct.
How can I even begin to try and figure out what's going on? Is there a way I could connect to the display to see what's happening? I am behind a corporate proxy, but with or without -Dhttp.proxyHost arguments, I get the same hung result.
Well, after pointing at an internal server, I get right on past the INFO - Checking Resource Aliases step, so clearly the proxy was the issue.
By trying to hit a site that required the proxy, I was doing too much at once. Confounding variables confounded me.
Selenium is not hanging on INFO - Checking Resource Aliases. Its waiting for a command to execute. You need to trigger your tests using ANT or some other build tool in Jenkins. That should get you going