Nutch 2.x No errors, No results neither - apache

I've been playing with nutch 2.x for awhile, have it set up according to the Nutch 2.x tutorial as advised in this post , still I can't figure it out - any help would be greatly appreciated.
When using the INJECT command as per tutorial, it injects the 2 URLS I have in seeds.txt:
nutch inject ../local/urls/seed.txt
but when running the script it doesn't visit any of the urls:
bin/crawl ../local/urls/seed.txt TestCrawl *ttp://l*calhost:8983/solr 2

I've now started again with a complete new install of Nutch 2.2.1 - Hbase-0.94.10 and Solr 4.4.0 as advised vy someone on the mailinglist, due to that the versions mentioned in the tutorial are years old, and now the error I'm getting is:
[root#localhost local]# bin/nutch inject /urls/seed.txt
InjectorJob: starting at 2013-08-11 17:59:32
InjectorJob: Injecting urlDir: /urls/seed.txt
InjectorJob: org.apache.gora.util.GoraException: java.lang.RuntimeException: java.lang.IllegalArgumentException: Not a host:port pair: �2249#localhost.localdomainlocalhost,45431,1376235201648

Although this is a long time question, but I have a suggestion here。
Because nutch is apache project, so it will obey robots.txt, perhaps because of that,you got any thing。you can gedit src/java/org/apache/nutch/fetcher/FetcherReducer.java to Uncomment
/*if (!rules.isAllowed(fit.u.toString())) {
// unblock
fetchQueues.finishFetchItem(fit, true);
if (LOG.isDebugEnabled()) {
LOG.debug("Denied by robots.txt: " + fit.url);
}
output(fit, null, ProtocolStatusUtils.STATUS_ROBOTS_DENIED,
CrawlStatus.STATUS_GONE);
continue;
}
*/

Related

Spartacus API calls return 504 (Gateway Timeout) when running using Server Side Rendering (SSR)

I'm trying to get Spartacus to work with SSR. When opening the default URL, http://localhost:4200, the storefront renders, as expected, but only after I clear the site data first. When I attempt to browse the storefront, API calls fail with a 504 (Gateway timeout). Chrome dev tools indicates the error is happening in the service worker. At this point, I'm wondering if I configured Spartacus incorrectly. When running Spartacus using yarn start rather than yarn serve:ssr, I can load the home page and browse the site normally.
OS: Ubuntu 16.04.6 LTS
Chrome Version: 73.0.3683.75
Node version: 11.15.0
Angular CLI version: 8.3.8
Yarn version: 1.19.1
ng new ssr-spartacus-app --style=scss
cd ssr-spartacus-app
ng add #spartacus/schematics --baseUrl https://localhost:9002 --baseSite cmssiteuid --pwa --ssr
rm src/app/app.component.html
echo "<cx-storefront>Loading...</cx-storefront>" > src/app/app.component.html
yarn build:ssr
yarn serve:ssr
Before running yarn build:ssr, I made following change to the app.module.ts file:
Before
context: {
baseSite: ['cmssiteuid'],
},
After
authentication: {
client_id: 'mobile_android',
client_secret: 'secret',
},
context: {
urlParameters: ['baseSite', 'language', 'currency'],
baseSite: ['cmssiteuid'],
},
I also set anonymousConsents to false. With this set to true, I was getting a lot of CORs errors.
If been scratching my head with this for a little while now and I'm hoping someone with more knowledge of Spartacus' inner workings can shed some light on why Spartacus is behaving this way with SSR.
I'm not sure that I can give you some certain recipe to fix the issue, obviously I need more details and logs relates to your problem, but still, based on my experience I can share with you some tips and tricks about how we should play with such issues (which relates to SSR).
Some set of theory which relates to SSR
https://angular.io/guide/universal (you can feel free to use Angular official documentation as a primary source, cuz Spartacus uses Angular OOTB features to make it works)
https://sap.github.io/spartacus-docs/server-side-rendering-in-spartacus/
https://enable.cx.sap.com/tag/tagid/spartacus (SSR related videos)
Practical approaches for debugging SSR
You should observe and analyze console output during starting your application in Node.js
You can use SSR configuration from example Storefront application (https://github.com/SAP/spartacus/tree/develop/projects/storefrontapp) like a starting point, cuz OOTB SSR works like a charm
Something from Spartacus team https://sap.github.io/spartacus-docs/how-to-debug-server-side-rendered-storefront/
Common set of theory to ensure that application has been configured correctly
SAP Commerce Cloud configuration for working with Spartacus https://sap.github.io/spartacus-docs/installing-sap-commerce-cloud/
Take a look on the guide https://sap.github.io/spartacus-docs/building-the-spartacus-storefront-from-libraries/ to ensure, that your frontend application has correct configuration
Double check your configuration which B2cStorefrontModule is using (here you can find an example project here https://github.com/SAP/spartacus/tree/develop/projects/storefrontapp)
Take a look on Network and Console browser tabs and try to resolve all errors
did you turn off PWA?
Turn PWA off.
As soon as Spartacus is installed in PWA mode, a service worker is installed, and it serves a cached version of index.html, along with the js files. This results in SSR being completely skipped. The following steps describe how to turn off PWA:
Check that there are no service workers registered in your app. If you do find any service workers, remove them.
Turn PWA off in your app module configuration, as follows:
StorefrontModule.withConfig({
backend: {
occ: {
baseUrl: 'https://[your_enpdoint],
},
},
pwa: {
enabled: false,
},
};
Rebuild your local Spartacus libraries by running the following command:
yarn build:core:lib
Build your local Spartacus shell app by running the following command:
yarn build --prod
Build the SSR version of your shell app by running the following command:
yarn build:ssr
Start Spartacus with the SSR server by running the following command:
yarn serve:ssr
If you are getting 504 after hitting the API service you need to check your API logs.
IF you have err log:
{"instant":{"epochSecond":1644915623,"nanoOfSecond":929833000},"thread":"hybrisHTTP1","level":"ERROR","loggerName":"org.springframework.web.servlet.DispatcherServlet","message":"Context initialization failed","thrown":{"commonElementCount":0,"localizedMessage":"Error creating bean with name 'cartEntriesController': Injection of resource dependencies failed; nested exception is org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'defaultStockValidator' defined in ServletContext resource [/WEB-INF/config/v2/validators-v2-spring.xml]: Unsatisfied dependency expressed through constructor parameter 0: Could not convert argument value of type [de.hybris.platform.ycommercewebservices.stock.impl.DefaultCommerceStockFacade] to required type [de.hybris.platform.commercewebservices.core.stock.CommerceStockFacade]: Failed to convert value of type 'de.hybris.platform.ycommercewebservices.stock.impl.DefaultCommerceStockFacade' to required type 'de.hybris.platform.commercewebservices.core.stock.CommerceStockFacade'; nested exception is java.lang.IllegalStateException: Cannot convert value of type 'de.hybris.platform.ycommercewebservices.stock.impl.DefaultCommerceStockFacade' to required type 'de.hybris.platform.commercewebservices.core.stock.CommerceStockFacade': no matching editors or conversion strategy found","message":"Error creating bean with name 'cartEntriesController': Injection of resource dependencies failed; nested exception is org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'defaultStockValidator'
You can try resolution:
Remove template extension ycommercewebservices extension from manifest.json, rebuild and redeploy with "Migrate Data" mode.

Parsoid: Unexpected Token error and failing to initialize

mwApis:
- # This is the only required parameter,
# the URL of you MediaWiki API endpoint.
uri: 'http://spgenerations.com/wiki/api.php'
On my linux box, I can curl this URL and receive the api data.
Regardless of using the apt-get installation or developer installation (ngm install) both instances give me this error:
{"name":"parsoid","hostname":"play.projecttidal.com.KVM","pid":12636,"level":30,"levelPath":"info/service-runner","msg":"master(12636) initializing 2 workers","time":"2019-03-12T03:55:47.504Z","v":0}
{"name":"parsoid","hostname":"play.projecttidal.com.KVM","pid":12645,"level":60,"moduleName":"lib/index.js","levelPath":"fatal/service-runner/worker","msg":"Unexpected token ...","time":"2019-03-12T03:55:47.917Z","v":0}
{"name":"parsoid","hostname":"play.projecttidal.com.KVM","pid":12636,"level":40,"message":"first worker died during startup, continue startup","worker_pid":12645,"exit_code":1,"startup_attempt":1,"levelPath":"warn/service-runner/master","msg":"first worker died during startup, continue startup","time":"2019-03-12T03:55:48.925Z","v":0}
For context, the hostname here is incorrect and the domain has been removed.
This is my parsoid config:
// Parsoid configuration
$wgVirtualRestConfig['modules']['parsoid'] = array(
'url' => 'server.spgenerations.com',
'forwardCookies' => true
);
I have tried everything under the hidden voodoo sun to get this thing to work and I'm beyond frustrated. 4 hours spent tinkering with URL links to no avail, so please, if you know anything relating to this error, lend a hand.
Check what Node.JS version you are running with:
nodejs --version
If it is 4.x: That's too old for Parsoid. I had the same situation (Debian 9, still such an old Node.JS version in the repositories..). After upgrading to 10.x it ran fine for me.
I used the following guide (see Install using a PPA) to update to a newer Node.JS release: https://www.digitalocean.com/community/tutorials/how-to-install-node-js-on-debian-9

Nutch 1.4 with Solr 3.4 - can't crawl URL, "no URLs to fetch"

I followed a tutorial for web-crawling with Nutch using cygwin, tomcat, nutch 1.4 and solr 3.4. I already could crawl an URL once, but somehow this doesn't work anymore, no matter which URL i try.
My regex-urlfilter.txt in runtime/local/conf is as following:
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!#=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept anything else
+^http://([a-z0-9]*\.)*nutch.apache.org/
The only URL in my seed.txt in runtime/local/bin/urls is only http://nutch.apache.org/.
For crawling I use command
$ ./nutch crawl urls -dir newCrawl3 -solr http://localhost:8080/solr/ -depth 2 -topN 3
Console output is:
cygpath: can't convert empty path
crawl started in: newCrawl3
rootUrlDir = urls
threads = 10
depth = 2
solrUrl=http://localhost:8080/solr/
topN = 3
Injector: starting at 2017-05-18 17:03:25
Injector: crawlDb: newCrawl3/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2017-05-18 17:03:28, elapsed: 00:00:02
Generator: starting at 2017-05-18 17:03:28
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 3
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: newCrawl3
I know there are a few similar questions, but most of them are not resolved. Can anyone help?
Thank you very much in advance!
Why using a Nutch version that is really really old? But nevertheless the problem that you're facing is the space at the beginning of this line:
_+^http://([a-z0-9]*\.)*nutch.apache.org/
(I've highlighted the space with an underscore) every line that starts with a space, \n, # gets ignored by the configuration parser, take a look at:
https://github.com/apache/nutch/blob/master/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexURLFilterBase.java#L258-L269
You can try deleting the directory newCrawl3. Nutch will not crawl an url again, when it has been crawled lately.

Nutch 1.12 on Cygwin on Windows 7 - NullPointerException

I'm working to get nutch running for the first time for a work project. At this time, the plan is to run nutch from a single machine (Windows 7) to scrape context from a dozen or so web sites. Below is the command line output from cygwin.
$ bin/nutch inject crawl/crawldb urls
Injector: starting at 2016-10-29 09:16:37
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:467)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:849)
at org.apache.hadoop.fs.FileSystem.createNewFile(FileSystem.java:1149)
at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:58)
at org.apache.nutch.crawl.Injector.inject(Injector.java:357)
at org.apache.nutch.crawl.Injector.run(Injector.java:467)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.Injector.main(Injector.java:441)
Looking through the source, here are lines 440 thru 443 of org.apache.nutch.crawl.Injector:
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(NutchConfiguration.create(), new Injector(), args);
System.exit(res);
}
It's not clear exactly whether it is the NutchConfiguration.create() or the new Injector() which is failing there. I setup my installation from the tutorial on the nutch site. I put a list of 3 urls, 1 per line, in the file ./urls/seed.txt; and edited ./conf/nutch-site.xml.
Any suggestions for investigation/debugging this would be appreciated.
Thank you!
Ok After somewhat struggling here are the final steps to get hadoop working with cygwin/windows.
download the right version of winutils.exe and hadoop.dll under a folder bin from https://github.com/cdarlint/winutils based on hadoop version.
set HADOOP_HOME to the download dir of bin folder above. (note if the above two files are downloaded in dir D:\winutil\bin then HADOOP_HOME = D:\winutil)
make sure to add D:\winutil\bin to the PATH variable of windows. This step is important now (was not a while back).
I had the same issue. Solved it by setting up Hadoop in machine and included winutils.exe in %HADOOP%/bin.
Then will get java.lang.UnsatisfiedLinkError error. To solve that, open nutch file in %NUTCH_HOME%/runtime/local/bin and comment below lines
if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then
NUTCH_OPTS=("${NUTCH_OPTS[#]}" -Djava.library.path="$JAVA_LIBRARY_PATH")
fi

bin/nutch inject crawl/crawldb urls not working

I just followed the tutorial to setup Nutch from NutchWiki.
Downloaded Nutch 2.x src and set all configurations.
The problem occurs when I just started to crawl.
When I run this code : bin/nutch inject crawl/crawldb urls I am getting an error message like this : Unrecognized arg urls
I just followed all steps in the tutorial, created directories, made changes to configuration files etc. And I also have a query that there is no crawldb directory in the apache-nutch-2.x/runtime/local/ Is it automatically generated or need to manually generate it ?
Any help to this problem will be appreciated.
I was going through the same problem. The documentation seems to be outdated. It is for 1.x .
For 2.x I have tried the following and it worked for me.
bin/nutch inject urls
Hope it helps.