OpenSearchServer stops running: Error (The unique key is missing: content) - missing-data

I am using OpenSearchServer v1.5.11 - build bd28c79a9d with a total of 5000 home pages. After running the Crawl process (Run forever) I get the following error (4608 pages processed).
Error (The unique key is missing: content)
The most of the pages in the URL browser section are Unfetched/Not parsed and Not indexed.
What's wrong?

Related

Vue 3 browser caching doesnt pick latest files

We have a build.js code in our project which makes a duplicate of our client directory and adds a folder inside src.
example -
client/src/components will become dist/src/abcxy/components
abcxy will change every time we create a new build.
Now the problem is that the browser tries to find the old dist files from the cache and is unable to find them, instead of finding new files it gives an error in the console.
The error changes depending on the browser I am using
EXAMPLE ERROR - Failed to load ‘http://localhost/src/pbtsg/components/report/reports.js’. A ServiceWorker intercepted the request and encountered an unexpected error.

Nutch 1.4 with Solr 3.4 - can't crawl URL, "no URLs to fetch"

I followed a tutorial for web-crawling with Nutch using cygwin, tomcat, nutch 1.4 and solr 3.4. I already could crawl an URL once, but somehow this doesn't work anymore, no matter which URL i try.
My regex-urlfilter.txt in runtime/local/conf is as following:
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!#=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept anything else
+^http://([a-z0-9]*\.)*nutch.apache.org/
The only URL in my seed.txt in runtime/local/bin/urls is only http://nutch.apache.org/.
For crawling I use command
$ ./nutch crawl urls -dir newCrawl3 -solr http://localhost:8080/solr/ -depth 2 -topN 3
Console output is:
cygpath: can't convert empty path
crawl started in: newCrawl3
rootUrlDir = urls
threads = 10
depth = 2
solrUrl=http://localhost:8080/solr/
topN = 3
Injector: starting at 2017-05-18 17:03:25
Injector: crawlDb: newCrawl3/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2017-05-18 17:03:28, elapsed: 00:00:02
Generator: starting at 2017-05-18 17:03:28
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 3
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: newCrawl3
I know there are a few similar questions, but most of them are not resolved. Can anyone help?
Thank you very much in advance!
Why using a Nutch version that is really really old? But nevertheless the problem that you're facing is the space at the beginning of this line:
_+^http://([a-z0-9]*\.)*nutch.apache.org/
(I've highlighted the space with an underscore) every line that starts with a space, \n, # gets ignored by the configuration parser, take a look at:
https://github.com/apache/nutch/blob/master/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexURLFilterBase.java#L258-L269
You can try deleting the directory newCrawl3. Nutch will not crawl an url again, when it has been crawled lately.

django liveservertestcase error

I am trying to run django tests in my app.
My app is "accounts" and after I run the tests using the command
"python manage.py test accounts "
it creates a test database which is apt. So the default port for testing is localhost:8081. But then when tests go the login function with the url "locahost:8081/accounts/login" I don't see the login page and it shows a blank page. I get the error saying
NoSuchElementException: Message: u'Unable to locate element: {"method":"id","selector":"id_username"}' ; Stacktrace: Method FirefoxDriver.prototype.findElementInternal_ threw an error in file:///tmp/tmpfJYYIZ/extensions/fxdriver#googlecode.com/components/driver_component.js
So I think it couldn't load the login page and so it didn't find the id_username(This is the id for text box in login page).
I thought may be it couldn't load the "localhost:8081" so i added it to the django_site database table in setUpClass of the tests.py . But still it didn't work.
I also tried adding the "localhost:8081" in fixtures and include it in tests.py .It didn't help either.
Any ideas please.??

dojo require fails

see this fiddle. This upon running gives an error in console. I'm currently on chrome. Is this a bug?
doing require(["dijit/tree" ] should load .../digit/tree.js but it gives a 404
GET http://ajax.googleapis.com/ajax/libs/dojo/1.7.2/dijit//tree.js 404 (Not Found)
there should be only one / but there are two!
You need to capitalize tree
require(["dijit/Tree"]

mvc 4 bundle and minification - not getting 304 (not modified) when I refresh

I'm trying out MVC 4 Beta's bundling and minification thru System.Web.Optimization. I was hoping that the site I'm using it for would receive a 304 (Not Modified) when I hit refresh.
I thought the point of the src to my js bundle, /desktop-js-bundle?v=D33JhbMl9LHkXSPBj1tfRkRI0lHeYVmbSMQqD59bXHg1 (with that version #), was that the version # changed only when one of the files in the bundle on the server was modified. Yet, every time I hit refresh and monitor the Network tab in Chrome's F12, it makes a request with that same version number and gets a 200 status.
Why doesn't it just return 304?, which would decrease the load and increase perf a decent amount. Thanks!
Why doesn't it just return 304?
Because when you hit F5 you expire the cache of your browser. Basically your test is flawed. You should put links to this bundle in different pages (using the <script> tag). Then you should navigate to those pages using hyperlinks. Now observe the Network tab.
Also make sure you are running in Release mode.
UPDATE:
OK, after digging a little more here's what I found out. The 200 HTTP status code is indeed always sent which is normal. But the second time the bundle is fetched from the cache.
Here's the first request:
We can see that in this case the bundle comes from the server with HTTP cache response headers.
And here's the second request:
We can clearly see in this second screenshot that the bundle is served from the cache. Notice how the entire line is grayed. The HTTP 200 status code is fictional => the client doesn't even send an HTTP request to the server as it retrieves the bundle directly from its cache.
And I can observe the same thing in Google Chrome.
For the first request:
And for the second request:
I had the same issue and the problem was with Microsoft.AspNet.Web.Optimization package. As described here: http://aspnetoptimization.codeplex.com/workitem/127, versions 1.1.2 - 1.1.3 are affected. After downgrade to 1.1.1 it works fine and 304 is returned for non-changed resources after refresh.
You can do it in Package Manager Console with following commands:
PM> Uninstall-Package "Microsoft.AspNet.Web.Optimization"
PM> Install-Package "Microsoft.AspNet.Web.Optimization" -Version 1.1.1