FAST Search for Sharepoint Crawler issue with Dokuwiki pages - sharepoint-2010

My level of frustion is maxxing out over crawling Dokuwiki sites.
I have a content source using FAST search for SharePoint that i have set up to crawl a dokuwiki/doku.php site. My crawler rules are set to: http://servername/* , match case and include all items in this path with crawl complex urls.. testing the content source in the crawl rules shows that it will be crawled by the crawler. However..... The crawl always last for under 2 minutes and completes having only crawled the page I pointed to and no other link on that page. I have check with the Dokuwki admin and he has the robots text set to allow. when I look at the source on the pages I see that it says
meta name="robots" content="index,follow"
so in order to test that the other linked pages were not a problem, I added those links to the content souce manually and recrawled.. example source page has three links
site A
site B
site C.
I added Site A,B and C urls to the crawl source. The results of this crawl are 4 successes, the primary souce page and the other links A,B, and C i manually added.
So my question is why wont the crawler crawl the link on the page? is this something I need to do with the crawler on my end or is it something to do with how namespaces are defined and links constructed with Dokuwiki?
Any help would be appreciated
Eric

Did you disable the delayed indexing options and rel=nofollow options?

The issue was around authentication even though no issues were reported suggesting it was authentication in the FAST Crawl Logs.
The fix was adding a $freepass setting for the IP address of the Search indexing server so that Appache would not go through the authentication process for each page hit.
Thanks for the reply
Eric

Related

How to avoid google indexing my site under development is completed

I have a new site and a new domain, will take about 2 months to complete development and then will go live. Only then I want google starts crawling and indexing my site.
So the question is how to "shut off" google indexing for these 2 months before going live?
Right now I plan to use this index.html:
<html>
<meta name="googlebot" content="noindex">
UNDER CONSTRUCTION
</html>
I will start development in index.php, when done I will remove index.html, then googlebot will start indexing starting from index.php.
Don't know if this sounds like a good plan.
You can create robots.txt in your project directory and add following to it:
User-agent: *
Disallow: /
So when a bot reaches your page it first checks robots.txt file and if disallow is there it would crawl your pages. Read more about it here

Test my website online

How can I test my website online (before it is public) and that no one can see it except me?
I know I can add a password but I don't want that google indexs it (before it's really public).
To prevent google from indexing it, use this meta tag in your head:
<meta name="robots" content="noindex,nofollow" />
This tells search engines you do not wish for your page to show up in the search results.
Add a robots.txt file to your website.
User-agent: *
Disallow: /
Save the text above into a text file called robots.txt, and upload it to your website root.
By using this, any well-behaved crawler will not read your website, and any well-behaved search engine not index your website.
Depending on how your website is designed (PHP, Python, Ruby), you would have to establish a web server. The most typical configuration is AMP, which stands for Apache, MySQL, and PHP/Python. (Typically this runs atop Linux, but it can be run under Windows too.)
Ruby on Rails comes with its own built-in web server. Most Python web frameworks (Django, CherryPy, Pylons) do too.

Sharepoint 2010 not searching content across files

Having made sure that all the proper indexing options are set, my dev install of SP2010 is still not searching the content of word docs, only titles. Any suggestions?
Does your crawler account has sufficient permission to access the file attached to the list item ? Are you crawling your site as a SharePoint site or as a web site (in that case you need to make sure that you have link(s) pointing to the document(s).
Don't you have robots.txt file a the root of your web application that might have exclusions rules preventing the content to be properly crawled ?
If you really want to know what's happening when the crawler is doing it's job, you can install fiddler on your dev machine and change the proxy settings of your search service application to the one created by fiddler. Doing so will allow you to check in real time what url / content is currently being crawled and the http status code that are being returned to diagnose permissions / content issue.
Hope it helped.

Stop Google from Indexing files from our second server

I'm actually scouring the web for the right terms for this question but after a few hours I decided to post my question here.
The scenario is:
we have a website running on two servers. So the files/website is synchronized in these two servers. We have a second server for internal purposes. Let's name the first server as www and ww2 for the second server. ww2 is automatically updated once the files are updated in www.
Now, Google is indexing the ww2 which I want to stop and just let www be crawled and indexed. My questions are:
1. How can I removed those crawled pages in ww2 removed from Google index?
2. How can I stop google from indexing ww2?
Thanks guys!
You can simply use robots.txt to disallow indexing. And there is a robots meta tag obeyed by google.
For your first question Google has a removal tool in their Webmaster tools here is a more info about it Remove page or site from Google's search results
For the second question you either can use a robots.txt file to block Google from crawling your site (here is a more info Blocking Google) or you can restrict the access to that server

Is there a webcrawler that can download an entire site?

Need to know if there is a crawler/downloader that can crawl and download and entire website with at least a link depth of 4 pages. The site I am trying to download has java script hyperlinks that are rendered only by a browser and thus the crawler is unable crawl these hyperlinks unless the crawler itself renders them!!!
Ive used Teleport Pro and it works well
Metaproducts Offline Explorer boasts doing what you need.