How do I prevent GoogleBot from finding acquisition URLs? - zope

I have apache in front of zope 2 (multiple virtual hosts) using the standard simple rewrite rule.
I am having big issues with some of the old sites I host and googlebot.
Say I have:
site.example.com/documents/
site.example.com/images/i.jpg
site.example.com/xml/
site.example.com/flash_banner.swf
How do I stop the following from happening?
site.example.com/documents/images/xml/i.jpg
site.example.com/images/xml/i.jpg
site.example.com/images/i.jpg/xml/documents/flash_banner.swf
All respond with the correct object from the last folder on the end of the URI, the old sites where not written very well and it some cases Google is going in and out of hundreds of permutations of folder structures that don’t exist but always finding large flash files. So instead of Googlebot hitting the flash file once, it's dragging it off the site thousands of times. I am in the process of moving the old sites the Django. But I need to put a halt to it in Zope. In tthe past have tried ipchains and mod_security but they are not an option this time around.

Find out what page is providing Google all the variant paths to the same objects. Then fix that page so that it only provides the canonical paths using the absoute_url(), absoute_url_path(), or virtual_url_path() methods of traversable objects.
You could also use sitemaps.xml or robots.txt to tell Google not to spider the wrong paths but that's definitely a workaround and not a fix as the above would be.

Related

Scrapy - making sure I get all the pages from a domain / how to tell I didn't / what to do about it?

I have a pretty generic spider that I do broad crawls with. I feed it a couple hundred starting urls, limit the allowed_domains and let it go wild (I'm following the suggested 'Avoiding getting banned' measures like auto-throttle, no cookies, rotating user agents, rotating proxies etc).
Everything has been going smoothly until like a week ago when the batch of the starting URLs included a pretty big, known domain. At that time, fortunately, I was monitoring the scrape and noticed that the big domain just "got skipped". When looking into why, it seemed that the domain recognized I was using a public proxy and 403ed my initial request to 'https://www.exampledomain.com/', hence the spider didn't find any urls to follow and hence no urls were scraped for that domain.
I then tried using a different set of proxies and/or VPN and that time I was able to scrape some of the pages but got banned shortly after.
The problem with that is that I need to scrape every single page until 3 levels deep. I cannot afford to miss a single one. Also, as you can imagine, missing a request at the default or first level can potentially lead to missing thounsands of urls or no urls being scraped at all.
When a page fails on the initial request it is pretty straight-forward to tell something went wrong. However, when you scrape thousands of urls from multiple domains in one go it's hard to tell if any got missed. And even if I did notice there are 403s and I got banned the only thing to do at that point seems to be to cross my fingers and run the whole domain again since I can't say the urls I missed due to 403s (and all the urls I would get from deeper levels) didn't get scraped from any other urls that contained the 403ed url.
The only thing that comes to mind is to SOMEHOW collect the failed urls, save them to a file at the end of the scrape, make them the starting_urls and run the scrape again. But that would scrape all of the other pages that were successfully scraped previously. Preventing that would require somehow passing a list of successfully scraped urls setting them as denied. But that also isn't a be all end all solution since there are pages you will get 403ed despite not being banned, like resources you need to logged in to see etc.
TLDR: How do I make sure I scrape all the pages from a domain? How do I tell I didn't? What is the best way of doing something about it?

Force a page to be re-downloaded, rather than fetched from browser cache - Apache Server

Ive made a minor text change to our website, the change is minor in that its a couple of words, but the meaning is quite significant.
I want all users (both new and returning) to the site to see the new text rather than any cached versions, is there a way i can force a user (new or returning) to re download the page, rather than fetch it from their browser cache ?
The site is a static html site hosted on a LAMP server.
This depends totally on how your webserver has caching set up but, in short, if it's already cached then you cannot force a download again until the cache expires. So you'll need to look at your cache headers in your browsers developer tools to see how long it's been set for.
Caching gives huge performance benefits and, in my opinion, really should be used. However that does mean you've a difficulty in forcing a refresh as you've discovered.
In case you're interested in how to handle this in the future, there are various cache busting methods, all of which basically involve changing the URL to fool the browser into thinking its a different resource and forcing the download.
For example you can add a version number to a resource so you ask for so instead of requesting index.html the browser asks for index2.html, but that could mean renaming the file and all references to it each time.
You can also set up rewrites in Apache using regular expressions so that index[0-9]*.html actually loads index.html so you don't need multiple copies of the file but can refer to it as index2.html or index3.html or even index5274.html and Apache will always serve the contents of index.html.
These methods, though a little complicated to maintain unless you have an automated build process, work very well for resources that users don't see. For example css style sheets or JavaScript.
Cache busting techniques work less well for HTML pages themselves for a number of reasons: 1) they create unfriendly urls, 2) they cannot be used for default files where the file name itself is not specified (e.g. the home page) and 3) without changing the source page, your browser can't pick up the new URLs. For this reason some sites turn off caching for the HTML pages, so they are always reloaded.
Personally I think not caching HTML pages is a lost opportunity. For example visitors often visit a site's home page, and then try a few pages, going back to the home page in between. If you have no caching then the pages will be reloaded each time despite the fact it's likely not to have changed in between. So I prefer to have a short expiry and just live with the fact I can't force a refresh during that time.

Blocking web spider

I want to block all web spiders to vaccum my web site.
Is there a way ?
I only found some Apache rules from 2008 (like this one )
http://perishablepress.com/ultimate-htaccess-blacklist/
Unfortunately, there isn't a way to block ALL scripts from accessing your site: if a human can, nothing prevents anyone from writing a spider that behaves similarly to a person and can therefore view every page.
You can look up techniques to prevent certain robots from accessing your site (you can do it with the majority of search engines), but if it remains up long enough and gets some visits it's likely that it will eventually end up in some database.
Have a look here.

Should a sitemap have *every* url

I have a site with a huge number (well, thousands or tens of thousands) of dynamic URLs, plus a few static URLs.
In theory, due to some cunning SEO linkage on the homepage, it should be possible for any spider to crawl the site and discover all the dynamic urls via a spider-friendly search.
Given this, do I really need to worry about expending the effort to produce a dynamic sitemap index that includes all these URLs, or should I simply ensure that all the main static URLs are in there?
That actual way in which I would generate this isn't a concern - I'm just questioning the need to actually do it.
Indeed, the Google FAQ (and yes, I know they're not the only search engine!) about this recommends including URLs in the sitemap that might not be discovered by a crawl; based on that fact, then, if every URL in your site is reachable from another, surely the only URL you really need as a baseline in your sitemap for a well-designed site is your homepage?
If there is more than one way to get to a page, you should pick a main URL for each page that contains the actual content, and put those URLs in the site map. I.e. the site map should contain links to the actual content, not every possible URL to get to the same content.
Also consider putting canonical meta tags in the pages with this main URL, so that spiders can recognise a page even if it's reachable through different dynamical URLs.
Spiders only spend a limited time searching each site, so you should make it easy to find the actual content as soon as possible. A site map can be a great help as you can use it to point directly to the actual content so that the spider doesn't have to look for it.
We have had a pretty good results using these methods, and Google now indexes 80-90% of our dynamic content. :)
In an SO podcast they talked about limitations on the number of links you could include/submit in a sitemap (around 500 per page with a page limit based on pagerank?) and how you would need to break them over multiple pages.
Given this, do I really need to worry
about expending the effort to produce
a dynamic sitemap index that includes
all these URLs, or should I simply
ensure that all the main static URLs
are in there?
I was under the impression that the sitemap wasn't necessarily about disconnected pages but rather about increasing the crawling of existing pages. In my experience when a site includes a sitemap, minor pages even when prominently linked to are more likely to appear on Google results. Depending on the pagerank/inbound links etc. of your site this may be less of an issue.

Google Page Rank - New Domain / Link Structure Migration

i've been tasked with re-organizing a pure HTML site into a CMS. if all goes well, the new site will eventually become the main URL, and the old domain will be phased out. the old domain has a decent enough page rank, and the company wishes to mitigate any loss of page rank for that. in looking over the options available, i've discovered a few things:
it's better to use a 301 redirect when you're ready to make the switch (source).
the current site does not have a sitemap, so adding one and submitting it may help their future page rank.
i'll need to suggest to them that they contact people currently linking to them to update their links.
the process for regaining an old page rank takes awhile, so plan on rebuilding links while we see if the new site is flexible enough to warrant switching over completely.
my question is: as a result of a move to a CMS driven site, the links to various pages will change to accommodate the new structure. will this be an issue for trying to maintain (or improve) the current page rank? what sort of methods are available to mitigate the issue of changing individual page URL's? is there a preferable method beyond mapping individual pages to their new locations with 301 redirects? (the site has literally hundreds of pages, ugh...)
ex.
http://domain.com/Messy_HTML_page_with_little_categorization.html ->
http://newdomain.com/nice/structured/pages.php
i realize this isn't strictly a programming question, however i felt the information could be useful to developers who are tasked with handling this sort of thing in addition to development of the site.
edit: additions in italics
If you really truly want to ensure that page rank is not lost, you will want to replace the old content with something that performs a proper 301 redirect to the new location. With a 301 redirect the search spiders will know that the content is moved and the page rank typically carries over. It also helps external links.
However, the down side is that after a certain period of time you just have to get rid of the old domains.
You can make a handler for HTML files and map the old pages to the new structure with a 301 redirect.