SEO, Google Webmaster Tools - How can get I generate a 404 crawl error report for bad URLs that are in the sitemap? - seo

I have an automatically generated sitemap for a large website which contains a number of URLs that are causing 404 errors which I need to remove. I need to generate a report based on only the URLs that are in the sitemap and not crawl errors caused by bad links on the site. I can not see any way of filtering the crawl error reports to only include these URLs. Does anyone know of a way that I can achieve this?
Thanks

I'm not sure you can do it easily from webmaster tools, but it is trivial to check them all yourself. Here is a perl program that will accept a sitemap file and check each line, printing each url along with its status.
#!/usr/bin/perl
use strict;
require LWP::UserAgent;
my $ua = LWP::UserAgent->new;
while (my $line = <>){
if ($line =~ /\<loc\>(.*?)\<\/loc\>/){
my $url = $1;
my $response = $ua->get($url);
my $status = $response->status_line;
$status =~ s/ .*//g;
print "$status $url\n";
}
}
I save it as checksitemapstatus.pl and use it like this:
$ /tmp/checksitemap.pl /tmp/sitemap.xml
200 http://example.com/
404 http://example.com/notfound.html

Nothing natively within WMT. You'll want to do some Excel.
Download the list of busted links
Get your list of sitemap links.
Put them side by side.
Use a VLOOKUP to match columns (http://www.techonthenet.com/excel/formulas/vlookup.php)
As a bonus, use some conditional formatting to make it easier to see if they match. Then, sort by colour.

You can also import the sitemap.xml into A1 Website Analyzer and let it scan them. See:
http://www.microsystools.com/products/website-analyzer/help/crawl-website-pages-list/
After that, you can filter scan results by e.,g. 404 response code and export that to CSV if need to be. (Including if-so-wanted from where they are linked.)

Related

VBA url download and renaming is unsuccessful

Using VBA, I am trying to download images from url, renaming them and saving them to folder.
I have found code that facilitates this, but it seems that all "names" with a "/" in it won't download.
Is this possible? Is there a way around it?
I have tried the code from the link Downloading Images from URL and Renaming
No error messages are provided. The images simply won't download.
Files are not allowed to have a / in their name, that's always been the case. The code works unless you put a symbol that's not allowed, which are / \ : * ? " < > |

Output to stderr in REBOL2?

I am trying to get my CGI scripts running on my web host (which runs on FreeBSD). To debug why I keep getting the dreaded "premature end of script headers" error, their support recommended that I redirect all my output to stderr, rather than printing it. Looking up how to do this, I came across a very old RAMBO ticket about it, but it looks like it was never implemented.
Per some of the answers to this question, it seems like I should be able to do a call {echo Hello, world >&2} to achieve this, but it doesn't work.
How can I write to stderr in REBOL2?
For my CGI-specific scenario, I have a truly awful workaround. Since writing to stderr in Perl (with which I am entirely unfamiliar) is a one-liner, I'm currently calling the REBOL script from Perl and printing its output to stderr from there:
#!/usr/bin/perl
use strict;
use warnings;
use CGI;
# Note the backticks
my $the_string = `/home/public/rebol -csw test-reb.cgi`;
print STDERR $the_string;
This webpage has some suggestions http://www.liquidweb.com/kb/apache-error-premature-end-of-script-headers/
to solve your real problem. Perhaps you did not have the headers printed as first thing in your script, this must be the first thing to do. Maybe the rights are not sufficient, or the .r file type was not properly added in your .htaccess as cgi able file. Your (correct!) rebol core exe has not the correct rights. Or your script ends up in an endless loop?
Some hints to redirect errors for Rebol cgi script:
http://www.rebol.com/docs/core23/rebolcore-2.html#section-6.2
Better late than never... I've just implemented it for Rebol3 in my Rebol fork.
https://github.com/Oldes/Rebol-issues/issues/2468
The syntax will be probably changed a little bit, because I don't like that the system console port is named input, although it is not just for the input.
So far it is:
print 1 ;<- std_out
modify system/ports/input 'error on
print 2 ;<- std_err
modify system/ports/input 'error off
print 3 ;<- std_out

How to set and get variables when working in cmd

I am working in cmd to send HTTP GET and POST requests with cURL.
There are many times where I am sending requests to the same pages and typing them out every time is a huge pain.
I'm trying to figure out how to use set= so that I can save these URLs for each time I want to use them.
I've tried
C:\>set page = "http://www.mywebpage.com/api/user/friends"
C:\>page
'page' is not recognized as an internal or external command,
operable program or batch file.
C:\>echo %page%
%page%
but it won't return the page name.
How can I accomplish what I need?
C:\Windows\system32>set page="http://www.mywebpage.com/api/user/friends"
C:\Windows\system32>echo %page%
"http://www.mywebpage.com/api/user/friends"
C:\Windows\system32>set page=http://www.mywebpage.com/api/user/friends
C:\Windows\system32>echo %page%
http://www.mywebpage.com/api/user/friends
Don't use spaces around =. Select version with or without " according to your needs. Variable value may contain spaces inside:
C:\Windows\system32>set page=http://www.mywebpage.com/api/user/my friends
C:\Windows\system32>echo %page%
http://www.mywebpage.com/api/user/my friends
You are setting the value "http://www.mywebpage.com/api/user/friends" inside the variable "page " (notice the space) since you have a space before the =.
So you can either retrieve the value by using %page % or by using set page="http://..." without a space between page and the equals sign

How to use # symbol in HTML in a CGI script

Sure a very simple question but I can't seem to find the terminology to find the answer in a search!
I'm using a file-uploader CGI script. Inside the CGI script is some code that generates some HTML. In the HTML I need to put an email address using the # symbol, however this breaks the script. What is the correct way to escape the # symbol in a CGI script?
The error when using the # symbol is:
"FileChucker: load_external_prefs(): Error processing your prefs file ('filechucker_prefs.cgi'): Global symbol "#email" requires explicit package name at (eval 16) line 1526."
Many thanks for any help
Update..
Hi All, many thanks for the replies - I guess it is perl.. (shows my ignorance of what's going on here perfectly!). The code below shows the problem the # in 'email#domain.com'.
'test$PREF{app_output_template} = qq`
%%%ifelse-onpage_uploader%%%
<div id="fcintro">If you're using a mobile or tablet and have problems uploading, we recommend emailing your CV to: email#domain.com<br><span class"upload_limits">We can accept Adobe PDF, Microsoft Word and all popular image and text file types. (max total upload size: 7MB)</span></div>
%%%else%%%
%%subtitle%%
%%%endelse-onpage_uploader%%%
%%app_output_body%%'
Try # instead of #.
Reference: http://www.w3schools.com/tags/ref_ascii.asp

Is there any way to generate a set of JWebUnit tests from an apache rewrite config?

Seems unlikely, but is there any way to generate a set of unit tests for the following rewrite rule:
RewriteRule ^/(user|group|country)/([a-z]+)/(photos|videos)$ http:/whatever?type=$1&entity=$2&resource=$3
From this I'd like to generate a set of urls of the form:
/user/foo/photos
/user/bar/photos
/group/baz/videos
/country/bar/photos
etc...
The reason I don't want to just do this once by hand is that I'd like the bounded alternation groups (e.g. (user|group|country)) to be able to grow and maintain coverage without having the update the tests by hand.
Is there a rewrite rule or regex parser that might be able to do this, or am I doing it by hand?
If you don't mind hacking a few lines of Perl then there's a package, Regexp::Genex that you can use to generate something close to what you require e.g.
# perl -MRegexp::Genex=:all -le 'print for strings(qr/\/(user|group|country)\/([a-z]+)\//)'
/user/dxb/
/user/dx/
/user/d/
/group/xd/
/group/x/
# perl -MRegexp::Genex=:all -le 'my $re=qr/\/(user|group|country)\/([a-z]+)\/(phone|videos)/;$Regexp::Genex::DEFAULT_LEN = length $re;print for strings($re)'
/user/mgcgmccdmgdmmzccgmczgmzzdcmmd/phone
/user/mgcgmccdmgdmmzccgmczgmzzdcmm/phone
/user/mgcgmccdmgdmmzccgmczgmzzdcm/phone
/user/mgcgmccdmgdmmzccgmczgmzzdc/phone
...
/group/gg/videos
/group/g/phone
/group/g/videos
/country/jvmmm/phone
/country/jvmmm/videos
/country/jvmm/phone
/country/jvmm/videos
/country/jvm/phone
/country/jvm/videos
/country/jv/phone
/country/jv/videos
/country/j/phone
/country/j/videos
#
Note:
1) You'll need to write a wrapper to parse the source file, tokenise (extract) the source patterns, escape certain characters in the rule e.g. "/", and possibly split your rules into more manageable parts, before expanding, via Genex, and then outputting the results, in the desired format.
2) To install the module type: cpan Regexp::Genex