Dealing with errors in a rails app generated by spiders - ruby-on-rails-3

I'm seeing a lot of exceptions in an app -- that was converted from an off the shelf ecommerce site -- a year ago, when a spider hits routes that no longer exist. There's not many of these, but they're hit by various spiders sometimes multiple times a day. I've blocked the worst offenders (garbage spiders, mostly), but I can't block google and bing obviously. There are too many URL's to remove manually.
I'm not sure why the app doesn't return a 404 code, I'm guessing one of the routes is catching the URL's and trying to generate a view, but since the resource is missing it returns nil, which is what's throwing the errors. Like this:
undefined method `status' for nil:NilClass
app/controllers/products_controller.rb:28:in `show'
Again, this particular product is gone, so I'm not sure why the app didn't return the 404 page, instead it's trying to generate the view even though the resource doesn't exist, it's checking to make sure the nil resource has a public status, and the error is thrown.
If I rescue for Active:Record not found, will that do it? It's kind of hard to test, as I have to wait for the various bots to come through.
I also have trouble with some links that rely on a cookie being set for tracking, and if the cookie's not set, the app sets it before processing the request. That doesn't seem to be working with the spiders, and I've set those links to nofollow links, but that doesn't seem to be honored by all the spiders.

For your first question about the 404 page.
Take a look on this post, I'm sure it will help you.

Related

Using Shopify 404 page for content

Shopify is quite restrictive about the ways that you can structure directories. For example all pages must have a url which looks like "my-store.com/pages/my-page".
Whilst there is no way around this in Shopify, I considered a workaround which would work like this.
Use javascript to check the URL queried when displaying the 404 page.
If URL queried = "my-url" connect to Wordpress Rest or graph QL API, query and then render desired content on the page.
For example, my-site.com/blog would return a 404 error, however javascript would run a function to get content when the URL ends in "/blog".
Although this would work from a technical point of view, I understand the server would still be giving a 404 error and this probably has wider implications? To what extent is this the case and is this an unviable solution?
A really interesting idea.
The biggest issue I see will be SEO, since the URLS will still points to the 404 page and you won't be able to show the proper content with liquid, all of the pages will pull the 404 content and show as 404 pages in the google search.
That said I don't see any other major issues that will prevent you to use this with JS. It depends really how many type of pages will require this logic and how the JS logic is written, but as an idea I really like the possibility of it.
I will probably not recommend it to a client that wants a SEO optimized site, but for a personal one it seems like an interesting idea.

Scrapy - making sure I get all the pages from a domain / how to tell I didn't / what to do about it?

I have a pretty generic spider that I do broad crawls with. I feed it a couple hundred starting urls, limit the allowed_domains and let it go wild (I'm following the suggested 'Avoiding getting banned' measures like auto-throttle, no cookies, rotating user agents, rotating proxies etc).
Everything has been going smoothly until like a week ago when the batch of the starting URLs included a pretty big, known domain. At that time, fortunately, I was monitoring the scrape and noticed that the big domain just "got skipped". When looking into why, it seemed that the domain recognized I was using a public proxy and 403ed my initial request to 'https://www.exampledomain.com/', hence the spider didn't find any urls to follow and hence no urls were scraped for that domain.
I then tried using a different set of proxies and/or VPN and that time I was able to scrape some of the pages but got banned shortly after.
The problem with that is that I need to scrape every single page until 3 levels deep. I cannot afford to miss a single one. Also, as you can imagine, missing a request at the default or first level can potentially lead to missing thounsands of urls or no urls being scraped at all.
When a page fails on the initial request it is pretty straight-forward to tell something went wrong. However, when you scrape thousands of urls from multiple domains in one go it's hard to tell if any got missed. And even if I did notice there are 403s and I got banned the only thing to do at that point seems to be to cross my fingers and run the whole domain again since I can't say the urls I missed due to 403s (and all the urls I would get from deeper levels) didn't get scraped from any other urls that contained the 403ed url.
The only thing that comes to mind is to SOMEHOW collect the failed urls, save them to a file at the end of the scrape, make them the starting_urls and run the scrape again. But that would scrape all of the other pages that were successfully scraped previously. Preventing that would require somehow passing a list of successfully scraped urls setting them as denied. But that also isn't a be all end all solution since there are pages you will get 403ed despite not being banned, like resources you need to logged in to see etc.
TLDR: How do I make sure I scrape all the pages from a domain? How do I tell I didn't? What is the best way of doing something about it?

Facebook App in Page Tab receiving signed_request but missing page data

I have a page tab app that I am hosting. I have both http and https supported. While I receive a signed_request package as expected, after I decode it does not contain page information. That data is simply missing.
I verified that like schemes are being used (https) among facebook, my hosted site and even the 'go between'-- facebook's static page handler.
Also created a new application with page tab support but got the same results-- simply no page information in the signed_request.
Any other causes people can think of?
I add the app to the page tab using this link:
https://www.facebook.com/dialog/pagetab?app_id=176236832519816&next=https://www.intelligantt.com/Facebook/application.html
Here is the page tab I am using (Note: requires permissions):
https://www.facebook.com/pages/School-Auction-Test-2/154869721351873?id=154869721351873&sk=app_176236832519816
Here is the decoded signed_request I am receiving:
{"algorithm":"HMAC-SHA256","code":!REMOVED!,"issued_at":1369384264,"user_id":"1218470256"}
5/25 Update - I thought maybe the canvas app urls didn't match the page tab urls so I spent several hours going through scenarios where they both had a trailing slash or not. Where they both had a trailing ? or not, with query parameters or not.
I also tried changing the 'next' value when creating the page tab to the canvas app url and the page tab url.
No success on either count.
I did read where because I'm seeing the 'code' value in the signed_request it means Facebook either couldn't match my urls or that I'm capturing the second request. However, I given all the URL permutations I went through I believe the urls match. I also subscribed to the 'auth.authResponseChange' which should give me the very first authResponse that should contain the signed_request with page.id in it (but doesn't).
If I had any reputation, I'd add a bounty to this.
Thanks.
I've just spent ~5 hours on this exact same problem and posted a prior answer that was incorrect. Here's the deal:
As you pointed out, signed_request appears to be missing the page data if your tab is implemented in pure javascript as a static html page (with *.htm extension).
I repeated the exact same test, on the exact same page, but wrapped my html page (including js) within a Perl script (with *.cgi extension)... and voila, signed_request has the page info.
Although confusing (and should be better documented as a design choice by Facebook), this may make some sense because it would be impossible to validate the signed_request wholly within Javascript without placing your secretkey within the scope (and therefore revealing it to a potential hacker).
It would be much easier with the PHP SDK, but if you just want to use JavaScript, maybe this will help:
Facebook Registration - Reading the data/signed request with Javascript
Also, you may want to check out this: https://github.com/diulama/js-facebook-signed-request
simply you can't get the full params with the javascript signed_request, use the php sdk to get the full signed_request . and record the values you need into javascript variabls ...
with the php sdk after instanciation ... use the facebook object as following.
$signed_request = $facebook->getSignedRequest();
var_dump($signed_request) ;
this is just to debug but u'll see that the printed array will contain many values that u won't get with js sdk for security reasons.
hope that helped better anyone who would need it, cz it seems this issue takes at the min 3 hours for everyone who runs into.

Block google from indexing some pages from site

I have a problem with lots of 404 errors on one site. I figured out that these errors are happening because google is trying to find pages that no longer exist.
Now I need to tell Google not to index those pages again.
I found some solutions on the internet about using robots.txt file. But this is not a site that I built. I just need to fix those errors.The thing is, those pages are generated. They do not physically exist in that form. So I can not add anything in php code.
And I am not quite sure how to add those to robot.txt.
When I just write:
*User-agent: *
noindex: /objekten/anzeigen/haus_antea/5-0000001575*
and hit test button in webmaster tools
I get this from Googlebot:
Allowed
Detected as a directory; specific files may have different restrictions
And I do not know what that means.
I am new in this kind of stuff so please write your answer as simpler as it can be.
Sorry for bad english.
I think Google will remove such pages that return a 404 error automatically from its index. Google will not display these pages in the results. So you don't need to care about that.
Just make sure, that these pages are not linked from other pages. If so, Google may try to index them from time to time. In this case you should return a 301 error (permanently moved) and redirect to the correct url. Google will follow the 301 errors and use the redirected url instead.
Robots.txt is only necessary, if you want to remove pages that are already in the search results. But I think pages with error code 404 will not be displayed there anyway.

Do I need to send a 404?

We're in the middle of writing a lot of URL rewrite code that would basically take ourdomain.com/SomeTag and some something dynamic to figure out what to display.
Now if the Tag doesn't exist in our system, we're gonna display some information helping them finding what they were looking for.
And now the question came up, do we need to send a 404 header? Should we? Are there any reasons to do it or not to do it?
Thanks
Nathan
You aren't required to, but it can be useful for automated checkers to detect the response code instead of having to parse the page.
I certainly send proper response codes in my applications, especially when I have database errors or other fatal errors. Then the search engine knows to give up and retry in 5 mins instead of indexing the page. e.g. code 503 for "Service Unavailable" and I also send a Retry-After: 600 to tell it to try again...search engines won't take this badly.
404 codes are sent when the page should not be indexed or doesn't exist (e.g. non-existent tag)
So yes, do send status codes.
I say do it - if the user is actually an application acting on behalf of the user (i.e. cURL, wget, something custom, etc...) then a 404 would actually help quite a bit.
You have to keep in mind that the result code you return is not for the user; for the standard user, error codes are meaningless so don't display this info to the user.
However think about what could happen if the crawlers access your pages and consider them valid (with a 200 response); they will start indexing the content and your page will be added to the index. If you tell the search engine to index the same content for all your not found pages, it will certainly affect your ranking and if one page appears in the top search results, you will look like a fool.