Do I need to send a 404? - apache

We're in the middle of writing a lot of URL rewrite code that would basically take ourdomain.com/SomeTag and some something dynamic to figure out what to display.
Now if the Tag doesn't exist in our system, we're gonna display some information helping them finding what they were looking for.
And now the question came up, do we need to send a 404 header? Should we? Are there any reasons to do it or not to do it?
Thanks
Nathan

You aren't required to, but it can be useful for automated checkers to detect the response code instead of having to parse the page.

I certainly send proper response codes in my applications, especially when I have database errors or other fatal errors. Then the search engine knows to give up and retry in 5 mins instead of indexing the page. e.g. code 503 for "Service Unavailable" and I also send a Retry-After: 600 to tell it to try again...search engines won't take this badly.
404 codes are sent when the page should not be indexed or doesn't exist (e.g. non-existent tag)
So yes, do send status codes.

I say do it - if the user is actually an application acting on behalf of the user (i.e. cURL, wget, something custom, etc...) then a 404 would actually help quite a bit.

You have to keep in mind that the result code you return is not for the user; for the standard user, error codes are meaningless so don't display this info to the user.
However think about what could happen if the crawlers access your pages and consider them valid (with a 200 response); they will start indexing the content and your page will be added to the index. If you tell the search engine to index the same content for all your not found pages, it will certainly affect your ranking and if one page appears in the top search results, you will look like a fool.

Related

Using Shopify 404 page for content

Shopify is quite restrictive about the ways that you can structure directories. For example all pages must have a url which looks like "my-store.com/pages/my-page".
Whilst there is no way around this in Shopify, I considered a workaround which would work like this.
Use javascript to check the URL queried when displaying the 404 page.
If URL queried = "my-url" connect to Wordpress Rest or graph QL API, query and then render desired content on the page.
For example, my-site.com/blog would return a 404 error, however javascript would run a function to get content when the URL ends in "/blog".
Although this would work from a technical point of view, I understand the server would still be giving a 404 error and this probably has wider implications? To what extent is this the case and is this an unviable solution?
A really interesting idea.
The biggest issue I see will be SEO, since the URLS will still points to the 404 page and you won't be able to show the proper content with liquid, all of the pages will pull the 404 content and show as 404 pages in the google search.
That said I don't see any other major issues that will prevent you to use this with JS. It depends really how many type of pages will require this logic and how the JS logic is written, but as an idea I really like the possibility of it.
I will probably not recommend it to a client that wants a SEO optimized site, but for a personal one it seems like an interesting idea.

Accessing Metacritic API and/or Scraping

Does anybody know where documentation for the Metacritic api is/if it still works. There used to be a Metacritic API at https://market.mashape.com/byroredux/metacritic-v2#get-user-details which disappeared today.
Otherwise I'm trying to scrape the site myself but keeping getting a blocked by a 429 Slow down. I got data like 3 times this hour and haven't been able to get anymore in the last 20 minutes which is making testing difficult and application possibly useless. Please let me know if there's anything else I can be doing to scape I don't know about.
I was using that API as well for an app I wrote a while ago. Looks like the creator removed it from Mashape. I just sent him an email to ask whether it'll be back up. I did find this scraper online. It only has a few endpoints but following the examples given you could easily add more. Let me know if you make any progress!
Edit: Looks like CBS requested it to be taken down. The ToS prohibits scraping:
[…] you agree not to do the following, or assist others to do the following:
Engage in unauthorized spidering, “scraping,” data mining or harvesting of Content, or use any other unauthorized automated means to gather data from or about the Services;
Though I was hoping for a Javascript way of doing this, the creator of the API also told me some info.
He says I was getting blocked for not having a User agent in the header and should use a 429 handling procedure i.e. re-request with longer pauses in between.
A PHP plugin available as well: http://datalinx.io/shop/metacritic-api/
I had to add a user agent like JCDJulian said and now it allows me to scrape. So for Ruby:
agent = Mechanize.new
agent.user_agent_alias = "Mac Firefox"
Then it stopped giving me the 403 Forbidden error.

Missing enrollment terms in Canvas LMS API

I am currently doing some work with the Canvas LMS REST API and have run into an issue when trying to retrieve a list of all enrollment terms defined in the system. When viewing the terms in the online system, I can see all the terms that have been created, from the first ones up to the furthest defined semester. However, when I try to get a list of terms using
GET /api/v1/accounts/:account_id/terms
I only receive a list of 10 terms, while the rest are missing. Does anyone know what could be causing this?
Additionally, is there a difference between a Term and an EnrollmentTerm object? I only see API calls for EnrollmentTerm objects, while a Term seems to be a subset of the data contained in an EnrollmentTerm that only gets passed within a Course. Could someone explain if there is an important difference here, and what I may be missing?
Lastly, could anyone point me towards some information about error codes that are returned from an API call? For example, when I use
POST /api/v1/accounts/:account_id/terms
with some associated parameters, I get a 400 bad request response. When the parameters are incorrectly named, I get a 500 response instead. Any guidance on this matter would be very helpful.
Let me know if there is anything I can do to help clarify things. Thanks for your help!
I got into contact with Canvas developers and found out that this was caused by how they paginate their API responses. Their default cap appears to be at 10 per response, but this can be extended up to 100 by adding ?per_page=100 at the end of the query like so:
POST /api/v1/accounts/:account_id/terms?per_page=100
Additional pages can be retrieved using the URLs returned in the Link header of the response. More info on that can be found here.
An example Link header would be:
<https://<canvas>/api/v1/accounts/:account_id/terms?page=1&per_page=10>; rel="current",
<https://<canvas>/api/v1/accounts/:account_id/terms?page=2&per_page=10>; rel="next",
<https://<canvas>/api/v1/accounts/:account_id/terms?page=1&per_page=10>; rel="first",
<https://<canvas>/api/v1/accounts/:account_id/terms?page=10&per_page=10>; rel="last"
The URLs in the Link header are only included when they are relevant, so the first page will not return a "prev" link and the last page will not return a "next" link, for example.

Ways to keep Google from indexing Sites/Content

I've a case on my Hand where I must be super duper sure that google (or any yahoo / bing for that matter) does not index specific content, so the more redundant, the better.
As far as i know there are 3 Ways to accomplish that, I wonder if there are more (redundancy is key here) :
set meta tag to no-index
disallow affected url structure in robots.txt
post load the content via ajax
So if that are all methods, good, but it would be just dandy if someone has some Idea how to be even more sure :D
(I know thats a little bit insane, but if the content shows up in google somehow it will get really expensive for my company :'-( )
uh, there are a lot more
a) identify googlebot (works similar with other bots)
http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=80553
and don't show them the content
b) return these pages with an HTTP 404 / HTTP 410 header instead of HTTP 200
c) only show these pages to clients with cookies / sesssions
d) render the whole content as image (and then disalow the image)
e) render the whole content as an image data URL (then a disalow is not needed)
f) user pipes | in the URL structure (works in google, don't know about the other pages)
g) use dynamic URLs that only work let say for 5 minutes
and these are just a few on top of my mind ... there are propably more
Well, I suppose you could require some sort of registration/authentication to see the content.
We're using the post-load content via ajax method at my work and it works pretty well. You just have to be sure that you're not returning anything if that same ajax route is hit without the xhr header. (We're using it in conjunction with authorization though.)
I just don't think there's anyway to be completely sure without actually locking down the data behind some sort of authentication. And if it's going to be expensive for your company if it gets out there, then you might want to seriously consider it.
What about blocking IPs from search engines and requests with search engine user-agents in .htaccess?
It might need more maintenance of the list of IPs and user-agents but it will work.

Why would "/id" as a HTTP GET parameter would be a security breach?

While trying to debug my openid implementation with Google, which kept returning Apache 406 errors, I in the end discovered that my hosting company does not allow to pass a string containing "/id" as a GET parameter (something like "example.php?anyattribute=%2Fid" once URL encoded).
That's rather annoying as Google openid endpoint includes this death word "/id" (https://google.com/accounts/o8/id) so my app is returning 406 errors every time I log in with Google because of this. I contacted my hosting company who told me this has been deactivated for security purposes.
I could use POST instead, for sure. But has anyone got an idea why this could cause security problems ???
It can't, your host is being stupid. There's nothing magical about the string /id.
Sometimes people do stupid things with the string /id, like assuming no one is going to guess what follows, so that example.com/mysensitivedata/id/3/ shows my data because my user has id 3, and being the sneaky sort, I wonder what happens if I navigate to example.com/mysensitivedata/id/4/, and your site blindly lets me through to see someone else's stuff.
If that sort of attack breaks your site, no amount of mollycoddling by your host will help you anyway.
One reason a simple ID in the URL could be a security concern is that a user could see their ID and then type another one in, such as if its an integer they may select the next integer up, and potentially see another users info if it is not protected.