Google-Plus Crawler - google-plus

Currently, my company is attempting to add Google Plus One links to our site.
We have the code working, however it appears that the Google-Plus Crawler is unable to access the page content. When the share link snippet is created, it renders with a message stating that the crawler is unable to view the content because it fails a test to differentiate bots from human visitors.
We can white-list the bot, however the system we are using only accepts a User-Agent and a URL. When the User-Agent is detected, a reverse look up is run and the bot ip is compared to the url that was entered to see if it comes from the same set of ips.
I know that the Google Plus crawler does not use a bot style user-agent, like Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html), but is there a user-agent we can perform the necessary white-list test on?

Yes it does. The +Snippet bot user agent contains the following string:
Google (+https://developers.google.com/+/web/snippet/)

This is what the user agent returned for me:
Mozilla/5.0 (Windows NT 6.1; rv:6.0) Gecko/20110814 Firefox/6.0 Google
(+https://developers.google.com/+/web/snippet/)

This is what the user agent returned for me:
Mozilla/5.0 (Windows NT 6.1; rv:6.0) Gecko/20110814 Firefox/6.0 Google (+)

Related

Not allowed by CORS

I'm trying to automate a process so I want to connect to an external API, first just to log in (can't use the API Key since I'm not an admin user).
I basically copied the request the browser does when it logs in, but when doing this from Postman I get a 400 responde, with the body "Not allowed by CORS".
Is there any way through code, that I can bypass that and work with such API?
Cors means Cross-Origin Resource Sharing. Basically browsers help web servers a way to protect themselves for data change requests.
Remove Origin Header (or) replace Origin value to server hostname (in this case api.kenjo.io)
Add referer header.
With dothttp it would look like below.
POST 'https://api.kenjo.io/auth/token'
origin: 'https://www.kenjo.io'
referer: 'https://www.kenjo.io/'

Server locale xmlhttp requests recognised as wrong language by remote servers?

I have a web server hosted with 1and1 which evidently is hosted in Germany, so if I try to do a xmlhttp get on data from google or facebook I am presented with German return data as their site presumes I am a German user.
Does anyone know if it is a server setting which needs to be changed or is facebook recognising the IP location?
if the resource is available in two or many languages, the server mast decide which version to serve. he does this often by examining Accept-Language HTTP header. Probably the header in the request issued by yur server says that it accepts any language, so the server prefers to send german not english due to your srever's IP. Try to add the header menually to your request:
Accept-Language: en
so your ajax will look like this:
xmlhttpobject.setRequestHeader('Accept-Language', 'en');

OAuth 2.0: Can a user-agent (client) avoid forwarding fragments?

In the OAuth 2.0 draft specification, user-agent clients receive authorization in the form of a bearer token via redirection (from an authentication server) to a URL such as
HTTP/1.1 302 Found
Location: http://example.com/rd#access_token=FJQbwq9&expires_in=3600
According to Section 3.5.2 it is then the user-agent's job to GET the URL in question, but "The user-agent SHALL NOT include the fragment component with the request." In other words, as a result of the example redirection above, the user-agent should
GET /rd HTTP/1.1
Host: example.com
without passing #access_token to the server.
My question: what user agents behave this way? I thought redirection in Firefox, for example, would (logically) include the fragment in the GET request. Am I just wrong about this, or does the OAuth 2.0 specification rely on non-standard user-agent behavior?
In fact, Firefox and other browsers behave this way by default. Fragments after the # in a URL are used by the browser to determine which part of a page to show; they are not sent to the server as part of a GET request.

How do I keep my app from tracking bot requests as views

This is a general question about writing web apps.
I have an application that counts page views of articles as well as a url shortner script that I've installed for a client of mine. The problem is that, whenever bots hit the site, they tend to inflate the page views.
Does anyone have an idea on how to go about eliminating bot views from the view count of these applications?
There are a few ways you could determine whether your articles are being viewed by an actual user or by a search engine bot. Probably the best way is to check the User-Agent header sent by the browser (or bot). The User-Agent header is essentially a field that is sent identifying the client application used to access the resource. For example, Internet Explorer might send something Mozilla/5.0 (Windows; U; MSIE 7.0; Windows NT 6.0; en-US). Google's bot might send something like Googlebot/2.1 (+http://www.google.com/bot.html). It is possible to send a fake User-Agent header, but I can't see the average site user or a major company like Google doing that. If it's blank or a common User-Agent string associated with a commercial bot, it's most likely a bot.
While you're at it, you may want to make sure you have an up-to-date robots.txt file. It's a simple text file that provides rules automated bots should respect in terms of which content they are not allowed to retrieve for indexing.
Here's a few resources that may be helpful:
List of User-Agents
How to Verify Googlebot
Web Robots Page
How do I stop bots from incrementing my file download counter in PHP?
Check User-Agent. Use this header value to distinguish bots from regular browsers/users.
For example,
Google bot:
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Safari:
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_3; lv-lv) AppleWebKit/531.22.7 (KHTML, like Gecko) Version/4.0.5 Safari/531.22.7

How do I figure out which parts of a web page are encrypted and which aren't?

I'm working on a webserver that I didn't totally set up and I'm trying to figure out which parts of a web page are being sent encrypted and which aren't. Firefox tells me that parts of the page are encrypted, but I want to know what, specifically, is encrypted.
The problem is not always bad links in your page.
If you link to iresources at an external site using https://, and then the external site does its own HTTP redirect to non-SSL pages, that will break the SSL lock on your page.
BUT, when you viewing the source or the information in the media tab, you will not see any http://, becuase your page is properly using only https:// links.
As suggested above, the firebug Net tab will show this and any other problems. Follow these steps:
Install Firebug add-on into firefox if you don't already have it, and restart FF when prompted.
Open Firebug (F12 or the little insect menu to the right of your search box).
In firebug, choose the "Net" tab. Hit "Enable" (text link) to turn it on
Refresh your problem page without using the cache by hitting Ctrl-Shift-R (or Command-shift-R in OSX). You will see the "Net" tab in firefox fill up with a list of each HTTP request made.
Once the page is done loading, hover your mouse over the left colum of each HTTP request shown in the net tab. A tooltip will appear showing you the actual link used. it will be easy to spot any that are http:// instead of https://.
If any of your links resulted in an HTTP redirect, you will see "301 Moved Permanently" in the HTTP status column, and another HTTP request will be just below for the new location. If the problem was due to an external redirect, that's where the evidence will be - the new location's request will be HTTP.
If your problem is due to redirections from an external site, you will see "301 Moved permanently" status codes for the requests that point them to their new location.
Exapnd any of those 301 relocations with the plus sign at the left, and review the response headers to see what is going on. the Location: header will tell you the new location the external server is requesting browsers use.
Make note of this info in the redirect, then send a friendly polite email to the external site in question and ask them to remove the https:// -> http:// redirects for you. Explain how it's breaking the SSL on your site, and ideally include a link to the page that is broken if possible, so that they can see the error for themselves. (this will spur faster action than if you just tell them about the error).
Here is sample output from Firebug for the the external redirect issue.. In my case I found a page calling https:// data feeds was getting the feeds rewritten by the external server to http://.
I've renamed my site to "mysite.example.com" and the external site to "external.example.com", but otherwise left the heders intact. The request headers are shown at the bottom, below the response headers. Note that I"m requesting an https:// link from my site, but getting redirected to an http:// link, which is what was breaking my SSL lock:
Response Headers
Server nginx/0.8.54
Date Fri, 07 Oct 2011 17:35:16 GMT
Content-Type text/html
Content-Length 185
Connection keep-alive
Location http://external.example.com/embed/?key=t6Qu2&width=940&height=300&interval=week&baseAtZero=false
Request Headers
Host external.example.com
User-Agent Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1
Accept */*
Accept-Language en-gb,en;q=0.5
Accept-Encoding gzip, deflate
Accept-Charset ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection keep-alive
Referer https://mysite.example.com/real-time-data
Cookie JSESSIONID=B33FF1C1F1B732E7F05A547A9CB76ED3
Pragma no-cache
Cache-Control no-cache
So, the important thing to note is that in the Response Headers (above), you are seeing a Location: that starts with http://, not https://. Your browser will take this into account when figuring out if the lock is valid or not, and report only partially encrypted content! (This is actually an important browser security feature to alert users to a potential XSRF and/or phishing attacks).
The solution in this case is not something you can fix on your site - you have to ask the external site to stop their redirect to http. Often this was done on their side for convenience, without realizing this consequence, and a well written, polite email can get it fixed.
For each element loaded in page, check their scheme:
it starts with HTTPS: it is encrypted.
it starts with HTTP: it's not encrypted.
(you can see a relatively complete list on firefox by right-clicking on the page and selecting "View Page Info" then the "medias"tab.
EDIT: FF only shows images and multimedia elements. They are also javascript files & CSS ones which have to be checked. And Firebug is a good tool to find what you need.
Some elements may not list http or https, in this case whichever was used for the page will be used for these items, i.e. if the page request is under SSL then these images will come encrypted while if the page request is not under SSL then these will come unencrypted. Fiddler in Internet Explorer may also be useful in tracking down some of this information.
Sniff the packets - that'll tell you really quick. WireShark is a good program for such a task.
Can firebug do this?
Edit: Looks like firebug will also do this using the "Net" panel, which also gives you some other interesting statistics.
The best tool I have found for detecting http links on a https connection is Fiddler. It's also great for many other troubleshooting efforts.
I use FF plugin HTTPFox for this.
https://addons.mozilla.org/en-us/firefox/addon/httpfox/