How do I keep my app from tracking bot requests as views

How do I keep my app from tracking bot requests as views - seo

This is a general question about writing web apps.
I have an application that counts page views of articles as well as a url shortner script that I've installed for a client of mine. The problem is that, whenever bots hit the site, they tend to inflate the page views.
Does anyone have an idea on how to go about eliminating bot views from the view count of these applications?

There are a few ways you could determine whether your articles are being viewed by an actual user or by a search engine bot. Probably the best way is to check the User-Agent header sent by the browser (or bot). The User-Agent header is essentially a field that is sent identifying the client application used to access the resource. For example, Internet Explorer might send something Mozilla/5.0 (Windows; U; MSIE 7.0; Windows NT 6.0; en-US). Google's bot might send something like Googlebot/2.1 (+http://www.google.com/bot.html). It is possible to send a fake User-Agent header, but I can't see the average site user or a major company like Google doing that. If it's blank or a common User-Agent string associated with a commercial bot, it's most likely a bot.
While you're at it, you may want to make sure you have an up-to-date robots.txt file. It's a simple text file that provides rules automated bots should respect in terms of which content they are not allowed to retrieve for indexing.
Here's a few resources that may be helpful:
List of User-Agents
How to Verify Googlebot
Web Robots Page
How do I stop bots from incrementing my file download counter in PHP?

Check User-Agent. Use this header value to distinguish bots from regular browsers/users.
For example,
Google bot:
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Safari:
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_3; lv-lv) AppleWebKit/531.22.7 (KHTML, like Gecko) Version/4.0.5 Safari/531.22.7

Related

Google-Plus Crawler

Currently, my company is attempting to add Google Plus One links to our site.
We have the code working, however it appears that the Google-Plus Crawler is unable to access the page content. When the share link snippet is created, it renders with a message stating that the crawler is unable to view the content because it fails a test to differentiate bots from human visitors.
We can white-list the bot, however the system we are using only accepts a User-Agent and a URL. When the User-Agent is detected, a reverse look up is run and the bot ip is compared to the url that was entered to see if it comes from the same set of ips.
I know that the Google Plus crawler does not use a bot style user-agent, like Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html), but is there a user-agent we can perform the necessary white-list test on?

Yes it does. The +Snippet bot user agent contains the following string:
Google (+https://developers.google.com/+/web/snippet/)

This is what the user agent returned for me:
Mozilla/5.0 (Windows NT 6.1; rv:6.0) Gecko/20110814 Firefox/6.0 Google
(+https://developers.google.com/+/web/snippet/)

This is what the user agent returned for me:
Mozilla/5.0 (Windows NT 6.1; rv:6.0) Gecko/20110814 Firefox/6.0 Google (+)

How can I identify web requests created when someone links to my site from Google+?

To comply with new EU legislation regarding cookies, we've had to implement cookie 'warning' banners in several places on our site. We're now seeing problems when users try and link/embed content from our site on Facebook, Google+, bit.ly and so on - the embedded page thumbnail is showing the cookie notification banner instead of a preview of the actual page.
Most of these sites use a distinctive user agent string, so we can easily identify their incoming requests and disable the cookie banner - but Google+ seems to identify itself as
Mozilla/5.0 (Windows NT 6.1; rv:6.0) Gecko/20110814 Firefox/6.0
which makes it very difficult to distinguish between Google+ automated requests and browsing traffic from 'real' users. Is there any way at all I can identify these requests? Custom HTTP headers I can look for or anything?

G+ has got its user agent. It contains the text: Google (+https://developers.google.com/+/web/snippet/).
Ref: https://developers.google.com/+/web/snippet/#faq-snippet-useragent

There are not any HTTP headers that you can depend on for detecting the +1 button page fetcher, but there is a feature request for it to specify a distinctive user agent. In the mean time, I would not depend on it.
You can, however, use other means to configure the snippet of content that appears on Google+. If you add structured markup such as schema.org or OpenGraph to your pages, the +1 button will pull the snippet from those tags. Google+ provides a configuration tool and docs to help you design your markup.
If you add schema.org markup it might look something like this:
<body itemscope itemtype="http://schema.org/Product">
<h1 itemprop="name">Shiny Trinket</h1>
<img itemprop="image" src="http://example.com/trinket.jpg" />
<p itemprop="description">Shiny trinkets are shiny.</p>
</body>

Server locale xmlhttp requests recognised as wrong language by remote servers?

I have a web server hosted with 1and1 which evidently is hosted in Germany, so if I try to do a xmlhttp get on data from google or facebook I am presented with German return data as their site presumes I am a German user.
Does anyone know if it is a server setting which needs to be changed or is facebook recognising the IP location?

if the resource is available in two or many languages, the server mast decide which version to serve. he does this often by examining Accept-Language HTTP header. Probably the header in the request issued by yur server says that it accepts any language, so the server prefers to send german not english due to your srever's IP. Try to add the header menually to your request:
Accept-Language: en
so your ajax will look like this:
xmlhttpobject.setRequestHeader('Accept-Language', 'en');

HTTP referer headers - mobile browsers

I seem to have a unique problem. In my apache logs (for a mobile site) I see that for some requests the referer header is null ("-"). All requests on the logs are "href" links and occur from a mobile browser by clicking/tapping . So there is almost no chance of the url being accessed from the address bar directly (in which case I can understand the absence of a referer header) . What are the other possibilities that a referer header can be null ?
Are there mobile devices which have a browser feature to turn of referer headers (like desktop browsers) ?
I also understand that such a thing can be achieved on a desktop browser by spoofing the the UserAgent to a mobile device. But I'm slightly hesitant as there is no motive for the user to do so.
Also, will the referer header persist when a link is opened in a separate window/tab ? (Again, I'm not sure if this is possible on a mobile device)
Any thoughts will be appreciated .

most newer mobile browsers support opening in new window/tab ... in that case the referrer is lost...
referrer can also be blank in case of a redirect ... if the redirecting page had no referrer

How do I figure out which parts of a web page are encrypted and which aren't?

I'm working on a webserver that I didn't totally set up and I'm trying to figure out which parts of a web page are being sent encrypted and which aren't. Firefox tells me that parts of the page are encrypted, but I want to know what, specifically, is encrypted.

The problem is not always bad links in your page.
If you link to iresources at an external site using https://, and then the external site does its own HTTP redirect to non-SSL pages, that will break the SSL lock on your page.
BUT, when you viewing the source or the information in the media tab, you will not see any http://, becuase your page is properly using only https:// links.
As suggested above, the firebug Net tab will show this and any other problems. Follow these steps:
Install Firebug add-on into firefox if you don't already have it, and restart FF when prompted.
Open Firebug (F12 or the little insect menu to the right of your search box).
In firebug, choose the "Net" tab. Hit "Enable" (text link) to turn it on
Refresh your problem page without using the cache by hitting Ctrl-Shift-R (or Command-shift-R in OSX). You will see the "Net" tab in firefox fill up with a list of each HTTP request made.
Once the page is done loading, hover your mouse over the left colum of each HTTP request shown in the net tab. A tooltip will appear showing you the actual link used. it will be easy to spot any that are http:// instead of https://.
If any of your links resulted in an HTTP redirect, you will see "301 Moved Permanently" in the HTTP status column, and another HTTP request will be just below for the new location. If the problem was due to an external redirect, that's where the evidence will be - the new location's request will be HTTP.
If your problem is due to redirections from an external site, you will see "301 Moved permanently" status codes for the requests that point them to their new location.
Exapnd any of those 301 relocations with the plus sign at the left, and review the response headers to see what is going on. the Location: header will tell you the new location the external server is requesting browsers use.
Make note of this info in the redirect, then send a friendly polite email to the external site in question and ask them to remove the https:// -> http:// redirects for you. Explain how it's breaking the SSL on your site, and ideally include a link to the page that is broken if possible, so that they can see the error for themselves. (this will spur faster action than if you just tell them about the error).
Here is sample output from Firebug for the the external redirect issue.. In my case I found a page calling https:// data feeds was getting the feeds rewritten by the external server to http://.
I've renamed my site to "mysite.example.com" and the external site to "external.example.com", but otherwise left the heders intact. The request headers are shown at the bottom, below the response headers. Note that I"m requesting an https:// link from my site, but getting redirected to an http:// link, which is what was breaking my SSL lock:
Response Headers
Server nginx/0.8.54
Date Fri, 07 Oct 2011 17:35:16 GMT
Content-Type text/html
Content-Length 185
Connection keep-alive
Location http://external.example.com/embed/?key=t6Qu2&width=940&height=300&interval=week&baseAtZero=false
Request Headers
Host external.example.com
User-Agent Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1
Accept */*
Accept-Language en-gb,en;q=0.5
Accept-Encoding gzip, deflate
Accept-Charset ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection keep-alive
Referer https://mysite.example.com/real-time-data
Cookie JSESSIONID=B33FF1C1F1B732E7F05A547A9CB76ED3
Pragma no-cache
Cache-Control no-cache
So, the important thing to note is that in the Response Headers (above), you are seeing a Location: that starts with http://, not https://. Your browser will take this into account when figuring out if the lock is valid or not, and report only partially encrypted content! (This is actually an important browser security feature to alert users to a potential XSRF and/or phishing attacks).
The solution in this case is not something you can fix on your site - you have to ask the external site to stop their redirect to http. Often this was done on their side for convenience, without realizing this consequence, and a well written, polite email can get it fixed.

For each element loaded in page, check their scheme:
it starts with HTTPS: it is encrypted.
it starts with HTTP: it's not encrypted.
(you can see a relatively complete list on firefox by right-clicking on the page and selecting "View Page Info" then the "medias"tab.
EDIT: FF only shows images and multimedia elements. They are also javascript files & CSS ones which have to be checked. And Firebug is a good tool to find what you need.

Some elements may not list http or https, in this case whichever was used for the page will be used for these items, i.e. if the page request is under SSL then these images will come encrypted while if the page request is not under SSL then these will come unencrypted. Fiddler in Internet Explorer may also be useful in tracking down some of this information.

Sniff the packets - that'll tell you really quick. WireShark is a good program for such a task.

Can firebug do this?
Edit: Looks like firebug will also do this using the "Net" panel, which also gives you some other interesting statistics.

The best tool I have found for detecting http links on a https connection is Fiddler. It's also great for many other troubleshooting efforts.

I use FF plugin HTTPFox for this.
https://addons.mozilla.org/en-us/firefox/addon/httpfox/

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas