How long does Google continue polling a linked CSE specification file after it's requested? - google-custom-search

When you create a Google Custom Search Engine (CSE) with a linked specification file on your server, Google's "FeedFetcher-Google-CoOp" bot requests that file in order to build the CSE. It appears that even after results have been returned to the user and the specification file is no longer used, Google continues polling it regularly for at least several days.
My question is how long Google will continue polling the file after it has stopped being requested by your CSE code, and if there is any way to force it to stop immediately.
(We created a dynamic linked CSE that was unique to each query, which meant many, many specification files (the same script with different GET arguments each time) were requested. Now that we are no longer using them, FeedFetcher-Google-CoOp continues to request this script with various past arguments.
FeedFetcher-Google-CoOp ignores robots.txt. We are now returning 410: Gone for all requests, but it is difficult to tell whether this is having an effect, since there are so many different versions being requested (ie: /script.php?query=). Ideally there would be some way to tell Google that script.php does not exist, regardless of arguments, but without robots.txt, I can't find a way to do so.
TL;DR:
1) Will Google stop requesting this script on its own eventually? If so, when?
2) Is there a way to stop it requesting immediately?

If left alone, it appears Google will continue requesting these files indefinitely (at least for months). It ignores 410 (gone) responses, but it appears that it respects 301 redirects! So to stop Google trying to request outdated CSE specifications, you can 301 redirect them to a null file. Google will likely still try to access the file again for every set of arguments it has cached, but should stop trying after that.

Related

Blocking Requests using HTTP_ORIGIN to Prevent Spamming

Over the last couple days I've been getting millions of requests from rotating IPs. They're attempting to run post requests and seem to be using an incorrect HTTP_ORIGIN. By incorrect, I mean that it's not the same as what my server sends:
My server sends: "https://www.example.com"
The spam request sends: www.example.com
I placed some logging for each scenario:
User logged in and has incorrect HTTP_ORIGIN
User NOT logged in and has incorrect HTTP_ORIGIN
What I've noticed is that there are users that are logged in, but have the wrong HTTP_ORIGIN (origin is missing "https://". I have checked those user accounts and while they appear to be real, and not created by the original spam requests, they may be currently run through scripts.
It seems like it would prevent those users from accessing the POST requests of the site, but on the other hand, if they were real users, it would cause a problem.
Now if I were to put filtering in place to block requests that didn't match the origin, my questions are:
What would be the side effect of that?
Are there downsides or negative aspects?
Would I see drops in traffic?
If that so, It's like you said some are using your website from scripts, considering if your website is normal (I mean not like a website to upload data or sth like that), then it would be good to consider adding captcha to your website in place of filtering requests (cause I think it would be simple for those who send incorrect HTTP_ORIGIN to make a similar one to the original if they use a sslstream especially if it is for malicious goals).
And for the consequences if you use a filtering to the http request, I think the requests will drop remarkably (since you will refuse incorrect ones), and some real users who use scripts will switch to browser (it's a rare case especially if they scrape data from website in an automatic way) or they will stop using your website.
You need to wait for further research and make sure that those false requests are not malicious ones (perhaps they are using simple tcp client). Either way it is best for the time being to inspect data sent in the POST requests (incorrect ones) and see if there is some suspicious data (In that case you should use some safety method in your website)

Different results locally and in staging

I'm using google custom search engine with a search engine that searches specific sites and excludes some patterns in these sites.
I'm testing the api locally and I receive 12 results. I test the same exact call in staging (heroku us region) and I receive 410 results.
Does google personalise the results when using a custom search engine?
If yes, how do I turn it off? If no, do you have any idea why am I seeing this difference?
Update
Ok I did a test. I issued the exact same request by using a proxy and not, and the results are different (vastly).
Now, the question is, can this behaviour be disabled?
Ok found it. By specifying the userIp param https://developers.google.com/custom-search/json-api/v1/using_rest it will force google to use the same behaviour regardless of location.

Script to download Google web history

How does one write a script to download one's Google web history?
I know about
https://www.google.com/history/
https://www.google.com/history/lookup?hl=en&authuser=0&max=1326122791634447
feed:https://www.google.com/history/lookup?month=1&day=9&yr=2011&output=rss
but they fail when called programmatically rather than through a browser.
I wrote up a blog post on how to download your entire Google Web History using a script I put together.
It all works directly within your web browser on the client side (i.e. no data is transmitted to a third-party), and you can download it to a CSV file. You can view the source code here:
http://geeklad.com/tools/google-history/google-history.js
My blog post has a bookmarklet you can use to easily launch the script. It works by accessing the same feed, but performs the iteration of reading the entire history 1000 records at a time, converting it into a CSV string, and making the data downloadable at the touch of a button.
I ran it against my own history, and successfully downloaded over 130K records, which came out to around 30MB when exported to CSV.
EDIT: It seems that number of foks that have used my script have run into problems, likely due to some oddities in their history data. Unfortunately, since the script does everything within the browser, I cannot debug it when it encounters histories that break it. If you're a JavaScript developer, use my script, and it appears your history has caused it to break; please feel free to help me fix it and send me any updates to the code.
I tried GeekLad's system, unfortunately two breaking changes have occurred #1 URL has changed ( I modified and hosted my own copy which led to #2 type=rss arguments no longer works.
I only needed the timestamps... so began the best/worst hack I've written in a while.
Step 1 - https://stackoverflow.com/a/3177718/9908 - Using chrome disable ALL security protocols.
Step 2 - https://gist.github.com/devdave/22b578d562a0dc1a8303
Using contentscript.js and manifest.json, make a chrome extension, host ransack.js locally to whatever service you want ( PHP, Ruby, Python, etc ). Goto https://history.google.com/history/ after installing your contentscript extension in developer mode ( unpacked ). It will automatically inject ransack.js + jQuery into the dom, harvest the data, and then move on to the next "Later" link.
Every 60 seconds, Google will force you to re-login randomly so this is not a start and walk away process BUT it does work and if they up the obfustication ante, you can always resort to chaining Ajax calls and send the page back to the backend for post processing. At full tilt, my abomination script collected 1 page a second of data.
On moral grounds I will not help anyone modify this script to get search terms and results as this process is not sanctioned by Google ( though not blocked apparently ) and recommend it only to sufficiently motivated individuals to make it work for them. By my estimates it took me 3-4 hours to get all 9 years of data ( 90K records ) # 1 page every 900ms or faster.
While this thing is going, DO NOT browse the rest of the web because Chrome is running with no safeguards in place, most of them exist for a reason.
One can download her search logs directly from Google (In case downloading it using a script is not the primary purpose),
Steps:
1) Login and Go to https://history.google.com/history/
2) Just below your profile picture logo, towards the right side, you can find an icon for settings. See the second option called "Download". Click on that.
3) Then click on "Create Archive", then Google will mail you the log within minutes.
maybe before issuing a request to get the feed the script shuld add a User-Agent HTTP header of well known browser, for Google to decide that the request came from that browser.

"Anti-XSS protection" by adding )]}' before ajax response

Google plus returns ajax requests with )]}' on first line. I heard it is protection against XSS. Are there any examples what and how could anyone do with this without that protection ?
Here's my best guess as to what's happening here.
First off, there are other aspects of the google json format that aren't quite valid json. So, in addition to any protection purposes, they may be using this specific string to signal that the rest of the file is in google-json format and needs to be interpreted accordingly.
Using this convention also means that the data feed wont execute from a call from a script tag, nor by interpreting the javascript directly from an eval(). This ensures front end developers are passing the content through a parser, which will keep any implanted code from executing.
So to answer your question, there are two plausible attacks that this prevents, one cross-site through a script tag, but the more interesting on is within-site. Both attacks assume that:
a bug exists in how user data is escaped and
it is exploited in a way that allows an attacker to inject code into one of the data feeds.
As a simple example, lets say a user figured out how to take a string like example
["example"]
and changed it to "];alert('example');
[""];alert('example');"]
Now if when that data shows up in another user's feed, the attacker can execute arbitrary code in the user's browser. Since it's within site, cookies are being sent to the server and the attacker could automate things like sharing posts or messaging people from the user's account.
In the Google scenario, these attacks won't work for a number of reasons. The first 5 characters will cause a javascript error before the attack code is run. Plus, since developers are forced to parse the code instead of accidentally running it through an eval, this practice will prevent code from being executed anyway.
As others said, it's a protection against Cross Site Script Inclusion (XSSI)
We explained this on Gruyere as:
Third, you should make sure that the script is not executable. The
standard way of doing this is to append some non-executable prefix to
it, like ])}while(1);. A script running in the same domain can
read the contents of the response and strip out the prefix, but
scripts running in other domains can't.

Figure out if a website has restricted/password protected area

I have a big list of websites and I need to know if they have areas that are password protected.
I am thinking about doing this: downloading all of them with httrack and then writing a script that looks for keywords like "Log In" and "401 Forbidden". But the problem is these websites are different/some static and some dynamic (html, cgi, php,java-applets...) and most of them won't use the same keywords...
Do you have any better ideas?
Thanks a lot!
Looking for password fields will get you so far, but won't help with sites that use HTTP authentication. Looking for 401s will help with HTTP authentication, but won't get you sites that don't use it, or ones that don't return 401. Looking for links like "log in" or "username" fields will get you some more.
I don't think that you'll be able to do this entirely automatically and be sure that you're actually detecting all the password-protected areas.
You'll probably want to take a library that is good at web automation, and write a little program yourself that reads the list of target sites from a file, checks each one, and writes to one file of "these are definitely passworded" and "these are not", and then you might want to go manually check the ones that are not, and make modifications to your program to accomodate. Using httrack is great for grabbing data, but it's not going to help with detection -- if you write your own "check for password protected area" program with a general purpose HLL, you can do more checks, and you can avoid generating more requests per site than would be necessary to determine that a password-protected area exists.
You may need to ignore robots.txt
I recommend using the python port of perls mechanize, or whatever nice web automation library your preferred language has. Almost all modern languages will have a nice library for opening and searching through web pages, and looking at HTTP headers.
If you are not capable of writing this yourself, you're going to have a rather difficult time using httrack or wget or similar and then searching through responses.
Look for forms with password fields.
You may need to scrape the site to find the login page. Look for links with phrases like "log in", "login", "sign in", "signin", or scrape the whole site (needless to say, be careful here).
I would use httrack with several limits and then search the downloaded files for password fields.
Typically, a login form could be found within two links of the home page. Almost all ecommerce sites, web apps, etc. have login forms that are accessed just by clicking on one link on the home page, but another layer or even two of depth would almost guarantee that you didn't miss any.
I would also limit the speed that httrack downloads, tell it not to download any non-HTML files, and prevent it from downloading external links. I'd also limit the number of simultaneous connections to the site to 2 or even 1. This should work for just about all of the sites you are looking at, and it should be keep you off the hosts.deny list.
You could just use wget and do something like:
wget -A html,php,jsp,htm -S -r http://www.yoursite.com > output_yoursite.txt
This will cause wget to download the entire site recursively, but only download endings listed with the -A option, in this case try to avoid heavy files.
The header will be directed to file output_yoursite.txt which you then can parse for the header value 401, which means that the part of the site requires authentication, and parse the files accordingly to Konrad's recommendation also.
Looking for 401 codes won't reliably catch them as sites might not produce links to anything you don't have privileges for. That is, until you are logged in, it won't show you anything you need to log in for. OTOH some sites (ones with all static content for example) manage to pop a login dialog box for some pages so looking for password input tags would also miss stuff.
My advice: find a spider program that you can get source for, add in whatever tests (plural) you plan on using and make it stop of the first positive result. Look for a spider that can be throttled way back, can ignore non HTML files (maybe by making HEAD requests and looking at the mime type) and can work with more than one site independently and simultaneously.
You might try using cURL and just attempting to connect to each site in turn (possibly put them in a text file and read each line, try to connect, repeat).
You can set up one of the callbacks to check the HTTP response code and do whatever you need from there.