API for When Google crawled my site last, give a URL? - seo

I have a bunch of URLs which are currently indexed in Google. Given those URLs, is there a way to figure out when was the last time Google crawled them ?
Manually, if i check the link in Google and check the 'cached' link, I see the date on when it was crawled. Is there a way to do this automatically ? A Google API of some sort ?
Thank you :)

Google doesn't provide an API for this type of data. The best way of tracking last crawled information is to mine your server logs.
In your server logs, you should be able to identify Googlebot by it's typical user-agent: Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html). Then you can see what URLs Googlebot has crawled, and when.
If you want to be sure it's Googlebot crawling those pages you can verify it with a Reverse DNS lookup.. Bingbot also supports Reverse DNS lookups.
If you don't want to manually parse your server logs, you can always use something like splunk or logstash. Both are great log processing platforms.
Also note, that the "cached" date in the SERPs doesn't always necessarily match the last crawled date. Googlebot can crawl your pages multiple times after the "cached" date, but not update their cached version. You can think of "cached date" as more of a "last indexed" date, but that's not exactly correct either. In either case, if you ever need to get a page re-indexed, you can always use Google Webmaster Tools (GWT). There's an option in GWT to force Googlebot to re-crawl a page, and also re-index a page. There's a weekly limit of 50 or something like that.

<?php
$domain_name = $_GET["url"];
//get googlebot last access
function googlebot_lastaccess($domain_name)
{
$request = 'http://webcache.googleusercontent.com/search?hl=en&q=cache:'.$domain_name.'&btnG=Google+Search&meta=';
$data = getPageData($request);
$spl=explode("as it appeared on",$data);
//echo "<pre>".$spl[0]."</pre>";
$spl2=explode(".<br>",$spl[1]);
$value=trim($spl2[0]);
//echo "<pre>".$spl2[0]."</pre>";
if(strlen($value)==0)
{
return(0);
}
else
{
return($value);
}
}
$content = googlebot_lastaccess($domain_name);
$date = substr($content , 0, strpos($content, 'GMT') + strlen('GMT'));
echo "Googlebot last access = ".$date."<br />";
function getPageData($url) {
if(function_exists('curl_init')) {
$ch = curl_init($url); // initialize curl with given url
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']); // add useragent
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // write the response to a variable
if((ini_get('open_basedir') == '') && (ini_get('safe_mode') == 'Off')) {
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow redirects if any
}
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5); // max. seconds to execute
curl_setopt($ch, CURLOPT_FAILONERROR, 1); // stop when it encounters an error
return #curl_exec($ch);
}
else {
return #file_get_contents($url);
}
}
?>
Just upload this PHP and create a Cron-Job.
You can test it as following .../bot.php/url=http://www....

You can check the google bot last visit using the link http://www.gbotvisit.com/

Related

Google Vision API Invalid operation ID format

I've been using the Google Vision API for a while now to extract text from documents (PDFs) but just came across an issue. I have created a long running job and now I need to check the job status. According to the documentation the GET request should be;
GET https://vision.googleapis.com/v1/operations/operation-id
However when trying that I get a response;
{ "error": { "code": 400, "message": "Invalid operation id format. Valid format is either projects/*/operations/* or projects/*/locations/*/operations/*", "status": "INVALID_ARGUMENT" } }
Ok, no problem, so I look through the docs and according to the message I should be able to do the following;
https://vision.googleapis.com/v1/projects/project-id/operations/1efec2285bd442df
Or;
https://vision.googleapis.com/v1/projects/project-id/locations/location-id/operations/1efec2285bd442df
My final code is a GET request using PHP Curl like so;
$url = "https://vision.googleapis.com/v1/projects/myproject-id/operations/longrunningjobid";
// create a curl request
$ch = curl_init($url);
// define the parameters
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Authorization:Bearer $token", "Content-Type: application/json; charset=utf-8"));
// execute the request
$result = curl_exec($ch);
// close the connection
curl_close($ch);
echo $result;
I have tried several combinations of the url to try and get this to work. My gcp project id is correct and the job number is correct but I feel the url is not right. Any ideas?
The implementation was correct, however I was using regex earlier in the code and didn't realize that in PHP the \n character is escaped differently than in javascript.
So in javascript I was using \\n to escape it but in PHP I needed to use \\\n.
This was causing the longrunningjobid to have one too many characters.

How do you obtain a WeChat access_token certificate from api.wechat.com?

This site tries to explain the process: http://admin.wechat.com/wiki/index.php?title=Access_token
The problem is nowhere do they tell you where to get the AppID or what exactly the secret is.Has anyone else succeeded in communicating with WeChat?
Essentially we # WeChat have 2 types of accounts, subscription and service. Subscription account only gives you access to the Message API which allows for receiving messages and autoresponses and allows you to broadcast to your users once a day. Subscription accounts are also grouped in a category in your contacts under subscription.
A service account gives you an APP ID and APP SECRET which allows you to generate an access token which is needed for pretty much all the other API's apart from the Message API. A service account displays in the user's contact list under the main chats in between all your other normal contacts. You can only broadcast to each of your users once a month on a service account.
If you have a service account you will get the APP ID and APP SECRET from admin.wechat.com -> login -> function -> advanced -> developer mode -> Just under your token you will see the APP ID and APP SECRET
To see what type of account you have go to admin.wechat.com -> login and then look at the top right of the screen next to your account name you will see your account name and just above that it will either say subscription account or service account.
If you want to test all the API's I recommend going to the developer sandbox environment where you get full access to all the API's: How does link with href for Line and Wechat?
Please note your number needs to be in the international format so 072 111 2233 you have to enter as +27721112233
Login http://admin.wechat.com
[advanced] --> [Developer Mode], you will got your Appid & AppSecret.
You don't have a wechat OA account?
Join [WeChat Space] https://plus.google.com/communities/102783597675617808511
You may go to http://dev.wechat.com/ to sign up for a developer account.
After you sign up, you will get your App ID and AppKey via your signup email.
Then, you can go to http://admin.wechat.com/wiki/index.php?title=Main_Page to obtain more information.
I wrote a code snippet on github that explains the entire process. The code is for django but can be used with any python framework
here is a snippet
import xml.etree.ElementTree as ET
from wechat.views import WeChatView
MyCustomView(WeChatView):
token = "ad4sf65weG7Db6ddWE"
on_message(self, message):
root = ET.fromstring(message)
from = root[1].text
message_type = root[3].text
content = root[4].text
print('from: {}'.format(from))
print('message type: {}'.format(message_type))
print('content: {}'.format(content))
The full code is here https://github.com/tawanda/django-wechat
Here's my code,maybe you can try it.
//Getting access_token from customize menus
static function get_access_token($appid,$secret){
$url="https://api.weixin.qq.com/cgi-bin/token?grant_type=client_credential&appid=".$appid."&secret=".$secret;
$json=http_request_json($url);//here cannot use file_get_contents
$data=json_decode($json,true);
if($data['access_token']){
return $data['access_token'];
}else{
return "Error occurred while geting the access_token";
}
}
//Though URL request is https',cannot use file_get_contents.Using CURL while asking the JSON data
function http_request_json($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($ch);
curl_close($ch);
return $result;
}

Caching JSON with Cloudflare

I am developing a backend system for my application on Google App Engine.
My application and backend server communicating with json. Like http://server.example.com/api/check_status/3838373.json or only http://server.example.com/api/check_status/3838373/
And I am planning to use CloudFlare for caching JSON pages.
Which one I should use on header? :
Content-type: application/json
Content-type: text/html
Is CloudFlare cache my server's responses to reduce my costs? Because I'll not use CSS, image, etc.
The standard Cloudflare cache level (under your domain's Performance Settings) is set to Standard/Aggressive, meaning it caches only certain types by default scripts, stylesheets, images. Aggressive caching won't cache normal web pages (ie at a directory location or *.html) and won't cache JSON. All of this is based on the URL pattern (e.g. does it end in .jpg?) and regardless of the Content-Type header.
The global setting can only be made less aggressive, not more, so you'll need to setup one or more Page Rules to match those URLs, using Cache Everything as the custom cache rule.
http://blog.cloudflare.com/introducing-pagerules-advanced-caching
BTW I wouldn't recommend using an HTML Content-Type for a JSON response.
By default, Cloudflare does not cache JSON file. I've ended up with config a new page rule:
https://example.com/sub-directiory/*.json*
Cache level: Cache Everything
Browser Cache TTL: set a timeout
Edge Cache TTL: set a timeout
Hope it saves someone's day.
The new workers feature ($5 extra) can facilitate this:
Important point:
Cloudflare normally treats normal static files as pretty much never expiring (or maybe it was a month - I forget exactly).
So at first you might think "I just want to add .json to the list of static extensions". This is likely NOT want you want with JSON - unless it really rarely changed - or is versioned by filename. You probably want something like 60 seconds or 5 minutes so that if you update a file it'll update within that time but your server won't get bombarded with individual requests from every end user.
Here's how I did this with a worker to intercept all .json extension files:
// Note: there could be tiny cut and paste bugs in here - please fix if you find!
addEventListener('fetch', event => {
event.respondWith(handleRequest(event));
});
async function handleRequest(event)
{
let request = event.request;
let ttl = undefined;
let cache = caches.default;
let url = new URL(event.request.url);
let shouldCache = false;
// cache JSON files with custom max age
if (url.pathname.endsWith('.json'))
{
shouldCache = true;
ttl = 60;
}
// look in cache for existing item
let response = await cache.match(request);
if (!response)
{
// fetch URL
response = await fetch(request);
// if the resource should be cached then put it in cache using the cache key
if (shouldCache)
{
// clone response to be able to edit headers
response = new Response(response.body, response);
if (ttl)
{
// https://developers.cloudflare.com/workers/recipes/vcl-conversion/controlling-the-cache/
response.headers.append('Cache-Control', 'max-age=' + ttl);
}
// put into cache (need to clone again)
event.waitUntil(cache.put(request, response.clone()));
}
return response;
}
else {
return response;
}
}
You could do this with mime-type instead of extension - but it'd be very dangerous because you'd probably end up over-caching API responses.
Also if you're versioning by filename - eg. products-1.json / products-2.json then you don't need to set the header for max-age expiration.
You can cache your JSON responses on Cloudflare similar to how you'd cache any other page - by setting the Cache-Control headers. So if you want to cache your JSON for 60 seconds on the edge (s-maxage) and the browser (max-age), just set the following header in your response:
Cache-Control: max-age=60, s-maxage=60
You can read more about different cache control header options here:
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control
Please note that different Cloudflare plans have different value for minimum edge cache TTL they allow (Enterprise plan allows as low as 1 second). If your headers have a value lower than that, then I guess they might be ignored. You can see the limits here:
https://support.cloudflare.com/hc/en-us/articles/218411427-What-does-edge-cache-expire-TTL-mean-#summary-of-page-rules-settings

Using Twitter API on shared server - Rate limit exceeded even though I am caching the response

I have written a php script which gets the latest status update for 12 different twitter accounts by pulling an xml for each and caching it on my server. This currently runs every 30 minutes.
Unfortunately I keep getting the "Rate limit exceeded. Clients may not make more than 150 requests per hour." error event though i'm only making 24 requests from the 150 I should have.
I assume this is because my domain is on a shared server and twitter is counting other requests against me.
How can I authorise my requests so i'm not restriced by the standard IP limit?
I have no experience of OAuth so need step by step instructions if possible.
Thanks in advance!
OK so I managed to get the most of this working with no previous experience of API's etc.
Here is my step by step guide:
Step 1.
Create a Twitter list.
Go to: https://twitter.com/username/lists
Click "Create list"
Enter details and save.
Go to a twitter user you wish to add to the list and click the gear dropdown and select "Add or remove from lists". Tick the checkbox next to your list.
Step 2.
Create a Twitter App via: https://dev.twitter.com/apps/new
Log in using your Twitter credentials.
Give your app a name, description etc.
Go to the Settings tab and change the Access type to Read and Write then click "Update this Twitter application's settings".
Click "Create my access token" at the bottom of the page.
You will now have a Consumer Key, Consumer secret, Access token and Access token secret. Make a note of these.
Step 3. Create API tokens.
Download and install onto your server the Abraham Twitter oAuth library from: https://github.com/abraham/twitteroauth (I'll use a folder called "twitter").
Create a new file, name it authorise.php in the oAuth folder and put the following code inside (with your generated keys in place of the named text). (Put the code between < ? PHP and ?> brackets).
// Create our twitter API object
require_once("twitteroauth/twitteroauth.php");
$oauth = new TwitterOAuth('Put-Consumer-Key-here', 'Put-Consumer-secret-here',
'Put-Access-Token-here', 'Put-Access-token-secret-here');
// Send an API request to verify credentials
$credentials = $oauth->get("account/verify_credentials");
echo "Connected as #" . $credentials->screen_name;
// Post our new "hello world" status
$oauth->post('statuses/update', array('status' => "hello world"));
This has now authorised your twitter App for the API and posted a "hello world" status on your twitter account.
Note: The Read / Write access change we did earlier alowed the code to post the status update, it's not actually needed to pull the list from the API but I did it to make sure it was working OK. (You can turn this off again by going back to the Settings).
Step 4.
Create PHP file to pull your list and cache the file.
Create an XML file (YOUR-FILE-NAME.xml) and save it in the oAuth folder.
Create a PHP file (YOUR-PHP-FILE.php) and save it in the oAuth folder
Edit the below code with your twitter API keys, file name and twitter list details and save it in your PHP file. (Put the code within < ? PHP and ?> brackets).
/* Twitter keys & secrets here */
$consumer_key = 'INSERT HERE';
$consumer_secret = 'INSERT HERE';
$access_token = 'INSERT HERE';
$access_token_secret = 'INSERT HERE';
// Create Twitter API object
require_once('twitteroauth/twitteroauth.php');
// get access token and secret from Twitter
$oauth = new TwitterOAuth($consumer_key, $consumer_secret, $access_token, $access_token_secret);
// fake a user agent to have higher rate limit
$oauth->useragent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.9) Gecko/20071025 Firefox/2.0.0.9';
// Send an API request to verify credentials
$credentials = $oauth->get('account/verify_credentials');
echo 'Connected as #' . $credentials->screen_name . '\n';
// Show API hits remaining
$remaining = $oauth->get('account/rate_limit_status');
echo "Current API hits remaining: {$remaining->remaining_hits}.\n";
$ch = curl_init();
$file = fopen("YOUR-FILE-NAME.xml", "w+");
curl_setopt($ch, CURLOPT_URL,'https://api.twitter.com/1/lists/statuses.xml?slug=INSERT-LIST-NAME&owner_screen_name=INSERT-YOUR-TWITTER-USERNAME-HERE&include_entities=true');
curl_setopt($ch, CURLOPT_FILE, $file);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_exec($ch);
curl_close($ch);
fclose($file);?>
Copy the file path into your browser and test it. (e.g. http://www.yourwebsite.com/twitter/YOUR-PHP-FILE.php)
This should contact twitter, pull the list as an XMl file and save it into YOUR-FILE-NAME.xml. Test it by opening the XML file, it should have the latest statuses from the users in your twitter list.
Step 5.
Automate the PHP script to run as often as you like (up to 350 times per hour) via a Cron job.
Open your Cpanel and click "Cron jobs" (usually under Advanced).
You can choose the regularity of your script using the common settings.
In the command field add the following code:
php /home/CPANEL-USERNAME/public_html/WEBSITE/twitter/YOUR-PHP-FILE.php >/dev/null 2>&1
Your script will now run as often as you have chosen, pull the list from twitter and save it into YOUR-FILE-NAME.xml.
Step 6.
You can now pull statuses from the cached XML file meaning your visitors will not be making unnecessary calls to the API.
I've not worked out how to target a specific screen_name yet if anyone can help there?
a) don't check 12 different accounts, create a [public] list https://twitter.com/lists and check only the it => 12 times less requests
b) use this awesome oAuth lib: https://github.com/abraham/twitteroauth and use oAuth requests instead of unsigned => you will get 350 requests and they will not be affected by IP limit

processing application response from apply with linkedin

I am trying to get the response returned by the Apply with LinkedIn plugin. I am using Zend Framework. I have tried using the sample code from linkedIn
<?php
// access raw HTTP POST data
$post_body = file_get_contents('php://input');
$application = json_decode($post_body);
// confirm success by writing applicant name to the error log
error_log($application->person->firstName . " " . $application->person->lastName);
// now parse $application and insert data into a DB
// or perform another action
?>
pasted this in my controller and then my view but no success.
my data-url is the url of the current page...
Could anyone show me how this is suppose to be done PHP or javascript (need to get the response into the database).
Thanks.
Try replacing
$post_body = file_get_contents('php://input');
With:
$post_body = $this->getRequest()->getRawBody();
Where $this refers to the current request. I'm not positive if you get this for free..., but I'm hoping you can take this from here. I'm not a ZF expert.