How to find inbound links to a given URL on the fly? - api

Technorarati's got their Cosmos api, which works fairly well but limits you to noncommercial use and no more than 500 queries a day.
Yahoo's got a Site Explorer InLink Data API, but it defines the task very literally, returning links from sidebar widgets in blogs rather than just links from inside blog content.
Is there any other alternative for tracking who's linking to a given URL (think of the discussion links that run below stories on Techmeme.com)? Or will I have to roll my own?

Well, it's not an API, but if you google (for example): "link:nytimes.com", the search results that come back show inbound links to that site.
I haven't tried to implement what you want yet, but the Google search API almost certainly has that functionality built in.

Is this for links to Urls under your control?
If so, you could whip up something quick that logs entries in the Referrer HTTP header.
If you wanted to do to this for an entire web site without altering application code, you could implement as an ISAPI filter or equivalent for your web server of choice.
Information available publicly from web crawlers is always going to be incomplete and unreliable (not that my solution isn't...).

Related

Google Custom Search refinement redirect

So I'm using Google Custom Search (Google CSE) and I'm trying to use the refinement functionality to redirect search queries to Google Scholar.
Basically I'm following exactly the documentation found here. However it turns out that, despite there being documentation, this functionality doesn't exist, and it doesn't appear that Google has any plans to implement it in the near future (see the StackOverflow post here).
My question is, does anyone have a hack/workaround for this problem, so that I could use Google CSE to search Google Scholar?
Server Side
You can use something like https://github.com/ckreibich/scholar.py to parse the results from google scholar yourself and expose it as an API that you could consume and render any way you liked.
It would use scholar search under the hood. However, since this isn't an official API this might break at any time, it also requires you to have server side resources to service the requests, but would let you have the nicest interface that you have full control over.
IFrame
You can open an iframe at the particular URL, and this can be embedded inside your page. It looks a bit clunkier, but it means you don't have to link externally and you can embed it locally
<iframe src='http://scholar.google.com/scholar?q={query}'></iframe>
See documentation here. It might be specifically what renders well for you.
External Link
Alternatively, you can just open a new tab/window with:
<a href='http://scholar.google.com/scholar?q={query}' target='_blank'> My Link </a>

Rightmove API and scraping technical and legal

I'm looking to build an app using property data. Nestoria has a free API and rules of use and Zoopla an API you register for. OnTheMarket and Rightmove have same terms of use to the letter (bizarre for competitors?). Rightmove advertise an API for upload but not download - I can't find anything for OnTheMarket.
I've discovered that Rightmove does have an API although the post code search is obfuscated by their own outcode mappings...
https://api.rightmove.co.uk/api/sale/find?index=0&sortType=1&numberOfPropertiesRequested=2&locationIdentifier=OUTCODE%5E1&apiApplication=IPAD
I'm wary of using an API that's not promoted. The alternative is scraping, which is harder technically and legally questionable, although from what I read the data is in the public domain and so free to use.
I've contacted Rightmove but got no response.
Is anyone using the Rightmove api and had this authorised by them? Seems most strange that it's open and available but barely mentioned when searching for it.
Can anyone clarify what rules/law/ethics are in place for scraping data?
Don't query their hidden API. But you can run a web crawler on RightMove.co.uk website, and it is perfectly legal as outlined in their Terms of Service under section 3.3 :
You must not use or attempt to use any automated program unless the automated program identifies itself uniquely in the User Agent field and is fully compliant with the Robots Exclusion Protocol
A web crawler like Apache Nutch perfectly follows the Robots Exclusion Protocol. From their robots.txt file I found they have elaborate nested sitemap.xml files, and hence they rather promote organized but polite crawling of their website. I was myself wanting to get their data, so I am beginning with my endeavour to crawl them with my resources - do let me know if you need access to this data.
You are not allowed to scrape their data, here what their terms&conditions say about it:
"You must not use or attempt to use any automated program (including, without limitation, any spider or other web crawler) to access our system or this Site. You must not use any scraping technology on the Site. Any such use or attempted use of an automated program shall be a misuse of our system and this Site. Obtaining access to any part of our system or this Site by means of any such automated programs is strictly unauthorised."

Track how often link was clicked

I am currently running a website where I promote different coffees from pubs in my city. On my website I have links to the different coffees.
I have recently seen some of this links being shared on Facebook and other social networks.
So I was wondering if it is somehow possible to track how often one of this links are being clicked?
I have tried using redirects to my site but Facebook uses my pictures in the previews, whereas I don't want this because it is misleading.
I have seen that this works with Bitly so it must somehow be possible?
And there are of course different services providing this, but it would be nice if it would run without any foreign services.
So basically I am looking for a solution which will let me know how often a link, origination from my site was clicked in Facebook, Google+ or any other forum.
There definitely is. Try looking into Google Analytics, it will show you show much data from your personal websites and links that it can blow your mind! Here is the link
Google Analytics helps you analyze visitor traffic and paint a
complete picture of your audience and their needs. Track the routes
people take to reach you and the devices they use to get there with
reporting tools like Traffic Sources. Learn what people are looking
for and what they like with In-Page Analytics. Then tailor your
marketing and site content for maximum impact.
You can even get a free package to use!
Hope this helps!
Yes you have plenty of analytical options.
Something as straight forward as Google Analytics for example.
If you are using cpanel on your hosts server, you even have options such as AWSTATS, which will also provide information.
If all else fails you can even use post data stored in your apache / nginx logs.
Since you have amended your question you might want to check out this tool. It is not google. :)
It is called Click Meter and performs Link Tracking and provides click reports, etc

Is this a blackhat SEO technique?

I have a site which has been developed completely in flash. Now the site owners do not want to shift to a more text/html based site. So am planning to create an alternative html/text based site which the googlebot will get redirected to. (By checking the useragent). My question is that is this allowed officially by google?
If not then how come there are many subscription based sites which display a different set of data to google compared to the users? Is that allowed?
Thank you very much.
I've dealt with this exact scenario for a large ecommerce site and Google essentially ignored the site. Google considers it cloaking and addresses it directly here and says:
Cloaking refers to the practice of presenting different content or URLs to users and search engines. Serving up different results based on user agent may cause your site to be perceived as deceptive and removed from the Google index.
Instead, create an ADA compliant version of the website so that users with screen readers and vision aids can use your web site. As long as there as link from your home page to your ADA compliant pages, Google will index them.
The official advice seems to be: offer a visible link to a non-flash version of the site. Fooling the googlebot is a surefire way to get in trouble. And remember, Google results will link to the matching page! Do not make useless results.
Google already indexes flash content so my suggestion would be to check how your site is being indexed. Maybe you don't have to do anything.
I don't think showing an alternate version of the site is good from a Google perspective.
If you serve up your page with the exact same address, then you're probably fine. For example, if you show 'http://www.somesite.com/' but direct googlebot to 'http://www.somesite.com/alt.htm', then Google might direct search users to alt.htm. You don't want that, right?
This is called cloaking. I'm not sure what the effects of it are but it is certainly not whitehat. I am pretty sure Google is working on a way to crawl flash now so it might not even be a concern.
I'm assuming you're not really doing a redirect but instead a PHP import or something similar so it shows up as the same page. If you're actually redirecting then it's just going to index the other page like normal.
Some sites offer a different level of content -- they LIMIT the content, they don't offer alternative and additional content. This is done so it doesn't index unrelated things generally.

How do sites like Hubspot track inbound links?

Are all these types of sites just illegally scraping Google or another search engine?
As far as I can tell ther is no 'legal' way to get this data for a commercial site.. The Yahoo! api ( http://developer.yahoo.com/search/siteexplorer/V1/inlinkData.html ) is only for noncommercial use, Yahoo! Boss does not allow automated queries etc.
Any ideas?
For example, if you wanted to find all the links to Google's homepage, search for
link:http://www.google.com
So if you want to find all the inbound links, you can simply traverse your website's tree, and for each item it finds, build a URL. Then query Google for:
link:URL
And you'll get a collection of all the links that Google has from other websites into your website.
As for the legality of such harvesting, I'm sure it's not-exactly-legal to make a profit from it, but that's never stopped anyone before, has it?
(So I wouldn't bother wondering whether they did it or not. Just assume they do.)
I don't know what hubspot do, but, if you wanted to find out what sites link to your site, and you don't have the hardware to crawl the web, one thing you can do is monitor the HTTP_REFERER of visitors to your site. This is, for example, how Google Analytics (as far as I know) can tell you where your visitors are arriving from. This is not 100% reliable as not all browsers set it, particularly in "Privacy Mode", but you only need one visitor per link to know that it exists!
This is ofter accomplished by embedding a script into each of your webpages (often in a common header or footer). For example, if you examine the source for the page you are currently reading you will find (right down at the bottom) a script that reports back to Google information about your visit.
Now this won't tell you if there are links out there that no one has ever used to get to your site, but let's face it, they are a lot less interesting than the ones people actually use.