Google Custom Search API: Using it as a scraper? - api

Is this API simply for searching your website only, or can any standard google search (even advanced search features) be submitted to it? I understand there is a limit of 100 per day, I am just curious if it can be invoked from say your own machine as the code samples and introduction indicate its intended use is for displaying results on your website. I want to search outside of a given domain and scrape standard google results for any given search. This will not be an ajax call.

My current understanding:
You're limited to 100/day only if you don't pay.
You do have to specify domains, but some tlds are fine (eg: .uk)
There's a limit to 100 search results for any given search query (ten pages of up to ten responses)
It can be invoked from your own machine.

Related

Lists the most viewed pages Wikimedia projects (Wikipedia) with more than 1000 results

I have seen that there are various APIs and various tools that allow you to see the most visited pages of the Wikimedia projects such as Wikipedia, but all these services have a limit, they do not allow to show more than 1000 pages, while I would like to have the list of 5000-10000(or more) most visited pages in order of traffic.
these are all the services that I checked and with which I found this limit:
https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bmostviewed
https://stats.wikimedia.org/#/en.wikipedia.org/reading/top-viewed-articles/normal|table|last-month|~total|monthly
https://pageviews.toolforge.org/topviews/?project=en.wikipedia.org&platform=all-access&date=last-month&excludes=
https://wikimedia.org/api/rest_v1/#/Pageviews%20data
I have also found services like https://quarry.wmflabs.org/ or
https://query.wikidata.org/ where you can run a query, technically perhaps through this service you could but I don't know the query to be performed to show the pages with most visits.
I also found an interesting article here: https://www.reddit.com/r/bigquery/comments/3dg9le/analyzing_50_billion_wikipedia_pageviews_in_5/ where it is explained that it is possible to use Google's BigQuery but it is an external service and before using it I wanted to know if it existed a simpler method.
If the REST API doesn't suit your purpose, you'd need to parse the raw data yourself. That's because all the tools you've linked just consume the REST API.
The raw data are available at https://dumps.wikimedia.org/other/pageviews/. There are two groups of files there. One starts with pageviews-, which lists the number of views of individual pages, the second starts with projectviews-, which lists the number of views of individual projects.
For your target, you need the pageviews ones. Download the files for your timespan, and then analyze them using a script.
The file is space-separated. Each row represents one page that was visited in that hour. First column represents the project (en is English Wikipedia, for instance), second is the page title (spaces are represented by underscores) and then there are total pageviews.
The technical documentation is available at https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageviews.

Google Custom Search Engine pricing

The pricing regarding CSE is a little bit vague:
For CSE users, the API provides 100 search queries per day for free. If you need more, you may sign up for billing in the API Console. Additional requests cost $5 per 1000 queries, up to 10k queries per day
Does one query equal one keyword regardless of pagination used, or one request? (in this sense XML is more efficient than JSON, as it allows 20 in num parameter, as opposed to JSONs 10)
Are the queries counted per API key, or per cx key?
It is vague and you are not the first to be puzzled. When I did my research I found this blog post helpful.
Assume you are talking about Custom Search Engine (terms you noted in your Q) and NOT Google Site Search (paid from the start). The reason I ask is that the XML function is only for Google Site Search customers. There is the JSON/Atom API and Custom Search API available for CSE.
For Q1, one Query = one request. You can use as many keywords or other parameters in your request (see comments in the blog post I referenced), but you will always be limited to 100 results.
For Q2, the billing is enabled through the API console. Once enabled (and in order to allow the 101st query) your code must include both your cx and API key. So in theory you could set up multiple search engines within your API and stay under the 100 request limit, but I have not seen a way to allow an API to support multiple cx keys.

know the page rank for certain key words

I want to know the page rank for certain key words against my page. For example I wrote "best movies 2012" my page does come, but in 30th to 50th page. I want to query in the result set Google gave against my keywords so that I can see the rank of my page and my competitors against typical keywords.
I think you may be confusing PageRank with positions. PageRank is an algorithm that Google uses to determine the authority of your site. This doesn't always affect the positions of certain keywords.
There are plenty of good programs and web services around that you can use such as
http://raventools.com/
Most of the good free web services have been closed down due to Google now limiting the amount of searches performed and charging for this data.
You could check out:
http://www.semrush.com
It's free but you have to register to get data.
There are several web services providing this functionality: http://raventools.com/ or http://seomoz.org/
Or, you can perform the task manually. Here is an example on how to query google search using Java: How can you search Google Programmatically Java API
You need to compare your webpage PageRank and website PR against those of the competition. The best indication we have of website PR is the HomePage PagRank.
Ensure that you do this for the appropriate Google domain - USA - Google.com - UK Google.co.uk etc
The technique is described in more detail on http://www.keywordseopro.com
You can repeat the technique for each keyword.

How to retrieve all tweets from a user and not just the first 3,200 as Twitter limits it’s timeline and API to

With https://dev.twitter.com/docs/api/1/get/statuses/user_timeline I can get 3,200 most recent tweets. However, certain sites like http://www.mytweet16.com/ seems to bypass the limit, and my browse through the API documentation could not find anything.
How do they do it, or is there another API that doesn't have the limit?
You can use twitter search page to bypass 3,200 limit. However you have to scroll down many times in the search results page. For example, I searched tweets from #beyinsiz_adam. This is the link of search results:
https://twitter.com/search?q=from%3Abeyinsiz_adam&src=typd&f=realtime
Now in order to scroll down many times, you can use the following javascript code.
var myVar=setInterval(function(){myTimer()},1000);
function myTimer() {
window.scrollTo(0,document.body.scrollHeight);
}
Just run it in the FireBug console. And wait some time to load all tweets.
The only way to see more is to start saving them before the user's tweet count hits 3200. Services which show more than 3200 tweets have saved them in their own dbs. There's currently no way to get more than that through any Twitter API.
http://www.quora.com/Is-there-a-way-to-get-more-than-3200-tweets-from-a-twitter-user-using-Twitters-API-or-scraping
https://dev.twitter.com/discussions/276
Note from that second link: "…the 3,200 limit is for browsing the timeline only. Tweets can always be requested by their ID using the GET statuses/show/:id method."
I've been in this (Twitter) industry for a long time and witnessed lots of changes in Twitter API and documentation. I would like to clarify one thing to you. There is no way to surpass 3200 tweets limit. Twitter doesn't provide this data even in its new premium API.
The only way someone can surpass this limit is by saving the tweets of an individual Twitter user.
There are tools available which claim to have a wide database and provide more than 3200 tweets. Few of them are followersanalysis.com, keyhole.co which I know of.
You can use a tool I wrote that bypasses the limit.
It saves the Tweets in a JSON format.
https://github.com/pauldotknopf/twitter-dump
You can use a Python library snscrape to do it. Or you can use ExportData tool to get all tweets for the user, which returns already preprocessed CSV and spreadsheet files. The first option is free, but has less information and requires more manual work.

Programmatic Querying of Google and Other Search Engines With Domain and Keywords

I'm trying to find out if there is a programmatic way to determine how far down in a search engine's search results my site shows up for given keywords. For example, my query would provide my domain name, and keywords, and the result would return a say 94 indicating that my site was the 94th result. I'm specifically interested in how to do this with google but also interested in Bing and Yahoo.
No.
There is no programmatic access to such data. People generally roll out their own version of such trackers. Get the Google search page and use regexes to find your position. But now different results are show in different geographies and results are personalize.
gl=us parameter will help you getting results from US, you can change geography accordingly to get the results.
Before creating this from scratch, you may want to save yourself some time (and money) by using a service that does exactly that [and more]: Ginzametrics.
They have a free plan (so you can test if it fits your requirements and check if it's really worth creating your own tool), an API and can even import data from Google Analytics.