Filter google query results - lucene

I'm writing a search engine for wikipedia articles using lucene on the wiki xml dump and I want to calculate the accuracy of the engine when compared to google wiki result on a particular query, when I give "site:en.wikipedia.org" along with the query. I want to do it for multiple queries so I'm getting the google search result URLs manually. I got Google APIs to use a bot to search Google but the problem is I want to get rid off certain type of results like
"/Category:"
"/icon:"
"/file:"
"/photo:"
and user pages.
But I haven't found a convenient way to do this except for using an iterative method of issuing a query, get n number of results, then filter out by using regular expressions, then retrieve the remaining (n-x) results and so on. Google keeps blocking me when I do that.
Is there an intelligent way to get Google results the way I want using Java?
Thanks in advance guys.

You could just try excluding those pages from the Google results, like this:
living people site:en.wikipedia.org -inurl:category -inurl:category_talk -inurl:file -inurl:file_talk -inurl:user -inurl:user_talk

Related

Filter, subset and download Wikidata

Is there any easier way to filter data in Wikidata and download a portion of claims?
For e.g., let us say that I want a list of all humans that are alive currently and have an active Twitter profile.
I would like to download a file containing their Q-ids, names and Twitter usernames (https://www.wikidata.org/wiki/Property:P2002).
I expect there to be hundreds of thousands of results, if not millions.
What is the best way to obtain this information?
I am not sure if by submitting a SPARQL query, one can collect results in a file.
I looked at MediaWiki API, but not sure if it allows accessing multiple entities in one go.
Thanks!
Wikidata currently has around 190,000 Twitter IDs linked to people. You can easily get them all using the SPARQL Query Interface: Web Interface (with a LIMIT you can remove or increase). In the dropdown on the right, choose SPARQL Endpoint for the Direct Link (no limit, 35MB .csv).
But, in case you run into timeouts with more complicated queries, you can first try LIMIT and OFFSET, or one of:
Wikibase Dump Filter is a CLI tool that downloads the full wikidata dump but filters the stream as it comes in according to your needs. You can put very much the same thing together with some creative pipe|ing and it tends to work better than one would expect.
https://wdumps.toolforge.org wdumps.toolforge.org does more or less the same thing but on-premise, then allows you to download the filtered data.
The linked data interface also works rather well for "simple query, high volume" access needs. Example here gives all Twitter IDs (326,000+) and you can read it in pages as fast as you can generate get requests (set an appropriate Accept header to get json)

How do I search this? Possible to access more than 100 JSON api search results if I pay for it?

How to search this?
I want to be able to:
1. create a search engine
2. programatically search it thorugh an API (python, or other)
3. paginate through the results (all of them, if I chose)
4. store URL's or results that I want.
Is this even possible with Google Custom Search Engine?
I enabled billing, my CC is up to date with Google, I do steps 1..3 above.
On a search, I will get back 4,000 results for example, but I can only access 10 at a time with the API, none more, and when I reach 100 results I am shut off.
I want to be able to process 1000 results if I wish.
Before you reply, do you personally have working code that goes beyond the 100 limit?
If so, would be very much interested in speaking, learning how you did it.
I am using Python at the moment, but it could be any language.
--
I tried using the &start=100, 200, and so on to paginate through, but this does not work.
I tried getting 100 results in a python script, ending the program, calling it again setting start=100 (after the first set returned), and nothing happened.
I want to be able to use the Google Custom Search API, pay Google for a monthly subscription but have not found that this is possible.
For any given search, I want to decide how many results to process, could be 1K, could be 20K, I simply need/want access to the full result set, but I do not, have not seemed to find a way to do this.
The API allows only a max result depth of 100. See https://developers.google.com/custom-search/v1/cse/list

Using Twitter's public API to find similar tweets

I am working on an application that amongst other things tries to find similar tweets based on a tweet's text as input. The similarity of the tweet would be based on the amount of matching text. I would like to use the public twitter search api to accomplish this.
The closest thing the twitter API offers is searching using OR operators. This however returns a list of seemingly randomly ordered tweets that contain any of the query's words, ussualy matching common words like 'with' or 'we' (which is expected behaviour of the OR operator). I however am interested in results with as much matching text as possible and also in results with text that is characteristic to the input tweet (matching common words is less relevant then matching uncommon words).
Is there any way I can use the twitter API to find results with as much matching words as possible?
Example of results from query with OR operators.
The Twitter REST API does not expose a function that does what you are describing. You will need to capture a large number of tweets (probably from the Streaming API and then do the comparisons/identifications of similar tweets in your own code.

How can I count the results in Gnip's Powertrack API?

I am looking for a URL to count the results retrieved using Powertrack API. Something similar to what I find using Search API:
https://search.gnip.com/accounts/ACCOUNT_NAME/search/LABEL/counts.json
I've been looking at Gnip's docs but I have found nothing that allows me to count the results.
I tried using other URLs (stream.gnip.com, and using search.gnip.com with 'powertrack' instead of 'search'). I can't paste more than 1 link so I can't show the complete URLs here, sorry.
I also looked at Historical PowerTrack API reference, and I can't find anything there related to this.
Thank you.
The only products that support a counts endpoint are the 30 Day and Full Archive Search APIs.
Because PowerTrack is a streaming API and supports 10's of thousands of concurrent rules, your best bet would be to store the data into a database or document storage system (NoSQL) that would allow for filtered queries to extract the counts you need.
Historical PowerTrack could technically allow you to determine a count for a specific query, just based on the total number of activities returned, but to execute an HPT job for the sole purpose of getting a count would not be cost-effective.
Like Steven suggested you better store it in a (NoSQL) database and perform your own aggregations.
Gnip does provide a Usage API which will give you the total volume per period per source.

Programmatic Querying of Google and Other Search Engines With Domain and Keywords

I'm trying to find out if there is a programmatic way to determine how far down in a search engine's search results my site shows up for given keywords. For example, my query would provide my domain name, and keywords, and the result would return a say 94 indicating that my site was the 94th result. I'm specifically interested in how to do this with google but also interested in Bing and Yahoo.
No.
There is no programmatic access to such data. People generally roll out their own version of such trackers. Get the Google search page and use regexes to find your position. But now different results are show in different geographies and results are personalize.
gl=us parameter will help you getting results from US, you can change geography accordingly to get the results.
Before creating this from scratch, you may want to save yourself some time (and money) by using a service that does exactly that [and more]: Ginzametrics.
They have a free plan (so you can test if it fits your requirements and check if it's really worth creating your own tool), an API and can even import data from Google Analytics.