Filter, subset and download Wikidata - sparql

Is there any easier way to filter data in Wikidata and download a portion of claims?
For e.g., let us say that I want a list of all humans that are alive currently and have an active Twitter profile.
I would like to download a file containing their Q-ids, names and Twitter usernames (https://www.wikidata.org/wiki/Property:P2002).
I expect there to be hundreds of thousands of results, if not millions.
What is the best way to obtain this information?
I am not sure if by submitting a SPARQL query, one can collect results in a file.
I looked at MediaWiki API, but not sure if it allows accessing multiple entities in one go.
Thanks!

Wikidata currently has around 190,000 Twitter IDs linked to people. You can easily get them all using the SPARQL Query Interface: Web Interface (with a LIMIT you can remove or increase). In the dropdown on the right, choose SPARQL Endpoint for the Direct Link (no limit, 35MB .csv).
But, in case you run into timeouts with more complicated queries, you can first try LIMIT and OFFSET, or one of:
Wikibase Dump Filter is a CLI tool that downloads the full wikidata dump but filters the stream as it comes in according to your needs. You can put very much the same thing together with some creative pipe|ing and it tends to work better than one would expect.
https://wdumps.toolforge.org wdumps.toolforge.org does more or less the same thing but on-premise, then allows you to download the filtered data.
The linked data interface also works rather well for "simple query, high volume" access needs. Example here gives all Twitter IDs (326,000+) and you can read it in pages as fast as you can generate get requests (set an appropriate Accept header to get json)

Related

How to fetch results from an offset when the API doesn't support offset (HERE Maps API)

I have a search functionality that gets data from HERE API's Search endpoint. I maintain records of each search's results so I can add metadata that I need for my own purposes and also so I can provide results without always going back to HERE API. The problem I have is with paginating, specifically with providing a starting index when fetching results from HERE. Similar to how Algolia does it, I want to be able to search for a term and begin with the results at a certain index, the offset. HERE API apparently doesn't allow this at all. The closest it comes to such a feature is that it provides the URL for the next search, as described here. This is limited because it doesn't allow me to start the search results at a particular index that I specify. So essentially I want to know if there's a "standard" way of getting such functionality even when it's not provided by the API.
My own solution
The HERE API provides a size parameter that allows specifying the total number of results that I want, so I can specify a larger size than I need, and basically use code to start the results from my desired index. But this feels a bit hacky, and I wonder if there's a better/more established way of doing this.
Happy to listen to any ideas! Thanks. :)
Such a kind of an 'offset' for starting the paging after a specific number of results is indeed not supported by the Places API itself.
You have to set up a workaround within your application.

Kapow Robot - Extract business Operating hours from Google Search Results

Is it possible to create a Kapow Robot that can search Google for the Operating hours of the Businesses from our list/database and update the timings if changes are made?
Please share if there are any other more efficient ways than the KAPOW robot that can be implemented with minimal effort and cost-effectiveness.
That's what the Google Places API is there for. While you could in theory just open Google Maps in a Load Page action, enter the query string and then parse the results, I would advise against it. Here's why:
The API will be faster, returning results in a structured manner (JSON)
Kapow has actions for calling RESTful services and parsing/modifying JSON
Google does not like robots parsing their pages, and most likely will lock you out (i.e. present you with Captchas sooner or later)
If you decide to go for the API, here's what you should do:
Get your API key first, see this page for details: https://developers.google.com/places/web-service/get-api-key. Note that the free plan allows for 1,000 requests within a 24-hours limit (https://developers.google.com/places/web-service/usage)
Maintain the place ids for all the businesses you'd like to query regularly, and update your list.
For each place, retrieve the details as described in the API documentation. The opening hours will be within the JSON response: https://developers.google.com/places/web-service/details
Update your list. I'd recommend using a definite type in Kapow for that, and using the actions Store in Database and Query Database. In case you need the data elsewhere, you may create additional robots (e.g. for Excel files, sending data per email, et cetera).

Mechanical Turk - Fetch results for a batch via API

We've created batches of HITs using the Mechanical Turk web interface. Now all we want to do is download the results for a batch using the API, the same way you can download the results for a batch in the web interface using "Download CSV".
The documentation from Amazon says that downloading the results from the API is possible and I would be surprised if it isn't. But after a lot of programming hours and testing I have not been able to get the results of a batch.
http://docs.aws.amazon.com/AWSMechTurk/latest/AWSMturkAPI/ApiReference_OperationsArticle.html
Our problem is not to get the HIT data, that stuff is easy with GetHIT. Our problem isn't either to get the assignment data, that's easily done with GetAssignmentsForHIT. Our problem is to figure out the HIT IDs of a batch so that we only fetch the results of that batch.
We thought we would be able to do this with GetHITsForQualificationType but since we use the same HIT type ID for all batches this isn't possible. The only other operation I can see is SearchHITs, but this operation only lets you "sort" values and not "filter" by e.g batch ID.
If Amazon is a SOA company and they follow the "eat your own dog food" concept, then I wonder how they generate the results in "Download CSV" using their API?
Any hints would be greatly appreciated. Thank you!
UPDATE #1
I believe you could use SearchHITs to pull out all HITs. Then grab the details for each HIT using GetHIT. Then filter all the HITs by "RequesterAnnotation" which actually contains the batch ID, e.g "BatchId:1234567;". This might be the only solution. Sounds a bit far fetched though.
The workflow is exactly as you describe in your Update #1:
(1) Use SearchHITs to get all of your HITs.
(2) Get details with GetHIT (You can actually skip this step because the "Requester Annotation" field comes with SearchHITs if you include the HITDetail response group).
(3) Filter the results by the annotation field to get the HITs you want.
(4) Use GetAssignmentsForHIT to retrieve assignments.
The "batch id" is something that appears to only be accessible to Amazon for use on the Requester User Interface. (see some discussion on the MTurk Developer Forum)
And, of course, the API is going to give you results in XML, which you'll need to parse to turn them into a CSV.

Amazon API search results vs. Amazon.com search results

For our web app, which will use Amazon's API as a basis for some of the site's main interactions, we required the ability to do a generalized search of Amazon's products and return results based on relevancy. The expectation was that their API would work exactly like their actual site's search.
Unfortunately it does not. For instance, querying "joy of cooking" does not return a link to the famous cook book, but to some food processor. Contrarily, on the actual site, one would see the book isn't just first, but it and any derivations occupy the top 5 or so results.
Is there a way of getting this level of relevancy search from Amazon's API without specifying a node to browse through? We need to be able to search everything at once, and the API seems very limited on parameter sets.
The answer is that, if you use "All" as your sorting basis, rather than "Blended", you will get results that are inline with Amazon's own product search. Older docs don't seem to account for this discrepency, but testing both methods has shown "All" to be the preferred product sorting method.
http://docs.amazonwebservices.com/AWSECommerceService/2010-11-01/DG/
Pagesearch under "SearchIndex: All"
You don't get any item sorting options with this method, but if all you want is "most relevant" results, this is the preferred method.

Programmatic Querying of Google and Other Search Engines With Domain and Keywords

I'm trying to find out if there is a programmatic way to determine how far down in a search engine's search results my site shows up for given keywords. For example, my query would provide my domain name, and keywords, and the result would return a say 94 indicating that my site was the 94th result. I'm specifically interested in how to do this with google but also interested in Bing and Yahoo.
No.
There is no programmatic access to such data. People generally roll out their own version of such trackers. Get the Google search page and use regexes to find your position. But now different results are show in different geographies and results are personalize.
gl=us parameter will help you getting results from US, you can change geography accordingly to get the results.
Before creating this from scratch, you may want to save yourself some time (and money) by using a service that does exactly that [and more]: Ginzametrics.
They have a free plan (so you can test if it fits your requirements and check if it's really worth creating your own tool), an API and can even import data from Google Analytics.