Mechanical Turk - Fetch results for a batch via API - api

We've created batches of HITs using the Mechanical Turk web interface. Now all we want to do is download the results for a batch using the API, the same way you can download the results for a batch in the web interface using "Download CSV".
The documentation from Amazon says that downloading the results from the API is possible and I would be surprised if it isn't. But after a lot of programming hours and testing I have not been able to get the results of a batch.
http://docs.aws.amazon.com/AWSMechTurk/latest/AWSMturkAPI/ApiReference_OperationsArticle.html
Our problem is not to get the HIT data, that stuff is easy with GetHIT. Our problem isn't either to get the assignment data, that's easily done with GetAssignmentsForHIT. Our problem is to figure out the HIT IDs of a batch so that we only fetch the results of that batch.
We thought we would be able to do this with GetHITsForQualificationType but since we use the same HIT type ID for all batches this isn't possible. The only other operation I can see is SearchHITs, but this operation only lets you "sort" values and not "filter" by e.g batch ID.
If Amazon is a SOA company and they follow the "eat your own dog food" concept, then I wonder how they generate the results in "Download CSV" using their API?
Any hints would be greatly appreciated. Thank you!
UPDATE #1
I believe you could use SearchHITs to pull out all HITs. Then grab the details for each HIT using GetHIT. Then filter all the HITs by "RequesterAnnotation" which actually contains the batch ID, e.g "BatchId:1234567;". This might be the only solution. Sounds a bit far fetched though.

The workflow is exactly as you describe in your Update #1:
(1) Use SearchHITs to get all of your HITs.
(2) Get details with GetHIT (You can actually skip this step because the "Requester Annotation" field comes with SearchHITs if you include the HITDetail response group).
(3) Filter the results by the annotation field to get the HITs you want.
(4) Use GetAssignmentsForHIT to retrieve assignments.
The "batch id" is something that appears to only be accessible to Amazon for use on the Requester User Interface. (see some discussion on the MTurk Developer Forum)
And, of course, the API is going to give you results in XML, which you'll need to parse to turn them into a CSV.

Related

Filter, subset and download Wikidata

Is there any easier way to filter data in Wikidata and download a portion of claims?
For e.g., let us say that I want a list of all humans that are alive currently and have an active Twitter profile.
I would like to download a file containing their Q-ids, names and Twitter usernames (https://www.wikidata.org/wiki/Property:P2002).
I expect there to be hundreds of thousands of results, if not millions.
What is the best way to obtain this information?
I am not sure if by submitting a SPARQL query, one can collect results in a file.
I looked at MediaWiki API, but not sure if it allows accessing multiple entities in one go.
Thanks!
Wikidata currently has around 190,000 Twitter IDs linked to people. You can easily get them all using the SPARQL Query Interface: Web Interface (with a LIMIT you can remove or increase). In the dropdown on the right, choose SPARQL Endpoint for the Direct Link (no limit, 35MB .csv).
But, in case you run into timeouts with more complicated queries, you can first try LIMIT and OFFSET, or one of:
Wikibase Dump Filter is a CLI tool that downloads the full wikidata dump but filters the stream as it comes in according to your needs. You can put very much the same thing together with some creative pipe|ing and it tends to work better than one would expect.
https://wdumps.toolforge.org wdumps.toolforge.org does more or less the same thing but on-premise, then allows you to download the filtered data.
The linked data interface also works rather well for "simple query, high volume" access needs. Example here gives all Twitter IDs (326,000+) and you can read it in pages as fast as you can generate get requests (set an appropriate Accept header to get json)

How to get the most searched words in Solr? [duplicate]

I'm trying to organize a solr search engine. I've already set up the misspelling system and the suggestions.
However I can't seem to find how to retrieve the top 10 most searched words/terms/keywords in solr/lucene. How can I get this? I want to display those on my homepage.
Solr does not provide this kind of feature out of the box. There is the StatsComponent, that provides you with all kind of statistics, but all of those are numeric only.
Depending on how you access solr (directly or via your own app) you could intercept all calls an log the query string. I did this in a recent project where I logged a queries to a database. If you submit all keywords to an other core on your solr server, you can faceting queries on your search terms as described by Hyque
You could use a facet for retrieving the Top X words like this:
http://yourservergoeshere/solr/select?q=*&wt=xml&indent=true&facet=true&facet.query=*&facet.field=message&facet.limit=10&facet.minCount=1
The value of facet.field depends on the field you like to search in. With facet.limit you'll (obviously) limit the amount of results to 10. You'll find the facet results at the end of the results, starting with "facet_counts"
Edit: I really should go to bed earlier. I didn't see the "most searched" in your question. Sorry for that.
Apache Solr does not provide any such capability as of today. There is a desire for this and a JIRA ticket corresponding to it. You can vote for it if you'd like to see it in Solr some day: https://issues.apache.org/jira/browse/SOLR-10359.
The stats component provides information around statistics, but it's mostly numeric in nature. You could parse server logs and come up with a way to build a Frequently Searched Terms (e.g. pump those logs in SiLK or Kibana for visualization).
If you have the ability to change the front end and add some javascript code to the UI or can intercept the search request and make an async or batch calls to APIs for tracking, you can use SearchStax Analytics that provides Search Analytics that tracks searches, clicks, cart actions, revenue, etc.

How do I search this? Possible to access more than 100 JSON api search results if I pay for it?

How to search this?
I want to be able to:
1. create a search engine
2. programatically search it thorugh an API (python, or other)
3. paginate through the results (all of them, if I chose)
4. store URL's or results that I want.
Is this even possible with Google Custom Search Engine?
I enabled billing, my CC is up to date with Google, I do steps 1..3 above.
On a search, I will get back 4,000 results for example, but I can only access 10 at a time with the API, none more, and when I reach 100 results I am shut off.
I want to be able to process 1000 results if I wish.
Before you reply, do you personally have working code that goes beyond the 100 limit?
If so, would be very much interested in speaking, learning how you did it.
I am using Python at the moment, but it could be any language.
--
I tried using the &start=100, 200, and so on to paginate through, but this does not work.
I tried getting 100 results in a python script, ending the program, calling it again setting start=100 (after the first set returned), and nothing happened.
I want to be able to use the Google Custom Search API, pay Google for a monthly subscription but have not found that this is possible.
For any given search, I want to decide how many results to process, could be 1K, could be 20K, I simply need/want access to the full result set, but I do not, have not seemed to find a way to do this.
The API allows only a max result depth of 100. See https://developers.google.com/custom-search/v1/cse/list

Kapow Robot - Extract business Operating hours from Google Search Results

Is it possible to create a Kapow Robot that can search Google for the Operating hours of the Businesses from our list/database and update the timings if changes are made?
Please share if there are any other more efficient ways than the KAPOW robot that can be implemented with minimal effort and cost-effectiveness.
That's what the Google Places API is there for. While you could in theory just open Google Maps in a Load Page action, enter the query string and then parse the results, I would advise against it. Here's why:
The API will be faster, returning results in a structured manner (JSON)
Kapow has actions for calling RESTful services and parsing/modifying JSON
Google does not like robots parsing their pages, and most likely will lock you out (i.e. present you with Captchas sooner or later)
If you decide to go for the API, here's what you should do:
Get your API key first, see this page for details: https://developers.google.com/places/web-service/get-api-key. Note that the free plan allows for 1,000 requests within a 24-hours limit (https://developers.google.com/places/web-service/usage)
Maintain the place ids for all the businesses you'd like to query regularly, and update your list.
For each place, retrieve the details as described in the API documentation. The opening hours will be within the JSON response: https://developers.google.com/places/web-service/details
Update your list. I'd recommend using a definite type in Kapow for that, and using the actions Store in Database and Query Database. In case you need the data elsewhere, you may create additional robots (e.g. for Excel files, sending data per email, et cetera).

What is the maximum results returned for YouTube Data API v3 call

Context
I am in the process of providing some consultancy on doing a HTTP GET using YouTube Data API V3; in order to develop a Windows based application to GET a list of results from Youtube, for say a specific CATEGORY, or a specific TAG.
We are open to using any programming language(I'm from a C++ background and am hoping You tube will support direct HTTP connections without using Google client SDK and so on) to connect to YouTube and (HTTP) GET data.(Once a month or so, so YouTube API quotas should not be problem).
The Issue
We are being told by some of my client's web developers that YouTube API v3 will only return a maximum of 500 records/results, for say a query that returns JUST the Total viewers, the Video's link, and basic meta data such as that.
S, say I wish to find 5,000 results for category "House music" or "basketball" - and I have the Developer Key etc are all set up, would that be possible?
If so, what GET fields would I need to populate(such as "max_results_per_page")?
Thank you.
The API won't provide more than ~500 search results for any arbitrary query. It's by design. Technically, it means that the nextPageToken field won't be returned once you hit ~500 results. No additional parameter can change that.
If you want more than ~500 results for a query, you have to split it into more specific sub-queries. I'd suggest using the publishedAfter and publishedBefore parameters to achieve that, but feel free to experiment with the other ones here.
This only holds for the search-Query. Other queries like "PlaylisItem:list" deliver more results. I have tested with 100.000 items to get the videos of a playlist.