How can I count the results in Gnip's Powertrack API? - api

I am looking for a URL to count the results retrieved using Powertrack API. Something similar to what I find using Search API:
https://search.gnip.com/accounts/ACCOUNT_NAME/search/LABEL/counts.json
I've been looking at Gnip's docs but I have found nothing that allows me to count the results.
I tried using other URLs (stream.gnip.com, and using search.gnip.com with 'powertrack' instead of 'search'). I can't paste more than 1 link so I can't show the complete URLs here, sorry.
I also looked at Historical PowerTrack API reference, and I can't find anything there related to this.
Thank you.

The only products that support a counts endpoint are the 30 Day and Full Archive Search APIs.
Because PowerTrack is a streaming API and supports 10's of thousands of concurrent rules, your best bet would be to store the data into a database or document storage system (NoSQL) that would allow for filtered queries to extract the counts you need.
Historical PowerTrack could technically allow you to determine a count for a specific query, just based on the total number of activities returned, but to execute an HPT job for the sole purpose of getting a count would not be cost-effective.

Like Steven suggested you better store it in a (NoSQL) database and perform your own aggregations.
Gnip does provide a Usage API which will give you the total volume per period per source.

Related

Filter, subset and download Wikidata

Is there any easier way to filter data in Wikidata and download a portion of claims?
For e.g., let us say that I want a list of all humans that are alive currently and have an active Twitter profile.
I would like to download a file containing their Q-ids, names and Twitter usernames (https://www.wikidata.org/wiki/Property:P2002).
I expect there to be hundreds of thousands of results, if not millions.
What is the best way to obtain this information?
I am not sure if by submitting a SPARQL query, one can collect results in a file.
I looked at MediaWiki API, but not sure if it allows accessing multiple entities in one go.
Thanks!
Wikidata currently has around 190,000 Twitter IDs linked to people. You can easily get them all using the SPARQL Query Interface: Web Interface (with a LIMIT you can remove or increase). In the dropdown on the right, choose SPARQL Endpoint for the Direct Link (no limit, 35MB .csv).
But, in case you run into timeouts with more complicated queries, you can first try LIMIT and OFFSET, or one of:
Wikibase Dump Filter is a CLI tool that downloads the full wikidata dump but filters the stream as it comes in according to your needs. You can put very much the same thing together with some creative pipe|ing and it tends to work better than one would expect.
https://wdumps.toolforge.org wdumps.toolforge.org does more or less the same thing but on-premise, then allows you to download the filtered data.
The linked data interface also works rather well for "simple query, high volume" access needs. Example here gives all Twitter IDs (326,000+) and you can read it in pages as fast as you can generate get requests (set an appropriate Accept header to get json)

How to fetch results from an offset when the API doesn't support offset (HERE Maps API)

I have a search functionality that gets data from HERE API's Search endpoint. I maintain records of each search's results so I can add metadata that I need for my own purposes and also so I can provide results without always going back to HERE API. The problem I have is with paginating, specifically with providing a starting index when fetching results from HERE. Similar to how Algolia does it, I want to be able to search for a term and begin with the results at a certain index, the offset. HERE API apparently doesn't allow this at all. The closest it comes to such a feature is that it provides the URL for the next search, as described here. This is limited because it doesn't allow me to start the search results at a particular index that I specify. So essentially I want to know if there's a "standard" way of getting such functionality even when it's not provided by the API.
My own solution
The HERE API provides a size parameter that allows specifying the total number of results that I want, so I can specify a larger size than I need, and basically use code to start the results from my desired index. But this feels a bit hacky, and I wonder if there's a better/more established way of doing this.
Happy to listen to any ideas! Thanks. :)
Such a kind of an 'offset' for starting the paging after a specific number of results is indeed not supported by the Places API itself.
You have to set up a workaround within your application.

Kapow Robot - Extract business Operating hours from Google Search Results

Is it possible to create a Kapow Robot that can search Google for the Operating hours of the Businesses from our list/database and update the timings if changes are made?
Please share if there are any other more efficient ways than the KAPOW robot that can be implemented with minimal effort and cost-effectiveness.
That's what the Google Places API is there for. While you could in theory just open Google Maps in a Load Page action, enter the query string and then parse the results, I would advise against it. Here's why:
The API will be faster, returning results in a structured manner (JSON)
Kapow has actions for calling RESTful services and parsing/modifying JSON
Google does not like robots parsing their pages, and most likely will lock you out (i.e. present you with Captchas sooner or later)
If you decide to go for the API, here's what you should do:
Get your API key first, see this page for details: https://developers.google.com/places/web-service/get-api-key. Note that the free plan allows for 1,000 requests within a 24-hours limit (https://developers.google.com/places/web-service/usage)
Maintain the place ids for all the businesses you'd like to query regularly, and update your list.
For each place, retrieve the details as described in the API documentation. The opening hours will be within the JSON response: https://developers.google.com/places/web-service/details
Update your list. I'd recommend using a definite type in Kapow for that, and using the actions Store in Database and Query Database. In case you need the data elsewhere, you may create additional robots (e.g. for Excel files, sending data per email, et cetera).

What is the maximum results returned for YouTube Data API v3 call

Context
I am in the process of providing some consultancy on doing a HTTP GET using YouTube Data API V3; in order to develop a Windows based application to GET a list of results from Youtube, for say a specific CATEGORY, or a specific TAG.
We are open to using any programming language(I'm from a C++ background and am hoping You tube will support direct HTTP connections without using Google client SDK and so on) to connect to YouTube and (HTTP) GET data.(Once a month or so, so YouTube API quotas should not be problem).
The Issue
We are being told by some of my client's web developers that YouTube API v3 will only return a maximum of 500 records/results, for say a query that returns JUST the Total viewers, the Video's link, and basic meta data such as that.
S, say I wish to find 5,000 results for category "House music" or "basketball" - and I have the Developer Key etc are all set up, would that be possible?
If so, what GET fields would I need to populate(such as "max_results_per_page")?
Thank you.
The API won't provide more than ~500 search results for any arbitrary query. It's by design. Technically, it means that the nextPageToken field won't be returned once you hit ~500 results. No additional parameter can change that.
If you want more than ~500 results for a query, you have to split it into more specific sub-queries. I'd suggest using the publishedAfter and publishedBefore parameters to achieve that, but feel free to experiment with the other ones here.
This only holds for the search-Query. Other queries like "PlaylisItem:list" deliver more results. I have tested with 100.000 items to get the videos of a playlist.

Filter google query results

I'm writing a search engine for wikipedia articles using lucene on the wiki xml dump and I want to calculate the accuracy of the engine when compared to google wiki result on a particular query, when I give "site:en.wikipedia.org" along with the query. I want to do it for multiple queries so I'm getting the google search result URLs manually. I got Google APIs to use a bot to search Google but the problem is I want to get rid off certain type of results like
"/Category:"
"/icon:"
"/file:"
"/photo:"
and user pages.
But I haven't found a convenient way to do this except for using an iterative method of issuing a query, get n number of results, then filter out by using regular expressions, then retrieve the remaining (n-x) results and so on. Google keeps blocking me when I do that.
Is there an intelligent way to get Google results the way I want using Java?
Thanks in advance guys.
You could just try excluding those pages from the Google results, like this:
living people site:en.wikipedia.org -inurl:category -inurl:category_talk -inurl:file -inurl:file_talk -inurl:user -inurl:user_talk