How to integrate search results with database - api

Lets say I store story documents in Elasticsearch like
{
story_id: 1,
title: "Hello World",
content: "Foo. Bar."
likes: 2222
}
When the client (frontend) searches, they should have the option to like (or remove their like) any of the search results, but there should also be an indication of whether or not they liked each result already.
What is a good way to get this information to the client?
Perform a database query to get all the stories the user has liked and keep it in the client's local storage. When search results are retrieved, map the user's liked stories to the retrieved search results on the client. This would add the complexity of updating local storage as well as the API when a user likes a story. Also, the number of stories a user likes could get very large.
Keep a list of users that have liked a story within the document itself and when searching check if the user is in the list. This could blow up the search index size?
{ ...
likes: [ 'foo_user', 'bar_user', ... ]
}
In the API, after the search, perform a database query to determine which stories in the search response the user has already liked, and map this info to the search results before returning the API response. This could slow down searches because an additional database query is required, but maybe it is inconsequential?

For this use case, most common/mainstream approach would be your option 3.
You need to save every like as a record in datastore.
You need to index docs in Elasticsearch(ES) with most probably only the properties you will use for searching and aggregation purpose, not whole doc.
After use query/search from Frontend you lookup from ES the docs and take their ids.
Go to datastore like records and check if there are user like records for each of them.
Combine this info and return whole doc to frontend.
Additional Datastore lookup would not cost you much both in time and money I would say. It wouldn't effect user experience much either.
My only concern would be because for every query I need to check likes collection, this request is not CDN/cache friendly.

Related

Filter, subset and download Wikidata

Is there any easier way to filter data in Wikidata and download a portion of claims?
For e.g., let us say that I want a list of all humans that are alive currently and have an active Twitter profile.
I would like to download a file containing their Q-ids, names and Twitter usernames (https://www.wikidata.org/wiki/Property:P2002).
I expect there to be hundreds of thousands of results, if not millions.
What is the best way to obtain this information?
I am not sure if by submitting a SPARQL query, one can collect results in a file.
I looked at MediaWiki API, but not sure if it allows accessing multiple entities in one go.
Thanks!
Wikidata currently has around 190,000 Twitter IDs linked to people. You can easily get them all using the SPARQL Query Interface: Web Interface (with a LIMIT you can remove or increase). In the dropdown on the right, choose SPARQL Endpoint for the Direct Link (no limit, 35MB .csv).
But, in case you run into timeouts with more complicated queries, you can first try LIMIT and OFFSET, or one of:
Wikibase Dump Filter is a CLI tool that downloads the full wikidata dump but filters the stream as it comes in according to your needs. You can put very much the same thing together with some creative pipe|ing and it tends to work better than one would expect.
https://wdumps.toolforge.org wdumps.toolforge.org does more or less the same thing but on-premise, then allows you to download the filtered data.
The linked data interface also works rather well for "simple query, high volume" access needs. Example here gives all Twitter IDs (326,000+) and you can read it in pages as fast as you can generate get requests (set an appropriate Accept header to get json)

Kapow Robot - Extract business Operating hours from Google Search Results

Is it possible to create a Kapow Robot that can search Google for the Operating hours of the Businesses from our list/database and update the timings if changes are made?
Please share if there are any other more efficient ways than the KAPOW robot that can be implemented with minimal effort and cost-effectiveness.
That's what the Google Places API is there for. While you could in theory just open Google Maps in a Load Page action, enter the query string and then parse the results, I would advise against it. Here's why:
The API will be faster, returning results in a structured manner (JSON)
Kapow has actions for calling RESTful services and parsing/modifying JSON
Google does not like robots parsing their pages, and most likely will lock you out (i.e. present you with Captchas sooner or later)
If you decide to go for the API, here's what you should do:
Get your API key first, see this page for details: https://developers.google.com/places/web-service/get-api-key. Note that the free plan allows for 1,000 requests within a 24-hours limit (https://developers.google.com/places/web-service/usage)
Maintain the place ids for all the businesses you'd like to query regularly, and update your list.
For each place, retrieve the details as described in the API documentation. The opening hours will be within the JSON response: https://developers.google.com/places/web-service/details
Update your list. I'd recommend using a definite type in Kapow for that, and using the actions Store in Database and Query Database. In case you need the data elsewhere, you may create additional robots (e.g. for Excel files, sending data per email, et cetera).

How to get public data from Google plus

I have a project that involves having public data downloaded from Google plus, can you give me a reference on how I can download like 1 GB of any type of public data from Google plus?
The data can be posts or circles information. I've tried to work with developer tools but the far I got is downloading my own profile information but what I need is public data.
Thanks !
There is no truly "public" data on Google+.
Every stream is unique to a user.
Try viewing the site without logging in, and you'll see what I mean.
Since users have the ability to block other users from viewing even their "public" posts, before Google shows you a post they check to see if you're on the blocked list. For them to be able to do that, you have to be logged in.
Your best bet would be to create a dummy account and only look at your nearby stream or What's Hot.
Otherwise you'd need to circle users, and that would create the stream. G+ is not like twitter. There's no firehose to speak of.
To programmatically cull data, you would have to use their API, but even then their HTTP API limits you to 20 results per search and you have to provide a query.
You could get up to 100 results per user if you picked individuals and got their userids, but again there's not a programmatic way to get a bulk dump.
You could randomly select users by using an activity search for a dictionary entry, and then seed that into the activity listing api... something like (in pure pseudocode)
for Random word in dictionary
group = userids from GET https://www.googleapis.com/plus/v1/activities?query=[word]
for userid in group
GET https://www.googleapis.com/plus/v1/people/[userid]/activities/collection/public
Actual code would of course depend on the language.

Instagram: sort photos with a specific tag with most likes

I'm running a contest on the web where the image with the most likes wins. It's tiresom having to go through 900 images manually so what I want to do is, sort all images with the tag lets say #computer after the amount of likes, with the most liked pics on top. I have searched the net like crazy for some program or site that does this (ExtraGram, gramhoot, statigram, webstagram) but none offer to sort by amount of likes and it drives me INSANE! It's a really relevant request.
I've tried istafeed.js but it doesn't include all images, actually it leaves out the ones with the moest likes which defies the purpose.
There's nothing I know of in the Instagram API that sends back media sorted by likes in advance. I don't think there's a tool to do this either, but writing one is relatively simple IMO and I've done it before for a contest specifically.
The simplest thing to do is to do the following:
Use the Instagram API (via a library or pure REST) to query by tag. For instance, if you only care about the most recently tagged media or you want to process by date, you can use the [/tag/tag-name/media/recent][1] enpoint.
Page through each result page by processing the next_max_id/next_max_tag_id.
Collect the results locally into a database. You will receive the "like" count for each media item. You will have to update the data if you want to track the likes over time.
Sort the results using your database or if it's a small result set, you could skip #3 and just sort in memory.
If you need to refresh the results, you need to subscribe to the Tag via the API. You can give Instagram a URL to then push updates, and then you'll have to retrieve 1 or media items and update them in your database accordingly.
You will of course need to register your application with Instagram to get an API key if you want to do this. Then you can either send them your client_id or use OAuth.
The best way to achieve this is to pull the photos in and then sort them programmatically based on the likes numeric value. I've designed a plugin that does this automatically for you for anyone interested.
Instagram Journal

Filtering Foursquare Venue Results

I am currently evaluating several different APIs in order to get venue information. A key component of any provider is the ability to not just return all venues nearby but tailor the list based on previously entered user preferences.
Foursquare does not allow 'munging' their venue data with other data, like Google's places to create an aggregated service. But can I take Foursquare's venues for a given area, apply some filtering based on user preferences and recommendation engine techniques, and present a modified, personalized version of their information? Do they frown on only using their venue info as a jumping off point, even if attribution on the final results is given?
This customization would be above and beyond using retailer categories, something that can be included in the facebook request. Asking because other services require results presented exactly as returned from the API, including ads.
First, check out the policies at https://developer.foursquare.com/overview/community
We welcome you to use foursquare as your location database. You can associate additional content with our venue data in your system, but you may not combine our database with another database or export it on your own.
I think that they even encourage you to manipulate the data and create creative solutions with it, as long as you do not break their ground rule of not merging it with another database (see the full text at the link).
The API even lets you filter the results according to your needs with the categoryId and intent parameters. For example in our app, we filter out places that have less than 2 unique people checked in, because we assume its faked places.. we do other filtering on the result set as well, but we display only data from from foursquare venues database, and we attribute.