How to implement social search/typeahead (like FB/LinkedIn) using elastic search/lucene - lucene

I'm trying to figure out what's the best way to implement user search where the users are sorted by their "social distance" from you - so first 2nd degree friends (friends-of-friends) follow by 3rd degree friends (friends-of-friends-of-friends)
I found a few resources on it online:
1)LinkedIn :Cleo is their older typehead engine - they use friends lists/adjacency list as an inverted index. LinkedIn also has a new search architecture called Galene that is built top of Lucene, but they don't mention how they go about implementing social search
2) odnoklassniki.ru: They mention around 5:00 that the typehead search first looks in a "personal" lucene index - I would assume that it's basically an index limited to the user's adjacency list.
My questions are:
1) How would you integrate such a "personalized" adjacency list into Lucene on a per user basis? Is it possible to tell Lucene - "user this friends list as an inverted index for the search"?
2) If you need a more complicated (faceted) search, you would want to have the social data embedded in the schema (as opposed to having personalized indexes) - how would one go about it? I guess one way to go about it would be to user the parent/children relationship - but that works only for 2nd degree friends, not 3rd.
p.s
Another way to go about it would be to use a graph database, but I haven't seen any info about big companies using it for search.

Related

How to integrate search results with database

Lets say I store story documents in Elasticsearch like
{
story_id: 1,
title: "Hello World",
content: "Foo. Bar."
likes: 2222
}
When the client (frontend) searches, they should have the option to like (or remove their like) any of the search results, but there should also be an indication of whether or not they liked each result already.
What is a good way to get this information to the client?
Perform a database query to get all the stories the user has liked and keep it in the client's local storage. When search results are retrieved, map the user's liked stories to the retrieved search results on the client. This would add the complexity of updating local storage as well as the API when a user likes a story. Also, the number of stories a user likes could get very large.
Keep a list of users that have liked a story within the document itself and when searching check if the user is in the list. This could blow up the search index size?
{ ...
likes: [ 'foo_user', 'bar_user', ... ]
}
In the API, after the search, perform a database query to determine which stories in the search response the user has already liked, and map this info to the search results before returning the API response. This could slow down searches because an additional database query is required, but maybe it is inconsequential?
For this use case, most common/mainstream approach would be your option 3.
You need to save every like as a record in datastore.
You need to index docs in Elasticsearch(ES) with most probably only the properties you will use for searching and aggregation purpose, not whole doc.
After use query/search from Frontend you lookup from ES the docs and take their ids.
Go to datastore like records and check if there are user like records for each of them.
Combine this info and return whole doc to frontend.
Additional Datastore lookup would not cost you much both in time and money I would say. It wouldn't effect user experience much either.
My only concern would be because for every query I need to check likes collection, this request is not CDN/cache friendly.

Creating a SOLR index for activity stream or newsfeed

I am trying to index the activity feed of a social portal am building. The portal allows users to follow each other to get updates from the people they follow as an activity feed sorted by date.
For example, user A will be following users B, C, D, E & F. So user A should see all the posts from B, C, D, E & F on his/her activity feed.
Let's assume the post consist of just two fields.
1. The text of the post. (text_field)
2. The name/UID of the user who posted it. (user_field)
Currently, I am creating an index for all the posts and indexing the text_field & user_field. In scale, there can be 1,000,000+ posts. A user may follow 100s if not 1000s of users. What will be the best way to create an index for this scenario?
Should I also index a person followers, so that its quickly looked up and then pass it to a second query for getting the posts of all those users sorted by date?
What is the best way to query the index consisting of all these posts, by passing the UID of all the users that are followed? Considering this may be in 100's or more.
Update:
The motivation for using Solr for the news feed was mainly inspired by this detailed slide and my brief discussion with OpenSocial team.
When starting off with a social portal, Fan out on write seems an overkill and more expensive. However Fan out on read is better. Both the slide and the OpenSocial team suggested using a search backend for Fan out on read. The slide mentioned above also have data on how it helped them.
At present, the feed is going to be flat and only sort criteria will be the date(recency). We won't be considering relevance or posts from more closer groups.
It's kind of abstract, but I will do my best here. Based on what you mentioned, I am not sure if Solr is really the right tool for the job here. You can still have Solr for full text search, but I am not sure about generating a news feed from it in this scenario. Remember that although Solr is pretty impressive, it is a search engine. I will pretend that you will stick with Solr for the rest of the post, keep in mind that we are trying to put a square peg through a round hole here though.
Here are a few additional questions you should think about.
You will probably want to add a timestamp of the post to the data element
You need to figure out how to properly sort the results. Is it in order of recency? Or based on posts that the user is more likely to interact with?
If a user has 1000+ connections, would he want to see an update from every one of them in the main feed? Or should posts from a closer group of friends show up higher?
Here are some comments about your questions:
1) If you index person's followers, it may be hard to keep up. I am assuming followers are going to be changing often and re-indexing in this scenario would not really be practical.
2) That sounds more on par, but again, you need to figure out the sorting. You can get a list of connections for the user, then run a search for top posts from all of them.

How to get random page of specific "Portal:" using WikiMedia API

Is there any way to get random article of specific wikimedia portal using wikimedia API?
For example, I need random page of Portal:Science.
Does anyone know how to do this?
What you're asking for doesn't make much sense, because a portal doesn't have a list of pages associated with it.
The closest thing you can do is to get a random page from e.g. Category:Science or one of its subcategories. There is no way to do that directly using the API, you would need to traverse all the subcategories and choose a random page from them by yourself.
There is a tool that already does this (with a limit on the depth of the category tree): erwin85's random article and there is also a template for it on the English Wikipedia.

know the page rank for certain key words

I want to know the page rank for certain key words against my page. For example I wrote "best movies 2012" my page does come, but in 30th to 50th page. I want to query in the result set Google gave against my keywords so that I can see the rank of my page and my competitors against typical keywords.
I think you may be confusing PageRank with positions. PageRank is an algorithm that Google uses to determine the authority of your site. This doesn't always affect the positions of certain keywords.
There are plenty of good programs and web services around that you can use such as
http://raventools.com/
Most of the good free web services have been closed down due to Google now limiting the amount of searches performed and charging for this data.
You could check out:
http://www.semrush.com
It's free but you have to register to get data.
There are several web services providing this functionality: http://raventools.com/ or http://seomoz.org/
Or, you can perform the task manually. Here is an example on how to query google search using Java: How can you search Google Programmatically Java API
You need to compare your webpage PageRank and website PR against those of the competition. The best indication we have of website PR is the HomePage PagRank.
Ensure that you do this for the appropriate Google domain - USA - Google.com - UK Google.co.uk etc
The technique is described in more detail on http://www.keywordseopro.com
You can repeat the technique for each keyword.

Programmatic Querying of Google and Other Search Engines With Domain and Keywords

I'm trying to find out if there is a programmatic way to determine how far down in a search engine's search results my site shows up for given keywords. For example, my query would provide my domain name, and keywords, and the result would return a say 94 indicating that my site was the 94th result. I'm specifically interested in how to do this with google but also interested in Bing and Yahoo.
No.
There is no programmatic access to such data. People generally roll out their own version of such trackers. Get the Google search page and use regexes to find your position. But now different results are show in different geographies and results are personalize.
gl=us parameter will help you getting results from US, you can change geography accordingly to get the results.
Before creating this from scratch, you may want to save yourself some time (and money) by using a service that does exactly that [and more]: Ginzametrics.
They have a free plan (so you can test if it fits your requirements and check if it's really worth creating your own tool), an API and can even import data from Google Analytics.