Creating a SOLR index for activity stream or newsfeed - indexing

I am trying to index the activity feed of a social portal am building. The portal allows users to follow each other to get updates from the people they follow as an activity feed sorted by date.
For example, user A will be following users B, C, D, E & F. So user A should see all the posts from B, C, D, E & F on his/her activity feed.
Let's assume the post consist of just two fields.
1. The text of the post. (text_field)
2. The name/UID of the user who posted it. (user_field)
Currently, I am creating an index for all the posts and indexing the text_field & user_field. In scale, there can be 1,000,000+ posts. A user may follow 100s if not 1000s of users. What will be the best way to create an index for this scenario?
Should I also index a person followers, so that its quickly looked up and then pass it to a second query for getting the posts of all those users sorted by date?
What is the best way to query the index consisting of all these posts, by passing the UID of all the users that are followed? Considering this may be in 100's or more.
Update:
The motivation for using Solr for the news feed was mainly inspired by this detailed slide and my brief discussion with OpenSocial team.
When starting off with a social portal, Fan out on write seems an overkill and more expensive. However Fan out on read is better. Both the slide and the OpenSocial team suggested using a search backend for Fan out on read. The slide mentioned above also have data on how it helped them.
At present, the feed is going to be flat and only sort criteria will be the date(recency). We won't be considering relevance or posts from more closer groups.

It's kind of abstract, but I will do my best here. Based on what you mentioned, I am not sure if Solr is really the right tool for the job here. You can still have Solr for full text search, but I am not sure about generating a news feed from it in this scenario. Remember that although Solr is pretty impressive, it is a search engine. I will pretend that you will stick with Solr for the rest of the post, keep in mind that we are trying to put a square peg through a round hole here though.
Here are a few additional questions you should think about.
You will probably want to add a timestamp of the post to the data element
You need to figure out how to properly sort the results. Is it in order of recency? Or based on posts that the user is more likely to interact with?
If a user has 1000+ connections, would he want to see an update from every one of them in the main feed? Or should posts from a closer group of friends show up higher?
Here are some comments about your questions:
1) If you index person's followers, it may be hard to keep up. I am assuming followers are going to be changing often and re-indexing in this scenario would not really be practical.
2) That sounds more on par, but again, you need to figure out the sorting. You can get a list of connections for the user, then run a search for top posts from all of them.

Related

Twitter API Standard Search: Can I get hidden replies?

I am trying to get as much data as a I can out of the Twitter API for an academic research project. Even though I only have access to the Standard API the data should be as accurate as possible. I am building myself a "wrapper" around Twarc and other utilities in Python that gets me most of the data I want in just the format I need. A big problem was getting all the replies, but I was able to solve it with a bit of trickery: Searching from the tweet in question onwards and then checking if the tweets in the obtained sample have the original tweet ID in "in_reply_to_tweet_id". Rinse and repeat with those newly obtained tweets.
Then I noticed the new moderation feature Twitter implemented in March. Now the moderated comments under "More replies" do not show up in my search output.
Example: https://twitter.com/NDRreporter/status/1113353224730365952
I find all replies except the following: Under "More replies" ("Mehr Antworten" in German), there is a reply chain started by a extreme right leaning (possibly troll) account ("#Der Steuerzahler") that got moderated and shoved down there. This does not show up in API searches, even if I let the code iterate for over an hour just looking for replies to this particular original tweet.
My question is pretty general: Aside from getting replies as they come in (i.e. before they are moderated) via Filter API, is it possible to find these moderated tweets via the Standard Search API? Not looking for a ready-made solution, general pointers suffice. If I can't find them via Search, then I obviously won't try it with that anymore.
Thanks in advance.

Querying an implicit re-orderable list

I was searching for a way to re-order my records, like blog posts, for instance.
One of the solutions I have found is to self-reference to refer to the previous (or next) value, like in a linked list (https://softwareengineering.stackexchange.com/a/375246). However, this requires the client-side (a web service or perhaps a mobile app) to implement the linked-list travesal logic to derive the order.
Is there a way to do this at the database level?
The reason for this is that if you are deriving the order at the client-side, then if you want to display only the first 10 records, you would have to retrieve all the records anyway.
EDIT
It seems the blog posts example was a very bad example, sorry. I was thinking of blog posts as they are displayed on an admin dashboard, and the user can re-order the position they are displayed by dragging and dropping. Hope this is more clear.
EDIT 2
I guess, generally, what I'm really asking is, how can one implement and query a tree-like structure in SQL

How to query a thread in SQL

Threads on Twitter consist of tweets connected to each other in a chain where a tweet is pointed to from another tweet by its id. E.g
id text in_reply_to_status_id
------------------------------------------
1 Hello world null
2 Hello Twitter 1
3 Hello TL 2
Is it possible to fetch a thread with one SQL query? If no, how does Twitter do it? Or do they store tweets differently from how they're retrieved from their API?
Depending on the database you use you could use a CONNECT BY clause to automatically unwrap the recursive hierarchy of these tweets.
How Twitter does it it's a different question, to provide quickly data to users, Twitter doesn't serve them from SQL databases, but from NoSQL caches, like Redis (and its variants) and Twemcache. You can start reading about this from the article The Infrastructure Behind Twitter: Scale, the cache section is the one you are looking for.
It is for sure a very interesting topic, but also very wide, trying to understand how Twitter works it's a good starting point. This is true also for me, to be honest I am not an expert of these technologies and your question lead me to a few things I want to read, so thanks for asking.

Instagram: sort photos with a specific tag with most likes

I'm running a contest on the web where the image with the most likes wins. It's tiresom having to go through 900 images manually so what I want to do is, sort all images with the tag lets say #computer after the amount of likes, with the most liked pics on top. I have searched the net like crazy for some program or site that does this (ExtraGram, gramhoot, statigram, webstagram) but none offer to sort by amount of likes and it drives me INSANE! It's a really relevant request.
I've tried istafeed.js but it doesn't include all images, actually it leaves out the ones with the moest likes which defies the purpose.
There's nothing I know of in the Instagram API that sends back media sorted by likes in advance. I don't think there's a tool to do this either, but writing one is relatively simple IMO and I've done it before for a contest specifically.
The simplest thing to do is to do the following:
Use the Instagram API (via a library or pure REST) to query by tag. For instance, if you only care about the most recently tagged media or you want to process by date, you can use the [/tag/tag-name/media/recent][1] enpoint.
Page through each result page by processing the next_max_id/next_max_tag_id.
Collect the results locally into a database. You will receive the "like" count for each media item. You will have to update the data if you want to track the likes over time.
Sort the results using your database or if it's a small result set, you could skip #3 and just sort in memory.
If you need to refresh the results, you need to subscribe to the Tag via the API. You can give Instagram a URL to then push updates, and then you'll have to retrieve 1 or media items and update them in your database accordingly.
You will of course need to register your application with Instagram to get an API key if you want to do this. Then you can either send them your client_id or use OAuth.
The best way to achieve this is to pull the photos in and then sort them programmatically based on the likes numeric value. I've designed a plugin that does this automatically for you for anyone interested.
Instagram Journal

Using APIs to Filter Albums by Years

So I'm working on an application that has a feature that generates a list of 100 or so artists that are similar to those in the user's music catalog using the Echo Nest API. Then, a user can supply a certain year, and, based on the similar artists, the application will return a list of albums that were released on that year.
The only problem is that I have no idea how to filter albums based on year. The Echo Nest API doesn't really do much with albums. The Discogs and Last.fm APIs work with albums, and the Discogs API has data about albums' release dates, but there is no way to filter an initial query by release date. For example, if I have the artist Fleet Foxes and I want to filter it by albums released in 2011, there is no option to search for albums by the Fleet Foxes confined to release dates of 2011.
The only option I can really see at this point is iterating over EVERY album an artist has and only adding those albums that meet my specifications. However, this is obviously very heavy on both the APIs and my server, especially considering that many of the artists in the list of 100 similar artists will have no albums that match my criteria and that many artists have well within the range of 100 albums when you take into consideration singles, remixes, etc.
Does anyone see a better way of doing this?
If an API really doesn't have any way to filter by year, then yes, of course you will have to pull down all of the releases and filter them after the fact.
If you think this is a burden on your code and/or their server, you should file a feature request to add the filtering.
However, you should make sure first that they really don't provide such a thing. Most REST APIs separate "fetch" and "search". For example, http://api.example.com/artists/12345/releases may not have any way to filter it, but http://api.example.com/search?type=releases&artist=12345&year=2011 may exist.
Without looking into all of the APIs in detail, a quick check of Discogs' "Run a search query" docs shows that you can include a year criterion in the search (although it looks like maybe you can't actually search by artist ID, just by artist name?).