How to query a thread in SQL - sql

Threads on Twitter consist of tweets connected to each other in a chain where a tweet is pointed to from another tweet by its id. E.g
id text in_reply_to_status_id
------------------------------------------
1 Hello world null
2 Hello Twitter 1
3 Hello TL 2
Is it possible to fetch a thread with one SQL query? If no, how does Twitter do it? Or do they store tweets differently from how they're retrieved from their API?

Depending on the database you use you could use a CONNECT BY clause to automatically unwrap the recursive hierarchy of these tweets.
How Twitter does it it's a different question, to provide quickly data to users, Twitter doesn't serve them from SQL databases, but from NoSQL caches, like Redis (and its variants) and Twemcache. You can start reading about this from the article The Infrastructure Behind Twitter: Scale, the cache section is the one you are looking for.
It is for sure a very interesting topic, but also very wide, trying to understand how Twitter works it's a good starting point. This is true also for me, to be honest I am not an expert of these technologies and your question lead me to a few things I want to read, so thanks for asking.

Related

Creating a SOLR index for activity stream or newsfeed

I am trying to index the activity feed of a social portal am building. The portal allows users to follow each other to get updates from the people they follow as an activity feed sorted by date.
For example, user A will be following users B, C, D, E & F. So user A should see all the posts from B, C, D, E & F on his/her activity feed.
Let's assume the post consist of just two fields.
1. The text of the post. (text_field)
2. The name/UID of the user who posted it. (user_field)
Currently, I am creating an index for all the posts and indexing the text_field & user_field. In scale, there can be 1,000,000+ posts. A user may follow 100s if not 1000s of users. What will be the best way to create an index for this scenario?
Should I also index a person followers, so that its quickly looked up and then pass it to a second query for getting the posts of all those users sorted by date?
What is the best way to query the index consisting of all these posts, by passing the UID of all the users that are followed? Considering this may be in 100's or more.
Update:
The motivation for using Solr for the news feed was mainly inspired by this detailed slide and my brief discussion with OpenSocial team.
When starting off with a social portal, Fan out on write seems an overkill and more expensive. However Fan out on read is better. Both the slide and the OpenSocial team suggested using a search backend for Fan out on read. The slide mentioned above also have data on how it helped them.
At present, the feed is going to be flat and only sort criteria will be the date(recency). We won't be considering relevance or posts from more closer groups.
It's kind of abstract, but I will do my best here. Based on what you mentioned, I am not sure if Solr is really the right tool for the job here. You can still have Solr for full text search, but I am not sure about generating a news feed from it in this scenario. Remember that although Solr is pretty impressive, it is a search engine. I will pretend that you will stick with Solr for the rest of the post, keep in mind that we are trying to put a square peg through a round hole here though.
Here are a few additional questions you should think about.
You will probably want to add a timestamp of the post to the data element
You need to figure out how to properly sort the results. Is it in order of recency? Or based on posts that the user is more likely to interact with?
If a user has 1000+ connections, would he want to see an update from every one of them in the main feed? Or should posts from a closer group of friends show up higher?
Here are some comments about your questions:
1) If you index person's followers, it may be hard to keep up. I am assuming followers are going to be changing often and re-indexing in this scenario would not really be practical.
2) That sounds more on par, but again, you need to figure out the sorting. You can get a list of connections for the user, then run a search for top posts from all of them.

Social Tables data model

I've just started looking at the documentation as we are going to need to integrate Salesforce with Social Tables shortly, so I am really new to Social Tables.
Specifically, we will need to sync data between the CRM and Social Tables Events and Guests, and maybe other objects, so it would be very helpful to have a data model or similar to check the relationships and fields available in Social Tables architecture.
I haven't found anything in the documentation, is there any way to get this, even if it's at a high level?
Thanks
Danny
To make an integration with SocialTables you'll have to do a few manual steps, there is no way to do this completely programmatic from my experience. You'll also have to be prepared to contact SocialTables to get get correct guestlist ids. Also keep in mind that the API documentation isn't always correct, the API logic is also quite difficult to understand from time to time.
The first thing you need to do is figure out which version of the Venue Mapper you use. You'd want to use the 4.0 api and as far as I know this version of the api is only supported by Venue Mapper 3.0. I believe the Venue Mapper 3.0 is the frontend tool SocialTables provides to do the venue planning.
In social tables an event has two ids, one numerical one and one alpha-numerical one, when you use the 4.0/events endpoint you only get the alpha-numerical event id, and your going to need the numerical one. The only way I've been able to get the numerical id is to pull it out from the url when using the Venue Mapper, example of the url follows below:
https://plan.socialtables.com/team/{team_id}/event/{event_id}/space/{space_id}
Now you need to get the guestlist id, you can get that by using the following url, using the numerical event id:
GET https://api.socialtables.com/4.0/diagrams?event={numerical_event_id}
This endpoint return a json structure where one of the parameters is "guestlist_id".
Please be aware that the guestlist id you get from this endpoint might not be the correct one. I struggled quite a bit with this part and ended up with SocialTables sending me the guestlist id by email.
To get the guests in your guestlist use the following api endpoint:
GET https://api.socialtables.com/4.0/guestlists/{guestlist_id}
The {guestlist_id} is an alpha-numerical string similar to: cfdac1c0-yb1d-12e6-84a5-a39e92131645
And by that you should hopefully get access to your guests.
Hey thanks for using our API.
To answer your question, the best way to see the data model at the moment is to access our developer portal and use the API console to see what is returned. For events you will need to know the team id of the team you are working with use the team events endpoint to get access to the event ids.
https://developer.socialtables.com/api-console#!/Events/get_4_0_legacyvm3_teams_team_events
This will return some basic information about each event for that team. You can then request additional details for specific events by using this endpoint:
https://developer.socialtables.com/api-console#!/Events/get_4_0_legacyvm3_events_event

What is the maximum results returned for YouTube Data API v3 call

Context
I am in the process of providing some consultancy on doing a HTTP GET using YouTube Data API V3; in order to develop a Windows based application to GET a list of results from Youtube, for say a specific CATEGORY, or a specific TAG.
We are open to using any programming language(I'm from a C++ background and am hoping You tube will support direct HTTP connections without using Google client SDK and so on) to connect to YouTube and (HTTP) GET data.(Once a month or so, so YouTube API quotas should not be problem).
The Issue
We are being told by some of my client's web developers that YouTube API v3 will only return a maximum of 500 records/results, for say a query that returns JUST the Total viewers, the Video's link, and basic meta data such as that.
S, say I wish to find 5,000 results for category "House music" or "basketball" - and I have the Developer Key etc are all set up, would that be possible?
If so, what GET fields would I need to populate(such as "max_results_per_page")?
Thank you.
The API won't provide more than ~500 search results for any arbitrary query. It's by design. Technically, it means that the nextPageToken field won't be returned once you hit ~500 results. No additional parameter can change that.
If you want more than ~500 results for a query, you have to split it into more specific sub-queries. I'd suggest using the publishedAfter and publishedBefore parameters to achieve that, but feel free to experiment with the other ones here.
This only holds for the search-Query. Other queries like "PlaylisItem:list" deliver more results. I have tested with 100.000 items to get the videos of a playlist.

Instagram: sort photos with a specific tag with most likes

I'm running a contest on the web where the image with the most likes wins. It's tiresom having to go through 900 images manually so what I want to do is, sort all images with the tag lets say #computer after the amount of likes, with the most liked pics on top. I have searched the net like crazy for some program or site that does this (ExtraGram, gramhoot, statigram, webstagram) but none offer to sort by amount of likes and it drives me INSANE! It's a really relevant request.
I've tried istafeed.js but it doesn't include all images, actually it leaves out the ones with the moest likes which defies the purpose.
There's nothing I know of in the Instagram API that sends back media sorted by likes in advance. I don't think there's a tool to do this either, but writing one is relatively simple IMO and I've done it before for a contest specifically.
The simplest thing to do is to do the following:
Use the Instagram API (via a library or pure REST) to query by tag. For instance, if you only care about the most recently tagged media or you want to process by date, you can use the [/tag/tag-name/media/recent][1] enpoint.
Page through each result page by processing the next_max_id/next_max_tag_id.
Collect the results locally into a database. You will receive the "like" count for each media item. You will have to update the data if you want to track the likes over time.
Sort the results using your database or if it's a small result set, you could skip #3 and just sort in memory.
If you need to refresh the results, you need to subscribe to the Tag via the API. You can give Instagram a URL to then push updates, and then you'll have to retrieve 1 or media items and update them in your database accordingly.
You will of course need to register your application with Instagram to get an API key if you want to do this. Then you can either send them your client_id or use OAuth.
The best way to achieve this is to pull the photos in and then sort them programmatically based on the likes numeric value. I've designed a plugin that does this automatically for you for anyone interested.
Instagram Journal

Twitter REST API consistency with ID placement

Can someone explain to me in, in REST terms, Twitter's design decision with the parameter placement in these two calls? It seems that the :id placement is inconsistent and arbitrary (although clearly this was deliberate).
GET statuses/:id/retweeted_by
Show user objects of up to 100 members who retweeted the status.
GET statuses/retweets/:id
Returns up to 100 of the first retweets of a given tweet.
There are other similar examples throughout their API (https://dev.twitter.com/docs/api), so I'm definitely missing something.
Thanks!
Just making guesses here
Someone at Twitter once pointed out that the Twitter API runs on several servlets. I can only assume that this was related - it's easier to map /retweets/* than to map every single combination.
Update: I think that the history of the API itself can also be relevant. Twitter's API hasn't really changed much over the past years, and if it did change then it would be because new features would be added. An endpoint like GET statuses/show/:id is old, while GET statuses/retweets/:id is newer. If Twitter at some point decided to change naming conventions, they couldn't just rename the old ones, since it would break applications.
Another theory of mine is that GET statuses/retweets/:id actually doesn't refer to the Tweet :id itself, but is about the tweets that were based on it. GET statuses/:id/retweeted_by is directly related to the tweet itself, by returning users and not other statuses.
I too am often puzzled by the naming consistency. I'm sure they have their reasons though.
I ended up checking with a friend at Twitter, who says:
"I just talked with the guy who originally wrote those two API
endpoints, and he doesn't remember why. To answer your question,
though, there probably isn't a good RESTful reason for that design."