design the feature in LinkedIn where it computes how many hops there are between you and another person? - sql

This was questioned in quora, and I'm really interested in the most efficient answer.
I think one way is a sql db or a document based db
get all your connections, then all your connection's connections, and
then theirs, and check if the person you are looking at is somewhere
in that list.
Taking into account an average of 500 connections per person, and an indexed by user_id db, this will be 3 queries max to the db.
I am interested in a graph db solution which I know very little about to see this feature can be greatly improved.

In a graph database such a task is a typical path finding task. It is solved using algorithms built into the system. For example in Neo4j: http://neo4j.com/docs/stable/rest-api-graph-algos.html
When you find the shortest path between persons you can easily calculate number of edges (hops in terms of your question) between them.
Graph databases has a great advantage in such kind of tasks over relational or key-value databases as they can use effective graph algorithms such as Dijkstra's algorithm.

Related

In MongoDB, if my queries do not involve any joins, can I assume that it will scale?

I have an APP that will be demanding in terms of pulling data. Each time a user logs in, data is pulled, each time a new page is visited data is pulled, etc.
Let's suppose that these queries will never involve joins.
Can I assume then that the queries will scale?
No, it does not follow that using MongoDB and not using joins means "your queries will scale." That's a myth told by MongoDB marketing, not real software engineering.
It depends what your query is doing. Every query has a cost, no matter what brand of datastore you use. Every data access needs to use resources on the server, and that resource usage adds up. Do you queries scan thousands or millions of documents in the MongoDB datastore? Do they need to do map-reduce? How many documents are in the query response? Is it pulling data that is cached, or will it cost I/O overhead to pull that data? How many requests per second do you need to serve? Can MongoDB support the rate of queries you need to do? Are you configuring a MongoDB replica set or a sharded cluster? How many shards do you queries need to visit to get their result? How powerful are the servers hosting each node?
These are some examples of the types of questions you need to understand and analyze for your queries and your MongoDB cluster (the list is not complete).
You don't need to give me the answers to these questions. I'm just using them to illustrate why it's a naive question to ask "will it scale?"
It's like asking "I'm need to drive my car to my brother's house, will I have to refill my fuel tank?" That's not enough information to answer the question. How far away is your brother's house? What type of vehicle do you have? What is its fuel efficiency? Is your vehicle laden with a lot of heavy cargo? How many times do you need to make the trip? How fast are you driving? How rough are the roads on the route?
There are probably many things to consider depending on your needs but i think the main difference comes from the document data model (that MongoDB is made to support and scale on)
Document => more related data in 1 place
fewer joins (expensive especially if data are in different machines)
fewer transactions (single document updates are atomic)
simpler smaller schema, more tailored to your application
data model, similar to the way programmers save their data on
objects(maps)/arrays
If you have many applications or too many different ways to access the same data, maybe you end up normalizing more your data to a more general data representation => losing some of the above benefits or duplicating some of your data to serve the different needs.

Data model guidance, database choice for aggregations on changing filter criteria

Problem:
We are looking for some guidance on what database to use and how to model our data to efficiently query for aggregated statistics as well as statistics related to a specific entity.
We have different underlying data but this example should showcase the fundamental problem:
Let's say you have data of Facebook friend requests and interactions over time. You now would like to answer questions like the following:
In 2018 which American had the most German friends that like ACDC?
Which are the friends that person X most interacted with on topic Y?
The general problem is that we have a lot of changing filter criteria (country, topic, interests, time) on both the entities that we want to calculate statistics for and the relevant related entities to calculate these statistics on.
Non-Functional Requirements:
It is an offline use-case, meaning there are no inserts, deletes or
updates happening, instead every X weeks a new complete dump is imported to replace the old data.
We would like to have an upper bound of 10 seconds
to answer our queries. The faster the better max 2 seconds for queries would be great.
The actual data has around 100-200 million entries, growth rate is linear.
The system has to serve a limited amount of concurrent users, max 100.
Questions:
What would be the right database technology or mixture of technologies to solve our problem?
What would be an efficient data model for computing aggregations with changing filter criteria in several dimensions?
(Bonus) What would be the estimated hardware requirements given a specific technology?
What we tried so far:
Setting up a document store with denormalized entries. Problem: It doesn't perform well on general queries because it has to scan too many entries for aggregations.
Setting up a graph database with normalized entries. Problem: performs even more poorly on aggregations.
You talk about which database to use, but it sounds like you need a data warehouse or business intelligence solution, not just a database.
The difference (in a nutshell) is that a data warehouse (DW) can support multiple reporting views, custom data models, and/or pre-aggregations which can allow you to do advanced analysis and detailed filtering. Data warehouses tend to hold a lot of data and are generally built to be very scalable and flexible (in terms of how the data will be used). For more details on the difference between a DW and database, check out this article.
A business intelligence (BI) tool is a "lighter" version of a data warehouse, where the goal is to answer specific data questions extremely rapidly and without heavy technical end-user knowledge. BI tools provide a lot of visualization functionality (easy to configure graphs and filters). BI tools are often used together with a data warehouse: The data is modeled, cleaned, and stored inside of the warehouse, and the BI tool pulls the prepared data into specific visualizations or reports. However many companies (particularly smaller companies) do use BI tools without a data warehouse.
Now, there's the question of which data warehouse and/or BI solution to use.
That's a whole topic of its own & well beyond the scope of what I write here, but here are a few popular tool names to help you get started: Tableau, PowerBI, Domo, Snowflake, Redshift, etc.
Lastly, there's the data modeling piece of it.
To summarize your requirements, you have "lots of changing filter criteria" and varied statistics that you'll need, for a variety of entities.
The data model inside of a DW would often use a star, snowflake, or data vault schema. (There are plenty of articles online explaining those.) If you're using purely BI tool, you can de-normalize the data into a combined dataset, which would allow you a variety of filtering & calculation options, while still maintaining high performance and speed.
Let's look at the example you gave:
Data of Facebook friend requests and interactions over time. You need to answer:
In 2018 which American had the most German friends that like ACDC?
Which are the friends that person X most interacted with on topic Y?
You want to filter/re-calculate the answers to those questions based on country, topic, interests, time.
One potential dataset can be structured like:
Date of Interaction | Initiating Person's Country | Responding Person's Country | Topic | Interaction Type | Initiating Person's Top Interest | Responding Person's Top Interest
This would allow you to easily count the amount of interactions, grouped and/or filtered by any of those columns.
As you can tell, this is just scratching the surface of a massive topic, but what you're asking is definitely do-able & hopefully this post will help you get started. There are plenty of consulting companies who would be happy to help, as well. (Disclaimer: I work for one of those consulting companies :)

How to mix RDMS DB with a Graph DB

I am developing a website using Django, and PostgreSQL which would seemingly have huge amount of data as gathered in social network sites.
I need to use RDMS with SQL for tabular data for less SQL complexity and also Graph DB with Cipher for large data for high query complexity.
Please let me know how to go about this. Also please let me know whether it is feasible.
EDIT: Clarity as asked in Comments:-
The database structure can be similar to that of a social network like Facebook. I've checked FB Engineering page for their open graph. For graph DB I can find only Neo4J graph DB with proper ACID values though I would prefer an open source graph DB. Graph DB structure, I require basically for summary of huge volume data pertaining to relationships like friends, updates, daily user related updates as individual relations. Horizontal Scalability is important for future up gradation to me.
I intend to use PostgreSQL for base informational data and push the relational data updates to graph DB like Facebook uses both MySql and open graph.
Based on your reply to my queries. I would first suggest looking at TitanDB. I believe it fulfills many of your requirements:
It is open source.
It scales horizontally.
In addition to meeting your requirements it has existed for quite sometime and many companies are using it in Production. The only thing you would have to get used to is that it uses TinkerPop traversals, not Cypher queries. Also note that I believe Titan is not ACID for most backends. This is a result of it being horizontally scalable.
If you would like a more structured (but significantly less mature) approach to Graph DBs then you can look at the stack that myself and some colleagues are working on MindmapsDB which sits on top of Titan, but uses a more "sql-like" query language.
OrientDB Gremlin is also a very good option but lacks the maturity and support of Titan.
There are many other graph vendors out there such as DSE Graph, IBM Graph, etc . . . but the ones I have listed above are the opensource ones I have worked with.

Database Normalisation and Searching it Quickly

I'm working on the technical architecture for a content solution integration. The data from the solution provider runs to millions of rows and normalised to 3NF. It is updated on a regular schedule (daily most likely) and its data is split down to a very granular level of atomicity.
I need to search and query this data and my current inclination is to leave the normalised data alone and create a denormalised database from its data (OLAP to OLTP). The 'transfer' can be a custom built program that can contain the necessary business logic in addition to the raw copying power and be run at a set schedule as required. The denormalised database would then reduce the atomicity and allow the keyword searches and queries to run efficiently. I was looking at using Lucene .NET for the keyword work on the denormalised database.
So before I sing loudly from the hills that this is the way forward, I wanted some expert opinion on this and what is the perceived "best practise". Is the method I have suggested the best way forward considering the data I will be provided? It was suggested that perhaps I could use a 'search engine' to search the normalised data. This scared the hell out of me, but raised the question; what search engine and how?
Opinions, flames, bad language and help appreciated :)
I have built reporting databases and data warehouses based on data stored in normalized form. There is quite a bit of work involved in the transfer program (ETL). Given your description of the data feed, maybe some of that work has been done for you by the feeder.
Millions of rows isn't a lot, these days. You may be able to get away with report oriented views into the existing database. Try it and see.
The biggest benefit to building an OLAP oriented database is not speed. It's flexibility. "We love this report, but now we want to see it weekly and quarterly instead of monthly. Bam! Done!" "Can you break it down by marketing category instead of manufacturing category? Bam! Done!" And so on.
A resonably normalized model (3NF/BCNF) provides the best average performance and the least amount of modification anomalies for the largest number of scenarios. That's big, so I would start from there. As your requirements are fuzzy, it's seems like the most sensible option.
Actually, the most sensible thing would be to go over the requirements until they are a bit more "crisp" ;)
Also, if you could get your hands on a few early extracts from your data provider you could experiment with it and get a feeling for the data distributions (not all people live in one country, and some countries holds more people than others. Not all people have children, and the number children per person is vastly different depending on the country). This is a major point and it is crucial that the optimizer can make good decisions.
Other than that, I agree with everything Walter said and also gave him my vote.

In terms of today's technology, are these meaningful concerns about data size?

We're adding extra login information to an existing database record on the order of 3.85KB per login.
There are two concerns about this:
1) Is this too much on-the-wire data added per login?
2) Is this too much extra data we're storing in the database per login?
Given todays technology, are these valid concerns?
Background:
We don't have concrete usage figures, but we average about 5,000 logins per month. We hope to scale to larger customers, howerver, still in the 10's of 1000's per month, not 1000's per second.
In the US (our market) broadband has 60% market adoption.
Assuming you have ~80,000 logins per month, you would be adding ~ 3.75 GB per YEAR to your database table.
If you are using a decent RDBMS like MySQL, PostgreSQL, SQLServer, Oracle, etc... this is a laughable amount of data and traffic. After several years, you might want to start looking at archiving some of it. But by then, who knows what the application will look like?
It's always important to consider how you are going to be querying this data, so that you don't run into performance bottlenecks. Without those details, I cannot comment very usefully on that aspect.
But to answer your concern, do not be concerned. Just always keep thinking ahead.
How many users do you have? How often do they have to log in? Are they likely to be on fast connections, or damp pieces of string? Do you mean you're really adding 3.85K per time someone logs in, or per user account? How long do you have to store the data? What benefit does it give you? How does it compare with the amount of data you're already storing? (i.e. is most of your data going to be due to this new part, or will it be a drop in the ocean?)
In short - this is a very context-sensitive question :)
Given that storage and hardware are SOOO cheap these days (relatively speaking of course) this should not be a concern. Obviously if you need the data then you need the data! You can use replication to several locations so that the added data doesn't need to move over the wire as far (such as a server on the west coast and the east coast). You can manage your data by separating it by state to minimize the size of your tables (similar to what banks do, choose state as part of the login process so that they look to the right data store). You can use horizontal partitioning to minimize the number or records per table to keep your queries speedy. Lots of ways to keep large data optimized. Also check into Lucene if you plan to do lots of reads to this data.
In terms of today's average server technology it's not a problem. In terms of your server technology it could be a problem. You need to provide more info.
In terms of storage, this is peanuts, although you want to eventually archive or throw out old data.
In terms of network (?) traffic, this is not much on the server end, but it will affect the speed at which your website appears to load and function for a good portion of customers. Although many have broadband, someone somewhere will try it on edge or modem or while using bit torrent heavily, your site will appear slow or malfunction altogether and you'll get loud complaints all over the web. Does it matter? If your users really need your service, they can surely wait, if you are developing new twitter the page load time increase is hardly acceptable.