Creating mutually exclusive groupings in SQL (tables with pairs) - sql

Looking for some query structuring help. I have a table with rows for link timestamp, user_id, linked_id, type_if_link. These link types are for example 'email' vs. 'phone number' so in the example below you can see user 1 is not directly connected to user 3 but is via user 2. The other complication is that each 'linked account' appears in r1 as well, meaning there are several 'duplicate' fields (in the example: row 1+2 , row 3+4)
ex:
Link time user id linked_id link type
---------------------------------------------------
link_occurred at user 1 user 2 link a
link_occurred at user 2 user 1 link a
link_occurred at user 2 user 3 link b
link_occurred at user 3 user 2 link b
link_occurred_at user 4 user 5 link a
link_occurred_at user 5 user 4 link a
What functions could I use to get the first user-id, a count of all the (directly+indirectly) linked accounts and possibly an array of the linked account ids.
For example the output I would want here is:
initial user - Count linked accounts array of linked accounts
--------------------------------------------------------------
user 1 2 linked [user 2, user 3]
user 4 1 linked account [user 5]
This would give me mutually exclusive grouping of all linked networks of accounts.

I didn't know about recursive CTEs until Erwin Brandstetter mentioned them in the comment above. The concept is what it sounds like: a CTE that refers to itself, and has a base case so that recursion terminates. For your problem, a recursive CTE solution might look something like:
WITH accumulate_users AS (
-- Base case: the direct links from a user_id.
SELECT
user_id AS user_id,
ARRAY_AGG(linked_id) AS linked_accounts
FROM your_table
GROUP BY user_id
UNION ALL
-- Recursive case: transitively linked accounts.
SELECT
ARRAY_UNION(
accumulate_users.linked_accounts,
ARRAY_AGG(DISTINCT your_table.linked_id)
) AS linked_accounts
FROM accumulate_users
JOIN your_table ON CONTAINS(accumulate_users.linked_accounts, your_table.user_id)
GROUP BY accumulate_users.user_id
-- But there is no enforced termination condition, hopefully it just
-- ends at some point? This is part of why implementing recursive CTEs
-- is challenging, I think.
)
SELECT
user_id,
CARDINALITY(linked_accounts) AS count_linked_accounts,
linked_accounts
FROM accumulate_users
But, I haven't been able to test this query, because as detailed in another Stack Overflow Q&A Presto does not support recursive CTEs.
It is possible to traverse an arbitrary, but finite, number of links by repeatedly joining back to the table you have. Something like this, and I've included the second_, third_, fourth_degree_links only for clarity:
SELECT
yt1.user_id,
ARRAY_AGG(DISTINCT yt2.user_id) AS first_degree_links,
ARRAY_AGG(DISTINCT yt3.user_id) AS second_degree_links,
ARRAY_AGG(DISTINCT yt3.linked_user) AS fourth_degree_links,
ARRAY_UNION(
ARRAY_AGG(DISTINCT yt2.user_id),
ARRAY_UNION(ARRAY_AGG(DISTINCT yt3.user_id), ARRAY_AGG(DISTINCT yt3.linked_user))
) AS up_to_fourth_degree_links
FROM your_table AS yt1
JOIN your_table AS yt2 ON yt1.linked_user = yt2.user_id
JOIN your_Table AS yt3 ON yt2.linked_user = yt3.user_id
GROUP BY yt1.user_id
I've been working with a similar set of data, although I have the original identifiers as part of the raw data set. In other words the 'email' and 'phone number' in your example. I found it helpful to create a table that groups user ids by these connecting identifiers:
CREATE TABLE email_connections AS
SELECT
email,
ARRAY_AGG(DISTINCT user_id) AS users
FROM source_table
GROUP BY email
The same arbitrary-but-finite-depth set of links can then be computed by looking for intersections between the user arrays:
SELECT
3764350 AS user_id,
FLATTEN(ARRAY_AGG(ARRAY_UNION(emails1.users, ARRAY_UNION(emails2.users, ARRAY_UNION(emails3.users, emails4.users))))) AS all_users,
CARDINALITY(FLATTEN(ARRAY_AGG(ARRAY_UNION(emails1.users, ARRAY_UNION(emails2.users, ARRAY_UNION(emails3.users, emails4.users)))))) AS count_all_users
FROM email_connections AS emails1
JOIN email_connections AS emails2 ON CARDINALITY(ARRAY_INTERSECT(emails1.users, emails2.users)) > 0
JOIN email_connections AS emails3 ON CARDINALITY(ARRAY_INTERSECT(emails2.users, emails3.users)) > 0
JOIN email_connections AS emails4 ON CARDINALITY(ARRAY_INTERSECT(emails3.users, emails4.users)) > 0
WHERE CONTAINS(emails1.users, 3764350)
GROUP BY 1
Calculating links to an arbitrary depth is a good use case for a graph database technology like Neo4j or JanusGraph. That's what I'm now looking at to address this "user linking" problem.

Related

How to select users for whom one type of event occurred before another in PostgreSQL?

this is an example of data structure in my SQL table
In fact I have many users in my table and some of them have incorrect order of steps (user number 2 in the picture). How can I select all such users? The logic is to select all users that have date of sign_in earlier than date of registration? I suppose regular WHERE clause won't work here. Maybe there is a special function for such cases?
I can see two approaches to solve the problem. For reference this is how I imagine the table might look like
create table users (
user_id int,
action text,
date decimal
);
Use a self join. In this we're basically fetching the records with 'registration' action and adding a self join on matching user_id and 'sign_in' action. Because of the join, the data for each of the action is now available in the same row so this allows you to compare in the where clause
select u1.*
from users u1
join users u2 on u1.user_id = u2.user_id and u2.action = 'sign_in'
where u1.action = 'registration' and u2.date < u1.date;
Use crosstab* function of postgres. This allows you to transpose rows into columns hence gives the ability to compare in the where clause. Personally I think this is more elegant and extensive in the sense that it'll allow you to make other comparisons as well if needed without adding another join. Looking at the cost using "explain", this comes out to be more efficient as well.
SELECT *
FROM crosstab(
'select user_id, action, date
from users
order by user_id, action'
) AS ct(user_id int, del_account decimal, registration decimal, sign_in decimal)
where sign_in < registration;
*Note: In order to use crosstab however you may need superuser access to the database to create the extension. You can do so by running the following query only once
CREATE EXTENSION IF NOT EXISTS tablefunc;
Hope this helps. Let me know in the comments if there's any confusion
Your question is a bit vague yet the problem is generic enough.
First let's make your actions comparable and sortable in the right sequence, for example '1.registration', '2.sign_in', '3.del_account' instead of 'registration', 'sign_in', 'del_account'. Even better, use action codes, 2 for sign_in, 1 for registration etc.
Then you can detect misplaced actions and select the list of distinct user_id-s who did them.
select distinct user_id from
(
select user_id,
action > lead(action) over (partition by user_id order by "date") as misplaced
from the_table
) as t
where misplaced;
This approach would work for ay number of action steps, not only 3.
If you create a case statement for the action column you can get date of sign_in earlier than date of registration
https://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=1e112d51825f5d3185e445d97d4e9c78
select * from (
select ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY date ) as udid,case when action='registration' then 1
when action='sign_in' then 2
when action='delete' then 3
ELSE 4
end as stsord,*
from duptuser
) as drt where stsord!=udid

PostgreSQL - Best approach for summarize data

We have data as follows in system
User data
Experience
Education
Job Application
This data will be used across application and there are few logic also attached to these data.
Just to make sure that this data are consistent across application, i thought to create View for the same and get count of these data then use this view at different places.
Now question is, as detail tables does not have relation with each other, how should i create view
Create different view for each table and then use group by
Create one view and write sub query to get these data
From performance perspective, which one is the best approach?
For e.g.
SELECT
UserId,
COUNT(*) AS ExperienceCount,
0 AS EducationCount
FROM User
INNER JOIN Experience ON user_id = User_Id
GROUP BY
UserId
UNION ALL
SELECT
UserId,
0,
COUNT(*)
FROM User
INNER JOIN Education ON user_id = user_id
GROUP BY
UserId
And then group by this to get summary of all these data in one row per user.
One way to write the query that you have specified would probably be:
SELECT UserId, SUM(ExperienceCount), SUM(EducationCount
FROM ((SELECT UserId, COUNT(*) as ExperienceCount, 0 AS EducationCount
FROM Experience
GROUP BY UserId
) UNION ALL
(SELECT UserId, 0, COUNT(*)
GROUP BY UserId
)
) u
GROUP BY UserId;
This can also be written as a FULL JOIN, LEFT JOIN, and using correlated subqueries. Each of these can be appropriate in different circumstances, depending on your data.

SQL - I need to see how many users are associated with a specific set of ids

I'm trying to identify a list of users that all have the same set of IDs from another table.
I have users 1, 2, 3, and 4, all that can have multiple IDs from the list A, B, C, and D. I need to see how many users from list one have ONLY 3 IDs, and those three IDs must match (so how many users from list one have ONLY A, B, and C, but not D).
I can identify which users have which IDs, but I can't quite get how to get how many users specifically have a specific set of them
Here is the SQL that I'm using where the counts just aren't looking correct. I've identified that there are about 7k users with exactly 16 IDs (of any type), but when I try to use this sql to get a count of a specific set of 16, the count I get is 15k.
select
count(user_id)
from
(
SELECT
user_id
FROM user_id_type
where user_id_type not in ('1','2','3','4','5')
GROUP BY user_id
HAVING COUNT(user_id_type)='16'
)
So you want users with 3 IDs as long as one of the IDs is not D. How about;
select user
from table
group by user
having count(*) = 3 and max(ID) <> 'D'
The HAVING clause is useful in situations like this. This approach will work as long as the excluded ID is the max (or an easy change for min).
Following your comment, if the min/max(ID) approach isn't viable then you could use NOT IN;
select user
from table
where user not in (select user from table where ID = 'D')
group by user
having count(*) = 3
Following the updated question, if I've understood the mapping between the initial example and reality correctly then the query should be something like this;
SELECT user_id
FROM user_id_type
WHERE user_id not in (select user_id from user_id_type where user_id_type in ('1','2','3','4','5'))
GROUP BY user_id
HAVING COUNT(user_id_type)='16'
What is odd is that you appear to have both a table and a column in the table with the same name 'user_id_type'. This isn't the clearest of designs.

SQL query with OR condition on 2 joined tables

I have a projects table, a users table, a write_permissions table and a read_permissions table. Both read_permissions and write_permissions have 2 columns: project_id and user_id (this is a purposely contrived example, I'm not looking for alternative table settings).
I need for a given user to find all the projects on which he has a write permission or a read permission.
For instance for a user with write permissions on projects A and B, and read permission only on project C, and no permissions for project D, I need to write a query that returns the projects A, B and C.
The query may need to take additional JOIN clauses. For instance I may have a categories table, and a projects_categories table with columns projects_id and user_id, and may want to find all the projects on which a user has write permission and read permission, and that belongs to a given category.
SELECT p.*
FROM (
SELECT project_id
FROM write_permissions
WHERE user_id = 1
UNION
SELECT project_id
FROM read_permissions
WHERE user_id = 1
) sub
JOIN projects p USING (project_id);
UNION without ALL automatically folds duplicates in the result.
SELECT [project_id] FROM [read_permissions] WHERE [user_id]=#user_id
UNION ALL
SELECT [project_id] FROM [write_permissions] WHERE [user_id]=#user_id
this will give you the project that user have read or write premission on them
now you can use it's result

Query to ORDER BY the number of rows returned from another SELECT

I'm trying to wrap my head around SQL and I need some help figuring out how to do the following query in PostgreSQL 9.3.
I have a users table, and a friends table that lists user IDs and the user IDs of friends in multiple rows.
I would like to query the user table, and ORDER BY the number of mutual friends in common to a user ID.
So, the friends table would look like:
user_id | friend_user_id
1 | 4
1 | 5
2 | 10
3 | 7
And so on, so user 1 lists 4 and 5 as friends, and user 2 lists 10 as a friend, so I want to sort by the highest count of user 1 in friend_user_id for the result of user_id in the select.
The Postgres way to do this:
SELECT *
FROM users u
LEFT JOIN (
SELECT user_id, count(*) AS friends
FROM friends
) f USING (user_id)
ORDER BY f.friends DESC NULLS LAST, user_id -- as tiebreaker
The keyword AS is just noise for table aliases. But don't omit it from column aliases. The manual on "Omitting the AS Key Word":
In FROM items, both the standard and PostgreSQL allow AS to be omitted
before an alias that is an unreserved keyword. But this is impractical
for output column names, because of syntactic ambiguities.
Bold emphasis mine.
ISNULL() is a custom extension of MySQL or SQL Server. Postgres uses the SQL-standard function COALESCE(). But you don't need either here. Use the NULLS LAST clause instead, which is faster and cleaner. See:
PostgreSQL sort by datetime asc, null first?
Multiple users will have the same number of friends. These peers would be sorted arbitrarily. Repeated execution might yield different sort order, which is typically not desirable. Add more expressions to ORDER BY as tiebreaker. Ultimately, the primary key resolves any remaining ambiguity.
If the two tables share the same column name user_id (like they should) you can use the syntax shortcut USING in the join clause. Another standard SQL feature. Welcome side effect: user_id is only listed once in the output for SELECT *, as opposed to when joining with ON. Many clients wouldn't even accept duplicate column names in the output.
Something like this?
SELECT * FORM [users] u
LEFT JOIN (SELECT user_id, COUNT(*) friends FROM fields) f
ON u.user_id = f.user_id
ORDER BY ISNULL(f.friends,0) DESC