Postgres: Aggregate accounts into a single identity by common email address - sql

I'm building a directory of users, where:
each user can have an account on one or more external services, and
each of these accounts can have one or more email addresses.
What I want to know is, how can I aggregate these accounts into single identities through common email addresses?
For example, let's say I have two services, A and B. For each service, I have a table that relates an account to one or more email addresses.
So if service A has these account email addresses:
account_id | email_address
-----------|--------------
1 | a#foo.com
1 | b#foo.com
2 | c#foo.com
and service B has these account email addresses:
account_id | email_address
-----------|--------------
3 | a#foo.com
3 | a#bar.com
4 | d#foo.com
I'd like to create a table that aggregates the email addresses of these accounts into a single user identity:
user_id | email_address
--------|--------------
X | a#foo.com
X | b#foo.com
X | a#bar.com
Y | c#foo.com
Z | d#foo.com
As you can see, account 1 from service A and account 2 from service B have been merged into a common user X, based on the common email address a#foo.com. Here's an animated visual:
The closest answer I could find is this one, and I suspect the solution is a recursive CTE, but given the inputs and engine are different I'm having trouble implementing it.
Clarification: I'm looking for a solution that handles an arbitrary number of services, so perhaps the input table might be better off as:
service_id | account_id | email_address
-----------|------------|--------------
A | 1 | a#foo.com
A | 1 | b#foo.com
A | 2 | c#foo.com
B | 3 | a#foo.com
B | 3 | a#bar.com
B | 4 | d#foo.com

demo1:db<>fiddle, demo2:db<>fiddle
WITH combined AS (
SELECT
a.email as a_email,
b.email as b_email,
array_remove(ARRAY[a.id, b.id], NULL) as ids
FROM
a
FULL OUTER JOIN b ON (a.email = b.email)
), clustered AS (
SELECT DISTINCT
ids
FROM (
SELECT DISTINCT ON (unnest_ids)
*,
unnest(ids) as unnest_ids
FROM combined
ORDER BY unnest_ids, array_length(ids, 1) DESC
) s
)
SELECT DISTINCT
new_id,
unnest(array_cat) as email
FROM (
SELECT
array_cat(
array_agg(a_email) FILTER (WHERE a_email IS NOT NULL),
array_agg(b_email) FILTER (WHERE b_email IS NOT NULL)
),
row_number() OVER () as new_id
FROM combined co
JOIN clustered cl
ON co.ids <# cl.ids
GROUP BY cl.ids
) s
Step by step explanation:
For explanation I'll take this dataset. This is a little bit more complex than yours. It can illustrate my steps better. Some problems don't occur in your smaller set. Think about the characters as variables for email addresses.
Table A:
| id | email |
|----|-------|
| 1 | a |
| 1 | b |
| 2 | c |
| 5 | e |
Table B
| id | email |
|----|-------|
| 3 | a |
| 3 | d |
| 4 | e |
| 4 | f |
| 3 | b |
CTE combined:
JOIN of both tables on same email addresses to get a touch point. IDs of same Ids will be concatenated in one array:
| a_email | b_email | ids |
|-----------|-----------|-----|
| (null) | a#bar.com | 3 |
| a#foo.com | a#foo.com | 1,3 |
| b#foo.com | (null) | 1 |
| c#foo.com | (null) | 2 |
| (null) | d#foo.com | 4 |
CTE clustered (sorry for the names...):
Goal is to get all elements exactly in only one array. In combined you can see, for example currently there are more arrays with the element 4: {5,4} and {4}.
First ordering the rows by the length of their ids arrays because the DISTINCT later should take the longest array (because holding the touch point {5,4} instead of {4}).
Then unnest the ids arrays to get a basis for filtering. This ends in:
| a_email | b_email | ids | unnest_ids |
|---------|---------|-----|------------|
| b | b | 1,3 | 1 |
| a | a | 1,3 | 1 |
| c | (null) | 2 | 2 |
| b | b | 1,3 | 3 |
| a | a | 1,3 | 3 |
| (null) | d | 3 | 3 |
| e | e | 5,4 | 4 |
| (null) | f | 4 | 4 |
| e | e | 5,4 | 5 |
After filtering with DISTINCT ON
| a_email | b_email | ids | unnest_ids |
|---------|---------|-----|------------|
| b | b | 1,3 | 1 |
| c | (null) | 2 | 2 |
| b | b | 1,3 | 3 |
| e | e | 5,4 | 4 |
| e | e | 5,4 | 5 |
We are only interested in the ids column with the generated unique id clusters. So we need all of them only once. This is the job of the last DISTINCT. So CTE clustered results in
| ids |
|-----|
| 2 |
| 1,3 |
| 5,4 |
Now we know which ids are combined and should share their data. Now we join the clustered ids against the origin tables. Since we have done this in the CTE combined we can reuse this part (that's the reason why it is outsourced into a single CTE by the way: We do not need another join of both tables in this step anymore). The JOIN operator <# says: JOIN if the "touch point" array of combined is a subgroup of the id cluster of clustered. This yields in:
| a_email | b_email | ids | ids |
|---------|---------|-----|-----|
| c | (null) | 2 | 2 |
| a | a | 1,3 | 1,3 |
| b | b | 1,3 | 1,3 |
| (null) | d | 3 | 1,3 |
| e | e | 5,4 | 5,4 |
| (null) | f | 4 | 5,4 |
Now we are able to group the email addresses by using the clustered ids (rightmost column).
array_agg aggregates the mails of one column, array_cat concatenates the email arrays of both columns into one big email array.
Since there are columns where email is NULL we can filter these values out before clustering with the FILTER (WHERE...) clause.
Result so far:
| array_cat |
|-----------|
| c |
| a,b,a,b,d |
| e,e,f |
Now we group all email addresses for one single id. We have to generate new unique ids. That's what the window function row_number is for. It simply adds a row count to the table:
| array_cat | new_id |
|-----------|--------|
| c | 1 |
| a,b,a,b,d | 2 |
| e,e,f | 3 |
Last step is to unnest the array to get a row per email address. Since in the array are still some duplicates we can eliminate them in this step with a DISTINCT as well:
| new_id | email |
|--------|-------|
| 1 | c |
| 2 | a |
| 2 | b |
| 2 | d |
| 3 | e |
| 3 | f |

OK, provided you only have two 'services', and assuming that to begin with you are not overly concerned with how to best represent the new key (I've used text as the easiest to hand), then please try the below query. This works for me on Postgres 9.6:
WITH shared_addr AS
(
SELECT foo.account_a, foo.account_b, row_number() OVER (ORDER BY foo.account_a) AS shared_id
FROM (
SELECT
a.account_id as account_a
, b.account_id as account_b
FROM
service_a a
JOIN
service_b b
ON
a.email_address = b.email_address
GROUP BY a.account_id, b.account_id
) foo
)
SELECT
bar.account_id,
bar.email_address
FROM
(
SELECT
'A-' || service_a.account_id::text AS account_id,
service_a.email_address
FROM service_a
LEFT OUTER JOIN
shared_addr
ON
shared_addr.account_a = service_a.account_id
WHERE shared_addr.account_b IS NULL
UNION ALL
SELECT
'B-' ||service_b.account_id::text,
service_b.email_address FROM service_b
LEFT OUTER JOIN
shared_addr
ON
shared_addr.account_b = service_b.account_id
WHERE shared_addr.account_a IS NULL
UNION ALL
(
SELECT
'shared-' || shared_addr.shared_id::text,
service_b.email_address
FROM service_b
JOIN
shared_addr
ON
shared_addr.account_b = service_b.account_id
UNION
SELECT
'shared-' || shared_addr.shared_id::text,
service_a.email_address
FROM service_a
JOIN
shared_addr
ON
shared_addr.account_a = service_a.account_id
)
) bar
;

Related

Merging multiple "state-change" time series

Given a number of tables like the following, representing state-changes at time t of an entity identified by id:
| A | | B |
| t | id | a | | t | id | b |
| - | -- | - | | - | -- | - |
| 0 | 1 | 1 | | 0 | 1 | 3 |
| 1 | 1 | 2 | | 2 | 1 | 2 |
| 5 | 1 | 3 | | 3 | 1 | 1 |
where t is in reality a DateTime field with millisecond precision (making discretisation infeasible), how would I go about creating the following output?
| output |
| t | id | a | b |
| - | -- | - | - |
| 0 | 1 | 1 | 3 |
| 1 | 1 | 2 | 3 |
| 2 | 1 | 2 | 2 |
| 3 | 1 | 2 | 1 |
| 5 | 1 | 3 | 1 |
The idea is that for any given input timestamp, the entire state of a selected entity can be extracted by selecting one row from the resulting table. So the latest state of each variable corresponding to any time needs to be present in each row.
I've tried various JOIN statements, but I seem to be getting nowhere.
Note that in my use case:
rows also need to be joined by entity id
there may be more than two source tables to be merged
I'm running PostgreSQL, but I will eventually translate the query to SQLAlchemy, so a pure SQLAlchemy solution would be even better
I've created a db<>fiddle with the example data.
I think you want a full join and some other manipulations. The ideal would be:
select t, id,
last_value(a.a ignore nulls) over (partition by id order by t) as a,
last_value(b.b ignore nulls) over (partition by id order by t) as b
from a full join
b
using (t, id);
But . . . Postgres doesn't support ignore nulls. So an alternative method is:
select t, id,
max(a) over (partition by id, grp_a) as a,
max(b) over (partition by id, grp_b) as b
from (select *,
count(a.a) over (partition by id order by t) as grp_a,
count(b.b) over (partition by id order by t) as grp_b
from a full join
b
using (t, id)
) ab;

Remove Duplicate Result on Query

could help me solve this duplication problem where it returns more than 1 result for the same record I want to bring only 1 result for each id, and only the last history of each record.
My Query:
SELECT DISTINCT ON(tickets.ticket_id,ticket_histories.created_at)
ticket.id AS ticket_id,
tickets.priority,
tickets.title,
tickets.company,
tickets.ticket_statuse,
tickets.created_at AS created_ticket,
group_user.id AS group_id,
group_user.name AS user_group,
ch_history.description AS ch_description,
ch_history.created_at AS ch_history
FROM
tickets
INNER JOIN company ON (company.id = tickets.company_id)
INNER JOIN (SELECT id,
tickets_id,
description,
user_id,
MAX(tickets.created_at) AS created_ticket
FROM
ch_history
GROUP BY id,
created_at,
ticket_id,
user_id,
description
ORDER BY created_at DESC LIMIT 1) AS ch_history ON (ch_history.ticket_id = ticket.id)
INNER JOIN users ON (users.id = ch_history.user_id)
INNER JOIN group_users ON (group_users.id = users.group_user_id)
WHERE company = 15
GROUP BY
tickets.id,
ch_history.created_at DESC;
Result of my query, but returns 3 or 5 identical ids with different histories
I want to return only 1 id of each ticket, and only the last recorded history of each tick
ticket_id | priority | title | company_id | ticket_statuse | created_ticket | company | user_group | group_id | ch_description | ch_history
-----------+------------+--------------------------------------+------------+-----------------+----------------------------+------------------------------------------------------+-----------------+----------+------------------------+----------------------------
49713 | 2 | REMOVE DATA | 1 | t | 2019-12-09 17:50:35.724485 | SAME COMPANY | people | 5 | TEST 1 | 2019-12-10 09:31:45.780667
49706 | 2 | INCLUDE DATA | 1 | f | 2019-12-09 09:16:35.320708 | SAME COMPANY | people | 5 | TEST 2 | 2019-12-10 09:38:52.769515
49706 | 2 | ANY TITLE | 1 | f | 2019-12-09 09:16:35.320708 | SAME COMPANY | people | 5 | TEST 3 | 2019-12-10 09:39:22.779473
49706 | 2 | NOTING ELSE MAT | 1 | f | 2019-12-09 09:16:35.320708 | SAME COMPANY | people | 5 | TESTE 4 | 2019-12-10 09:42:59.50332
49706 | 2 | WHITESTRIPES | 1 | f | 2019-12-09 09:16:35.320708 | SAME COMPANY | people | 5 | TEST 5 | 2019-12-10 09:44:30.675434
wanted to return as below
ticket_id | priority | title | company_id | ticket_statuse | created_ticket | company | user_group | group_id | ch_description | ch_history
-----------+------------+--------------------------------------+------------+-----------------+----------------------------+------------------------------------------------------+-----------------+----------+------------------------+----------------------------
49713 | 2 | REMOVE DATA | 1 | t | 2019-12-09 17:50:10.724485 | SAME COMPANY | people | 5 | TEST 1 | 2020-01-01 18:31:45.780667
49707 | 2 | INCLUDE DATA | 1 | f | 2019-12-11 19:22:21.320701 | SAME COMPANY | people | 5 | TEST 2 | 2020-02-05 16:38:52.769515
49708 | 2 | ANY TITLE | 1 | f | 2019-12-15 07:15:57.320950 | SAME COMPANY | people | 5 | TEST 3 | 2020-02-06 07:39:22.779473
49709 | 2 | NOTING ELSE MAT | 1 | f | 2019-12-16 08:30:28.320881 | SAME COMPANY | people | 5 | TESTE 4 | 2020-01-07 11:42:59.50332
49701 | 2 | WHITESTRIPES | 1 | f | 2019-12-21 11:04:00.320450 | SAME COMPANY | people | 5 | TEST 5 | 2020-01-04 10:44:30.675434
I wanted to return as shown below, see that the field ch_description, and ch_history bring only the most recent records and only the last of each ticket listed, without duplication I wanted to bring this way could help me.
Two things jump out at me:
You have listed "created at" as part of your "distinct on," which is going to inherently give you multiple rows per ticket id (unless there happens to be only one)
The distinct on should make the subquery on the ticket history unnecessary... and even if you chose to do it this way, you again are going on the "created at" column, which will give you multiple results. The ideal subquery, should you choose this approach, would have been to group by ticket_id and only ticket_id.
Slightly related:
An alternative approach to the subquery would be an analytic function (windowing function), but I'll save that for another day.
I think the query you want, which will give you one row per ticket_id, based on the history table's created_at field would be something like this:
select distinct on (t.id)
<your fields here>
from
tickets t
join company c on t.company_id = c.id
join ch_history ch on ch.ticket_id = t.id
join users u on ch.user_id = u.ud
join group_users g on u.group_user_id = g.id
where
company = 15
order by
t.id, ch.created_at -- this is what tells distinct on which record to choose

How to insert records based on another table value

I have the following three tables:
Permission
| PermissionId | PermissionName |
+--------------+----------------+
| 1 | A |
| 2 | B |
| 3 | C |
| 100 | D |
Group
| GroupId | GroupLevel | GroupName |
+---------+------------+----------------------+
| 1 | 0 | System Administrator |
| 7 | 0 | Test Group 100 |
| 8 | 20 | Test Group 200 |
| 9 | 20 | test |
| 10 | 50 | TestGroup01 |
| 11 | 51 | TestUser02 |
| 12 | 52 | TestUser03 |
GroupPermission
| GroupPermissionId | FkGroupId | FkPermissionId |
+-------------------+-----------+----------------+
| 1 | 1 | 1 |
| 2 | 1 | 2 |
| 3 | 1 | 3 |
| 4 | 1 | 4 |
I need to insert records into GroupPermission table, if table Group, GroupLevel column have 0
then I need to take its GroupId and need to insert values to GroupPermission table as that particular id and 100.
In order to above sample table records, I need to insert the following two records to GroupPermission table,
| FkGroupId | FkPermissionId |
+-----------+----------------+
| 1 | 100 |
| 7 | 100 |
How can I do it
This question is not very clear and I can only assume the value 100 is a static value and that you don't actually have foreign keys as the names of the columns imply. Also, you really should avoid reserved words like "Group" for object names. It makes things more difficult and confusing.
The simple version of your insert might look like this.
insert GroupPermission
(
FkGroupId
, FkPermissionId
)
select g.GroupId
, 100
from [Group] g
where g.GroupLevel = 0
--EDIT--
Since you want to only insert those rows that don't already exist you can use NOT EXISTS like this.
select g.GroupId
, 100
from [Group] g
where g.GroupLevel = 0
AND NOT EXISTS
(
select *
from GroupPermission gp
where gp.FkGroupId = g.GroupId
and g.FkPermissionId = 100
)
Or you could use a left join like this.
select g.GroupId
, 100
from [Group] g
left join GroupPermission gp on gp.FkGroupId = g.GroupId
and gp.FkPermissionId = 100
where g.GroupLevel = 0
and gp.FkGroupId is null

Select from a concatenation of two columns after a left join

Problem description
Let the tables C and V have those values
>> Table V <<
| UnID | BillID | ProductDesc | Value | ... |
| 1 | 1 | 'Orange Juice' | 3.05 | ... |
| 1 | 1 | 'Apple Juice' | 3.05 | ... |
| 1 | 2 | 'Pizza' | 12.05 | ... |
| 1 | 2 | 'Chocolates' | 9.98 | ... |
| 1 | 2 | 'Honey' | 15.98 | ... |
| 1 | 3 | 'Bread' | 3.98 | ... |
| 2 | 1 | 'Yogurt' | 8.55 | ... |
| 2 | 1 | 'Ice Cream' | 7.05 | ... |
| 2 | 1 | 'Beer' | 9.98 | ... |
| 2 | 2 | 'League of Legends RP' | 40.00 | ... |
>> Table C <<
| UnID | BillID | ClientName | ... |
| 1 | 1 | 'Alexander' | ... |
| 1 | 2 | 'Tom' | ... |
| 1 | 3 | 'Julia' | ... |
| 2 | 1 | 'Tom' | ... |
| 2 | 2 | 'Alexander' | ... |
Table C have the values of each product, which is associated with a bill number. Table V has the relationship between the client name and the bill number. However, the bill number has a counter that is dependent on the UnId, which is the store unity ID. That being said, each store has it`s own Bill number 1, number 2, etc. Also, the number of bills from each store are not equal.
Solution description
I'm trying to make select between the C left join V without sucess. Because each BillID is dependent on the UnID, I have to make the join considering the concatenation between those two columns.
I've used this script, but it gives me an error.
SELECT
SUM(C.Value),
V.ClientName
FROM
C
LEFT JOIN
V
ON
CONCAT(C.UnID, C.BillID) = CONCAT(V.UnID, V.BillID)
GROUP BY
V.ClientName
and SQL server returns me this 'CONCAT' is not a recognized built-in function name.
I'm using Microsoft SQL Server 2008 R2
Is the use of CONCAT wrong? Or is it the way I tried to SELECT? Could you give me a hand?
[OBS: The tables I've present you are just for the purpose of explaining my difficulties. That being said, if you find any errors in the explanation, please let me know to correct them.]
You should be joining on the equality of the UnID and BillID columns in the two tables:
SELECT
c.ClientName,
COALESCE(SUM(v.Value), 0) AS total
FROM C c
LEFT JOIN V v
ON c.UnID = v.UnID AND
c.BillID = v.BillID
GROUP BY
c.ClientName;
In theory you could try joining on CONCAT(UnID, BillID). However, you could run into problems. For example, UnID = 1 with BillID = 23 would, concatenated together, be the same as UnID = 12 and BillID = 3.
Note: We wrap the sum with COALESCE, because should a given client have no entries in the V table, the sum would return NULL, which we then replace with zero.
concat is only available in sql server 2012.
Here's one option.
SELECT
SUM(C.Value),
V.ClientName
FROM
C
LEFT JOIN
V
ON
cast(C.UnID as varchar(100)) + cast(C.BillID as varchar(100)) = cast(V.UnID as varchar(100)) + cast(V.BillID as varchar(100))
GROUP BY
V.ClientName

Getting Sum of MasterTable's amount which joins to DetailTable

I have two tables:
1. Master
| ID | Name | Amount |
|-----|--------|--------|
| 1 | a | 5000 |
| 2 | b | 10000 |
| 3 | c | 5000 |
| 4 | d | 8000 |
2. Detail
| ID |MasterID| PID | Qty |
|-----|--------|-------|------|
| 1 | 1 | 1 | 10 |
| 2 | 1 | 2 | 20 |
| 3 | 2 | 2 | 60 |
| 4 | 2 | 3 | 10 |
| 5 | 3 | 4 | 100 |
| 6 | 4 | 1 | 20 |
| 7 | 4 | 3 | 40 |
I want to select sum(Amount) from Master which joins to Deatil where Detail.PID in (1,2,3)
So I execute the following query:
SELECT SUM(Amount) FROM Master M INNER JOIN Detail D ON M.ID = D.MasterID WHERE D.PID IN (1,2,3)
Result should be 20000. But I am getting 40000
See this fiddle. Any suggestion?
You are getting exactly double the amount because the detail table has two occurences for each of the PIDs in the WHERE clause.
See demo
Use
SELECT SUM(Amount)
FROM Master M
WHERE M.ID IN (
SELECT DISTINCT MasterID
FROM DETAIL
WHERE PID IN (1,2,3) )
What is the requirement of joining the master table with details when you have all your columns are in Master table.
Also, isnt there any FK relationhsip defined on these tables. Looking at your data it seems to me that there should be FK on detail table for MasterId. If that is the case then you do not need join the table at all.
Also, in case you want to make sure that you have records in details table for the records for which you need sum and there is no FK relationship. Then you could give a try for exists instead of join.