INNER JOIN of pagevies, contacts and companies - duplicated entries - sql

In short: 3 table inner join duplicates records
I have data in BigQuery in 3 tables:
Pageviews with columns:
timestamp
user_id
title
path
Contacts with columns:
website_user_id
email
company_id
Companies with columns:
id
name
I want to display all recorded pageviews and, if user and/or company is known, display this data next to pageview.
First, I join contact and pageviews data (SQL is generated by Metabase business intelligence tool):
SELECT
`analytics.pageviews`.`timestamp` AS `timestamp`,
`analytics.pageviews`.`title` AS `title`,
`analytics.pageviews`.`path` AS `path`,
`Contacts`.`email` AS `email`
FROM `analytics.pageviews`
INNER JOIN `analytics.contacts` `Contacts` ON `analytics.pageviews`.`user_id` = `Contacts`.`website_user_id`
ORDER BY `timestamp` DESC
It works as expected and I can see pageviews attributed to known contacts.
Next, I'd like to show pageviews of contacts with known company and which company is this:
SELECT
`analytics.pageviews`.`timestamp` AS `timestamp`,
`analytics.pageviews`.`title` AS `title`,
`analytics.pageviews`.`path` AS `path`,
`Contacts`.`email` AS `email`,
`Companies`.`name` AS `name`
FROM `analytics.pageviews`
INNER JOIN `analytics.contacts` `Contacts` ON `analytics.pageviews`.`user_id` = `Contacts`.`website_user_id`
INNER JOIN `analytics.companies` `Companies` ON `Contacts`.`company_id` = `Companies`.`id`
ORDER BY `timestamp` DESC
With this query I would expect to see only pageviews where associated contact AND company are known (just another column for company name). The problem is, I get duplicate rows for every pageview (sometimes 5, sometimes 20 identical rows).
I want to avoid selecting DISTINCT timestamps because it can lead to excluding valid pageviews from different users but with identical timestamp.
How to approach this?

Your description sounds like you have duplciates in companies. This is easy to test for:
select c.id, count(*)
from `analytics.companies` c
group by c.id
having count(*) >= 2;
You can get the details using window functions:
select c.*
from (select c.*, count(*) over (partition by c.id) as cnt
from `analytics.companies` c
) c
where cnt >= 2
order by cnt desc, id;

Related

Handling nested case statements in Redshift

I am writing a Redshift query which require use of multiple case statements.
Pretext:
Customers can associated with more than one organizations like, sweet or salt etc.
Ask :
We have to check that customers associated with 'SWEETS' organization are picked first, if no affiliation with 'SWEETS' is available , than we have to take id of that organization where flag = 1.
I have to use a case statement in redshift to derive the result.
There are three different tables, customer table, organization table and 3 table that determines how customers are associated with organization.
![enter image description here][1]
Code that I have tried is below , but after executing this , I am still getting the two organization ids, instead of one id which should be of sweet org.
SELECT customer_id
, organization_id
FROM customer_details AS customer
LEFT JOIN organization AS org
ON customer.customer_id
AND organization_id = CASE WHEN organization_id IN (SELECT organization_id
FROM organization_type
WHERE organization_type = 'SWEET')
THEN organization_id
ELSE org.organization_id END
You can use window functions:
select customer_id, organization_id
from (select c.customer_id, o.organization_id,
row_number() over (partition by o.customer_id order by o.organization_type = 'SWEET' desc) as seqnum
from customer_details c left join
organization o
on c.customer_id = o.organization_id
) co
where seqnum = 1;

How to Left Join SQL Subquery with Table

Goal: Create a query that calculates the ratio of ids that have/don't have a particular attribute.
Table 1: events
Fields: event_id, event_name, user_id
Field event_id is unique key/index
Field event_name has 3 potential values, one of which is the one being inspected.
Field user_id is a foreign key from Table 2
Table 2: users
Fields: id (and a long list of other attributes that aren't pertinent)
To get the list of user_ids with the qualifying attribute, I created the following:
SELECT DISTINCT events.user_id AS viewing_ids
FROM events
WHERE event_name = 'view_user_profile'
As I would expect this provides the list of users that have the corresponding event_name associated with their user_id
The next part is where I'm getting mixed up. Yes, I could COUNT(DISTINCT the select to get the count ids that have the attribute 'view_user_profile' but that only provides half the answer. What I need to do is then Join that list with the full user_id list from the table users and then determine when the id exists or doesn't.
I'm thinking the initial SELECT needs to be
SELECT
(CASE WHEN viewers IS NULL THEN false
ELSE true END) AS has_viewed_profile
, COUNT(user_id) AS users
FROM
(SELECT DISTINCT events.user_id AS viewing_ids
FROM events
WHERE event_name = 'view_user_profile') viewers
LEFT JOIN
users
ON
??? = users.id
This is where I get lost, I don't have a column name for viewers...
I think this is what you want:
select count(e.user_id) / count(*) as view_ratio
from users u left join
(select distinct e.user_id
from events e
where e.event_name = 'view_user_profile'
) e
on e.user_id = u.id;

SQL: Select count of a record in right table with joins

I have 2 tables one for mobiles and other is for reviews. Reviews table store the reviews of a specific mobile against its mobile id.
Structure of mobiles table.
mobile_id | mobile_name
Structure of reviews table.
review_id | mobile_id | review_body
So far I have written this query.
SELECT c.*, p.review_body
FROM ((select mobile_id, mobile_name from mobiles
WHERE brand_id=1 limit 0,5) c)
left JOIN
(
SELECT mobile_id,
MAX(review_id) MaxDate
FROM reviews
GROUP BY mobile_id
) MaxDates ON c.mobile_id = MaxDates.mobile_id left JOIN
reviews p ON MaxDates.mobile_id = p.mobile_id
AND MaxDates.MaxDate = p.review_id
This query returns the first 5 mobiles from mobile table and their latest (one) review from review table. This is the result it returns.
mobile_id | mobile_name | review_body
Question: But i also want review_count with it. review_count should be equal to total number of reviews a mobile has in reviews table against its mobile_id.
So please tell me how it can be done with a single query that I already have. Any help would be appreciated as i am trying to do this since 24 hours.
I think this would work
SELECT c.*, p.review_body, MaxDates.review_count
FROM ((select mobile_id, mobile_name from mobiles
WHERE brand_id=1 limit 0,5) c)
left JOIN
(
SELECT mobile_id,count(review_id) review_count,
MAX(review_id) MaxDate
FROM reviews
GROUP BY mobile_id
) MaxDates ON c.mobile_id = MaxDates.mobile_id left JOIN
reviews p ON MaxDates.mobile_id = p.mobile_id
AND MaxDates.MaxDate = p.review_id

Select by frequency

I have two tables, like that:
users(id, name)
phones(user_id, number)
I'd like to select all user's names that are in more than three rows in the table phones. How can I do that?
Join the tables and add a having clause that limits the results returned by the count of the user_ids
select name,
count(user_id)
from users u
join phones p
on u.id = p.user_id
group by name
having count(user_id) > 3
SQL Fiddle: http://sqlfiddle.com/#!2/c5516/2
select name from user
join phones on id = user_id
Group By user_id
Having Count(number) > 3

Using a complicated double join to get a count of child objects

Note that I'm using postgresql
I have an organizations table, a users table, a jobs table, and a documents table. I want to get a list of the organizations ordered by the number of total documents they have access to.
organizations
------------
id (pk)
company_name
users
------------
id (pk)
organization_id
jobs
------------
id (pk)
client_id (id of an organization)
server_id (id of an organization)
creator_id (id of a user)
documents
------------
id (pk)
job_id
Result Desired
organizations.id | organizations.company_name | document_count
85 | Big Corporation | 84
905 | Some other folks | 65
403 | ACME, Inc | 14
As you can see, an organization can be connected to a document through 3 different paths:
organizations.id => jobs.client_id => documents.job_id
organizations.id => jobs.server_id => documents.job_id
organizations.id => users.organization_id => jobs.creator_id => documents.job_id
But I want a query that will get the count of all the documents each company has access to...
I tried a couple of things... like this:
SELECT COUNT(documents.id) document_count, organizations.id, organizations.company_name
FROM organizations
INNER JOIN users ON organizations.id = users.organization_id
INNER JOIN jobs ON (
jobs.client_id = organizations.id OR
jobs.server_id = organizations.id OR
jobs.creator_id = users.id
)
INNER JOIN documents ON documents.job_id = jobs.id
GROUP BY organizations.id, organizations.company_name
ORDER BY document_count DESC
LIMIT 10
The query takes awhile to run, but it's not horrible since i'm doing it for a one-time report, but the results... cannot possibly be correct.
The first listed organization has a reported count of 129,834 documents -- but that's impossible since there's only 32,820 records in the documents table. I feel like it must be counting drastic quantities of duplicates (due to an error in one of my joins?) but I'm not sure where I've gone wrong.
The order appears correct since the highest volume user of the system is clearly at the top of the list... but the value is inflated somehow.
The problem is that if jobs.client_id = organizations.id or jobs.server_id = organizations.id, then there's nothing to filter your INNER JOIN users (aside from its ON clause), so you'll get a separate record for every single user that belongs to that organization. In other words, for each organization, you're adding three values:
its total number of users times the total number of documents belonging to jobs for which it's a client
its total number of users times the total number of documents belonging to jobs for which it's a server
the total number of documents belonging to jobs for which one if its users is the creator
One way to fix this is to remove the INNER JOIN users line, and change this:
jobs.creator_id = users.id
to this:
jobs.creator_id IN (SELECT id FROM users WHERE organization_id = organizations.id)
. . . but that might perform terribly. You might need to try a few things before finding a query that performs acceptably.
Simplify your thinking. You have 3 paths to docid so write 3 queries, union them and count that
It's probably too late to redesign this, but you really should.
The jobs table should not have its own id field a d key.
The jobs table is horribly designed because every reference to a disk page from the id index is gonna have to go read 1-100 different pages from disk out of the data file just to get the three other id fields that you always want to use (which is the clue that a job should not have its own id).
You can make a quick fix by making jobs use an index that is clustered or clustering ( depending on the db system) on the job id field. And alternative will be to mark the other three id fields as "includes" on the index so the page reads to the data file will 100% go away. Either of these may be enough to make this "just work".
What I would encourage you to do though is drop the id field and key on jobs and instead make a "natural key" that has the three other id fields in it and use that key on the documents table as well.
I would also demoralize (repeat) the organization of the creator on the jobs table and the document table. A user isn't going to move to another org and keep the same acces, so you should never have to run a sweep to update these in sync and even if you did it would be easy.
With these changes you can just do a select on the documents table directly, skipping the random pages reads needed from the other tables. The group by to group across the three different id fields would be a bit tricky. I might give this a try as it is interesting.
In the short term though, try clustering or includes on the jobs table to solve the performance issue and I will check the join logic tonight.
None of the answers quite got me there except for the one suggesting a UNION. This is what I came up with:
SELECT COUNT(docs.doc_id) document_count, docs.org_id, docs.org_name
FROM (
SELECT documents.id doc_id, organizations.id org_id, organizations.company_name org_name
FROM documents
INNER JOIN jobs ON documents.job_id = jobs.id
INNER JOIN organizations ON jobs.client_id = organizations.id
UNION
SELECT documents.id doc_id, organizations.id org_id, organizations.company_name org_name
FROM documents
INNER JOIN jobs ON documents.job_id = jobs.id
INNER JOIN organizations ON jobs.server_id = organizations.id
UNION
SELECT documents.id doc_id, organizations.id org_id, organizations.company_name org_name
FROM documents
INNER JOIN jobs on documents.job_id = jobs.id
INNER JOIN users ON jobs.creator_id = users.id
INNER JOIN organizations ON users.organization_id = organizations.id
) docs
GROUP BY org_id, org_name
ORDER BY document_count DESC
The performance was much better than any of the people suggesting subqueries and it appears to have given me a reasonable answer
But I want a query that will get the count of all the documents you have access to...
That's where your query starts:
SELECT ... FROM documents
...
Since the only clue to the documents table is in jobs, you'll need the jobs table as well::
SELECT ...
FROM documents dc
JOIN jobs jo ON jo.document_id = dc.id
...
Now, it is time for restrictions. Which documents do you actually want ? There are three cases you want: either the client_id matches the organisation, or the server_id maches the company, or the creator_id matches a user that happens to work for the company:
SELECT ...
FROM documents dc
JOIN jobs jo ON jo.document_id = dc.id
WHERE jo.client_id = $THE_COMPANY
OR jo.server_id = $THE_COMPANY
OR EXISTS (
SELECT *
FROM users uu
JOIN organizations oo ON uu.organization_id = ex.id
WHERE uu.id = jo.creator_id
AND oo.id = $THE_COMAPNY
)
;
But, there might be a problem here. If two or more different jobs-records would point to the same document, you would count these double. You can either add a DISTINCT to the outer query, or move the jobs-table down into a subquery:
SELECT ...
FROM documents dc
WHERE EXISTS (
SELECT *
FROM jobs jo
WHERE jo.document_id = dc.id
AND ( jo.client_id = $THE_COMPANY
OR jo.server_id = $THE_COMPANY
OR EXISTS (
SELECT *
FROM users uu
JOIN organizations oo ON uu.organization_id = ex.id
WHERE uu.id = jo.creator_id
AND oo.id = $THE_COMAPNY
)
)
)
;
As you can see, the thee ways of selecting a document end up in a WHERE (a OR b OR c) clause.
UPDATE: (since the OP does not give us the table definions in a useble form I had to reconstruct these)
DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;
--
-- create the missing tables
--
CREATE TABLE organizations
( id SERIAL NOT NULL PRIMARY KEY
, company_name varchar
);
CREATE TABLE users
( id SERIAL NOT NULL PRIMARY KEY
, organization_id INTEGER NOT NULL REFERENCES organizations(id)
);
CREATE TABLE jobs
( id SERIAL NOT NULL PRIMARY KEY
, client_id INTEGER NOT NULL REFERENCES organizations(id)
, server_id INTEGER NOT NULL REFERENCES organizations(id)
, creator_id INTEGER NOT NULL REFERENCES users(id)
);
CREATE TABLE documents
( id SERIAL NOT NULL PRIMARY KEY
, job_id INTEGER NOT NULL REFERENCES jobs(id)
);
--
-- Populate
--
INSERT INTO organizations(id, company_name) VALUES
(85,'Big Corporation') ,(905,'Some other folks') ,(403,'ACME, Inc')
;
select setval('organizations_id_seq', 905);
INSERT INTO users(organization_id)
SELECT o.id
FROM generate_series(1,1000)
JOIN organizations o ON random() < 0.3
;
INSERT INTO jobs (client_id,server_id,creator_id)
SELECT o1.id, o2.id, u.id
FROM users u
JOIN organizations o1 ON 1=1
JOIN organizations o2 ON o2.id <> o1.id
;
INSERT INTO documents(job_id)
SELECT id FROM jobs j
;
DELETE FROM documents
WHERE random() < 0.5
;
--
-- And the query ...
--
EXPLAIN ANALYZE
SELECT o.id AS org
, count(*) AS the_docs
FROM organizations o
JOIN documents d ON 1=1 -- start with a carthesian product
WHERE EXISTS (
SELECT *
FROM jobs j
WHERE d.job_id = j.id
AND (j.client_id = o.id OR j.server_id = o.id )
)
OR EXISTS (
SELECT *
FROM jobs j
JOIN users u ON j.creator_id = u.id
WHERE u.organization_id = o.id
AND d.job_id = j.id
)
GROUP BY o.id
;