Joining two joins in PostgreSQL - sql

Background
I've needed to learn some PostgreSQL quickly and from scratch in order to do a data analysis project about car insurance. I have a locally stored PostgreSQL database of fairly decent size (around 8gb worth of data on insurance claims for vehicles like cars and motorcycles), and I've needed to JOIN and UNION ALL a couple of things in order to get the table I need for my statistical models.
The first part of what I've needed to do is this thing, a JOIN inside of a UNION ALL between two tables about car claims and motorcycle claims:
select
l.customer_combined_id,
l.claim_id,
l."Part_Cd",
l.service_date,
h.principal_problem_cd,
h.problem_code_vers_flag
from claims.auto_claims_line_items as l
JOIN claims.auto_claims_general h on l.claim_id = h.claim_id
UNION ALL
select
l.customer_combined_id,
l.claim_id,
l."Part_Cd",
l.entry_date as service_date,
NULL as principal_problem_cd,
NULL as problem_code_vers_flag
from claims.motorcycle_claims_line_items as l
This yields a table that looks like this (column names abbreviated for aesthetics):
cust_comb_id| claim_id | "Part_Cd" | svc_date | prin_prob_cd | prob_cd_vers_flg |
------------+----------+-----------+----------+--------------+------------------|
| | | | | |
As you can see, the car claims have some columns that the motorcycle claims don't have. This is fine -- I've filled those in as NULL in order to get the UNION ALL to work. Now the car claims table is nicely stacked on top of the motorcycle claims table. So far, so good.
The second part of what I've done so far is this other thing, which concerns data about car and motorcycle insurance policyholders ("customers"):
select m.customer_dob,
m.customer_id,
m.customer_gender_cd,
m.customer_zip_cd,
c.customer_combined_id
from customer."Customer" m
JOIN customer.customer_combined_crosswalk c on m.customer_id = c.customer_id
The result of which looks like this:
dob | customer_id | gender_cd | zip_cd | cust_comb_id |
----+-------------+-----------+--------+--------------|
| | | | |
The Problem
I've figured out two halves of my data manipulation, but I don't know how to put these halves together, so to speak. I want (I think) to left join these two things on cust_comb_id, but I'm not sure how to write it. I want to keep everything in the first part (the claim data) and bring in data from the second part (the policyholders / customers) when cust_comb_id matches, and give null values if it doesn't. Here's a visual of what I'm looking for:
cust_comb_id| claim_id | "Part_Cd" | svc_date | prin_prob_cd | prob_cd_vers_flg |dob | cust_id | gender_cd | zip_cd |
------------+----------+-----------+----------+--------------+------------------|----+---------+-----------+--------+
| | | | | | | | | |
What I've tried
I've tried to use subqueries to join these joins, but I keep getting errors. Edit:
Here's a concrete example of something I've tried:
select *
from
(select
l.customer_combined_id,
l.claim_id,
l."Part_Cd",
l.service_date,
h.principal_problem_cd,
h.problem_code_vers_flag
from claims.auto_claims_line_items as l
JOIN claims.auto_claims_general h on l.claim_id = h.claim_id
UNION ALL
select
l.customer_combined_id,
l.claim_id,
l."Part_Cd",
l.entry_date as service_date,
NULL as principal_problem_cd,
NULL as problem_code_vers_flag
from claims.motorcycle_claims_line_items as l) as cl
LEFT JOIN
select m.customer_dob,
m.customer_id,
m.customer_gender_cd,
m.customer_zip_cd,
c.customer_combined_id
from customer."Customer" m
JOIN customer.customer_combined_crosswalk c on m.customer_id = c.customer_id
This yields the error ERROR: syntax error at or near "select".
Any help is much appreciated.
[Note: customer_combined_id and customer_id are two different things: the combined id is unique, and made to account for when a customer switches from one insurance plan - where they have one customer_id - to another, where they're given a new one.]

So it was a syntax issue.
OP already had all needed parts:
Part I and Part II subqueries were already implemented
it was defined how to join them
The only problem was a struggle with syntax.
I suppose this form would be the most readable:
WITH PartI AS(
select
l.customer_combined_id,
l.claim_id,
l."Part_Cd",
l.service_date,
h.principal_problem_cd,
h.problem_code_vers_flag
from claims.auto_claims_line_items as l
JOIN claims.auto_claims_general h on l.claim_id = h.claim_id
UNION ALL
select
l.customer_combined_id,
l.claim_id,
l."Part_Cd",
l.entry_date as service_date,
NULL as principal_problem_cd,
NULL as problem_code_vers_flag
from claims.motorcycle_claims_line_items as l
),
PartII AS (
select customer_dob,
customer_id,
customer_gender_cd,
customer_zip_cd,
customer_combined_id
from customer."Customer" m
JOIN customer.customer_combined_crosswalk c on m.customer_id = c.customer_id
)
SELECT
*
FROM
PartI P1
LEFT JOIN PartII P2
ON P1.customer_combined_id = P2.customer_combined_id;
https://www.db-fiddle.com/f/msAtD89dn4DndMtxukkgkP/2

Alex Yu's answer is better, but I wanted to post this because a) it also works and b) shows a neat use for views in SQL.
Take the first part, and make a view of it by adding a single line of CREATE OR REPLACE VIEW before the first select:
CREATE OR REPLACE VIEW clms AS
select
l.customer_combined_id,
l.claim_id,
l."Part_Cd",
l.service_date,
h.principal_problem_cd,
h.problem_code_vers_flag
from claims.auto_claims_line_items as l
JOIN claims.auto_claims_general h on l.claim_id = h.claim_id
UNION ALL
select
l.customer_combined_id,
l.claim_id,
l."Part_Cd",
l.entry_date as service_date,
NULL as principal_problem_cd,
NULL as problem_code_vers_flag
from claims.motorcycle_claims_line_items as l
Next, do the same for the second part:
CREATE OR REPLACE VIEW cstmr AS
select m.customer_dob,
m.customer_id,
m.customer_gender_cd,
m.customer_zip_cd,
c.customer_combined_id
from customer."Customer" m
JOIN customer.customer_combined_crosswalk c on m.customer_id = c.customer_id
Finally, do a SQL 101-level simple join of the two views:
select *
from clms
join cstmr m on clms.customer_combined_id = customer_combined_id
I bumped into this answer after posting the problem and was happy to find a (somewhat) elegant solution myself.

Related

is there a way to unnest bigquery array independently?

Let's say I have this database:
with tbl as (
select
['Unknown','Eletric','High Voltage'] AS product_category,
['Premium','New'] as client_cluster
) select * from tbl
And its output:
row | product_category | client_cluster
--------------------------------------------------------------
1 | [Unknown, Eletric, High Voltage] | [Premium, New]
I would like to unnest this columns independently in a way that it will be then N rows where N would be the size of the biggest array I unnest and the output would look like this:
row | product_category | client_cluster
---------------------------------------------
1 | Unknown | Premium
2 | Eletric | New
3 | High Voltage | Null
And there is no order that I would like to imply. Is there a way to do that? I tried this stackoverflow but in my case it did not work as expected because of my arrays does not have the same size.
it did not work as expected because of my arrays does not have the same size.
for your specific sample in your question, you can left join unnested arrays.
WITH tbl AS (
SELECT ['Unknown','Eletric','High Voltage'] AS product_category, ['Premium','New'] as client_cluster
)
SELECT p AS product_category, c AS client_cluster
FROM tbl, UNNEST(product_category) p WITH offset
LEFT JOIN UNNEST(client_cluster) c WITH offset USING (offset);
But the length of product_category is less than that of client_cluster, it won't work as you wish.
WITH tbl AS (
SELECT ['Eletric','High Voltage'] AS product_category, ['Supreme', 'Premium','New'] as client_cluster
)
SELECT p AS product_category, c AS client_cluster
FROM tbl, UNNEST(product_category) p WITH offset
LEFT JOIN UNNEST(client_cluster) c WITH offset USING (offset);
I might be wrong, but as far as I know you can't use FULL JOIN or RIGHT JOIN with flattened array to solve this issue. If you try to do so, you will get:
Query error: Array scan is not allowed with FULL JOIN: UNNEST expression at [31:13]
So you might consider below workaround using a hidden table(array) for join key.
WITH tbl AS (
SELECT 1 id, ['Unknown','Eletric','High Voltage'] AS product_category, ['Premium','New'] as client_cluster,
UNION ALL
SELECT 2 id, ['Eletric','High Voltage'], ['Premium','New', 'Supreme']
)
SELECT id, p AS product_category, c AS client_cluster
FROM tbl, UNNEST(GENERATE_ARRAY(0, GREATEST(ARRAY_LENGTH(client_cluster), ARRAY_LENGTH(product_category)) - 1)) o0
LEFT JOIN UNNEST(product_category) p WITH offset o1 ON o0 = o1
LEFT JOIN UNNEST(client_cluster) c WITH offset o2 ON o0 = o2;
Query results

Returning a number when result set is null

Each lot object contains a corresponding list of work orders. These work orders have tasks assigned to them which are structured by the task set on the lots parent (the phase). I am trying to get the LOT_ID back and a count of TASK_ID where the TASK_ID is found to exist for the where condition.
The problem is if the TASK_ID is not found, the result set is null and the LOT_ID is not returned at all.
I have uploaded a single row for LOT, PHASE, and WORK_ORDER to the following SQLFiddle. I would have added more data but there is a fun limiter .. err I mean character limiter to the editor.
SQLFiddle
SELECT W.[LOT_ID], COUNT(*) AS NUMBER_TASKS_FOUND
FROM [PHASE] P
JOIN [LOT] L ON L.[PHASE_ID] = P.[PHASE_ID]
JOIN [WORK_ORDER] W ON W.[LOT_ID] = L.[LOT_ID]
WHERE P.[TASK_SET_ID] = 1 AND W.[TASK_ID] = 41
GROUP BY W.[LOT_ID]
The query returns the expected result when the task id is found (46) but no result when the task id is not found (say 41). I'd expect in that case to see something like:
+--------+--------------------+
| LOT_ID | NUMBER_TASKS_FOUND |
+--------+--------------------+
| 500 | 0 |
| 506 | 0 |
+--------+--------------------+
I have a feeling this needs to be wrapped in a sub-query and then joined but I am uncertain what the syntax would be here.
My true objective is to be able to pass a list of TASK_ID and get back any LOT_ID that doesn't match, but for now I am just doing a query per task until I can figure that out.
You want to see all lots with their counts for the task. So either outer join the tasks or cross apply their count or use a subquery in the select clause.
select l.lot_id, count(wo.work_order_id) as number_tasks_found
from lot l
left join work_order wo on wo.lot_id = l.lot_id and wo.task_id = 41
where l.phase_id in (select p.phase_id from phase p where p.task_set_id = 1)
group by l.lot_id
order by l.lot_id;
or
select l.lot_id, w.number_tasks_found
from lot l
cross apply
(
select count(*) as number_tasks_found
from work_order wo
where wo.lot_id = l.lot_id
and wo.task_id = 41
) w
where l.phase_id in (select p.phase_id from phase p where p.task_set_id = 1)
order by l.lot_id;
or
select l.lot_id,
(
select count(*)
from work_order wo
where wo.lot_id = l.lot_id
and wo.task_id = 41
) as number_tasks_found
from lot l
where l.phase_id in (select p.phase_id from phase p where p.task_set_id = 1)
order by l.lot_id;
Another option would be to outer join the count and use COALESCE to turn null into zero in your result.

LEFT OUTER JOIN not always matching

I'm starting with a SQL query with a couple of joins and I'm getting the exact data I expect. This is what the current query is.
SELECT DISTINCT o.OrganizationHierarchyUnitLevelFourCd, o.OrganizationHierarchyUnitLevelThreeNm, o.OrganizationHierarchyUnitLevelFourNm
FROM Lab_Space l
JOIN Worker w ON l.Contact_WWID = w.WWID AND w.Employee_Status_Code = 'A'
JOIN Org_Hierarchy o ON o.OrganizationHierarchyUnitLevelThreeNm IS NOT NULL AND w.Org_Hierarchy_Unit_Cd = o.OrganizationHierarchyUnitCd
ORDER BY o.OrganizationHierarchyUnitLevelThreeNm, o.OrganizationHierarchyUnitLevelFourNm
This ends up with a row like
1234 | Finance | IT
Now I've created a new table, where I'm tracking whether or not I want to include the organization in my output. That table just has two columns, an org ID and a bit field. So I thought I could LEFT OUTER JOIN, since the second table won't have data on all orgs, so I expanded the query to this:
SELECT DISTINCT o.OrganizationHierarchyUnitLevelFourCd, o.OrganizationHierarchyUnitLevelThreeNm, o.OrganizationHierarchyUnitLevelFourNm, v.Include
FROM Lab_Space l
JOIN Worker w ON l.Contact_WWID = w.WWID AND w.Employee_Status_Code = 'A'
JOIN Org_Hierarchy o ON o.OrganizationHierarchyUnitLevelThreeNm IS NOT NULL AND w.Org_Hierarchy_Unit_Cd = o.OrganizationHierarchyUnitCd
LEFT OUTER JOIN Validation_Email_Org_Unit_Inclusion v ON o.OrganizationHierarchyUnitCd = v.OrganizationHierarchyUnitCd
ORDER BY o.OrganizationHierarchyUnitLevelThreeNm, o.OrganizationHierarchyUnitLevelFourNm
The problem I have is now I end up with rows like so:
1234 | Finance | IT | NULL
1234 | Finance | IT | 1
Since the Validation_Email_Org_Unit_Inclusion table includes a 1 for the 1234 org, I would expect to just get a single row with a value of 1, not include the row with NULL.
What have I done wrong?
You output OrganizationHierarchyUnitLevelFourCd but currently join on OrganizationHierarchyUnitCd. Join on the same column you output to get the corresponding value.
SELECT DISTINCT o.OrganizationHierarchyUnitLevelFourCd, ...
...
LEFT OUTER JOIN Validation_Email_Org_Unit_Inclusion v ON o.OrganizationHierarchyUnitLevelFourCd = v.OrganizationHierarchyUnitCd
...

SQL counting and multiple subqueries on huge tables

I have a few SQL tables, named FOS, keywords, and PRef. Their structure and relationships are as follow:
+------------------+ +------------------+ +-----------------+
| FOS | | keywords | | PRef |
+------------------+ +------------------+ +-----------------+
|fosID (PK) |--+ |pkID (PK) | +---|pID1 (PK) |
|fosName | +---|fosID(FK) | +---|pID2 (PK) |
+------------------+ |paperID (FK) |--+ +-----------------+
( 53k+ rows) +------------------+ ( 952M+ rows)
( 157M+ rows)
Currently i can do it by supplying a single fosID to my query, but since the fos table contains over 1k records, i do not have enough manpower to manually feed every fosID and get its corresponding rowCount then merging all results
declare #fosID varchar(10)='1234567890';--my fosID
select fos.fosID,fos.fosName,count(*) as rowCount
from PRef pr left join FOS fos on fos.fosID=#fosID
where
pr.pID1 in(SELECT paperID FROM keywords k where k.fosID=#fosID)
OR pr.pID2 in(SELECT paperID FROM keywords k where k.fosID=#fosID)
group by fos.fosID,fos.fosName
Then it gives a correct result as:
+----------+--------+----------+
|fosID |fosName |rowCount |
+----------+--------+----------+
|1234567890|name1 |34 |
+----------+--------+----------+
Now i want to get a list of all fos items and number of records in PRef for EACH of the 53k+ fos item.
I've tried to modified the part in where k.fosID=#fieldID to where k.fosID in (select fosID from FOS) but less count was produced.
Any suggestions on how to solve this problem?
P.S. I am looking at cursors right now but the performance is really...really slow
Edit 1: Expected results:
+----------+--------+----------+
|fosID |fosName |rowCount |
+----------+--------+----------+
|1234567890|name1 |34 |
|1234567891|name2 |3 |
|1234567892|name3 |23 |
|..... |.... |... |
+----------+--------+----------+
(exact same number of rows as table FOS)
You could just modify your subqueries to use correlated subqueries
select fos.fosID, fos.fosName, count(*) as rowCount
from PRef pr cross join
FOS fos
where pr.pID1 in (SELECT paperID FROM keywords k where k.fosID = fos.fosID) OR
pr.pID2 in (SELECT paperID FROM keywords k where k.fosID = fos.fosID)
group by fos.fosID, fos.fosName;
My guess is that the performance would be pretty bad.
Here is one alternative:
select fos.*, kp.cnt
from fos outer apply
(select count(*) as cnt
from keywords k join
pref pr
on k.paperID in (pr.pID1, pf.pID2) and
k.fosID = fos.fosID
) kp;
I imagine that this will also have pretty bad performance characteristics.
If you can do each id separately, then the SQL Server should be able to come up with a better execution plan:
select fos.*, (kp1.cnt + kp2.cnt)
from fos outer apply
(select count(*) as cnt
from keywords k join
pref pr
on k.paperID = pr.pID1 and
k.fosID = fos.fosID
) kp1 outer apply
(select count(*) as cnt
from keywords k join
pref pr
on k.paperID = pr.pID2 and
k.fosID = fos.fosID
) kp2;
First I suspect you could gain significant improvement by checking the data types in your tables. It looks like you're using varchar(10) with only numeric digits?
That sort of absurdity goes unnoticed on small tables, but on 900M rows can waste in excess of 5GB, affecting storage, memory and performance.
Second FOS is only really used to lookup fosName and at 53k rows is the smaller part of the work. So start by getting your counts per fosID correct; then join for the names.
;with CountPerFos as (
SELECT k.fosID, COUNT(*) AS fosCount
FROM PRef r
INNER JOIN keywords k ON
r.PID1 = k.paperID
OR r.PID2 = k.paperID
GROUP BY k.fosID
)
SELECT c.fosID, f.fosName,
--Need to handle fosIDs missing from CTE above
COALESCE(c.fosCount, 0)
FROM FOS f
LEFT OUTER JOIN CountPerFos c
f.fosID = c.fosID

Selecting a number of related records into a result row

I am currently writing an export function for an MS-Access database and i am not quite sure how to write a query that gives me the results that i want.
What i am trying to do is the following:
Let's say i have a table Error and there is a many-to-many relationship to the table Cause, modeled by the table ErrorCause. Currently i have a query similar to this (simplified, the original also goes one relationship further):
select Error.ID, Cause.ID
from ((Error inner join ErrorCauses on Error.ID = ErrorCauses.Error)
left join Cause on ErrorCauses.Cause = Cause.ID)
I get something like this:
Error | Cause
-------------
12345 | 12
12345 | 23
67890 | 23
67890 | 34
But i need to select the IDs of the first, say, 3 Causes for each error (even if those are empty), so that it looks like this:
Error | Cause1 | Cause2 | Cause3
--------------------------------
12345 | 12 | 23 |
67890 | 23 | 34 |
Is there any way to do this in a single query?
Like selecting the Top 3 and then flattening this into the resulting row?
Thanks in advance for any pointers.
Your requirement is for a specific number of causes--3. This makes it possible and manageable to get three different causes on the same row by doing a three-way join on the same subquery.
First, let's define your error-and-cause query as a straight-up Access query (a QueryDef object, if you want to be technical).
qryErrorCauseInfo:
select
Error.ID as ErrorID
, Cause.ID as CauseID
from (Error
inner join ErrorCauses
on Error.ID = ErrorCauses.Error)
left outer join Cause
on ErrorCauses.Cause = Cause.ID
By the way, I feel that the above left join should really be an inner join, for the reason I mentioned in my comment.
Next, let's do a three-way join to get possible combinations of causes in rows:
qryTotalCause:
select distinct
*
, iif(Cause1 is null, 0, 1)
+ iif(Cause2 is null, 0, 1)
+ iif(Cause3 is null, 0, 1) as TotalCause
from (
select
eci1.ErrorID
, eci1.CauseID as Cause1
, iif(eci2.CauseID = Cause1, null, eci2.CauseID) as Cause2
, iif(
eci3.CauseID = Cause1 or eci3.CauseID = Cause2
, null
, eci3.CauseID
) as Cause3
from (qryErrorCauseInfo as eci1
left outer join qryErrorCauseInfo as eci2
on eci1.ErrorID = eci2.ErrorID)
left outer join qryErrorCauseInfo as eci3
on eci2.ErrorID = eci3.ErrorID
) as sq
where (
Cause1 < Cause2
and Cause2 < Cause3
) or (
Cause1 < Cause2
and Cause3 is null
) or (
Cause2 is null
and Cause3 is null
) or (
Cause1 is null
and Cause2 is null
and Cause3 is null
)
Finally, we need a correlated subquery to select, for each error, the one row with the highest number of causes (the rest of the rows are simply different permutations of the same causes):
select
ErrorID
, Cause1
, Cause2
, Cause3
from qryTotalCause as tc1
where tc1.TotalCause = (
select max(tc2.TotalCause)
from qryTotalCause as tc2
where tc1.ErrorID = tc2.ErrorID
)
Simple! (Not :-)