SQL counting and multiple subqueries on huge tables - sql

I have a few SQL tables, named FOS, keywords, and PRef. Their structure and relationships are as follow:
+------------------+ +------------------+ +-----------------+
| FOS | | keywords | | PRef |
+------------------+ +------------------+ +-----------------+
|fosID (PK) |--+ |pkID (PK) | +---|pID1 (PK) |
|fosName | +---|fosID(FK) | +---|pID2 (PK) |
+------------------+ |paperID (FK) |--+ +-----------------+
( 53k+ rows) +------------------+ ( 952M+ rows)
( 157M+ rows)
Currently i can do it by supplying a single fosID to my query, but since the fos table contains over 1k records, i do not have enough manpower to manually feed every fosID and get its corresponding rowCount then merging all results
declare #fosID varchar(10)='1234567890';--my fosID
select fos.fosID,fos.fosName,count(*) as rowCount
from PRef pr left join FOS fos on fos.fosID=#fosID
where
pr.pID1 in(SELECT paperID FROM keywords k where k.fosID=#fosID)
OR pr.pID2 in(SELECT paperID FROM keywords k where k.fosID=#fosID)
group by fos.fosID,fos.fosName
Then it gives a correct result as:
+----------+--------+----------+
|fosID |fosName |rowCount |
+----------+--------+----------+
|1234567890|name1 |34 |
+----------+--------+----------+
Now i want to get a list of all fos items and number of records in PRef for EACH of the 53k+ fos item.
I've tried to modified the part in where k.fosID=#fieldID to where k.fosID in (select fosID from FOS) but less count was produced.
Any suggestions on how to solve this problem?
P.S. I am looking at cursors right now but the performance is really...really slow
Edit 1: Expected results:
+----------+--------+----------+
|fosID |fosName |rowCount |
+----------+--------+----------+
|1234567890|name1 |34 |
|1234567891|name2 |3 |
|1234567892|name3 |23 |
|..... |.... |... |
+----------+--------+----------+
(exact same number of rows as table FOS)

You could just modify your subqueries to use correlated subqueries
select fos.fosID, fos.fosName, count(*) as rowCount
from PRef pr cross join
FOS fos
where pr.pID1 in (SELECT paperID FROM keywords k where k.fosID = fos.fosID) OR
pr.pID2 in (SELECT paperID FROM keywords k where k.fosID = fos.fosID)
group by fos.fosID, fos.fosName;
My guess is that the performance would be pretty bad.
Here is one alternative:
select fos.*, kp.cnt
from fos outer apply
(select count(*) as cnt
from keywords k join
pref pr
on k.paperID in (pr.pID1, pf.pID2) and
k.fosID = fos.fosID
) kp;
I imagine that this will also have pretty bad performance characteristics.
If you can do each id separately, then the SQL Server should be able to come up with a better execution plan:
select fos.*, (kp1.cnt + kp2.cnt)
from fos outer apply
(select count(*) as cnt
from keywords k join
pref pr
on k.paperID = pr.pID1 and
k.fosID = fos.fosID
) kp1 outer apply
(select count(*) as cnt
from keywords k join
pref pr
on k.paperID = pr.pID2 and
k.fosID = fos.fosID
) kp2;

First I suspect you could gain significant improvement by checking the data types in your tables. It looks like you're using varchar(10) with only numeric digits?
That sort of absurdity goes unnoticed on small tables, but on 900M rows can waste in excess of 5GB, affecting storage, memory and performance.
Second FOS is only really used to lookup fosName and at 53k rows is the smaller part of the work. So start by getting your counts per fosID correct; then join for the names.
;with CountPerFos as (
SELECT k.fosID, COUNT(*) AS fosCount
FROM PRef r
INNER JOIN keywords k ON
r.PID1 = k.paperID
OR r.PID2 = k.paperID
GROUP BY k.fosID
)
SELECT c.fosID, f.fosName,
--Need to handle fosIDs missing from CTE above
COALESCE(c.fosCount, 0)
FROM FOS f
LEFT OUTER JOIN CountPerFos c
f.fosID = c.fosID

Related

Joining two joins in PostgreSQL

Background
I've needed to learn some PostgreSQL quickly and from scratch in order to do a data analysis project about car insurance. I have a locally stored PostgreSQL database of fairly decent size (around 8gb worth of data on insurance claims for vehicles like cars and motorcycles), and I've needed to JOIN and UNION ALL a couple of things in order to get the table I need for my statistical models.
The first part of what I've needed to do is this thing, a JOIN inside of a UNION ALL between two tables about car claims and motorcycle claims:
select
l.customer_combined_id,
l.claim_id,
l."Part_Cd",
l.service_date,
h.principal_problem_cd,
h.problem_code_vers_flag
from claims.auto_claims_line_items as l
JOIN claims.auto_claims_general h on l.claim_id = h.claim_id
UNION ALL
select
l.customer_combined_id,
l.claim_id,
l."Part_Cd",
l.entry_date as service_date,
NULL as principal_problem_cd,
NULL as problem_code_vers_flag
from claims.motorcycle_claims_line_items as l
This yields a table that looks like this (column names abbreviated for aesthetics):
cust_comb_id| claim_id | "Part_Cd" | svc_date | prin_prob_cd | prob_cd_vers_flg |
------------+----------+-----------+----------+--------------+------------------|
| | | | | |
As you can see, the car claims have some columns that the motorcycle claims don't have. This is fine -- I've filled those in as NULL in order to get the UNION ALL to work. Now the car claims table is nicely stacked on top of the motorcycle claims table. So far, so good.
The second part of what I've done so far is this other thing, which concerns data about car and motorcycle insurance policyholders ("customers"):
select m.customer_dob,
m.customer_id,
m.customer_gender_cd,
m.customer_zip_cd,
c.customer_combined_id
from customer."Customer" m
JOIN customer.customer_combined_crosswalk c on m.customer_id = c.customer_id
The result of which looks like this:
dob | customer_id | gender_cd | zip_cd | cust_comb_id |
----+-------------+-----------+--------+--------------|
| | | | |
The Problem
I've figured out two halves of my data manipulation, but I don't know how to put these halves together, so to speak. I want (I think) to left join these two things on cust_comb_id, but I'm not sure how to write it. I want to keep everything in the first part (the claim data) and bring in data from the second part (the policyholders / customers) when cust_comb_id matches, and give null values if it doesn't. Here's a visual of what I'm looking for:
cust_comb_id| claim_id | "Part_Cd" | svc_date | prin_prob_cd | prob_cd_vers_flg |dob | cust_id | gender_cd | zip_cd |
------------+----------+-----------+----------+--------------+------------------|----+---------+-----------+--------+
| | | | | | | | | |
What I've tried
I've tried to use subqueries to join these joins, but I keep getting errors. Edit:
Here's a concrete example of something I've tried:
select *
from
(select
l.customer_combined_id,
l.claim_id,
l."Part_Cd",
l.service_date,
h.principal_problem_cd,
h.problem_code_vers_flag
from claims.auto_claims_line_items as l
JOIN claims.auto_claims_general h on l.claim_id = h.claim_id
UNION ALL
select
l.customer_combined_id,
l.claim_id,
l."Part_Cd",
l.entry_date as service_date,
NULL as principal_problem_cd,
NULL as problem_code_vers_flag
from claims.motorcycle_claims_line_items as l) as cl
LEFT JOIN
select m.customer_dob,
m.customer_id,
m.customer_gender_cd,
m.customer_zip_cd,
c.customer_combined_id
from customer."Customer" m
JOIN customer.customer_combined_crosswalk c on m.customer_id = c.customer_id
This yields the error ERROR: syntax error at or near "select".
Any help is much appreciated.
[Note: customer_combined_id and customer_id are two different things: the combined id is unique, and made to account for when a customer switches from one insurance plan - where they have one customer_id - to another, where they're given a new one.]
So it was a syntax issue.
OP already had all needed parts:
Part I and Part II subqueries were already implemented
it was defined how to join them
The only problem was a struggle with syntax.
I suppose this form would be the most readable:
WITH PartI AS(
select
l.customer_combined_id,
l.claim_id,
l."Part_Cd",
l.service_date,
h.principal_problem_cd,
h.problem_code_vers_flag
from claims.auto_claims_line_items as l
JOIN claims.auto_claims_general h on l.claim_id = h.claim_id
UNION ALL
select
l.customer_combined_id,
l.claim_id,
l."Part_Cd",
l.entry_date as service_date,
NULL as principal_problem_cd,
NULL as problem_code_vers_flag
from claims.motorcycle_claims_line_items as l
),
PartII AS (
select customer_dob,
customer_id,
customer_gender_cd,
customer_zip_cd,
customer_combined_id
from customer."Customer" m
JOIN customer.customer_combined_crosswalk c on m.customer_id = c.customer_id
)
SELECT
*
FROM
PartI P1
LEFT JOIN PartII P2
ON P1.customer_combined_id = P2.customer_combined_id;
https://www.db-fiddle.com/f/msAtD89dn4DndMtxukkgkP/2
Alex Yu's answer is better, but I wanted to post this because a) it also works and b) shows a neat use for views in SQL.
Take the first part, and make a view of it by adding a single line of CREATE OR REPLACE VIEW before the first select:
CREATE OR REPLACE VIEW clms AS
select
l.customer_combined_id,
l.claim_id,
l."Part_Cd",
l.service_date,
h.principal_problem_cd,
h.problem_code_vers_flag
from claims.auto_claims_line_items as l
JOIN claims.auto_claims_general h on l.claim_id = h.claim_id
UNION ALL
select
l.customer_combined_id,
l.claim_id,
l."Part_Cd",
l.entry_date as service_date,
NULL as principal_problem_cd,
NULL as problem_code_vers_flag
from claims.motorcycle_claims_line_items as l
Next, do the same for the second part:
CREATE OR REPLACE VIEW cstmr AS
select m.customer_dob,
m.customer_id,
m.customer_gender_cd,
m.customer_zip_cd,
c.customer_combined_id
from customer."Customer" m
JOIN customer.customer_combined_crosswalk c on m.customer_id = c.customer_id
Finally, do a SQL 101-level simple join of the two views:
select *
from clms
join cstmr m on clms.customer_combined_id = customer_combined_id
I bumped into this answer after posting the problem and was happy to find a (somewhat) elegant solution myself.

Returning a number when result set is null

Each lot object contains a corresponding list of work orders. These work orders have tasks assigned to them which are structured by the task set on the lots parent (the phase). I am trying to get the LOT_ID back and a count of TASK_ID where the TASK_ID is found to exist for the where condition.
The problem is if the TASK_ID is not found, the result set is null and the LOT_ID is not returned at all.
I have uploaded a single row for LOT, PHASE, and WORK_ORDER to the following SQLFiddle. I would have added more data but there is a fun limiter .. err I mean character limiter to the editor.
SQLFiddle
SELECT W.[LOT_ID], COUNT(*) AS NUMBER_TASKS_FOUND
FROM [PHASE] P
JOIN [LOT] L ON L.[PHASE_ID] = P.[PHASE_ID]
JOIN [WORK_ORDER] W ON W.[LOT_ID] = L.[LOT_ID]
WHERE P.[TASK_SET_ID] = 1 AND W.[TASK_ID] = 41
GROUP BY W.[LOT_ID]
The query returns the expected result when the task id is found (46) but no result when the task id is not found (say 41). I'd expect in that case to see something like:
+--------+--------------------+
| LOT_ID | NUMBER_TASKS_FOUND |
+--------+--------------------+
| 500 | 0 |
| 506 | 0 |
+--------+--------------------+
I have a feeling this needs to be wrapped in a sub-query and then joined but I am uncertain what the syntax would be here.
My true objective is to be able to pass a list of TASK_ID and get back any LOT_ID that doesn't match, but for now I am just doing a query per task until I can figure that out.
You want to see all lots with their counts for the task. So either outer join the tasks or cross apply their count or use a subquery in the select clause.
select l.lot_id, count(wo.work_order_id) as number_tasks_found
from lot l
left join work_order wo on wo.lot_id = l.lot_id and wo.task_id = 41
where l.phase_id in (select p.phase_id from phase p where p.task_set_id = 1)
group by l.lot_id
order by l.lot_id;
or
select l.lot_id, w.number_tasks_found
from lot l
cross apply
(
select count(*) as number_tasks_found
from work_order wo
where wo.lot_id = l.lot_id
and wo.task_id = 41
) w
where l.phase_id in (select p.phase_id from phase p where p.task_set_id = 1)
order by l.lot_id;
or
select l.lot_id,
(
select count(*)
from work_order wo
where wo.lot_id = l.lot_id
and wo.task_id = 41
) as number_tasks_found
from lot l
where l.phase_id in (select p.phase_id from phase p where p.task_set_id = 1)
order by l.lot_id;
Another option would be to outer join the count and use COALESCE to turn null into zero in your result.

SQL query with two EXISTS statements behaving differently than expected

The following SQL query is intended to find label_item_lists which have label_items with given names.
SELECT lils.id FROM label_item_lists AS lils
INNER JOIN label_items AS items ON lils.id = items.label_item_list_id
WHERE EXISTS(SELECT * FROM label_item_lists WHERE items.name=?)
OR EXISTS(SELECT * FROM label_item_lists WHERE items.name=?)
It properly returns the ids of label_item_lists having an item with either name. However, the same query using the AND operator rather than OR returns no results.
SELECT lils.id FROM label_item_lists AS lils
INNER JOIN label_items AS items ON lils.id = items.label_item_list_id
WHERE EXISTS(SELECT * FROM label_item_lists WHERE items.name=?)
AND EXISTS(SELECT * FROM label_item_lists WHERE items.name=?)
There are definitely label_item_list entries that have label_items matching both names provided. In fact the OR SQL query returns the id twice for these entries, but the AND query returns no results. For this reason I think I might be missing an important piece of info on how these JOINed queries with EXISTS work. Thanks for any assistance!
----------------------------------------------------------------
| label_items | id | name | label_item_list_id |
----------------------------------------------------------------
| Row1 | 1 | foo | 1 |
----------------------------------------------------------------
| Row2 | 2 | bar | 1 |
----------------------------------------------------------------
| Row3 | 3 | bar | 2 |
----------------------------------------------------------------
--------------------------------
| label_item_lists | id |
--------------------------------
| Row1 | 1 |
--------------------------------
| Row2 | 2 |
--------------------------------
I want to return the first label_item_list but not the second, as it only has one of the two names I am searching for, 'foo' and 'bar'.
try changing the where condition from
WHERE EXISTS(SELECT * FROM label_item_lists WHERE items.name=?)
AND EXISTS(SELECT * FROM label_item_lists WHERE items.name=?)
to
WHERE EXISTS(SELECT * FROM label_item_lists lst WHERE lst.name=?)
AND EXISTS(SELECT * FROM label_item_lists lst WHERE lst.name=?)
In your query AND will not return anything because on same output row it will apply both filters which will never happen hence it is giving blank output.
And Or operator will never check condition after OR operator until first condition is false.
Try something like this, # is just a place holder to distinguish between two searches:
select * from label_items lil
where label_item_list_id
in (
select li.label_item_list_id from
label_items li
inner join label_items l1
on li.label_item_list_id = l1.label_item_list_id
and li.name <> l1.name
where concat(li.name,'#',l1.name) = 'foo#bar')
This is what I eventually came up with! I'm not 100% confident yet, but it has worked so far. I added a bit of functionality in Ruby and ActiveRecord to allow for as many necessary matches as desired and to return only those which match exactly (without any extra names not in the list).
items = ["foo", "bar"]
db = ActiveRecord::Base.connection
query = <<-EOS
SELECT lils.id FROM label_item_lists AS lils
JOIN label_items AS items ON items.label_item_list_id = lils.id
WHERE lils.id IN (
SELECT label_item_list_id FROM label_items AS items
WHERE items.name IN (#{(['?'] * items.length).join(',')})
) AND lils.id NOT IN (
SELECT label_item_list_id FROM label_items AS items
WHERE items.name NOT IN (#{(['?'] * items.length).join(',')})
)
GROUP BY lils.id
HAVING COUNT(DISTINCT items.name) = #{items.length}
EOS
query = ActiveRecord::Base.send(:sanitize_sql_array, [query, *(items*2)])
Basically it checks that a name is both IN the list of given names (items array) AND it also checks that the name IS NOT outside (or NOT IN) the list of given names. Any list of label_items that has a non-matching name is excluded by the latter IN query. This is helpful so that a label_item_list with both "foo" and "bar" but also "lorem" is not included.

Is there a simpler way to write this query? [MS SQL Server]

I'm wondering if there is a simpler way to accomplish my goal than what I've come up with.
I am returning a specific attribute that applies to an object. The objects go through multiple iterations and the attributes might change slightly from iteration to iteration. The iteration will only be added to the table if the attribute changes. So the most recent iteration might not be in the table.
Each attribute is uniquely identified by a combination of the Attribute ID (AttribId) and Generation ID (GenId).
Object_Table
ObjectId | AttribId | GenId
32 | 2 | 3
33 | 3 | 1
Attribute_Table
AttribId | GenId | AttribDesc
1 | 1 | Text
2 | 1 | Some Text
2 | 2 | Some Different Text
3 | 1 | Other Text
When I query on a specific object I would like it to return an exact match if possible. For example, Object ID 33 would return "Other Text".
But if there is no exact match, I would like for the most recent generation (largest Gen ID) to be returned. For example, Object ID 32 would return "Some Different Text". Since there is no Attribute ID 2 from Gen 3, it uses the description from the most recent iteration of the Attribute which is Gen ID 2.
This is what I've come up with to accomplish that goal:
SELECT attr.AttribDesc
FROM Attribute_Table AS attr
JOIN Object_Table AS obj
ON obj.AttribId = obj.AttribId
WHERE attr.GenId = (SELECT MIN(GenId)
FROM(SELECT CASE obj2.GenId
WHEN attr2.GenId THEN attr2.GenId
ELSE(SELECT MAX(attr3.GenId)
FROM Attribute_Table AS attr3
JOIN Object_Table AS obj3
ON obj3.AttribId = attr3.AttribId
WHERE obj3.AttribId = 2
)
END AS GenId
FROM Attribute_Table AS attr2
JOIN Object_Table AS obj2
ON attr2.AttribId = obj2.AttribId
WHERE obj2.AttribId = 2
) AS ListOfGens
)
Is there a simpler way to accomplish this? I feel that there should be, but I'm relatively new to SQL and can't think of anything else.
Thanks!
The following query will return the matching value, if found, otherwise use a correlated subquery to return the value with the highest GenId and matching AttribId:
SELECT obj.Object_Id,
CASE WHEN attr1.AttribDesc IS NOT NULL THEN attr1.AttribDesc ELSE attr2.AttribDesc END AS AttribDesc
FROM Object_Table AS obj
LEFT JOIN Attribute_Table AS attr1
ON attr1.AttribId = obj.AttribId AND attr1.GenId = obj.GenId
LEFT JOIN Attribute_Table AS attr2
ON attr2.AttribId = obj.AttribId AND attr2.GenId = (
SELECT max(GenId)
FROM Attribute_Table AS attr3
WHERE attr3.AttribId = obj.AttribId)
In the case where there is no matching record at all with the given AttribId, it will return NULL. If you want to get no record at all in this case, make the second JOIN an INNER JOIN rather than a LEFT JOIN.
Try this...
Incase the logic doesn't find a match for the Object_table GENID it maps it to the next highest GENID in the ON clause of the JOIN.
SELECT AttribDesc
FROM object_TABLE A
INNER JOIN Attribute_Table B
ON A.AttrbId = B.AttrbId
AND (
CASE
WHEN A.Genid <> B.Genid
THEN (
SELECT MAX(C.Genid)
FROM Attribute_Table C
WHERE A.AttrbId = C.AttrbId
)
ELSE A.Genid
END
) -- Selecting the right GENID in the join clause should do the job
= B.Genid
This should work:
with x as (
select *, row_number() over (partition by AttribId order by GenId desc) as rn
from Attribute_Table
)
select isnull(a.attribdesc, x.attribdesc)
from Object_Table o
left join Attribute_Table a
on o.AttribId = a.AttribId and o.GenId = a.GenId
left join x on o.AttribId = x.AttribId and rn = 1

MySQL join question

I'm trying to join 3 tables, somehow do an hierarchical inner join, and get data from the 3rd table. My starting point is the article_number (156118) from the article table. Here are the working sql statements and table structure, but there must be a way to join all this together in one, right?
// Get the parent task from the article
select task_parent
from article a, tasks t
where a.task_id = t.task_id
and a.article_number = 156118
// Get the task id for the 'Blog' task
select task_id
from tasks
where task_parent = 26093
and task_name like '%blog%'
// Get ALL the blog record
select *
from blogs
where task_id = 26091
---------Tables------------
* article table *
id | article_number | task_id
1 | 156118 | 26089
* tasks table *
id | task_name | task_parent
26089 | article | 26093
26091 | blogs | 26093
26093 | Main Task | 26093
* blog table *
id | task_id | content
1 | 102 | blah
2 | 102 | blah
3 | 102 | blah
-------------
* How do I get all of blog data with 1 SQl statement using just the article_number?
Thanks in advance!
EDIT: Revised answer after rereading question. You need two joins to the task table. One to get the parent of the article's task and a second to get the blog task with the same parent as the article's task.
select b.id, b.task_id, b.content
from article a
inner join tasks t1
on a.task_id = t1.task_id
inner join tasks t2
on t1.task_parent = t2.task_parent
and t2.task_name like '%blog%'
inner join blogs b
on t2.task_id = b.task_id
where a.article_number = 156118
Looks like you're wanting to tie them all together and just use the article number as the parameter...
Try:
select b.*
from blogs b, tasks t, tasks tp, article a
where b.task_id = t.task_id
and t.task_parent = tp.task_id
and tp.task_id = a.task_id
and a.article_number = 156118
Here you go.
SELECT * FROM a
INNER JOIN tasks USING (a.article_number)
INNER JOIN blogs USING (a.article_number)
WHERE a.article_number = 156118;