Include zero counts when grouping by multiple columns - sql

I have a table (TCAP) containing the gender (2 categories), race/ethnicity (3 categories), and height (integer in inches) for multiple individuals. For example:
GND RCE HGT
1 3 65
1 2 72
2 1 62
1 2 68
2 1 65
2 2 64
1 3 69
1 1 70
I want to get a count of the number of individuals in each possible gender and race/ethnicity combination. When I group by GND and RCE, however, it doesn't show zero counts. I've tried the following code:
SELECT
GND,
RCE,
COUNT(*) TotalRecords
FROM TCAP
GROUP BY GND, RCE;
This gives me:
GND RCE TotalRecords
1 1 1
1 2 2
1 3 2
2 1 2
2 2 1
I want it to show all possible combinations though. In other words, even though there are no individuals with a gender of 1 and race/ethnicity of 3 in the table, I want that to display as a zero count. So, like this:
GND RCE TotalRecords
1 1 1
1 2 2
1 3 2
2 1 2
2 2 1
2 3 0
I've looked at the responses to similar questions, but they are based on a single group, resolved using an outer join with a table that has all possible values. Would I use a similar process here? Would I create a single table that has all 6 combinations of GND and RCE to join on? Is there another way to accomplish this, especially if the number of combinations increases (for example, 1 group with 5 values and 1 group with 10 values)?
Any help would be much appreciated! Thanks!

You can try to use CROSS JOIN make for GND,RCE columns then do OUTER JOIN base on it.
Query #1
SELECT t1.GND,t1.RCE,COUNT(t3.GND) TotalRecords
FROM (
SELECT GND,RCE
FROM (
SELECT DISTINCT GND
FROM TCAP
) t1 CROSS JOIN
(
SELECT DISTINCT RCE FROM TCAP
) t2
) t1
LEFT JOIN TCAP t3 ON t3.GND = t1.GND and t3.RCE = t1.RCE
group by t1.GND,t1.RCE;
| GND | RCE | TotalRecords |
| --- | --- | ------------ |
| 1 | 1 | 1 |
| 1 | 2 | 2 |
| 1 | 3 | 2 |
| 2 | 1 | 2 |
| 2 | 2 | 1 |
| 2 | 3 | 0 |
View on DB Fiddle

Use a cross join to generate the rows and a left join to bring in the results -- with a final group by:
select g.gnd, r.rce, count(t.gnd) as cnt
from (select distinct gnd from tcap) g cross join
(select distinct rce from tcap) r left join
tcap t
on t.gnd = g.gnd and t.rce = r.rce
group by g.gnd, r.rce;

Related

SQL cross match IDs to create new cross-platform ID -> how to optimize

I have a Redshift table with two columns which shows which ID's are connected, that is, belonging to the same person. I would like to make a mapping (extra column) with a unique person ID using SQL.
The problem is similar to this one: SQL: creating unique id for item with several ids
However in my case the ID's in both columns are of a different kind, and therefor the suggested joining solution (t1.epid = t2.pid, etc..) will not work.
In below example there are 4 individual persons using 9 IDs of type 1 and 10 IDs of type 2.
ID_type1 | ID_type2
---------+--------
1 | A
1 | B
2 | C
3 | C
4 | D
4 | E
5 | E
6 | F
7 | G
7 | H
7 | I
8 | I
8 | J
9 | J
9 | B
What I am looking for is an extra column with a mapping to a unique ID for the person. The difficulty is in correctly identifying the IDs related to persons like x & z which have multiple IDs of both types. The result could look something this:
ID_type1 | ID_type2 | ID_real
---------+---------------------
1 | A | z
1 | B | z
2 | C | y
3 | C | y
4 | D | x
4 | E | x
5 | E | x
6 | F | w
7 | G | z
7 | H | z
7 | I | z
8 | I | z
8 | J | z
9 | J | z
9 | B | z
I wrote below query which goes up to 4 loops and does the job for a small dataset, however is struggling with larger sets as the number of rows after joining increase very fast each loop. I am stuck in finding ways to do this more effective / efficient.
WITH
T1 AS(
SELECT DISTINCT
l1.ID_type1 AS ID_type1,
r1.ID_type1 AS ID_type1_overlap
FROM crossmatch_example l1
LEFT JOIN crossmatch_example r1 USING(ID_type2)
ORDER BY 1,2
),
T2 AS(
SELECT DISTINCT
l1.ID_type1,
r1.ID_type1_overlap
FROM T1 l1
LEFT JOIN T1 r1 on l1.ID_type1_overlap = r1.ID_type1
ORDER BY 1,2
),
T3 AS(
SELECT DISTINCT
l1.ID_type1,
r1.ID_type1_overlap
FROM T2 l1
LEFT JOIN T2 r1 on l1.ID_type1_overlap = r1.ID_type1
ORDER BY 1,2
),
T4 AS(
SELECT DISTINCT
l1.ID_type1,
r1.ID_type1_overlap
FROM T3 l1
LEFT JOIN T3 r1 on l1.ID_type1_overlap = r1.ID_type1
ORDER BY 1,2
),
mapping AS(
SELECT ID_type1,
min(ID_type1_overlap) AS mapped
FROM T4
GROUP BY 1
ORDER BY 1
),
output AS(
SELECT DISTINCT
l1.ID_type1::INT AS ID_type1,
l1.ID_type2,
FUNC_SHA1(r1.mapped) AS ID_real
FROM crossmatch_example l1
LEFT JOIN mapping r1 on l1.ID_type1 = r1.ID_type1
ORDER BY 1,2)
SELECT * FROM output
What you're trying to do is called Transitive Closure. There are articles about how to implement it in SQL.
This is an example in Spark linq-like dsl https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkTC.scala.
The solution to the problem is iterative, and to fully resolve the graph, you may need to apply more iterations. What can be optimised is the input for each iteration. I remember working on it once, but cannot recall the details.

SQL: Add values to STDEVP calculation

I have the following table.
Key | Count | Amount
----| ----- | ------
1 | 2 | 10
1 | 2 | 15
2 | 5 | 1
2 | 5 | 2
2 | 5 | 3
2 | 5 | 50
2 | 5 | 20
3 | 3 | 5
3 | 3 | 4
3 | 3 | 5
Sorry I couldn't figure out who to make the above a table.
I'm running this on SQL Server Management Studio 2012.
I'd like the stdevp return of the amount columns but if the number of records is less than some value 'x' (there will never be more than x records for a given key), then I want to add zeros to account for the remainder.
For example, if 'x' is 6:
for key 1, I need stdevp(10,5,0,0,0,0)
for key 2, I need stdevp(1,2,3,50,20,0)
for key 3, I need stdevp(5,4,5,0,0,0)
I just need to be able to add zeros to the calculation. I could insert records to my table, but that seems rather tedious.
This seems complicated -- padding data for each key. Here is one approach:
with xs as (
select 0 as val, 1 as n
union all
select 0, n + 1
from xs
where xs.n < 6
)
select k.key, stdevp(coalesce(t.amount, 0))
from xs cross join
(select distinct key from t) k left join
(select t.*, row_number() over (partition by key order by key) as seqnum
from t
) t
on t.key = k.key and t.seqnum = xs.n
group by k.key;
The idea is that the cross join generates 6 rows for each key. Then the left join brings in available rows, up to the maximum.

SQL - How to get the count of each distinct value?

I have 3 table
**room**
room_id | nurse_needed
----------------------
1 | 2
2 | 3
3 | 1
**doctor_schedule**
doctor_schedule_id| room_id
---------------------------
1 | 1
2 | 2
3 | 3
*nurse_schedule*
nurse_schedule_id | doctor_schedule_id
--------------------------------------
1 | 1
2 | 1
3 | 2
Each Room needs a number of nurse, A doctor work in Room and a nurse work with doctor's schedule. I want to count how many nurse in each room.
The result should be:
room_id | nurse_needed|nurse_have_in_room
---------------------------------------------
1 | 2 | 2
2 | 3 | 1
3 | 1 | 0
Hmmm . . .
select r.*,
(select count(*)
from doctor_schedule ds join
nurse_schedule ns
on ds.doctor_schedule_id = ns.doctor_schedule_id
where ds.room_id = r.room_id
) as nurse_have_in_room
from room r;
select room.*,
(select count(*) from
dotor_schedule docs,
nurse_schedule nurs
where docs.doctor_schedule_id=nurs.dcotor_schedule_id
group by docs.room_id) as nurse_have_in_room
from room;
Result of join on doctor_schedule_id between doctor_schedule and
nurse_schedule
nurse_schedule_id | doctor_schedule_id room_id
--------------------------------------+------------
1 | 1 | 1
2 | 1 | 1
3 | 2 | 2
We group by room_id and then get the result.
select r.room_id,
r.nurse_needed,
ns.nurses_scheduled,
ns.dist_nurses_scheduled
from room r
left join (select ds.room_id,
count(1) nurses_schedule,
count(distinct ns.nurse_schedule_id) dist_nurses_scheduled
from doctor_schedule ds
join nurse_schedule ns
on ds.doctor_schedule_id = ns.doctor_schedule_id
group by ds.room_id) as ns
on r.room_id = ns.room_id
Left join so you find rooms with no nurses scheduled.
Count(distinct ns.nurse_schedule_id) if needed to see how many different nurses make up the count.
Normally you have a time component in there too. Something like "where r.roomdate = ns.date"

SQL - Limiting to one row for matching results

Considering the tables below, how would I write a query that returns profession.profession when the profession.profession_id is present in contractor_has_profession.profession_id, but limiting it to one result for each profession.profession
So in this example the result would be [Coder, Database, Frontend]
contractor_has_profession
contractor_id | profession_id
1 | 5
2 | 5
3 | 5
4 | 2
5 | 1
profession
profession_id | profession
1 | Frontend
2 | Database
3 | Graphics
4 | Sound
5 | Coder
SELECT p.profession
FROM profession p
WHERE EXISTS(SELECT *
FROM contractor_has_profession c
WHERE c.profession_id = p.profession_id)
Hmm, this should be sufficient:
select distinct p.profession
from profession p
inner join contractor_has_profession c
where p.profession_id = c.profession_id
or if I'm wrong here, then try:
select p.profession
from profession p
inner join contractor_has_profession c
where p.profession_id = c.profession_id
group by p.profession

Help with optimising SQL query

Hi i need some help with this problem.
I am working web application and for database i am using sqlite. Can someone help me with one query from databse which must be optimized == fast =)
I have table x:
ID | ID_DISH | ID_INGREDIENT
1 | 1 | 2
2 | 1 | 3
3 | 1 | 8
4 | 1 | 12
5 | 2 | 13
6 | 2 | 5
7 | 2 | 3
8 | 3 | 5
9 | 3 | 8
10| 3 | 2
....
ID_DISH is id of different dishes, ID_INGREDIENT is ingredient which dish is made of:
so in my case dish with id 1 is made with ingredients with ids 2,3
In this table a have more then 15000 rows and my question is:
i need query which will fetch rows where i can find ids of dishes ordered by count of ingreedients ASC which i haven added to my algoritem.
examle: foo(2,4)
will rows in this order:
ID_DISH | count(stillMissing)
10 | 2
1 | 3
Dish with id 10 has ingredients with id 2 and 4 and hasn't got 2 more, then is
My query is:
SELECT
t2.ID_dish,
(SELECT COUNT(*) as c FROM dishIngredient as t1
WHERE t1.ID_ingredient NOT IN (2,4)
AND t1.ID_dish = t2.ID_dish
GROUP BY ID_dish) as c
FROM dishIngredient as t2
WHERE t2.ID_ingredient IN (2,4)
GROUP BY t2.ID_dish
ORDER BY c ASC
works,but it is slow....
select ID_DISH, sum(ID_INGREDIENT not in (2, 4)) stillMissing
from x
group by ID_DISH
having stillMissing != count(*)
order by stillMissing
this is the solution, my previous query work 5 - 20s this work about 80ms
This is from memory, as I don't know the SQL dialect of sqlite.
SELECT DISTINCT T1.ID_DISH, COUNT(T1.ID_INGREDIENT) as COUNT
FROM dishIngredient as T1 LEFT JOIN dishIngredient as T2
ON T1.ID_DISH = T2.ID_DISH
WHERE T2.ID_INGREDIENT IN (2,4)
GROUP BY T1.ID_DISH
ORDER BY T1.ID_DISH