How to include values with count 0? - sql

I have the following table called Trains.
id | name | train_id
1 Carl 1
2 Kat 1
3 Paul 2
4 Adam 4
5 Janet 4
6 James 4
I am trying to count for each name how many other people are in the same train.
Here's what I've gotten so far:
SELECT T1.name, COUNT(T2.name)
FROM Trains T1, Trains T2
WHERE T1.name<>T2.name AND T1.train_id=T2.train_id
GROUP BY T1.name;
However, the result I get is
Janet 2
Adam 2
Kat 1
Carl 1
James 2
but I should also have the name 'Paul' there with count 0. I am new to SQL and I am unsure of how I could change my code to have the zero values here as well.

If you phrase your current logic as a left join, it should work:
SELECT t1.name, COUNT(t2.name) AS cnt
FROM Trains t1
LEFT JOIN Trains t2
ON t1.name <> t2.name AND t1.train_id = t2.train_id
GROUP BY t1.name;
Demo
The problem with your current approach is that it doing an old school implicit inner join, not a left join. This means that first the join happens, then the WHERE clause is filtering off the missing Paul record. By using a left join, all names on the left side of the join are retained.

I don't think a join is needed. Just use window functions:
select name, count(*) over (partition by train_id) - 1
from trains t;
Basically, count(*) over (partition by train_id) counts the number of rows on the train. The - 1 is to subtract the current row.

We use COUNT(*) which counts all of the input rows for a group.
(COUNT() also works with expressions, but it has slightly different behavior.)
Here's how the database executes this query:
FROM train_id: — First, retrieve all of the records from the Trains table.
GROUP BY name — Next, determine the unique name groups.
SELECT ... — Finally, select the name and the count of the number of rows in that group.
We also give this count of rows an alias using AS people_in_same_train to make the output more readable.
SELECT
name, COUNT(*) AS people_in_same_train
FROM Trains
GROUP BY name;

Related

Joining on closest timestamp? Presto SQL

I have 2 tables with epoch values. One with multiple samples per minute such as:
id
First_name
epoch_time
1
Paul
1650317420
2
Jeff
1650317443
3
Raul
1650317455
And one with 1 sample per minute:
id
Home
epoch_time
1
New York
1650317432
What I would like to do is join on the closest timestamp between the two tables. Ideally, finding the closest values between tables 1 and 2 and then populating a field from table 1 and 2. Id like to populate the 'Home' field and keep the rest of the records from table 1 as is, such as:
id
Name
Home
epoch_time
1
Paul
New York
1650317420
2
Jeff
New York
1650317443
3
Raul
New York
1650317455
The problem is the actual join. The ID is not unique hence why I need to not only join on ID but also scan for the closest epoch time between the 2 tables. I cannot use correlated subqueries, since Presto doesn't support correlated subqueries.
Answered my own question. It was as simple as first adding some offset such as a LEAD() between each minute sample and then using a BETWEEN in the join between the tables on the current minute sample looking ahead 59 seconds. Such that:
WITH tbl1 AS (
SELECT
*
FROM table_1
),
tbl2 AS (
SELECT
*,
LEAD(epoch_time) OVER (
PARTITION BY
name,
home
ORDER BY
epoch_time
) - 1 AS next_time
FROM table_2
)
SELECT
t1.Id,
t1.Name,
t2.Home,
t1.epoch_time
FROM tbl1 t1
LEFT JOIN tbl2 t2
ON t1.Id = t2.Id
AND t1.epoch_time BETWEEN t2.epoch_time AND t2.next_time

Optimal SQL to perform multiple aggregate functions with different group by fields

To simplify a complex query I am working on, I feel like solving this is key.
I have the following table
id
city
Item
1
chicago
1
2
chicago
2
3
chicago
1
4
cedar
2
5
cedar
1
6
cedar
2
7
detroit
1
I am trying to find the ratio of number of rows grouped by city and item to the number of rows grouped by just the items for each and every unique city-item pair.
So I would like something like this
City
Item
groupCityItemCount
groupItemCount
Ratio
chicago
1
2
4
2/4
chicago
2
1
3
1/3
cedar
1
1
4
1/4
cedar
2
2
3
2/3
detroit
1
1
4
1/4
This is my current solution but its too slow.
Select city, item, (count(*) / (select count(*) from records t2 where t1.item=t2.item)) AS pen_ratio
From records t1
Group By city, item
Also replaced where with groupBy and having but that is also slow.
Select city, item, (count(*) / (select count(*) from records t2 group by item having t1.item=t2.item)) AS pen_ratio
From records t1
Group By city, item
(Note: I have removed column3 and column4 from the solution for smaller code)
(Edit: Typo as pointed out by xQbert and
MatBailie)
Is it slow because it's evaluating each row separately with the subquery in the select statement? It may be operating as a correlated subquery.
If that's the case it might be faster if you get the values out of a join and go from there -
Select city, t1.item, (COUNT(t1.item) / MAX(t2.it_count)) AS pen_ratio
from records t1
JOIN (SELECT item, count(item) AS it_count
FROM records
group by item) t2
ON t2.item = t1.item
GROUP BY city, t1.item
Updated some errors and included the fiddle based off the starting point from xQbert. I had to CAST as float in the fiddle, but you may not need to CAST and use the above query in yours depending on datatypes.
I believe this follows the intent of your original query.
https://dbfiddle.uk/?rdbms=postgres_13&fiddle=d77a715175159304b9192a16ad903347
You can approach it in two parts.
First, aggregate to the level you're interested in, as normal.
Then, use analytical functions to work out subtotals across your partitions (item, in your case).
WITH
aggregate AS
(
SELECT
city,
item,
COUNT(*) AS row_count
FROM
records
GROUP BY
city,
item
)
SELECT
city,
item,
row_count AS groupCityItemCount,
SUM(row_count) OVER (PARTITION BY item) AS groupItemCount
FROM
aggregate
Fiddle borrowed from xQbert
https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=730146262267412522f6e27796151f43

unable to use LIMIT when using correlated query

I have two tables in Postgres. I want to get the latest 3records data from table.
Below is the query:
select two.sid as sid,
two.sidname as sidname,
two.myPercent as mypercent,
two.saccur as saccur,
one.totalSid as totalSid
from table1 one,table2 two
where one.sid = two.sid;
The above query displays all records checking the condition one.sid = two.sid;I want to get only recent 3 records data(4,5,6) from table2.
I know in Postgres we can use limit to limit the rows to retrieve, but here in table2 for each ID I have multiple rows. So I guess I cannot use limit on table2 but should use on table1. Any suggestions?
table1:
sid totalSid
1 10
2 20
3 30
4 40
5 50
6 60
table2:
sid sidname myPercent saccur
1 aaaa 11 11t
1 bbb 13 13g
1 ccc 11 11g
1 qw 88 88k
//more data for 2,3,4,5....
6 xyz 89 895W
6 xyz1 90 90k
6 xyz2 91 91p
6 xyz3 92 92q
Given a changed understanding of the question a simple subquery and join should suffice.
We select everything from table1 limit to 3 records in sid order desc. This gives us the 3 most recent Sid's and then join to table2 to get the other SID relevant data. The assumption here is that SID is unique in table one and "most recent" would be those records having the highest SID.
SELECT two.sid as sid
, two.sidname as sidname
, two.myPercent as mypercent
, two.saccur as saccur
, one.totalSid as totalSid
FROM (SELECT * FROM table1 ORDER BY SID DESC LIMIT 3) one
INNER JOIN table2 two
ON one.sid = two.sid;
*note I removed a comma after one alias above.
and below we reinstated the ANSI 88 join syntax using , notation.
SELECT two.sid as sid
, two.sidname as sidname
, two.myPercent as mypercent
, two.saccur as saccur
, one.totalSid as totalSid
FROM (SELECT * FROM table1 ORDER BY SID DESC LIMIT 3) one
, table2 two
WHERE one.sid = two.sid;
This syntax basically says get the 3 most recent SIDs from table one and cross join (For each record in one match it to all records in two) that to all records in table two but then return only records that have the same SID on both sides. Modern compilers may be able to use Cost based optimization to improve performance here negating the need to do the entire cross join; however, order of operation says this is what the database would normally have to do. if one and two are both tables of substantial size, you can see the cross join could result in a very large temporary dataset

Select data from a table where only the first two columns are distinct

Background
I have a table which has six columns. The first three columns create the pk. I'm tasked with removing one of the pk columns.
I selected (using distinct) the data into a temp table (excluding the third column), and tried inserting all of that data back into the original table with the third column being '11' for every row as this is what I was instructed to do. (this column is going to be removed by a DBA after I do this)
However, when I went to insert this data back into the original table I get a pk constraint error. (shocking, I know)
The other three columns are just date columns, so the distinct select didn't create a unique pk for each record. What I'm trying to achieve is just calling a distinct on the first two columns, and then just arbitrarily selecting the three other columns as it doesn't matter which dates I choose (at least not on dev).
What I've tried
I found the following post which seems to achieve what I want:
How do I (or can I) SELECT DISTINCT on multiple columns?
I tried the answers from both Joel,and Erwin.
Attempt 1:
However, with Joels answer the set returned is too large - the inner join isn't doing what I thought it would do. Selecting distinct col1 and col2 there are 400 columns returned, however when I use his solution 600 rows are returned. I checked the data and in fact there were duplicate pk's. Here is my attempt at duplicating Joels answer:
select a.emp_no,
a.eec_planning_unit_cde,
'11' as area, create_dte,
create_by_emp_no, modify_dte,
modify_by_emp_no
from tempdb.guest.temp_part_time_evaluator b
inner join
(
select emp_no, eec_planning_unit_cde
from tempdb.guest.temp_part_time_evaluator
group by emp_no, eec_planning_unit_cde
) a
ON b.emp_no = a.emp_no AND b.eec_planning_unit_cde = a.eec_planning_unit_cde
Now, if I execute just the inner select statement 400 rows are returned. If I select the whole query 600 rows are returned? Isn't inner join supposed to only show the intersection of the two sets?
Attempt 2:
I also tried the answer from Erwin. This one has a syntax error and I'm having trouble googling the spec on the where clause (specifically, the trick he is using with (emp_no, eec_planning_unit_cde))
Here is the attempt:
select emp_no,
eec_planning_unit_cde,
'11' as area, create_dte,
create_by_emp_no,
modify_dte,
modify_by_emp_no
where (emp_no, eec_planning_unit_cde) IN
(
select emp_no, eec_planning_unit_cde
from tempdb.guest.temp_part_time_evaluator
group by emp_no, eec_planning_unit_cde
)
Now, I realize that the post I referenced is for postgresql. Doesn't T-SQL have something similar? Trying to google parenthesis isn't working too well.
Overview of Questions:
Why doesn't inner join return an intersection of two sets? From googling this is what I thought it was supposed to do
Is there another way to achieve the same method that I was trying in attempt 2 in t-sql?
It doesn't matter to me which one of these I use, or if I use another solution... how should I go about this?
A select distinct will be based on all columns so it does not guarantee the first two to be distinct
select pk1, pk2, '11', max(c1), max(c2), max(c3)
from table
group by pk1, pk2
You could TRY this:
SELECT a.emp_no,
a.eec_planning_unit_cde,
b.'11' as area,
b.create_dte,
b.create_by_emp_no,
b.modify_dte,
b.modify_by_emp_no
FROM
(
SELECT emp_no, eec_planning_unit_cde
FROM tempdb.guest.temp_part_time_evaluator
GROUP BY emp_no, eec_planning_unit_cde
) a
JOIN tempdb.guest.temp_part_time_evaluator b
ON a.emp_no = b.emp_no AND a.eec_planning_unit_cde = b.eec_planning_unit_cde
That would give you a distinct on those fields but if there is differences in the data between columns you might have to try a more brute force approch.
SELECT a.emp_no,
a.eec_planning_unit_cde,
a.'11' as area,
a.create_dte,
a.create_by_emp_no,
a.modify_dte,
a.modify_by_emp_no
FROM
(
SELECT ROW_NUMBER() OVER(ORDER BY emp_no, eec_planning_unit_cde) rownumber,
a.emp_no,
a.eec_planning_unit_cde,
a.'11' as area,
a.create_dte,
a.create_by_emp_no,
a.modify_dte,
a.modify_by_emp_no
FROM tempdb.guest.temp_part_time_evaluator
) a
WHERE rownumber = 1
I'll reply one by one:
Why doesn't inner join return an intersection of two sets? From googling this is what I thought it was supposed to do
Inner join don't do an intersection. Le'ts supose this tables:
T1 T2
n s n s
1 A 2 X
2 B 2 Y
2 C
3 D
If you join both tables by numeric column you don't get the intersection (2 rows). You get:
select *
from t1 inner join t2
on t1.n = t2.n;
| N | S |
---------
| 2 | B |
| 2 | B |
| 2 | C |
| 2 | C |
And, your second query approach:
select *
from t1
where t1.n in (select n from t2);
| N | S |
---------
| 2 | B |
| 2 | C |
Is there another way to achieve the same method that I was trying in attempt 2 in t-sql?
Yes, this subquery:
select *
from t1
where not exists (
select 1
from t2
where t2.n = t1.n
);
It doesn't matter to me which one of these I use, or if I use another solution... how should I go about this?
yes, using #JTC second query.

Select a subgroup of records by one distinct column

Sorry if this has been answered before, but all the related questions didn't quite seem to match my purpose.
I have a table that looks like the following:
ID POSS_PHONE CELL_FLAG
=======================
1 111-111-1111 0
2 222-222-2222 0
2 333-333-3333 1
3 444-444-4444 1
I want to select only distinct ID values for an insert, but I don't care which specific ID gets pulled out of the duplicates.
For Example(a valid SELECT would be):
1 111-111-1111 0
2 222-222-2222 0
3 444-444-4444 1
Before I had the CELL_FLAG column, I was just using an aggregate function as so:
SELECT ID, MAX(POSS_PHONE)
FROM TableA
GROUP BY ID
But I can't do:
SELECT ID, MAX(POSS_PHONE), MAX(CELL_FLAG)...
because I would lose integrity within the row, correct?
I've seen some similar examples using CTEs, but once again, nothing that quite fit.
So maybe this is solvable by a CTE or some type of self-join subquery? I'm at a block right now, so I can't see any other solutions.
Just get your aggregation in a subquery and join to it:
SELECT a.ID, sub.Poss_Phone, CELL_FLAG
FROM TableA as a
INNER JOIN (SELECT ID, MAX(POSS_PHONE) as [Poss_Phone]
FROM TableA
GROUP BY ID) Sub
ON Sub.ID = a.ID and SUB.Poss_Phone = A.Poss_Phone
This will keep integrity between your non-aggregated fields but still give you the MAX(Poss_Phone) per ID.