SELECT fields from one table with aggregates from related table - sql

Here is a simplified description of 2 tables:
CREATE TABLE jobs(id PRIMARY KEY, description);
CREATE TABLE dates(id PRIMARY KEY, job REFERENCES jobs(id), date);
There may be one or more dates per job.
I would like create a query which generates the following (in pidgin):
jobs.id, jobs.description, min(dates.date) as start, max(dates.date) as finish
I have tried something like this:
SELECT id, description,
(SELECT min(date) as start FROM dates d WHERE d.job=j.id),
(SELECT max(date) as finish FROM dates d WHERE d.job=j.id)
FROM jobs j;
which works, but looks very inefficient.
I have tried an INNER JOIN, but can’t see how to join jobs with a suitable aggregate query on dates.
Can anybody suggest a clean efficient way to do this?

While retrieving all rows: aggregate first, join later:
SELECT id, j.description, d.start, d.finish
FROM jobs j
LEFT JOIN (
SELECT job AS id, min(date) AS start, max(date) AS finish
FROM dates
GROUP BY job
) d USING (id);
Related:
SQL: How to save order in sql query?
About JOIN .. USING
It's not a "different type of join". USING (col) is a standard SQL (!) syntax shortcut for ON a.col = b.col. More precisely, quoting the manual:
The USING clause is a shorthand that allows you to take advantage of
the specific situation where both sides of the join use the same name
for the joining column(s). It takes a comma-separated list of the
shared column names and forms a join condition that includes an
equality comparison for each one. For example, joining T1 and T2 with
USING (a, b) produces the join condition ON *T1*.a = *T2*.a AND *T1*.b = *T2*.b.
Furthermore, the output of JOIN USING suppresses redundant columns:
there is no need to print both of the matched columns, since they must
have equal values. While JOIN ON produces all columns from T1 followed
by all columns from T2, JOIN USING produces one output column for each
of the listed column pairs (in the listed order), followed by any
remaining columns from T1, followed by any remaining columns from T2.
It's particularly convenient that you can write SELECT * FROM ... and joining columns are only listed once.

In addition to Erwin's solution, you can also use a window clause:
SELECT j.id, j.description,
first_value(d.date) OVER w AS start,
last_value(d.date) OVER w AS finish
FROM jobs j
JOIN dates d ON d.job = j.id
WINDOW w AS (PARTITION BY j.id ORDER BY d.date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING);
Window functions effectively group by one or more columns (the PARTITION BY clause) and/or ORDER BY some other columns and then you can apply some window function to it, or even a regular aggregate function, without affecting grouping or ordering of any other columns (description in your case). It requires a somewhat different way of constructing queries, but once you get the idea it is pretty brilliant.
In your case you need to get the first value of a partition, which is easy because it is accessible by default. You also need to look beyond the window frame (which ends by default with the current row) to the last value in the partition and then you need the ROWS clause. Since you produce two columns using the same window definition, the WINDOW clause is used here; in case it applies to a single column you can just write the window function in the select list followed by the OVER clause and the window definition without its name (WINDOW w AS (...)).

Related

SQL join: keep same column name, then refer to it

I'm regularly running into the following issue.
select
A.command_id as command_id,
sum(B.compile_time) as compile_time,
sum(B.run_time) as run_time,
compile_time + run_time as total_time
from commands as A
inner join subcommands as B on A.command_id = B.command_id
group by A.command_id
This doesn't seem to work because on line 5, the SQL engine seems to think that I'm referring to the columns of table B, and not the columns of the resulting table. Is there a way to fix that? Something like this.compile_time?
Of course I can rename the columns of the resulting table, e.g. total_compile_time and total_run_time. But this situation happens to me enough times that I hate having to be creative about the naming every time. It just makes sense to have the same column names in the result.
You can't use columns name alias in select because the alias name is created after the select execution then is not available in select clause.
For avoid error or problem you must repeat the sum function
select
A.command_id as command_id,
sum(B.compile_time) as compile_time,
sum(B.run_time) as run_time,
sum(B.compile_time) + sum(B.run_time) as total_time
from commands as A
inner join subcommands as B on A.command_id = B.command_id
group by A.command_id
there is a specific sequence for clause evaluation by the db engine in the db engine sequence evalation the alias resulting after the completion of select clause

When to Use * in SQL Query Containing JOINs & Aggregations?

Question
Web_events table contain id,..., channel,account_id
accounts table contain id, ..., sales_rep_id
sales_reps table contains id, name
Given the above tables, write an SQL query to determine the number of times a particular channel was used in the web_events table for each name in sales_reps. Your final table should have three columns - the name of the sales_reps, the channel, and the number of occurrences. Order your table with the highest number of occurrences first.
Answer
SELECT s.name, w.channel, COUNT(*) num_events
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
JOIN sales_reps s
ON s.id = a.sales_rep_id
GROUP BY s.name, w.channel
ORDER BY num_events DESC;
The COUNT(*) is confusing to me. I don't get how SQL figure out thatCOUNT(*) is COUNT(w.channel). Can anyone clarify?
I don't get how SQL figure out that COUNT(*) is COUNT(w.channel)
COUNT() is an aggregation function that counts the number of rows that match a condition. In fact, COUNT(<expression>) in general (or COUNT(column) in particular) counts the the number of rows where the expression (or column) is not NULL.
In general, the following do exactly the same thing:
COUNT(*)
COUNT(1)
COUNT(<primary key used on inner join>)
In general, I prefer COUNT(*) because that is the SQL standard for this. I can accept COUNT(1) as a recognition that COUNT(*) is just feature bloat. However, I see no reason to use the third version, because it just requires excess typing.
More than that, I find that new users often get confused between these two constructs:
COUNT(w.channel)
COUNT(DISTINCT w.channel)
People learning SQL often think the first really does the second. For this reason, I recommend sticking with the simpler ways of counting rows. Then use COUNT(DISTINCT) when you really want to incur the overhead to count unique values (COUNT(DISTINCT) is more expensive than COUNT()).

Avoid repeated information when having multiple joins?

I have the following query that uses joins to join multiple tables
select DISTINCT
tblArticles.Article_Title,
tblArticles.Article_img,
tblArticles.Article_Content,
tblArticles.Article_Date_Created,
tblArticles.Article_Sequence,
tblWriters.Writer_Name,
tblTypes.Article_Type_Name,
tblimages.image_path as "Extra images"
from tblArticles inner join tblWriters
on tblArticles.Writer_ID_Fkey = tblWriters.Writer_ID inner join
tblArticleType on tblArticles.Article_ID = tblArticleType.Article_ID_Fkey inner join
tblTypes on tblArticleType.Article_Type_ID_Fkey = tblTypes.Article_Type_ID left outer join tblExtraImages
on tblArticles.Article_ID = tblExtraImages.Article_ID_Fkey left outer join tblimages
on tblExtraImages.image_id_fkey = tblimages.image_id
order by tblArticles.Article_Sequence, tblArticles.Article_Date_Created;
And I get the following results:
If an article has more than one type_name then I will get repeated columns for the rest of the records. Is there another way of joining these tables that would prevent that from happening?
The simplest method is to just remove column Article_Type_Name from the select clause. This allows SELECT DISTINCT to identify the rows as duplicates, and eliminate them.
Another option is to use an aggregation function on the column. In recent SQL Server versions, STRING_AGG() comes handy (you can also use MIN() or MAX()):
select
tblArticles.Article_Title,
tblArticles.Article_img,
tblArticles.Article_Content,
tblArticles.Article_Date_Created,
tblArticles.Article_Sequence,
tblWriters.Writer_Name,
string_agg(tblTypes.Article_Type_Name, ',')
within group(order by tblTypes.Article_Type_Name) Article_Type_Name_List,
tblimages.image_path as Extra_Images
from ..
group by
tblArticles.Article_Title,
tblArticles.Article_img,
tblArticles.Article_Content,
tblArticles.Article_Date_Created,
tblArticles.Article_Sequence,
tblWriters.Writer_Name,
tblimages.image_path
What you're seeing here is a Cartesian product; you've joined Tables in such a way that multiple rows from one side match with rows from the other
If you don't care about the article_type, then group the other columns and take the max(article_type), or omit it in a subquery that selects distinct records, not including the article type column, from the table that contains article type). If your SQLS is recent enough and you want to know all the article types you could STRING_AGG them into a csv list
Ultimately what you choose to do depends on what you want them for; filter the rows out, or group them down

return the rows representing days that were hotter than average for the zip code

Please take a look at the attached image for the question as well as the corresponding data table and an example of what the output should look like.
You can solve this problem by using nested SQL with AVG group function and left outer join :
select z2.zip_code, z2.measurement_date, z2.noon_temp
from zip_temps z2 left outer join
(
select avg(z1.noon_temp) noon_temp, z1.zip_code
from zip_temps z1
group by z1.zip_code ) z3
on (z2.zip_code=z3.zip_code)
where z3.noon_temp < z2.noon_temp;
D e m o
Here is a solution using analytic functions.
select zip_code, measurement_date, noon_temp
from (
select zip_code, measurement_date, noon_temp,
avg(noon_temp) over (partition by zip_code) as avg_temp
)
where noon_temp > avg_temp
;
(Add an order by clause at the end if needed.)
The older approach is to have an aggregate subquery, to process the entire input table once and produce the average temperature for each zip code. Then the main query reads the main table a second time, it joins to this aggregate subquery by zip code, and outputs the needed rows. So the base table is read twice, and we have a join as well. (Other approaches, such as correlated subqueries, would require even more work, but the optimizer is smart enough to transform them into a join.)
Analytic functions were introduced specifically for this kind of problems, to reduce the amount of work needed. The average temperature (by zip code) is calculated essentially the same way (by partitioning or grouping by zip code), it is attached to every row in the input, and then the where clause in the outer query compares two values in the same row output by the subquery. There is no need to read the base data a second time, and there is no join.

How to join more than one column between 2 tables

I am currently having trouble with learning SQL, and am unable to get a table to join to another one when two or more of the columns in both tables are the same.
For example, I have 2 tables:
(I'm not sure how to post the code so I've just posted a link I hope that this is ok)
This is table 1, it shows how long each stage of each Project will take
http://puu.sh/gt92M/3dfe0063f0.png
This is table 2, it shows how long the stage of each project has been worked upon
http://puu.sh/gt9HO/2fd5090c9a.png
So far I have been able to put them into the same table, but I am unable to get the hours taken into its own column, currently they mix with the hours needed column.
SELECT ID, Stage, SUM(Hours_Taken)
FROM Work
GROUP BY ID, Stage
UNION
SELECT ID, Stage, Hours
FROM Budget_Allocation
GROUP BY ID, Stage
As you can see, each project has stages, and each stage needs a different amount of work hours. I want to be able to display a 4 columned table:
ID
Stage
Hours
Hours_Taken.
You are asking for a result whose columns include some derived from one table and others derived from a different table. That means you need to perform some kind of JOIN. The UNION operator does not join tables, it just collates multiple row sets into a single row set, eliminating duplicates.
One of the rowsets you want to select from is not a base table, however, but rather the result of an aggregate query. This calls for a subquery, the results of which you join to the other base table as needed:
SELECT
tw.ID AS ID,
tw.Stage AS Stage,
ba.Hours AS Hours,
tw.Hours_Taken AS Hours_Taken
FROM
Budget_Allocation ba
-- JOIN operator --
JOIN (
-- here's the subquery --
SELECT ID, Stage, SUM(Hours_Taken) AS Hours_Taken
FROM Work
GROUP BY ID, Stage
) tw
-- predicate for the preceding JOIN operator --
ON ba.ID = tw.ID AND ba.Stage = tw.Stage
Note that in this case you do not want to join base tables first and then aggregate rows of the joint results, because you are selecting values from one column (Budget_Allocation.Hours) that is neither a grouping column nor a function of the groups. There are workarounds and implementation-specific exceptions to that limitation, but in this case it's easy to do the right thing straight off by aggregating before joining.
you are doing union instead of join.
select w.id,w.stage,w.hours_taken, b.hours
from work w, budge_allocation b
where w.id = b.id and
w.stage = b.stage;
now you have everything you need in one row and can do what you want with it.