I'm trying to create a view that takes a base table and joins it to information from multiple other tables and returns one row per row in the original table. For the sake of example, let's say I'm matching college graduates to employment and graduate school data... because that is, in fact, what I'm doing. Now, the issue here is that I can get multiple matches in the employment and graduate school data. People could work for more than one employer, or they could go to one grad school and then decide to transfer to another. This creates duplicate rows when I join, which then need to be eliminated through aggregation (or some other means).
My current solution is to do nested joins/queries something like this:
select ID, GradYear, max(Salary) as Salary, case when sum(case when S.Year=GradYear+1 then 1 else 0 end)>0 then 1 else 0 end
from
(
select ID, GradYear, sum(case when W.Year=GradYear+1 then W.Wages else null end) as Salary
from
(
select ID, GradYear
from dbo.Students
where Graduated=1
) as G
left join dbo.Wages as W
on G.ID=W.ID
) as Inner
left join dbo.GradSchool as S
on Inner.ID=S.ID
This seems a bit ugly to me, especially if I want to bring in more data (say, I now want to look for them in the military too). Is there a better way of accomplishing the joining? If I just straight up join the three tables together, I'll end up double counting people's wages if they have 2 grad school records, for example... Let me know if you've got a solution!
SELECT
U.ID,
U.GradYear,
W.Salary,
S.HasGradSchool
FROM dbo.Students U
OUTER APPLY (
SELECT
SUM(Wages) AS Salary
FROM dbo.Wages
WHERE ID = U.ID
AND Year = U.GradYear+1
) W
OUTER APPLY (
SELECT TOP 1
1 AS HasGradSchool
FROM dbo.GradSchool
WHERE ID = U.ID
AND Year = U.GradYear+1
) S
where U.Graduated=1
Related
i have 4 tables, employees, skills, interests, and goals. im trying to display all skills, interests, and goals for one employee but there a different number of values within each of the skills, interests, and goals tables. so for example employee 1 has 3 interests, 5 skills, and 4 goals. what im trying to do is display 5 rows and 5 columns. column 1 would have the name of employee 1 listed 5 times. column 2 would have the skills of the employee. column 3 would have the interests of the employee with 2 nulls. column 4 would have the goals of the employee with one null. like below
i have tried a number of different joins but i keep getting all possible combinations as the output.
any help on this would be greatly appreciated.
Okay, this is an ugly solution, but it works for me, at least in SQL Server:
WITH employees_temp as (
SELECT employees.first_name, row_number() over (ORDER BY id) as RowNum
FROM employees
WHERE employees.first_name LIKE 'will'
),
skills_temp as (
SELECT skills.skill, , row_number() over (ORDER BY skills.employee_id) as RowNum
FROM skills
INNER JOIN employees
ON skills.employee_id = employees.id
WHERE employees.first_name LIKE 'will'
),
goals_temp as (
SELECT goals.goal, , row_number() over (ORDER BY goals.employee_id) as RowNum
FROM goals
INNER JOIN employees
ON goals.employee_id = employees.id
WHERE employees.first_name LIKE 'will'
),
interests_temp as (
SELECT interests.interest, , row_number() over (ORDER BY interests.employee_id) as RowNum
FROM interests
INNER JOIN employees
ON interests.employee_id = employees.id
WHERE employees.first_name LIKE 'will'
)
select employees_temp.first_name, skills_temp.skill, goals_temp.goal, interests_temp.interest
from employees_temp
full outer join skills_temp
on employees_temp.RowNum = skills_temp.RowNum
full outer join goals_temp
on employees_temp.RowNum = goals_temp.RowNum
full outer join interests_temp
on employees_temp.RowNum = interests_temp.RowNum
What this is doing is selecting the data you need for each of the four queries, and adding in a row number. Then, we join all four of those together, joining on the row number. We need a row number to join on, or else it will, as you saw, create every possible combination. The row number functions as a sort of dummy ID for us to join on in order to prevent that.
A couple caveats:
You need to be able to do an OUTER JOIN in order for this to work.
As a matter of principle, I would filter based on employee ID when possible, rather than pattern matching the first name. This would also simplify the query by removing the need for multiple joins in the first half.
If you do so, you're probably best creating a variable at the very beginning of the query to set the employee ID, and then using that variable in the query, so you don't have to update it in multiple places. If that's feasible for the situation you're working in.
I'm working in BigQuery. I've got two tables:
TABLE: orgs
code: STRING
group: STRING
TABLE: org_employees
code: STRING
employee_count: INTEGER
The code in each table is effectively a foreign key. I want to get all unique groups, with a count of the orgs in them, and (this is the tricky bit) a count of how many of of those orgs only have a single employee. Data that looks like this:
group,orgs,single_handed_orgs
00Q,23,12
00K,15,7
I know how to do the first bit, get the unique groups and count of associated orgs from the orgs table:
SELECT
count(code), group
FROM
[orgs]
GROUP BY group
And, I know how to get the count of single-handed orgs from the practice table:
SELECT
code,
(employee_count==1) AS is_single_handed
FROM
[org_employees]
But I'm not sure how to glue them together. Can anyone help?
for BigQuery: legacy SQL
SELECT
[group],
COUNT(o.code) as orgs,
SUM(employee_count = 1) as single_handed_orgs
FROM [orgs] AS o
LEFT JOIN [org_employees] AS e
ON e.code = o.code
GROUP BY [group]
using LEFT JOIN in case if some codes are missing in org_employees tables
for BigQuery: standard SQL
SELECT
grp,
COUNT(o.code) AS orgs ,
SUM(CASE employee_count WHEN 1 THEN 1 ELSE 0 END) AS single_handed_orgs
FROM orgs AS o
LEFT JOIN org_employees AS e
ON e.code = o.code
GROUP BY grp
Note use of grp vs group - looks like standard sql does like use of Reserved Keywords even if i put backticks around
Confirmed:
you can use keyword with backticks around
You could join the two tables to get the groups that have just one employee. Then you wrap this in a sub query and you count the groups that you have.
I'm using a COUNT DISTINCT and GROUP BY because I don't know how your data is structured. Is there only a single line per group or multiple?
SELECT
COUNT(DISTINCT group)
FROM (
SELECT
group
FROM
orgs AS o INNER JOIN org_employees AS e ON o.code = e.code
WHERE
employee_count = 1
GROUP BY
group
)
The database being used for this question is structured as follows with Primary Keys bolded, and Foreign Keys ' '.
Countries (Name, Country_ID, area_sqkm, population)
Teams (team_id, name, 'country_id', description, manager)
Stages (stage_id, took_place, start_loc, end_loc, distance, description)
Riders (rider_id, name, 'team_id', year_born, height_cms, weight_kgs, 'country_id', bmi)
Results ('stage_id', 'rider_id', time_seconds)
I am stuck at the question of:
Q: Bradley Wiggins won the tour. Write a query to find the riders who beat him in at least 4 stages, i.e., riders who had a better time than Wiggins in at least 4 of the 21 stages.
I am currently at :
SELECT ri.name
from riders ri
INNER JOIN results re ON ri.name = re.name
WHERE ri.name = 'BRADLEY Wiggins' IN ...`
I am unsure of how can I move to comparing 2 time_seconds.
May I know how can I go about getting the solution?
Thank you
The task is indeed a little complicated, as it involves several concepts.
The first of these is a self join, i.e. you'll have to select from the same table twice. You want Bradley's results and the others' results, so as to be able to compare them.
select ...
from results bradley
join results other on ...
Or:
select ...
from (select * from results where ...) bradley
join (select * from results where ...) other on ...
Let's use the first option. We add a WHERE clause so to get Bradley and we add the ON clause to get non-Bradleys at the same stage with a better result:
select ...
from results bradley
join results other on other.rider_id <> bradley.rider_id
and other.stage_id = bradley.stage_id
and other.time_seconds < bradley.time_seconds
where bradley.rider_id = (select id from riders where name = 'BRADLEY Wiggins')
The last part is to find riders with at least four better results. This is called aggregation. You want to see riders, so you group by rider_id. And you want to count, so you use COUNT. Moreover you want to restrict results based on COUNT, so you put this in the HAVING clause:
select other.rider_id
from results bradley
join results other on other.rider_id <> bradley.rider_id
and other.stage_id = bradley.stage_id
and other.time_seconds < bradley.time_seconds
where bradley.rider_id = (select id from riders where name = 'BRADLEY Wiggins')
group by other.rider_id
having count(*) >= 4;
As to getting the riders' data, e.g. their names, there are a couple of options:
Join the table and put the columns both in your SELECT clause and your GROUP BY clause. You would do this, if you wanted data from both sets, i.e. riders' data plus the result count.
Subselect the value if you only want one value (e.g. the name). That's simple but really only makes sense when you want only one value from riders table.
You'd change your SELECT clause thus:
select (select name from riders where id = other.rider_id) as name
Write an outer query around the query you already have.
This would be:
select *
from riders
where id in
(
select other.rider_id
from results bradley
join results other on other.rider_id <> bradley.rider_id
and other.stage_id = bradley.stage_id
and other.time_seconds < bradley.time_seconds
where bradley.rider_id = (select id from riders where name = 'BRADLEY Wiggins')
group by other.rider_id
having count(*) >= 4
);
I'm trying to solve a seemingly simple problem, but I think i'm tripping over on my understanding of how the EXISTS keyword works. The problem is simple (this is a dumbed down version of the actual problem) - I have a table of students and a table of hobbies. The students table has their student ID and Name. Return only the students that share the same number of hobbies (i.e. those students who have a unique number of hobbies would not be shown)
So the difficulty I run into is working out how to compare the count of hobbies. What I have tried is this.
SELECT sa.studentnum, COUNT(ha.hobbynum)
FROM student sa, hobby ha
WHERE sa.studentnum = ha.studentnum
AND EXISTS (SELECT *
FROM student sb, hobby hb
WHERE sb.studentnum = hb.studentnum
AND sa.studentnum != sb.studentnum
HAVING COUNT(ha.hobbynum) = COUNT(hb.hobbynum)
)
GROUP BY sa.studentnum
ORDER BY sa.studentnum;
So what appears to be happening is that the count of hobbynums is identical each test, resulting in all of the original table being returned, instead of just those that match the same number of hobbies.
Not tested, but maybe something like this (if I understand the problem correctly):
WITH h AS (
SELECT studentnum, COUNT(hobbynum) OVER (PARTITION BY studentnum) student_hobby_ct
FROM hobby)
SELECT studentnum, student_hobby_ct
FROM h h1 JOIN h h2 ON h1.student_hobby_ct = h2.student_hobby_ct AND
h1.studentnum <> h2.studentnum;
I think that what your query would do is only return students who had at least one other student that had the same number of hobbies. But you're not returning anything about the students with whom they match. Is that intentional? I'd treat both queries as sub-queries and aggregate before a join on the counts. You could do several things... here it's returning the number of students that have matching hobby counts, but you could limit HAVING(COUNT(distinct sb.studentnum) = 0 to get the result your query seemed to return...
with xx as
(SELECT sa.studentnum, count(ha.hobbynum) hobbycount
FROM student sa inner join hobby ha
on sa.studentnum = ha.studentnum
group by sa.studentnum
)
select sa.studentnum, sa.hobbycount, count(distinct sb.studentnum) as matchcount
from
xx sa inner join xx sb on
sa.hobbycount = sb.hobbycount
where
sa.studentnum != sb.studentnum
GROUP by sa.studentnum, sa.hobbycount
ORDER BY sa.studentnum;
I have a case where I wanna choose any database entry that have an invalid Country, Region, or Area ID, by invalid, I mean an ID for a country or region or area that no longer exists in my tables, I have four tables: Properties, Countries, Regions, Areas.
I was thinking to do it like this:
SELECT * FROM Properties WHERE
Country_ID NOT IN
(
SELECT CountryID FROM Countries
)
OR
RegionID NOT IN
(
SELECT RegionID FROM Regions
)
OR
AreaID NOT IN
(
SELECT AreaID FROM Areas
)
Now, is my query right? and what do you suggest that i can do and achieve the same result with better performance?!
Your query in fact is optimal.
LEFT JOIN's proposed by others are worse, as they select ALL values and then filter them out.
Most probably your subquery will be optimized to this:
SELECT *
FROM Properties p
WHERE NOT EXISTS
(
SELECT 1
FROM Countries i
WHERE i.CountryID = p.CountryID
)
OR
NOT EXISTS
(
SELECT 1
FROM Regions i
WHERE i.RegionID = p.RegionID
)
OR
NOT EXISTS
(
SELECT 1
FROM Areas i
WHERE i.AreaID = p.AreaID
)
, which you should use.
This query selects at most 1 row from each table, and jumps to the next iteration right as it finds this row (i. e. if it does not find a Country for a given Property, it will not even bother checking for a Region).
Again, SQL Server is smart enough to build the same plan for this query and your original one.
Update:
Tested on 512K rows in each table.
All corresponding ID's in dimension tables are CLUSTERED PRIMARY KEY's, all measure fields in Properties are indexed.
For each row in Property, PropertyID = CountryID = RegionID = AreaID, no actual missing rows (worst case in terms of execution time).
NOT EXISTS 00:11 (11 seconds)
LEFT JOIN 01:08 (68 seconds)
You could rewrite it differently as follows:
SELECT p.*
FROM Properties p
LEFT JOIN Countries c ON p.Country_ID = c.CountryID
LEFT JOIN Regions r on p.RegionID = r.RegionID
LEFT JOIN Areas a on p.AreaID = a.AreaID
WHERE c.CountryID IS NULL
OR r.RegionID IS NULL
OR a.AreaID IS NULL
Test the performance difference (if there is any - there should be as NOT IN is a nasty search, especially over a lot of items as it HAS to test every single one).
You can also make this faster by indexing the IDS being searched - in each master table (Country, Region, Area) they should be clustered primary keys.
Since this seems to be cleanup sql, this should be ok. But how about using foreign keys so that it does not bother you next time around?
Well, you could try things like UNION (instead of OR) - but I expect that the optimizer is already doing the best it can given the information available:
SELECT * FROM Properties
WHERE NOT EXISTS (SELECT 1 FROM Areas WHERE Areas.AreaID = Properties.AreaID)
UNION
SELECT * FROM Properties
WHERE NOT EXISTS (SELECT 1 FROM Regions WHERE Regions.RegionID = Properties.RegionID)
UNION
SELECT * FROM Properties
WHERE NOT EXISTS (SELECT 1 FROM Countries WHERE Countries.CountryID = Properties.CountryID)
Subqueries in the conditions can be quite inefficient. Instead you can do left joins against the related tables. Where there are no matching record you get a null value. You can use this in the condition to select only the records where there is a matching record missing:
select p.*
from Properties p
left join Countries c on c.CountryID = p.Country_ID
left join Regions r on r.RegionID = p.RegionID
left join Areas a on a.AreaID = p.AreaID
where c.CountryID is null or r.RegionID is null or a.AreaID is null
If you're not grabbing the row data from countries/regions/areas you can try using "exists":
SELECT Properties.*
FROM Properties
WHERE Properties.CountryID IS NOT NULL AND NOT EXISTS (SELECT 1 FROM Countries WHERE Countries.CountryID = Properties.CountryID)
OR Properties.RegionID IS NOT NULL AND NOT EXISTS (SELECT 1 FROM Regions WHERE Regions.RegionID = Properties.RegionID)
OR Properties.AreaID IS NOT NULL AND NOT EXISTS (SELECT 1 FROM Areas WHERE Areas.AreaID = Properties.AreaID)
This will typically hint to use the pkey indices of countries et al for the existence check... but whether that is an improvement depends on your data stats, you simply have to plug it into query analyzer and try it.