SQL Filter based on results from SQL query - sql

Input table - t1
make | model | engine | kms_covered | start | end
-------------------------------------------------------
suzuki | sx4 | petrol | 11 | City A | City D
suzuki | sx4 | diesel | 150 | City B | City C
suzuki | swift | petrol | 140 | City C | City B
suzuki | swift | diesel | 18 | City D | City A
toyota | prius | petrol | 16 | City E | City A
toyota | prius | hybrid | 250 | City B | City E
Need to get a subset of the records such that start and end is only cities where both diesel and hybrid cars were either in start or end.
In above case, expect that only city B qualifies for the condition and expect output table as below
output table
make | model | engine | kms_covered | start | end
-------------------------------------------------------
suzuki | sx4 | diesel | 150 | City B | City C
suzuki | swift | petrol | 140 | City C | City B
toyota | prius | hybrid | 250 | City B | City E
Two step process
Get list of cities where both diesel and hybrid cars have either in start or end
Subset the table with only records having cities in #1
Need help with starting point as below.
select * from t1
where start in () or end in ()

Hmmmm . . . If I understand the question, you can get the list of cities using a CTE and then use this in to solve your question:
with c as (
select city
from (select start as city, engine
from t1
union all
select end, engine
from t1
)
where engine in ('petrol', 'deisel')
group by city
having count(distinct engine) = 2
)
select t1.*
from t1
where t1.start in (select city from c) and
t1.end in (select city from c);

Related

Return the row with max value for each group

I have the following table with name t2:
realm | race | gender | total
----------+------------+--------+--------
Buffalo | faerie | F | 5972
Buffalo | faerie | M | 2428
Buffalo | footballer | F | 1954
Buffalo | footballer | M | 2093
Buffalo | raccoon | F | 2118
Buffalo | raccoon | M | 1237
Buffalo | shark | F | 12497
Buffalo | shark | M | 3621
Buffalo | wizard | F | 468
Buffalo | wizard | M | 11079
Camelot | faerie | F | 2414
Camelot | faerie | M | 1455
I want to create a query that just selects the realm, race and gender with the highest total. Every time I use GROUP BY I keep getting both genders.
The output table looks like this:
realm | race | gender | total
----------+------------+--------+--------
Buffalo | faerie | F | 5972
Buffalo | footballer | M | 2093
Buffalo | raccoon | F | 2118
...
I think I have a very poor understanding on how to compare rows.
I can't figure out how to write the WHERE clause so that when I GROUP BY realm,race,gender, I only get 1 gender.
A perfect use case for DISTINCT ON:
SELECT DISTINCT ON (realm, race) *
FROM tbl
ORDER BY realm, race, total DESC;
db<>fiddle here
Notably, the query has no GROUP BY at all.
Assuming total is NOT NULL, else append NULLS LAST.
In case of a tie, the winner is arbitrary unless you add more ORDER BY items to break the tie.
Detailed explanation:
Select first row in each GROUP BY group?
select q.realm
, q.race
, q.gender
, q.total
from (
Select t2.realm
, t2.race
, t2.gender
, total
, max(total) over (partition by t2.realm, t2.race) as maxtotal
FROM adventure t2
) q
where q.total = q.maxtotal

How to print the students name in this query?

The concerned tables are as follows:
students(rollno, name, deptcode)
depts(deptcode, deptname)
course(crs_rollno, crs_name, marks)
The query is
Find the name and roll number of the students from each department who obtained
highest total marks in their own department.
Consider:
i) Courses of different department are different.
ii) All students of a particular department take same number and same courses.
Then only the query makes sense.
I wrote a successful query for displaying the maximum total marks by a student in each department.
select do.deptname, max(x.marks) from students so
inner join depts do
on do.deptcode=so.deptcode
inner join(
select s.name as name, d.deptname as deptname, sum(c.marks) as marks from students s
inner join crs_regd c
on s.rollno=c.crs_rollno
inner join depts d
on d.deptcode=s.deptcode
group by s.name,d.deptname) x
on x.name=so.name and x.deptname=do.deptname group by do.deptname;
But as mentioned I need to display the name as well. Accordingly if I include so.name in select list, I need to include it in group by clause and the output is as below:
Kendra Summers Computer Science 274
Stewart Robbins English 80
Cole Page Computer Science 250
Brian Steele English 83
expected output:
Kendra Summers Computer Science 274
Brian Steele English 83
Where is the problem?
I guess this can be easily achieved if you use window function -
select name, deptname, marks
from (select s.name as name, d.deptname as deptname, sum(c.marks) as marks,
row_number() over(partition by d.deptname order by sum(c.marks) desc) rn
from students s
inner join crs_regd c on s.rollno=c.crs_rollno
inner join depts d on d.deptcode=s.deptcode
group by s.name,d.deptname) x
where rn = 1;
To solve the problem with a readable query I had to define a couple of views:
total_marks: For each student the sum of their marks
create view total_marks as select s.deptcode, s.name, s.rollno, sum(c.marks) as total from course c, students s where s.rollno = c.crs_rollno group by s.rollno;
dept_max: For each department the highest total score by a single student of that department
create view dept_max as select deptcode, max(total) max_total from total_marks group by deptcode;
So I can get the desidered output with the query
select a.deptcode, a.rollno, a.name from total_marks a join dept_max b on a.deptcode = b.deptcode and a.total = b.max_total
If you don't want to use views you can replace their selects on the final query, which will result in this:
select a.deptcode, a.rollno, a.name
from
(select s.deptcode, s.name, s.rollno, sum(c.marks) as total from course c, students s where s.rollno = c.crs_rollno group by s.rollno) a
join (select deptcode, max(total) max_total from (select s.deptcode, s.name, s.rollno, sum(c.marks) as total from course c, students s where s.rollno = c.crs_rollno group by s.rollno) a_ group by deptcode) b
on a.deptcode = b.deptcode and a.total = b.max_total
Which I'm sure it is easily improvable in performance by someone more skilled then me...
If you (and anybody else) want to try it the way I did, here is the schema:
create table depts ( deptcode int primary key auto_increment, deptname varchar(20) );
create table students ( rollno int primary key auto_increment, name varchar(20) not null, deptcode int, foreign key (deptcode) references depts(deptcode) );
create table course ( crs_rollno int, crs_name varchar(20), marks int, foreign key (crs_rollno) references students(rollno) );
And here all the entries I inserted:
insert into depts (deptname) values ("Computer Science"),("Biology"),("Fine Arts");
insert into students (name,deptcode) values ("Turing",1),("Jobs",1),("Tanenbaum",1),("Darwin",2),("Mendel",2),("Bernard",2),("Picasso",3),("Monet",3),("Van Gogh",3);
insert into course (crs_rollno,crs_name,marks) values
(1,"Algorithms",25),(1,"Database",28),(1,"Programming",29),(1,"Calculus",30),
(2,"Algorithms",24),(2,"Database",22),(2,"Programming",28),(2,"Calculus",19),
(3,"Algorithms",21),(3,"Database",27),(3,"Programming",23),(3,"Calculus",26),
(4,"Zoology",22),(4,"Botanics",28),(4,"Chemistry",30),(4,"Anatomy",25),(4,"Pharmacology",27),
(5,"Zoology",29),(5,"Botanics",27),(5,"Chemistry",26),(5,"Anatomy",25),(5,"Pharmacology",24),
(6,"Zoology",18),(6,"Botanics",19),(6,"Chemistry",22),(6,"Anatomy",23),(6,"Pharmacology",24),
(7,"Sculpture",26),(7,"History",25),(7,"Painting",30),
(8,"Sculpture",29),(8,"History",24),(8,"Painting",30),
(9,"Sculpture",21),(9,"History",19),(9,"Painting",25) ;
Those inserts will load these data:
select * from depts;
+----------+------------------+
| deptcode | deptname |
+----------+------------------+
| 1 | Computer Science |
| 2 | Biology |
| 3 | Fine Arts |
+----------+------------------+
select * from students;
+--------+-----------+----------+
| rollno | name | deptcode |
+--------+-----------+----------+
| 1 | Turing | 1 |
| 2 | Jobs | 1 |
| 3 | Tanenbaum | 1 |
| 4 | Darwin | 2 |
| 5 | Mendel | 2 |
| 6 | Bernard | 2 |
| 7 | Picasso | 3 |
| 8 | Monet | 3 |
| 9 | Van Gogh | 3 |
+--------+-----------+----------+
select * from course;
+------------+--------------+-------+
| crs_rollno | crs_name | marks |
+------------+--------------+-------+
| 1 | Algorithms | 25 |
| 1 | Database | 28 |
| 1 | Programming | 29 |
| 1 | Calculus | 30 |
| 2 | Algorithms | 24 |
| 2 | Database | 22 |
| 2 | Programming | 28 |
| 2 | Calculus | 19 |
| 3 | Algorithms | 21 |
| 3 | Database | 27 |
| 3 | Programming | 23 |
| 3 | Calculus | 26 |
| 4 | Zoology | 22 |
| 4 | Botanics | 28 |
| 4 | Chemistry | 30 |
| 4 | Anatomy | 25 |
| 4 | Pharmacology | 27 |
| 5 | Zoology | 29 |
| 5 | Botanics | 27 |
| 5 | Chemistry | 26 |
| 5 | Anatomy | 25 |
| 5 | Pharmacology | 24 |
| 6 | Zoology | 18 |
| 6 | Botanics | 19 |
| 6 | Chemistry | 22 |
| 6 | Anatomy | 23 |
| 6 | Pharmacology | 24 |
| 7 | Sculpture | 26 |
| 7 | History | 25 |
| 7 | Painting | 30 |
| 8 | Sculpture | 29 |
| 8 | History | 24 |
| 8 | Painting | 30 |
| 9 | Sculpture | 21 |
| 9 | History | 19 |
| 9 | Painting | 25 |
+------------+--------------+-------+
I take chance to point out that this database is badly designed. This becomes evident with course table. For these reasons:
The name is singular
This table does not represent courses, but rather exams or scores
crs_name should be a foreign key referencing the primary key of another table (that would actually represent the courses)
There is no constrains to limit the marks to a range and to avoid a student to take twice the same exam
I find more logical to associate courses to departments, instead of student to departments (this way also would make these queries easier)
I tell you this because I understood you are learning from a book, so unless the book at one point says "this database is poorly designed", do not take this exercise as example to design your own!
Anyway, if you manually resolve the query with my data you will come to this results:
+----------+--------+---------+
| deptcode | rollno | name |
+----------+--------+---------+
| 1 | 1 | Turing |
| 2 | 6 | Bernard |
| 3 | 8 | Monet |
+----------+--------+---------+
As further reference, here the contents of the views I needed to define:
select * from total_marks;
+----------+-----------+--------+-------+
| deptcode | name | rollno | total |
+----------+-----------+--------+-------+
| 1 | Turing | 1 | 112 |
| 1 | Jobs | 2 | 93 |
| 1 | Tanenbaum | 3 | 97 |
| 2 | Darwin | 4 | 132 |
| 2 | Mendel | 5 | 131 |
| 2 | Bernard | 6 | 136 |
| 3 | Picasso | 7 | 81 |
| 3 | Monet | 8 | 83 |
| 3 | Van Gogh | 9 | 65 |
+----------+-----------+--------+-------+
select * from dept_max;
+----------+-----------+
| deptcode | max_total |
+----------+-----------+
| 1 | 112 |
| 2 | 136 |
| 3 | 83 |
+----------+-----------+
Hope I helped!
Try the following query
select a.name, b.deptname,c.marks
from students a
, crs_regd b
, depts c
where a.rollno = b.crs_rollno
and a.deptcode = c.deptcode
and(c.deptname,b.marks) in (select do.deptname, max(x.marks)
from students so
inner join depts do
on do.deptcode=so.deptcode
inner join (select s.name as name
, d.deptname as deptname
, sum(c.marks) as marks
from students s
inner join crs_regd c
on s.rollno=c.crs_rollno
inner join depts d
on d.deptcode=s.deptcode
group by s.name,d.deptname) x
on x.name=so.name
and x.deptname=do.deptname
group by do.deptname
)
Inner/Sub query will fetch the course name and max marks and the outer query gets the corresponding name of the student.
try and let know if you got the desired result
Dense_Rank() function would be helpful in this scenario:
SELECT subquery.*
FROM (SELECT Student_Total_Marks.rollno,
Student_Total_Marks.name,
Student_Total_Marks.deptcode, depts.deptname,
rank() over (partition by deptcode order by total_marks desc) Student_Rank
FROM (SELECT Stud.rollno,
Stud.name,
Stud.deptcode,
sum(course.marks) total_marks
FROM students stud inner join course course on stud.rollno = course.crs_rollno
GROUP BY stud.rollno,Stud.name,Stud.deptcode) Student_Total_Marks,
dept dept
WHERE Student_Total_Marks.deptcode = dept.deptname
GROUP BY Student_Total_Marks.deptcode) subquery
WHERE suquery.student_rank = 1

SQL: Cascading conditions on Join

I have found a few similar questions to this on SO but nothing which applies to my situation.
I have a large dataset with hundreds of millions of rows in Table 1 and am looking for the most efficient way to run the following query. I am using Google BigQuery but I think this is a general SQL question applicable to any DBMS?
I need to apply an owner to every row in Table 1. I want to join in the following priority:
1: if item_id matches an identifier in Table 2
2: if no item_id matches try match on item_name
3: if no item_id or item_name matches try match on item_division
4: if no item_division matches, return null
Table 1 - Datapoints:
| id | item_id | item_name | item_division | units | revenue
|----|---------|-----------|---------------|-------|---------
| 1 | xyz | pen | UK | 10 | 100
| 2 | pqr | cat | US | 15 | 120
| 3 | asd | dog | US | 12 | 105
| 4 | xcv | hat | UK | 11 | 140
| 5 | bnm | cow | UK | 14 | 150
Table 2 - Identifiers:
| id | type | code | owner |
|----|---------|-----------|-------|
| 1 | id | xyz | bob |
| 2 | name | cat | dave |
| 3 | division| UK | alice |
| 4 | name | pen | erica |
| 5 | id | xcv | fred |
Desired output:
| id | item_id | item_name | item_division | units | revenue | owner |
|----|---------|-----------|---------------|-------|---------|-------|
| 1 | xyz | pen | UK | 10 | 100 | bob | <- id
| 2 | pqr | cat | US | 15 | 120 | dave | <- code
| 3 | asd | dog | US | 12 | 105 | null | <- none
| 4 | xcv | hat | UK | 11 | 140 | fred | <- id
| 5 | bnm | cow | UK | 14 | 150 | alice | <- division
My attempts so far have involved multiple joining the table onto itself and I fear it is becoming hugely inefficient.
Any help much appreciated.
Another option for BigQuery Standard SQL
#standardSQL
SELECT ARRAY_AGG(a)[OFFSET(0)].*,
ARRAY_AGG(owner
ORDER BY CASE
WHEN type = 'id' THEN 1
WHEN type = 'name' THEN 2
WHEN type = 'division' THEN 3
END
LIMIT 1
)[OFFSET(0)] owner
FROM Datapoints a
JOIN Identifiers b
ON (a.item_id = b.code AND b.type = 'id')
OR (a.item_name = b.code AND b.type = 'name')
OR (a.item_division = b.code AND b.type = 'division')
GROUP BY a.id
ORDER BY a.id
It leaves out entries which k=have no owners - like in below result (id=3 is out as it has no owner)
Row id item_id item_name item_division units revenue owner
1 1 xyz pen UK 10 100 bob
2 2 pqr cat US 15 120 dave
3 4 xcv hat UK 11 140 fred
4 5 bnm cow UK 14 150 alice
I am using the following query (thanks #Barmar) but want to know if there is a more efficient way in Google BigQuery:
SELECT a.*, COALESCE(b.owner,c.owner,d.owner) owner FROM datapoints a
LEFT JOIN identifiers b on a.item_id = b.code and b.type = 'id'
LEFT JOIN identifiers c on a.item_name = c.code and c.type = 'name'
LEFT JOIN identifiers d on a.item_division = d.code and d.type = 'division'
I'm not sure if BigQuery optimizes today a query like this - but at least you would be writing a query that gives strong hints to not run the subqueries when not needed:
#standardSQL
SELECT COALESCE(
null
, (SELECT MIN(payload)
FROM `githubarchive.year.2016`
WHERE actor.login=a.user)
, (SELECT MIN(payload)
FROM `githubarchive.year.2016`
WHERE actor.id = SAFE_CAST(user AS INT64))
)
FROM (SELECT '15229281' user) a
4.2s elapsed, 683 GB processed
{"action":"started"}
For example, the following query took a long time to run, but BigQuery could optimize its execution massively in the future (depending on how frequently users needed an operation like this):
#standardSQL
SELECT COALESCE(
"hello"
, (SELECT MIN(payload)
FROM `githubarchive.year.2016`
WHERE actor.login=a.user)
, (SELECT MIN(payload)
FROM `githubarchive.year.2016`
WHERE actor.id = SAFE_CAST(user AS INT64))
)
FROM (SELECT actor.login user FROM `githubarchive.year.2016` LIMIT 10) a
114.7s elapsed, 683 GB processed
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello

Query to rank rows in groups

I'm using Apache Derby 10.10.
I have a list of participants and would like to calculate their rank in their country, like this:
| Country | Participant | Points | country_rank |
|----------------|---------------------|--------|--------------|
| Australia | Bridget Ciriac | 1 | 1 |
| Australia | Austin Bjorklun | 4 | 2 |
| Australia | Carrol Motto | 7 | 3 |
| Australia | Valeria Seligma | 8 | 4 |
| Australia | Desmond Miyamot | 27 | 5 |
| Australia | Maryjane Digma | 33 | 6 |
| Australia | Kena Elmendor | 38 | 7 |
| Australia | Emmie Hicke | 39 | 8 |
| Australia | Kaitlyn Mund | 50 | 9 |
| Australia | Alisia Vitaglian | 65 | 10 |
| Australia | Anika Bulo | 65 | 11 |
| UK | Angle Ifil | 2 | 1 |
| UK | Demetrius Buelo | 12 | 2 |
| UK | Ermelinda Mell | 12 | 3 |
| UK | Adeline Pee | 21 | 4 |
| UK | Alvera Cangelos | 23 | 5 |
| UK | Keshia Mccalliste | 23 | 6 |
| UK | Alayna Rashi | 24 | 7 |
| UK | Malinda Mcfarlan | 25 | 8 |
| United States | Gricelda Quirog | 3 | 1 |
| United States | Carmina Britto | 5 | 2 |
| United States | Noemi Blase | 6 | 3 |
| United States | Britta Swayn | 8 | 4 |
| United States | An Heidelber | 12 | 5 |
| United States | Maris Padill | 21 | 6 |
| United States | Rachele Italian | 21 | 7 |
| United States | Jacquiline Speake | 28 | 8 |
| United States | Hipolito Elami | 45 | 9 |
| United States | Earl Sayle | 65 | 10 |
| United States | Georgeann Ves | 66 | 11 |
| United States | Conchit Salli | 77 | 12 |
The schema looks like this (sqlfiddle):
create table Country(
id INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY,
name varchar(255),
PRIMARY KEY (id)
);
create table Team(
id INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY,
country_id int not null,
PRIMARY KEY (id),
FOREIGN KEY (country_id) REFERENCES Country(id)
);
create table Participant(
id INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY,
team_id int not null,
name varchar(100),
points int,
PRIMARY KEY (id),
FOREIGN KEY (team_id) REFERENCES Team(id)
);
This is what I have tried:
select
Country.name,
Participant.name,
Participant.points,
ROW_NUMBER() OVER(order by Country.name, Participant.points) as country_rank
from Country
join Team
on Country.id = Team.country_id
join Participant
on Team.id = Participant.team_id;
But according to the apache derby doco, the OVER() statement doesn't take any arguments.
Does anyone have a way to achieve the country rank?
SQL
SELECT c.name AS Country,
p.name AS Participant,
p.points AS Points,
(SELECT COUNT(*)
FROM Participant p2
JOIN Team t2 ON p2.team_id = t2.id
WHERE t2.country_id = t.country_id
AND (p2.points < p.points
OR p2.points = p.points AND p2.name <= p.name)) AS country_rank
FROM Country c
JOIN Team t ON c.id = t.country_id
JOIN Participant p ON t.id = p.team_id
ORDER BY c.name, p.points, p.name;
Online Demo
SQL Fiddle demo: http://sqlfiddle.com/#!5/f48f8/14
Explanation
A simple ANSI-SQL subselect can be used to do the same job, counting the number of records for participants in the same country with a lower score or with the same score and a name that is alphabetically no higher.
Consider a non-windows function SQL query that uses a correlated aggregate count subquery. Because the group column (Country.name) is not in same table as the rank criteria (Participant.points), we need to run same joins in the subquery but rename table aliases to properly compare inner and outer queries.
Now of course, in a perfect world that would be it but we must now account for tied points. Therefore, another very similar subquery (for tie breaker) is used to be added to first subquery. This second nested query matches inner and outer query's Country.name and Participant.points but ranks by alphabetical order of Participant.name.
SELECT
Country.name AS Country,
Participant.name AS Participant,
Participant.points,
(SELECT Count(*) + 1
FROM Country subC
INNER JOIN Team subT
ON subC.id = subT.country_id
INNER JOIN Participant subP
ON subT.id = subP.team_id
WHERE subC.name = Country.name
AND subP.points < Participant.points)
+
(SELECT Count(*)
FROM Country subC
INNER JOIN Team subT
ON subC.id = subT.country_id
INNER JOIN Participant subP
ON subT.id = subP.team_id
WHERE subC.name = Country.name
AND subP.points = Participant.points
AND subP.name < Participant.name) As country_rank
FROM Country
INNER JOIN Team
ON Country.id = Team.country_id
INNER JOIN Participant
ON Team.id = Participant.team_id
ORDER BY Country.name, Participant.points;
all you need to add is a partition by country and that should give you what you need.
SELECT
Country.name,
Participant.name,
Participant.points,
ROW_NUMBER() OVER(PARTITION BY country order by Country.name, Participant.points) as country_rank
from Country
join Team
on Country.id = Team.country_id
join Participant
on Team.id = Participant.team_id;

Easiest way to merge rows in Google Refine (OpenRefine) if all columns are identical

I'm cleaning data with OpenRefine (was Google Refine) from multiple sources. I have files from different sources which contain companies, column definitions are identical i.e.
UNID | Name | Street | City | Country | Phone | ...
sg52d | Company a | A street | a city | c country | 12345
sg52d | Company a | A street | a city | c country | 0099835
dfnsd | Company B | B Street | City B | c country | 33445
dfnsd | Company B | Different | Another | c country | 33445
xxbb3 | Company C | C Street | City B | Country A | 1111
xxbb3 | Company C | C Street | City B | Country A | 1111
What I want is this result (only the last Company is merged, all columns were identical)
UNID | Name | Street | City | Country | Phone | ...
sg52d | Company a | A street | a city | c country | 12345
sg52d | Company a | A street | a city | c country | 0099835
dfnsd | Company B | B Street | City B | c country | 33445
dfnsd | Company B | Different | Another | c country | 33445
xxbb3 | Company C | C Street | City B | Country A | 1111
Is there a simple way to do this?
I understand that I can concatenate all columns into a new column, but this is a little PITA, because of the number of columns.
Perhaps there is a way for the new column definition to loop through all other columns and merge it?
It is a strange approach but this should work: http://googlerefine.blogspot.com/2011/08/remove-duplicate.html
Make sure you make the sort change permanent.
You could create new column with an expression like:
forEach(["UNID", "Name", "Street", "City", "..." ],x,cells[x].value).join("")