When is aliasing required when using SQL set theory clauses? - sql

I just started learning SQL and am trying to learn from my mistakes. In one of my practice exercises, I had to find city names from the cities database are not listed as capital cities in countries database. Initially I tried the code below but it yielded an error.
SELECT name
FROM cities
EXCEPT
SELECT capital
FROM countries
ORDER BY capital ASC;
The correct code is:
SELECT city.name
FROM cities AS city
EXCEPT
SELECT country.capital
FROM countries AS country
ORDER BY name;
Can someone explain to me why aliasing made all the difference here?

An ORDER BY for a UNION, EXCEPT or INTERSECT sorts the complete result. The column names of the overall query are defined by the first query. So this query:
SELECT name
FROM cities
EXCEPT
SELECT capital
FROM countries
returns a result with a single column named name.
Adding an order by is conceptually the same as:
select *
from (
SELECT name
FROM cities
EXCEPT
SELECT capital
FROM countries
) x
order by ....;
As the inner query only returns a single column name, that's the only column you can use in the order by.
The aliases that you used in your second query don't change the column name of the overall result which determines the column names available for the order by clause.

Related

How to add alias column in SQL Table?

I need some assistance.
I have pursed the City name from the purchase address with the alias "City" .
Now I have a I have a problem statement saying "Find the top cities with the highest sales".
Now, the city column is not into the main table so I cannot run a group by operation. How would I perform this task?
If you're looking for a way to get the city with the highest number of Quantity_Ordered, you can try this.
Since your data table doesn't contain the city exactly but it needs to be parsed, it would probably be best to create a view.
CREATE VIEW vProductDataCity AS
SELECT *, PARSENAME... -- Fill the "..." with your parsing code
And then to get the top city:
SELECT TOP(1) city, SUM(Quantity_Ordered) as SumQuantity_Ordered
FROM vProductDataCity
GROUP BY city
ORDER BY SumQuantity_Ordered DESC;

How can I query to find max population of country from countries table?

I have a table "countries" with columns -> name,continent,area,popualtion.
Let's say I want to find the name and population of the chosen country with the highest population.
SELECT MAX(population) FROM countries;
The example above returns the maximum population.
I want it to also see the name of the country with that population.
SELECT name,MAX(population) FROM countries;
I am getting the error like below.
ERROR: column "countries.name" must appear in the GROUP BY clause or be used in an aggregate function
I can't think of another way to do it.
Here is an example of my query.
SELECT name,population
FROM countries
WHERE population >= (
SELECT MAX(population)
FROM countries)
;
This query works, but I am also curious why am I getting the error or if anyone knows if there is any better ways to accomplish this?
SELECT name, population
FROM countries
ORDER BY population DESC
LIMIT 1
MAX selects the maximum element from a list of values. In your first query,
SELECT MAX(population) FROM countries;
the list is formed by extracting the population field from all rows in countries, and then the maximum is selected. This collapses the list of rows down to a single row containing just the maximum.
In your second query,
SELECT name,MAX(population) FROM countries;
you (conceptually) get a list of all name fields from countries, but there's only one MAX(population). The database system doesn't know what to do with this: SELECT name FROM countries would return as many rows as there are in countries, but SELECT MAX(population) FROM countries would only return one row. This doesn't match up; it's unclear how many rows you want returned from this. This is why you get an error.
The error message says you need to either
use name in an aggregate function, which would collapse the list of rows down to a single value, which could be returned along the single MAX value, or
use a GROUP BY name clause, which would group the list of countries into entries with equal names first, then compute MAX(population) separately for each group. This makes no sense if all your countries have different names.
As far as I know there's no SQL syntax for "select the maximum population and then get the name field from the same row" (it's not quite clear what this would do anyway because there can be more than one country with a population equal to the maximum).
What you can do instead is sort the whole table, then select only fields from the first row:
SELECT name, population
FROM countries
ORDER BY population DESC
LIMIT 1
(I'm pretty sure Postgres optimizes this so there's no actual sort involved.)
Now if there is more than one country with a maximum population, you'll get a random result (we haven't told the database how to sort rows with equal population).
You can make use of Top keyword for selecting only single record
from countries table.
SELECT Top 1 name,population
FROM countries
order by population DESC

Why use many columns in GROUP BY and HAVING clause in these examples

Given the schema here I'm trying to understand and solve the below 3 sql queries as I'm confused:
1- Present a table giving the names of the countries with ≥ 50% urbanization
rates, their urbanization rates, and their per capita GDP. Note that
urbanization rate is the percentage of population living in cities. Do not
count cities with NULL values for population.
SELECT country.name, round(sum(city.population)/country.population, 3) AS urban, round(gdp/country.population, 3) AS gdppc
FROM city
INNER JOIN country ON code = country
INNER JOIN economy ON code = economy.country
WHERE city.population IS NOT NULL
GROUP BY country.name, country.population, economy.gdp
HAVING round(sum(city.population)/country.population, 3) >= 0.5
ORDER BY urban DESC;
In the above query, Why I need to include country.population and economy.gdp in the GROUP BY? If I tried using just country.name in the GROUP BY I get an error saying I should include the others.
2- Show organizations that have as members all the European countries with over 50 million people?
SELECT name
FROM organization
INNER JOIN (SELECT organization
FROM country
INNER JOIN encompasses
ON code = encompasses.country
INNER JOIN ismember
ON code = ismember.country
WHERE population > 50000000 AND continent = 'Europe'
GROUP BY organization
HAVING count(ismember.country) = (SELECT count(*)
FROM country
INNER JOIN encompasses
ON code = country
WHERE population > 50000000 AND continent = 'Europe'))
AS innerQuery
ON abbreviation = innerQuery.organization;
Why I need the HAVING Part above?
3- Insert a new organization called “Tivoli” and a trigger that says if Germany joins “Tivoli” then so too must the UK and France. Insert Germany into the “Tivoli” organization. Confirm proper behavior.
I tried the below script but it's not working, any advice please?
do $$
begin
IF(NOT EXISTS ( SELECT 1 FROM organization WHERE organization."name" = 'Tivoli' AND organization.country = 'D' ))
BEGIN
INSERT INTO organization VALUES ('Tivoli','Tivoli organization',NULL,'F',NULL,NULL);
INSERT INTO organization VALUES ('Tivoli','Tivoli organization',NULL,'GB',NULL,NULL);
END;
end $$
1)
You used country.population and economy.gdp in the select, outside of aggregate functions ( COUNT(), AVG() and SUM() ), and you have a GROUP BY. Everything that you select has to be in GROUP BY or inside of aggregate functions.
2)
Because you were asked to show organizations that have ALL of 50mil + people countries. With HAVING, you check if that organization has the right amount of countries.
3)
organization."name" = 'Tivoli'
It's supposed to be :
organization.name
First of all, you should limit a question to one only, not 3. But here are some pointers for all 3:
In the above query, Why I need to include country.population and economy.gdp in the GROUP BY? If I tried using just country.name in the GROUP BY I get an error saying I should include the others.
This is a requirement. A group by country.name alone would work (in Postgres 9.1+) only if the other two fields are known to be functionally dependent on country.name. But probably country.name is not the primary key of the country table, so in theory it is possible to have two records in that table with the same name, but different population.
The rule is as follows:
When GROUP BY is present, it is not valid for the SELECT list expressions to refer to ungrouped columns except within aggregate functions or if the ungrouped column is functionally dependent on the grouped columns, since there would otherwise be more than one possible value to return for an ungrouped column. A functional dependency exists if the grouped columns (or a subset thereof) are the primary key of the table containing the ungrouped column.
This is implemented since version 9.1.
Why I need the HAVING Part above?
Because a condition on an aggregate (count in this case) can only be performed after grouping, and can thus not be expressed in the where clause. In this case the having clause makes sure that the organisation is not only present in some big EU Member States, but all big EU Member states.
I tried the below script but it's not working, any advice please?
Without a proper database schema, it is not possible to provide you with the correct SQ, but from the ERD diagram it seems that the organization table does not have a country field. Instead the ismember table connects organizations with countries. You would only insert one organization, but several ismember records (one per Member State involved)
It is better also to name the fields in your insert statement, so it is clear which value corresponds to which field.

JOIN on another table after GROUP BY and COUNT

I'm trying to make sense of the right way to use JOIN, COUNT(*), and GROUP BY to do a pretty simple query. I've actually gotten it to work (see below) but from what I've read, I'm using an extra GROUP BY that I shouldn't be.
(Note: The problem below isn't my actual problem (which deals with more complicated tables), but I've tried to come up with an analogous problem)
I have two tables:
Table: Person
-------------
key name cityKey
1 Alice 1
2 Bob 2
3 Charles 2
4 David 1
Table: City
-------------
key name
1 Albany
2 Berkeley
3 Chico
I'd like to do a query on the People (with some WHERE clause) that returns
the number of matching people in each city
the key for the city
the name of the city.
If I do
SELECT COUNT(Person.key) AS count, City.key AS cityKey, City.name AS cityName
FROM Person
LEFT JOIN City ON Person.cityKey = City.key
GROUP BY Person.cityKey, City.name
I get the result that I want
count cityKey cityName
2 1 Albany
2 2 Berkeley
However, I've read that throwing in that last part of the GROUP BY clause (City.name) just to make it work is wrong.
So what's the right way to do this? I've been trying to google for an answer, but I feel like there's something fundamental that I'm just not getting.
I don't think that it's "wrong" in this case, because you've got a one-to-one relationship between city name and city key. You could rewrite it such that you join to a sub-select to get the count of persons to cities by key, to the city table again for the name, but it's debatable that that'd be better. It's a matter of style and opinion I guess.
select PC.ct, City.key, City.name
from City
join (select count(Person.key) ct, cityKey key from Person group by cityKey) PC
on City.key = PC.key
if my SQL isn't too rusty :-)
...I've read that throwing in that last part of the GROUP BY clause (City.name) just to make it work is wrong.
You misunderstand, you got it backwards.
Standard SQL requires you to specify in the GROUP BY all the columns mentioned in the SELECT that are not wrapped in aggregate functions. If you don't want certain columns in the GROUP BY, wrap them in aggregate functions. Depending on the database, you could use the analytic/windowing function OVER...
However, MySQL and SQLite provide the "feature" where you can omit these columns from the group by - which leads to no end of "why doesn't this port from MySQL to fill_in_the_blank database?!" Stackoverflow and numerous other sites & forums.
However, I've read that throwing in
that last part of the GROUP BY clause
(City.name) just to make it work is
wrong.
It's not wrong. You have to understand how the Query Optimizer sees your query. The order in which it is parsed is what requires you to "throw the last part in." The optimizer sees your query in something akin to this order:
the required tables are joined
the composite dataset is filtered through the WHERE clause
the remaining rows are chopped into groups by the GROUP BY clause, and aggregated
they are then filtered again, through the HAVING clause
finally operated on, by SELECT / ORDER BY, UPDATE or DELETE.
The point here is that it's not that the GROUP BY has to name all the columns in the SELECT, but in fact it is the opposite - the SELECT cannot include any columns not already in the GROUP BY.
Your query would only work on MySQL, because you group on Person.cityKey but select city.key. All other databases would require you to use an aggregate like min(city.key), or to add City.key to the group by clause.
Because the combination of city name and city key is unique, the following are equivalent:
select count(person.key), min(city.key), min(city.name)
...
group by person.citykey
Or:
select count(person.key), city.key, city.name
...
group by person.citykey, city.key, city.name
Or:
select count(person.key), city.key, max(city.name)
...
group by city.key
All rows in the group will have the same city name and key, so it doesn't matter if you use the max or min aggregate.
P.S. If you'd like to count only different persons, even if they have multiple rows, try:
count(DISTINCT person.key)
instead of
count(person.key)

Return all Fields and Distinct Rows

Whats the best way to do this, when looking for distinct rows?
SELECT DISTINCT name, address
FROM table;
I still want to return all fields, ie address1, city etc but not include them in the DISTINCT row check.
Then you have to decide what to do when there are multiple rows with the same value for the column you want the distinct check to check against, but with different val;ues in the other columns. In this case how does the query processor know which of the multiple values in the other columns to output, if you don't care, then just write a group by on the distinct column, with Min(), or Max() on all the other ones..
EDIT: I agree with comments from others that as long as you have multiple dependant columns in the same table (e.g., Address1, Address2, City, State ) That this approach is going to give you mixed (and therefore inconsistent ) results. If each column attribute in the table is independant ( if addresses are all in an Address Table and only an AddressId is in this table) then it's not as significant an issue... cause at least all the columns from a join to the Address table will generate datea for the same address, but you are still getting a more or less random selection of one of the set of multiple addresses...
This will not mix and match your city, state, etc. and should give you the last one added even:
select b.*
from (
select max(id) id, Name, Address
from table a
group by Name, Address) as a
inner join table b
on a.id = b.id
When you have a mixed set of fields, some of which you want to be DISTINCT and others that you just want to appear, you require an aggregate query rather than DISTINCT. DISTINCT is only for returning single copies of identical fieldsets. Something like this might work:
SELECT name,
GROUP_CONCAT(DISTINCT address) AS addresses,
GROUP_CONCAT(DISTINCT city) AS cities
FROM the_table
GROUP BY name;
The above will get one row for each name. addresses contains a comma delimted string of all the addresses for that name once. cities does the sames for all the cities.
However, I don't see how the results of this query are going to be useful. It will be impossible to tell which address belongs to which city.
If, as is often the case, you are trying to create a query that will output rows in the format you require for presentation, you're much better off accepting multiple rows and then processing the query results in your application layer.
I don't think you can do this because it doesn't really make sense.
name | address | city | etc...
abc | 123 | def | ...
abc | 123 | hij | ...
if you were to include city, but not have it as part of the distinct clause, the value of city would be unpredictable unless you did something like Max(city).
You can do
SELECT DISTINCT Name, Address, Max (Address1), Max (City)
FROM table
Use #JBrooks answer below. He has a better answer.
Return all Fields and Distinct Rows
If you're using SQL Server 2005 or above you can use the RowNumber function. This will get you the row with the lowest ID for each name. If you want to 'group' by more columns, add them in the PARTITION BY section of the RowNumber.
SELECT id, Name, Address, ...
(select id, Name, Address, ...,
ROW_NUMBER() OVER (PARTITION BY Name ORDER BY id) AS RowNo
from table) sub
WHERE RowNo = 1