SQL Inequality in duplicates - sql

I found a question in an exam paper which consists of a table Forest having these attributes
Name - type : C(10)
Size - type : I
Location - type : C(10)
Company - type : C(10)
Basically the question asks to find all forests found in the same location with the same company and create this table by that information
Smaller_Forest Larger_Forest CompanyName
I am stuck when I am meant to sort the duplicates by smaller or larger in terms of SQL.
Am I meant to do a CROSS JOIN and eliminate forests with the same size or something like that? And if so, how am I to place the larger and smaller forest in the same record (maybe by the company name value ? )
To filter out the duplicates I did this:
Select * INTO ForestSameLocationCompany
GROUP BY Location, Company
HAVING (count(distinct Location)>1) AND (count(distinct Company)>1)
So this is meant to give me a table with all the duplicate forests by location and company. All is left is to sort them into the above mentioned table which is where I am stuck.
Any help on this matter is much appreciated.

The question is suspect, because it assumes that there are only two forests when there are duplicates. I would start by doing:
select cnt, count(*)
from (select company, location, count(*) as cnt
from Forest
group by company, location
) cl
group by cnt
order by cnt;
This tells you the distribution of the number of companies per forest.
Then, if you only have two forests per company/location, there are several ways to get the name of the smallest and largest on one line. One problem, of course, is that the two forests could have the same size -- and that could be troubling for the approaches. Here is an attempt:
select company, location,
min(case when size = minsize then name end) as minForest,
max(case when size = maxsize then name end) as maxForest
from Forest f join
(select company, location, min(size) as minsize, max(size) as maxsize
from forest
group by company, location
having count(*) > 1
) cl
on f.company = cl.company and f.location = cl.location
group by company, location;
By using both min() and max() in the select clause, the query will return both names when the sizes are the same.
As an additional comment, I find such an exercise to be less useful than it should be. There are plenty of examples of real-world data where you have to deal with duplicates. By not mentioning issues such as the number of possible solution and what to do when the forests are the same size, the exercise is a bit misleading as a real-world example.

Related

Create a report with an query

I have a problem. Consider the following fact and dimension tables in a ROLAP system that collects values of harmful substances measured in foods that are sold in supermarkets.
Fact table:
• Contaminants (TimeID, ShopID, FoodID, substance, quantityPerOunce)
This describes which harmful substance in which quantity was found on a given
food in a given supermarket at a given time.
Dimension tables:
• Time (TimeID, dayNr, dayName, weekNr, monthNr, year)
• Food (FoodID, foodName, brand, foodType)
Example data: (43, egg, Bioland, animalProduct)
• Place (ShopID, name, street1, region, country)
Write one SQL statement to create a report that answers the following query:
List the minimum quantities of the substance "PCB" in animal products and
vegetables (both are foodTypes) that were measured per year in the regions Sachsen,
Thüringen, and Hessen in Germany.
The result should contain years, regions, and the minimum values.
With the same statement, also list
the minimum values per year (i.e. aggregating over all regions in each year)
as well as a grand total with the minimum quantity of PCB in the mentioned regions for animal products and vegetables over all years and all regions.
SQL query
SELECT years, regions, min(quantityPerOunce)
FROM Contaminants as c, Time as t, Food as f, Place as p
WHERE c.TimeID = t.TimeID
AND c.FoodID = f.FoodID
AND c.ShopdID = p.ShopID
AND substance = "PCB"
AND foodType = "vegetables"
AND foodType = "animalProducts"
GROUP BY regions;
I don't know how to solve this kind of exercise. I tried it, but I don't know. And the join should be Equi-Join even if this not the best way.
You are close. First, remember that in GROUP BY queries, the non-aggregate fields in your SELECT must also appear on the GROUP BY line. So, you should have:
GROUP BY years, regions;
Further, if you use this:
foodType = 'vegetables' AND foodType = 'animalProducts'
the query will return nothing, because the foodType can't be both at the same time.
As such, you need this:
(foodType = 'vegetables' OR foodType = 'animalProducts')
or alternatively:
foodType IN ('vegetables','animalProducts')
Your query assumes that regions only contains the three listed regions. If you aren't 100% sure about that, it would be better to specify them explicitly with:
AND regions IN ('Sachsen', 'Thüringen', 'Hessen')
This alone also assumes that these regions are only in Germany. This may be true. It might not be though, so it would be safest to also add:
AND country = 'Germany'
So, something along these lines:
SELECT years, regions, MIN(quantityPerOunce) AS min_quantityPerOunce
FROM Contaminants as c, Time as t, Food as f, Place as p
WHERE c.TimeID = t.TimeID
AND c.FoodID = f.FoodID
AND c.ShopdID = p.ShopID
AND substance = 'PCB'
AND foodType IN ('vegetables','animalProducts')
AND regions IN ('Sachsen', 'Thüringen', 'Hessen')
AND country = 'Germany'
GROUP BY years, regions;
Forgive me if I'm mistaken, but it does seem like this might be a school assignment, so it may help to think about general principles in the future:
Identify ALL the nouns in the problem statement (the names of the regions, the name of the country, the names of the food types, the name of the substance) and make sure they are all represented in the query. They likely wouldn't be mentioned in the problem statement / client request if they weren't important. This is a good rule of thumb for professional settings as well as educational settings.
As a rule, fields in the SELECT which aren't aggregates must also be in the GROUP BY. You can have fields in the GROUP BY which are not in the SELECT, but this is far less common.
For parts of the request which list some items from the same field (regions, for example), use field IN (item1,item2,...,itemX) to allow an OR operator on each of the items.
As an addendum, if you have a dimension table called Time, you may want to enclose the name in double-quotes in some systems to avoid confusion with what is normally a system name of some kind.

SQL joining two tables with different levels of details

So I have two tables of sales, budget and actual.
"budget" has two columns: location and sales. For example,
location sales
24 $20000
36 $100300
40 $24700
Total $145000
"actual" has three columns: invoice_number, location, and sales. For example,
invoice location sales
10000 36 $5000
10001 40 $6000
10002 99 $7000
and so forth
Total $110000
In summary, "actual" records transactions at the invoice level, whereas "budget" is done at the location level only (no individual invoices).
I'm trying to create a summary table that lists actual and budget sales side by side, grouped by location. The total of the actual column should be $110000, and $145000 for budget. This is my attempt at it (on pgAdmin/ postgresql):
SELECT actual.location, SUM(actual.sales) AS actual_sales, SUM(budget.sales) AS budget_sales
FROM actual LEFT JOIN budget
ON actual.location = budget.location
GROUP BY actual.location;
I used LEFT JOIN because "actual" has locations that "budget" doesn't have (e.g. location 99).
I ended up with some gigantic numbers ($millions) on both the actual_sales and budget_sales columns, far exceeding the total actual ($110000) or budget sales ($145,000).
Is this because the way I wrote my query is basically asking SQL to join each invoice in "actual" to each line in "budget," therefore duplicating things many times over? If so how should I have written this?
Thanks in advance!
Based on your description, you seem to have duplicates in both tables. There are various ways to solve this problem. Here is one using union all and group by:
select Location,
sum(actual_sales) as actual_sales,
sum(budget_sales) as budget_sales
from ((select a.location, a.sales as actual_sales, null as budget_sales
from actual a
) union all
(select b.location, null, b.sales
from budget b
)
) ab
group by location;
This structure guarantees that each value is counted only once, regardless of the table.
The query looks fine to me. However, it is difficult to find out why the figures are wrong. My suggestion is that you do the sum by location separately for budget and actual into 2 temporary tables, and later put them together using LEFT JOIN.
Yes, you're joining the budget in once for each actual sales row. However, your Actual Sales sum shouldn't have been larger unless there were multiple budget rows for the same location. You should check for that, because it doesn't sound like there should be.
What you need to do in a case like this is sum the actual sales first in a CTE or subquery, then later join the result to the budget. That way you only have one row for each location. This does it for the actual sales. If you really do have more than one row for a location for budget as well, you might need to subquery the budget as well the same way.
Select Act.Location, Act.actual_sales, budget.sales as budget_sales
From
(
SELECT actual.location, SUM(actual.sales) AS actual_sales
FROM actual
GROUP BY actual.location
) Act
left join budget on Act.location = budget.location
Gordon's suggestion is good, an alternative using WITH statements is:
WITH aloc AS (
SELECT location, SUM(sales) FROM actual GROUP BY 1
), bloc AS (
SELECT location, SUM(sales) FROM budget GROUP BY 1
)
SELECT location, a.sum AS actual_sales, b.sum AS budget_sales
FROM aloc a LEFT JOIN bloc b USING (location)
This is equivalent to:
SELECT location, a.sum AS actual_sales, b.sum AS budget_sales
FROM (SELECT location, SUM(sales) FROM actual GROUP BY 1) a LEFT JOIN
(SELECT location, SUM(sales) FROM budget GROUP BY 1) b USING (location)
but I find WITH statements more readable.
The purpose of the subqueries is to get tables into a state where a row means something relevant, i.e. aloc contains a row per location, and hence cause the join to evaluate to what you want.

How to get most popular name by year in SQL Server

I am practicing SQL in Microsoft SQL Server 2012 (not a homework question), and have a table Names. The table shows baby names by year, with columns Sex (gender of name), N (number of babies having that name), Yr (year), and Name (the name itself).
I need to write a query using only one SELECT statement that returns the most popular baby name by year, with gender, the year, and the number of babies named. So far I have;
SELECT *
From Names
ORDER By N DESC;
Which gives the highest values of N in DESC order, repeating years. I need to limit it to only the highest value in each year, and everything I have tried to do so has thrown errors. Any advice you can give me for this would be appreciated.
Off the top of my my head, something like the following would normally let you do it in (technically) one SELECT statment. That statement includes sub-SELECTs, but I'm not immediately seeing an alternative that wouldn't.
When there's joint top ranking names, both queries should bring back all joint top results so there may not be exactly one answer. If you then just need a random single representative row from those result, look at using select top 1, perhaps adding order by to get the first alphabetically.
Most popular by year regardless of gender:
-- ONE PER YEAR:
SELECT n.Year, n.Name, n.Gender, n.Qty FROM Name n
WHERE NOT EXISTS (
SELECT 1 FROM Name n2
WHERE n2.Year = n.Year
AND n2.Qty > n.Qty
)
Most popular by year for each gender:
-- ONE PER GENDER PER YEAR:
SELECT n.Year, n.Name, n.Gender, n.Qty FROM Name n
WHERE NOT EXISTS (
SELECT 1 FROM Name n2
WHERE n2.Year = n.Year
AND n2.Gender = n.Gender
AND n2.Qty > n.Qty
)
Performance is, despite the verbosity of the SQL, usually on a par with alternatives when using this pattern (often better).
There are other approaches, including using GROUP statements, but personally I find this one more readable and standard cross-DBMS.

Is a better solution like a single set operation possible

I can't think of a single T-SQL operation through the following problem can be solved. I can think only of record by record operation to solve the problem.
The problem is as follows:
For each village a number of shops are assigned ( from 1 to n).
Same shop can serve more than one village.
Each shop has different maximum capacity (that is given in a table)
Need to assign all members of a family (based on family id) to same shop in such a way that `'nearly' equal families are assigned to each FPS. As the number of families may not be equally divisible FPS number a few shops may get one additional Family. While assigning last family if the FPS max capacity exceeds by a few member that is acceptable. This however would not happen if last family has just one member.
Some families may remain unassigned if FPS max capacity exceeds for all FPS assigned to that village.
Available tables
Population: Uniqid, Familyid, name, shopcode, villagecode
Village: VillageId
Shop: ShopId, Name, MaxCapacity
VillageShopMap: VillageId, ShopId
My solution is as follows
Take each village
Get one Family for that village
Get a shop with minimum number of person allotted for that village , whose current capacity < max Capacity
Continue until that population from that village is exhausted, or Shop MaxCapacity is reached (in that case some people remain unassigned to shops, that is acceptable)
Loop
My solution is extremely slow. Looking for a better solution.
Thanks
Not much but could use this to fill a shop in one pass
In this case 20 is the shop capacity
The top 20 is just to not evaluate more than needed - a family will have at least one
This could leave some shops empty
You could scale capacity to a fraction of the actual capacity
with famA as
( select top 20 sParID as ID, count(*) as famSize
from docSVsys
group by sParID
)
, fam as
( select famA.*, ROW_NUMBER() over (order by ID) as rn
from famA
)
, famCum as
( select fam.ID, famSize, fam.rn,
(select sum(f.famSize) from fam f where f.rn <= fam.rn) as cum
from fam
)
select famCum.*
from famCum
where famCum.rn <= (select max(f.rn) from famCum f where f.cum <= 20) + 1
order by famCum.rn
Repeating shopcode and village code in Population is not 3NF
Should have a Family table and I would denormalize and put a famsize in the table so you are not calculating size over and over.
Or assume you have the above Family table and a ShopView with CurCapacity
Can assign a one family to all open shops in one pass
with ShopOne as
( select ShopId, min(VillageID) as VillageID
from ShopView
where CurCapacity < Max Capacity
)
, FamilyRn as
( select Family.*, row_number (over VillageID order by ID) as rn
from Family where ShopID is null
)
select Family.*, ShopOne.*
from ShopOne
join FamilyRn
on ShopOne.VillageID = Famility.VillageID
and FamilyRn = 1

How do I use the MAX function over three tables?

So, I have a problem with a SQL Query.
It's about getting weather data for German cities. I have 4 tables: staedte (the cities with primary key loc_id), gehoert_zu (contains the city-key and the key of the weather station that is closest to this city (stations_id)), wettermessung (contains all the weather information and the station's key value) and wetterstation (contains the stations key and location). And I'm using PostgreSQL
Here is how the tables look like:
wetterstation
s_id[PK] standort lon lat hoehe
----------------------------------------
10224 Bremen 53.05 8.8 4
wettermessung
stations_id[PK] datum[PK] max_temp_2m ......
----------------------------------------------------
10224 2013-3-24 -0.4
staedte
loc_id[PK] name lat lon
-------------------------------
15 Asch 48.4 9.8
gehoert_zu
loc_id[PK] stations_id[PK]
-----------------------------
15 10224
What I'm trying to do is to get the name of the city with the (for example) highest temperature at a specified date (could be a whole month, or a day). Since the weather data is bound to a station, I actually need to get the station's ID and then just choose one of the corresponding to this station cities. A possible question would be: "In which city was it hottest in June ?" and, say, the highest measured temperature was in station number 10224. As a result I want to get the city Asch. What I got so far is this
SELECT name, MAX (max_temp_2m)
FROM wettermessung, staedte, gehoert_zu
WHERE wettermessung.stations_id = gehoert_zu.stations_id
AND gehoert_zu.loc_id = staedte.loc_id
AND wettermessung.datum BETWEEN '2012-8-1' AND '2012-12-1'
GROUP BY name
ORDER BY MAX (max_temp_2m) DESC
LIMIT 1
There are two problems with the results:
1) it's taking waaaay too long. The tables are not that big (cities has about 70k entries), but it needs between 1 and 7 minutes to get things done (depending on the time span)
2) it ALWAYS produces the same city and I'm pretty sure it's not the right one either.
I hope I managed to explain my problem clearly enough and I'd be happy for any kind of help. Thanks in advance ! :D
If you want to get the max temperature per city use this statement:
SELECT * FROM (
SELECT gz.loc_id, MAX(max_temp_2m) as temperature
FROM wettermessung as wm
INNER JOIN gehoert_zu as gz
ON wm.stations_id = gz.stations_id
WHERE wm.datum BETWEEN '2012-8-1' AND '2012-12-1'
GROUP BY gz.loc_id) as subselect
INNER JOIN staedte as std
ON std.loc_id = subselect.loc_id
ORDER BY subselect.temperature DESC
Use this statement to get the city with the highest temperature (only 1 city):
SELECT * FROM(
SELECT name, MAX(max_temp_2m) as temp
FROM wettermessung as wm
INNER JOIN gehoert_zu as gz
ON wm.stations_id = gz.stations_id
INNER JOIN staedte as std
ON gz.loc_id = std.loc_id
WHERE wm.datum BETWEEN '2012-8-1' AND '2012-12-1'
GROUP BY name
ORDER BY MAX(max_temp_2m) DESC
LIMIT 1) as subselect
ORDER BY temp desc
LIMIT 1
For performance reasons always use explicit joins as LEFT, RIGHT, INNER JOIN and avoid to use joins with separated table name, so your sql serevr has not to guess your table references.
This is a general example of how to get the item with the highest, lowest, biggest, smallest, whatever value. You can adjust it to your particular situation.
select fred, barney, wilma
from bedrock join
(select fred, max(dino) maxdino
from bedrock
where whatever
group by fred ) flinstone on bedrock.fred = flinstone.fred
where dino = maxdino
and other conditions
I propose you use a consistent naming convention. Singular terms for tables holding a single item per row is a good convention. You only table breaking this is staedte. Should be stadt.
And I suggest to use station_id consistently instead of either s_id and stations_id.
Building on these premises, for your question:
... get the name of the city with the ... highest temperature at a specified date
SELECT s.name, w.max_temp_2m
FROM (
SELECT station_id, max_temp_2m
FROM wettermessung
WHERE datum >= '2012-8-1'::date
AND datum < '2012-12-1'::date -- exclude upper border
ORDER BY max_temp_2m DESC, station_id -- id as tie breaker
LIMIT 1
) w
JOIN gehoert_zu g USING (station_id) -- assuming normalized names
JOIN stadt s USING (loc_id)
Use explicit JOIN conditions for better readability and maintenance.
Use table aliases to simplify your query.
Use x >= a AND x < b to include the lower border and exclude the upper border, which is the common use case.
Aggregate first and pick your station with the highest temperature, before you join to the other tables to retrieve the city name. Much simpler and faster.
You did not specify what to do when multiple "wettermessungen" tie on max_temp_2m in the given time frame. I added station_id as tiebreaker, meaning the station with the lowest id will be picked consistently if there are multiple qualifying stations.