Is a better solution like a single set operation possible - sql

I can't think of a single T-SQL operation through the following problem can be solved. I can think only of record by record operation to solve the problem.
The problem is as follows:
For each village a number of shops are assigned ( from 1 to n).
Same shop can serve more than one village.
Each shop has different maximum capacity (that is given in a table)
Need to assign all members of a family (based on family id) to same shop in such a way that `'nearly' equal families are assigned to each FPS. As the number of families may not be equally divisible FPS number a few shops may get one additional Family. While assigning last family if the FPS max capacity exceeds by a few member that is acceptable. This however would not happen if last family has just one member.
Some families may remain unassigned if FPS max capacity exceeds for all FPS assigned to that village.
Available tables
Population: Uniqid, Familyid, name, shopcode, villagecode
Village: VillageId
Shop: ShopId, Name, MaxCapacity
VillageShopMap: VillageId, ShopId
My solution is as follows
Take each village
Get one Family for that village
Get a shop with minimum number of person allotted for that village , whose current capacity < max Capacity
Continue until that population from that village is exhausted, or Shop MaxCapacity is reached (in that case some people remain unassigned to shops, that is acceptable)
Loop
My solution is extremely slow. Looking for a better solution.
Thanks

Not much but could use this to fill a shop in one pass
In this case 20 is the shop capacity
The top 20 is just to not evaluate more than needed - a family will have at least one
This could leave some shops empty
You could scale capacity to a fraction of the actual capacity
with famA as
( select top 20 sParID as ID, count(*) as famSize
from docSVsys
group by sParID
)
, fam as
( select famA.*, ROW_NUMBER() over (order by ID) as rn
from famA
)
, famCum as
( select fam.ID, famSize, fam.rn,
(select sum(f.famSize) from fam f where f.rn <= fam.rn) as cum
from fam
)
select famCum.*
from famCum
where famCum.rn <= (select max(f.rn) from famCum f where f.cum <= 20) + 1
order by famCum.rn
Repeating shopcode and village code in Population is not 3NF
Should have a Family table and I would denormalize and put a famsize in the table so you are not calculating size over and over.
Or assume you have the above Family table and a ShopView with CurCapacity
Can assign a one family to all open shops in one pass
with ShopOne as
( select ShopId, min(VillageID) as VillageID
from ShopView
where CurCapacity < Max Capacity
)
, FamilyRn as
( select Family.*, row_number (over VillageID order by ID) as rn
from Family where ShopID is null
)
select Family.*, ShopOne.*
from ShopOne
join FamilyRn
on ShopOne.VillageID = Famility.VillageID
and FamilyRn = 1

Related

UPDATE Capacity using COUNT()

I have three tables which are joined by the following
FLIGHT F,
RESERVATION R,
AIRPLANE A
where F.AirplaneSerialNum = A.AirplaneSerialNum
and F.FlightCode = R.FlightCode
In the airplane table, there is a column to store the maximum capacity of any given plane.
In the reservation table, records of passengers are stored, and the flight they are embarking on is based on the FlightCode
In the flight table, there is a column to store the remaining capacity of any given plane, and each flight is uniquely determined by its FlightCode
Thus, I would like to find a way to update the remaining capacity by taking the values of the original maximum capacity, then get the remaining capacity by doing a COUNT() of the number of times the FlightCode appears in the reservation table
So far I've got the first half to work (setting RemCapacity as the original max capacity)
UPDATE FLIGHT F
SET F.RemCapacity = (SELECT Capacity FROM airplane
WHERE AIRPLANE.airplaneserialnum = F.airplaneserialnum);
However i'm stuck trying to minus the number of reservations
-- to get the count for number of times the FlightCode appears
SELECT COUNT(*) FROM reservation group by flightcode
UPDATE FLIGHT F
SET F.RemCapacity = F.RemCapacity -
(SELECT COUNT(*) FROM reservation group by flightcode ) WHERE F.FlightCode = R.FlightCode;
(returns %s invalid identifier SQL error)
And also if possible, how can I combine both halves into one query?
Not totally sure, but I think this might do the trick for you, doing all the work in one statement:
UPDATE FLIGHT F
SET F.RemCapacity = (SELECT Capacity FROM airplane
WHERE AIRPLANE.airplaneserialnum = F.airplaneserialnum) -
(SELECT COUNT(*) FROM reservation r WHERE F.FlightCode = R.FlightCode);

Selecting percentage of group and population based on a field in a table

I have a table with user IDs and states. I need to assign 20% of users in each state to a control group by setting a flag in another table. I don't know how I would be able to ensure that the numbers are correct though. How would I go about even starting this?
As an example, take a look at this sqlfiddle:
http://sqlfiddle.com/#!4/8e49d/6/0
with counts as
(select stateid, count(userid) as num_users
from userstates
group by stateid)
select *
from (select x.stateid,
x.userid,
sum(1) over(partition by x.stateid order by x.userid) as runner,
y.num_users,
sum(1) over(partition by x.stateid order by x.userid) / y.num_users as pct
from userstates x
join counts y
on x.stateid = y.stateid)
where pct <= .2
There are a couple of assumptions I made:
-- I assumed that, if you could not pull exactly 20%, you would choose, for instance, 19%, rather than 21%. The query would need to be changed slightly if you want to pull 1 ID over 20% when exactly 20% is not possible (you can't pull a fraction of a username, so you have to choose one way or the other).
-- I assumed that you did not want a random 20%, and that 20% of the first user IDs, in order, would suffice. I would need to change the query slightly if you wanted the 20% from each group to be random.

recursive geometric query : five closest entities

The question is whether the query described below can be done without recourse to procedural logic, that is, can it be handled by SQL and a CTE and a windowing function alone? I'm using SQL Server 2012 but the question is not limited to that engine.
Suppose we have a national database of music teachers with 250,000 rows:
teacherName, address, city, state, zipcode, geolocation, primaryInstrument
where the geolocation column is a geography::point datatype with optimally tesselated index.
User wants the five closest guitar teachers to his location. A query using a windowing function performs well enough if we pick some arbitrary distance cutoff, say 50 miles, so that we are not selecting all 250,000 rows and then ranking them by distance and taking the closest 5.
But that arbitrary 50-mile radius cutoff might not always succeed in encompassing 5 teachers, if, for example, the user picks an instrument from a different culture, such as sitar or oud or balalaika; there might not be five teachers of such instruments within 50 miles of her location.
Also, now imagine we have a query where a conservatory of music has sent us a list of 250 singers, who are students who have been accepted to the school for the upcoming year, and they want us to send them the five closest voice coaches for each person on the list, so that those students can arrange to get some coaching before they arrive on campus. We have to scan the teachers database 250 times (i.e. scan the geolocation index) because those students all live at different places around the country.
So, I was wondering, is it possible, for that latter query involving a list of 250 student locations, to write a recursive query where the radius begins small, at 10 miles, say, and then increases by 10 miles with each iteration, until either a maximum radius of 100 miles has been reached or the required five (5) teachers have been found? And can it be done only for those students who have yet to be matched with the required 5 teachers?
I'm thinking it cannot be done with SQL alone, and must be done with looping and a temporary table--but maybe that's because I haven't figured out how to do it with SQL alone.
P.S. The primaryInstrument column could reduce the size of the set ranked by distance too but for the sake of this question forget about that.
EDIT: Here's an example query. The SINGER (submitted) dataset contains a column with the arbitrary radius to limit the geo-results to a smaller subset, but as stated above, that radius may define a circle (whose centerpoint is the student's geolocation) which might not encompass the required number of teachers. Sometimes the supplied datasets contain thousands of addresses, not merely a few hundred.
select TEACHERSRANKEDBYDISTANCE.* from
(
select STUDENTSANDTEACHERSINRADIUS.*,
rowpos = row_number()
over(partition by
STUDENTSANDTEACHERSINRADIUS.zipcode+STUDENTSANDTEACHERSINRADIUS.streetaddress
order by DistanceInMiles)
from
(
select
SINGER.name,
SINGER.streetaddress,
SINGER.city,
SINGER.state,
SINGER.zipcode,
TEACHERS.name as TEACHERname,
TEACHERS.streetaddress as TEACHERaddress,
TEACHERS.city as TEACHERcity,
TEACHERS.state as TEACHERstate,
TEACHERS.zipcode as TEACHERzip,
TEACHERS.teacherid,
geography::Point(SINGER.lat, SINGER.lon, 4326).STDistance(TEACHERS.geolocation)
/ (1.6 * 1000) as DistanceInMiles
from
SINGER left join TEACHERS
on
( TEACHERS.geolocation).STDistance( geography::Point(SINGER.lat, SINGER.lon, 4326))
< (SINGER.radius * (1.6 * 1000 ))
and TEACHERS.primaryInstrument='voice'
) as STUDENTSANDTEACHERSINRADIUS
) as TEACHERSRANKEDBYDISTANCE
where rowpos < 6 -- closest 5 is an abitrary requirement given to us
I think may be if you need just to get closest 5 teachers regardless of radius, you could write something like this. The Student will duplicate 5 time in this query, I don't know what do you want to get.
select
S.name,
S.streetaddress,
S.city,
S.state,
S.zipcode,
T.name as TEACHERname,
T.streetaddress as TEACHERaddress,
T.city as TEACHERcity,
T.state as TEACHERstate,
T.zipcode as TEACHERzip,
T.teacherid,
T.geolocation.STDistance(geography::Point(S.lat, S.lon, 4326))
/ (1.6 * 1000) as DistanceInMiles
from SINGER as S
outer apply (
select top 5 TT.*
from TEACHERS as TT
where TT.primaryInstrument='voice'
order by TT.geolocation.STDistance(geography::Point(S.lat, S.lon, 4326)) asc
) as T

How do I use the MAX function over three tables?

So, I have a problem with a SQL Query.
It's about getting weather data for German cities. I have 4 tables: staedte (the cities with primary key loc_id), gehoert_zu (contains the city-key and the key of the weather station that is closest to this city (stations_id)), wettermessung (contains all the weather information and the station's key value) and wetterstation (contains the stations key and location). And I'm using PostgreSQL
Here is how the tables look like:
wetterstation
s_id[PK] standort lon lat hoehe
----------------------------------------
10224 Bremen 53.05 8.8 4
wettermessung
stations_id[PK] datum[PK] max_temp_2m ......
----------------------------------------------------
10224 2013-3-24 -0.4
staedte
loc_id[PK] name lat lon
-------------------------------
15 Asch 48.4 9.8
gehoert_zu
loc_id[PK] stations_id[PK]
-----------------------------
15 10224
What I'm trying to do is to get the name of the city with the (for example) highest temperature at a specified date (could be a whole month, or a day). Since the weather data is bound to a station, I actually need to get the station's ID and then just choose one of the corresponding to this station cities. A possible question would be: "In which city was it hottest in June ?" and, say, the highest measured temperature was in station number 10224. As a result I want to get the city Asch. What I got so far is this
SELECT name, MAX (max_temp_2m)
FROM wettermessung, staedte, gehoert_zu
WHERE wettermessung.stations_id = gehoert_zu.stations_id
AND gehoert_zu.loc_id = staedte.loc_id
AND wettermessung.datum BETWEEN '2012-8-1' AND '2012-12-1'
GROUP BY name
ORDER BY MAX (max_temp_2m) DESC
LIMIT 1
There are two problems with the results:
1) it's taking waaaay too long. The tables are not that big (cities has about 70k entries), but it needs between 1 and 7 minutes to get things done (depending on the time span)
2) it ALWAYS produces the same city and I'm pretty sure it's not the right one either.
I hope I managed to explain my problem clearly enough and I'd be happy for any kind of help. Thanks in advance ! :D
If you want to get the max temperature per city use this statement:
SELECT * FROM (
SELECT gz.loc_id, MAX(max_temp_2m) as temperature
FROM wettermessung as wm
INNER JOIN gehoert_zu as gz
ON wm.stations_id = gz.stations_id
WHERE wm.datum BETWEEN '2012-8-1' AND '2012-12-1'
GROUP BY gz.loc_id) as subselect
INNER JOIN staedte as std
ON std.loc_id = subselect.loc_id
ORDER BY subselect.temperature DESC
Use this statement to get the city with the highest temperature (only 1 city):
SELECT * FROM(
SELECT name, MAX(max_temp_2m) as temp
FROM wettermessung as wm
INNER JOIN gehoert_zu as gz
ON wm.stations_id = gz.stations_id
INNER JOIN staedte as std
ON gz.loc_id = std.loc_id
WHERE wm.datum BETWEEN '2012-8-1' AND '2012-12-1'
GROUP BY name
ORDER BY MAX(max_temp_2m) DESC
LIMIT 1) as subselect
ORDER BY temp desc
LIMIT 1
For performance reasons always use explicit joins as LEFT, RIGHT, INNER JOIN and avoid to use joins with separated table name, so your sql serevr has not to guess your table references.
This is a general example of how to get the item with the highest, lowest, biggest, smallest, whatever value. You can adjust it to your particular situation.
select fred, barney, wilma
from bedrock join
(select fred, max(dino) maxdino
from bedrock
where whatever
group by fred ) flinstone on bedrock.fred = flinstone.fred
where dino = maxdino
and other conditions
I propose you use a consistent naming convention. Singular terms for tables holding a single item per row is a good convention. You only table breaking this is staedte. Should be stadt.
And I suggest to use station_id consistently instead of either s_id and stations_id.
Building on these premises, for your question:
... get the name of the city with the ... highest temperature at a specified date
SELECT s.name, w.max_temp_2m
FROM (
SELECT station_id, max_temp_2m
FROM wettermessung
WHERE datum >= '2012-8-1'::date
AND datum < '2012-12-1'::date -- exclude upper border
ORDER BY max_temp_2m DESC, station_id -- id as tie breaker
LIMIT 1
) w
JOIN gehoert_zu g USING (station_id) -- assuming normalized names
JOIN stadt s USING (loc_id)
Use explicit JOIN conditions for better readability and maintenance.
Use table aliases to simplify your query.
Use x >= a AND x < b to include the lower border and exclude the upper border, which is the common use case.
Aggregate first and pick your station with the highest temperature, before you join to the other tables to retrieve the city name. Much simpler and faster.
You did not specify what to do when multiple "wettermessungen" tie on max_temp_2m in the given time frame. I added station_id as tiebreaker, meaning the station with the lowest id will be picked consistently if there are multiple qualifying stations.

SQL Inequality in duplicates

I found a question in an exam paper which consists of a table Forest having these attributes
Name - type : C(10)
Size - type : I
Location - type : C(10)
Company - type : C(10)
Basically the question asks to find all forests found in the same location with the same company and create this table by that information
Smaller_Forest Larger_Forest CompanyName
I am stuck when I am meant to sort the duplicates by smaller or larger in terms of SQL.
Am I meant to do a CROSS JOIN and eliminate forests with the same size or something like that? And if so, how am I to place the larger and smaller forest in the same record (maybe by the company name value ? )
To filter out the duplicates I did this:
Select * INTO ForestSameLocationCompany
GROUP BY Location, Company
HAVING (count(distinct Location)>1) AND (count(distinct Company)>1)
So this is meant to give me a table with all the duplicate forests by location and company. All is left is to sort them into the above mentioned table which is where I am stuck.
Any help on this matter is much appreciated.
The question is suspect, because it assumes that there are only two forests when there are duplicates. I would start by doing:
select cnt, count(*)
from (select company, location, count(*) as cnt
from Forest
group by company, location
) cl
group by cnt
order by cnt;
This tells you the distribution of the number of companies per forest.
Then, if you only have two forests per company/location, there are several ways to get the name of the smallest and largest on one line. One problem, of course, is that the two forests could have the same size -- and that could be troubling for the approaches. Here is an attempt:
select company, location,
min(case when size = minsize then name end) as minForest,
max(case when size = maxsize then name end) as maxForest
from Forest f join
(select company, location, min(size) as minsize, max(size) as maxsize
from forest
group by company, location
having count(*) > 1
) cl
on f.company = cl.company and f.location = cl.location
group by company, location;
By using both min() and max() in the select clause, the query will return both names when the sizes are the same.
As an additional comment, I find such an exercise to be less useful than it should be. There are plenty of examples of real-world data where you have to deal with duplicates. By not mentioning issues such as the number of possible solution and what to do when the forests are the same size, the exercise is a bit misleading as a real-world example.