There is a number of hotels with different bed capacities. I need to learn that for any given day, how many beds are occupied in each hotel.
Sample data:
HOTEL CHECK-IN CHECK-OUT
A 29.05.2010 30.05.2010
A 28.05.2010 30.05.2010
A 27.05.2010 29.05.2010
B 18.08.2010 19.08.2010
B 16.08.2010 20.08.2010
B 15.08.2010 17.08.2010
Intermediary Result:
HOTEL DAY OCCUPIED_BEDS
A 27.05.2010 1
A 28.05.2010 2
A 29.05.2010 3
A 30.05.2010 2
B 15.08.2010 1
B 16.08.2010 2
B 17.08.2010 2
B 18.08.2010 2
B 19.08.2010 2
B 20.08.2010 1
Final result:
HOTEL MAX_OCCUPATION
A 3
B 2
A similar question is asked before. I thought getting the list of dates (as Tom Kyte shows) between two dates and calculating each day's capacity with a group by. The problem is my table is relatively big and I wonder if there is a less costly way of accomplishing this task.
I don't think there's a better approach than the one you outlined in the question. Create your days table (or generate one on the fly). I personally like to have one lying around, updated once a year.
Someone who understand analytic functions will probably be able to do this without an inner/outer query, but as the inner grouping is a subset of the outer, it doesn't make much difference.
Select
i.Hotel,
Max(i.OccupiedBeds)
From (
Select
s.Hotel,
d.DayID,
Count(*) As OccupiedBeds
From
SampleData s
Inner Join
Days d
-- might not need to +1 depending on business rules.
-- I wouldn't count occupancy on the day I check out, if so get rid of it
On d.DayID >= s.CheckIn And d.DayID < s.CheckOut + 1
Group By
s.Hotel,
d.DayID
) i
Group By
i.Hotel
After a bit of playing I couldn't get an analytic function version to work without an inner query:
If speed really is a problem with this, you could consider maintaining an intermediate table with triggers on main table.
http://sqlfiddle.com/#!4/e58e7/24
create a temp table containing the days you are interested in
create table #dates (dat datetime)
insert into #dates (dat) values ('20121116')
insert into #dates (dat) values ('20121115')
insert into #dates (dat) values ('20121114')
insert into #dates (dat) values ('20121113')
Get the intermediate result by joining the bookings with the dates so that one per booking-day is "generated"
SELECT Hotel, d.dat, COUNT(*) from bookings b
INNER JOIN #dates d on d.dat BETWEEN b.checkin AND b.checkout
GROUP BY Hotel, d.dat
An finally get the Max
SELECT Hotel, Max(OCCUPIED_BEDS) FROM IntermediateResult GROUP BY Hotel
The problem with performance is that the join conditions are not based on equality which makes a hash join impossible. Assuming we have a table hotel_day with hotel-day pairs, I would try something like that:
select ch_in.hotel, ch_in.day,
(check_in_cnt - check_out_cnt) as occupancy_change
from ( select d.hotel, d.day, count(s.hotel) as check_in_cnt
from hotel_days d,
sample_data s
where s.hotel(+) = d.hotel
and s.check_in(+) = d.day
group by d.hotel, d.day
) ch_in,
( select d.hotel, d.day, count(s.hotel) as check_out_cnt
from hotel_days d,
sample_data s
where s.hotel(+) = d.hotel
and s.check_out(+) = d.day
group by d.hotel, d.day
) ch_out
where ch_out.hotel = ch_in.hotel
and ch_out.day = ch_in.day
The tradeoff is a double full scan, but I think it would still run faster, and it may be parallelized. (I assume that sample_data is big mostly due to the number of bookings, not the number of hotels itself.) The output is a change of occupancy in particular hotels on particular days, but this may be easily summed up into total values with either analytical functions or (probably more efficiently) a PL/SQL procedure with bulk collect.
Related
I keep running in to the same problem over and over again, hoping someone can help...
I have a large table with a category column that has 28 entries for donkey breed, then I'm counting two specific values grouped by each of those categories in subqueries like this:
WITH totaldonkeys AS (
SELECT donkeybreed,
COUNT(*) AS total
FROM donkeytable1
GROUP BY donkeybreed
)
,
sickdonkeys AS (
SELECT donkeybreed,
COUNT(*) AS totalsick
FROM donkeytable1
JOIN donkeyhealth on donkeytable1.donkeyid = donkeyhealth.donkeyid
WHERE donkeyhealth.sick IS TRUE
GROUP BY donkeybreed
)
,
It's my goal to end up with a table that has primarily the percentage of sick donkeys for each breed but I always end up struggling like hell with the problem of not being able to group by without using an aggregate function which I cannot do here:
SELECT (CAST(sickdonkeys.totalsick AS float) / totaldonkeys.total) * 100 AS percentsick,
totaldonkeys.donkeybreed
FROM totaldonkeys, sickdonkeys
GROUP BY totaldonkeys.donkeybreed
When I run this I end up with 28 results for each breed of donkey, one correct I believe but obviously hundreds of useless datapoints.
I know I'm probably being really dumb here but I keep hitting in to this same problem again and again with new donkeydata, I should obviously be structuring the whole thing a new way because you just can't do this final query without an aggregate function, I think I must be missing something significant.
You can easily count the proportion that are sick in the donkeyhealth table
SELECT d.donkeybreed,
AVG( (dh.sick)::int ) AS proportion_sick
FROM donkeytable1 d JOIN
donkeyhealth dh
ON d.donkeyid = dh.donkeyid
GROUP BY d.donkeybreed
Please help me to optimize my SQL query.
I have a table with the fields: date, commodity_id, exp_month_id, exp_year, price, where the first 4 fields are the primary key. The months are designated with the alphabet-ordered letters: e.g. F (for Jan), G (for Feb.), H (for March), etc. Thus the letter of more distant from Jan. month will be larger than the letter of the less distant month (F < G < H < ...). Some commodity_ids have all 12 months in the table, some only 5 or 3, which are constant for all years.
I need to calculate the difference between prices (gradient) of the neighboring records in terms of exp_month_id, exp_year. As the first step, I want to define for every couple (exp_month_id, exp_year) the valid couple (next_month_id, next_year). The main problem here, that if the current exp_month_id is the last in the year, then next_year = exp_year + 1 and next_month_id should be the first one in the year.
I have written the following query to do the job:
WITH trading_months AS (
SELECT DISTINCT commodity_id,
exp_month_id
FROM futures
ORDER BY exp_month_id
)
SELECT DISTINCT f.commodity_id,
f.exp_month_id,
f.exp_year,
(
WITH [temp] AS (
SELECT exp_month_id
FROM trading_months
WHERE commodity_id = f.commodity_id
)
SELECT exp_month_id
FROM [temp]
WHERE exp_month_id > f.exp_month_id
UNION ALL
SELECT exp_month_id
FROM [temp]
LIMIT 1
)
AS next_month_id,
(
SELECT CASE WHEN EXISTS (
SELECT commodity_id,
exp_month_id
FROM trading_months
WHERE commodity_id = f.commodity_id AND
exp_month_id > f.exp_month_id
LIMIT 1
)
THEN f.exp_year ELSE f.exp_year + 1 END
)
AS next_year
FROM futures AS f
This query serves as a base for a dynamic table (view) which is subsequently used for calculating the gradient. However, the execution of this query takes more than one second and thus the whole process takes minutes. I wonder if you could help me optimizing the query.
Note: The following requires Sqlite 3.25 or newer for window function support:
Lack of sample data (Preferably as a CREATE TABLE and INSERT statements for easy importing) and expected results makes it hard to test, but if your end goal is computing the difference in prices between expiration dates (Making your question a bit of an XY problem, maybe something like:
SELECT date, commodity_id, price, exp_year, exp_month_id
, price - lag(price, 1) OVER (PARTITION BY commodity_id ORDER BY exp_year, exp_month_id) AS "change from last price"
FROM futures;
Thanks to the hint of #Shawn to use window functions I could rewrite the query in much shorter form:
CREATE VIEW "futures_nextmonths_win" AS
WITH trading_months AS (
SELECT DISTINCT commodity_id,
exp_month_id,
exp_year
FROM futures)
SELECT commodity_id,
exp_month_id,
exp_year,
lead(exp_month_id) OVER w AS next_month_id,
lead(exp_year) OVER w AS next_year
FROM trading_months
WINDOW w AS (PARTITION BY commodity_id ORDER BY exp_year, exp_month_id);
which is also slightly faster then the original one.
I have a table with potentially up to a few hundred thousand rows. Each row represents an application for funding within a census block, by a given applicant. If I have just a couple hundred rows, I can assign a rank to each application within its census block using this:
SELECT art1.CFA_Plus, art1.Census_Block_ID,
(SELECT count(*) + 1
FROM AppReferenceTable art2
WHERE art2.State_Cost_Per_Unit < art1.State_Cost_Per_Unit AND
art2.Census_Block_ID = art1.Census_Block_ID) AS Rank_In_Block
FROM AppReferenceTable AS art1;
This works fine to rank each application by unit cost within that census block. But it chokes on my test table, which has about 60,000 rows. Is there a better way? Or, what am I doing wrong?
Thanks!
Figured it out - joining the table with itself does the trick.
SELECT art1.Census_Block_ID, art1.CFA_Plus, art1.State_Cost_Per_unit, COUNT(*) As Rank
FROM appreferencetable as art1
LEFT JOIN appreferencetable as art2
ON art1.Census_Block_ID = art2.Census_Block_ID AND art2.State_Cost_Per_Unit <= art1.State_Cost_Per_unit
GROUP BY art1.Census_Block_ID, art1.CFA_Plus, art1.State_cost_per_unit
I have to query a table with few millons of rows and I want to do it the most optimized.
Lets supose that we want to controll the access to a movie theater with multiples screening rooms and save it like this:
AccessRecord
(TicketId,
TicketCreationTimestamp,
TheaterId,
ShowId,
MovieId,
SeatId,
CheckInTimestamp)
To simplify, the 'Id' columns of the data type 'bigint' and the 'Timestamp' are 'datetime'. The tickets are sold at any time and the people access to the theater randomly. And the primary key (so also unique) is TicketId.
I want to get for each Movie and Theater and Show (time) the AccessRecord info of the first and last person who accessed to the theater to see a mov. If two checkins happen at the same time, i just need 1, any of them.
My solution would be to concatenate the PK and the grouped column in a subquery to get the row:
select
AccessRecord.*
from
AccessRecord
inner join(
select
MAX(CONVERT(nvarchar(25),CheckInTimestamp, 121) + CONVERT(varchar(25), TicketId)) as MaxKey,
MIN(CONVERT(nvarchar(25),CheckInTimestamp, 121) + CONVERT(varchar(25), TicketId)) as MinKey
from
AccessRecord
group by
MovieId,
TheaterId,
ShowId
) as MaxAccess
on CONVERT(nvarchar(25),CheckInTimestamp, 121) + CONVERT(varchar(25), TicketId) = MaxKey
or CONVERT(nvarchar(25),CheckInTimestamp, 121) + CONVERT(varchar(25), TicketId) = MinKey
The conversion 121 is to the cannonical expression of datatime resluting like this: aaaa-mm-dd hh:mi:ss.mmm(24h), so ordered as string data type it will give the same result as it is ordered as a datetime.
As you can see this join is not very optimized, any ideas?
Update with how I tested the different solutions:
I've tested all your answers in a real database with SQL Server 2008 R2 with a table over 3M rows to choose the right one.
If I get only the first or the last person who accessed:
Joe Taras's solution lasts 10 secs.
GarethD's solution lasts 21 secs.
If I do the same accessed but with an ordered result by the grouping columns:
Joe Taras's solution lasts 10 secs.
GarethD's solution lasts 46 secs.
If I get both (the first and the last) people who accessed with an ordered result:
Joe Taras's (doing an union) solution lasts 19 secs.
GarethD's solution lasts 49 secs.
The rest of the solutions (even mine) last more than 60 secs in the first test so I canceled it.
Try this:
select a.*
from AccessRecord a
where not exists(
select 'next'
from AccessRecord a2
where a2.movieid = a.movieid
and a2.theaterid = a.theaterid
and a2.showid = a.showid
and a2.checkintimestamp > a.checkintimestamp
)
In this way you pick the last row as timestamp for the same movie, teather, show.
Ticket (I suppose) is different for each row
Using analytical functions may speed up the query, more specifically ROW_NUMBER, it should reduce the number of reads:
WITH CTE AS
( SELECT TicketId,
TicketCreationTimestamp,
TheaterId,
ShowId,
MovieId,
SeatId,
CheckInTimestamp,
RowNumber = ROW_NUMBER() OVER(PARTITION By MovieId, TheaterId, ShowId ORDER BY CheckInTimestamp, TicketID),
RowNumber2 = ROW_NUMBER() OVER(PARTITION By MovieId, TheaterId, ShowId ORDER BY CheckInTimestamp DESC, TicketID)
FROM AccessRecord
)
SELECT TicketId,
TicketCreationTimestamp,
TheaterId,
ShowId,
MovieId,
SeatId,
CheckInTimestamp,
FROM CTE
WHERE RowNumber = 1
OR RowNumber2 = 1;
However as always with optimisation you are best suited to tune your own queries, you have the data to test with and all the execution plans. Try the query with different indexes, if you show the actual execution plan SSMS will even suggest indexes to help your query. I would expect an index on (MovieId, TheaterId, ShowId) that includes CheckInTimestamp as a non key column would help.
Add either new columns to the table and pre-convert the dates or join the pk in that access table to a new table which has the converted values sitting it it already. The new table that looks up the conversion instead of doing it on the join will speed things up in your queries immensely. If you can do it so that the access record gets an integer FK that goes to the lookup (pre-converted values) table then you're going to avoid using dates at all and things will be phenopminally faster.
If you normalize the dataset and break it out into a star pattern, things will get even faster.
SELECT
R1.*
FROM AccessRecord R1
LEFT JOIN AccessRecord R2
ON R1.MovieId = R2.MovieId
AND R1.TheaterId = R2.TheaterId
AND R1.ShowId = R2.ShowId
AND (
R1.CheckInTimestamp < R2.CheckInTimestamp
OR (R1.CheckInTimestamp = R2.CheckInTimestamp
AND R1.TicketId< R2.TicketId
))
WHERE R2.TicketId IS NULL
Selects the last entry based on the CheckInTimestamp. But if there is a match for this, then it is based on the highest TicketId
Offcourse an index on MovieId, TheaterId and ShowId will help
This is where I learned the trick
You could also consider a union ALL qwuery instead of that nasty OR. Ors are usually slower than union ALL queries.
I am trying to wrap my head around this one this morning.
I am trying to show inventory status for parts (for our products) and this query only becomes complex if I try to return all parts.
Let me lay it out:
single table inventoryReport
I have a distinct list of X parts I wish to display, the result of which must be X # of rows (1 row per part showing latest inventory entry).
table is made up of dated entries of inventory changes (so I only need the LATEST date entry per part).
all data contained in this single table, so no joins necessary.
Currently for 1 single part, it is fairly simple and I can accomplish this by doing the following sql (to give you some idea):
SELECT TOP (1) ldDate, ptProdLine, inPart, inSite, inAbc, ptUm, inQtyOh + inQtyNonet AS in_qty_oh, inQtyAvail, inQtyNonet, ldCustConsignQty, inSuppConsignQty
FROM inventoryReport
WHERE (ldPart = 'ABC123')
ORDER BY ldDate DESC
that gets me my TOP 1 row, so simple per part, however I need to show all X (lets say 30 parts). So I need 30 rows, with that result. Of course the simple solution would be to loop X# of sql calls in my code (but it would be costly) and that would suffice, but for this purpose I would love to work this SQL some more to reduce the x# calls back to the db (if not needed) down to just 1 query.
From what I can see here I need to keep track of the latest date per item somehow while looking for my result set.
I would ultimately do a
WHERE ldPart in ('ABC123', 'BFD21', 'AA123', etc)
to limit the parts I need. Hopefully I made my question clear enough. Let me know if you have an idea. I cannot do a DISTINCT as the rows are not the same, the date needs to be the latest, and I need a maximum of X rows.
Thoughts? I'm stuck...
SELECT *
FROM (SELECT i.*,
ROW_NUMBER() OVER(PARTITION BY ldPart ORDER BY ldDate DESC) r
FROM inventoryReport i
WHERE ldPart in ('ABC123', 'BFD21', 'AA123', etc)
)
WHERE r = 1
EDIT: Be sure to test the performance of each solution. As pointed out in this question, the CTE method may outperform using ROW_NUMBER.
;with cteMaxDate as (
select ldPart, max(ldDate) as MaxDate
from inventoryReport
group by ldPart
)
SELECT md.MaxDate, ir.ptProdLine, ir.inPart, ir.inSite, ir.inAbc, ir.ptUm, ir.inQtyOh + ir.inQtyNonet AS in_qty_oh, ir.inQtyAvail, ir.inQtyNonet, ir.ldCustConsignQty, ir.inSuppConsignQty
FROM cteMaxDate md
INNER JOIN inventoryReport ir
on md.ldPart = ir.ldPart
and md.MaxDate = ir.ldDate
You need to join into a Sub-query:
SELECT i.ldPart, x.LastDate, i.inAbc
FROM inventoryReport i
INNER JOIN (Select ldPart, Max(ldDate) As LastDate FROM inventoryReport GROUP BY ldPart) x
on i.ldPart = x.ldPart and i.ldDate = x.LastDate