Related
Not being an SQL expert, I am struggling with the following:
I inherited a larg-ish table (about 100 million rows) containing time-stamped events that represent stage transitions of mostly shortlived phenomena. The events are unfortunately recorded in a somewhat strange way, with the table looking as follows:
phen_ID record_time producer_id consumer_id state ...
000123 10198789 start
10298776 000123 000112 hjhkk
000124 10477886 start
10577876 000124 000123 iuiii
000124 10876555 end
Each phenomenon (phen-ID) has a start event and theoretically an end event, although it might not have been occured yet and thus not recorded. Each phenomenon can then go through several states. Unfortunately, for some states, the ID is recorded in either a product or a consumer field. Also, the number of states is not fixed, and neither is the time between the states.
To beginn with, I need to create an SQL statement that for each phen-ID shows the start time and the time of the last recorded event (could be an end state or one of the intermediate states).
Just considering a single phen-ID, I managed to pull together the following SQL:
WITH myconstants (var1) as (
values ('000123')
)
select min(l.record_time), max(l.record_time) from
(select distinct * from public.phen_table JOIN myconstants ON var1 IN (phen_id, producer_id, consumer_id)
) as l
As the start-state always has the lowest recorded-time for the specific phenomenon, the above statement correctly returns the recorded time range as one row irrespective of what the end state is.
Obviously here I have to supply the phen-ID manually.
How can I make this work that so I get a row of the start times and maxium recorded time for each unique phen-ID? Played around with trying to fit in something like select distinct phen-id ... but was not able to "feed" them automatically into the above. Or am I completely off the mark here?
Addition:
Just to clarify, the ideal output using the table above would like something like this:
ID min-time max-time
000123 10198789 10577876 (min-time is start, max-time is state iuii)
000124 10477886 10876555 (min-time is start, max-time is end state)
union all might be an option:
select phen_id,
min(record_time) as min_record_time,
max(record_time) as max_record_time
from (
select phen_id, record_time from phen_table
union all select producer_id, record_time from phen_table
union all select consumer_id, record_time from phen_table
) t
where phen_id is not null
group by phen_id
On the other hand, if you want prioritization, then you can use coalesce():
select coalesce(phen_id, producer_id, consumer_id) as phen_id,
min(record_time) as min_record_time,
max(record_time) as max_record_time
from phen_table
group by coalesce(phen_id, producer_id, consumer_id)
The logic of the two queries is not exactly the same. If there are rows where more than one of the three columns is not null, and values differ, then the first query takes in account all non-null values, while the second considers only the "first" non-null value.
Edit
In Postgres, which you finally tagged, the union all solution can be phrased more efficiently with a lateral join:
select x.phen_id,
min(p.record_time) as min_record_time,
max(p.record_time) as max_record_time
from phen_table p
cross join lateral (values (phen_id), (producer_id), (consumer_id)) as x(phen_id)
where x.phen_id is not null
group by x.phen_id
I think you're on the right track. Try this and see if it is what you are looking for:
select
min(l.record_time)
,max(l.record_time)
,coalesce(phen_id, producer_id, consumer_id) as [Phen ID]
from public.phen_table
group by coalesce(phen_id, producer_id, consumer_id)
Ok, since it seems that my last two questions (this one and this one) only lead to confussion, I will try to explain the FULL problem here, so it might be a long post.
I'm trying to create a database for a trading system. The database has 2 main tables. One is table "Ticks" and the other is "Candles". As shown in the figure, each table has its own attributes..
Candles, bars or ohlc are the same thing.
The way a candle is seen in a chart is like this:
Candles are just a way to representate aggregated data, nothing more.
There are many ways to aggregate ticks in order to create one candle. In this post, I'm asking for a particular way that is creating one candle every 500 ticks. So, if the ticks table has 1000 ticks, I can create only 2 candles. If it has 500 ticks, I can create 1 candle. If it has 5000 ticks, I can create 10 candles. If there are 5001 ticks I still have only 10 candles, because I'm missing the other 499 ticks in order to create the 11th candle.
Actually, I'm storing all the ticks using a python script and creating (and therefore, inserting in the candles table) candles with another python script. This is a real time process.
Both scripts run in a while True: loop. No, I can't (read shouldn't) stop the scripts because the market is opened 24 hours - 5 days a week.
What I'm trying to do is to get rid of the python script that creates and stores the candles in the candles table. Why? Because I think that it will improve performance. Instead of doing multiple queries to know the amount of ticks that are available to create a new candle, I think that a trigger could handle it in a more efficient way (please, if I'm mistaken correct me).
I don't know how to actually solve it, but what I'm trying is to do this (thanks to #GordonLinoff for helping me in previous questions):
do $$
begin
with total_ticks as (
select count(*) c from (
select * from eurusd_tick2 eurusd where date >
(SELECT date from eurusd_ohlc order by date desc limit 1)
order by date asc) totals),
ticks_for_candles as(
select * from eurusd_tick2 eurusd where date >
(SELECT date from eurusd_ohlc order by date desc limit 1)
order by date asc
), candles as(
select max(date) as date,
max(bid) filter (where mod(seqnum, 500) = 1) as open,
max(bid) as high,
min(bid) as low,
max(bid) filter (where mod(seqnum, 500) = 500-1) as close,
max(ask) filter (where mod(seqnum, 500) = 500-1) as ask
from (
select t.*, row_number() over (order by date) as seqnum
from (select * from ticks_for_candles) t) as a
group by floor((seqnum - 1) /500)
having count(*) = 500
)
case 500<(select * from total_ticks)
when true then
return select * from candles
end;
end $$;
Using this, I get this error:
ERROR: syntax error at or near "case"
LINE 33: case 500<(select * from total_ticks)
^
SQL state: 42601
Character: 945
As you can see, there is no select after the CETs. If I put:
select case 500<(select * from total_ticks)
when true then
return select * from candles
end;
end $$;
I get this error:
ERROR: subquery must return only one column
LINE 31: (select * from candles)
^
QUERY: with total_ticks as (
select count(*) c from (
select * from eurusd_tick2 eurusd where date >
(SELECT date from eurusd_ohlc order by date desc limit 1)
order by date asc) totals),
ticks_for_candles as(
select * from eurusd_tick2 eurusd where date >
(SELECT date from eurusd_ohlc order by date desc limit 1)
order by date asc
), candles as(
select max(date) as date,
max(bid) filter (where mod(seqnum, 500) = 1) as open,
max(bid) as high,
min(bid) as low,
max(bid) filter (where mod(seqnum, 500) = 500-1) as close,
max(ask) filter (where mod(seqnum, 500) = 500-1) as ask
from (
select t.*, row_number() over (order by date) as seqnum
from (select * from ticks_for_candles) t) as a
group by floor((seqnum - 1) /500)
having count(*) = 500
)
select case 1000>(select * from total_ticks)
when true then
(select * from candles)
end
CONTEXT: PL/pgSQL function inline_code_block line 4 at SQL statement
SQL state: 42601
So honestly, I don't know how to do it correctly. It doesn't has to be with the actual code I provide here, but the desired output looks as follows:
-----------------------------------------------------------------------------------
| date | open | high | low | close | ask |
|2020-05-01 20:39:27.603452| 1.0976 | 1.09766 | 1.09732 | 1.09762 | 1.09776 |
This would be the output when there is enough ticks to create only 1 candle. If there is enough to create two of them, then there should be 2 rows.
So, at the end of the day, what I have in mind is that the trigger should check constantly if there is enough data to create a candle and if it is, then create it.
Is this a good idea or I should stick to the python script?
Can this be achieved with my approach?
What I'm doing wrong?
What should I do and how should I manage this situation?
I really hope that the question now is complete and there is no missing information.
All comments and advices are appreciated.
Thanks!
EDIT: Since this is a real time process, in one second there could be 499 ticks in the database and in the next second there could be 503 ticks. This means that 4 ticks arrived in 1 second.
Being a database guy, my approach would be to use triggers in the database.
Create a third table candle_in_the_making that contains the data from the ticks that have not yet been aggregated to a candles entry.
Create an INSERT trigger on the ticks table (doesn't matter if BEFORE or AFTER) that does the following:
For every tick inserted, add a row to candle_in_the_making.
If the row count reaches 500, compute and insert a new candles row and TRUNCATE candle_in_the_making.
This is simple if ticks are inserted only in a single thread.
If ticks are inserted concurrently, you have to find a way to prevent two threads from inserting the 500th tick in candle_in_the_making at the same time (so that you end up with 501 entries). I can think of two ways to do that in the database:
Have an extra table c_i_m_count that contains only a single number, which is the number of rows in candle_in_the_making. Before you insert into candle_in_the_making, you run the atomic
UPDATE c_i_m_count SET counter = counter + 1 RETURNING counter;
This locks the row, so that any two INSERTs into counter_in_the_making are effectively serialized.
Use advisory locks to serialize the inserting threads. In particular, a transaction level exclusive lock as taken by pg_advisory_xact_lock would be indicated.
I have to query a table with few millons of rows and I want to do it the most optimized.
Lets supose that we want to controll the access to a movie theater with multiples screening rooms and save it like this:
AccessRecord
(TicketId,
TicketCreationTimestamp,
TheaterId,
ShowId,
MovieId,
SeatId,
CheckInTimestamp)
To simplify, the 'Id' columns of the data type 'bigint' and the 'Timestamp' are 'datetime'. The tickets are sold at any time and the people access to the theater randomly. And the primary key (so also unique) is TicketId.
I want to get for each Movie and Theater and Show (time) the AccessRecord info of the first and last person who accessed to the theater to see a mov. If two checkins happen at the same time, i just need 1, any of them.
My solution would be to concatenate the PK and the grouped column in a subquery to get the row:
select
AccessRecord.*
from
AccessRecord
inner join(
select
MAX(CONVERT(nvarchar(25),CheckInTimestamp, 121) + CONVERT(varchar(25), TicketId)) as MaxKey,
MIN(CONVERT(nvarchar(25),CheckInTimestamp, 121) + CONVERT(varchar(25), TicketId)) as MinKey
from
AccessRecord
group by
MovieId,
TheaterId,
ShowId
) as MaxAccess
on CONVERT(nvarchar(25),CheckInTimestamp, 121) + CONVERT(varchar(25), TicketId) = MaxKey
or CONVERT(nvarchar(25),CheckInTimestamp, 121) + CONVERT(varchar(25), TicketId) = MinKey
The conversion 121 is to the cannonical expression of datatime resluting like this: aaaa-mm-dd hh:mi:ss.mmm(24h), so ordered as string data type it will give the same result as it is ordered as a datetime.
As you can see this join is not very optimized, any ideas?
Update with how I tested the different solutions:
I've tested all your answers in a real database with SQL Server 2008 R2 with a table over 3M rows to choose the right one.
If I get only the first or the last person who accessed:
Joe Taras's solution lasts 10 secs.
GarethD's solution lasts 21 secs.
If I do the same accessed but with an ordered result by the grouping columns:
Joe Taras's solution lasts 10 secs.
GarethD's solution lasts 46 secs.
If I get both (the first and the last) people who accessed with an ordered result:
Joe Taras's (doing an union) solution lasts 19 secs.
GarethD's solution lasts 49 secs.
The rest of the solutions (even mine) last more than 60 secs in the first test so I canceled it.
Try this:
select a.*
from AccessRecord a
where not exists(
select 'next'
from AccessRecord a2
where a2.movieid = a.movieid
and a2.theaterid = a.theaterid
and a2.showid = a.showid
and a2.checkintimestamp > a.checkintimestamp
)
In this way you pick the last row as timestamp for the same movie, teather, show.
Ticket (I suppose) is different for each row
Using analytical functions may speed up the query, more specifically ROW_NUMBER, it should reduce the number of reads:
WITH CTE AS
( SELECT TicketId,
TicketCreationTimestamp,
TheaterId,
ShowId,
MovieId,
SeatId,
CheckInTimestamp,
RowNumber = ROW_NUMBER() OVER(PARTITION By MovieId, TheaterId, ShowId ORDER BY CheckInTimestamp, TicketID),
RowNumber2 = ROW_NUMBER() OVER(PARTITION By MovieId, TheaterId, ShowId ORDER BY CheckInTimestamp DESC, TicketID)
FROM AccessRecord
)
SELECT TicketId,
TicketCreationTimestamp,
TheaterId,
ShowId,
MovieId,
SeatId,
CheckInTimestamp,
FROM CTE
WHERE RowNumber = 1
OR RowNumber2 = 1;
However as always with optimisation you are best suited to tune your own queries, you have the data to test with and all the execution plans. Try the query with different indexes, if you show the actual execution plan SSMS will even suggest indexes to help your query. I would expect an index on (MovieId, TheaterId, ShowId) that includes CheckInTimestamp as a non key column would help.
Add either new columns to the table and pre-convert the dates or join the pk in that access table to a new table which has the converted values sitting it it already. The new table that looks up the conversion instead of doing it on the join will speed things up in your queries immensely. If you can do it so that the access record gets an integer FK that goes to the lookup (pre-converted values) table then you're going to avoid using dates at all and things will be phenopminally faster.
If you normalize the dataset and break it out into a star pattern, things will get even faster.
SELECT
R1.*
FROM AccessRecord R1
LEFT JOIN AccessRecord R2
ON R1.MovieId = R2.MovieId
AND R1.TheaterId = R2.TheaterId
AND R1.ShowId = R2.ShowId
AND (
R1.CheckInTimestamp < R2.CheckInTimestamp
OR (R1.CheckInTimestamp = R2.CheckInTimestamp
AND R1.TicketId< R2.TicketId
))
WHERE R2.TicketId IS NULL
Selects the last entry based on the CheckInTimestamp. But if there is a match for this, then it is based on the highest TicketId
Offcourse an index on MovieId, TheaterId and ShowId will help
This is where I learned the trick
You could also consider a union ALL qwuery instead of that nasty OR. Ors are usually slower than union ALL queries.
Suppose I have a database of athletic meeting results with a schema as follows
DATE,NAME,FINISH_POS
I wish to do a query to select all rows where an athlete has competed in at least three events without winning. For example with the following sample data
2013-06-22,Johnson,2
2013-06-21,Johnson,1
2013-06-20,Johnson,4
2013-06-19,Johnson,2
2013-06-18,Johnson,3
2013-06-17,Johnson,4
2013-06-16,Johnson,3
2013-06-15,Johnson,1
The following rows:
2013-06-20,Johnson,4
2013-06-19,Johnson,2
Would be matched. I have only managed to get started at the following stub:
select date,name FROM table WHERE ...;
I've been trying to wrap my head around the where clause but I can't even get a start
I think this can be even simpler / faster:
SELECT day, place, athlete
FROM (
SELECT *, min(place) OVER (PARTITION BY athlete
ORDER BY day
ROWS 3 PRECEDING) AS best
FROM t
) sub
WHERE best > 1
->SQLfiddle
Uses the aggregate function min() as window function to get the minimum place of the last three rows plus the current one.
The then trivial check for "no win" (best > 1) has to be done on the next query level since window functions are applied after the WHERE clause. So you need at least one CTE of sub-select for a condition on the result of a window function.
Details about window function calls in the manual here. In particular:
If frame_end is omitted it defaults to CURRENT ROW.
If place (finishing_pos) can be NULL, use this instead:
WHERE best IS DISTINCT FROM 1
min() ignores NULL values, but if all rows in the frame are NULL, the result is NULL.
Don't use type names and reserved words as identifiers, I substituted day for your date.
This assumes at most 1 competition per day, else you have to define how to deal with peers in the time line or use timestamp instead of date.
#Craig already mentioned the index to make this fast.
Here's an alternative formulation that does the work in two scans without subqueries:
SELECT
"date", athlete, place
FROM (
SELECT
"date",
place,
athlete,
1 <> ALL (array_agg(place) OVER w) AS include_row
FROM Table1
WINDOW w AS (PARTITION BY athlete ORDER BY "date" ASC ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
) AS history
WHERE include_row;
See: http://sqlfiddle.com/#!1/fa3a4/34
The logic here is pretty much a literal translation of the question. Get the last four placements - current and the previous 3 - and return any rows in which the athlete didn't finish first in any of them.
Because the window frame is the only place where the number of rows of history to consider is defined, you can parameterise this variant unlike my previous effort (obsolete, http://sqlfiddle.com/#!1/fa3a4/31), so it works for the last n for any n. It's also a lot more efficient than the last try.
I'd be really interested in the relative efficiency of this vs #Andomar's query when executed on a dataset of non-trivial size. They're pretty much exactly the same on this tiny dataset. An index on Table1(athlete, "date") would be required for this to perform optimally on a large data set.
; with CTE as
(
select row_number() over (partition by athlete order by date) rn
, *
from Table1
)
select *
from CTE cur
where not exists
(
select *
from CTE prev
where prev.place = 1
and prev.athlete = cur.athlete
and prev.rn between cur.rn - 3 and cur.rn
)
Live example at SQL Fiddle.
There is a number of hotels with different bed capacities. I need to learn that for any given day, how many beds are occupied in each hotel.
Sample data:
HOTEL CHECK-IN CHECK-OUT
A 29.05.2010 30.05.2010
A 28.05.2010 30.05.2010
A 27.05.2010 29.05.2010
B 18.08.2010 19.08.2010
B 16.08.2010 20.08.2010
B 15.08.2010 17.08.2010
Intermediary Result:
HOTEL DAY OCCUPIED_BEDS
A 27.05.2010 1
A 28.05.2010 2
A 29.05.2010 3
A 30.05.2010 2
B 15.08.2010 1
B 16.08.2010 2
B 17.08.2010 2
B 18.08.2010 2
B 19.08.2010 2
B 20.08.2010 1
Final result:
HOTEL MAX_OCCUPATION
A 3
B 2
A similar question is asked before. I thought getting the list of dates (as Tom Kyte shows) between two dates and calculating each day's capacity with a group by. The problem is my table is relatively big and I wonder if there is a less costly way of accomplishing this task.
I don't think there's a better approach than the one you outlined in the question. Create your days table (or generate one on the fly). I personally like to have one lying around, updated once a year.
Someone who understand analytic functions will probably be able to do this without an inner/outer query, but as the inner grouping is a subset of the outer, it doesn't make much difference.
Select
i.Hotel,
Max(i.OccupiedBeds)
From (
Select
s.Hotel,
d.DayID,
Count(*) As OccupiedBeds
From
SampleData s
Inner Join
Days d
-- might not need to +1 depending on business rules.
-- I wouldn't count occupancy on the day I check out, if so get rid of it
On d.DayID >= s.CheckIn And d.DayID < s.CheckOut + 1
Group By
s.Hotel,
d.DayID
) i
Group By
i.Hotel
After a bit of playing I couldn't get an analytic function version to work without an inner query:
If speed really is a problem with this, you could consider maintaining an intermediate table with triggers on main table.
http://sqlfiddle.com/#!4/e58e7/24
create a temp table containing the days you are interested in
create table #dates (dat datetime)
insert into #dates (dat) values ('20121116')
insert into #dates (dat) values ('20121115')
insert into #dates (dat) values ('20121114')
insert into #dates (dat) values ('20121113')
Get the intermediate result by joining the bookings with the dates so that one per booking-day is "generated"
SELECT Hotel, d.dat, COUNT(*) from bookings b
INNER JOIN #dates d on d.dat BETWEEN b.checkin AND b.checkout
GROUP BY Hotel, d.dat
An finally get the Max
SELECT Hotel, Max(OCCUPIED_BEDS) FROM IntermediateResult GROUP BY Hotel
The problem with performance is that the join conditions are not based on equality which makes a hash join impossible. Assuming we have a table hotel_day with hotel-day pairs, I would try something like that:
select ch_in.hotel, ch_in.day,
(check_in_cnt - check_out_cnt) as occupancy_change
from ( select d.hotel, d.day, count(s.hotel) as check_in_cnt
from hotel_days d,
sample_data s
where s.hotel(+) = d.hotel
and s.check_in(+) = d.day
group by d.hotel, d.day
) ch_in,
( select d.hotel, d.day, count(s.hotel) as check_out_cnt
from hotel_days d,
sample_data s
where s.hotel(+) = d.hotel
and s.check_out(+) = d.day
group by d.hotel, d.day
) ch_out
where ch_out.hotel = ch_in.hotel
and ch_out.day = ch_in.day
The tradeoff is a double full scan, but I think it would still run faster, and it may be parallelized. (I assume that sample_data is big mostly due to the number of bookings, not the number of hotels itself.) The output is a change of occupancy in particular hotels on particular days, but this may be easily summed up into total values with either analytical functions or (probably more efficiently) a PL/SQL procedure with bulk collect.