SQLite3 How to calculate differential changes - sql

I have a medium size database (400,000 rows at the time) containing a Measurement table with the following schema:
CREATE TABLE `Measurements` (
`timestamp` timestamp,
`timetick` INTEGER,
`Sensor1` REAL,
`Sensor2` REAL,
PRIMARY KEY(timestamp));
As timestamp increases (timestamp increases are not constant there are gaps and delays but timestamps are guaranteed to be monotonic), normally timetick increases too, but there cases where it resets to a small but unpredictable value. I need to find all such rows. I have used the following query (inspired by Finding the difference in rows in query using SQLite):
select r0,r1,a,b,rd,d from
(select M0.rowid as r0,
M1.rowid as r1,
M0.timestamp as a,
M1.timestamp as b,
min(M1.timestamp)-M0.timestamp as rd,
M1.timetick-M0.timetick as d
from Measurements M0,Measurements M1
where M1.timestamp>M0.timestamp group by M0.timestamp
) where d<0;
This works but takes hours, while the same job in python finishes in 30 seconds. Yet it is a a very common task, scientists calculate derivatives all the time and financial professionals calculate price differences. There should be an efficient way to do it.
I will appreciate your help and comments.

A join with a GROUP BY is hard to optimize.
Better use a correlated subquery to find the respective next row:
SELECT m0.rowid AS r0,
m1.rowid AS rn,
m0.timestamp AS a,
m1.timestamp AS b,
m1.timestamp - m0.timestamp AS rd,
m1.timetick - m0.timetick AS d
FROM (SELECT rowid, -- This is the core query attaching to each row
timestamp, -- the rowid of its next
timetick,
(SELECT rowid
FROM measurements
WHERE timestamp > m.timestamp
ORDER BY timestamp
LIMIT 1
) AS r1
FROM Measurements AS m
) AS m0
JOIN measurements AS m1 ON m0.r1 = m1.rowid
WHERE m1.timetick - m0.timetick < 0;
If the timestamp is an integer, make that column an INTEGER PRIMARY KEY to avoid an extra index lookup.

Related

How to find neighboring records in the SQL table in terms of month and year?

Please help me to optimize my SQL query.
I have a table with the fields: date, commodity_id, exp_month_id, exp_year, price, where the first 4 fields are the primary key. The months are designated with the alphabet-ordered letters: e.g. F (for Jan), G (for Feb.), H (for March), etc. Thus the letter of more distant from Jan. month will be larger than the letter of the less distant month (F < G < H < ...). Some commodity_ids have all 12 months in the table, some only 5 or 3, which are constant for all years.
I need to calculate the difference between prices (gradient) of the neighboring records in terms of exp_month_id, exp_year. As the first step, I want to define for every couple (exp_month_id, exp_year) the valid couple (next_month_id, next_year). The main problem here, that if the current exp_month_id is the last in the year, then next_year = exp_year + 1 and next_month_id should be the first one in the year.
I have written the following query to do the job:
WITH trading_months AS (
SELECT DISTINCT commodity_id,
exp_month_id
FROM futures
ORDER BY exp_month_id
)
SELECT DISTINCT f.commodity_id,
f.exp_month_id,
f.exp_year,
(
WITH [temp] AS (
SELECT exp_month_id
FROM trading_months
WHERE commodity_id = f.commodity_id
)
SELECT exp_month_id
FROM [temp]
WHERE exp_month_id > f.exp_month_id
UNION ALL
SELECT exp_month_id
FROM [temp]
LIMIT 1
)
AS next_month_id,
(
SELECT CASE WHEN EXISTS (
SELECT commodity_id,
exp_month_id
FROM trading_months
WHERE commodity_id = f.commodity_id AND
exp_month_id > f.exp_month_id
LIMIT 1
)
THEN f.exp_year ELSE f.exp_year + 1 END
)
AS next_year
FROM futures AS f
This query serves as a base for a dynamic table (view) which is subsequently used for calculating the gradient. However, the execution of this query takes more than one second and thus the whole process takes minutes. I wonder if you could help me optimizing the query.
Note: The following requires Sqlite 3.25 or newer for window function support:
Lack of sample data (Preferably as a CREATE TABLE and INSERT statements for easy importing) and expected results makes it hard to test, but if your end goal is computing the difference in prices between expiration dates (Making your question a bit of an XY problem, maybe something like:
SELECT date, commodity_id, price, exp_year, exp_month_id
, price - lag(price, 1) OVER (PARTITION BY commodity_id ORDER BY exp_year, exp_month_id) AS "change from last price"
FROM futures;
Thanks to the hint of #Shawn to use window functions I could rewrite the query in much shorter form:
CREATE VIEW "futures_nextmonths_win" AS
WITH trading_months AS (
SELECT DISTINCT commodity_id,
exp_month_id,
exp_year
FROM futures)
SELECT commodity_id,
exp_month_id,
exp_year,
lead(exp_month_id) OVER w AS next_month_id,
lead(exp_year) OVER w AS next_year
FROM trading_months
WINDOW w AS (PARTITION BY commodity_id ORDER BY exp_year, exp_month_id);
which is also slightly faster then the original one.

Splitting table PK values into roughly same-size ranges

I have a table in Postgres with about half a million rows and an integer primary key.
I'd like to split its entire PK space into N ranges of approximately same size for independent processing. How do I best do it?
I apparently can do it by fetching all PK values to a client and remember every N-th value. This does a full scan and a fetch of all the values, while I only want no more than N+1 of them.
I can select min and max values and cut the range, but if the PKs are not distributed quite evenly, it may give me some ranges of seriously different sizes.
I want ranges for index-based access later on, so any modulo-based tricks do mot apply.
Is there any nice SQL-based solution that does not involve fetching all the keys to a client? Writing an N-specific query, e.g. with N clauses, if fine.
An example:
IDs in a range, say, from 1234 to 567890, N = 4.
I'd like to get 4 numbers, say 127123, 254789, 379860, so than there are approximately 125k records in each of the ranges of IDs [1234, 127123], [127123, 254789], [254789, 379860], [379860, 567890].
Update:
I've come up with a solution like this:
select
percentile_disc(0.25) within group (order by c.id) over() as pct_25
,percentile_disc(0.50) within group (order by c.id) over() as pct_50
,percentile_disc(0.75) within group (order by c.id) over() as pct_75
from customer c
limit 1
;
It does a decent job of giving me the exact range boundaries, and runs only a few seconds, which is fine for my purposes.
What bothers me is that I have to add the limit 1 clause to get just one row. Without it, I receive identical rows, one per record in the table. Is there a better way to get just a one row of the percentiles?
I think you can use row_number() for this purpose. Something like this:
select t.*,
floor((seqnum * N) / cnt) as range
from (select t.*,
row_number() over (order by pk) - 1 as seqnum,
count(*) over () as cnt
from t
) t;
This assumes by range that you mean ranges on pk values. You can also move the range expression to a where clause to just select one particular range.

Update with last known location?

I have a large table of data which has about 75,000 people's locations
per minute of the day for a 24 hour period. The columns are:
ppid (person ID)
point_time (timestamp)
the_geom (geometry point)
My problem is that some (alot) of the info from the location
(the_geom) column is missing. This column needs updating with the last
known location of the person. I'm struggling conceptually as to how to do this. Some sort of self-join on the table I think. But how to get the right data for the
update?
I've made a SQL fiddle which demonstrates the problem:
http://sqlfiddle.com/#!15/77157/1
Thanks
James
I'm not sure how this will perform over a larger data set, but here's a single query solution, using two nested sub-queries:
SELECT
data.ppid,
data.point_time,
CASE
WHEN data.the_geom IS NULL
THEN (
--Get all locations with an earlier time stamp for that ppid
SELECT geom.the_geom
FROM test_data geom
WHERE data.ppid = geom.ppid
AND geom.point_time < data.point_time
AND geom.the_geom IS NOT NULL
AND NOT EXISTS (
-- Cull all but the most recent one
SELECT *
FROM test_data cull
WHERE cull.ppid = geom.ppid
AND geom.the_geom IS NOT NULL
AND cull.point_time < data.point_time
AND cull.point_time > geom.point_time
AND cull.the_geom IS NOT NULL
)
)
ELSE data.the_geom
end
FROM test_data data

Calculating the peak capacity of hotels with Sql

There is a number of hotels with different bed capacities. I need to learn that for any given day, how many beds are occupied in each hotel.
Sample data:
HOTEL CHECK-IN CHECK-OUT
A 29.05.2010 30.05.2010
A 28.05.2010 30.05.2010
A 27.05.2010 29.05.2010
B 18.08.2010 19.08.2010
B 16.08.2010 20.08.2010
B 15.08.2010 17.08.2010
Intermediary Result:
HOTEL DAY OCCUPIED_BEDS
A 27.05.2010 1
A 28.05.2010 2
A 29.05.2010 3
A 30.05.2010 2
B 15.08.2010 1
B 16.08.2010 2
B 17.08.2010 2
B 18.08.2010 2
B 19.08.2010 2
B 20.08.2010 1
Final result:
HOTEL MAX_OCCUPATION
A 3
B 2
A similar question is asked before. I thought getting the list of dates (as Tom Kyte shows) between two dates and calculating each day's capacity with a group by. The problem is my table is relatively big and I wonder if there is a less costly way of accomplishing this task.
I don't think there's a better approach than the one you outlined in the question. Create your days table (or generate one on the fly). I personally like to have one lying around, updated once a year.
Someone who understand analytic functions will probably be able to do this without an inner/outer query, but as the inner grouping is a subset of the outer, it doesn't make much difference.
Select
i.Hotel,
Max(i.OccupiedBeds)
From (
Select
s.Hotel,
d.DayID,
Count(*) As OccupiedBeds
From
SampleData s
Inner Join
Days d
-- might not need to +1 depending on business rules.
-- I wouldn't count occupancy on the day I check out, if so get rid of it
On d.DayID >= s.CheckIn And d.DayID < s.CheckOut + 1
Group By
s.Hotel,
d.DayID
) i
Group By
i.Hotel
After a bit of playing I couldn't get an analytic function version to work without an inner query:
If speed really is a problem with this, you could consider maintaining an intermediate table with triggers on main table.
http://sqlfiddle.com/#!4/e58e7/24
create a temp table containing the days you are interested in
create table #dates (dat datetime)
insert into #dates (dat) values ('20121116')
insert into #dates (dat) values ('20121115')
insert into #dates (dat) values ('20121114')
insert into #dates (dat) values ('20121113')
Get the intermediate result by joining the bookings with the dates so that one per booking-day is "generated"
SELECT Hotel, d.dat, COUNT(*) from bookings b
INNER JOIN #dates d on d.dat BETWEEN b.checkin AND b.checkout
GROUP BY Hotel, d.dat
An finally get the Max
SELECT Hotel, Max(OCCUPIED_BEDS) FROM IntermediateResult GROUP BY Hotel
The problem with performance is that the join conditions are not based on equality which makes a hash join impossible. Assuming we have a table hotel_day with hotel-day pairs, I would try something like that:
select ch_in.hotel, ch_in.day,
(check_in_cnt - check_out_cnt) as occupancy_change
from ( select d.hotel, d.day, count(s.hotel) as check_in_cnt
from hotel_days d,
sample_data s
where s.hotel(+) = d.hotel
and s.check_in(+) = d.day
group by d.hotel, d.day
) ch_in,
( select d.hotel, d.day, count(s.hotel) as check_out_cnt
from hotel_days d,
sample_data s
where s.hotel(+) = d.hotel
and s.check_out(+) = d.day
group by d.hotel, d.day
) ch_out
where ch_out.hotel = ch_in.hotel
and ch_out.day = ch_in.day
The tradeoff is a double full scan, but I think it would still run faster, and it may be parallelized. (I assume that sample_data is big mostly due to the number of bookings, not the number of hotels itself.) The output is a change of occupancy in particular hotels on particular days, but this may be easily summed up into total values with either analytical functions or (probably more efficiently) a PL/SQL procedure with bulk collect.

table design + SQL question

I have a table foodbar, created with the following DDL. (I am using mySQL 5.1.x)
CREATE TABLE foodbar (
id INT NOT NULL AUTO_INCREMENT,
user_id INT NOT NULL,
weight double not null,
created_at date not null
);
I have four questions:
How may I write a query that returns
a result set that gives me the
following information: user_id,
weight_gain where weight_gain is
the difference between a weight and
a weight that was recorded 7 days
ago.
How may I write a query that will
return the top N users with the
biggest weight gain (again say over
a week).? An 'obvious' way may be to
use the query obtained in question 1
above as a subquery, but somehow
picking the top N.
Since in question 2 (and indeed
question 1), I am searching the
records in the table using a
calculated field, indexing would be
preferable to optimise the query -
however since it is a calculated
field, it is not clear which field
to index (I'm guessing the 'weight'
field is the one that needs
indexing). Am I right in that
assumption?.
Assuming I had another field in the
foodbar table (say 'height') and I
wanted to select records from the
table based on (say) the product
(i.e. multiplication) of 'height'
and 'weight' - would I be right in
assuming again that I need to index
'height' and 'weight'?. Do I also
need to create a composite key (say
(height,weight)). If this question
is not clear, I would be happy to
clarify
I don't see why you should need the synthetic key, so I'll use this table instead:
CREATE TABLE foodbar (
user_id INT NOT NULL
, created_at date not null
, weight double not null
, PRIMARY KEY (user_id, created_at)
);
How may I write a query that returns a result set that gives me the following information: user_id, weight_gain where weight_gain is the difference between a weight and a weight that was recorded 7 days ago.
SELECT curr.user_id, curr.weight - prev.weight
FROM foodbar curr, foodbar prev
WHERE curr.user_id = prev.user_id
AND curr.created_at = CURRENT_DATE
AND prev.created_at = CURRENT_DATE - INTERVAL '7 days'
;
the date arithmetic syntax is probably wrong but you get the idea
How may I write a query that will return the top N users with the biggest weight gain (again say over a week).? An 'obvious' way may be to use the query obtained in question 1 above as a subquery, but somehow picking the top N.
see above, add ORDER BY curr.weight - prev.weight DESC and LIMIT N
for the last two questions: don't speculate, examine execution plans. (postgresql has EXPLAIN ANALYZE, dunno about mysql) you'll probably find you need to index columns that participate in WHERE and JOIN, not the ones that form the result set.
I think that "just somebody" covered most of what you're asking, but I'll just add that indexing columns that take part in a calculation is unlikely to help you at all unless it happens to be a covering index.
For example, it doesn't help to order the following rows by X, Y if I want to get them in the order of their product X * Y:
X Y
1 8
2 2
4 4
The products would order them as:
X Y Product
2 2 4
1 8 8
4 4 16
If mySQL supports calculated columns in a table and allows indexing on those columns then that might help.
I agree with just somebody regarding the primary key, but for what you're asking regarding the weight calculation, you'd be better off storing the delta rather than the weight:
CREATE TABLE foodbar (
user_id INT NOT NULL,
created_at date not null,
weight_delta double not null,
PRIMARY KEY (user_id, created_at)
);
It means you'd store the users initial weight in say, the user table, and when you write records to the foodbar table, a user could supply the weight at that time, but the query would subtract the initial weight from the current weight. So you'd see values like:
user_id weight_delta
------------------------
1 2
1 5
1 -3
Looking at that, you know that user 1 gained 4 pounds/kilos/stones/etc.
This way you could use SUM, because it's possible for someone to have weighings every day - using just somebody's equation of curr.weight - prev.weight wouldn't work, regardless of time span.
Getting the top x is easy in MySQL - use the LIMIT clause, but mind that you provide an ORDER BY to make sure the limit is applied correctly.
It's not obvious, but there's some important information missing in the problem you're trying to solve. It becomes more noticeable when you think about realistic data going into this table. The problem is that you're unlikely to to have a consistent regular daily record of users' weights. So you need to clarify a couple of rules around determining 'current-weight' and 'weight x days ago'. I'm going to assume the following simplistic rules:
The most recent weight reading is the 'current-weight'. (Even though that could be months ago.)
The most recent weight reading more than x days ago will be the weight assumed at x days ago. (Even though for example a reading from 6 days ago would be more reliable than a reading from 21 days ago when determining weight 7 days ago.)
Now to answer the questions:
1&2: Using the above extra rules provides an opportunity to produce two result sets: current weights, and previous weights:
Current weights:
select rd.*,
w.Weight
from (
select User_id,
max(Created_at) AS Read_date
from Foodbar
group by User_id
) rd
inner join Foodbar w on
w.User_id = rd.User_id
and w.Created_at = rd.Read_date
Similarly for the x days ago reading:
select rd.*,
w.Weight
from (
select User_id,
max(Created_at) AS Read_date
from Foodbar
where Created_at < DATEADD(dd, -7, GETDATE()) /*Or appropriate MySql equivalent*/
group by User_id
) rd
inner join Foodbar w on
w.User_id = rd.User_id
and w.Created_at = rd.Read_date
Now simply join these results as subqueries
select cur.User_id,
cur.Weight as Cur_weight,
prev.Weight as Prev_weight
cur.Weight - prev.Weight as Weight_change
from (
/*Insert query #1 here*/
) cur
inner join (
/*Insert query #2 here*/
) prev on
prev.User_id = cur.User_id
If I remember correctly the MySql syntax to get the top N weight gains would be to simply add:
ORDER BY cur.Weight - prev.Weight DESC limit N
2&3: Choosing indexes requires a little understanding of how the query optimiser will process the query:
The important thing when it comes to index selection is what columns you are filtering by or joining on. The optimiser will use the index if it is determined to be selective enough (note that sometimes your filters have to be extremely selective returning < 1% of data to be considered useful). There's always a trade of between slow disk seek times of navigating indexes and simply processing all the data in memory.
3: Although weights feature significantly in what you display, the only relevance is in terms of filtering (or selection) is in #2 to get the top N weight gains. This is a complex calculation based on a number of queries and a lot of processing that has gone before; so Weight will provide zero benefit as an index.
Another note is that even for #2 you have to calculate the weight change of all users in order to determine the which have gained the most. Therefore unless you have a very large number of readings per user you will read most of the table. (I.e. a table scan will be used to obtain the bulk of the data)
Where indexes can benefit:
You are trying to identify specific Foodbar rows based on User_id and Created_at.
You are also joining back to the Foodbar table again using User_id and Created_at.
This implies an index on User_id, Created__at would be useful (more-so if this is the clustered index).
4: No, unfortunately it is mathematically impossible to determine how the individual values H and W would independently determine the ordering of the product. E.g. both H=3 & W=3 are less than 5, yet if H=5 and W=1 then the product 3*3 is greater than 5*1.
You would have to actually store the calculation an index on that additional column. However, as indicated in my answer to #3 above, it is still unlikely to prove beneficial.