How to group results from a query based on one-to-many association by some criterion in the "many"? - sql

Please forgive the awkward title. I had a hard time distilling my question into one phrase. If anyone can come up with a better one, feel free.
I have the following simplified schema:
vendors
INT id
locations
INT id
INT vendor_id
FLOAT latitude
FLOAT longitude
I am perfectly capable of return a list of the nearest vendors, sorted by proximity, limited by an approximation of radius:
SELECT * FROM locations
WHERE latitude IS NOT NULL AND longitude IS NOT NULL
AND ABS(latitude - 30) + ABS(longitude - 30) < 50
ORDER BY ABS(latitude - 30) + ABS(longitude - 30) ASC
I can't at this moment find my way around the repetition of the order/limit term. I initially attempted aliasing it as "distance" among the SELECT fields, but psql told me that this alias wasn't available in the WHERE clause. Fine. If there's some fancy pants way around this, I'm all ears, but on to my main question:
What I'd like to do is to return a list of vendors, each joined with the closest of its locations, and have this list ordered by proximity and limited by radius.
So supposing I have 2 vendors, each with two locations. I want a query that limits the radius such that only one of the four locations is within it to return that location's associated vendor alongside the vendor itself. If the radius encompassed all the locations, I'd want vendor 1 presented with the closest between its locations and vendor 2 with the closest between its locations, ultimately ordering vendors 1 and 2 based on the proximity of their closest location.
In MySQL, I managed to get the closest location in each vendor's row by using GROUP BY and then MIN(distance). But PostgreSQL seems to be stricter on the usage of GROUP BY.
I'd like to, if possible, avoid meddling with the SELECT clause. I'd also like to, if possible reuse the WHERE and ORDER parts of the above query. But these are by no means absolute requirements.
I have made hackneyed attempts at DISTINCT ON and GROUP BY, but these gave me a fair bit of trouble, mostly in terms of me missing mirrored statements elsewhere, which I won't elaborate in great detail on now.
Solution
I ended up adopting a solution based off OMG Ponies' excellent answer.
SELECT vendors.* FROM (
SELECT locations.*,
ABS(locations.latitude - 2.1) + ABS(locations.longitude - 2.1) AS distance,
ROW_NUMBER() OVER(PARTITION BY locations.locatable_id, locations.locatable_type
ORDER BY ABS(locations.latitude - 2.1) + ABS(locations.longitude - 2.1) ASC) AS rank
FROM locations
WHERE locations.latitude IS NOT NULL
AND locations.longitude IS NOT NULL
AND locations.locatable_type = 'Vendor'
) ranked_locations
INNER JOIN vendors ON vendors.id = ranked_locations.locatable_id
WHERE (ranked_locations.rank = 1)
AND (ranked_locations.distance <= 0.5)
ORDER BY ranked_locations.distance;
Some deviations from OMG Ponies' solution:
Locations are now polymorphically associated via _type. A bit of a premise change.
I moved the join outside the subquery. I don't know if there are performance implications, but it made sense in my mind to see the subquery as a getting of locations and partitioned rankings and then the larger query as an act of bringing it all together.
minor Took away table name aliasing. Although I'm plenty used to aliasing, it just made it harder for me to follow along. I'll wait until I'm more experienced with PostgreSQL before working in that flair.

For PostgreSQL 8.4+, you can use analytics like ROW_NUMBER:
SELECT x.*
FROM (SELECT v.*,
t.*,
ABS(t.latitude - 30) + ABS(t.longitude - 30) AS distance,
ROW_NUMBER() OVER(PARTITION BY v.id
ORDER BY ABS(t.latitude - 30) + ABS(t.longitude - 30)) AS rank
FROM VENDORS v
JOIN LOCATIONS t ON t.vendor_id = v.id
WHERE t.latitude IS NOT NULL
AND t.longitude IS NOT NULL) x
WHERE x.rank = 1
AND x.distance < 50
ORDER BY x.distance
I left the filtering on distance, in case the top ranked value was over 50 so the vendor would not appear. Remove the distance check being less than 50 portion if you don't want this to happen.
ROW_NUMBER will return a distinct sequential value that resets for every vendor in this example. If you want duplicates, you'd need to look at using DENSE_RANK.
See this article for emulating ROW_NUMBER on PostgreSQL pre-8.4.

MySQL extends GROUP BY and not all columns are required to be aggregates. http://dev.mysql.com/doc/refman/5.0/en/group-by-hidden-columns.html
I have seen many questions here with the same issue. The trick is to get the nececssary columns in a subquery and then self join it in the outer query:
create temp table locations (id int, vender_id int, latitude int, longitude int);
CREATE TABLE
insert into locations values
(1, 1, 50, 50),
(2, 1, 35, 30),
(3, 2, 5, 30)
;
SELECT
locations.*, distance
FROM
(
SELECT
vender_id,
MIN(ABS(latitude - 30) + ABS(longitude - 30)) as distance
FROM locations
WHERE latitude IS NOT NULL AND longitude IS NOT NULL
GROUP BY vender_id
) AS min_locations
JOIN locations ON
ABS(latitude - 30) + ABS(longitude - 30) = distance
AND min_locations.vender_id = locations.vender_id
WHERE distance < 50
ORDER BY distance
;
id | vender_id | latitude | longitude | distance
----+-----------+----------+-----------+----------
2 | 1 | 35 | 30 | 5
3 | 2 | 5 | 30 | 25

Related

PostgreSQL - Get related columns of an aggregated column

I have a table called "places"
origin | destiny | distance
---------------------------
A | X | 5
A | Y | 8
B | X | 12
B | Y | 9
For each origin, I want to find out which is the closest destiny. In MySQL I could do
SELECT origin, destiny, MIN(distance) FROM places GROUP BY origin
And I could expect the following result
origin | destiny | distance
---------------------------
A | X | 5
B | y | 9
Unfortunately, this query is not working in PostgreSQL. Postgre is forcing me to either put "destiny" in his own aggregate function or to define it as another argument of the GROUP BY statement. Both "solutions" change completely my desired result.
How could I translate the above MySQL query to PostgreSQL?
MySQL is the only DBMS that allows the broken ("lose" in MySQL terms) group by handling. Every other DBMS (including Postgres) would reject your original statement.
In Postgres you can use the distinct on operator to achieve the same thing:
select distinct on (origin)
origin,
destiny,
distance
from places
order by origin, distance;
The ANSI solution would be something like this:
select p.origin,
p.destiny,
p.distance
from places p
join (select p2.origin, min(p2.distance) as distance
from places p2
group by origin
) t on t.origin = p.origin and t.distance = p.distance
order by origin;
Or without a join using window functions
select t.origin,
t.destiny,
t.distance
from (
select origin,
destiny,
distance,
min(distance) over (partition by origin) as min_dist
from places
) t
where distance = min_dist
order by origin;
Or another solution with window functions:
select distinct origin,
first_value(destiny) over (partition by origin order by distance) as destiny,
min(distance) over (partition by origin) as distance
from places
order by origin;
My guess is that the first one (Postgres specific) is probably the fastest one.
Here is an SQLFiddle for all three solutions: http://sqlfiddle.com/#!12/68308/2
Note that the MySQL result might actually be incorrect as it will return an arbitrary (=random) value for destiny. The value returned by MySQL might not be the one that belongs to the lowest distance.
More details on the broken group by handling in MySQL can be found here: http://www.mysqlperformanceblog.com/2006/09/06/wrong-group-by-makes-your-queries-fragile/
The neatest (in my opinion) way to do this in PostgreSQL is to use an aggregate function which clearly specifies which value of destiny should be selected.
The desired value can be described as "the first matching destiny, if you order the matching rows by their distance".
You therefore need two things:
A "first" aggregate, which simply returns the "first" of a list of values. This is very easy to define, but is not included as standard.
The ability to specify what order those matches come in (otherwise, like the MySQL "loose Group By", it will be undefined which value you actually get). This was added in PostgreSQL 9.0, and the syntax is documented under "Aggregate Expressions".
Once the first() aggregate is defined (which you need do only once per database, while you're setting up your initial tables), you can then write:
Select
origin,
first(destiny Order by distance Asc) as closest_destiny,
min(distance) as closest_destiny_distance
-- Or, equivalently: first(distance Order by distance Asc) as closest_destiny_distance
from places
group by origin
order by origin;
Here is a SQLFiddle demo showing the whole thing in operation.
Just to add another possible solution to a_horse_with_no_name answer - using windowed function row_num:
with cte as (
select
row_number() over(partition by origin order by distance) as row_num,
*
from places
)
select
origin,
destiny,
distance
from cte
where row_num = 1
It'll work in SQL Server or other RDBMS supporting row_number too. In PostgreSQL, though, I prefer distinct on syntax.
sql fiddle demo

recursive geometric query : five closest entities

The question is whether the query described below can be done without recourse to procedural logic, that is, can it be handled by SQL and a CTE and a windowing function alone? I'm using SQL Server 2012 but the question is not limited to that engine.
Suppose we have a national database of music teachers with 250,000 rows:
teacherName, address, city, state, zipcode, geolocation, primaryInstrument
where the geolocation column is a geography::point datatype with optimally tesselated index.
User wants the five closest guitar teachers to his location. A query using a windowing function performs well enough if we pick some arbitrary distance cutoff, say 50 miles, so that we are not selecting all 250,000 rows and then ranking them by distance and taking the closest 5.
But that arbitrary 50-mile radius cutoff might not always succeed in encompassing 5 teachers, if, for example, the user picks an instrument from a different culture, such as sitar or oud or balalaika; there might not be five teachers of such instruments within 50 miles of her location.
Also, now imagine we have a query where a conservatory of music has sent us a list of 250 singers, who are students who have been accepted to the school for the upcoming year, and they want us to send them the five closest voice coaches for each person on the list, so that those students can arrange to get some coaching before they arrive on campus. We have to scan the teachers database 250 times (i.e. scan the geolocation index) because those students all live at different places around the country.
So, I was wondering, is it possible, for that latter query involving a list of 250 student locations, to write a recursive query where the radius begins small, at 10 miles, say, and then increases by 10 miles with each iteration, until either a maximum radius of 100 miles has been reached or the required five (5) teachers have been found? And can it be done only for those students who have yet to be matched with the required 5 teachers?
I'm thinking it cannot be done with SQL alone, and must be done with looping and a temporary table--but maybe that's because I haven't figured out how to do it with SQL alone.
P.S. The primaryInstrument column could reduce the size of the set ranked by distance too but for the sake of this question forget about that.
EDIT: Here's an example query. The SINGER (submitted) dataset contains a column with the arbitrary radius to limit the geo-results to a smaller subset, but as stated above, that radius may define a circle (whose centerpoint is the student's geolocation) which might not encompass the required number of teachers. Sometimes the supplied datasets contain thousands of addresses, not merely a few hundred.
select TEACHERSRANKEDBYDISTANCE.* from
(
select STUDENTSANDTEACHERSINRADIUS.*,
rowpos = row_number()
over(partition by
STUDENTSANDTEACHERSINRADIUS.zipcode+STUDENTSANDTEACHERSINRADIUS.streetaddress
order by DistanceInMiles)
from
(
select
SINGER.name,
SINGER.streetaddress,
SINGER.city,
SINGER.state,
SINGER.zipcode,
TEACHERS.name as TEACHERname,
TEACHERS.streetaddress as TEACHERaddress,
TEACHERS.city as TEACHERcity,
TEACHERS.state as TEACHERstate,
TEACHERS.zipcode as TEACHERzip,
TEACHERS.teacherid,
geography::Point(SINGER.lat, SINGER.lon, 4326).STDistance(TEACHERS.geolocation)
/ (1.6 * 1000) as DistanceInMiles
from
SINGER left join TEACHERS
on
( TEACHERS.geolocation).STDistance( geography::Point(SINGER.lat, SINGER.lon, 4326))
< (SINGER.radius * (1.6 * 1000 ))
and TEACHERS.primaryInstrument='voice'
) as STUDENTSANDTEACHERSINRADIUS
) as TEACHERSRANKEDBYDISTANCE
where rowpos < 6 -- closest 5 is an abitrary requirement given to us
I think may be if you need just to get closest 5 teachers regardless of radius, you could write something like this. The Student will duplicate 5 time in this query, I don't know what do you want to get.
select
S.name,
S.streetaddress,
S.city,
S.state,
S.zipcode,
T.name as TEACHERname,
T.streetaddress as TEACHERaddress,
T.city as TEACHERcity,
T.state as TEACHERstate,
T.zipcode as TEACHERzip,
T.teacherid,
T.geolocation.STDistance(geography::Point(S.lat, S.lon, 4326))
/ (1.6 * 1000) as DistanceInMiles
from SINGER as S
outer apply (
select top 5 TT.*
from TEACHERS as TT
where TT.primaryInstrument='voice'
order by TT.geolocation.STDistance(geography::Point(S.lat, S.lon, 4326)) asc
) as T

How do I use the MAX function over three tables?

So, I have a problem with a SQL Query.
It's about getting weather data for German cities. I have 4 tables: staedte (the cities with primary key loc_id), gehoert_zu (contains the city-key and the key of the weather station that is closest to this city (stations_id)), wettermessung (contains all the weather information and the station's key value) and wetterstation (contains the stations key and location). And I'm using PostgreSQL
Here is how the tables look like:
wetterstation
s_id[PK] standort lon lat hoehe
----------------------------------------
10224 Bremen 53.05 8.8 4
wettermessung
stations_id[PK] datum[PK] max_temp_2m ......
----------------------------------------------------
10224 2013-3-24 -0.4
staedte
loc_id[PK] name lat lon
-------------------------------
15 Asch 48.4 9.8
gehoert_zu
loc_id[PK] stations_id[PK]
-----------------------------
15 10224
What I'm trying to do is to get the name of the city with the (for example) highest temperature at a specified date (could be a whole month, or a day). Since the weather data is bound to a station, I actually need to get the station's ID and then just choose one of the corresponding to this station cities. A possible question would be: "In which city was it hottest in June ?" and, say, the highest measured temperature was in station number 10224. As a result I want to get the city Asch. What I got so far is this
SELECT name, MAX (max_temp_2m)
FROM wettermessung, staedte, gehoert_zu
WHERE wettermessung.stations_id = gehoert_zu.stations_id
AND gehoert_zu.loc_id = staedte.loc_id
AND wettermessung.datum BETWEEN '2012-8-1' AND '2012-12-1'
GROUP BY name
ORDER BY MAX (max_temp_2m) DESC
LIMIT 1
There are two problems with the results:
1) it's taking waaaay too long. The tables are not that big (cities has about 70k entries), but it needs between 1 and 7 minutes to get things done (depending on the time span)
2) it ALWAYS produces the same city and I'm pretty sure it's not the right one either.
I hope I managed to explain my problem clearly enough and I'd be happy for any kind of help. Thanks in advance ! :D
If you want to get the max temperature per city use this statement:
SELECT * FROM (
SELECT gz.loc_id, MAX(max_temp_2m) as temperature
FROM wettermessung as wm
INNER JOIN gehoert_zu as gz
ON wm.stations_id = gz.stations_id
WHERE wm.datum BETWEEN '2012-8-1' AND '2012-12-1'
GROUP BY gz.loc_id) as subselect
INNER JOIN staedte as std
ON std.loc_id = subselect.loc_id
ORDER BY subselect.temperature DESC
Use this statement to get the city with the highest temperature (only 1 city):
SELECT * FROM(
SELECT name, MAX(max_temp_2m) as temp
FROM wettermessung as wm
INNER JOIN gehoert_zu as gz
ON wm.stations_id = gz.stations_id
INNER JOIN staedte as std
ON gz.loc_id = std.loc_id
WHERE wm.datum BETWEEN '2012-8-1' AND '2012-12-1'
GROUP BY name
ORDER BY MAX(max_temp_2m) DESC
LIMIT 1) as subselect
ORDER BY temp desc
LIMIT 1
For performance reasons always use explicit joins as LEFT, RIGHT, INNER JOIN and avoid to use joins with separated table name, so your sql serevr has not to guess your table references.
This is a general example of how to get the item with the highest, lowest, biggest, smallest, whatever value. You can adjust it to your particular situation.
select fred, barney, wilma
from bedrock join
(select fred, max(dino) maxdino
from bedrock
where whatever
group by fred ) flinstone on bedrock.fred = flinstone.fred
where dino = maxdino
and other conditions
I propose you use a consistent naming convention. Singular terms for tables holding a single item per row is a good convention. You only table breaking this is staedte. Should be stadt.
And I suggest to use station_id consistently instead of either s_id and stations_id.
Building on these premises, for your question:
... get the name of the city with the ... highest temperature at a specified date
SELECT s.name, w.max_temp_2m
FROM (
SELECT station_id, max_temp_2m
FROM wettermessung
WHERE datum >= '2012-8-1'::date
AND datum < '2012-12-1'::date -- exclude upper border
ORDER BY max_temp_2m DESC, station_id -- id as tie breaker
LIMIT 1
) w
JOIN gehoert_zu g USING (station_id) -- assuming normalized names
JOIN stadt s USING (loc_id)
Use explicit JOIN conditions for better readability and maintenance.
Use table aliases to simplify your query.
Use x >= a AND x < b to include the lower border and exclude the upper border, which is the common use case.
Aggregate first and pick your station with the highest temperature, before you join to the other tables to retrieve the city name. Much simpler and faster.
You did not specify what to do when multiple "wettermessungen" tie on max_temp_2m in the given time frame. I added station_id as tiebreaker, meaning the station with the lowest id will be picked consistently if there are multiple qualifying stations.

table design + SQL question

I have a table foodbar, created with the following DDL. (I am using mySQL 5.1.x)
CREATE TABLE foodbar (
id INT NOT NULL AUTO_INCREMENT,
user_id INT NOT NULL,
weight double not null,
created_at date not null
);
I have four questions:
How may I write a query that returns
a result set that gives me the
following information: user_id,
weight_gain where weight_gain is
the difference between a weight and
a weight that was recorded 7 days
ago.
How may I write a query that will
return the top N users with the
biggest weight gain (again say over
a week).? An 'obvious' way may be to
use the query obtained in question 1
above as a subquery, but somehow
picking the top N.
Since in question 2 (and indeed
question 1), I am searching the
records in the table using a
calculated field, indexing would be
preferable to optimise the query -
however since it is a calculated
field, it is not clear which field
to index (I'm guessing the 'weight'
field is the one that needs
indexing). Am I right in that
assumption?.
Assuming I had another field in the
foodbar table (say 'height') and I
wanted to select records from the
table based on (say) the product
(i.e. multiplication) of 'height'
and 'weight' - would I be right in
assuming again that I need to index
'height' and 'weight'?. Do I also
need to create a composite key (say
(height,weight)). If this question
is not clear, I would be happy to
clarify
I don't see why you should need the synthetic key, so I'll use this table instead:
CREATE TABLE foodbar (
user_id INT NOT NULL
, created_at date not null
, weight double not null
, PRIMARY KEY (user_id, created_at)
);
How may I write a query that returns a result set that gives me the following information: user_id, weight_gain where weight_gain is the difference between a weight and a weight that was recorded 7 days ago.
SELECT curr.user_id, curr.weight - prev.weight
FROM foodbar curr, foodbar prev
WHERE curr.user_id = prev.user_id
AND curr.created_at = CURRENT_DATE
AND prev.created_at = CURRENT_DATE - INTERVAL '7 days'
;
the date arithmetic syntax is probably wrong but you get the idea
How may I write a query that will return the top N users with the biggest weight gain (again say over a week).? An 'obvious' way may be to use the query obtained in question 1 above as a subquery, but somehow picking the top N.
see above, add ORDER BY curr.weight - prev.weight DESC and LIMIT N
for the last two questions: don't speculate, examine execution plans. (postgresql has EXPLAIN ANALYZE, dunno about mysql) you'll probably find you need to index columns that participate in WHERE and JOIN, not the ones that form the result set.
I think that "just somebody" covered most of what you're asking, but I'll just add that indexing columns that take part in a calculation is unlikely to help you at all unless it happens to be a covering index.
For example, it doesn't help to order the following rows by X, Y if I want to get them in the order of their product X * Y:
X Y
1 8
2 2
4 4
The products would order them as:
X Y Product
2 2 4
1 8 8
4 4 16
If mySQL supports calculated columns in a table and allows indexing on those columns then that might help.
I agree with just somebody regarding the primary key, but for what you're asking regarding the weight calculation, you'd be better off storing the delta rather than the weight:
CREATE TABLE foodbar (
user_id INT NOT NULL,
created_at date not null,
weight_delta double not null,
PRIMARY KEY (user_id, created_at)
);
It means you'd store the users initial weight in say, the user table, and when you write records to the foodbar table, a user could supply the weight at that time, but the query would subtract the initial weight from the current weight. So you'd see values like:
user_id weight_delta
------------------------
1 2
1 5
1 -3
Looking at that, you know that user 1 gained 4 pounds/kilos/stones/etc.
This way you could use SUM, because it's possible for someone to have weighings every day - using just somebody's equation of curr.weight - prev.weight wouldn't work, regardless of time span.
Getting the top x is easy in MySQL - use the LIMIT clause, but mind that you provide an ORDER BY to make sure the limit is applied correctly.
It's not obvious, but there's some important information missing in the problem you're trying to solve. It becomes more noticeable when you think about realistic data going into this table. The problem is that you're unlikely to to have a consistent regular daily record of users' weights. So you need to clarify a couple of rules around determining 'current-weight' and 'weight x days ago'. I'm going to assume the following simplistic rules:
The most recent weight reading is the 'current-weight'. (Even though that could be months ago.)
The most recent weight reading more than x days ago will be the weight assumed at x days ago. (Even though for example a reading from 6 days ago would be more reliable than a reading from 21 days ago when determining weight 7 days ago.)
Now to answer the questions:
1&2: Using the above extra rules provides an opportunity to produce two result sets: current weights, and previous weights:
Current weights:
select rd.*,
w.Weight
from (
select User_id,
max(Created_at) AS Read_date
from Foodbar
group by User_id
) rd
inner join Foodbar w on
w.User_id = rd.User_id
and w.Created_at = rd.Read_date
Similarly for the x days ago reading:
select rd.*,
w.Weight
from (
select User_id,
max(Created_at) AS Read_date
from Foodbar
where Created_at < DATEADD(dd, -7, GETDATE()) /*Or appropriate MySql equivalent*/
group by User_id
) rd
inner join Foodbar w on
w.User_id = rd.User_id
and w.Created_at = rd.Read_date
Now simply join these results as subqueries
select cur.User_id,
cur.Weight as Cur_weight,
prev.Weight as Prev_weight
cur.Weight - prev.Weight as Weight_change
from (
/*Insert query #1 here*/
) cur
inner join (
/*Insert query #2 here*/
) prev on
prev.User_id = cur.User_id
If I remember correctly the MySql syntax to get the top N weight gains would be to simply add:
ORDER BY cur.Weight - prev.Weight DESC limit N
2&3: Choosing indexes requires a little understanding of how the query optimiser will process the query:
The important thing when it comes to index selection is what columns you are filtering by or joining on. The optimiser will use the index if it is determined to be selective enough (note that sometimes your filters have to be extremely selective returning < 1% of data to be considered useful). There's always a trade of between slow disk seek times of navigating indexes and simply processing all the data in memory.
3: Although weights feature significantly in what you display, the only relevance is in terms of filtering (or selection) is in #2 to get the top N weight gains. This is a complex calculation based on a number of queries and a lot of processing that has gone before; so Weight will provide zero benefit as an index.
Another note is that even for #2 you have to calculate the weight change of all users in order to determine the which have gained the most. Therefore unless you have a very large number of readings per user you will read most of the table. (I.e. a table scan will be used to obtain the bulk of the data)
Where indexes can benefit:
You are trying to identify specific Foodbar rows based on User_id and Created_at.
You are also joining back to the Foodbar table again using User_id and Created_at.
This implies an index on User_id, Created__at would be useful (more-so if this is the clustered index).
4: No, unfortunately it is mathematically impossible to determine how the individual values H and W would independently determine the ordering of the product. E.g. both H=3 & W=3 are less than 5, yet if H=5 and W=1 then the product 3*3 is greater than 5*1.
You would have to actually store the calculation an index on that additional column. However, as indicated in my answer to #3 above, it is still unlikely to prove beneficial.

Distribution of table in time

I have a MySQL table with approximately 3000 rows per user. One of the columns is a datetime field, which is mutable, so the rows aren't in chronological order.
I'd like to visualize the time distribution in a chart, so I need a number of individual datapoints. 20 datapoints would be enough.
I could do this:
select timefield from entries where uid = ? order by timefield;
and look at every 150th row.
Or I could do 20 separate queries and use limit 1 and offset.
But there must be a more efficient solution...
Michal Sznajder almost had it, but you can't use column aliases in a WHERE clause in SQL. So you have to wrap it as a derived table. I tried this and it returns 20 rows:
SELECT * FROM (
SELECT #rownum:=#rownum+1 AS rownum, e.*
FROM (SELECT #rownum := 0) r, entries e) AS e2
WHERE uid = ? AND rownum % 150 = 0;
Something like this came to my mind
select #rownum:=#rownum+1 rownum, entries.*
from (select #rownum:=0) r, entries
where uid = ? and rownum % 150 = 0
I don't have MySQL at my hand but maybe this will help ...
As far as visualization, I know this is not the periodic sampling you are talking about, but I would look at all the rows for a user and choose an interval bucket, SUM within the buckets and show on a bar graph or similar. This would show a real "distribution", since many occurrences within a time frame may be significant.
SELECT DATEADD(day, DATEDIFF(day, 0, timefield), 0) AS bucket -- choose an appropriate granularity (days used here)
,COUNT(*)
FROM entries
WHERE uid = ?
GROUP BY DATEADD(day, DATEDIFF(day, 0, timefield), 0)
ORDER BY DATEADD(day, DATEDIFF(day, 0, timefield), 0)
Or if you don't like the way you have to repeat yourself - or if you are playing with different buckets and want to analyze across many users in 3-D (measure in Z against x, y uid, bucket):
SELECT uid
,bucket
,COUNT(*) AS measure
FROM (
SELECT uid
,DATEADD(day, DATEDIFF(day, 0, timefield), 0) AS bucket
FROM entries
) AS buckets
GROUP BY uid
,bucket
ORDER BY uid
,bucket
If I wanted to plot in 3-D, I would probably determine a way to order users according to some meaningful overall metric for the user.
#Michal
For whatever reason, your example only works when the where #recnum uses a less than operator. I think when the where filters out a row, the rownum doesn't get incremented, and it can't match anything else.
If the original table has an auto incremented id column, and rows were inserted in chronological order, then this should work:
select timefield from entries
where uid = ? and id % 150 = 0 order by timefield;
Of course that doesn't work if there is no correlation between the id and the timefield, unless you don't actually care about getting evenly spaced timefields, just 20 random ones.
Do you really care about the individual data points? Or will using the statistical aggregate functions on the day number instead suffice to tell you what you wish to know?
AVG
STDDEV_POP
VARIANCE
TO_DAYS
select timefield
from entries
where rand() = .01 --will return 1% of rows adjust as needed.
Not a mysql expert so I'm not sure how rand() operates in this environment.
For my reference - and for those using postgres - Postgres 9.4 will have ordered set aggregates that should solve this problem:
SELECT percentile_disc(0.95)
WITHIN GROUP (ORDER BY response_time)
FROM pageviews;
Source: http://www.craigkerstiens.com/2014/02/02/Examining-PostgreSQL-9.4/