Select every Nth row where N is dynamic - sql

In a single query I'm trying to implement a decimation on a Database view (The view's size ranging on 10^7 elements). To do so, I'm trying to select every Nth element in a result set. Currently, the implementation I have is correct, but very slow (2-3 seconds) and I'm looking to find a faster alternative.
Right now I select my result set with row numbers, select that result set with each element's mod value, and finally select all elements from that result set where mod value equals 0 or the rownumber is equal to the total count (to retrieve the last element). To find the Mod value I need to get the total size of the inital result set and divide by 1000 (we decimate by 1000 as a fixed number).
SELECT * FROM (
SELECT id, age, sex, height, weight, attendence_time,
EXP1.rownum, MOD(INITIAL.rownum-1, CEIL(COUNT(*) OVER () /1000)) AS person_mod
FROM (
SELECT id, age, sex, height, weight, attendence_time,
ROW_NUMBER() OVER(ORDER BY attendence_time) AS rownum
FROM person p
WHERE p.age = :age
) INITIAL
) MODROW WHERE MODROW.person_mod = 0 OR rownum = (
SELECT COUNT(*)
FROM person p
WHERE p.age = :age
)
I've looked into NTILE, and although I can use it to separate my result set into a 1000 groups based on the rownumber, I'm not sure how to get just the first element from each group to have a decimated response.
Are there any recommendations on how I might be able to make it more efficient?
Since I have a Java backend, would it be better to return the initial large result set and decimate in Java?
Note I'm using Oracle SQL with JPA/JPQL.

Related

Using a subquery in a where clause to find the second smallest value

I'm trying to find the second smallest value for a list to put it into SSRS as a way to highlight that value. This issue is there are multiple minimum values for a given element. The data is presented such that there is an overarching group A that encompasses smaller groups B and I am wanting the second smallest value for each of the smaller groups.
I have a query set up right now that uses a subquery in the where clause to exclude the minimum value from the search so that the second smallest value will be considered the new minimum value. This seemed to work but the subquery only rules out the minimum value for the larger A group, which may or may not be the minimum value for each B group. Here is my query:
Select
BPosition,
Min(Value) as SecondMinimum
From Table
Where Value > (Select
Min(Value)
From Table
Where APosition = #AName)
and APosition = #AName
Group By BPosition
I was expecting a list of the second smallest values for each B group, but it is pulling in the smallest value in each B group that is greater than the smallest value of the A group. This is right for the one B group that contains the true smallest value but incorrect for the others.
If you want the second largest value, use dense_rank():
Select distinct BPosition, Value as SecondMinimum
From (select t.*,
dense_rank() over (partition by Aposition, Bposition order by value) as seqnum
from table
) t
where seqnum = 2;

Min function in postgresql

I am trying to find a division with the lowest population density to do so i did the following:
SELECT P.edname, MIN((P.total_area*1000)/P.total2011) AS "Lowest population density"
FROM eds_census2011 P
GROUP BY P.edname
HAVING COUNT (*)> 1
total_area is multiplied by 1000 (so it is in square metres) and divide by total population.
I want only one record displaying the division (edname) and the population density wich is calculated (MIN((P.total_area*1000)/P.total2011)), instead I get all the records - not even sorted...
The problem is that I have to group it by edname, if I leave out the GROUP BY and HAVING lines I get an error. Any help is greatly appriciated!
Try
SELECT edname, (total_area*1000/total2011) density
FROM eds_census2011
WHERE (total_area*1000/total2011) = (SELECT MIN(total_area*1000/total2011) FROM eds_census2011)
SQLFiddle
A 'Return only one row' rule could be easily enforced by using LIMIT 1 if it's really necessary
Without subquery:
SELECT p.edname, min((p.total_area * 1000)/p.total2011) AS lowest_pop
FROM eds_census2011 p
GROUP BY p.edname
HAVING COUNT (*) > 1
ORDER BY 2
LIMIT 1;
This one returns only 1 row (if any qualify), even if multiple rows have equally low density.
If you just want the lowest density, period, this can be much simpler:
SELECT edname, (total_area * 1000)/total2011) AS lowest_pop
FROM eds_census2011
ORDER BY 2
LIMIT 1;

Check Sequence in Max Min Values

I have a database table that Stores Maximum and Minimum Price Breaks for a Product.
Does anyone know of the SQL which say if I have a break from one Max to the Min of the next item. E.g. 1-10 12-20 I would like it to return me either the numbers that are missing or at the very least a count or bool if it can detect a break from the Absolute Min and the Absolute Max by going through each range.
SQL Server (MSSQL) 2008
For a database that supports window functions, like Oracle:
SELECT t.*
, CASE LAG(maxq+1, 1, minq) OVER (PARTITION BY prod ORDER BY minq)
WHEN minq
THEN 0
ELSE 1
END AS is_gap
FROM tbl t
;
This will produce is_gap = 1 for a row that forms a gap with the previous row (ordered by minq). If your quantity ranges can overlap, the required logic would need to be provided.
http://sqlfiddle.com/#!4/f609e/4
Something like this, giving max quantities that aren't the overall max for the product and don't have a min quantity following them:
select prev.tbProduct_Id,prev.MaxQuantity
from yourtable prev
left join (select tbProduct_ID, max(MaxQuantity) MaxQuantity from yourtable group by tbProduct_id) maxes
on maxes.tbProduct_ID=prev.tbProduct_Id and maxes.MaxQuantity=prev.MaxQuantity
left join yourtable next
on next.tbProduct_Id=prev.tbProduct_Id and next.MinQuantity=prev.MaxQuantity+1
where maxes.tbProduct_Id is null and next.tbProduct_Id is null;
This would fail on your sample data, though, because it would expect a row with MinQuantity 21, not 20.

SQL Server : how to select a fixed amount of rows (select every x-th value)

A short description: I have a table with data that is updated over a certain time period. Now the problem is, that - depending on the nature of the sensor which sends the data - in this time period there could be either 50 data sets or 50.000. As I want to visualize this data (using ASP.NET / c#), for a first preview I would like to SELECT just 1000 values from the table.
I already have an approach doing this: I count the rows in the time period of interest, with a simple "where" clause to specify the sensor-id, save it as a variable in SQL, and then divide the count() by 1000. I've tried it in MS Access, where it works just fine:
set #divider = select count(*) from table where [...]
SELECT (Int([RowNumber]/#divider)), First(Value)
FROM myTable
GROUP BY (Int([RowNumber]/#divider));
The trick in Access was, that I simply have a data field ("RowNumber"), which is my PK/ID, and goes from 0 up. I tried to accomplish that in SQL Server using the ROW_NUMBER() method, which works more or less. I've got the right syntax for the method, but I can not use the GROUP BY statement
Windowed functions can only appear in the SELECT or ORDER BY
clauses.
meaning ROW_NUMBER() can't be in the GROUP BY statement.
Now I'm kinda stuck. I've tried to save the ROW_NUMBER value into a char or a separate column, and GROUP BY it later on, but I couldn't get it done. And somehow I start to think, that my strategy might have its weaknesses ...? :/
To clarify once more: I don't need to SELECT TOP 1000 from my table, because this would just mean that I select the first 1000 values (depending on the sorting). I need to SELECT every x-th value, while I can compute the x (and I could even round it to an INT, if that would help to get it done). I hope I was able to describe the problem understandable ...
This is my first post here on StackOverflow, I hope I didn't forget anything essential or important, if you need any further information (table structure, my queries so far, ...) please don't hesitate to ask. Any help or hint is highly appreciated - thanks in advance! :)
Update: SOLUTION! Big thanks to https://stackoverflow.com/users/52598/lieven!!!
Here is how I did it in the end:
I declare 2 variables - I count my rows and SET it into the first var. Then I use ROUND() on the just assigned variable, and divide it by 1000 (because in the end I want ABOUT 1000 values!). I split this operation into 2 variables, because if I used the value from the COUNT function as basis for my ROUND operation, there were some mistakes.
declare #myvar decimal(10,2)
declare #myvar2 decimal(10,2)
set #myvar = (select COUNT(*)
from value_table
where channelid=135 and myDate >= '2011-01-14 22:00:00.000' and myDate <= '2011-02-14 22:00:00.000'
)
set #myvar2 = ROUND(#myvar/1000, 0)
Now I have the rounded value, which I want to be my step-size (take every x-th value -> this is our "x" ;)) stored in #myvar2. Next I will subselect the data of the desired timespan and channel, and add the ROW_NUMBER() as column "rn", and finally add a WHERE-clause to the outer SELECT, where I divide the ROW_NUMBER through #myvar2 - when the modulus is 0, the row will be SELECTed.
select * from
(
select (ROW_NUMBER() over (order by id desc)) as rn, myValue, myDate
from value_table
where channel_id=135 and myDate >= '2011-01-14 22:00:00.000' and myDate<= '2011-02-14 22:00:00.000'
) d
WHERE rn % #myvar2 = 0
Works like a charm - once again all my thanks to https://stackoverflow.com/users/52598/lieven, see the comment below for the original posting!
In essence, all you need to do to select the x-th value is retain all rows where the modulus of the rownumber divided by x is 0.
WHERE rn % #x_thValues = 0
Now to be able to use your ROW_NUMBER's result, you'll need to wrap the entire statement into in a subselect
SELECT *
FROM (
SELECT *
, rn = ROW_NUMBER() OVER (ORDER BY Value)
FROM DummyData
) d
WHERE rn % #x_thValues = 0
Combined with a variable to what x-th values you need, you might use something like this testscript
DECLARE #x_thValues INTEGER = 2
;WITH DummyData AS (SELECT * FROM (VALUES (1), (2), (3), (4)) v (Value))
SELECT *
FROM (
SELECT *
, rn = ROW_NUMBER() OVER (ORDER BY Value)
FROM DummyData
) d
WHERE rn % #x_thValues = 0
One more option to consider:
Select Top 1000 *
From dbo.SomeTable
Where ....
Order By NewID()
but to be honest- like the previous answer more than this one.
The question could be about performance..

Distribution of table in time

I have a MySQL table with approximately 3000 rows per user. One of the columns is a datetime field, which is mutable, so the rows aren't in chronological order.
I'd like to visualize the time distribution in a chart, so I need a number of individual datapoints. 20 datapoints would be enough.
I could do this:
select timefield from entries where uid = ? order by timefield;
and look at every 150th row.
Or I could do 20 separate queries and use limit 1 and offset.
But there must be a more efficient solution...
Michal Sznajder almost had it, but you can't use column aliases in a WHERE clause in SQL. So you have to wrap it as a derived table. I tried this and it returns 20 rows:
SELECT * FROM (
SELECT #rownum:=#rownum+1 AS rownum, e.*
FROM (SELECT #rownum := 0) r, entries e) AS e2
WHERE uid = ? AND rownum % 150 = 0;
Something like this came to my mind
select #rownum:=#rownum+1 rownum, entries.*
from (select #rownum:=0) r, entries
where uid = ? and rownum % 150 = 0
I don't have MySQL at my hand but maybe this will help ...
As far as visualization, I know this is not the periodic sampling you are talking about, but I would look at all the rows for a user and choose an interval bucket, SUM within the buckets and show on a bar graph or similar. This would show a real "distribution", since many occurrences within a time frame may be significant.
SELECT DATEADD(day, DATEDIFF(day, 0, timefield), 0) AS bucket -- choose an appropriate granularity (days used here)
,COUNT(*)
FROM entries
WHERE uid = ?
GROUP BY DATEADD(day, DATEDIFF(day, 0, timefield), 0)
ORDER BY DATEADD(day, DATEDIFF(day, 0, timefield), 0)
Or if you don't like the way you have to repeat yourself - or if you are playing with different buckets and want to analyze across many users in 3-D (measure in Z against x, y uid, bucket):
SELECT uid
,bucket
,COUNT(*) AS measure
FROM (
SELECT uid
,DATEADD(day, DATEDIFF(day, 0, timefield), 0) AS bucket
FROM entries
) AS buckets
GROUP BY uid
,bucket
ORDER BY uid
,bucket
If I wanted to plot in 3-D, I would probably determine a way to order users according to some meaningful overall metric for the user.
#Michal
For whatever reason, your example only works when the where #recnum uses a less than operator. I think when the where filters out a row, the rownum doesn't get incremented, and it can't match anything else.
If the original table has an auto incremented id column, and rows were inserted in chronological order, then this should work:
select timefield from entries
where uid = ? and id % 150 = 0 order by timefield;
Of course that doesn't work if there is no correlation between the id and the timefield, unless you don't actually care about getting evenly spaced timefields, just 20 random ones.
Do you really care about the individual data points? Or will using the statistical aggregate functions on the day number instead suffice to tell you what you wish to know?
AVG
STDDEV_POP
VARIANCE
TO_DAYS
select timefield
from entries
where rand() = .01 --will return 1% of rows adjust as needed.
Not a mysql expert so I'm not sure how rand() operates in this environment.
For my reference - and for those using postgres - Postgres 9.4 will have ordered set aggregates that should solve this problem:
SELECT percentile_disc(0.95)
WITHIN GROUP (ORDER BY response_time)
FROM pageviews;
Source: http://www.craigkerstiens.com/2014/02/02/Examining-PostgreSQL-9.4/