More efficient query to count group where values in range - sql

I am working with data that essentially looks like this.
table:processed_data
sensor_id, reading, time_stamp
1,0.1,1234567890
1,0.3,1234567891
1,0.9,1234567892
1,0.32,1234567893
...
what I want to do is make a query that can make one loop through the data and count how many readings are in each category. Simple example,
categories (0-0.5,0.5-0.7,0.7-1) (I am actually planning on breaking them into 10 categories with 0.1 increments though).
This is essentially what I want, even though it isn't valid sql:
select count(reading between 0 and 0.5), count(reading between 0.5 and 0.7), count(reading between 0.7 and 1) from processed_data;
The only way I can think of doing it though, is to do an O*N operation, rather than a 1 time loop.
select count(*) as low from processed_data where reading between 0 and 0.5
union
select count(*) as med from processed_data where reading between 0.5 and 0.7
union
select count(*) as high from processed_data where reading between 0.7 and 1;
I might just resort to doing the processing in php and scan the data once, but I would prefer to have sql do it, if it can be smart enough.

You can derive the category from the value, and use that for grouping:
SELECT CAST(reading * 10 AS INTEGER),
COUNT(*)
FROM processed_data
GROUP BY CAST(reading * 10 AS INTEGER);

Related

WHILE Window Operation with Different Starting Point Values From Column - SQL Server [duplicate]

In SQL there are aggregation operators, like AVG, SUM, COUNT. Why doesn't it have an operator for multiplication? "MUL" or something.
I was wondering, does it exist for Oracle, MSSQL, MySQL ? If not is there a workaround that would give this behaviour?
By MUL do you mean progressive multiplication of values?
Even with 100 rows of some small size (say 10s), your MUL(column) is going to overflow any data type! With such a high probability of mis/ab-use, and very limited scope for use, it does not need to be a SQL Standard. As others have shown there are mathematical ways of working it out, just as there are many many ways to do tricky calculations in SQL just using standard (and common-use) methods.
Sample data:
Column
1
2
4
8
COUNT : 4 items (1 for each non-null)
SUM : 1 + 2 + 4 + 8 = 15
AVG : 3.75 (SUM/COUNT)
MUL : 1 x 2 x 4 x 8 ? ( =64 )
For completeness, the Oracle, MSSQL, MySQL core implementations *
Oracle : EXP(SUM(LN(column))) or POWER(N,SUM(LOG(column, N)))
MSSQL : EXP(SUM(LOG(column))) or POWER(N,SUM(LOG(column)/LOG(N)))
MySQL : EXP(SUM(LOG(column))) or POW(N,SUM(LOG(N,column)))
Care when using EXP/LOG in SQL Server, watch the return type http://msdn.microsoft.com/en-us/library/ms187592.aspx
The POWER form allows for larger numbers (using bases larger than Euler's number), and in cases where the result grows too large to turn it back using POWER, you can return just the logarithmic value and calculate the actual number outside of the SQL query
* LOG(0) and LOG(-ve) are undefined. The below shows only how to handle this in SQL Server. Equivalents can be found for the other SQL flavours, using the same concept
create table MUL(data int)
insert MUL select 1 yourColumn union all
select 2 union all
select 4 union all
select 8 union all
select -2 union all
select 0
select CASE WHEN MIN(abs(data)) = 0 then 0 ELSE
EXP(SUM(Log(abs(nullif(data,0))))) -- the base mathematics
* round(0.5-count(nullif(sign(sign(data)+0.5),1))%2,0) -- pairs up negatives
END
from MUL
Ingredients:
taking the abs() of data, if the min is 0, multiplying by whatever else is futile, the result is 0
When data is 0, NULLIF converts it to null. The abs(), log() both return null, causing it to be precluded from sum()
If data is not 0, abs allows us to multiple a negative number using the LOG method - we will keep track of the negativity elsewhere
Working out the final sign
sign(data) returns 1 for >0, 0 for 0 and -1 for <0.
We add another 0.5 and take the sign() again, so we have now classified 0 and 1 both as 1, and only -1 as -1.
again use NULLIF to remove from COUNT() the 1's, since we only need to count up the negatives.
% 2 against the count() of negative numbers returns either
--> 1 if there is an odd number of negative numbers
--> 0 if there is an even number of negative numbers
more mathematical tricks: we take 1 or 0 off 0.5, so that the above becomes
--> (0.5-1=-0.5=>round to -1) if there is an odd number of negative numbers
--> (0.5-0= 0.5=>round to 1) if there is an even number of negative numbers
we multiple this final 1/-1 against the SUM-PRODUCT value for the real result
No, but you can use Mathematics :)
if yourColumn is always bigger than zero:
select EXP(SUM(LOG(yourColumn))) As ColumnProduct from yourTable
I see an Oracle answer is still missing, so here it is:
SQL> with yourTable as
2 ( select 1 yourColumn from dual union all
3 select 2 from dual union all
4 select 4 from dual union all
5 select 8 from dual
6 )
7 select EXP(SUM(LN(yourColumn))) As ColumnProduct from yourTable
8 /
COLUMNPRODUCT
-------------
64
1 row selected.
Regards,
Rob.
With PostgreSQL, you can create your own aggregate functions, see http://www.postgresql.org/docs/8.2/interactive/sql-createaggregate.html
To create an aggregate function on MySQL, you'll need to build an .so (linux) or .dll (windows) file. An example is shown here: http://www.codeproject.com/KB/database/mygroupconcat.aspx
I'm not sure about mssql and oracle, but i bet they have options to create custom aggregates as well.
You'll break any datatype fairly quickly as numbers mount up.
Using LOG/EXP is tricky because of numbers <= 0 that will fail when using LOG. I wrote a solution in this question that deals with this
Using CTE in MS SQL:
CREATE TABLE Foo(Id int, Val int)
INSERT INTO Foo VALUES(1, 2), (2, 3), (3, 4), (4, 5), (5, 6)
;WITH cte AS
(
SELECT Id, Val AS Multiply, row_number() over (order by Id) as rn
FROM Foo
WHERE Id=1
UNION ALL
SELECT ff.Id, cte.multiply*ff.Val as multiply, ff.rn FROM
(SELECT f.Id, f.Val, (row_number() over (order by f.Id)) as rn
FROM Foo f) ff
INNER JOIN cte
ON ff.rn -1= cte.rn
)
SELECT * FROM cte
Not sure about Oracle or sql-server, but in MySQL you can just use * like you normally would.
mysql> select count(id), count(id)*10 from tablename;
+-----------+--------------+
| count(id) | count(id)*10 |
+-----------+--------------+
| 961 | 9610 |
+-----------+--------------+
1 row in set (0.00 sec)

Lookup values based solely on minumum value of range

I'm wanting to place values within a range given only a minimum value, similar to using a VLOOKUP/HLOOKUP in Excel using the "FALSE" criteron.
As seen below, TableScore lists the low-end cutpoints (CutpointVal) for a value to be assigned a specific number of points (the minimum value in a range). The below SQL code accomplishes this in two steps, with the first query generating a datasheet that includes a high value for each low value, thus creating a full range.
However, this is a somewhat clunky way of doing this, especially when trying to iterate this many times. The original table (TableScore) cannot be altered to include high values. Is there a way to accomplish a similar mechanism with only one query?
Main
ID Score
72625 2.5
78261 3.2
82766 4.7
58383 0.3
TableScore
CutpointVal Points
0 0
0.3 1
1.2 2
2.7 3
3.4 4
Upper and lower range query (RangeQry):
SELECT a.CutpointVal AS LowVal, Val(Nz((SELECT TOP 1 [CutpointVal]-0.001
FROM TableScore b
WHERE b.Points > a.Points
ORDER BY b.Points ASC),9999)) AS HighVal, a.Points
FROM TableScore AS a
ORDER BY a.Points;
Range assignment query:
SELECT Main.ID, Main.Score, RangeQry.LowVal, RangeQry.HighVal, RangeQry.Points AS PTS
FROM RangeQry, Main
WHERE (((Main.Score) Between [RangeQry].[LowVal] And [RangeQry].[HighVal]));
Desired output:
ID Score Points
72625 2.5 2
78261 3.2 3
82766 4.7 4
58383 0.3 1
Consider:
SELECT Main.ID, Main.Score, (
SELECT Max(Points) FROM TableScore WHERE CutpointVal<=Main.Score) AS Pts
FROM Main;
Or
SELECT Main.ID, Main.Score, (
SELECT TOP 1 Points FROM TableScore
WHERE CutpointVal <= Main.Score
ORDER BY Points DESC) AS Pts
FROM Main;
Or
SELECT Main.ID, Main.Score, DMax("Points","TableScore","CutpointVal<=" & [Score]) AS Pts
FROM Main;

Subset large table for use in multiple UNIONs

Suppose I have a table with the following structure:
id measure_1_actual measure_1_predicted measure_2_actual measure_2_predicted
1 1 0 0 0
2 1 1 1 1
3 . . 0 0
I want to create the following table, for each ID (shown is an example for id = 1):
measure actual predicted
1 1 0
2 0 0
Here's one way I could solve this problem (I haven't tested this, but you get the general idea, I hope):
SELECT 1 AS measure,
measure_1_actual AS actual,
measure_1_predicted AS predicted
FROM tb
WHERE id = 1
UNION
SELECT 2 AS measure,
measure_2_actual AS actual,
measure_2_predicted AS predicted
FROM tb WHERE id = 1
In reality, I have five of these "measures" and tens of millions of people - subsetting such a large table five times for each member does not seem the most efficient way of doing this. This is a real-time API, receiving tens of requests a minute, so I think I'll need a better way of doing this. My other thought was to perhaps create a temp table/view for each member once the request is received, and then UNION based off of that subsetted table.
Does anyone have a more efficient way of doing this?
You can use a lateral join:
select t.id, v.*
from t cross join lateral
(values (1, measure_1_actual, measure_1_predicted),
(2, measure_2_actual, measure_2_predicted)
) v(measure, actual, predicted);
Lateral joins were introduced in Postgres 9.4. You can read about them in the documentation.

SQL group by steps

I'm using SQL in SAS.
I'm doing a SQL query with a GROUP BY clause on a continuous variable (made discrete), and I'd like it to aggregate more. I'm not sure this is clear, so here is an example.
Here is my query :
SELECT CEIL(travel_time) AS time_in_mn, MEAN(foo) AS mean_foo
FROM my_table
GROUP BY CEIL(travel_time)
This will give me the mean value of foo for each different value of travel_time. Thanks to the CEIL() function, it will group by minutes and not seconds (travel_time can take values such as 14.7 (minutes)). But I'd like to be able to group by groups of 5 minutes, for instance, so that I have something like that :
time_in_mn mean_foo
5 4.5
10 3.1
15 17.6
20 12
(Of course, the mean(foo) should be done over the whole interval, so for time_in_mn = 5, mean_foo should be the mean of foo where travel_time in (0,1,2,3,4,5) )
How can I achieve that ?
(Sorry if the answer can be found easily, the only search term I could think of is group by step, which gives me a lot of "step by step tutorials" about SQL...)
A common idiom of "ceiling to steps" (or rounding, or flooring, for that matter) is to divide by the step, ceil (or round, or floor, of course) and then multiply by it again. This way, if we take, for example, 12.4:
Divide: 12.4 / 5 = 2.48
Ceil: 2.48 becomes 3
Multiply: 3 * 5 = 15
And in SQL form:
SELECT 5 * CEIL(travel_time / 5.0) AS time_in_mn,
MEAN(foo) AS mean_foo
FROM my_table
GROUP BY 5 * CEIL(travel_time / 5.0)

Select random row from a PostgreSQL table with weighted row probabilities

Example input:
SELECT * FROM test;
id | percent
----+----------
1 | 50
2 | 35
3 | 15
(3 rows)
How would you write such query, that on average 50% of time i could get the row with id=1, 35% of time row with id=2, and 15% of time row with id=3?
I tried something like SELECT id FROM test ORDER BY p * random() DESC LIMIT 1, but it gives wrong results. After 10,000 runs I get a distribution like: {1=6293, 2=3302, 3=405}, but I expected the distribution to be nearly: {1=5000, 2=3500, 3=1500}.
Any ideas?
This should do the trick:
WITH CTE AS (
SELECT random() * (SELECT SUM(percent) FROM YOUR_TABLE) R
)
SELECT *
FROM (
SELECT id, SUM(percent) OVER (ORDER BY id) S, R
FROM YOUR_TABLE CROSS JOIN CTE
) Q
WHERE S >= R
ORDER BY id
LIMIT 1;
The sub-query Q gives the following result:
1 50
2 85
3 100
We then simply generate a random number in range [0, 100) and pick the first row that is at or beyond that number (the WHERE clause). We use common table expression (WITH) to ensure the random number is calculated only once.
BTW, the SELECT SUM(percent) FROM YOUR_TABLE allows you to have any weights in percent - they don't strictly need to be percentages (i.e. add-up to 100).
[SQL Fiddle]
ORDER BY random() ^ (1.0 / p)
from the algorithm described by Efraimidis and Spirakis.
Branko's accepted solution is great (thanks!). However, I'd like to contribute an alternative that is just as performant (according to my tests), and perhaps easier to visualize.
Let's recap. The original question can perhaps be generalized as follows:
Given an map of ids and relative weights, create a query that returns a random id in the map, but with a probability proportional to its relative weight.
Note the emphasis on relative weights, not percent. As Branko points out in his answer, using relative weights will work for anything, including percents.
Now, consider some test data, which we'll put in a temporary table:
CREATE TEMP TABLE test AS
SELECT * FROM (VALUES
(1, 25),
(2, 10),
(3, 10),
(4, 05)
) AS test(id, weight);
Note that I'm using a more complicated example than that in the original question, in that it does not conveniently add up to 100, and in that the same weight (20) is used more than once (for ids 2 and 3), which is important to consider, as you'll see later.
The first thing we have to do is turn the weights into probabilities from 0 to 1, which is nothing more than a simple normalization (weight / sum(weights)):
WITH p AS ( -- probability
SELECT *,
weight::NUMERIC / sum(weight) OVER () AS probability
FROM test
),
cp AS ( -- cumulative probability
SELECT *,
sum(p.probability) OVER (
ORDER BY probability DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumprobability
FROM p
)
SELECT
cp.id,
cp.weight,
cp.probability,
cp.cumprobability - cp.probability AS startprobability,
cp.cumprobability AS endprobability
FROM cp
;
This will result in the following output:
id | weight | probability | startprobability | endprobability
----+--------+-------------+------------------+----------------
1 | 25 | 0.5 | 0.0 | 0.5
2 | 10 | 0.2 | 0.5 | 0.7
3 | 10 | 0.2 | 0.7 | 0.9
4 | 5 | 0.1 | 0.9 | 1.0
The query above is admittedly doing more work than strictly necessary for our needs, but I find it helpful to visualize the relative probabilities this way, and it does make the final step of choosing the id trivial:
SELECT id FROM (queryabove)
WHERE random() BETWEEN startprobability AND endprobability;
Now, let's put it all together with a test that ensures the query is returning data with the expected distribution. We'll use generate_series() to generate a random number a million times:
WITH p AS ( -- probability
SELECT *,
weight::NUMERIC / sum(weight) OVER () AS probability
FROM test
),
cp AS ( -- cumulative probability
SELECT *,
sum(p.probability) OVER (
ORDER BY probability DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumprobability
FROM p
),
fp AS ( -- final probability
SELECT
cp.id,
cp.weight,
cp.probability,
cp.cumprobability - cp.probability AS startprobability,
cp.cumprobability AS endprobability
FROM cp
)
SELECT *
FROM fp
CROSS JOIN (SELECT random() FROM generate_series(1, 1000000)) AS random(val)
WHERE random.val BETWEEN fp.startprobability AND fp.endprobability
;
This will result in output similar to the following:
id | count
----+--------
1 | 499679
3 | 200652
2 | 199334
4 | 100335
Which, as you can see, tracks the expected distribution perfectly.
Performance
The query above is quite performant. Even in my average machine, with PostgreSQL running in a WSL1 instance (the horror!), execution is relatively fast:
count | time (ms)
-----------+----------
1,000 | 7
10,000 | 25
100,000 | 210
1,000,000 | 1950
Adaptation to generate test data
I often use a variation of the query above when generating test data for unit/integration tests. The idea is to generate random data that approximates a probability distribution that tracks reality.
In that situation I find it useful to compute the start and end distributions once and storing the results in a table:
CREATE TEMP TABLE test AS
WITH test(id, weight) AS (VALUES
(1, 25),
(2, 10),
(3, 10),
(4, 05)
),
p AS ( -- probability
SELECT *, (weight::NUMERIC / sum(weight) OVER ()) AS probability
FROM test
),
cp AS ( -- cumulative probability
SELECT *,
sum(p.probability) OVER (
ORDER BY probability DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) cumprobability
FROM p
)
SELECT
cp.id,
cp.weight,
cp.probability,
cp.cumprobability - cp.probability AS startprobability,
cp.cumprobability AS endprobability
FROM cp
;
I can then use these precomputed probabilities repeatedly, which results in extra performance and simpler use.
I can even wrap it all in a function that I can call any time I want to get a random id:
CREATE OR REPLACE FUNCTION getrandomid(p_random FLOAT8 = random())
RETURNS INT AS
$$
SELECT id
FROM test
WHERE p_random BETWEEN startprobability AND endprobability
;
$$
LANGUAGE SQL STABLE STRICT
Window function frames
It's worth noting that the technique above is using a window function with a non-standard frame ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This is necessary to deal with the fact that some weights might be repeated, which is why I chose test data with repeated weights in the first place!
Your proposed query appears to work; see this SQLFiddle demo. It creates the wrong distribution though; see below.
To prevent PostgreSQL from optimising the subquery I've wrapped it in a VOLATILE SQL function. PostgreSQL has no way to know that you intend the subquery to run once for every row of the outer query, so if you don't force it to volatile it'll just execute it once. Another possibility - though one that the query planner might optimize out in future - is to make it appear to be a correlated subquery, like this hack that uses an always-true where clause, like this: http://sqlfiddle.com/#!12/3039b/9
At a guess (before you updated to explain why it didn't work) your testing methodology was at fault, or you're using this as a subquery in an outer query where PostgreSQL is noticing it isn't a correlated subquery and executing it just once, like in this example. .
UPDATE: The distribution produced isn't what you're expecting. The issue here is that you're skewing the distribution by taking multiple samples of random(); you need a single sample.
This query produces the correct distribution (SQLFiddle):
WITH random_weight(rw) AS (SELECT random() * (SELECT sum(percent) FROM test))
SELECT id
FROM (
SELECT
id,
sum(percent) OVER (ORDER BY id),
coalesce(sum(prev_percent) OVER (ORDER BY id),0) FROM (
SELECT
id,
percent,
lag(percent) OVER () AS prev_percent
FROM test
) x
) weighted_ids(id, weight_upper, weight_lower)
CROSS JOIN random_weight
WHERE rw BETWEEN weight_lower AND weight_upper;
Performance is, needless to say, horrible. It's using two nested sets of windows. What I'm doing is:
Creating (id, percent, previous_percent) then using that to create two running sums of weights that are used as range brackets; then
Taking a random value, scaling it to the range of weights, and then picking a value that has weights within the target bracket
Here is something for you to play with:
select t1.id as id1
, case when t2.id is null then 0 else t2.id end as id2
, t1.percent as percent1
, case when t2.percent is null then 0 else t2.percent end as percent2
from "Test1" t1
left outer join "Test1" t2 on t1.id = t2.id + 1
where random() * 100 between t1.percent and
case when t2.percent is null then 0 else t2.percent end;
Essentially perform a left outer join so that you have two columns to apply a between clause.
Note that it will only work if you get your table ordered in the right way.
Based on Branko Dimitrijevic's answer, I wrote this query, which may or may not be faster by using the sum total of percent using tiered windowing functions (not unlike a ROLLUP).
WITH random AS (SELECT random() AS random)
SELECT id FROM (
SELECT id, percent,
SUM(percent) OVER (ORDER BY id) AS rank,
SUM(percent) OVER () * random AS roll
FROM test CROSS JOIN random
) t WHERE roll <= rank LIMIT 1
If the ordering isn't important, SUM(percent) OVER (ROWS UNBOUNDED PRECEDING) AS rank, may be preferable because it avoids having to sort the data first.
I also tried Mechanic Wei's answer (as described in this paper, apparently), which seems very promising in terms of performance, but after some testing, the distribution appear to be off :
SELECT id
FROM test
ORDER BY random() ^ (1.0/percent)
LIMIT 1